Applcatons of Socal Network Analyss to Communty Dynamcs A Thess Submtted for the Degree of Master of Scence (Engg.) n the Faculty of Engneerng by Koll Namsha Supercomputer Educaton and Research Centre Indan Insttute of Scence Bangalore 560 012 Inda February 2008
Sr Venkateswaraya Namaha To My Parents & Brother
ACKNOWLEDGEMENT I am eternally grateful to my research supervsor, Prof. N. Balakrshnan for hs nvgoratng gudance and valuable suggestons durng the course of my research work. I thank hm for encouragng my deas and very patently correctng my mstakes. I am also ndebted to hm for hs utmost support, encouragement and nspraton throughout the perod of ths work. I am thankful to hm for always makng tme for me through hs hectc schedule. My specal thanks to my grandfather Prof. M. S. Murthy and my aunt Dr. Y. V. S. Lakshm for beng a great source of nspraton throughout my lfe. I also thank my grandmother Mrs. M. Sunanda and my uncle Mr. Y. K. Vswanath for always beng there for me. I dedcate all of my work to my beloved Parents, Mr. K. Staram Prasad, Dr. K. R. L. Suryakran and to my brother K. Sshr, who have always been the drvng force for all my achevements. My heartful apprecatons to all my frends, Mrs. Shjesta Vctor, Ms. Jeyanth, Ms. Shvagama Sundar, Ms. B. S. Soumya and Ms. P. Syamala and many other frends for extendng a helpng hand and makng my stay at IISc very memorable. I would lke to acknowledge the support of Mr. Ryan who helped me n all possble ways and have contnuously kept my sprts hgh. Specal thanks to Ms. Swarna, Ms. Nagaratna, Mr. Vshwas, Mr. Rav and Mr. Sas for ther tmely support throughout my tenure. I also thank SERC and the MMSL lab for provdng us the best computng facltes. It was a great joy to work n such a lab. I would also lke to extend my sncere thanks to all Professors at IISc, who shared ther knowledge wth us.
ABSTRACT Ths thess concerns Socal Network Analyss as a mechansm for explorng Communty Dynamcs. To be able to use the Socal Network methodologes, relatonshps exstng between the modelng enttes are requred. In ths thess, we use two dfferent knds of relatonshps: e-mals exchanged and co-authorshp of papers. The e-mals exchanged, as an ndcator of nformaton exchange n an organzaton, s used to facltate the emergence of structure wthn the organzaton. In ths thess we demonstrate the effectveness of usng e-mal communcaton patterns for crss detecton n a herarchcally set organzaton. We compare the performance of a Socal Network based Classfer wth some of the tradtonal classfers from the data mnng framework for nferrng ths herarchy. A generc framework for studyng dynamc group transformatons s presented and the co-authorshp of papers, as an ndcator of collaboraton n an academc nsttuton, s used to study the communty behavoral patterns evolvng over tme. Enron e-mal corpus and the IISc Co-authorshp Dataset are utlzed for llustratve purposes.
CONTENTS Acknowledgement Abstract Contents Lst of Fgures Lst of Tables v v v Chapter 1 Introducton 1 1.1 Emergence of a New Scence 1 1.2 Communtes of Practce 2 1.3 Socal Networks 3 1.3.1 Socal Network as Dscplne 3 1.3.2 How s Socal Network Data Dfferent 5 1.3.3 Propertes of Socal Networks 9 1.3.4 Socal Network Methods 10 1.3.5 Revew Work on Socal Networks 14 1.4 Motvaton 17 1.5 Organzaton of the Thess 19 Chapter 2 Organzatonal Herarchy from E-mal Communcatons 21 2.1 Introducton 21 2.2 Problem Descrpton 22 2.3 Background 23 2.4 Methodology 25 2.4.1 Classfcaton Methods 26 2.5 Enron E-mal Dataset 39 2.5.1 Dfferent versons of Enron E-mal Corpus 39 2.5.2 Enron Related Work 42 2.5.3 Modfed Enron Dataset (Our Corpus) 45 2.5.4 Enron Herarchy Data 47 2.6 Experments and Results 48 2.7 Dscusson of Results 49
2.8 Conclusons and Future Work 53 Chapter 3 Organzatonal Crss Detecton from E-mal Communcatons 55 3.1 Introducton 55 3.2 Background 56 3.3 Enron-Events 58 3.4 Methodology 60 3.4.1 Feature Extracton 61 3.4.2 Identfcaton of Informal Networks 67 3.5 Results and Dscusson 69 3.6 Conclusons and Future Work 78 Chapter 4 Evoluton of Communtes 80 4.1 Introducton 80 4.2 Problem Framework 81 4.3 Background 82 4.4 IISc Co-publcaton Data 85 4.4.1 IISc Co-authorshp Dataset I 89 4.4.2 IISc Co-authorshp Dataset II 90 4.5 Methodology 92 4.5.1 Clusterng 92 4.5.2 Communty Transtons 94 4.5.3 Extractng Communty Transtons 96 4.6 Results 98 4.6.1 Dscusson 106 4.7 Conclusons and Future Work 110 Chapter 5 Summary 111 5.1 Summary of Contrbutons 111 5.2 Future Work 112 References 114
LIST OF FIGURES Fgure 1.1 Dstrbuton of Degree Vs Number of Nodes 2 Fgure 2.1 Vorono Cells n 2-D and 3-D 30 Fgure 2.2 K-Nearest Neghbor 30 Fgure 2.3 Example of Decson Tree 31 Fgure 2.4 Sgnal Flow Graph of a Sngle Perceptron 34 Fgure 2.5 Sgnal Flow Graph of a Mult-layer Perceptron 34 Fgure 2.6 Regular Equvalence Example 38 Fgure 2.7 Enron Database Schema 46 Fgure 2.8 Performance of Classfers 48 Fgure 2.9 Varatons n the Performance of the Socal Network Classfer wth 52 respect to E-mals Consdered Fgure 3.1 Varaton n the Traffc generated wth Events n Enron 70 Fgure 3.2 Varaton n Dstnct Senders wth the Events n Enron 70 Fgure 3.3 Varaton n Dstnct Recevers wth the Events n Enron 71 Fgure 3.4 Varaton n Reachablty wthn two hops wth Events n Enron 72 Fgure 3.5 Varaton n Reachablty wthn three hops wth Events n Enron 72 Fgure 3.6 Varaton n Group Closeness Centralty wth Events n Enron 73 Fgure 3.7 Varaton n the No. of Clques wth Events n Enron 74 Fgure 3.8 Varaton n the No. of n-clans wth Events n Enron 75 Fgure 3.9 Varaton n the number of Informal Networks wth Events n 76 Enron Fgure 3.10 Varaton n the Performance of Socal Network Classfer 77 Wth Events n Enron Fgure 4.1 Proposed Framework 82 Fgure 4.2 Dstrbuton n the No. of Authors wth ther Actve Years 87 Fgure 4.3 Dstrbuton of Authors year-wse wth No. of Actve years 87 Fgure 4.4 Dstrbuton n the No. of Papers wth varyng Actve years 88 Fgure 4.5 Dstrbuton of Authors year-wse wth No. of Actve Years 88 v
Fgure 4.6 Dstrbuton of Authors n IISc Co-authorshp Dataset I 90 Fgure 4.7 Dstrbuton of Papers n IISc Co-authorshp Dataset I 90 Fgure 4.8 Dstrbuton of Authors n IISc Co-authorshp Dataset II 91 Fgure 4.9 Dstrbuton of Papers n IISc Co-authorshp Dataset II 92 Fgure 4.10 Dstrbuton of Communtes n IISc Co-authorshp Dataset I 98 Fgure 4.11 Dstrbuton of Communtes n IISc Co-authorshp Dataset II 99 Fgure 4.12 Contnuaton n IISc Co-authorshp Dataset I 100 Fgure 4.13 Contnuaton n IISc Co-authorshp Dataset II 100 Fgure 4.14 Creaton n IISc Co-authorshp Dataset I 101 Fgure 4.15 Creaton n IISc Co-authorshp Dataset II 102 Fgure 4.16 Dssoluton n IISc Co-authorshp Dataset I 102 Fgure 4.17 Dssoluton n IISc Co-authorshp Dataset II 103 Fgure 4.18 Mergng n IISc Co-authorshp Dataset I 104 Fgure 4.19 Mergng n IISc Co-authorshp Dataset II 104 Fgure 4.20 Splttng n IISc Co-authorshp Dataset I 105 Fgure 4.21 Splttng n IISc Co-authorshp Dataset II 105 Fgure 4.22 Dstrbuton of each transformaton n IISc Co-authorshp 106 Dataset I Fgure 4.23 Dstrbuton of each transformaton n IISc Co-authorshp 107 Dataset II Fgure 4.24 Communty Sze Vs Domnant Member Actvty 109 In Co-authorshp Set I Fgure 4.25 Communty Sze Vs Domnant Member Actvty 110 In IISc Co-authorshp Dataset II v
LIST OF TABLES Table 2.1 Classfcaton Algorthms used 26 Table 2.2 Broad Ttles n Enron 47 Table 2.3 Overall Accuracy of Classfers 49 Table 2.4 Example Format for Correctly Vs Incorrectly classfed statstcs 50 for a ttle-classfer par Table 2.5 Correctly Vs Incorrectly classfed statstc for every ttle- 51 classfer par Table 3.1 Snapshot of Events assocated wth Enron crss 60 v
Chapter 1 INTRODUCTION 1.1. EMERGENCE OF A NEW SCIENCE Many systems take the form of networks, sets of nodes or vertces joned together n pars by lnks or edges. Examples nclude acquantance networks and collaboraton networks, technologcal networks such as the Internet, the Worldwde Web, and power grds, and bologcal networks such as neural networks, food webs, and metabolc networks. All these systems are networks, but all are completely dstnct n one sense or another. So, n essence what we requre s a language for talkng about networks that s precse enough to descrbe not only what a network s but also what knds of dfferent networks there are n the world [1]. Out of ths requrement, over decades of theory and experment n many felds from physcs to socology, s the emergence of a new scence, a scence of networks. In 1967, socal psychologst Stanley Mlgram performed an experment to solve an unresolved hypothess crculatng n those days. The hypothess was called the small-world problem. The clam of the small-world phenomenon s that the world, s n a sense small, when vewed as a network of socal acquantances, could be reached through a network of frends n a only a few steps. Mlgram asked a few hundred randomly selected people to send letters to a stock broker n Boston va ntermedares. They can send the letter to people they knew on frst name bass. Among the letters that reached the destnaton correctly, the average path length was found to be sx. Ths led to the phrase sx degrees of separaton. Ths experment lad the stage for algorthmc aspects ths new and emergng scence. In order to make such a clam, nstead of askng, How small s our world, one could ask, What would t take for any world to be small? In other words, we want to construct a mathematcal model of the world n whch the ndvduals are represented as nodes and relatonshps are represented as edges. Ths allows analyss usng tools of mathematcs. Erdos & Reny [2] ntroduced the theory of random graphs. A random graph s a network of nodes connected by lnks n a purely random fashon. Let N be the number of nodes. A par of nodes has probablty p of beng connected. 1
If k < 1: small, solated clusters small dameters short path lengths At k = 1: a gant component appears dameter peaks path lengths are hgh For k > 1: almost all nodes connected dameter shrnks path lengths shorten Percentage of nodes n largest component Dameter of largest component (not to scale) 1.0 0 1.0 Phase transton k Fgure 1.1 Dstrbuton of Degree vs Number of Nodes Therefore, the average degree s k pn. What nterestng thngs can be sad for dfferent values of p or k? ( that are true as N ). What does ths mean? If connectons between people can be modeled as a random graph, then, because the average person easly knows more than one person (k >> 1), we lve n a small world where wthn a few lnks, we are connected to anyone n the world. 1.2. COMMUNITIES OF PRACTICE Why s the above fndng so surprsng? Imagne one has one hundred frends, each one of them also has hundred frends. So at one degree of separaton one connects to one hundred people and at two degrees connects to one hundred tmes one hundred. Proceedng n a smlar fashon, n fve degrees he s connected to nne bllon people. So f everyone has one hundred frends, then wthn sx steps he can connect hmself to the entre populaton. But there s one mportant omsson n ths reasonng. Chances are that one wll come up wth many of the same people n one s frends network. Ths observaton turns out to be a unversal feature n all networks. They dsplay what we call clusterng. We tend to have groups of frends, each of whch s lke a communty or 2
cluster based on shared experence, locaton, or nterests, joned to each other by overlaps created when ndvduals n one group also belong to other groups. Ths s partcularly relevant, because clusterng breeds redundancy and ts study can tell us a great deal about the networks. The ablty to detect communty structure n a network could have practcal applcatons. Communtes n a network mght represent real socal groupngs, perhaps by nterest or background; communtes n a ctaton network mght represent related papers on a sngle topc; communtes n a metabolc network mght represent cycles and other functonal groupngs; communtes on the web mght represent pages on related topcs; hdden communtes mght represent potental suspcous actvty. Beng able to dentfy these communtes could help us understand and explot these networks more effectvely. Communtes of practce are the collaboraton groups that naturally grow and coalesce wthn any knd of networks. Any nsttuton that provdes opportuntes for communcaton or nteracton among ts members s eventually threaded by communtes who have smlar goals and a shared understandng of ther actvtes. These communtes have been the subject of much research as a way to uncover the structure and nteracton patterns wthn a network n order to understand the collectve behavor of the network from the ndvduals that consttute the network. Recent Research on these networks has focused on usng a socal network perspectve to analyze these networks. A socal network conssts of both a set of actors, who may be arbtrary enttes lke persons or organzatons, and one or more types of relatons between them, such as nformaton exchange or economc relatonshp. 1.3. SOCIAL NETWORKS 1.3.1 Socal Networks as a dscplne Networks have been studed as graphs n mathematcs, physcs, socology, engneerng and computer scence, bology and economcs. Each feld has ts own theory of networks and each feld has ts own way of aggregatng collectve behavor. So why s ths new? In the past, the networks have been vewed as objects of pure structure whose propertes are fxed n tme. Both these assumptons 3
are far from truth. Real networks represent populatons of ndvdual components that are actually dong somethng-nvolved n communcaton, generatng power, sendng data, or even makng decsons. Here, the structure of ndvdual components s mportant because they affect ther ndvdual behavor or the behavor of the system as a whole. Also, the networks are dynamc objects, not because thngs happened n these systems, but because the networks themselves are evolvng and changng n tme, wth respect to actvtes or decsons of the ndvdual components. Therefore, what happens and how t happens depend on the network, whch n turn depends on what has happened prevously. It s ths vew of the network- a contnuously evolvng and self consttutng system [1] - that s new about the scence of networks. If ths s to succeed, the new scence of networks must become a manfestaton of ts own subject matter, a network of scentsts collectvely solvng problems that cannot be solved by any sngle ndvdual or any sngle dscplne [1]. Socal network analyss (SNA) s a set of research procedures for dentfyng structures n systems based on the relatons among actors. Grounded n graph and system theores, ths approach has proven to be a powerful tool for studyng networks n physcal and socal worlds, ncludng on the web [3, 4, 5]. SNA focuses on relatons and tes n studyng actor s behavor and atttudes. Thus the postons of actors wthn a network and the strength of tes between them become crtcally mportant. Socal poston can be evaluated by fndng the centralty of a node dentfed through a number of connectons among network members. Such measures are used to characterze degrees of nfluence, promnence and mportance of certan members [6]. Te strength mostly nvolves closeness of bond. There s general agreement that strong tes contrbute to ntensve resource exchange and close communtes, whereas weak tes provde ntegraton of relatvely separated socal groups nto larger socal networks [7, 8]. The noton of a socal network and the methods of socal network analyss have attracted consderable nterest from the socal and behavoral and computer scence communty n recent decades. Much of ths nterest can be attrbuted to the appealng focus of socal network analyss on relatonshps among socal enttes, and on the patterns and mplcatons of these relatonshps. From the vew of socal network analyss, the presence of regular patterns n relatonshp, are referred as 4
structure and the quanttes that measure structure as structural varables. The focus on relatons, and the patterns of relatons, requres a set of methods and analytc concepts that are dstnct from the methods of tradtonal statstcs and data analyss. 1.3.2. How s Socal Network Data dfferent? On one hand, there really sn't anythng about socal network data that s all that unusual. Socal network analysts do use a specalzed language for descrbng the structure and contents of the sets of observatons that they use. But, network data can also be descrbed and understood usng the deas and concepts of more famlar methods, lke cross-sectonal survey research. On the other hand, the data sets that socal network analysts develop usually end up lookng qute dfferent from the conventonal rectangular data array so famlar to survey researchers and statstcal analysts. The dfferences are qute mportant because they lead us to look at our data n a dfferent way and even lead us to thnk dfferently about how to apply statstcs. "Conventonal" data consst of a rectangular array of measurements. The rows of the array are the cases, or subjects, or observatons. The columns consst of scores (quanttatve or qualtatve) on attrbutes, or varables, or measures. Ths data structure leads us to compare how actors are smlar or dssmlar to each other across attrbutes (by comparng rows). Or, we examne how varables are smlar or dssmlar to each other n ther dstrbutons across actors (by comparng or correlatng columns). "Socal Network" data consst of a square array of measurements. The rows of the array are the cases, or subjects, or observatons. The columns of the array are the same set of cases, subjects, or observatons. Each cell of the array descrbes a relatonshp between the actors. We could look at ths data structure the same way as wth attrbute data. By comparng rows of the array, we can see whch actors are smlar to whch other actors n whom they choose. By lookng at the columns, we can see who s smlar to whom n terms of beng chosen by others. These are useful ways to look at the data, because they help us to see whch actors have smlar postons n the network. Ths s the frst major emphass of network analyss: seeng how actors are located or embedded n the overall network. The analyst also notes 5
the densty of overall tes. The analyst mght also compare the cells above and below the dagonal to see f there s recprocty n choces. Ths s the second major emphass of network analyss: seeng how the whole pattern of ndvdual choces gves rse to more holstc patterns. It s qute possble to thnk of the network data set n the same terms as "conventonal data." One can thnk of the rows as smply a lstng of cases, and the columns as attrbutes of each actor (.e. the relatons wth other actors can be thought of as attrbutes of each actor). Indeed, many of the technques used by network analysts (lke calculatng correlatons and dstances) are appled exactly the same way to network data as they would be to conventonal data. Whle t s possble to descrbe network data as just a specal form of conventonal data, network analysts look at the data n some rather fundamentally dfferent ways. Rather than thnkng about how an actor's tes wth other actors descrbes the attrbutes of that actor, network analysts nstead see a structure of connectons, wthn whch the actor s embedded. Actors are descrbed by ther relatons, not by ther attrbutes. And, the relatons themselves are just as fundamental as the actors that they connect. The major dfference between conventonal and network data s that conventonal data focuses on actors and attrbutes; network data focus on actors and relatons. The dfference n emphass s consequental for the choces that a researcher must make n decdng on research desgn, n conductng samplng, developng measurement, and handlng the resultng data. It s not that the research tools used by network analysts are dfferent from those of other scentsts (they mostly are not). But the specal purposes and emphases of network research do call for some dfferent consderatons. From our survey, there are four major crtera that have been used n pror works to nfer relatonshps. They are self-report, communcaton, smlarty, and co-occurrence. Self-report s the most drect, and perhaps the most relable, crteron as t accepts only the lnks reported by the concerned actors themselves. For an actor reportng an assocate could mean revealng the assocate n questonnares or ntervews, acknowledgng the assocate n personal profle or home pages, or ncludng ths assocate n Instant Messagng buddy lst. As ths reportng s done ndvdually by 6
each actor, self reported lnks are not always mutual or equally weghted n both drectons. Communcaton s also a strong expresson of relatonshp, especally f the acts of communcaton are especally frequent or ntense (perhaps judges by what s exchanged or the length of conversaton). Partcularly, Internet-based communcaton tools, such as emals, newsgroups, and Instant Messagng, often leave electronc trals that can be traced and mned. Communcaton-based networks may consst of ether drected edges (from sender to recever) or undrected edges (f an exchange s requred). Smlarty borrows the dea from socology that people who are more closely related tend to have greater smlarty to each other. The problem of fndng undrected or mutual lnks between pars of actors can then be reduced to fndng smlar pars. Smlarty may be defned n terms of varous dstance functons and varous attrbutes, such as havng smlar content and lnkages n personal home pages, usng smlar vocabulary n emal messages, or sharng smlar opnons on common areas of nterest. Co-occurrence n turn s based on the dea that enttes that keep occurrng together at a frequency that s hgher than random chance usually allows are lkely to have some assocaton between them. Basc co-occurrence assumes that the dea would provde transactons, or clearly defned nstances of co-occurrence by a subset of actors. For nstance, a transacton could be a web page, and two names that keep occurrng together on the same web pages may be related. Alternatvely, a transacton could be a publcaton, and two authors who keep co-authorng papers together are also lkely to be related. Also the nodes or actors ncluded n non-network studes tend to be the result of ndependent probablty samplng. Network studes are much more lkely to nclude all of the actors who occur wthn some (usually naturally occurrng) boundary. The use of whole populatons as a way of selectng observatons n network studes makes t mportant for the analyst to be clear about the boundares of each populaton to be studed, and how ndvdual unts of observaton are to be selected wthn that populaton. Network data sets also frequently nvolve several levels of 7
analyss, wth actors embedded at the lowest level. Survey research methods usually use a qute dfferent approach to decdng whch nodes to study. A lst s made of all nodes (sometmes stratfed or clustered), and ndvdual elements are selected by probablty methods. The logc of the method treats each ndvdual as a separate replcaton that s, nterchangeable wth any other. The populatons that network analysts study are remarkably dverse. At one extreme, they mght consst of symbols n texts or sounds n verbalzatons, at the other extreme, natons n the world system of states mght consttute the populaton of nodes. Most common are populatons of ndvdual persons. In each case, however, the elements of the populaton to be studed are defned by fallng wthn some boundary. Network analysts can expand the boundares of ther studes by replcatng populatons. Rather than studyng one neghborhood, we can study several. Ths type of desgn (whch could use samplng methods to select populatons) allows for replcaton and for testng of hypotheses by comparng populatons. A second and equally mportant way that network studes expand ther scope s by the ncluson of multple levels of analyss, or modaltes. Most socal network analysts thnk of ndvdual persons as beng embedded n networks that are embedded n networks that are embedded n networks. Network analysts descrbe such structures as "mult-modal." A data set that contans nformaton about two types of socal enttes (say persons and organzatons) s a two mode network. Of course, ths knd of vew of the nature of socal structures s not unque to socal network analysts. Statstcal analysts deal wth the same ssues as "herarchcal" or "nested" desgns. Theorsts speak of the macro-meso-mcro levels of analyss, or develop schema for dentfyng levels of analyss (ndvdual, group, organzaton, communty, nsttuton, socety, global order beng the most commonly used system). One advantage of network thnkng and method s that t naturally predsposes the analyst to focus on multple levels of analyss smultaneously. That s, the network analyst s always nterested n how the ndvdual s embedded wthn a structure and how the structure emerges from the mcro-relatons between ndvdual parts. The ablty of network methods to map such mult-modal relatons s, at least potentally, a step forward n rgor. In one way, there s lttle apparent dfference between conventonal statstcal 8
approaches and network approaches. Unvarate, b-varate, and even many multvarate descrptve statstcal tools are commonly used n the descrbng, explorng, and modelng socal network data. Socal network data are easly represented as arrays of numbers, just lke other types of data. As a result, the same knds of operatons can be performed on network data as on other types of data. Algorthms from statstcs are commonly used to descrbe characterstcs of ndvdual observatons (e.g. the medan te strength of an actor wth all other actors n the network) and the network as a whole (e.g. the mean of all te strengths among all actors n the network). Statstcal algorthms are very heavly used n assessng the degree of smlarty among actors, and for fndng patterns n network data (e.g. factor analyss, cluster analyss, mult-dmensonal scalng). Even the tools of predctve modelng are commonly appled to network data (e.g. correlaton and regresson). The other major use of statstcs s for testng hypotheses. The key lnk n the nferental chan of hypothess testng s the estmaton of the standard errors of statstcs. But, n fact, t s not really dfferent from the logc of testng hypotheses wth non-network data. Socal network data tend to dffer from more "conventonal" survey data n some key ways: network data are often not probablty samples, and the observatons of ndvdual nodes are not ndependent. These dfferences are qute consequental for both the questons of generalzaton of fndngs, and for the mechancs of hypothess testng. There s, however, nothng fundamentally dfferent about the logc of the use of descrptve and nferental statstcs wth socal network data. 1.3.3. Propertes of Socal Networks Researchers have concentrated partcularly on a few propertes that seem to be common to many networks: the small-world property, power-law degree dstrbutons, and network transtvty. Small world effect s the fndng that the average dstance between vertces n a network s short, usually scalng logarthmcally wth the total number n of vertces. Rght-skewed degree dstrbuton s another property that many networks possess. The degree of a vertex n a network s the number of other vertces to whch t s connected, and one fnds that there are typcally many vertces n a network wth low degree and a small number wth hgh degree, the precse dstrbuton follow a power-law or exponental form. 9
A thrd property that many networks have n common s network transtvty, whch s the property that two vertces that are both neghbors of the same thrd vertex have a heghtened probablty of also beng neghbors of each other. In the language of socal networks, two of your frends wll have a greater probablty of knowng one another than wll two people chosen at random from the populaton, on account of ther common acquantance wth you. 1.3.4. Socal Network Methods Many of the key structural measures and notons of socal network analyss are motvated by central concepts n socal theory. Of crtcal mportance for the development of methods for socal network analyss s the fact that the unt of analyss n the network s not the ndvdual, but an entty consstng of a collecton of ndvduals and the lnkages among them. Network methods focus on dyads, trads or larger systems. Therefore, specal methods are necessary. It s mportant to contrast approaches n whch networks and structural propertes are central wth approaches that employ network deas and measurements n standard ndvduallevel analyses. The socal network perspectve models the relatonshps to depct the structure of a group. One could then study the mpact of ths structure on the functonng of the group and the nfluence of the structure on ndvduals wthn the group. It can also be used to study the process of change wthn a group over tme. Thus, the network perspectve also extends longtudnally. The socal network perspectve thus has a dstnctve orentaton n whch structures, ther mpact, and ther evoluton become prmary focus. Snce structures may be behavoral, socal, poltcal, or economc, socal network analyss thus allows a flexble set of concepts and methods wth broad nterdscplnary appeal. Before we look at related methods of socal networks, we frst revew the termnologes frequently used n these lteratures. An actor s a socal entty. It could be a person or any other entty for whch a relatonshp wth another entty could be defned. The relatonshp between a par of actors s called a te, lnk or par. Each lnk may be drected or undrected, bnary (present or absent) or weghted (a set of values, usually wth hgher value mplyng stronger relatonshp). Lnks could also be of partcular types, e.g., frendshp, famlal. All lnks of the same type can be 10
grouped together as a relaton. A dyad conssts of a par of actors and the tes between them. A trad s a subset of three actors and the tes among them. Relatonshps among larger subsets of actors nclude the subgroup or a group. Socal network encompasses a set of actors and all the relatons that could be defned on them. Usually dependng on the number of actor types n, a socal network may be dentfed as beng an n-mode network. As far as possble, these terms wll be unformly used throughout the rest of ths thess. There are many dfferent types of socal networks that can be studed. One way of categorzng them s based on mode. A mode s defned as the number of sets of enttes on whch the structural varables are measured. One-mode networks study just a sngle set of actors, whle two-mode networks focus on two sets of actors, or one set of actors and one set of events. There are many ways to descrbe socal network data mathematcally. The most popular of them are graph theoretc, socometrc and algebrac. For some forms of data and network methods, one notaton scheme may be preferred to the others. Graph theoretc notaton s most useful for centralty and prestge methods, cohesve subgroup deas, dyadc and tradc methods. Socometrc notaton refers to the representaton of data for each relaton n a two-way matrx, termed socomatrx. Socometrc notaton s often used for the study of structural equvalence and block models. Algebrac notaton s most approprate for role and postonal analyss and relatonal algebras. Both graph theory and matrx operatons have served as the foundatons of many concepts n the analyss of socal networks. One of the prmary uses of graph theory n socal network analyss s the dentfcaton of the most mportant actors n a socal network. All measures of mportance, attempt to descrbe and measure propertes of actor locaton n a socal network. Actors who are the most mportant are usually located n strategc locatons wthn the network. Several measures are defned based on degree, closeness, betweenness, nformaton and rank. These defntons yeld actor ndces whch attempt to quantfy the promnence of an ndvdual actor embedded n a network. The actor ndces can also be aggregated across actors to obtan a sngle group-level ndex whch summarzes how varable or dfferentated the set of actors s as a whole wth respect to a gven measure. Both centralty and prestge are examples of measures of the promnence or mportance of the actors n a socal network. 11
Degree centralty s a measure of the degree of an actor n a network. An actor wth a hgh degree centralty s where the acton s n the network. Thus ths measure focuses on the most vsble actors n the network. Ths actor can be recognzed by others as a major channel of relatonal nformaton. In contrast, actors wth low degrees are perpheral n the network. A second vew of centralty s based on closeness or dstance. Ths measure focuses on how close an actor s to all the other actors n the network. Ths dea of centralty based on closeness s nversely related to dstance. As a node grows farther apart n dstance from other nodes, ts centralty wll decrease, snce there wll be more lnes n the geodescs lnkng that node to the other nodes. Interactons between two non adjacent actors mght depend on the other actors n the set of actors, especally the actors who le on the paths between the two. These other actors potentally mght have some control over the nteractons between the two nonadjacent actors. The mportant dea here s that an actor s central f t les between other actors on ther geodescs, mplyng that to have a large betweenness centralty, the actor must be between many of the actors va ther geodescs. Although ths centralty has ganed popularty because of ts generalty, ths ndex assumes that all geodescs are equally lkely when estmatng the crtcal probablty that an actor fall on a partcular geodesc. It also gnores the fact that f some actors on the geodescs have large degrees, then the geodescs contanng these expansve actors are more lkely to be used as shortest paths than other geodescs. Also t would be more realstc to consder betweenness counts whch focus on paths other than geodescs. Informaton centralty generalzes the noton of betweenness centralty so all paths between actors, wth weghts dependng on ther lengths, are consdered when calculatng the betweenness counts. Another mportant concept to emerge from early days of socal network analyss was balance theory. The focus n balance theory s on the cognton or awareness of socometrc relatons, usually postve and negatve affect relatons from the perspectve of an ndvdual. The most mportant aspect of structural balance s that the nodes n a balanced graph can be parttoned nto two subsets or clusters. One of the major concerns of socal network analyss s dentfcaton of cohesve subgroups of actors wthn a network. Cohesve subgroups are subsets of actors among whom there are relatvely strong, drect, ntense, or frequent tes. These 12
methods attempt to formalze the ntutve and theoretcal noton of socal group usng socal network propertes. There are many specfc propertes of a socal network that are related to the cohesveness of subgroups, and hence many possble socal network subgroup defntons. Cohesve subgroups are theoretcally mportant because of socal forces operatng through drect contact among subgroup members, through ndrect conduct transmtted va ntermedares, or through the relatve coheson wthn as compared to outsde the subgroup. The concept of the socal group can be studed by lookng at propertes of subsets of actors wthn a network. In socal network analyss, the noton of subgroup s formalzed by the general property of coheson among subgroup members based on specfed propertes of the tes among the members. However, snce the property of coheson of a subgroup can be quantfed usng several dfferent specfc network propertes, cohesve subgroups can be formalzed by lookng at many dfferent propertes of the tes among the subsets of actors. There are four general propertes of cohesve subgroups that have nfluenced socal network formalzatons of ths concept. These are: the mutualty of tes, the closeness or reachablty of subgroup members, the frequency of tes among members, the relatve frequency of tes among subgroup members compared to non-members. Subgroups based on mutualty of tes requre that all pars of subgroup members choose each other; subgroups based on reachablty requre that all subgroup members be reachable to each other, but not necessarly adjacent; subgroups based on numerous tes requre that subgroup members have tes to many others wthn the subgroup; and subgroups based on the relatve densty or frequency of tes requre that subgroups be relatvely cohesve compared to the remander of the network. These subgroup deas lead to methods that focus on dfferent socal network propertes. The result of a cohesve subgroup analyss s a lst of subsets of actors wthn the network who meet the specfed subgroup defnton. For a gven analyss t mght be the case that no subsets of actors meet the specfed subgroup defnton, or t mght be the case that there are numerous subsets of actors that meet the specfed subgroup defnton. There are three levels at whch one mght nterpret the results of a cohesve subgroup analyss. These levels are the ndvdual actor, subset of actors and the whole group. In terms of ndvdual actors, the smplest dstncton s 13
between actors who belong to one or more cohesve subsets, and actors who do not belong to any cohesve subset. One can make a dstncton between members and non-members. Many methods for the descrpton of network structural propertes are concerned wth the notons of socal poston and role. In socal network terms, these translate nto procedures for analyzng actors structural smlartes and patterns of relatons n mult-relatonal networks. These are the postonal and role analyss methods. Although these methods are mathematcally and formally dverse, they share a common goal of representng patterns n complex socal network data n smplfed form to reveal subsets of actors who are smlarly embedded n networks of relatons and to descrbe the assocatons among relatons n mult-relatonal networks. The major task here s to locate subsets of actors who are smlar across the collecton of relatons. Smlarty wll be defned n terms of the equvalence of actors wth respect to some formal mathematcal property. The formal mathematcal property specfes whch actors wll be grouped together n a network poston. We can thnk of postonal analyss as mappng actors nto equvalence classes, where deally an equvalence class conssts of all actors who are dentcal on the specfed mathematcal property. Structural, Isomorphc, Automorphc and Regular Equvalences are examples of formal mathematcal propertes for defnng equvalence classes. Once equvalence classes (or postons) of actors have been dentfed, the tes between these postons must be descrbed. Image matrces, densty tables and block models are common ways to model tes between postons. These socal network approaches have been used for the last twenty years n dfferent areas, ncludng socology, organzatonal psychology, and anthropology. They are used to solve socal structural puzzles at any level of analyss - at the level of the ndvdual, group or organzaton. Networks can be analyzed at dfferent levels and at dfferent tme perods. 1.3.5. Revew work on Socal Networks In the recent years socal network approach has been ncreasngly appled n computer scence dscplnes. Wth the prolferaton of web technologes, there s an 14
ncreasngly greater amount of nteracton by people whle on the Internet. Socal Networks brng n a mult-dscplnary approach to solve problems n ths doman. It sheds new lght on old problems and gves a new way of lookng at them. Ths secton dscusses socal network related work n ths dscplne. The Internet poses new questons about the nature of socal networks and opens new perspectves for socal network analyss [9, 10]. In partcular, the hyper-textual structure of the web makes lnkng explct. As a consequence, as Jackson [11] ponts out, socal network analyss s deally suted to the web envronment. A number of studes have analyzed patterns of lnkng on the World Wde Web. Adamc [4] showed that the web s a small world network. Gbson et al. [12] developed a technque to dentfy hyperlnked communtes n Web envronments, whch ncludes the dentfcaton of hubs (strong central ponts wth hgh numbers of outbound lnks) and authortes (hghly referenced pages). Another nterestng area of nternet s Internet Relay Chat. It s a system that allows groups of people to collaborate and chat from anywhere n the world. Mutton [13] descrbes a method of nferrng the socal network of a group of IRC users n a channel. An IRC bot s used to montor a channel and perform a heurstc analyss of events to create a mathematcal approxmaton of the socal network. From ths, the bot can produce a vsualzaton of the nferred socal network on demand. These vsualzatons reveal the structure of the socal network, hghlghtng connectvty, clusterng and strengths of relatonshps between users. Anmated output allows vewers to see the evoluton of the socal network over tme. More recently, socal network analyss methods have begun to be appled to weblogs. Weblogs (blogs) web-based journals n whch entres are dsplayed n reverse chronologcal sequence are a recent addton to the repertore of Computer Medated Communcaton (CMC) technologes through whch people can socalze onlne. Kumar et al. [14] observe and model temporally-concentrated bursts of connectvty wthn blog communtes over tme, concludng that blogspace has been expandng rapdly snce the end of 2001, "not just n metrcs of scale, but also n metrcs of communty structure and connectedness. Marlow [15] uses socal network analyss to dentfy authortatve blog authors, and compares them wth measures of opnon leadershp and authorty n the popular press. The blogosphere 15
has been clamed to be a densely nterconnected conversaton, wth bloggers lnkng to other bloggers, referrng to them n ther entres, and postng comments on each other's blogs. Ths study emprcally nvestgates the extent to whch, and n what patterns, blogs are nterconnected, takng randomly-selected blogs. Quanttatve socal network analyss, vsualzaton of lnk patterns show that A-lst blogs are overrepresented and central n the network, although other groupngs of blogs are more densely nterconnected. Another area that has ganed a lot of popularty s securty. The securty assurance of network s crtcal to the whole modern world. As the ncreasngly occurred common network securty ncdents show, current network securty approaches are not suffcent. Zhao et al. [16] present a general framework for ntellgent analyss and montorng the securty of network nformaton content n hgh-speed network. The system can ntellgently gather and transform varous channels of nonstructured, sem-structured and structured data based on broadband network, carry on securty assurance related characterstc selectons and topc dentfcaton, perform socal network analyss of emal. The system can help nformaton securty experts fnd the assocaton rules n the results from varous analyzng levels, and vsualze assocaton patterns by ther relatonal structures from Lnk analyss technques and provde early warnng to system admnstrators. Peer-to-Peer networks can be seen as truly dstrbuted computng systems. Each peer s both a clent and a server n these networks. A reasonable trust constructon approach for these systems comes from the socal network analyss. Zhang et al. [17] proposes a recommendaton-based global trust model for Peer-to-Peer network, whch s easy to mplement. In ther model, a peer s trust nformaton s defned by ts past transactons wth other peers. Each peer s global relablty s decded by two factors: one s the relablty of the peer that t transacts wth, the other s the correspondng recommendaton degree provded by the transacton peer. A peer s trust value s calculated from the n-degree, correspondng weght (recommendaton degree) and the recommend peer s trust value. They also ntroduce some securty mechansm nto ths model to defense several attacks, such as tamper, pretend, slander and exaggerate. Socal networks are also useful for judgng the trustworthness of outsders. Boykn & Roychowdhury [18] propose an automated ant-spam tool that explots the propertes 16
of socal networks to dstngush between unsolcted commercal e-mal (spam) and messages assocated wth people the user knows. Ths technque s predcated on recognzng the unque characterstcs nherent to socal networks. The natural nstnct to form close knt socal networks operatng n cyberspace has been exploted to provde an effectve and automated spam-flterng algorthm. They use the quanttatve defnton of the clusterng coeffcent that nvolves countng the fracton of a node s neghbors that are also each other s neghbors. 1.4. MOTIVATION Wth the ncrease n communcaton bandwdth and the shrnkng of dstances, t has become ncreasngly possble for the formaton and workng together of varous communtes of nterest wth sgnfcant reducton n tme and cost. Understandng the formaton of such groups and sustenance of them has become a topc of great nterest n recent tmes. The growth of Internet and computer medated communcatons such as emals, newsgroups, onlne dscusson forums, Internet Relay Chats have revolutonzed the formaton of such communtes of nterest. Weblogs and numerous socal networkng webstes lke Facebook, Orkut, and so on are a recent addton to the repertore of computer medated communcatons. Socal Networkng web servces are nternet applcatons that help connect frends, busness partners or just about any one by allowng users to have profles, blogs, nteract wth others, jon or create communtes and much more. Ths revoluton n communcaton has shfted the technques for understandng such dynamcs from socal and psychologcal perspectves to the doman of computer scence. Socal Network Analyss usng such technques has become a powerful tool for such analyss and ths thess s a contrbuton towards the study of applcatons of socal networks for the evoluton of communtes of nterest and ther dynamcs. A varety of ndcators could be dentfed for each communty. These nclude the medum, common nterests and purported goals of the group. The basc dea of communty can be appled to a varety of networks. Communtes n a ctaton network mght represent related papers on a sngle topc, communtes on the web mght represent people wth smlar profles, and hdden communtes mght represent potental suspcous actvty. In order to brng these to the fore, several 17
llustratve examples have been utlzed n the thess. The Enron Dataset s one such emal corpus. The IISc co-authorshp data s another example. In the socal networks of large groups of people, such as companes and organzatons, formal herarches wth ttles and lne of authorty are establshed to defne the responsbltes and order of power wthn that group. Also n recent tmes, emal has been establshed as an ndcator of collaboraton and knowledge exchange n organzatons. Hence the analyss of an organzatonal emal archve to detect the role ths herarchy plays n the context of the communcaton patterns wthn an organzaton s an nterestng research queston. Specfcally, we address the problem of mappng a set of employees n an organzaton to a set of broad ttles exstng n an organzaton from the emal communcaton patterns of these employees n the organzaton. We analyze how well the herarchy of an organzaton s reflected n the emal communcaton among ts employees. We have used the Enron emal corpus for llustratve purposes. Any nsttuton that provdes opportuntes for communcaton among ts members s eventually threaded by communtes of people who share smlar goals. These communtes are the nformal networks of collaboraton that grow naturally wthn the organzaton. They coexst along wth the formal structure of an organzaton. Hence we explore the emal communcaton patterns of Enron to detect the evoluton of these nformal communtes exstng n the organzaton. The work s based on the premse that the rch temporal record of these nformal networks can reflect the structural patterns of the network n the long run and help us to detect events leadng to a crss wthn the organzaton. The emal s consdered an ndcator of communcaton n a herarchcally set organzaton and ts complex nteractons have been analyzed n ths thess to study the evoluton of groups n Enron. Varous socal network analytc features have been developed from the emal corpus to ad the temporal evoluton of the communcaton and the formaton of groups and correlate them to major events that have occurred n the organzaton leadng to bankruptcy. The methodologes developed have the potental for ntegratng nto early warnng systems. Apart from the emals as an ndcator of collaboraton, n an academc nsttuton lke the Indan Insttute of Scence the co-publcaton of papers s a strong ndcator of 18
collaboraton and knowledge exchange. A relatonshp between two authors exsts f they have co-authored a paper together. The communtes of practce that arse out these relatonshps are dynamc n nature and constantly evolve over tme wth the enterng and extng of authors from the communty as well as wth the growng and decreasng amount of collaboratons. The study of the evoluton of these communtes over tme can provde tremendous nsght on the behavor of communtes and the flow of nformaton among them. Identfyng the portons of the network that are changng and characterzng the type of change are the challenges addressed n the last part of the thess. The database of publcatons by a leadng academc nsttuton lke the Indan Insttute of Scence has been chosen for understandng the formaton of scentfc groups. The co-authorshp data collected from the corpus of publcatons has been used to dentfy transtons assocated wth evolvng communtes. We have used ths data to study the evoluton of these communtes, ther formaton, transtons and dssoluton. Ths study can be extremely useful for characterzng effectvely the changes to the network over tme. 1.5. ORGANIZATION OF THE THESIS The rest of the thess s organzed nto four chapters. We gve a chapter-wse organzaton of these four chapters. Chapter II In ths chapter we address the problem of mappng a set of employees n an organzaton to a set of broad ttles exstng n an organzaton from the emal communcaton patterns of these employees wthn the organzaton. It s our nterest to know how well the herarchy of an organzaton s reflected n the emal communcaton among ts employees. For our experments we use the publcly avalable Enron emal corpus and extend t by ncludng the ttle nformaton. We also descrbe the dfferent versons of the Enron emal corpus avalable and our own modfcatons to t. Ths chapter focuses on comparng the varous classfers developed n the data mnng dscplnes to that of a Socal Network based classfer. We mplement Naïve Bayes, Logstc Regresson, Nearest Neghbor classfer, Neural network based 19
classfer, C4.5 decson tree, and support vector machne (SVM) classfers. From the socal network perspectve, we use a classfer based on the regular equvalence defnton. The performance of varous classfers s compared based on the overall accuracy of the classfers. At the end t concludes wth the dscusson of results. Chapter III In ths chapter we explore the emal communcaton patterns of Enron to detect the evoluton of the nformal communtes exstng n the organzaton. Enron s a realworld organzaton that has faced a severe survval threatenng crss leadng to ts bankruptcy n 2002. Hence, we utlze the rch temporal record of these nformal networks to reflect the structural patterns of the networks n the long run and help us better understand the factors leadng to ts bankruptcy. We dentfy the varous events leadng to the fall of Enron and study how the nformal networks have changed over a perod of tme to reflect these events. Varous socal network metrcs lke the densty, centralty measures, reachablty, and cohesve subgroups were calculated at dfferent ponts n tme to enable the detecton of the nformal networks as well as to observe ther senstvty to the events relatng to the crss. The fndngs of ths study provde a better understandng of the underlyng causes of the organzatonal falure. Chapter IV We bult a co-authorshp network from the IISc publcaton data collected by NCSI for the duraton of ten years. Ths chapter focuses on the study of the evoluton of communtes wthn ths co-authorshp network. In ths chapter we defne and dentfy fve transtons that a communty can undergo durng the perod of ts exstence. We also defne algorthms for dentfyng each of these transtons. The methodology developed s generc and can be appled to other knds of communtes as well. We attempt to successfully characterze the changes undergone by these communtes over tme. Chapter V Ths chapter gves a summary of the results and suggestons for future work. 20
Chapter 2 ORGANIZATIONAL HIERARCHY FROM E-MAIL COMMUNICATION 2.1. INTRODUCTION In today s world, communcatng wth others va e-mal has become an ntegral part of lfe. It s hard to fnd a college student, professonal, or any educated person, who does not send and/or receve e-mals. Also, t s an establshed fact that a lot of the communcaton that occurs wthn companes and organzatons s nowadays done by e-mals, rather than memos or common bulletn boards. As such, any average person today heavly depends on e-mal as a fast, relable and effcent means of communcaton. Gven the wde scale usage of e-mal, there s a wdespread nterest n the storage, retreval and analyss of e-mal traffc. However, the lack of large benchmark collectons has been an obstacle for studyng the problems and evaluatng the solutons. Most studes so far have used personal collectons of the people workng on the experments. These sets have been ncredbly small, on the order of one to fve users. In May 2002, FERC has publcly released a large e-mal corpus of an organzaton called the Enron. The Enron e-mal corpus s appealng to researchers because t represents a rch temporal record of e-mal communcaton wthn a large, real-world organzaton facng a severe and survval threatenng crss. One current area of research n the underlyng socal network of e-mal communcatons s the nference of propertes of the network. Emal communcatons between ndvduals mply a relatonshp between those ndvduals. It can be formal, such as a manager-employee relatonshp, or nformal, such as frendshp relatonshps. Although these relatonshps are known wthn the socal network, they are not readly avalable outsde the group. Ths provdes an nterestng challenge for analysts to nfer these relatonshps by explorng the e-mal archves. A smlar problem can be seen n the dentfcaton of key ndvduals n a socal network. If we are nterested n dentfyng who the manager of a partcular group s and we do not have explct documentaton of ths nformaton, we mght try to 21
analyze the e-mal relatonshps to nfer ths nformaton. In socal network analyss the assocated methods are often concerned wth the dual notons of socal poston and socal role. In socal network analyss poston refers to a collecton of ndvduals who are smlarly embedded n networks of relatons, whle role refers to the patterns of relatons whch obtan between actors or between postons. The noton of poston thus refers to a collecton of actors who are smlar n socal actvty, tes or nteractons, wth respect to actors n other postons. These methods share a common goal of representng patterns n complex socal network data n smplfed form to reveal subsets of actors who are smlarly embedded n networks and to descrbe the assocatons among relatons n these networks. Ths analyss s referred to as postonal and role analyss and s the focus of ths chapter. In the socal networks of large groups of people, such as companes and organzatons, formal herarches wth ttles and lne of authorty are establshed to defne the responsbltes and order of power wthn that group. Although ths nformaton may be readly avalable for ndvduals wthn the group, the context ths herarchy provdes n communcatons s often not avalable to those outsde the organzaton. In ths chapter, we defne the problem of nferrng ths formal herarchy n the context of organzatonal e-mal archves. We llustrate our results on the wdely used Enron e-mal dataset and extend t wth nformaton about the formal organzaton herarchy. The objectve of ths chapter s two-fold. Frst we am to fnd out how well the e-mal communcaton can be used to probe nto the herarchy of the organzaton. Second we want to compare the performance of varous classfers ncludng a socal network based classfer for nferrng ths herarchy gven the e-mal communcaton as nput. 2.2. PROBLEM DESCRIPTION Let A = { a } be the set of actors (people who send and receve e-mal, n an organzaton) and let T = { t } be the set of formal ttles used wthn the organzaton. The objectve s to construct a mappng from the set of known actors n set A to a sngle ttle n set T. The Set T conssts of the broad ttles present n the organzaton 22
rather than specfc ttles. For example, Vce Presdent of Fnance and Vce Presdent of Operatons are generalzed to Vce Presdent. 2.3. BACKGROUND There has been a lot of work n the dscovery and analyss of poston and roles between ndvduals wthn a socal network. As more and more people jon onlne communtes, the ablty to better understand members roles becomes crtcal to preservng and mprovng the health of those communtes. Identfyng key members and ther roles n an onlne communty s an mportant way to support the needs of an onlne communty. Nolker & Zhou [19] propose a novel approach to dentfyng key members and ther roles by dscoverng mplct knowledge from onlne communtes. They focus on an open dscusson bulletn board and have dentfed three roles that are mportant to ths type of onlne communty namely, Leaders (ndvduals that are n a poston to spread knowledge and provde cohesveness and consstency), Motvators (ndvduals that keep conversaton gong) and Chatters. They vew ths onlne communty as a socal network connected by member-member relatonshps. To understand conversaton and ts nterplay wth relatonshps, the relatonshp-based measures from the socal network paradgm (.e., degree, betweeneess, and closeness) were combned wth behavor-based measures from the nformaton retreval realm (.e., TF-IDF) to determne the key members. Ths approach was tested on real world data collected from a Usenet bulletn board over a one year perod. It was shown that ths approach was able to dentfy promnent members whose behavors are communty supportve and flter chatters whose behavors are superfcal to the onlne communty. To assst law enforcement and ntellgence agences ascertan terrorst network knowledge effcently and effectvely, Memon & Larsen [20] propose a framework of automated analyss, vsualzaton and destablzaton of terrorst networks. Based on ths framework, they have developed a prototype called Mner that ncorporated several technques, ncludng socal network analyss, for automatcally detectng cells from a network, dentfyng varous roles n a network (e.g., central members, 23
gatekeepers, and followers), and may also assst law enforcement about the effect on the network after capturng or kllng a terrorst n a network. They treat a terrorst network as an undrected graph. Usng degree centralty and egenvector centralty from SNA the undrected graph was converted nto a drected graph. They defne a new centralty measure called dependence centralty. The dependence centralty (DC) of a node s defned as how much that node s dependent on any other node n the network. Ther approach nvolves convertng the drected graph nto a herarchcal chart usng ths newly proposed centralty measure. From ths herarchcal chart, t s possble to dstngush the leaders and perpheres n the network n order to destablze the network. The key players have low dependence centralty (DC) as they have large number of drect lnks wth other nodes of the network and they do not depend on others to communcate wth those nodes. Gloor [21] descrbes the applcaton of temporal lnk and content analyss to the Enron e-mal dataset. TeCFlow permts to extract dynamc moves of the evoluton of socal networks, dentfyng gatekeepers and other central actors, as well as to generate temporally correlated cluster maps of e-mal content. Ths approach combnes socal network analyss wth nformaton mnng and permts ntellgence analysts to easly dentfy patterns of potentally suspcous actors and actvtes n large e-mal and other communcaton archves. McCallum et al. [22] presented an LDA approach to topc and role dscovery wthn the network. They present the Author-Recpent-Topc (ART) model for socal network analyss, whch learns topc dstrbutons based on the drecton-senstve messages sent between enttes. They llustrate the results on the Enron e-mal corpus and a researcher s e-mal archve. They also descrbe an extenson of the ART model that explctly captures roles of people, by generatng role assocatons for the authors and recpents of a message, and condtonng the topc dstrbutons on role assgnments. These person condtoned topc dstrbutons are used to measure smlarty between people, and thus dscover people s roles by clusterng usng ths smlarty. They have termed ths model as Role-Author-Recpent-Topc (RART) model. Shetty & Adb [23] propose an Entropy model that dentfes the most nterestng and mportant nodes n a graph. Ths model has been successfully tested and evaluated 24
on the Enron e-mal dataset. The entropy model can dentfy an entty or a set of enttes whose removal has the maxmum effect on the command chan by measurng the most effect on the graph entropy and thus provde a ranked lst based on such effect. Zhou et al. [24] ntroduce a two-phase framework for the problem of leadershp dscovery n an organzaton based on e-mal communcaton hstory among the employees. They frst evaluate the leadershp factors between par of ndvduals and construct the drected weghted socal network. Then the organzatonal structure s dscovered from the socal network obtaned. Two heurstc metrcs are proposed for evaluatng par-wse leadershp factors among a group of employees. One metrc uses the sender-recever mbalance and the other bulds on the group nformaton nferred n e-mal group lsts. 2.4. METHODOLOGY Emal archves can be classfed as a collecton of unstructured, structured and numerc data. Unstructured data n e-mal conssts of felds lke the subject and body, whch allow for natural language processng approaches lke bag-of-words, stemmng and stop word removal. Although we are able to dscover formal ttles for ndvduals usng ths method, ths approach often does not provde ttle nformaton for all ndvduals. Although n some cases ttle nformaton s ncluded n e-mal communcatons, such as n e-mal sgnatures, n general, ths nformaton s already known wthn the group and s unnecessary to repeat. Numerc data n e-mal ncludes such features as the message sze, number of recpents, and counts of partcular characters. The use of ths knd of data for our current problem s lmted. Structured data ncludes felds such as to and from. These dffer from unstructured text felds n that the type of data whch can be used n them s very well defned. The structured data of e-mal archves wll be the focus of ths work. The only peces of nformaton used from each e-mal are the names of the sender and recever (.e., the to: and from: felds), enablng the processng of a large number of e-mals whle mnmzng prvacy concerns. Schwartz & Wood [25] generated a socal network usng the To and From felds of e-mal messages to dscover users of a partcular nterest and feld. 25
Gven the e-mal communcaton between the employees and ther ttle nformaton, we classfy ndvduals to ther broad ttles usng the structured part of the e-mal communcaton. As our frst attempt we tred dfferent categores of tradtonal classfers to dentfy whch approach performs the best. For ths, we have used the eght classfers gven n Table 2.1. Snce the performance of these classfers can be affected by the parameters used, we present the results for these algorthms usng only the best performng parameters we found durng the experments. Ths provdes a good estmate for ther performance. As our second attempt, we used a postonal analyss technque based on regular equvalence defnton and s mplemented purely from a socal network perspectve. We call ths the socal network classfer. We evaluate the overall accuracy of classfcaton for comparng the results of dfferent classfers. Table 2.1 Classfcaton Algorthms used Classfer Abbrevaton Naïve Bayes Logstc Regresson 1-Nearest Neghbor 10-Nearest Neghbor Naïve Bayes Log. Regr. 1-NN 10-NN Decson Tree C4.5 Rule-based learner Artfcal Neural Network Support Vector Machne Socal Network Classfer Rpper ANN SVM SNC 2.4.1. Classfcaton Methods Ths secton brefly descrbes the varous tradtonal classfcaton methods used n order to categorze the ndvduals based on ther broad ttles n the organzaton. The supervsed methods used are the Naïve Bayes, Logstc Regresson, Nearest Neghbor, C4.5 Decson Trees, Rpper, Artfcal Neural Networks and Support Vector Machnes. 26
Naïve Bayes Classfer Nave Bayes classfers can handle an arbtrary number of ndependent varables whether contnuous or categorcal. Gven a set of varables, X = { x1, x2,..., xd }, we want to construct the posteror probablty for the event C j among a set of possble outcomes C = { c1, c2,..., cd }. X s the predctor and C s the set of categorcal levels present n the dependent varable. Usng Bayes' rule: p C x, x,..., x ) p( x, x,..., x C ) p( C )..(2.1) ( j 1 2 d 1 2 d j j where, p( C j x, x2,..., x 1 d ) s the posteror probablty of class membershp, C j.e., the probablty that X belongs to. Snce Nave Bayes assumes that the condtonal probabltes of the ndependent varables are statstcally ndependent we can decompose the lkelhood to a product of terms: and rewrte the posteror as: d p( X C ) p( x C ) (2.2) j k= 1 d k = 1 k j p( C X ) p( C ) p( x C ) (2.3) j j Usng Bayes' rule above, we label a new case X wth a class level C j that acheves the hghest posteror probablty. k j Although the assumpton that the predctor (ndependent) varables are ndependent s not always accurate, t does smplfy the classfcaton task dramatcally, snce t allows the class condtonal denstes p( x k C j ) to be calculated separately for each varable,.e., t reduces a multdmensonal task to a number of one-dmensonal ones. In effect, Nave Bayes reduces a hgh-dmensonal densty estmaton task to one-dmensonal kernel densty estmaton. Furthermore, the assumpton does not seem to greatly affect the posteror probabltes, especally n regons near decson boundares, thus, leavng the classfcaton task unaffected. Nave Bayes can be modeled n several dfferent ways ncludng normal, lognormal, gamma and Posson densty functons [26]: 27
2 1 ( x μ ) kj exp σ 2 2 kj π σ kj < x < ; < μ < kj ; σ > kj 0 Normal μkj : mean; σkj : stddevaton px ( k Cj) = 2 1 log( x/ mkj ) exp 2 xσ 2 2σ kj π kj 0 < x < ; mkj > 0; σ kj > 0 Lognormal mkj : scaleparameter, σ kj : shapeparameter c kj 1 x b kj x exp bkjγ( ckj ) bkj 0 x ; bkj > 0; ckj > 0 Gamma px ( k Cj) = bkj : scaleparameter; ckj : shapeparameter λkj exp( λkj ) x! 0 x ; λkj > 0; x = 0,1, 2,... Posson λkj : mean Logstc Regresson. (2.4) Logstc regresson s a method of classfcaton belongng to a famly of models known as generalzed lnear models. It s a well known method used for determnng the relatonshp between feature varables and a bnary response varable. Modelng wth logstc regresson s very smlar to multple lnear regresson methods when the response s bnary. We are nterested n modelng the probabltes of our two classes usng a lnear functon of the feature varable x. p P( Y = 1/ x) log[ ] = β 0 + β j x..(2.5) (1 P( Y = 1/ x)) For k classes, k 1 lnear functons n x are requred. A common way of assessng the nfluence of the bnary response s to look at the log-odds [27]. For logstc regresson, the functonal form of the predcton s a log-odd (equaton 2.5). The left- = 1 28
hand sde of the equaton s the log-odds of one class versus the other, where Y = 1 corresponds to the class label for the class of nterest. The log transformaton maps the predcted probabltes to (, ). The crteron used to estmate β,..., 0, β1 β p s usually maxmum lkelhood. We defne the lkelhood functon as the product of the probablty of observed responses, gven a vector of nputs, x. The x each has length p + 1 wth the frst entry beng 1 (for the ntercept). The log-lkelhood takes the log of ths functon, as T seen n (equaton 2.6). The response s y and β = [ β,..., N T l( β ) = yβ x log(1 + = 1 β e T x ) 0, β1 β p ]. (2.6) We then set the dervatve of ths functon to zero to maxmze the log-lkelhood and to obtan non-lnear score equatons. These must be solved teratvely usng the Newton-Raphson algorthm. Logstc regresson s a standard procedure that s very relable and n many problems can provde compettve predcton accuracy. That s, t almost always provdes a usable workng base model. Nearest Neghbor Classfer Gven the tranng data D = x, x,..., x } as a set of n labeled examples, the { 1 2 n nearest neghbor classfer assgns a test pont x, the label assocated wth ts closest neghbor n D. Closeness s defned usng a dstance functon. For all ponts x, y and z, a dstance functon D (, ) must satsfy the followng propertes: Non-negatvty: D(x, y) 0. Reflexvty: D(x, y) = 0 f and only f x = y. Symmetry: D(x, y) = D(y, x). Trangle nequalty: D(x, y) + D(y, z) D(x, z). If the second property s not satsfed, the dstance functon D (, ) s called a pseudometrc. A general class of metrcs for d-dmensonal patterns s the Mnkowsk metrc L d 1/ p p p ( x, y) = x y = 1 also referred to as the L p norm. Gven the dstance functon, the nearest neghbor classfer parttons the feature 29
space nto cells consstng of all ponts closer to a gven tranng pont than to any other tranng ponts. All ponts n such a cell are labeled by the class of the tranng pont, formng Vorono cells of the feature space. In two dmensons, the nearest neghbor algorthm leads to a parttonng of the nput space nto Vorono cells, each labeled by the class of the tranng pont t contans. In three dmensons, the cells are three-dmensonal, and the decson boundary resembles the surface of a crystal (Fgure 2.1). The k-nearest neghbor classfer classfes x by assgnng t the label most frequently represented among the k nearest samples. In other words, a decson s made by examnng the labels on the k-nearest neghbors and takng a vote. The k-nearest neghbor query forms a sphercal regon around the test pont x untl t encloses k tranng samples, and t labels the test pont by a majorty vote of these samples. In the Fgure 2.2, for k = 5, the test pont wll be labeled as black. Fgure 2.1 Vorono cells n 2-D and 3-D Fgure 2.2 K-Nearest Neghbor 30
Decson Trees Decson trees, also known as recursve parttonng, are another type of data mnng model. An example of such a tree can be found n Fgure 2.3. Unlke logstc regresson or other generalzed lnear models, decson trees do not requre specfcaton of a functonal form. Instead, the relatonshp between nputs and response s approxmated by a seres of pecewse constant functons correspondng to a tree structure. Classfcaton trees (for bnary responses) partton or splt the data nto sets. Each branch n a classfcaton tree s a splt of the nput data. At each splt, one feature varable s examned and a left/rght branchng decson s made based on some threshold value. To classfy an observaton n our data (.e. a node), we smply follow a path down our tree untl we reach a termnal leaf and obtan a predcton for that leaf. (Ths predcton s a probablty of belongng to ether the nterestng class or the unnterestng one.) If a predcted class s desred, ths s taken to be the class wth the largest predcted probablty. In Fgure 2.3, our fcttous data s frst parttoned nto cases wth values for Feature 1 above or below 0.60. The left sde of the tree (Feature 1 0.60) s next splt on the value of Feature 2. If Feature 2 < 1.44, we partton yet agan, and so on. On the rght sde of the tree (Feature 1 > 0.60), the cases are splt dependng on the value of Feature 3. If Feature 3 > 6.09, a termnal leaf s reached and the fnal vote decdes that the node should be classfed nto Class A. Fgure 2.3 Example of Decson tree 31
To grow a tree, a greedy growng algorthm s employed to recursvely subdvde tree nodes. To subdvde a tree node, the algorthm loops over all features. For each feature, we loop agan over all possble splts on the feature and calculate an mpurty measure for each splt such as devance (-2 tmes the bnomal loglkelhood). The splt that provdes the lowest mpurty n the two chld leaves s recorded and the next feature s then examned. The feature/splt combnaton wth mnmum mpurty s selected, and the tree leaf s dvded nto two chld leaves. Ths process s repeated untl a full tree has been grown. The fnal tree model wll be a smplfed (or pruned ) verson of the full tree. As wth tree-growng, an mpurty measure must be used to decde how to prune the tree. The growng crteron used n our study s based on log-lkelhood. It would also be reasonable to use a loglkelhood for prunng. Large trees wth many splts can overft the data, whereas a tree that s too small can mss mportant splts. An optmal tree sze can be selected by growng a full tree frst and then prunng back usng the cost-complexty measure, C α (T) [27] C α ( T ) = T m= 1 R m I( y k( m)) + α T, (2.7) where Rm s a regon (or set) correspondng to the m number of termnal leaves, α s a complexty cost per termnal leaf, th termnal leaf, T s the k(m) s the predcted class label for termnal leaf m of the tree and I( y = k( m)) s an ndcator functon whether or not the th node observaton belongs n class k. The frst term n equaton 2.7 s the msclassfcaton rate and can be thought of as a bad ft penalty, and the second term s a penalty for complexty or tree sze. Larger values of costcomplexty parameter α wll often lead to more prunng. Tree sze s an mportant tunng parameter whch we used to select an approprate tree model. Rule-based Classfers Each classfcaton rule s of form r: (Condton) y. The LHS of the rule (Condton) s called rule antecedent or precondton, and s a conjuncton of attrbute tests. The RHS, also called the rule consequent, s the class label. The classfer model arrves at a Rule set of the form R = r, r,..., r }. A rule r covers an nstance x f the { 1 2 n attrbutes of the nstance satsfy the condton of the rule. There are two approaches 32
to generate an ntal set of rules, drect method and ndrect method. The drect method extracts rules drectly from data. Examples nclude RIPPER, CN2, Holte s 1R, and Boolean reasonng. The ndrect Method extracts rules from other classfcaton methods (e.g. decson trees). Example ncludes C4.5 rules. Rules are pruned and smplfed and are ordered to obtan a rule set R. Rule set R can be further optmzed. We used the drect method rule based classfer of Rpper. For 2-class problem, Rpper chooses one of the classes as postve class, and the other as negatve class It learns rules for postve class and negatve class wll be consdered the default class. For mult-class problem, we need to order the classes accordng to ncreasng class prevalence (fracton of nstances that belong to a partcular class) and learn the rule set for smallest class frst, treat the rest as negatve class Ths process s repeated wth next smallest class as postve class. Fol's Informaton Gan s used to compare the performance of a rule before and after addng a new conjunct and t s defned as t [ log2(p1/(p1 + n1)) -log2(p0/(p0 + n0)) ], where t s the number of postve nstances covered by both r and r. In order to grow a rule t starts from an empty rule, adds conjuncts as long as they mprove Fol's nformaton gan and stops when rule no longer covers negatve examples. The rule s then pruned mmedately usng ncremental reduced error prunng. The followng s the measure for prunng: v = (p - n) / (p + n), p: number of postve examples covered by the rule n the valdaton set and n: number of negatve examples covered by the rule n the valdaton set. The prunng method deletes any fnal sequence of condtons that maxmzes v. Neural Network based Classfer The basc unt of a neural network s a perceptron. The perceptron computes a sngle output from multple real-valued nputs by formng a lnear combnaton accordng to ts nput weghts and then possbly puttng the output through some nonlnear actvaton functon. Mathematcally ths can be wrtten as n y = ϕ ( w x + b) = ϕ( w = 1 T x + b)..(2.8) where, w denotes the vector of weghts, x s the vector of nputs, 33
b s the bas and ϕ s the actvaton functon. A sgnal-flow graph of ths operaton s shown n Fgure 2.4. A sngle perceptron s not very useful because of ts lmted mappng ablty. No matter what actvaton functon s used, the perceptron s only able to represent an orented rdge-lke functon. The perceptrons can, however, be used as buldng blocks of a larger, much more practcal structure. Fgure 2.4 Sgnal Flow graph of a sngle perceptron A typcal multlayer perceptron (MLP) network conssts of a set of source nodes formng the nput layer, one or more hdden layers of computaton nodes, and an output layer of nodes. The nput sgnal propagates through the network layer-bylayer wth one hdden layer as shown n Fgure 2.5. Fgure 2.5 Sgnal flow graph of a mult-layer perceptron 34
The computatons performed by such a feed-forward network wth a sngle hdden layer wth nonlnear actvaton functons and a lnear output layer can be wrtten mathematcally as x = f ( s) = Bϕ ( As + a) + b (2.9) where, s s a vector of nputs, x a vector of outputs A s the matrx of weghts of the frst layer, a s the bas vector of the frst layer, B s the weght matrx and b s the bas vector of the second layer. The functon ϕ denotes an element-wse nonlnearty. The model can be generalzed to more hdden layers. In multlayer networks, the actvaton functon s often chosen to be the logstc e x sgmod 1/(1 + ) or the hyperbolc tangent tanh( x). These functons are used because they are mathematcally convenent and are close to lnear near orgn whle saturatng rather quckly when gettng away from the orgn. The tranng phase of Multlayer Perceptron(MLP) when used as a classfer nvolves adaptng all the weghts and bases ( A, B, a and b n equaton (2.9)) to ther optmal values for the gven pars ( s( t), x( t)). The crteron to be optmzed s typcally the squared reconstructon error f ( s( t)) x( t) t 2 The supervsed learnng problem of the MLP can be solved wth the backpropagaton algorthm. The algorthm conssts of two steps. In the forward pass, the predcted outputs correspondng to the gven nputs are evaluated as n equaton 2. 9. In the backward pass, partal dervatves of the cost functon wth respect to the dfferent parameters are propagated back through the network. The chan rule of dfferentaton gves very smlar computatonal rules for the backward pass as the ones n the forward pass. The network weghts can then be adapted usng any gradent-based optmzaton algorthm. The whole process s terated untl the weghts have converged. Support Vector Machne (SVM) The goal of SVM s to produce a model whch predcts target value of data nstances 35
n the testng set gven only ther attrbutes. Gven a tranng set of nstance-label n l pars ( x, y ), = 1,..., l where x R and y { 1, 1}, the support vector machnes (SVM) requre the soluton of the followng optmzaton problem: mn wb,, ξ 2 φ l 1 T ww+ C ξ = 1 + ξ 0 T subject to y w ( x ) b) 1, ( ξ.(2.10) Here tranng vectors x are mapped nto a hgher (maybe nfnte) dmensonal space by the functonφ. Then SVM fnds a lnear separatng hyperplane wth the maxmal margn n ths hgher dmensonal space. C > 0 s the penalty parameter of the error term. Furthermore, The followng are the four basc kernels: lnear: K ( x, x ) = x j T K( x, x j ) φ( x ) φ( x j ) T T d polynomal: K( x, x ) = ( γx x + r), γ > 0 j x j j s called the kernel functon. 2, j j > radal bass functon (RBF): K ( x x ) = exp( γ x x ), γ 0 T sgmod: K( x, x ) = tanh( γx x r) j j + Hereγ, r and d are kernel parameters. Socal Network Classfer Ths classfer uses the REGE Algorthm belongng to postonal analyss technques of socal networks. One of the major objectves of a postonal analyss s to smplfy the nformaton n the network data set. Ths smplfcaton conssts of a representaton of the network n terms of the postons dentfed by an equvalence defnton. Actors who are equvalent are assgned to the same equvalence class or poston. Structural equvalence, automorphc equvalence and regular equvalence are all examples of equvalence defnton. The REGE algorthm s based on regular equvalence defnton. The noton of regular equvalence [28] formalzes the observaton that actors who occupy the same socal poston relate n the same ways wth other actors who are 36
themselves n the same postons. Regular equvalence does not requre actors to have dentcal tes to dentcal other actors. If actors and j are regularly equvalent, and actor has a te to/from some actor k, then actor j must have the same knd of te to/from some actor l, and actors k and l must be regularly equvalent. Ths way of defnng equvalence s very well suted for our present case of nferrng herarchy, as we are tryng to fnd people occupyng smlar postons n the network as opposed to people physcally close together n the network. The regular equvalence approach s sutable because t provdes a method for dentfyng "roles" from the patterns of tes present n a network. Rather than relyng on attrbutes of actors to defne socal roles and to understand how socal roles gve rse to patterns of nteracton, regular equvalence analyss seeks to dentfy socal roles by dentfyng regulartes n the patterns of network tes whether or not the occupants of the roles have names for ther postons. Defnton: If actors and j are regularly equvalent, RE j, then for all relatons, X = and for all actors, k = 1,2,... g, f k then there exsts some actor r, r 1,2,..., R, X r l X r X r such that j l and k RE l, and f k then there exsts some actor l such that X r l j and k RE l. Fgure 2.6 shows an example to llustrate regular equvalence [27]. In ths fgure, there are three regular equvalence classes. The frst s actor A, the second s composed of the three actors B, C, and D, the thrd s composed of the remanng fve actors E, F, G, H, and I. The fve actors (E, F, G, H, and I) are regularly equvalent to one another because they have no te wth any actor n the frst class (that s, wth actor A) and each has a te wth an actor n the second class (ether B or C or D). Each of the fve actors, then, has an dentcal pattern of tes wth actors n the other classes. Actors B, C, and D form a class because they each have a te wth a member of the frst class (that s, wth actor A) and they each have a te wth a member of the thrd class. B and D actually have tes wth two members of the thrd class, whereas actor C has a te to only one member of the thrd class; ths doesn't matter, as there s a te to some member of the thrd class. Actor A s n a class by tself, defned by a te to at least one member of class two and no te to any member of class three. 37
Fgure2.6 Regular Equvalence Example The measure of regular equvalence used by the REGE s as follows: M g k 1 R g R t max + = m= M t 1 1 r= 1 j j = g * max k = 1 m ( r= 1 jr ( jr Max M kmr t kmr + + jr jr M Max t kmr kmr ) ).(2.11) M s the estmate of the degree of equvalence for actors and j at teraton t+1. t+1 j Ths quantty s a functon of how well tes to and from all actors can be matched by j' s tes to and from all actors and vce versa. The quantty kmr M + = mn( x, x ) mn( x, x ) quantfes how well ' s tes to and from jr kr jmr a specfc actor k, can be matched by j' s tes to and from some actor m on the relaton X. Snce k and m may not be perfectly regularly equvalent, ths quantty r s weghted by the estmated regular equvalence of k and m from the prevous ( M t km ) teraton, and summed across relatons. To locate the best matchng m for ' s tes to k we need to fnd the maxmum value of kmr for m =1,2, g. In the above equaton the numerator does exactly ths for all actors k and m adjacent to actors and j. The denomnator of the equaton s the maxmum possble value of the numerator when all of actor j' s ' s kr ' s mjr jr M tes to and from ts neghbors and all of actor tes to and from ts neghbors are perfectly matched and all ther neghbors are regularly equvalent. The maxmum possble match on a relaton X r s gven by jr Max = max( x, x ) + max( x + x ). Ths value pertans to a partcular kmr kr jmr kr m n the numerator; we must use the same m n the denomnator. Ths s specfed mjr 38
* by max m. The quantty n equaton 2.11 ranges from 0 to 1. The equvalence of each par of actors s updated at each of the teraton, based on the equvalence of other pars of actors n the network. Ths procedure s run several tmes before an estmate of par wse regular equvalence s accepted. The next step towards classfcaton s assgnng actors to equvalence classes. Ths s done by parttonng actors nto subsets so that actors wthn each subset are closer to beng equvalent than are actors n dfferent subsets. We use herarchcal clusterng for ths purpose. 2.5. ENRON E-MAIL DATASET 2.5.1. Dfferent versons of the Enron E-Mal Corpus The release of the Enron e-mal dataset n 2002, as a part of the Federal Energy Regulatory Commsson (FERC) nvestgaton nto Enron s accountng provded researchers a unque glmpse of e-mal communcatons nsde a major corporaton. Consstng of 619,446 messages from the e-mal accounts of over 150 ndvduals, the Enron dataset s now released under several forms. The Enron corpus has been extensvely used for varous purposes snce t was made avalable to the publc. Snce then, researchers have modfed the corpus as per ther needs or used a partcular subset of the corpus for dfferent purposes, thereby creatng dfferent versons of the orgnal Enron corpus. Below we gve a bref descrpton of the dfferent versons avalable and how we modfed the dataset for the present work. Ths s just a short complaton of the work done by other researchers on the Enron corpus. We shall begn wth a descrpton of the orgnal corpus, followed by varatons of the orgnal corpus. Orgnal Corpus (As dstrbuted by Wllam Cohen) Ths corpus was orgnally dstrbuted by Wllam Cohen n March 2004. Ths corpus s almost dentcal to the one made publc by the Federal Energy and Regulatory Commsson, wthout the ntegrty problems that were present. It s a huge corpus, contanng 517,431 dstnct e-mal messages. Attachments have been excluded. The 39
sze of the tarred gzpped fle s 400 MB. Ths gves us an dea of the volume of data that t contans. Ths corpus mantans the orgnal folder structure and ther herarches. It contans e-mal messages exchanged between 151 users over a perod of three and a half years. The e-mal messages have been organzed nto 150 user folders, wth numerous sub-folders. The folderng has been done accordng to each user. Hence, every user has a folder named after hm/her. Wthn ths folder, the ndvdual folderng strategy of the user has been mantaned. The total number of folders present n the corpus exceeds 4700. The corpus s avalable for download at Wllam Cohens Enron page http://www.cs.cmu.edu/~enron/ Klmt and Yang corpus Klmt & Yang [29] from Carnege Mellon were amongst the frst people to work on the Enron corpus. They also wrote a paper ntroducng the Enron corpus. They went through the entre corpus and elmnated the duplcate messages that t contaned. Most of the duplcate messages were found n computer generated folders lke Inbox, Sent Items, etc. They removed these computer-generated folders from the corpus. Only those folders that were created by the users themselves were preserved. Ths cleaned corpus contans 200,399 dstnct e-mal messages, dstrbuted over 158 users. It should be noted that ths s just about one-thrd the sze of the orgnal corpus. In other words, approxmately 62 percent of the orgnal corpus s made up of duplcate e-mals. Bekkerman corpus Ths corpus was created by Ron Bekkerman [30] and hs group from the Unversty of Massachusetts, Amherst. He used a subset of the orgnal Cohen corpus. Instead of usng the entre corpus, they only used e-mal messages of seven top management personnel of the Enron Corporaton. The seven people were selected based on the number of e-mals present n the user folders. They removed all the non-topcal folders from the e-mal database of these seven people. He also elmnated user-specfc archvng folders, whch were created due to certan crcumstances lke lack of tme, or some user-specfc strategy, rather than content. They also removed folders that contaned less than three e-mal messages, snce they were very small and would not help ether n tranng or testng. Another unque 40
approach that They used was flattenng folder herarches. He reduced the depth of the folders to just two. The frst level contaned ndvdual folders for each of the seven users, whle the second level contaned actual top-level drectores created by the users themselves. All messages contaned n any further sub-drectores were brought under ths level. Hence, Bekkerman corpus now contans a total of 273 folders. There are 20,581 e-mal messages. The smallest folders contan 3 e-mal messages, whereas the largest folder contans 1398 e-mals. The seven preprocessed datasets can be downloaded from Ron Bekkerman s web page on Enron at http://www.cs.umass.edu/~ronb/enron dataset.html Corrada-Emmanuel corpus The Corrada-Emmanuel corpus [31] was derved from the orgnal Cohen corpus also. Ths corpus was created by Andres Corrada-Emmanuel from the Unversty of Massachusetts, Amherst. He created varous mappngs between e-mals wthn the Enron corpus. These nclude mappngs of e-mals to relatve paths, authors and recpents. He studed the relatonshp between user folders and e-mal addresses of varous users n detal and concludes that actually, there are only 149 users n the Enron corpus. He has created a mappng between the top folders n the corpus and hs normalzaton for an authors e-mal address. Corrada-Emmanuel also wrote Python scrpts to extract word lsts from the Enron corpus. The varous MD5 fles, hs mappng fles and Python scrpts can be downloaded from http://cr.cs.umass.edu/~corrada/enron/ Fore and Hearst corpus Andrew Fore and Mart Hearst, from the Unversty of Calforna at Berkeley, have created a powerful search nterface for the Enron e-mal database. Ths web-based nterface searches the corpus, stored n a MySQL database of unque e-mals from the Enron corpus, for e-mal messages contanng a gven term. The results obtaned can be sorted accordng to the date, sender, recpent, subject, e-mal address, etc. More about ths corpus s avalable on Mart Hearst s web page http://www2.sms.berkeley.edu/courses/s290-2/f04/assgnments/assgnment4.html 41
Shetty and Adb corpus Ths verson of the corpus was created by Jtesh Shetty and Jafar Adb [32], from the Unversty of Southern Calforna. Ther verson s nterestng because they have used t to study and analyze socal networks. In ths context, socal network refers to the types of professonal relatonshps between the employees of the Enron Corporaton. Shetty & Adb amed to understand the types of nter-personal relatonshps between Enron employees; who corresponded wth whom, the level of communcaton between top management and other employees, etc. Shetty & Adb used the corpus dstrbuted by Wllam Cohen and created a MySQL database for the entre corpus. Shetty & Adb have also cleaned the corpus, by elmnatng blank or duplcate e-mals and e-mals that contaned junk data or were returned to the sender due to some transacton falure. Ths corpus contans 252,759 e-mal messages exchanged between 151 users. These e-mal messages are present n around 3000 user-defned folders. They then created four relaton tables namely, Employee Lst, Message, Recpent Info and Reference Info. They have used these relaton tables to study and analyze socal networks contaned n the Enron organzaton. More detaled nformaton about ther corpus s avalable at http://www.s.edu/~adb/enron/enron.htm. 2.5.2. Enron Related Work The Enron corpus has prmarly been used by the Statstcal NLP communty as a test bed and new benchmark for technques and algorthms for the automated processng and search of a large scale collecton of textual data. One common practce across the varous NLP projects s the automated dentfcaton of folderng strateges, topcs, specfc ndvduals, events, and communcaton threads and patterns from streamng data that mght ndcate a threat or rsk. These peces of nformaton are then fltered and nvestgated n greater detal. In the followng secton bref revew of these projects s done pontng out lnks between research deployng NLP and SNA. Klmt & Yang [29] and Bekkerman et al. [30] bult classfers for predctng the user s organzaton of e-mals nto user-defned folders. Klmt & Yang [29] provde the 42
baselne results of a state-of-the-art classfer, Support Vector Machnes under varous condtons, ncludng the cases of usng ndvdual sectons (From, To, Subject and body) alone as the nput to the classfer, and usng all the sectons n combnaton wth regresson weghts. Bekkerman et al. [30] has made use of four supervsed learnng methods namely, Naïve Bayes, Maxmum Entropy, Support Vector Machnes and Wde Margn Wnnow, n order to classfy the e-mal messages from hs corpus. Kulkarn & Pedersen [33] have worked n the feld of automatc classfcaton of e-mal messages, usng unsupervsed methods. They have made use of Sense Clusters for ths purpose. Mnkov et al. [34] have made use of the Enron Corpus to work on Named Entty Recognton for nformal documents lke e-mal. They present two methods for mprovng the performance of person name recognzers for e-mal: e-mal specfc structural features and a recall enhancng method whch explots name repetton across multple documents. Concealed relatons can often be a source of nterest, especally n the doman of counter-terrorsm where relatons fosterng malcous actvtes tend to be secretve or concealed from the general publc. Pathak et al. [35] propose a technque for extractng concealed relatons and communtes from socal network data. The technque analyzes actors' perceptons regardng other actors' socal nteractons and requres that they can be constructed from the socal network data. They use the Enron Corpus to test ther theores. Ther work uses the popular and robust tf-df measure from the nformaton retreval lterature to quantfy the concept of concealment. Decepton theory suggests that deceptve wrtng s characterzed by reduced frequency of frst person pronouns and exclusve words, and elevated frequency of negatve emoton words and acton verbs. Kela & Skllcorn [36] apply ths model of decepton to the Enron e-mal dataset, and then apply sngular value decomposton to elct the correlaton structure between e-mals. Those e-mals that have hgh scores usng ths approach nclude deceptve e-mals; other e-mals that score hghly usng these frequency counts also ndcate organzatonal dysfunctons such as mproper communcaton of nformaton. Hence ths approach can be used as a tool for both external nvestgaton of an organzaton, and nternal management and 43
regulatory complance. They nvestgated the applcablty of structural features of e- mals such as message length and the usage and frequency of certan words to detect patterns of unusual communcaton. They were able to dentfy an Enronspecfc vocabulary, relate some employee s formal postons to the use of certan words, and based on ths nformaton determned relatons among those postons and employees. Ther results could serve as a pont of comparson for formal and emergent structures dentfed wth SNA technques. Beyond the NLP studes, the growng body of research nto the Enron corpus ncludes publcatons that take a network analytc perspectve. Shetty & Adb [32] and Yener & Chapanond [37] generated an undrected socal network that represents 151 and 150 Enron employees, respectvely. They consdered mutual exchange of at least 5 [32] and 30 [37] e-mals between any par of ndvduals as a lnk. The resultng networks do not refer to a specfc pont n tme, but show a consoldated snapshot of the dataset. Shetty & Adb [32] also took the people s formal postons nto account, but dd not further analyze the network. Yener & Chapanond [37] report network analytc measures on ther graph. McCallum et al. [22] combned socal network nformaton extracted from sender recpent relatons wth nformaton on the topc of e-mals that they dentfed by statstcal analyss of word dstrbutons nto the ART model. They extended the ART model by determnng people s roles (RART model) and showed expermentally that ths combnaton of evdence provdes a better predcton of smlartes among people wth the same roles than tradtonal block modelng. Berry & Browne [38] tracked and extracted topcs and then clustered messages to dentfy crtcal happenngs and ndvduals. They have appled a non-negatve matrx factorzaton approach for the extracton and detecton of concepts or topcs from electronc mal messages. A sparse term-by-message matrces are encoded and a low rank nonnegatve matrx factorzaton algorthm s used to preserve natural data nonnegatvty. The resultng bass vectors and matrx projectons of ths approach can be used to dentfy and montor underlyng semantc features (topcs) and message clusters n a general or hgh-level way wthout the need to read ndvdual electronc mal messages. Prebe et al. [39] ntroduced the dea of scan statstcs on drectonal graphs. Scan 44
statstcs can be thought of movng wndow analyss, where we scan a porton of the data and then calculate some local statstc for that porton. The scan statstc s then defned as some functon (.e. the maxmum) of the local statstcs for all portons of the data. In studes nvolvng dynamc graphs, the graph s scanned over perods of tme and at each tme wndow (.e. a week, say) a local statstc or feature of the graph s computed. Ther research on the Enron dataset was to observe whether a graph was relatvely homogeneous, or f there were certan regons of excessve actvty somewhere along the graph. Network analysts could use agents or data ponts dentfed n such ways to study propertes of socal networks at crtcal or unusual states. From a socal network analyss perspectve, Desner et al. [40] studed structural propertes of the graph and attempted to dentfy key players n the scandal across tme. To observe changes n the Enron network over tme, they compared two graphs: one from October 2000, when Enron was seemngly on a successful path, and one from October 2001, when Enron was undergong a major crss. Ther research ndcated that durng the Enron crss, the graph was much more connected, dense and centralzed than durng a non-crss tme. Also, there s clear evdence showng executves formng a small communty wth sparse nteractons wth other Enron employees. Usng measures such as hgher e-mal nvolvement n the graph, they were able to conclude that n general, senor management employees were more lkely to be key players nvolved n the scandal. 2.5.3. Modfed Enron Dataset (Our Corpus) Our corpus s derved from Shetty & Adb Corpus. We have found that ths corpus farly meets our requrements and hence we have used ths verson of the corpus from whch to further clean t to ft our needs. Fgure 2.7 shows the schematc of ths database [32]. Although there are thousands of e-mal addresses n the Enron archve, belongng to an equally large number of ndvduals and groups wthn Enron, the communcatons for a majorty of those are not fully observable. That s, the communcaton for the majorty of e-mal addresses only consst of the e-mal messages sent to and from the roughly 151 ndvduals whose accounts were subpoenaed n the nvestgaton. Thus the only fully observable e-mal addresses, 45
Fgure 2.7 Enron Database Schema the addresses for whch we can clam to hold a majorty of to and from traffc, are those whch belong to ths subset of ndvduals. Snce our research s focused on dentfcaton of ttles usng the structured data, we only use ths subgroup. Thus our corpus contans the same four relatonal tables as created by Shetty & Adb [32] namely, Employee Lst, Message, Recpent Info and Reference Info but the number of e-mals s restrcted to those exchanged among the 151 employees present n the Employee Lst table. Gven below s a bref descrpton of each relaton table: 1. Employee Lst: Ths table contans nformaton about every employee whose e-mal messages are present n the corpus. 2. Message: Ths table contans nformaton about the e-mal message, ts sender, the subject, the body of the e-mal and other detals. 3. Recpent Info: Ths table contans nformaton about the recpent/s of the e-mal messages. Also, ths table tells us whether the message was sent drectly To the recpent or whether t was CC or BCC to the recpent. 4. Reference Info: Ths table contans nformaton about e-mal messages that have been used as a reference n other e-mals. Ths also ncludes messages that have been forwarded or repled to. 46
Thus, the modfed Message table conssts of 21254 rows and the Recpent Info table conssts of 50572 rows. So the corpus s almost reduced to one-fourth of ts orgnal sze but gves a better pcture of communcaton wthn the top employees of Enron. 2.5.4. Enron Herarchy Data Our next step of modfcaton conssts of assgnng unque job ttles n order to track people s job postons through the organzatonal herarchy. As the bass for these modfcatons we gathered data pertanng to the employees n the database that was obtanable from several publc sources. We utlzed a data fle generated by Shetty & Adb [41] from Federal Court documents wth ndvdual s job ttles. The provded spreadsheet maps names of 151 Enron employees to broad ttles wthn Enron. The lst of employees n ths dataset does not correspond drectly to the lst of employees whose accounts were used for the dataset, but do cover a majorty of them. The set s largely ncomplete wth 40 of the 151 ndvduals lsted smply as Employee and 29 lsted as N/A. Snce our experments requre a lst of ndvduals to ther ttle wthn the company, we need to gather addtonal nformaton about the ndvduals. For ths purpose, we collected publshed documents from FERC [42] consstng of employee s names, jobs, locatons, names of ther supervsors, busness unts and trade relatons. We have been able to dentfy the followng sx broad ttles to form the set T. These broad ttles are shown n Table 2.2. Expermental results have been evaluated on ths modfed dataset. Table 2.2 Broad Ttles n Enron Ttle Descrpton CO VP DIR SPEC MGR ASSOC CEO, Presdent, COO Vce Presdent Drector Specalst Manager Assocate 47
2.6. EXPERIMENTS AND RESULTS The focus of our experment wll be classfcaton of ndvduals to ther formal ttles usng the structured data of e-mal messages. For our experments, the structured data we wll concentrate on are the counts of the e-mals to and from the archve base. We evaluate the classfer performance on counts of e-mals exchanged between ndvduals. The ttles are used as the target attrbute for the classfer. The attrbutes varables for the classfer are the ndvduals themselves and the values of these attrbutes for an ndvdual are the correspondng number of e-mals exchanged between the ndvdual and the attrbute ndvdual. We evaluate the performance of the classfers and datasets n terms of ther overall accuracy n classfyng an ndvdual to ther broad ttle. The overall accuracy s the percentage of correctly classfed nstances. Fgure 2.8 shows the overall performance of the classfers. It s found that 10-Nearest Neghbor classfer gves the lowest accuracy of 25.8% whle Naïve Bayes and Neural Network based classfers gve an accuracy of 26.7%. Logstc Regresson does slghtly better than these two wth 27.5% accuracy. 1-Nearest Neghbor classfer gves an accuracy of 29.6%. Out of the tradtonal classfers mplemented C4.5 Decson Tree and Rpper perform best wth accuraces 36.4% and 36.2% respectvely. Support Vector Machne also provdes an accuracy of 34.3%, comparable to that of C4.5 and 40 35 30 % of Accuracy 25 20 15 10 5 0 Naïve Bayes Log. Regr. 1-NN 10-NN C4.5 Rpper ANN SVM SNC Classfcaton Algorthms Fgure 2.8 Performance of Classfers 48
Table 2.3 Overall Accuracy of Classfers Classfer Accuracy Classfer Accuracy Naïve Bayes 26.7% Decson Tree 36.4% Logstc Regresson 27.5% Rule-based learner 36.2% 1-Nearest Neghbor 29.6% Artfcal Neural Network 26.5% 10-Nearest Neghbor 25.8% Support Vector Machne 34.3% Socal Network Classfer 34.8% Rpper. We fnd that the Socal network classfer that we mplemented gves an overall accuracy of 34.8%. Ths s comparable to that of C4.5 and Rpper. It also does margnally better than Support Vector Machne. Table 2.3 gves the accuracy of all the classfers. Although Socal Network based classfer s not the best performng classfer, our expermental results show that ts capabltes are comparable to that of the best performng tradtonal classfers and t performs margnally better n most cases. The better performance of the Socal Network based classfer can be attrbuted to the nherent way n whch the regular equvalence s defned. Regular equvalence defned earler nherently tres to capture the structural herarchy n the network and hence forms an excellent choce for nferrng the herarchy. 2.7. DISCUSSION OF RESULTS Although the performance of even the best classfer (C4.5) s only 36.4%, n order to fnd whether n realty, the nformaton of the e-mal communcaton n an organzaton s capable of reflectng the herarchy of that organzaton, we need to look at the performance of the classfers at each class level. In ths secton we present the number of correctly classfed employees and ncorrectly classfed employees at each class level for all the classfers. For each par of ttle and classfer we have bult Table 2.5 n format shown n Table 2.4 to show the correctly Vs Incorrectly classfed members of a ttle wth respect to a classfer. The followng measures are reported for every ttle-classfer par: 1. The fracton of the number of correctly classfed members of a ttle to 49
the total number of members belongng to that ttle. Ths gves an dea of how well the classfer classfes that ttle. 2. The actual number of correctly classfed members of that Ttle. 3. The fracton of the number of ncorrectly classfed members of a ttle to the total number of members belongng to that ttle. 4. The actual number of ncorrectly classfed members of a ttle who have been msclassfed as belongng to hgher n the herarchy. 5. The actual number of correctly classfed members of a ttle who have been msclassfed as belongng to lower n the herarchy. Table 2.4 Example Format for Correctly Vs Incorrectly classfed statstcs for a ttle-classfer par Ttles Classfer Classfer No. Ttle 1 Correctly classfed members of Ttle1 Total members belongng tottle1 Number of correctly classfed members of Ttle 1 Number of Incorrectly classfed members of Ttle Incorrectly classfed members of Ttle1 1 msclassfed to hgher herarchy Total membersbelongng tottle1 Number of Incorrectly classfed members of Ttle 1 msclassfed nto lower herarchy The fracton of correctly classfed members for DIR ranges from 0.37 to 0.53 and for VP t vares from 0.44 and 0.60. For ASSOC, ths value vares from 0.37 to 0.55. On the other hand, for ttles n the mddle of the herarchy namely, SPEC, t ranges from 0 to 0.17 and for MGR, ths fracton ranges between 0 and 0.20. It s seen that the classfers are performng best on the ttles hgher on the herarchy lst, namely CO, VP and DIR and lower n the herarchy namely ASSOC. Msclassfcaton s hgh for ttles n the mddle of the herarchy, namely MGR and SPEC. Ths may be attrbuted to the fact that we have not consdered the temporal nature of the herarchy. It s hghly probable that an ndvdual holds multple ttles durng the communcaton perod that we have consdered. We have aggregated the data collected over the three year tme perod and assgned one ttle per ndvdual over ths entre tme frame. The performance of the classfers could have been affected by ths 50
Table 2.5 Correctly Vs Incorrectly Classfed Statstcs for every ttle-classfer par Ttles CO VP DIR SPEC MGR ASSOC Classfer Naïve 0.07 1 0.56 14 0.50 15 0 0 0 0 0.37 10 Bayes 0.93-0.44 0 0.50 12 1.0 21 1.0 22 0.63 17 14 11 3 4 7 - Log. 0.20 3 0.48 12 0.40 12 0.08 2 0.07 2 0.41 11 Regr. 0.80-0.52 0 0.60 14 0.92 10 0.93 9 0.59 16 12 13 4 13 18-1-NN 0.20 3 0.52 13 0.40 12 0.16 4 0.10 3 0.37 10 10-NN C4.5 Rpper ANN SVM SNC 0.80-0.48 2 0.60 12 0.84 6 0.90 7 0.63 17 12 10 6 15 19-0.20 3 0.44 11 0.40 12 0.08 2 0.10 3 0.37 10 0.80-0.54 1 0.60 12 0.92 9 0.90 7 0.63 17 12 13 6 14 19-0.33 5 0.60 15 0.50 15 0.12 3 0.10 3 0.52 14 0.67-0.40 1 0.50 10 0.88 7 0.90 7 0.48 13 10 9 5 15 19-0.33 5 0.56 14 0.53 16 0.16 4 0.10 3 0.45 12 0.67-0.44 2 0.46 10 0.84 6 0.90 7 0.55 15 10 9 4 15 19-0.33 5 0.40 10 0.33 10 0.08 2 0.07 2 0.41 11 0.67-0.60 2 0.67 15 0.92 10 0.93 7 0.59 16 10 13 5 13 20-0.20 3 0.48 12 0.46 14 0.20 5 0.17 5 0.41 11 0.80-0.52 1 0.54 11 0.80 6 0.83 7 0.59 16 12 12 5 14 17-0.40 6 0.56 14 0.37 11 0.12 3 0.14 4 0.55 15 0.60-0.44 3 0.63 12 0.88 8 0.86 9 0.45 12 9 8 7 14 16 - generalzaton. Also, the better performance of classfers on ttles hgher n the herarchy could be owng to the fewer promotons observed n these postons. In the case of ASSOC, ths poston s the entry level poston and these employees may not have had the opportunty to hold other postons. Of partcular nterest are the msclassfcatons of ndvduals to ttles. Ths msclassfcaton seems to occur manly wth ttles close to the correct ttle n the herarchy. For example, n the Table 2.5, DIR s manly msclassfed as VP or SPEC, who are close n herarchy wth DIR and not as an ASSOC, who comes much lower n the herarchy. Ths trend s consstent among all the ttles. Ths mples that 51
the e-mal communcaton s ndeed capable of reflectng the herarchy n the organzaton. In our search for reasons nto the lower performance of the classfers, we dvded the entre tme perod of approxmately three nto thrty-fve one month parttons rangng from June 1999 to Aprl 2002. We then constructed the counts of mals exchanged between the employees for each month. Ths s gven as nput to our Socal network based Classfer and accuracy of the classfer s calculated at each month and plotted. We have also plotted the total number of e-mals consdered for classfcaton at each month. Ths gves a pcture of the actvty n that month. We have plotted both these quanttes n Fgure 2.9 n order to compare how the accuracy of classfcaton changes wth the varaton n actvty. In general, f the number of e-mals s larger, the tranng set s larger and accordngly the accuracy also goes up. Ths s ndeed the case seen tll June 2001. The hghest performance of 45.1% was acheved n Aprl 2001 when the total number of e-mals consdered 8000 50 Emals Consdered for Classfcaton 6000 4000 2000 40 30 20 Performance of the Socal Network Classfer 0 jun1999 nov1999 apr2000 sep2000 feb2001 jul2001 dec2001 10 Fgure 2.9 Varaton of the Performance of the Socal Network Classfer wth respect to the Emals Consdered. 52
ncreased to 1800. After June 2001, although there was an enormous ncrease n the total mals exchanged, we found that the accuracy of the classfer decreased rapdly and reached 35% n October 2001 when the actvty reached ts peak. Incdentally ths also happens to be the performance of our classfer when we appled t to the aggregated data. Ths observaton s not so surprsng snce the actvty durng ths tme accounts for a large part of the aggregated data. The perod after June 2001 shows an anomaly n the herarchy of the organzaton. These results suggest a dsassocaton n the formal roles of ndvduals and a sgnfcant ncrease n presence of nformal networks n the organzaton. These are the nformal networks of collaboraton that naturally grow and coalesce wthn organzatons. Despte ther lack of offcal recognton, these nformal networkscoexst wth the formal structure of the organzaton and are also reflected n the e-mal communcaton. It s well known that the presence of nformal networks n sgnfcant measures ndcates that the establshed organzaton structure s gettng less rgd. It s nterestng to note that perod of anomaly concdes wth the perod when the crss broke out n Enron. The crss could also have resulted n the dsassocaton of the herarchy. Ths s the focus of the next chapter where we concentrate on dentfyng these nformal networks and use ther presence as an ndcator for detectng the crss. 2.8. CONCLUSIONS & FUTURE WORK In ths chapter we have dentfed e-mals as a strong ndcator of communcaton n an organzaton and demonstrated ts capabltes n reflectng the herarchy of an organzaton lke Enron. We have classfed the employees to ther broad ttles wthn Enron usng the structured part of the organzatonal e-mal archves. We have compared the performance of a Socal Network based classfer wth the tradtonal data mnng classfers and found that ths Socal Network Classfer seems to perform margnally better than most of the tradtonal classfers, thus showng the advantage of usng Socal Network based methods for classfcaton of ths knd. Although mappng to a sngle ttle s the focus of our work, the problem can be generalzed to mappng an ndvdual to multple ttles. Such mappngs are useful n 53
cases where a sngle ndvdual mght have multple formal ttles n a company at a tme or over a perod of tme (say, due to promotons). Moreover, ths addresses the temporal aspect of an organzaton's socal networks, where ttles of ndvduals may change over tme. Our work towards ths goal has been lmted due to the unavalablty of nformaton regardng multple postons held by the ndvduals n Enron. The perod after June 2001 s clearly one of antphase varatons between the percentage accuracy of our socal network classfer and the number of e-mals. Such antphase varatons may be precursors to ether an organzaton restructurng or to an mpendng crss. It s worth notng that from the catalogue of events that took place n Enron [40], t s clear that the latter s lkely to be true n the case of Enron. Ths requres an n-depth analyss of the events that occurred durng the perod for whch we have the e-mal data and ths has been taken up n chapter 3. One of the mportant observatons of ths chapter s the fact that the percentage accuracy of socal network classfer, whch s a measure of ordered and structural herarchy of an organzaton along wth the number of communcatons (n ths nstance e-mals) between the stakeholders can become an excellent precursor to predct the possble onset of unplanned restructurng and the resultant crss n an organzaton. 54
Chapter 3 ORGANIZATIONAL CRISIS DETECTION FROM E-MAIL COMMUNICATION 3.1. INTRODUCTION Our probng nto the reasons for the lower (than expected) performance of the classfers n detectng the herarchy of the organzaton, has led us further to explore Enron and gve a closer look to the structure and patterns of e-mal communcaton n Enron. Enron s a real-world organzaton that has faced a severe survval threatenng crss. It s desrable to understand the varous factors that have led to the fall of Enron n 2002. It has been establshed that e-mal has become an ndcator of collaboraton n companes and organzatons. Hence we beleve that the rch temporal record of e-mal communcatons wthn an organzaton can reflect the workng of an organzaton and help us better understand the factors leadng to ts bankruptcy. In ths chapter our am s to study how the nformal networks present n an organzaton, obtaned from the socal network perspectve from the e-mal communcaton patterns help n detectng the events related to the crss. Our work s motvated by the need n real world stuatons to dentfy patterns of potentally suspcous actors and actvtes or track crtcal happenngs. One way of dong ths s by montorng the nformaton exchange that s assocated wth such stuatons. It s based on the premse that communcaton patterns emerge through socal nteractons to respond to partcular events. The communcatve actvtes exhbt a structural pattern of the network n the long run. The communcatve patterns may be examned and used as an ndcator to determne an organzaton s nformaton processng capacty. The underlyng assumpton s that structure affects substantve outcomes and that structures are emergent. Socal Network Analyss facltates the analyss of ths structure. Any nsttuton that provdes opportuntes for communcaton among ts members s eventually threaded by communtes of people. These communtes are the nformal networks of collaboraton that naturally exst wthn organzatons. These 55
communtes have attracted much nterest as a way to uncover the structure and communcaton patterns wthn an organzaton to understand the realty of how people gather nformaton and execute ther tasks. These nformal networks coexst wth the formal structure of the organzaton. Although they are not offcally recognzed, nformal networks can provde effectve ways of learnng and enhancng the productvty of the formal organzaton. Hence the detecton of these nformal networks over tme has the ablty to facltate event detecton. Our focus s on the mportance of these changng nformal networks wth the escalatng crss. In ths chapter we am to frst detect nformal communtes and then study the evoluton of nformal communtes and use these derved communtes as an ndcator of events (cause or effect). We assume that when an event occurs n an organzaton, the nformal networks reflect ths event. For comparson we also analyze the senstvty of some of the socal network features to these events. Although they reflect the general escalaton of the crss, we see that the resultng nformal communtes gve promsng results that may help us better understand the nterplay between network structure and functon. The methodologes developed along these lnes have the potental for ntegraton nto montorng or early warnng systems. The fndngs of ths study provde valuable nsght nto a real-world organzatonal crss, whch may be further used for valdatng or developng theores and dynamc models of organzatonal crses, thereby leadng to a better understandng of the underlyng causes of, and response to, organzaton falure. 3.2. BACKGROUND The ablty to detect communty structure n a network clearly has a major mplcaton n a varety of applcatons. The ablty to dentfy these communtes could help us to explot these networks more effectvely. The smplest way to cluster a network s to cut some lnks untl the network s no longer connected. However, cuttng lnks haphazardly s unlkely to gve useful results. So, there exst several methods to fnd the most approprate lnks to remove, so that the resultng components gve meanngful communtes. 56
The most successful algorthm, ntroduced by Grvan & Newmann [43], s based on the edge betweenness that measures the fracton of all shortest paths passng on a gven lnk. By removng lnks wth hgh betweenness, we can progressvely splt the whole network nto dsconnected components, untl the network s decomposed n communtes consstng of one sngle node. Grvan & Newmann have generalzed Freeman s betweenness centralty to edges and defned the edge betweenness of a lnk as the number of shortest paths between pars of vertces that run along t. If a network contans communtes or groups that s only loosely connected by a few nter group edges, then all shortest paths between dfferent communtes must go along one of these few edges. Thus, the edges connectng communtes wll have hgh edge betweenness. By removng these edges, we separate groups from one another and so reveal the underlyng communty structure. Its major drawback s the computatonal cost. Calculaton of lnk betweenness s the most computer ntensve part of the algorthm. Ths calculaton needs to be repeated every tme a lnk s removed as the betweenness of all the other lnks s affected. Qan et al. [44] present an algorthm based on lnk mnng. They have gven two formal defntons of communty for mplementaton of ther algorthm. The frst s a communty n a strong sense where each node has more connectons wthn the communty than wth the rest of the network. The second s a communty n a weak communty where the sum of all degrees wthn the communty s larger than the sum of all degrees toward the rest of the network. Qan et al. [44] have consdered the edge-clusterng coeffcent, defned as the number of trangles to whch a gven edge belongs, dvded by the number of trangles that mght potentally nclude t, gven the degrees of the adjacent nodes. The dea behnd the use of ths s that edges connectng nodes n dfferent communtes are ncluded n few or no trangles and tend to have small values of the edge clusterng coeffcent. On the other hand, many trangles exst wthn clusters. The algorthm s fast, snce t calculates the clusterng coeffcent wth local nformaton only, hence overcomng the major drawback of the GN algorthm. From a socal network analyss perspectve, Desner et al. [40] studed structural propertes of Enron n order to dentfy key players n the scandal. To observe changes n the Enron network over tme, they compared two graphs: one from October 2000, when Enron was seemngly on a successful path and one from 57
October 2001, when Enron was undergong a major crss. Ther research ndcated that durng the Enron crss, the graph was much more connected, dense and centralzed than durng a non-crss tme. Usng measures such as hgher e-mal nvolvement n the graph, they were able to conclude that n general, senor management employees were more lkely to be key players nvolved n the scandal. In the next secton, we descrbe the hstory of Enron as an organzaton, hghlghtng the varous crtcal events assocated wth the Enron crss leadng to the avalablty of the Enron e-mal corpus. 3.3. ENRON EVENTS In ths secton, we shall summarze the journey of Enron from a small company n the md-west to a corporate super-power, whch employed thousands of people worldwde. Also, we shall see what crcumstances led to the eventual bankruptcy of the corporaton, hghlghtng the crtcal happenngs that we wsh to montor. The tme perod of ths journey s from ts concepton n 1985 to ts fall n 2000 [40]. The orgns of the Enron Corporaton can be traced back to 1930, when t began as a modest company n Omaha named the Northern Natural Gas Company [45]. The Northern Natural Gas Company, n ts turn, was a consortum of Northern Amercan Power and Lght Company, Lone Star Gas Company, and Unted Lghts and Ralways Corporaton. For some tme, t was a prvate company, owned by the board members of the consortum. The Northern Natural Gas Company gradually became publc over a perod of seven years between 1941 and 1947. The company exsted as the Northern Natural Gas Company for many years. In 1979, the company was restructured and became known as InterNorth Incorporated. Then, n 1985, InterNorth Inc. took over Houston-based Houston Natural Gas Company. The transacton was engneered by the Chef Executve Offcer of Houston Natural Gas Company, Kenneth Lay [45]. Lay handled the acquston n such a way, that he emerged as the Chef Executve Offcer of the new company too. Thus was concepton of one of the largest energy companes, whose operatons were not lmted only to the U.S. but were spread all over the world. 58
Intally, Enron was only nvolved n the transmsson and dstrbuton of energy and gas throughout the Unted States and the development, constructon and operaton of power plants and other nfrastructure worldwde. Over tme, t branched nto other areas. Also, lookng at the changng market condtons, Enron gradually changed from beng a producton company to a servces-orented company. In short, rather than producng the energy that t sold, Enron was more of a mddleman, earnng huge commssons n the process. Also, n 1999 Enron launched Enron Onlne. It was the frst web based transacton system that allowed buyers and sellers to buy, sell and trade commodty products globally. Enron nssted on beng the mddleman, users dd not know each other. They could only do busness wth Enron. It had set up operatons n many countres across the globe, and was dong very well n the domestc market too. Enron Corporaton was named Amerca s Most Innovatve Company for fve consecutve years (from 1996 to 2000) by Fortune magazne. Begnnng around 1999, Enron began to le about ts profts. In 1999, Enron s senor management began to separate and dstance losses from equty and dervatve trades nto specal purpose enttes (SPE), these are specal legal partnershps that were excluded from the company s prmary fnancal reports. The systematc omsson of negatve balance sheets from SPE s n Enron s reports resulted n an off-balance-sheet-fnancng system. The SPE s are just one example of Enron s sweepng controversal ethcs deemed as llegal accountng and busness practces. Enron created a number of such SPEs, ncludng Azurx Corporaton. One of the most controversal contracts was the three bllon dollar Dabhol Power project n Maharashtra, Inda. It was alleged that Enron offcals used poltcal connectons to pressurze the Maharashtra State Electrcty Board [45]. In December 2000, Kenneth Lay resgned as the Chef Executve Offcer of Enron Corporaton. Jeffrey Skllng succeeded hm, as the CEO. Lay remaned Charman at a tme that Enron s stock was tradng at a 52-week hgh of $84.9. However, Jeff Skllng surprsed everyone by mysterously resgnng from the poston of Chef Executve Offcer n August 2001, hardly seven months after takng over the helm of the company. He cted personal reasons for hs resgnaton. It was later confrmed that ths was around the same tme that Sherron Watkns sent an anonymous e-mal to Kenneth Lay, advsng hm that thngs were not as rosy as they seemed. From then on, thngs only got worse. By October 2001, losses transferred to the Specal 59
Purpose Enttes were more than $618 mllon, and could not reman hdden any more. For the frst tme ever, Enron reported a huge loss and a reducton n the value of shareholder stake n the company, n ts thrd-quarterly report. Alerted by the sudden change from prevous reports, the Securtes and Exchange Commsson started a formal nqury nto the affars of Enron Corporaton n October 2001. The value of Enron shares, whch had been steadly lowerng throughout 2001, plunged to an all-tme low. In January 2001, the value of an Enron share was $90 and by November 2001, t went as low as 30 cents. Fnally, n December, 2001 Enron Corporaton offcally fled for bankruptcy and announced that thousands of ts employees would be lad off. The next month, January 2002, Lay resgned from the company. In the course of the nvestgaton that followed, the Federal Energy and Regulatory Commsson decded to make the e-mal corpus used durng the nvestgaton avalable to the general publc. The corpus was put up on the Internet n May 2002. It contaned around 92% of Enron e-mals rangng over a perod of three and a half years, from early 1999 to md-2002. The corpus conssted of a total of 619,449 e-mals from 158 Enron employees. However, attachments were not avalable. These were the events leadng to the avalablty of the corpus. Table 3.1 gves a snapshot of the events assocated wth the Enron crss. Table 3.1 Snapshot of events assocated wth Enron crss Tme Perod Event Assocated December 2000 Jeffrey Skllng succeeded Lay as CEO August 2001 October 2001 Skllng abruptly resgned and Lay became CEO & COO Transferred losses to SPE s books totaled $618 mllon November 2001 Fall of Enron s stock prces December 2001 Enron fled for bankruptcy January 2002 February 2002 Lay resgned from the company Lawsuts fled 3.4. METHODOLOGY We use the same modfed corpus that we have used to study herarchy n the prevous chapter. It conssts of all the e-mals sent and receved by 151 employees 60
among themselves, over a perod of three years rangng from June 1999 to Aprl 2002. One major consderaton n clusterng s that of whch features to use, and how to apply those features to the problem. As mentoned prevously, there are three knds of data to consder n e-mal namely, unstructured data, structured data, and numerc data. In our present work we focus on features extracted from the structured part of the e-mals. Temporal data s another knd of data avalable n e- mal archves. Snce we are concerned wth analyzng communcaton patterns at varous crtcal ponts over tme, we have ncorporated ths temporal nformaton assocated wth each e-mal along wth the structured nformaton. The entre tme perod T rangng from June 1999 to Aprl 2002 s dvded nto snapshots of T = t, t,..., }. The duraton of each snapshot t s one month { 1 2 t35 t1 t2 35 where corresponds to June 1999, corresponds to July 1999 and so on tll corresponds to Aprl 2002. Communcaton matrces are created for each of these tme frames. We have represented the communcaton networks n a format n whch the employees are represented as nods and the e-mals exchanges between them as edges. The edges are weghted by the total number of e-mals exchanged between the two employees n that partcular tme frame. The matrx represents the nstantaneous communcaton matrx of the network at tme t, where t aj s the number of e-mals exchanged between actors and j durng the tme perodt. Usng these monthly network approxmatons, we calculated varous ndvdual level features used for clusterng the ndvduals nto groups or communtes. We now descrbe each of these ndvdual level features that we computed from these communcaton matrces for nputs to a clusterng algorthm. We fnd that these features are n fact senstve to the crtcal events n Enron. A t, t = 1,2,...,35 t 3.4.1. Feature Extracton Features such as the number of e-mals that are sent or receved by an ndvdual n a month are ndcators of the ndvdual s actvty durng that month. It s reasonable to assume that these counts would be affected by the escalatng crss wthn the organzaton, thus showng trends of ncreased actvty. These features are normalzed by the total number of e-mals that are sent or receved by all users durng that month. Ths gves an dea of the densty of e-mal traffc generated by 61
that ndvdual n that month. Ths takes nto account the fracton of traffc generated by an ndvdual durng a month. Smlarly, features such as the number of dstnct senders or recpents that an ndvdual has nteracted n a month, ndcates the nfluence or popularty of the ndvdual n that month. These features take nto account the ndegree and outdegree of the ndvdual users. It ndcates the actveness of an ndvdual n communcatng wth others. Reachablty In our present scenaro where we wsh to understand the communcaton of nformaton among employees n an organzaton, an mportant consderaton s whether nformaton orgnatng wth one employee could eventually reach all other employees. The reachablty of an ndvdual n a socal network evolves wth changng communcaton structure. Ths reachablty s characterzed by the path lengths of an ndvdual to others n the network [28]. It s an ndcator of the connectvty of a network. Two employees are sad to be more connected n a network, f the path lengths between them are short and are wthn reach of two or three hops from each other. Flow of nformaton s fast and effcent when the path lengths are short. If there s a path between employees n and n, then n and are sad to be reachable. Then t s possble for an e-mal message to travel from one employee to the other by passng the message through ntermedares. If two employees are not reachable, then there s no path between them, and no way for a message to travel from one employee to the other. j n j Thus ths communcatve reachablty of ndvduals n an organzaton can be measured by examnng the number of people reachable wthn two or three hops from other. So the ndvdual feature that we have consdered s the number of employees reachable by the ndvdual wthn path lengths of two and three. Ths ndcates the varaton of the connectedness n the network wth the changng condtons n the organzaton. Matrx operatons on the nstantaneous adjacency or the communcaton matrx can be used to study the reachablty n the network. It s possble to consder paths of any length by studyng the powers of the adjacency matrx or the communcaton matrx. 62
[2] j Consder the value. The product, one term n ths sum, s equal to 1 f and only f both x g = k = x 1 k x kj x k x kj x = 1 and x = 1. That s x x k kj = 1only f both the k lnes n, n ) and n, n ) are present. If ths s true then there s a path n n n n ( k ( k j the network. Thus the sum counts the number of paths of length two between 2 [2] employees n and n, for all k. The entres X = { } gve exactly the number of j kj x j k j paths of length 2 between n and nj. Smlarly we can consder paths of any length by studyng powers of the adjacency matrx. The elements of X 3 par of employees. In general, the entres of the matrx paths of length p from employees n to n. count the number of paths of length three between each j p X gve the total number of Hence to study path lengths of two and three, we need to study the second and thrd powers of the relatonshp matrx. Here an element of relatonshp matrx gves the number of e-mals exchanged between two users. A non zero value n the relatonshp matrx ndcates that the two people are drectly connected whle a non zero value n the second or thrd powers of the relatonshp matrx ndcates that they are ndrectly connected va ntermedares. Centralty measure-closeness centralty It s ntutve that the communcaton behavors of mportant people (say the senor management) reflect the crtcal happenngs n an organzaton. One of the prmary uses of socal network analyss s the dentfcaton of the most mportant actors n the socal network. There are a varety of measures desgned to hghlght the dfferences between mportant and non-mportant actors. Promnent actors are those that are extensvely nvolved n communcaton wth others. The nvolvement makes them more vsble to the others. One type of vsblty s the centralty. In our work we use the actor centralty based on closeness or dstance called the closeness centralty. The measure focuses on 63
how close a person s to all the others n a communcaton network. The dea s that a person s central f he can quckly nteract wth all others. In the context of a communcaton relaton, such people need not rely on others for the relayng of nformaton. The geodescs lnkng the central actors to the other actors must be as short as possble. Wth ths explanaton, researchers began equatng closeness to mnmum dstance. The dea s that centralty s nversely related to dstance. As the actor grows farther apart n dstance from other actors, hs centralty wll decrease, snce there wll be more lnes n the geodescs lnkng that actor to the other actors. For example, consder the star network. The actor at the center of ths star s adjacent to all the other actors, has the shortest possble paths to all the other actors, and hence has maxmum closeness. There s exactly one actor who can reach all the other actors n a mnmum number of steps. Ths actor need not rely on the others for hs nteractons, snce he s ted to all others. Actor closeness can be measured as a functon of geodesc dstances [28]. As mentoned above, as geodescs ncrease n length, the centralty of the actors nvolved should decrease, consequently, dstances, whch measure the length of geodescs, wll have to be weghted nversely to arrve at ths ndex. Ths type of centralty depends not only on drect tes, but also on ndrect tes, especally when any two actors are not adjacent. Let d( n, n j ) be the number of lnes n the geodesc lnkng actors and j, where d(, ) s a dstance functon. The total dstance that I s from all the other actors s g j = 1 d( n, n ), where the sum s taken over all j. Thus the closeness j centralty [28] ndex of an actor s ( ) = g C c n j= 1 d( n, n j ) 1..(3.1) Ths ndex s smply the nverse of the sum of the dstances from actor I to all the other actors. At a maxmum, the ndex equals ( g 1), whch arses when the actor s adjacent to all the other actors. At a mnmum, the ndex attans the value of 0 n ts lmt, whch arses whenever one or more actors are not reachable from the actor 1 64
n queston. The maxmum value attaned by ths ndex depends on g, thus, comparsons across networks of dfferent szes are dffcult. So ths ndex s standardzed so that the maxmum value equals unty. To do ths we smply multply C C ( n ) by g-1: C ' C ( n ) = g j= 1 g 1 d( n, n j = ( g 1) C ) C ( n )..(3.2) Ths standardzed ndex ranges between 0 and 1, and can be vewed as the nverse average dstance between actor and all the other actors. We use ths standardzed actor closeness centraltes to ndcate how central an actor s n the network. Cohesve Subgroups-Clques and n-clans Another nterestng feature that characterzes a person s behavor n a network s the number of socal groups to whch he belongs. The concept of socal group can be studed by lookng at propertes of subsets of actors wthn a network. In socal network analyss, the noton of socal group s formalzed by the general property of coheson among subgroup members based on specfed propertes of the tes among the members. Cohesve subgroups are subsets of actors among whom there are relatvely strong, drect, ntense, frequent or postve tes. However, the property of coheson of a subgroup can be quantfed usng several dfferent propertes of the tes among the subsets of actors. The result of a cohesve subgroup analyss s a lst of subsets of actors wthn the network who meet the specfed subgroup defnton. Although the lterature on cohesve subgroups n networks contans numerous ways to conceptualze the dea, we use two defntons of subgroups, one based on the mutualty of tes namely, clques and the other based on reachablty and dameter namely, n-clans. Clque: A clque n a graph s a maxmal complete subgraph of three or more nodes. It conssts of a subset of nodes, all of whch are adjacent to each other, and there are no other nodes that are also adjacent to all of the members of the clque. The restrcton that the clques contan at least three nodes s ncluded so that mutual dyads are not consdered to be clques [28]. In the present context, a clque s a subset of people, all of whom have exchanged mals wth each other, and there are 65
no other people who have also exchanged mals wth all the members of the clque. A clque s a very strct defnton of cohesve subgroup. The absence of a sngle te wll prevent the subgraph from beng a clque. Also there s no nternal dfferentaton among actors wthn a clque. Snce a clque s complete, wthn the clque all members are graph theoretcally dentcal. All clque members are adjacent to all the other clque members, thus there are no dstnctons among members based on graph theoretc propertes wthn the clque. If we expect that the cohesve subgroups wthn a network should exhbt nterestng nternal structure, then the clque mght not be an approprate defnton of cohesve subgroup. On the other hand, there mght be largely overlappng, clques n the group. In such cases, the clques themselves mght not be very nformatve. Instead, we study the overlap among the clques. The same node or set of nodes mght belong to more than one clque. In terms of ndvdual actors, we consder the number of clques to whch an actor belongs to. Ths feature changes from person to person and from one month to another, ndcatng a person s socal neghborhood and communcaton structure. n-clan: Reachablty s the motvaton for the frst cohesve subgroup deas that extend the noton of a clque. These alternatve subgroup deas are relevant here, as the dffuson of nformaton s hypotheszed to occur through ntermedares. Conceptually, there should be relatvely short paths of nfluence or communcaton between all members of the subgroup. Subgroup members mght not be adjacent, but f they are not adjacent, then the paths connectng them should be relatvely short. The geodesc dstance between two nodes s the length of a shortest path between them. Cohesve subgroups based on reachablty requre that the geodesc dstances among the members of the subgroup be small. Thus, we can specfy some cutoff value, n, as the maxmum length of geodescs connectng pars of actors wthn the cohesve subgroup. Restrctng geodesc dstance among subgroup members s the bass for the defnton of an n-clque. An n-clque s a maxmal subgraph n whch the largest geodesc dstance between any two nodes s no greater than n. Formally, an n-clque s a subgraph wth node set N s, such that 66
d (, j) n n, nj N s and there are no addtonal nodes that are also dstance n or less from all nodes n the subgraph [28]. Two ssues that lmt the usefulness of ths defnton are, frst an n-clque may have a dameter greater than n, and second, an n-clque mght be dsconnected. The frst problem arses because the requrement that nodes be connected by paths of length n or less does not requre that these paths reman wthn the subgroup. Thus the dameter of the subgraph can be larger then n. A useful restrcton on n-clques s to requre that the dameter of an n-clque be no greater than n. Ths restrcton leads us to n-clans. An n-clan starts wth the n-clques that are dentfed n a network and excludes those n-clques that have a dameter greater than n. An n-clan s an n-clque n whch the geodesc dstance between all nodes n the subgraph s no greater than n for paths wthn the subgraph [28]. We use the defnton of n-clan for our analyss. Agan, n a manner smlar to clques, we consder the number of n-clans that a person belongs to n a partcular month as our feature. In studyng network processes such as nformaton dffuson that flow through ntermedares, these features based on ndrect connectons of relatvely short paths provde a reasonable approach. In the next sub-secton, we use the K- means clusterng algorthm to cluster the network at varous ponts n tme usng the above feature vectors. 3.4.2. Identfcaton of nformal networks The nformal networks are the clusters of employees wthout regard to formal roles or herarchy. We frst extract the feature vectors of each employee based on the above features and then cluster them to obtan the nformal networks. The purpose of clusterng s to dentfy natural groupngs of data from a large data set to produce a concse representaton of a system's behavor. The tradtonal methods for detectng the natural groupngs n networks based on multple features are herarchcal clusterng and K-means clusterng. In the case of herarchcal clusterng, one frst calculates a weght w j for every par,j of vertces n the network, whch represents n some sense how closely connected the vertces are. Then one takes the n vertces n the network, wth no edges 67
between them, and adds edges between pars one by one n order of ther weghts, startng wth the par wth the strongest weght and progressng to the weakest. As edges are added, the resultng graph shows a nested set of ncreasngly large components (connected subsets of vertces), whch are taken to be the communtes. Because the components are properly nested, they all can be represented by usng a tree n whch the lowest level at whch two vertces are connected represents the strength of the edge that resulted n ther frst becomng members of the same communty. A slce through ths tree at any level gves the communtes that exsted just before an edge of the correspondng weght was added. Many dfferent weghts have been proposed for use wth herarchcal clusterng algorthms. They gve reasonable results for communty structure n some cases. In other cases they are less successful. In partcular, they have a tendency to separate sngle perpheral vertces from the communtes to whch they should rghtly belong. Ths makes the herarchcal clusterng method, although useful, far from perfect for large amounts of data. K-means clusterng can best be descrbed as a parttonng method. That s, t parttons the observatons n the data nto K mutually exclusve clusters, and returns a vector of ndces ndcatng to whch of the k clusters t has assgned each observaton. Unlke the herarchcal clusterng methods, K-means does not create a tree structure to descrbe the groupngs n the data, but rather creates a sngle level of clusters. Another dfference s that K-means clusterng uses the actual observatons of objects or ndvduals n the data, and not just ther proxmtes. These dfferences often mean that K-means s more sutable for clusterng large amounts of data. K-means treats each observaton n the data as an object havng a locaton n space. It fnds a partton n whch objects wthn each cluster are as close to each other as possble, and as far from objects n other clusters as possble. Each cluster n the partton s defned by ts member objects and by ts centrod, or center. The centrod for each cluster s the pont to whch the sum of dstances from all objects n that cluster s mnmzed. K-means uses a two-phase teratve algorthm that mnmzes the sum of dstances from each object to ts cluster centrod, over all clusters. The frst phase uses "batch" updates, where each teraton conssts of reassgnng ponts to ther nearest cluster centrod, all at once, followed by recalculaton of cluster centrods. You can thnk of 68
ths phase as provdng a fast but potentally only approxmate soluton as a startng pont for the second phase. The second phase uses "on-lne" updates, where ponts are ndvdually reassgned f dong so wll reduce the sum of dstances, and cluster centrods are recomputed after each reassgnment. Each of teratons durng ths second phase conssts of one pass through all the ponts. We use K-means clusterng avalable n Matlab n order to fnd the natural clusterngs present n the data. The choce of k has been based on the Bayesan Informaton crtera (bc). The followng secton correlates the events n Enron durng the crss wth the varaton of the above features over tme. 3.5. RESULTS AND DISCUSSIONS In ths secton, we frst look at how each of the features used for clusterng vary over tme at the network level, as the crss escalated n Enron. We then see how the nformal networks obtaned correlates wth events n the organzaton, gvng a better pcture of the structure n Enron. Fgure 3.1 shows how the number of e-mals generated n a month vares wth the escalaton of the crss. It can be seen that the local peaks n the number of e-mals exchanged corresponds to varous crtcal events related to the crss. Although the actual values of the e-mal counts has decreased for the two events namely, Fall n stock prces n November 2001 and Fled for bankruptcy n December 2001, the relatve communcaton durng ths tme s an ndcaton of a hghly actve perod. The decrease n actual values can n part be assocated wth the vacaton perods durng these months. Ths ndcates that a hgh level of nformaton exchange s often assocated wth a crtcal happenng n the organzaton. Ths s attrbuted to the tendency of the employees to dscuss the major events that were happenng and that the e-mals reflect these s an ndcator that they are a major form of communcaton between employees. Fgure 3.2 shows the varaton n the number of people communcatng over the entre tme perod. Ths shows an ncrease n the dstnct senders as the crss escalated. But ths densty does not show a clear trend wth respect to events. 69
8000 7000 Transferred 618 mllon to SPEs 6000 5000 Fall of Stock prces Year 4000 3000 2000 1000 Skllng succeded Lay as CEO Skllng resgned,lay took over Fled for bankruptcy Lay resgned 0 jun1999 nov1999 apr2000 sep2000 feb2001 jul2001 dec2001 Emals exchanged n a month Lawsuts fled Fgure 3.1 Varaton n the Traffc generated wth the Events n Enron Number of Dstnct Senders 120 100 80 60 40 g Fled for bankruptcy Skllng resgned,lay took over Skllng succeded Lay as CEO Transferred 618 mllon to SPEs Fall of Stock prces Lay resgned Lawsuts fled 20 0 jun1999 nov1999 apr2000 sep2000 feb2001 jul2001 dec2001 Month Fgure 3.2 Varaton of Dstnct Senders wth the Events n Enron Although the overall pcture follows the general trend, at the local ponts no clear dstncton can be made regardng the ndvdual events. It does confrm the fact that more and more people are dscussng the crss and prevously less dormant 70
employees are also gettng actvely nvolved. The dstrbuton shows a peak n the communcaton between September 2001 and November 2001, whch correlates to the perod n whch Enron proceeded from an organzatonal crss to bankruptcy. Smlarly, Fgure 3.3 also shows the varaton n the number of people communcatng over the entre tme perod. Ths shows an ncrease n the dstnct recevers as the crss escalated. But ths densty also does not show a clear trend wth respect to events. Although the overall pcture follows the general trend, at the local ponts no clear dstncton can be made regardng the ndvdual events. It does confrm the fact that more and more people are dscussng the crss and prevously less dormant employees are also gettng actvely nvolved. The dstrbuton shows a peak n the communcaton between September and November 2001, whch correlates to the perod n whch Enron proceeded from an organzatonal crss to bankruptcy. 140 120 Transferred 618 mllon to SPEs Skllng resgned,lay took over Fall of Stock prces Number of Dstnct Recevers 100 80 60 40 Skllng succeded Lay as CEO Fled for bankruptcy Lay resgned Lawsuts fled 20 0 jun1999 nov1999 apr2000 sep2000 feb2001 jul2001 dec2001 Month Fgure 3.3 Varaton of Dstnct Recevers wth the Events n Enron Fgure 3.4 shows how the reachablty n the network s affected by the events n Enron. It shows the varaton n people reachable wthn n two hops wth the events n Enron. Fgure 3.5 shows the varaton n people reachable wthn three hops wth the events n Enron. Smlar to Fgures 3.2 and 3.3 these features do not show a 71
100 People Reachable wthn two hops 90 80 70 60 50 40 30 20 10 Skllng succeded Lay as CEO Fall of Stock prces Transferred 618 mllon to SPEs Lay resgned Fled for bankruptcy Skllng resgned,lay took over Lawsuts fled 0 jun1999 nov1999 apr2000 sep2000 feb2001 jul2001 dec2001 Month Fgure 3.4 Varaton n Reachablty wthn two hops wth Events n Enron 110 People Reachable wthn three hops 100 90 80 70 60 50 40 30 20 10 Transferred 618 mllon to SPEs Fall of Stock prces Lay resgned Fled for bankruptcy Skllng resgned,lay took over Lawsuts fled Skllng succeded Lay as CEO 0 jun1999 nov1999 apr2000 sep2000 feb2001 jul2001 dec2001 Month Fgure 3.5 Varaton n Reachablty wthn three hops wth Events n Enron clear dstncton wth respect to ndvdual events, but wth more and more people communcatng wth each other and an ncrease n e-mal traffc generated, new 72
paths are created between people thus makng more and more people easly reachable. As a result the path lengths have shortened and people can be reached wthn two or three hops. Ths shows that wth the escalatng crss, prevously unconnected people can now be connected va ntermedares. There s a sudden ncrease n the number of people reachable wthn short paths when the Enron scandal surfaced n August 2001 when Skllng resgned and Lay took over. Followng ths, these features show an ncreasng trend untl the Fall n Stock Prces n November 2001. More people reachable wthn two or three hops mply that newer paths have opened and a person can be reached n more than one way. Ths s an ndcator that the flow of nformaton wthn the organzaton s dversfed durng the crss perod. The shortenng of the path lengths and the ncrease n the number of communcaton partcpants ndcate that durng the escalaton of the crss, prevously dsconnected people began to engage n mutual communcaton, thus strengthenng the connectvty n the network. Fgure 3.6 shows the evoluton of group closeness centralty wth tme. The degree to whch employees have hgh mportance whle others are of low mportance can provde nformaton regardng the dsparty n the network. Ths nequalty s 0.09 0.08 Transferred 618 mllon to SPEs Fall of Stock prces Group Closeness Centralty 0.07 0.06 0.05 0.04 0.03 0.02 0.01 Lay resgned Fled for bankruptcy Skllng resgned,lay took over Lawsuts fled Skllng succeded Lay as CEO 0 jun1999 nov1999 apr2000 sep2000 feb2001 jul2001 dec2001 Month Fgure 3.6 Varaton n Group Closeness Centralty wth Events n Enron 73
represented n the closeness centralty measure, computed at the network level. Closeness centralty captures how close or how central an actor s to all the other actors n a network. As can be seen n the Fgure 6, durng the crss the dsparty n the network jumped up and subsequently fell untl Enron fled for bankruptcy n December 2001. Ths further supports the nsght that durng the crss the resultant network s dversfed from the establshed roles n the organzaton. There s an ncrease n the coheson of the system. But snce we have taken group centralzaton as the measure, there s a conflct n nformaton deduced from ths fgure. The group closeness centralty can be hgh due to two reasons. One s that more people are nvolved n communcaton wth a number of people and hence the path lengths have shortened and the other s that there are few central people who are communcatng wth a large number of people. From the prevous fgures that we have analyzed, n our case t s reasonable to conclude that the frst reason s a more probable one. Fgure 3.7 shows varaton n the number of clques over tme. Fgure 3.8 shows the varaton n the number of n-clans over tme. These fgures analyze the mutual communcaton n small groups. Ths reveals the trend n communcaton wthn the small groups durng the crss. It suggests a trend towards communcaton only 45 40 35 Transferred 618 mllon to SPEs Fall of Stock prces Number of Clques 30 25 20 15 10 5 Skllng resgned,lay took over Skllng succeded Lay as CEO Lay resgned Fled for bankruptcy Lawsuts fled 0 jun1999 nov1999 apr2000 sep2000 feb2001 jul2001 dec2001 Month Fgure 3.7 Varaton n Number of Clques wth Events n Enron 74
60 50 Fall of Stock prces Transferred 618 mllon to SPEs Number of n-clans 40 30 20 10 Lay resgned Fled for bankruptcy Skllng resgned,lay took over Lawsuts fled Skllng succeded Lay as CEO 0 jun1999 nov1999 apr2000 sep2000 feb2001 jul2001 dec2001 Month Fgure 3.8 Varaton n the Number of nclans wth Events n Enron among trusted employees, possbly ndcatng a decrease n trust and accountablty wthn the network. Lke the densty of the network, these features also do not show any clear trend relatng to events, but they follow the overall escalaton of crss. Fnally, Fgure 3.9 s the plot of the number of nformal networks that have been dentfed n a month, found from the clusterng appled to all employees n each month. Ths seems to show a consstent trend wth regard to events. It can be seen that there s a consstent decrease n the number of nformal networks followng every event. Ths pattern of evoluton n the nformal networks appears to concde wth all the events n the Enron scandal that we montored. Ths ndcates that people are nvolved n dverse communcatons durng the crss and that the clquelke structures wthn the network have dsntegrated. Ths suggests that durng the breakout of the crss the nterpersonal communcaton had ntensfed and spread throughout the network n a way that fractured the establshed roles n the organzaton. We can also see that t s the nformal networks that are affected by an event n a way that can help us detect these events. It also establshes ts capabltes as an ndcator of crss. 75
50 45 40 Number of nformal networks 35 30 25 20 15 10 Lawsuts fled Transferred 618 mllon to SPEs Fall of Stock prces Skllng resgned,lay took over Fled for bankruptcy Lay resgned 5 Skllng succeded Lay as CEO 0 jun1999 nov1999 apr2000 sep2000 feb2001 jul2001 dec2001 Month Fgure 3.9 Varaton n the Number of Informal Networks wth Events n Enron Thus, we conclude that the e-mal communcaton wthn an organzaton can n fact reflect the evolvng trends wthn that organzaton. Among the varous measures analyzed to detect events, only the nformal networks found from features of ndvdual employees show a consstent trend. Hence we conclude that although the network-level measures follow the overall trend, they do not follow any consstent trend wth respect to an event and cannot dscrmnate tme-perods followng majorevents from normal perods. The network-level measures are less senstve to the perturbatons n a network as they average out the fner varatons. These network-level measures are useful when we want to get a consoldated snapshot of the overall pcture lke n the case of post-analyss. They defntely show the wndow of crss rangng from August 2001 to December 2001. The network-level changes n communcaton reflect changes n communcaton norms and the way n whch groups present themselves. But n the present scenaro, where we want to detect crtcal real-world happenngs from the communcaton patterns, t s found advantageous to consder the structural propertes of each ndvdual at a gven nstant of tme, rather than consder ther network-level measures. Although we have used a purposefully cleaned data for evaluatons, we beleve that t s lkely that the full set of data n the corpus would reflect smlar over-tme trends. We beleve that 76
cleanng the corpus s necessary to extract the subtle trends n communcaton that we nvestgated. The patterns of partcpaton of ndvduals n groups show consstent trends. Ths suggests the mportance of nformal networks n analyzng an organzaton s nformaton processng capacty and for event detecton wthn an organzaton. As dscussed n Fgure 2.9 of Secton 2.7 n Chapter 2, the perod from June 2001 has antphase varatons wth the accuracy of our Socal Network Classfer and the number of e-mals. As mentoned earler, such antphase varatons may be precursors to ether an organzatonal restructurng or to an mpendng crss. We have also mentoned that for the case of Enron the latter s lkely to be the case. Ths fact s further valdated n Fgure 10 where we have re-plotted Fgure 2.9 of Secton 2.7 n Chapter 2 by showng the varous events that have been dentfed wth the crss. Fgure 3.10 clearly shows that ths antphase varatons n the accuracy of the Socal Network classfer wth the number of e-mals concdes wth the major events n the crss namely Transfered 618 mllon to SPE's and Fall of Stock Prces. These two events have consstently shown to be the peak crss perods. Ths also strongly demonstrates the capabltes of our approach for crss detecton and establshes 8000 50 Emals Consdered for Classfcaton 6000 4000 2000 Skllng succeded Lay as CEO Transferred 618 mllon to SPEs Skllng resgned,lay took over Fall of Stock prces Skllng succeded Lay as CEO Transferred 618 mllon to SPEs Skllng resgned,lay took over Fall of Stock prces Fled for bankruptcy Lay resgned Lawsuts fled Fled for bankruptcy Lay resgned Lawsuts fled 40 30 20 Performance of the Socal Network Classfer 0 jun1999 nov1999 apr2000 sep2000 feb2001 jul2001 dec2001 10 Fgure 3.10 Varaton n the Performance of the Socal Network Classfer wth the Events n Enron 77
e-mals as strong ndcators of collaboratons. It also valdates our assumpton that the structures are emergent n the e-mal communcaton patterns. 3.6. CONCLUSIONS & FUTURE WORK We have shown that the analyss of e-mal communcaton wth an SNA framework reveals communcaton patterns that correspond wth the organzatonal events. We found that the nterpersonal communcaton has ntensfed resultng n a dverse network and prevously dsconnected people engagng n mutual communcaton and how the network level features evolve over tme to reflect these events. We have shown how the establshed roles n the organzaton get dsturbed wth the mpendng crss. We have demonstrated the potental role that the nformal networks can play n event detecton durng an organzatonal crss. For Enron, we have seen that a change n usual pattern of communcaton accompanes major events. We conclude that the study of nformal networks along wth the structural herarchy can provde a sgnature to the e-mal communcaton patterns durng a crss. In order to further valdate ths, a detaled analyss of the e-mal content of ndvdual messages through text analytc approaches can be carred out and more real-world e-mal datasets needs to be examned. The fndngs of ths study provde valuable nsght nto a real-world organzatonal crss, whch may be further used for developng dynamc models of organzatonal crses. Such models can lead to a better understandng of the causes of and response to an organzaton falure. We can also experment wth other socal network metrcs to analyze ther contrbutons towards crss detecton. In ths chapter we have establshed the mportance of nformal networks n studyng the structure of an organzaton. For these nformal networks to contrbute effectvely to crss detecton systems, t s requred to explore not only how these nformal networks change wth type but also what type of changes they undergo. We need not only to dentfy the changng communtes but also dentfy the type of change that these communtes undergo durng ths change. Ths requres an n-depth 78
understandng of the transformatons assocated wth these communtes. Ths has been taken up n Chapter 4 where we present a methodology for dentfyng the changes undergone by a communty durng ts lfetme. 79
Chapter 4 EVOLUTION OF COMMUNITIES 4.1. INTRODUCTION Socal networks arse from a wde varety of scentfc domans lke computer scence, physcs, bology and socology. Onlne communtes such as Flckr, MySpace and Orkut, emal networks, co-authorshp networks and WWW networks are examples of nterestng socal networks. The study of these complex networks can provde nsght nto ther structure, propertes and behavor. Early research on these networks has prmarly focused on statc propertes such as ther modular nature, neglectng the fact that most real-world networks are dynamc n nature. In realty, many of these networks constantly evolve over tme, wth the addton and deleton of edges and nodes representng changes n the relatonshp patterns among the modeled enttes. Members of a communty may leave gradually, whle new ones jon. Then, the communty s stll there even f all of ts orgnal members have left. Identfyng the portons of the network that are changng, characterzng the type of change for these evolvng networks are challenges that need to be addressed. For nstance, the rapd growth of onlne communtes has dctated the need for analyzng large amounts of temporal data to reveal communty structure, dynamcs and evoluton. We beleve that studyng the evoluton of clusters of these networks, n partcular ther formaton, transtons and dssoluton, can be extremely useful for effectvely characterzng the correspondng changes to the network over tme. In ths chapter, we provde a framework for characterzng the evoluton of these complex socal networks. We begn by convertng an evolvng graph nto statc snapshot graphs at dfferent tme ponts. We obtan communtes at each of these snapshots ndependently. Next, we characterze the transformatons of these clusters by defnng and dentfyng certan crtcal transtons, whch can be effcently detected by comparng the consecutve snapshots. The dentfcaton of these transformatons and the patterns that we dscover offer new and nterestng nsghts for the characterzaton of dynamc behavor of these evolvng networks and help us make useful nferences about cluster evoluton. We llustrate our framework on the 80
evolvng IISc co-authorshp network. The key contrbuton of ths work s the dentfcaton of key transtons that occur n evolvng networks usng effcent algorthms. 4.2. PROBLEM FRAMEWORK Our focus n ths work s to study the evoluton of graphs, n partcular to understand behavoral patterns for communtes over tme. In ths regard, t becomes necessary to study and characterze the transformatons undergone by the communtes at dfferent tmestamps along the way. For ths purpose, we make use of temporal snapshots to examne statc versons of the evolvng network at dfferent tme ponts. Defnton A socal network can be represented as a graph where nodes represent actors and edges represent the collaboratons among them. A graph G s sad to be evolvng f ts structure vares over tme. Let G = (V, E) denote a temporally varyng graph where V represents the total unque actors and E the total collaboratons that exst among the actors. We defne a temporal snapshot S V, E ) of G to be a graph ( representng only actors and collaboratons actve at a partcular tme nstantt. As the graph evolves, new nodes and edges can appear. Smlarly, nodes and edges can also cease to exst. Ths dynamc behavor of a graph over tme can thus be represented as a set of S equal, non-overlappng temporal snapshots. Note that the dfferent snapshots are mutually exclusve. Suppose that n the frst tme nterval, collaboratons exst between nodes A and C and between A and D. And n the second tme nterval, these collaboratons cease to exst. Then n the cumulatve snapshot of the second tme nterval, we fnd that the nformaton regardng the loss of collaboratons AC and AD s lost. Hence, the communty structure does not reflect the actual structure. To prevent ths loss of nformaton, we choose short tme ntervals and generate snapshots representng only the nformaton of that specfc nterval. The collecton of all T temporal snapshots s represented by S = { S,..., 1, S 2 ST }. 81
To study the evoluton of the graph, we need a representaton of ts structure at dfferent snapshots. For ths purpose, we generate communtes for each snapshot graph. Each 1 2 k S s parttoned nto k communtes denoted by C { C, C,..., C }. The th j cluster of j j S, s also a graph denoted by ( V, E ) where j C j V are nodes n j S and E denotes the edges between nodes nv. Fnally, for each j j 1 2 k S V, E ), V V... V V. ( Fgure 4.1 shows the outlne of the framework we propose. Once the communtes n each snapshot have been dentfed, we dentfy sgnfcant changes that a communty undergoes between consecutve snapshots, referred to as communty transtons. The frst two steps of the proposed framework are descrbed n Secton 4.4 and the last two sectons are descrbed n Secton 4.5. Input: Relatonshp graph G = (V, E) and T, the number of ntervals [Secton 4.4] Convert graph G = (V, E) nto T temporal snapshots S S, S,..., S } [Secton 4.4] For 1 to T do End For Cluster S 1 2 k C { C, C,..., C } [Secton 4.5] For 1 to T 1do End For Transtons = Fnd transtons S, ) [Secton 4.5] ( S 1 Fgure 4.1 Proposed Framework { 1 2 T 4.3. BACKGROUND There has been enormous nterest for studyng relatonshp graphs for behavoral patterns n varous domans. However, the majorty of these studes have focused on studyng statc graphs to dentfy communty structures and patterns. Clauset et al. [46] present a herarchcal agglomeraton algorthm for detectng 82
communty structure whose runnng tme on a network wth n vertces and m edges s O(m dlog n) where d s the depth of the dendrogram descrbng the communty structure. Ths algorthm has substantal practcal mplcatons, brngng wthn reach the analyss of extremely large networks. Ther algorthm s appled to a recommender network of books from the onlne bookseller Amazon.com. They show that ther algorthm can extract meanngful communtes from ths network, revealng large-scale patterns present n the purchasng habts of customers. Flake et al. [47] descrbe a method for dentfcaton of web communtes. Despte the decentralzed and unorganzed nature of the web, they show that the web selforganzes such that communtes of hghly related pages can be effcently dentfed based purely on connectvty. Ther algorthm acheves ths performance usng only lnk nformaton, wthout the text nformaton used by algorthms such as HITS. Ths algorthm allows the dentfcaton of communtes ndependent of, and unbased by the specfc words used by authors. The algorthm assumes the exstence of one or more seed web stes and explots systematc regulartes of the web graph and fts nto a framework that allows for effcent communty dentfcaton by usng a polynomal tme algorthm that should scale well to studyng the entre web graph. Applcatons of ths algorthm nclude mproved search engnes, content flterng, and objectve analyss of relatonshps wthn and between communtes on the web. Newman [48] also tackles the problem of detectng and characterzng the communty structure. Ther approach to detectng communtes ncludes optmzaton of the qualty functon known as modularty over the possble dvsons of a network. However, the drect applcaton of ths method usng, smulated annealng s computatonally costly. He shows that the modularty can be reformulated n terms of the egenvectors of a new characterstc matrx for the network, called the modularty matrx, and that ths reformulaton leads to a spectral algorthm for communty detecton that returns results of better qualty than competng methods n notceably shorter runnng tmes. In recent tmes, the dynamc behavor of clusters and communtes has attracted the nterest of several groups. Chakrabart et al. [49] defne evolutonary clusterng as the task of ncrementally obtanng hgh-qualty clusters for a set of objects whle also mantanng smlarty wth clusters dentfed n prevous tmestamps. They present a 83
generc framework for ths problem, and dscuss evolutonary settngs for two wdelyused clusterng algorthms (k-means and agglomeratve herarchcal clusterng). They have also extensvely evaluated these algorthms on real datasets and showed that the algorthms can smultaneously attan both hgh accuracy n capturng present data, and hgh fdelty n reflectng past clusterng. Leskovec et al. [50] studed the evoluton of graphs based on varous topologcal propertes, such as the degree dstrbuton and small-world propertes of large networks. They proposed a graph generaton model, called Forest Fre model, to explan ther fndngs about evolutonary behavors of graphs. Falkowsk et al. [51] analyze the nteracton behavor of the members over tme n order to facltate communty buldng. They have proposed two approaches to analyze the evoluton of two dfferent types of onlne communtes on the level of subgroups. In the frst approach the tme axs tme axs s parttoned a graph of nteractons s bult. On ths graph a herarchcal dvsve edge betweenness clusterng algorthm s appled to fnd subgroups of densely connected ndvduals. In the next step statstcal measures are used to analyze the temporal development of these subgroups. The second approach tackles the problem of analyzng the evoluton of communtes n envronments wth a hgh membershp fluctuaton. For ths, they have appled the same clusterng as n the frst approach to detect subgroups n graphs (communty nstances). Then a graph of smlar communty nstances s bult and clustered to detect groups of smlar nstances. These clusters are vsualzed and the temporal development s analyzed. By ths they detect persstent structures and ther transtons n a graph of nteractons among fluctuatng members. They have studed the student onlne communty for both methods. Backstrom et al. [52] have studed how the evoluton of the communtes relates to propertes such as the structure of the underlyng socal networks. They found that the propensty of ndvduals to jon communtes, and of communtes to grow rapdly, depends on the underlyng network structure. They have mplemented decson-tree technques to dentfy the most sgnfcant structural determnants of these propertes. They also develop a novel methodology for measurng movement of ndvduals between communtes, and show how such movements are closely algned wth changes n the topcs of nterest wthn the communtes. For analyss 84
they have used two large sources of data: frendshp lnks and communty membershp on Lve Journal, and co-authorshp and conference publcatons n DBLP. Both of these datasets provde explct user-defned communtes, where conferences serve as proxes for communtes n DBLP. 4.4. IISc CO-PUBLICATION DATA The database of publcatons by a leadng academc nsttuton provdes a rch temporal record of collaboratons wthn the nsttute. Apart from the emals as an ndcator of collaboraton, n an academc nsttuton lke the Indan Insttute of Scence the co-publcaton of papers s a strong ndcator of collaboraton and knowledge exchange. A relatonshp between two authors exsts f they have coauthored a paper together. We have collected the publcaton data of the Indan Insttute of Scence for a perod of 10 years from 1995 to 2004. We beleve that a ten year perod s a reasonable tme perod to consder the evoluton of communtes n the nsttuton. It wll show a varety n the authors publshng, n collaboratons and n topcs of research pursued by the authors. Ths data has been provded by Web of Scence [53]. Web of Scence, a product of Thomson Scentfc, s a Ctaton Database provded wthn the ISI Web of Knowledge. Though the IISc faculty publshed papers and journals whch are not ndexed by Web of Scence [53] database, partcularly those from the feld of Engneerng and Management, t s good representaton snce t covers a majorty of journals n whch the IISc communty publshes. The co-authorshp data extracted from the corpus of publcatons has been chosen for understandng the formaton of scentfc groups and to dentfy transformatons assocated wth evolvng communtes. Collaboraton networks dsplay many of the structural features of socal networks. Hence, ths s a good representatve dataset for ths study. We beleve that studyng the evoluton of ths dataset can afford nformaton about the nature of collaboratons and the factors that nfluence future collaboratons between authors. We used the publcaton data to generate a co-authorshp network representng authors afflated to Indan Insttute of Scence and have publshed papers. We 85
chose all papers that appeared over a 10 year perod (1995-2004) spannng all the departments. We converted ths data nto a co-authorshp graph, where each author s represented as a node and an edge between two authors corresponds to a jont publcaton by these two authors. Ths populaton conssts of students (manly graduate students, post doctoral fellows), faculty, and all outsders who have collaborated wth the faculty of the nsttute over the selected academc years 1995-2004. The dataset conssts of 8817 papers and 7387 unque authors. We also need to chose the samplng perod.e., the frequency wth whch the network s measured. Ths determnes the number of snapshots that we wsh to extract from the data. We have chosen ths samplng perod to be one year as ths gves a reasonable account of the natural perodcty of the author actvtes. Wth reference to Fgure 4.1, we chose T the number of ntervals to be ten and hence our next step s to convert the co-authorshp graph nto ten consecutve snapshot graphs S = { S S,..., }. These graphs are then clustered and analyzed to dentfy transtons. 1, 2 S10 In order to further process the dataset, we have analyzed the dstrbuton of authors n the dataset. The dataset conssts of unque authors wth varyng number of actve years. It conssts of authors actve for one year, two years and so on unto all ten years. Fgure 4.2 shows ths dstrbuton. It shows the dstrbuton of total unque authors wth the varaton n the value of the mnmum number of actve years for an author. Accordng to the Fgure 4.2 there are 7387 dstnct authors who have publshed papers n atleast one year. Ths bascally ncludes all authors who have publshed atleast once durng the ten year perod. And there are 1020 authors who have publshed for atleast four years out of the ten year perod. Proceedng n ths manner, Fgure 4.2 shows that there are 93 unque authors who have publshed n all the ten years. Fgure 4.3 shows the dstrbuton of these dfferent sets of authors year-wse. It shows the dstrbuton n the set of authors n each year for every number of mnmum actve years consdered. Ths surface follows smlar dstrbuton as n Fgure 4.2 as expected. Smlarly, Fgure 4.4 shows the dstrbuton n the number of papers generated by the set of authors formed at each stage n the above 86
8000 7000 6000 Total Number of Authors 5000 4000 3000 2000 1000 0 1 2 3 4 5 6 7 8 9 10 Mnmum # of Years Actve Fgure 4.2 Dstrbuton n the Number of Authors wth ther actve years # Author Dstrbuton for Actve # years 2000 1500 1000 500 0 M n m m# u 1 2 3 4 Y ears A ct v e 5 6 7 8 9 10 1995 1997 Year 1999 2001 2003 Fgure 4.3 Dstrbuton of Authors year-wse wth Number of Actve years dstrbuton. There are 8817 total papers when we nclude the papers by all authors who have publshed atleast n one year. Ths ncludes all the 7387 unque authors. Also there are 8246 papers publshed by 1020 authors who have publshed n atleast four years. Proceedng n ths manner, Fgure 4.3 shows that there are 4287 87
9000 Total Number of Papers by Authors Actve Yearwse 8500 8000 7500 7000 6500 6000 5500 5000 4500 4000 1 2 3 4 5 6 7 8 9 10 Mnmum # of Years Actve Fgure 4.4 Dstrbuton n the Number of Papers wth varyng Actve Years papers publshed by the 93 authors who have publshed n all the ten years. Of note s the fact that these numbers of papers nclude the sngle author papers. # Papers Dstrbuton for Actve # years 1200 1000 800 600 400 200 M n m m# u 1 2 3 4 Y ears A ct v e 5 6 7 8 9 10 1995 1997 Year 1999 2001 2003 Fgure 4.5 Dstrbuton of Papers year-wse wth Number of Actve years Fgure 4.5 shows the dstrbuton n the number of papers publshed by these dfferent sets of authors year-wse. It shows the dstrbuton n the total number of papers publshed n each year for every number of mnmum actve years 88
consdered. Ths surface follows smlar dstrbuton as n Fgure 4.4 as expected. 4.4.1 IISc Co-authorshp Dataset I Out of the varous sets of authors consdered wth dfferent mnmum actve perods, we have chosen two sets and two dfferent versons of the co-authorshp data are created. Analyss s carred out on both the versons. The prmary focus of ths work s to characterze the evoluton of the communtes. Hence n the frst verson of the co-authorshp data that we hereafter refer to as the IISc Co-Authorshp Dataset I, t has been assumed that authors who have not been actve for more than three years need not be consdered for analyss as they do not contrbute to the evoluton of a communty. Ths corresponds to 1020 authors and 8246 papers accordng to Fgure 4.2 and Fgure 4.4. As these numbers nclude sngle author papers as well we have removed these sngle author papers and the correspondng authors from the data as they do not represent any collaboraton and thus do not ad the evoluton of communtes. Thus the resultng dataset conssts of 843 authors and 7410 papers spannng over ten years. As mentoned before, the populaton of the publcaton data conssts of students (mostly graduate students, post doctoral fellows), faculty, and all outsders who have collaborated wth the faculty of the nsttute over the selected academc years 1995-2004. It s not possble to dfferentate between varous authors.e., we cannot segregate them nto subsets of faculty and students from the avalable data alone. We do not have nformaton regardng the authors ndvdual afflatons or ther desgnatons. Snce t s dffcult to separate out students from the faculty, we have decded to nfer ths by other means. The students collaborate wth faculty to publsh papers wthn the nsttute and hence t s reasonable to assume that most of the communtes formed n ths dataset wll be centered around the faculty and show the collaboratons of ths faculty wth ther students. The set of students may vary from year to year for a faculty but the faculty wll have a long-term afflaton wth the nsttute. Fgure 4.6 shows the dstrbuton of authors n each year. The number of authors vares between 360 and 545 authors per year. Fgure 4.7 shows the dstrbuton of papers n each year for ths dataset. Ths vares between 360 and 561 papers per year. 89
550 500 Number of Authors 450 400 350 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 Year Fgure 4.6 Dstrbuton of Authors n IISc Co-authorshp Dataset I 600 500 Number of Papers n a Year 400 300 200 100 0 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 Year Fgure 4.7 Dstrbuton of Papers n IISc Co-authorshp Dataset I 4.4.2 IISc Co-authorshp Dataset II In the second verson of the co-authorshp data that we hereafter refer to as the IISc Co-authorshp Dataset II, only authors who have been actve for the entre tme 90
perod of ten years from 1995 to 2005 have been consdered. Hence, we have derved a small subset of the IISc Co-authorshp Dataset I to create the IISc Coauthorshp Dataset II. Ths subset conssts of all the authors who have publshed and collaborated wth other authors for all the ten years. In other words, we have consdered authors who have been actve for all the ten years under consderaton. It s reasonable to assume that the authors n ths subset, f not all, the majorty of them wll be the faculty of the nsttute and the majorty of collaboratons are facultyfaculty collaboratons and forms a very nterestng group of most actve members n the nsttute. Thus the resultng dataset conssts of 79 authors and 497 collaboratons spannng over ten years. Faculty who has retred durng ths perod wll not be ncluded as they may not have publshed for all ten years but ths s alrght snce the set stll represents the most actve members n the nsttute durng the perod consdered. Fgure 4.8 shows the dstrbuton of authors n each year. The number of authors per year vares between 45 and 59 per year. Fgure 4.9 shows the correspondng dstrbuton n the number of papers publshed by these authors. It vares between 53 and 92 papers per year. 60 58 56 54 Number of Authors 52 50 48 46 44 42 40 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 Year Fgure 4.8 Dstrbuton of authors n IISc Co-authorshp Dataset II 91
100 90 80 Number of Papers n a Year 70 60 50 40 30 20 10 0 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 Year Fgure 4.9 Dstrbuton of Papers n IISc Co-authorshp Dataset II 4.5. METHODOLOGY 4.5.1. Clusterng Wth reference to Fgure 4.1, we chose T the number of ntervals to be ten and converted the co-authorshp graphs of Datasets I and II nto ten consecutve snapshot graphs S = { S S,..., } each. The next step n our framework s to 1, 2 S10 choose a clusterng algorthm n order to cluster each of these snapshot graphs nto communtes. To choose a clusterng algorthm for ths work, we found that the MCL algorthm [54], a fast and scalable unsupervsed clusterng algorthm, consstently yelded clusters of hgh modularty. Ths algorthm was developed by Stjn Dongen. MCL Algorthm We use MCL to obtan the clusters at dfferent tmestamps. The MCL algorthm does not requre a parameter specfyng the number of clusters. Instead t uses a granularty parameter and the cluster structure prevalent n the graph to determne 92
the number of parttons. Accordngly, for each snapshot, the number of clusters may vary dependng on the actvty n that tme nterval. We used a granularty parameter of 5.0 for our experments. Natural clusters n a graph are characterzed by the presence of many edges between the members of that cluster, and the number of hgher length paths between two arbtrary nodes n the cluster s hgh relatve to node pars lyng n dfferent natural clusters [54]. The MCL algorthm fnds cluster structure n graphs by a mathematcal bootstrappng procedure [54]. The MCL algorthm fnds the partton of a network at the desred resoluton. It starts wth the relatonshp matrx S of the network and normalzes each column of the matrx to obtan a Markov or stochastc matrx M. It then alternates between two operatons called expanson and nflaton. Expanson corresponds to matrx squarng,.e., computng 2 M. Inflaton corresponds to takng the powers entry-wse followed by a scalng step.e., takng the th p power (p>1) of every element of 2 M and normalze each column to unty. These two steps are carred out alternately untl no further change s observed n the matrx and the fnal matrx s nterpreted as a clusterng The physcal meanng of ths procedure s the followng. Through expanson we compute the probablty that a random walk vsts edges two steps apart the startng poston. If a walk starts wthn a communty, wth greater probablty t wll reman nsde t. By rasng these probabltes to a power through nflaton and then normalzng them, we enhance these paths wth respect to the others. The effect s to create a statstcal matrx correspondng to an adjacency matrx (and hence a graph) n whch edges between communtes are removed. After some teraton, MCL converges to a matrx whch s nvarant under expanson and nflaton. There s n general exactly one non-zero entry per column yeldng the nodes' clusters as separated basns. The whole process of teraton corresponds to smulatng many random walks on the networks and strengthenng ther flow where t s already strong and weakenng t 93
where t s weak. The parameter p tunes the granularty of the clusterng. If p s large, the effect of nflaton becomes stronger and the random walks are lkely to end up n a greater number of smaller basns of attracton, or communtes. On the other hand, a small p produces larger communtes. In the lmt of p = 1, only one cluster s found. The MCL method, thus, has a parameter to be tuned, determnng the resoluton of the resultng dvson of the network. Accordngly n our case we use a granularty parameter of 5.0 and cluster the ndvdual snapshots usng ths algorthm. 4.5.2. Communty transtons In ths secton, we ntroduce and afford a formal defnton to certan crtcal transtons that communtes undergo durng ther lfetme. These transtons are nspred by a smlar noton descrbed by Samtaney et al. [55]. They descrbed an approach for extractng coherent regons from two-dmensonal and three dmensonal scalar and vector felds for trackng purposes. To study the evoluton of these regons over tme, they present certan evolutonary events for objects. The transtons that we defne are prmarly between two consecutve tmestamps but t s possble to coalesce the transtons from contguous tmestamps by analyzng the meta-data collected from the framework. We defne fve basc transtons that the communtes can undergo between any two consecutve tme ntervals or steps. Let consecutve tme ntervals wth The fve proposed transtons are: C and C 1 S and S 1 be snapshots of S at two denotng the set of clusters respectvely. Contnuaton A communty contnues from tme t tot 1 f t remans same from one tme step to other. A communty j C 1 s a contnuaton of the edge sets need not be the same. Contnuaton ( k j C C 1 k C f, ) = 1 f j V 1 s the same asv. However k j V V 1 The man motvaton behnd ths s that f certan nodes are always part of the same communty then, any nformaton suppled to one node wll eventually reach the k 94
others. The addton and deleton of edges merely ndcates the varyng strength between the nodes. It s possble for addtonal edges to appear between the nodes n the communty but the membershp of the communtes does not change. Dssoluton A communty dssolves when there s no communty exstng n the subsequent tmestamp that can be matched to t. A communty k C s sad to have dssolved f none of the vertces n the communty are n the same communty n the next tmestamp.e., no two actors n the orgnal communty have a collaboraton between them n the current tmestamp. Dssoluton ( k C ) = 1 f no j C 1 k such that V V j1 1 Intutvely, dssoluton ndcates the lack of collaboratons wthn a group n a partcular tmestamp. Ths mght sgnfy the breakup of a communty or a workgroup. Dssoluton occurs when there s no longer collaboraton between the nodes n the communty or by the deleton old nodes from the network. Creaton A Creaton occurs when a communty cannot be correlated wth any communty exstng n the prevous tmestamp. A new communty k C 1 s sad to have been formed f none of the nodes n the communty were grouped together at the prevous tme nterval.e., no two nodes n perod. k V 1 exsted n the same communty at tme Creaton ( k C 1 ) = 1 ff no k j C such that V V 1 j 1 Intutvely, a creaton ndcates the formaton of a new communty or new collaboraton. Ths could happen because of new collaboratons beng formed or by the addton of new nodes n the network. Mergng Mergng occurs f two or more communtes merge nto a sngle communty n the next tmestamp. Two dfferent communtes k C and l C are sad to be merged f there exsts a communty n the next tmestamp that contans at least k % of the nodes belongng to these two communtes. The essental condton for a merge s: 95
Merge ( C l, C k ) = 1 ff k, j C 1 such that ( V k Max( V l V ) V k V l j 1, V j 1 k% ) ; V k k C j V 1 and 2 V l V j 1 C l 2 Ths condton wll only hold f there exst edges between k V and l V n tmestamp 1. Intutvely, t mples that new collaboratons have been establshed between nodes whch prevously were part of dfferent clusters. Ths caused k % of the nodes n the two orgnal clusters to jon the new cluster. A complete merge ( k 100 ) occurs when all the nodes n the two orgnal communtes belong to the same cluster n the next tmestamp. Splttng Splttng occurs when a communty splts nto two or more communtes n the next tmestamp. A communty j C s sad to have splt f k % of nodes from ths communty are present n two dfferent communtes n the next tmestamp. The essental condton s that: Splt ( C j, k ) = 1 ff j l C C 1 1, such that ( V k 1 Max( V V k 1 l 1 V ) V l 1 j, V j k% ) ; V k 1 k C j 1 V and 2 V l 1 V j C l 1 2 Intutvely, a splt sgnfes that the collaboratons between certan nodes are broken and not carred over to the current tmestamp, causng the nodes to part ways and jon dfferent communtes. 4.5.3. Extractng the Communty Transtons In order to extract the communty transtons between the consecutve tmestamps, we use bt matrx operatons. For each tmestamp, we construct a bnary matrx of sze k n where k s the number of clusters at tmestamp and n s the number of nodes. If the number of nodes changes between the tmestamps we wll ncrease the length of the matrces to reflect the larger of the two. We then compare the matrces of the consecutve snapshots to dentfy the transtons between them. Let 96
97 Contnuaton If (AND ), ( 1 k j C C equals OR ), ( 1 k j C C ) then Contnuaton ( j C, ) 1 k C = 1 End f k C correspond to the k C cluster row of the bnary matrx created at tmestamp. To dentfy transtons that each cluster has undergone between tmestamp and tmestamp 1, we perform a set of bnary operatons on the correspondng Dssoluton If (Max (Sum (AND ( ), 1 k j C C ) 1 1 k k ) 1 ) then Dssoluton ( ) j C =1 End f Creaton If (Max (Sum (AND ( ), 1 k j C C ) 1 1 k k ) 1 ) then Dssoluton ( ) j C =1 End f Mergng If 1 1 1 1 1 1,, 1 2 ) ( ), ( ( ( & & 2 ) ( ), ( ( ( & & % )) ( ),, ( ( ( )) ),, ( ( ( k l k j l j l k j l k j k l j k k k Sum C C AND C Sum Sum C C AND C Sum k Sum C C C Sum OR Max C C C AND OR Sum then Mergng ( 1 ),, 1 l k j C C C End f Splttng If 1 1 1 1 1 1,, 1 2 ) ( ), ( ( ( & & 2 ) ( ), ( ( ( & & % )) ( ),, ( ( ( )) ),, ( ( ( k l k j l j l k j l k j k l j k k k Sum C C AND C Sum Sum C C AND C Sum k Sum C C C Sum OR Max C C C AND OR Sum then Splttng ( 1 ),, 1 l k j C C C End f
matrces. The operatons performed to dentfy these transtons are presented n the outlne above. 4.6. RESULTS The analyss s carred out on both the versons of the dataset and the correspondng results are presented n ths secton. The MCL algorthm when mplemented wth a granularty parameter of 5.0 at every tmestamp yelds communtes wth hgh modularty. Fgure 4.10 shows the communty dstrbuton at each tmestamp for IISc Coauthorshp Dataset I. The number of clusters vares from 101 clusters n 1995 to 142 clusters n 2000. Ths shows around 40% varaton n the sze of the clusters. Ths ndcates a lot of actvty and dynamsm n the communty structure. Ths varaton can be attrbuted to both changng collaboratons and more number of people leavng and jonng the groups. Fgure 4.11 shows the communty dstrbuton at each tmestamp for IISc Co-authorshp Dataset II. The number of clusters vares between 16 and 23. Ths shows around 44% varaton n the sze of the clusters and hence t s an equally actve and dynamc subgroup. Snce the number of people n a year s farly constant, the varaton can be attrbuted to changng collaboratons. 150 140 130 Number of Clusters 120 110 100 90 80 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 Year Fgure 4.10 Dstrbuton of Communtes n IISc Co-authorshp Dataset I 98
24 22 Number of Clusters 20 18 16 14 12 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 Year Fgure 4.11 Dstrbuton of Communtes n IISc Co-authorshp Dataset II Once the communtes n each year have been obtaned, we extract the communty transtons between consecutve years usng the bt matrx operatons descrbed n Secton 4.5.3. To fnd the mergng and splttng transtons, we need to set the parameter k. Ths parameter gves the percentage of mergng or splttng that has occurred. We have expermented wth k=25, 30 and 50. For the IISc Co-authorshp Dataset I, we found that k=25 yelds the best results whle for IISc Co-authorshp Dataset II k=50 yelds best results. The possble requrement for a lower value of k=25 n the former case s due to the larger varaton of authors and collaboratons. The results of ths process are better represented by examples and hence we provde examples for each of the communty transtons n both the versons of the dataset. For prvacy reasons, we do not use the actual names of the authors. Instead each author s gven a unque d number and s represented as M () where represents the unque d number. Each communty n a tmestamp s also gven a unque number n that tmestamp. Therefore, the representaton for an author belongng to communty j at a partcular tmestamp s represented as M (, j). It should be noted that s unque for an author, rrespectve of the tmestamp. For IISc Co-authorshp Dataset I, ranges from 1 to 843, whle for IISc Co-authorshp Dataset II, ranges from 1 to 79. Wth ths representaton, examples for each of the 99
transtons are presented below. Occurrence of Contnuaton n a communty ndcates strong coheson wthn the communty and represents a close knt communty. Examples of Contnuaton can be seen n Fgures 4.12 and 4.13. In Fgure 4.12, the communtes n consecutve years consst of the same members but the collaboratons wthn the communty dffer between consecutve years. Here the coheson of the communty has ncreased from 1998 to 1999 wth M(162) n 1999 collaboratng wth more members n the same Fgure 4.12 Contnuaton n IISc Co-authorshp Dataset I Fgure 4.13 Contnuatons n IISc Co-authorshp Dataset II communty. Ths ndcates a tendency n the group to form more new tes wthn the 100
group than outsde the group. On the other hand, Fgure 4.13 shows a communty whch has remaned unchanged both n ts members and collaboratons between consecutve years. Ths s found to be the case for most of the contnuaton communtes found n IISc Co-authorshp Dataset II, ndcatng that most of the faculty tends to have long term assocatons among themselves. Creaton of a communty n both the versons of the dataset s often assocated wth the creaton new collaboratons wth prevously dsconnected or connected authors or wth the addton of authors between consecutve years. Fgure 4.14 shows Creaton of a communty n IISc Co-authorshp Dataset I. M(336) has not collaborated wth any of hs earler collaborators of 1997 jonng a new group n 1998 by collaboratng wth M(70). Fgure 4.14 Creaton n IISc Co-authorshp Dataset I In IISc Co-authorshp Dataset II creaton s found n small groups only. Fgure 4.15 shows Creaton of a communty n IISc Co-authorshp Dataset II. M(1,4) and M(23,8) who belonged to two dfferent clusters n 1995 have formed a new group n 1996 wth M(79,5). Collaboratons formed between prevously dsconnected authors may ndcate the formaton of jont ventures between dfferent areas of research and also the ncrease n ths trend shows an actve trend towards dversfcaton of research topcs that the authors pursue. 101
Fgure 4.15 Creaton n IISc Co-authorshp Dataset II Fgure 4.16 Dssoluton n IISc Co-authorshp Dataset I 102
Dssoluton of a communty n both the versons of the dataset s often assocated wth dssoluton of collaboratons wth prevously connected authors or wth the deleton of authors between consecutve years. Fgure 4.16 shows the dssoluton of a communty n IISc Co-authorshp Dataset I. Authors M(383,41) and M(511,41) n 2000 have not publshed any papers n 2001, whle the other three authors belongng to communty 41, M(744,41), M(760,41), and M(811,41) have go on to jon wth completely dfferent set of authors n 2001 dssolvng the exstng communty 41 n 2000. In IISc Co-authorshp Dataset II dssoluton s found n small groups only. Example of Dssoluton n ths dataset s shown n Fgure 4.17. Two authors who have publshed papers wth only each other n 2003 (M(20) and M(66)) have publshed not collaborated wth each other n 2004. They have joned dfferent communtes n 2004, resultng n the dssoluton of ther communty. Dssoluton of a communty may ndcate a decreasng trend towards the topcs prevously pursued by the authors. Fgure 4.17 Dssoluton n IISc Co-authorshp Dataset II Fgures 4.18 and 4.19 show the mergng of communtes n Datasets I and II. In Fgure 4.18 the communty 13 centered around M(406) n 1995 has grown n 1996 by the collaboratons formed by M(406) wth M(535),M(570) and M(632) of communty 37 n 1995. The mergng of groups also suggests that the resultant group represent a confluence of deas or topcs. Thus, one factor whch mght affect a merger could be the smlarty of topcs nvolved. Fgure 4.19 shows a mergng that has occurred n IISc Co-authorshp dataset II. Ths has occurred due to M(11) and 103
Fgure 4.18 Mergng n IISc Co-authorshp Dataset I Fgure 4.19 Mergng n IISc Co-authorshp Dataset II M(34) n communty 5 of 2003 collaboratng wth M(79) n 2004. We fnd that most of the communtes are formed centered around a faculty who has publshed wth most of the members of that communty. We found that when ths central member forms a new collaboraton wth a central member of another communty, a mergng of these two communtes occurs. In fact the mergng of two groups has consstently shown ths trend. In the case of IISc Co-authorshp Dataset I, where we have ncluded both student and faculty, t s reasonable to assume that ths central member s the faculty and the members of hs communty are most lkely to be hs/her students. In ths scenaro, a mergng s most lkely to be assocated wth one faculty collaboratng 104
wth another faculty or jontly publshng papers wth ther common students. Note that more than two communtes can merge together, but ths s dscovered as a set of two merge occurrences. Fgure 4.20 shows the splttng that occurs n IISc Co-authorshp Dataset I. In Fgure 4.20, a prevously strong communty n 1996, whch shows strong collaboratons, has splt n 1997 nto two dsconnected groups form collaboratons wth new members. One possble reason for the splttng of a communty s the dvergence of topcs between ts members and the members collaboratng wth other related authors. Fgure 4.20 Splttng n IISc Co-authorshp Dataset I Fgure 4.21 Splttng n IISc Co-authorshp Dataset II 105
Fgure 4.21 shows the splttng n IISc Co-authorshp Dataset II. Although no new member collaboratons have been seen, a strong cohesve group n 2001 has splt n 2002 due to breakng up of prevous collaboratons. 4.6.1. Dscusson of Results Fgure 4.22 shows the Percentage dstrbuton of each transformaton between consecutve years n IISc Co-authorshp Dataset I. Ths fgure shows that the domnant transformatons are Creaton and Dssoluton whch range between 28% and 43% whle Contnuaton has occurred rarely between 7% and 18%. The Mergng and Splttng have shown occurrence less frequently then creaton and dssoluton rangng between 11% and 28% and gong up durng 2002-03 upto 33%. Ths trend shows that very few communtes have remaned ntact and there s sgnfcant number of creaton and dssoluton n communtes between consecutve years. Ths further substantates our clam that the communtes n ths dataset are created wth faculty as the central nodes and ther students formng the communtes around them. Snce the afflaton of students s less compared to a faculty, although the faculty s stll actve, the members of hs communty are varyng wth the jonng and Percentage transformatons n a Year 100 90 80 70 60 50 40 30 20 Creaton Dssoluton Contnuaton Mergng Splttng 10 0 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 Year Fgure 4.22 Dstrbuton of each transformaton n IISc Co-authorshp Dataset I 106
leavng of hs students and hence results n more number of creatons and dssolutons whle there are few contnuatons. Ths also suggests that ths dataset does not very well reflect the faculty-faculty collaboratons. Fgure 4.23 shows the Percentage dstrbuton of each transformaton between consecutve years n IISc Co-authorshp Dataset II. In ths dataset a farly constant 41% creatons are observed for the most part. There was a sudden ncrease of 50% n the creaton of new communtes n 2003. Ths suggests an nflow of new projects and possble trend towards prevously dsconnected faculty partcpatng n jont ventures and hence the sudden upsurge of new communtes and new collaboratons. Dssoluton also showed farly constant 41% dssolutons for most part. We also notce that 50% of the communtes n 2002 have dssolved n 2003. Ths further renforces the earler suggeston of an nflow of projects and possble trend towards prevously dsconnected faculty partcpatng n jont ventures and the mportance gven to these newer collaboratons. Durng the years 1997 to 2000, we see 30% of the communtes have remaned constant. The contnuaton of these communtes gves an dea of the longetvty of a communty n the nsttute and suggests close-knt communtes n the nsttute. The Percentage transformatons n a Year 100 90 80 70 60 50 40 30 20 10 Creaton Dssoluton Contnuaton Mergng Splttng 0 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 Year Fgure 4.23 Dstrbuton of each Transformaton n IISc Co-authorshp Dataset II 107
varaton n the Mergng of the communtes s changng contnuously, but also shows an ncreasng trend n 2002. Around 63% of the communtes have merged n 2003 from 2002. The percentage n Splttng of communtes s between 8% and 25% and gong upto 34% n 1996. Ths strongly suggests that splttng s rare n the nsttute and most collaboratons wthn the faculty are long lastng. As mentoned prevously, splttng happens n small groups of two or three n ths dataset. Overall, we fnd that 2002 and 2003 to be a very nterestng perod n collaboratons n the nsttute. They mark the perod of jont ventures and possble multdscplnary ventures n the nsttute. Ths perod also suggests an ncrease n the overall research actvty of the nsttute. The results suggest that ths dataset has the capablty of reflectng the structure of collaboratons between the faculty and the trends n research undertaken by the nsttute. We have also analyzed the correlaton between the sze of the communtes and the actvty of the domnant author n that communty for each year. Actvty s measured n terms of the number of papers publshed by the domnant author n that year. The correlaton coeffcent s a normalzed measure of the strength of the lnear relatonshp between two varables. We use the Pearson product-moment correlaton coeffcent, whch s obtaned by dvdng the covarance of the two varables by the product of ther standard devatons. The correlaton coeffcent X, Y between two random varables X and Y wth expected values X and X, Y Y and standard devatons X and Y s defned as: cov( X, Y ) E(( X X )( Y Y ))..(4.1) X Y X Y where, E s the expected value operator and cov means covarance. 2, 2 E( X ) E( X ) Snce, E(X ) wrte X X, Y X E( X 2 E( XY ) E( X ) E( Y ) ) E 2 ( X ) E( Y and lkewse for Y, we may also 2 ) E 2 ( Y ).(4.2) 108
It ranges between -1 and 1, where -1 means that the two varables have a negatve lnear relatonshp to each other, 0 means there s no lnear relatonshp between them and 1 means that there s a postve lnear relatonshp between them. Fgure 4.24 shows a hgh degree of correlaton n the sze of the communty wth the actvty of the domnant member n IISc Co-authorshp Dataset I. Ths observaton suggests that the communtes are formed around a few domnant members and the rest of the members meander n the socal network. Ths s n fact the nference ganed from our methodology that communtes are centered around a domnant member lke the faculty. Ths further demonstrates the strength of our methodology to reflect the nherent structure of a communty and successfully valdates our approach. 1 0.8 0.6 Correalton Coeffcent 0.4 0.2 0-0.2-0.4-0.6-0.8-1 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 Year Fgure 4.24 Communty Sze Vs Domnant Member Actvty n Co-authorshp set I Fgure 4.25 shows the correlaton of the sze of the communty wth the actvty of the domnant member n IISc Co-authorshp Dataset. Ths shows a postve correlaton between the two. Although the correlaton s farly hgh throughout, as shown by our methodology, the correlaton also shows large varaton around the perod 2002-03. Ths further valdates that ths perod has seen a restructurng n research of the nsttute. A lower correlaton ndcates that prevously domnant people are collaboratng wth each other. 109
Correlaton Coeffcent 1 0.8 0.6 0.4 0.2 0-0.2-0.4-0.6-0.8-1 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 Year Fgure 4.25 Communty Sze Vs Actvty of the Domnant Member n IISc Co-authorshp Dataset II 4.7. CONCLUSIONS AND FUTURE WORK In ths secton we have presented a framework for characterzng the evoluton of communtes n a collaboraton network. Ths framework s based on the computaton of the varous transformatons that a communty can undergo durng ts lfetme. We have used framework successfully to study the evoluton of these clusters, ther formaton, transtons and dssoluton. We found that our methodology s able to predct the trends n the research of the nsttute. It establshes our methodologes capabltes n studyng communtes and effectvely characterzng ther changes over tme. Ths work could be extended by ncorporatng the semantc nformaton and study the nformaton flow among the communtes to gan nsght nto the evoluton of topcs and to quantfy the relevance of author pars n the datasets. The framework proposed n ths chapter can be further used to predct future trends n the network lke predctng the future co-occurrences of authors n communtes. 110
Chapter 5 Summary 5.1. SUMMARY OF CONTRIBUTIONS Ths secton presents the summary of the major contrbutons of ths thess are 1. The frst part of ths thess has defned the problem of nferrng the herarchy of the organzaton from ts emal communcaton patterns. We have compared the performance of the varous tradtonal classfers wth that of a Socal Network based classfer and found that ths Socal Network Classfer seems to perform margnally better than most of the tradtonal classfers, thus showng the advantage of usng Socal Network based methods for classfcaton of ths knd. We have used the publcly avalable Enron emal corpus for expermentaton. We have also establshed emals as a strong ndcator of communcaton n an organzaton and demonstrated ts capabltes n reflectng the herarchy of an organzaton lke Enron and have establshed the percentage accuracy of socal network classfer as a measure of ordered and structural herarchy along wth the number of communcatons (n ths nstance emals). 2. The second part of our thess has carred out an n-depth analyss of events that have occurred durng the perod for whch we have emal data and have dentfed the crtcal events that have led to the bankruptcy of Enron at the end of ths perod. We have analyzed the senstvty of varous socal network features to correlate wth these crtcal events. Apart from the formal structure, ths thess recognzed the presence of nformal networks of collaboraton that naturally exst wthn an organzaton and establshed the mportance of dentfyng these nformal networks to facltate event detecton. We found that the key to the success of nformal networks for event detecton s ther ablty to overrde the formal herarchcal structure n the organzaton. 111
3. The last part of our thess has provded a generc framework for dentfyng the communtes, and characterzng the types of transformatons undergone by these communtes n consecutve years. To llustrate the applcablty of the proposed framework presented n the last part of the thess, an IISc Coauthorshp database for a ten year tme perod s created from the nformaton avalable on the Web of Scence, a ctaton database avalable wthn the ISI web of knowledge [53]. Evoluton of communtes and ther transformatons were dentfed successfully usng our framework. Further analyss on these transformatons revealed nterestng nsghts nto the behavoral patterns of the ndvduals n collaboratng wth others over tme and have demonstrated the effectveness of our methodology n capturng the changng structure n collaboratons. 5.2. FUTURE WORK One of the mportant contrbutons of ths thess s to establsh the accuracy of the Socal Network Classfer as a measure of ordered structural herarchy of an organzaton along wth the emal communcatons between stakeholders. It also establshes that the presence of nformal networks n the organzaton n sgnfcant numbers undermnes the establshed organzatonal structure. These two conclusons can become excellent precursors to predct the possble onset of resultng crss n the organzaton. These fndngs can provde an excellent bass for developng theores and dynamc models for anomaly detecton and early warnng systems n varous scenaros lke the stock market and organzatons. Identfyng and characterzng the types of changes occurrng n the network over tme can help law enforcement agences to better track the crtcal and suspcous ndvduals and ther actvtes n terrorst organzatons and hence provde a means for checkng ther malcous ntent at an early stage. The methodology presented n ths thess for studyng how the terrorst groups form, evolve and dssolve over tme, can help dentfy the domnant members of these terrorst groups whose elmnaton wll result n the dsassocaton of the group and thus counter terrorsm. 112
Pandemc vruses have the potental to spread rapdly and pose a severe threat. Analyss of the evoluton of such a socal network can be used to devse effectve contanment polces and restrct ther spread at an early stage. All the methodologes presented n ths thess can be extended by ncorporatng the semantc content avalable wth the communcaton or any other form of collaboraton data. 113
References [1] Duncan J. Watts. Sx Degrees: The Scence of a Connected Age. W. W. Norton & Company, February 2003. [2] P. Erdos and A. Reny. On random graphs. Publ. Math. Debrecen, 6:290, 1959. [3] S.D. Berkowtz. An Introducton to Structural Analyss: The Network Approach to Socal Research. Toronto: Butterworths, 1982. [4] Lada A. Adamc. The small world web. In Proceedngs of the thrd European Conference on Research and Advanced Technology for Dgtal Lbrares, ECDL, number 1696, pages 443 452. Sprnger-Verlag, 1999. [5] Réka Albert and Albert-László Barabás. Statstcal mechancs of complex networks. Revews of Modern Physcs, 74:47 97, 2002. [6] Faust K. Centralty n afflaton networks. Socal Networks, 19:157 191, Aprl 1997. [7] M.S. Granovetter. The strength of weak tes. Amercan Journal of Socology, 78:1360 1380, 1973. [8] Barry Wellman and Scot Wortley. Dfferent strokes from dfferent folks: Communty tes and socal support. The Amercan Journal of Socology, 96(3):558 588, November 1990. [9] L. Garton, C. Haythornthwate, and B. Wellman. Studyng onlne socal networks. In S. Jones, edtor, Dong Internet Research, pages 75 105. Sage, Thousand Oaks, CA, 1999. [10] B Wellman. Computer networks as socal networks. Scence Magazne, 293:2031 2034, 2001. [11] Mchele H. Jackson. Assessng the structure of communcaton on the world wde web. Journal of Computer-Medated Communcaton, 3(1), 1997. 114
[12] Davd Gbson, Jon M. Klenberg, and Prabhakar Raghavan. Inferrng web communtes from lnk topology. In UK Conference on Hypertext, pages 225 234, 1998. [13] Paul Mutton. Inferrng and vsualzng socal networks on nternet relay chat. In Proceedngs of the Eghth Internatonal Conference on Informaton Vsualsaton, pages 35 43, Washngton, DC, USA, 2004. IEEE Computer Socety. [14] R. Kumar, J. Novak, P. Raghavan, and A. Tomkns. On the bursty evoluton of blogspace. In Proceedngs of the Twelfth Internatonal WWW Conference, pages 568 576, 2003. [15] Cameron Marlow. Audence, structure and authorty n the weblog communty. In 54th Annual Conference of the Internatonal Communcatons Assocaton, New Orleans, LA, May 2004. [16] Yangpng Zhao, Jzhuang Zhao, and Rongsheng Xu. Network nformaton content securty: a framework for ntellgent analyss and montorng. csssm, 2:841 843 Vol. 2, 2005. [17] Zhen Zhang; Xao-Mng Wang; Yun-Xao Wang. A p2p global trust model based on recommendaton. Machne Learnng and Cybernetcs, 2005. Proceedngs of 2005 Internatonal Conference on, 7:3975 3980 Vol. 7, 18-21 Aug. 2005. [18] P. Oscar Boykn and Vwan P. Roychowdhury. Leveragng socal networks to fght spam. Computer, 38(4):61 68, 2005. [19] Robert D. Nolker and Lna Zhou. Socal computng and weghtng to dentfy member roles n onlne communtes. In Web Intellgence, pages 87 93, 2005. [20] Nasrullah Memon and Henrk Legnd Larsen. Practcal algorthms for destablzng terrorst networks. In Proceedngs of the The Frst Internatonal Conference on Avalablty, Relablty and Securty, ARES, pages 389 400, 2006. 115
[21] Peter A. Gloor. Capturng team dynamcs through temporal socal surfaces. In Proceedngs of the Nnth Internatonal Conference on Informaton Vsualsaton (IV 05), pages 939 944, Washngton, DC, USA, 2005. IEEE Computer Socety. [22] Andrew McCallum, Xueru Wang, and Andres Corrada-Emmanuel. Topc and role dscovery n socal networks wth experments on enron and academc emal. Journal of Artfcal Intellgence Research, (0):249 272, 2007. [23] Jtesh Shetty and Jafar Adb. Dscoverng mportant nodes through graph entropy the case of enron emal database. Assocaton for Computng Machnery, ACM Press, 2005. [24] Dng Zhou, Yang Song, Hongyuan Zha, and Ya Zhang. Towards dscoverng organzatonal structure from emal corpus. In Proceedngs of the Fourth Internatonal Conference on Machne Learnng and Applcatons (ICMLA 05), pages 279 284, Washngton, DC, USA, 2005. IEEE Computer Socety. [25] M. Schwartz and D. Wood. Dscoverng shared nterests among people usng graph analyss of global electronc mal traffc. In Communcatons of the ACM., 1992. [26] StatSoft R & D department. Electronc Textbook STATSOFT. http://www.statsoft.com. [27] T. Haste, R. Tbshran, and J. H. Fredman. The Elements of Statstcal Learnng. Sprnger, August 2001. [28] S. Wasserman and K. Faust. Socal Network Analyss: methods and applcatons. Cambrdge Unversty Press, 1994. [29] Bryan Klmt and Ymng Yang. The enron corpus: A new dataset for emal classfcaton research. In ECML, volume 3201 of Lecture Notes n Computer Scence, pages 217 226. Sprnger, 2004. 116
[30] R. Bekkerman, A. McCallum, and G. Huang. Automatc categorzaton of emal nto folders: Benchmark experments on Enron and SRI corpora. Techncal Report IR-418, Center of Intellgent Informaton Retreval, UMass Amherst, 2004. [31] Andrew McCallum, AndrÃl s Corrada-Emmanuel, and Xueru Wang. Topc and role dscovery n socal networks. In IJCAI, pages 786 791. Professonal Book Center, 2005. [32] J. Shetty and J. Adb. The Enron emal dataset database schema and bref statstcal report. Informaton Scences Insttute Techncal Report, Unversty of Southern Calforna, 2004. [33] Anagha Kulkarn and Ted Pedersen. Name dscrmnaton and emal clusterng usng unsupervsed clusterng and labelng of smlar contexts. In Proceedngs of the 2nd Indan Internatonal Conference on Artfcal Intellgence(IICAI), pages 703 722, December 2005. [34] Enat Mnkov, Rchard C. Wang, and Wllam W. Cohen. Extractng personal names from emal: applyng named entty recognton to nformal text. In HLT 05: Proceedngs of the conference on Human Language Technology and Emprcal Methods n Natural Language Processng, pages 443 450, Morrstown, NJ, USA, 2005. Assocaton for Computatonal Lngustcs. [35] Nshth Pathak, Sandeep Mane, and Jadeep Srvastava. Who thnks who knows who? sococogntve analyss of emal networks. In ICDM 06: Proceedngs of the Sxth Internatonal Conference on Data Mnng, pages 466 477, Washngton, DC, USA, 2006. IEEE Computer Socety. [36] P. S. Kela and D. B. Skllcorn. Structure n the enron emal dataset. Comput. Math. Organ. Theory, 11(3):183 199, 2005. [37] BÃjlent Yener and Anurat Chapanond. Graph theoretc and spectral analyss of enron emal data. 2005. 117
[38] Mchael W. Berry and Murray Browne. Emal survellance usng non-negatve matrx factorzaton. Comput. Math. Organ. Theory, 11(3):249 264, 2005. [39] Carey E. Prebe, John M. Conroy, Davd J. Marchette, and Youngser Park. Scan statstcs on enron graphs. Comput. Math. Organ. Theory, 11(3):229 247, 2005. [40] Jana Desner, Terrll L. Frantz, and Kathleen M. Carley. Communcaton networks from the enron emal corpus "t s always about the people. enron s no dfferent". Comput. Math. Organ. Theory, 11(3):201 228, 2005. [41] J. Shetty and J. Adb. Ex employee status report,retreved from http://www.s.edu/adb/enron/enron_employee_status.xls. [42] Aspen Ferc. Retreved, from http://ferc.aspensys.com. [43] M. Grvan and M. E. J. Newman. Communty structure n socal and bologcal networks. Proceedngs of the Natonal Academy of Scences, 99(12), 2002. [44] Rong Qan, We Zhang, and Bngru Yang. Detect communty structure from the enron emal corpus based on lnk mnng. In Proceedngs of the Sxth Internatonal Conference on Intellgent Systems Desgn and Applcatons (ISDA 06), pages 850 855. IEEE Computer Socety, 2006. [45] Enron and wkpeda, 2006. [46] Aaron Clauset, M. E. J. Newman, and Crstopher Moore. Fndng communty structure n very large networks, August 2004. [47] Gary Wllam Flake, Steve Lawrence, C. Lee Gles, and Frans Coetzee. Self-organzaton of the web and dentfcaton of communtes. IEEE Computer, 35(3):66 71, 2002. [48] M. E. J. Newman. Modularty and communty structure n networks. PNAS, 103(23):8577 8582, June 2006. 118
[49] Deepayan Chakrabart, Rav Kumar, and Andrew Tomkns. Evolutonary clusterng. In KDD 06: Proceedngs of the 12th ACM SIGKDD nternatonal conference on Knowledge dscovery and data mnng, pages 554 560. ACM Press, 2006. [50] Jurj Leskovec, Jon Klenberg, and Chrstos Faloutsos. Graphs over tme: Densfcaton laws, shrnkng dameters and possble explanatons. 2005. [51] Tanja Falkowsk, Jorg Bartelhemer, and Myra Splopoulou. Mnng and vsualzng the evoluton of subgroups n socal networks. In Proceedngs of the 2006 IEEE/WIC/ACM Internatonal Conference on Web Intellgence, pages 52 58. IEEE Computer Socety, 2006. [52] Lars Backstrom, Dan Huttenlocher, Jon Klenberg, and et al. Group formaton n large socal networks: Membershp, growth, and evoluton, 2006. [53] Web of scence : http://sknowledge.com, 2007. [54] Stjn Dongen. Performance crtera for graph clusterng and markov cluster experments. Techncal report, 2000. [55] Rav Samtaney, Deborah Slver, Norman Zabusky, and Jm Cao. Vsualzng features and trackng ther evoluton. Computer, 27(7):20 27, 1994. 119