Trust in Online Social Networks

Transcription

1 Trust in Online Social Networks Nikolaos Volakis E H U N I V E R S I T Y T O H F G R E D I N B U Master of Science School of Informatics University of Edinburgh 2011

2 Abstract These days evolution of the on-line social networks has steered to new notions in social media, especially by giving users the opportunity to extend basic relationships beyond a normal straight connection. Many social network platforms have been developed on the Web such as Twitter and Facebook. In those networks it may be the case that a lot of the end-users (agents) are usually physically unknown with each other. In this case if two unknown participants wish to communicate with each other for various reasons, the evaluation of their trustworthiness along a certain trust path between them within the social network is mandatory. But the level of trustworthiness may vary and it is sometimes subjective and depends on the person s specific role within the network. It is not an easy task as trust cannot easily be defined through mathematical formulas and algorithmic procedures. Trust may rely on several factors from psychological and sociological factors to computer security factors. There were many attempts to define trust for the online social networks and each covered one or more specific trust factors. The need to specify trust today is growing as even more companies are focusing on this specific area in order for example to achieve building effective marketing strategies through their social activity and thus they need to gain the trust of the consumers. For example in twitter the marketing strategies focus on how to persuade the users not only to buy their product but also to spread the news of this product (for example retweet) within their trust circles. But trust does not only concern commercial companies, it may involve governments which quickly need to find supporters for a future plan and of course it involves the interconnections between the users themselves as we stated. This work tries to clarify how trust is defined in online social networks, and proposes mechanisms in order to measure the level of trust and trustworthiness of online agents. i

3 Acknowledgments Acknowledgements I would like to thank my supervisor Massimo Felici, for his invaluable help, his infinite patience, and guidance not only during the dissertation but throughout my MSc year. Without his assistance, this dissertation would not have been possible. I would like also to thank my family, for their love, support and encouragement and my girlfriend Anastasia who supported me all this time with love and patience. ii

4 Declaration I declare that this thesis was composed by myself, that the work contained herein is my own except where explicitly stated otherwise in the text, and that this work has not been submitted for any other degree or professional qualification except as specified. (Nikolaos Volakis) iii

5 To my father. iv

6 Table of Contents 1 Introduction Motivation-Objectives Dissertation Structure Trust Offline vs. Online Trust Identity and reputation as trust parameters Online Trust and its role in online environments Summary Online Social Networks What is an online social network Standard Features in Online Social Networking Structure and evolution of online social networks Social Networks Analysis Graph Theory for Social Networks Analysis Summary Case study: Twitter The Twitter platform Twitter main mechanisms Twitter analysis Summary Twitter and Trust Trust mechanisms in Twitter Algorithms for identifying trust in Twitter Betweeness Centrality v

7 5.2.2 HITS Summary Design Requirements analysis Functional Requirements Non Functional Requirements Architecture Summary Implementation Environment Tools Coding Techniques and Screenshots Summary Evaluation Evaluation of the Trust Mechanisms Trust Maps Summary Conclusions Lessons learned Future Work Final remarks A Prototype s User Manual 76 Bibliography 78 vi

8 List of Figures 1.1 The relation between stucture, trust/trustworthiness and information flow Graph showing the interactions of the example presented The Communication between nodes and the corresponding adjacency matrix This figure shows the new communication and adjacency matrix This figure shows an ego network and a whole one Example of weighted graph An example picture of betweeness centrality Picture that shows the key players In this picture we observe clusters Twitter Usage This graph shows the followers (blue) compared with the actual friends This picture shows the number of tweets VS the number of followers This figure shows the tweets (posts) vs. the number of friends This figure shows the number of friends vs. the number of followees Another example of Betweeness Centrality This is the HITS algorithm s pseudocode The MVC Architecture The system s class diagram This initialization s sequence diagram Twitter destroy friendship method The destroy block method The Prototype The Tweet Button functionality vii

9 7.5 The Retweet button functionality The Hashtag Button functionality The picture shows the nodes with the highest node degree rank This picture shows the node with the highest betweeness centrality The betweeness centrality results jtextarea with the donloaded tweets The HITS results The most higky ranked users by the HITS algorithm The chart shows the Centrality results compared to the HITS results The chart shows the Overall trust value compared with the node degree results Histogram of the overal trust value The initial trust map The updated trust map of the growing case The case of a shrinked trust map The case where the map remains the same A.1 Figure for the user s manual viii

10 List of Tables 6.1 MVC class distribution ix

11 Chapter 1 Introduction Since the beginning of the Internet Time the main goal of the World Wide Web creators was to give people the opportunity to gain easy access and exchange quickly information. The recent years the World Wide Web gained new form and evolved to Web 2.0 [1], which gave developers the capability to design and implement new standardized platforms to host,not only information, but real time social interactions among their users. Some indicative examples are the well-known platforms of 1) Facebook, 2) Twitter and 3) MySpace. Online social networks experienced enormous growth in a small amount of time and users do spend respectable amount of their everyday life interacting about various topics with other peers. The case of interaction is expressed as direct messages or through application mechanisms that allow their authenticated users to communicate through these applications. The network s user base consists of millions of people-users which have different cultural, social and economic background. Each one of the users is able to generate significant amount of information that is exchanged via the online interactions. As expected, questions may arise about the nature of these information and the social interactions that peers afford. An important one is connected to the factor of trust which characterizes the information as well as the users. One problem that we explore and analyze in this work is whether trust, which is formally described as a relationship in which a trustor decides to depend on the trustee s foreseeable behavior in order to fulfill his expectations [2], can be applied to illustrate a particular connection among users who interact exclusively online. 1

12 Chapter 1. Introduction Motivation-Objectives What led me to choose this particular research project was the fact that online social networks have become part of our everyday life and trust is playing a very important role especially when we need to make a choice if we can trust a user or not without having a face to face contact. Additionally this is a new research sector which is always growing and there is yet not a paper that clarifies what eventually is trust in social networks. It is also a fact that trust has a complex meaning and cannot easily be quantified or computed. This research will give me the opportunity to decompose trust in several features that it relies on, and try to extract mechanisms that affect it. A trust or mistrust mechanism is a way of describing a plan of how to combine specific features and how to apply them within social circles in order to observe if trust level increases or decreases. Figure 1.1: The relation between stucture, trust/trustworthiness and information flow. The research will mainly focus on the twitter social network and how its structure is affecting the trust and the trustworthiness within it. The next step will be to determine how these two crucial factors are going to influence the information flow within a network and particularly examine the different information s propagation paths that may arise if these factors change. Let us try to imagine the social networks as large graphs with many users connected immediately or explicitly to each other and claim that each node represents a participant and each link corresponds to the interactions between them. One participant may give trust value to another based mainly on their past interactions and on several other factors. For example we may extract information from a specific node such as:

13 Chapter 1. Introduction 3 Number of existing friends Number of messages sent Number of replies to the sent messages Degree of influence on other users Position within the network These features are going to be analyzed in detail and in the end we will be able to understand the relationship between them and how can they (if they can) define the trust level within the network of this particular node. The results may be different for each unique participant (node) regarding the above features. This is because every user is able to regulate the structure of its own network by accepting, deleting or communicating with friends. Yet it may be the case that the users are not connected directly but they may have proper communication within a specific group of interest. Groups of interest are groups that are dedicated to a specific topic and users do interact with each other about this subject. We may claim that these groups may form some kind of trust circles about this special topic. This, of course, does not conclude that somebody may trust every single person within this group, but in comparison with other users that are not aware of this topic they may possess a greater impact factor that affects their trustworthiness within the group. 1.2 Dissertation Structure Our work is structured as follows: In the next chapter we do investigate the case of trust in overall and we try to prove through the existing literature that trust and trustworthiness can emerge online. The third chapter investigates the online social networks, including their structure, evolution across time and ways for analysing them that rely on graph theory. The fourth chapter includes the case study of the twitter social network. We did perform twitter s analysis and pointed out its main features. We also make some interesting observations that we take into account for the implementation and evaluation part. Chapter five outlines our main trust factors that, according to our point of view, affect the twitter social network. Furthermore, we do describe the main algorithms that we are going to use in our prototype s development. The next two

14 Chapter 1. Introduction 4 chapters are the design and the implementation ones and do explain step by step the methodology that has been followed for the development of our system. The eighth one presents the results of our evaluation and the final chapter includes our concluding remarks and thoughts for future work.

15 Chapter 2 Trust In this chapter mainly we will focus on the aspects of trust in general and will argue in favor of online trust. We will prove that trust can emerge in online environments and we will review the existing work. 2.1 Offline vs. Online Trust Trust is regarded as a main ingredient of successful interpersonal communication but is argued whether the key points of online communication comforts with the minimal conditions for the rise of trust. Mainly, it has been claimed that in an online environment it is infeasible to satisfy the two of the most important conditions for trust to occur: 1. The participants should possess as far as possible a common background considering cultural and institutional aspects and 2. One participant should be sure about the other s identity. As a result the argumentation on the potential of online trust reflects two opposite opinions. Some of the existing works like (Pettit 1995, Nissenbaum 2001) argue that the first and second point cannot be fulfilled in online interactions and as a result trust cannot occur [3] [4]. On the other hand there also exist works like (Weckert 2005, de Vries 2006 and Papadopoulou 2007)[5] [6] [7] which claim that the two above points are not prerequisites for the presence of trust or that there are examples where the points are satisfied one way or another. Mainly in our work, we will suggest that trust is present in online environments and has one of the most significant roles in respect to social interactions and we will focus 5

16 Chapter 2. Trust 6 on trust that appears in the online social networks. Consequently we will overcome the claims against online trust and will prove that a common cultural and institutional background is not necessarily a barrier. Furthermore we will point out several examples from online social networks which indicate that the trustee s identity could last over time or be diachronic [8] and that identity could be affected by reputation evaluation. As a next step we will try to identify the association between trust, trustworthiness and the trustee s identity (on-line and off-line) and will analyze the nature and the role of trust in online social networks. On the basis of this analysis, we will conclude that trust occurs in online networks and contributes to the evolution of social behaviors. When two people refer to interpersonal trust they mean the ability of the first, which is regarded as the trustor, to rely on the second, known as the trustee, to perform an agreed action. The term agreed does not necessarily indicate agreement but the certainty that the trustee will perform this particular action that he/she is capable for. The two participants (trustor and trustee) are called in literature agents [9]. In particular agent 1 (the trustor) assesses agent s 2 (trustee) trustworthiness and decides if he/she will trust him/her for a particular kind of action. Of course by trusting someone, people take the risk that the trustee will misbehave and not act accordingly to what was agreed [10] by betraying the trustor. This is the reason why the previously defined agent 1 seeks guarantees by making assessments on agent s 2 trustworthiness before proceeding with the trust action. The group of people who are arguing against the occurrence of online trust claim that the culture and the morality of people are fundamental aspects that could lead to trustworthiness. Two main points are provided to support this argument. Firstly, the shared values and norms provide a way for the trustor and the trustee to assess what a correct behavior is. Secondly, the trustee feels a social pressure to behave according to the shared norms and values, a pressure that prevents him from betraying the trustor [11] and it is almost impossible to identify those trust behaviours in the interactions performed in the online social networks. Nissenbaum (2001) [4] identifies in her work 4 characteristics of a possible trustworthy environment. The first one is publicity which is defined as the routine of making public your identity and being characterized as reliable or not reliable (a person who tends to betray ). But in order for a person to be characterized as trustworthy (reliable) or not (not reliable) the exposition of his moral and cultural values is an important condition. As a consequence of an agent s identity there comes reward or punishment as the third characteristic. As a final step we need a well-defined set of public

17 Chapter 2. Trust 7 policies that would act as a safety barrier for the trustor. In conclusion Nissenbaum claims that none of the above elements could be applied at an online environment and as a result the trustor s ability to assess an agent s trustworthiness is considerably limited. In contrast to the above, Yamagishi and Kikuchi (1999) [11] made an interesting observation about how people interact within their environments and proved that their interactions are not absolutely based on their backgrounds. They claim that people manage to develop their social intelligence and overcome the above described preconditions of trust and adjust to the needs of each interaction by learning gradually to correctly assess the trustworthiness of their peers. Their analysis involved examples of American people who communicated with peers from Japan. As a result we may claim that trust can occur in dynamic environments in which interactions could be repeatedly assessed as time passes. In case that an agent is proved to be not trustworthy the damage that the trustor suffers can be usually recoverable. Such a temporary failure should be regarded as an opportunity for the trustor to make its overall interaction with the environment more robust and efficient [11]. Agents of the online environments need to take those kind risks. If risks are highly constrained by social and economic factors we are talking about assurance and not trust. And consequently assurance could be applied in the online space only if we interact with people with high intimate degree [12] such as family members or colleagues Identity and reputation as trust parameters Previously we argued about the importance of the identity of the individuals. By many people who argue against online trust, it is regarded that a diachronic individuals identity is difficult to achieve in the online social network. Here it should be stressed that there are mechanisms like authorisation and accounting procedures which offer the means to construct an identity management system which contributes to the transparency of communications and helps the agents to establish their reputation. Consequently, the knowledge of a verified identity makes the agents less suspicious about a peer and the assurance level grows. Furthermore there are mechanisms which allow us to track and monitor the actions performed by agents online and this gives us a powerful weapon to assess someone s trustworthiness. In this way it is possible to construct the reputation of an online identity without needing to connect it to a specific physical person. Reputation is widely recognised as one of the main criteria used to assess

18 Chapter 2. Trust 8 the trustworthiness of a potential trustee [11]. As a result reputation makes agents identities last in time and as a result we are able to speak of a diachronic online identity as opposed to the arguments presented by the detractors of online trust. Let us at this point consider some popular examples from online communities where reputation plays an important role to their functionalities like Amazon, e-bay and Twitter. On these sites users do not have any knowledge of others people physical identity although they possess an online diachronic id. ebay users interact with each other in a purely online environment with high mutual trust and the reputation of both sides (seller-buyer) is assessed according to their past performance [13]. Similarly, Amazon users trust the positive feedback received from other users before proceeding with the purchase of a product. In Twitter users mainly decide to follow another peer based on his/her past interactions and reputation Online Trust and its role in online environments Until now we have made a clear analysis of trust and of the trust s role in social systems in general. Now we will focus more on the online trust. In the previous parts we have discussed that the presence of common background as well as the physical identity of an agent are not prerequisites for the occurrence of online trust. Now we will mainly concentrate on the nature of the online trust and how online trust is affecting the social interactions. A comprehensive and thorough analysis of online interactions could be found in several managerial as well as psychological studies, in which their goal was to clarify how people interact within the sector of e-commerce. Indicative works are that of (Bhattacherjee 2001, McKnight and Chervany 2002, Corritore 2003) [8] [14] [15] which point out the computer-centric nature of the procedure of the information exchange.in these works, and that of Floridi L. information could be regarded as the general sense of meaningful content that can be transmitted from a source to a receiver [16]. The online interactions concern not only the obvious, like chatting and s, but they also include all the communications performed for example by sellers and buyers in the field of e-commerce. The exchanged information concerns the products, people s honesty as well as their loyalty (Corritore 2003) [8] and the main aim here is to enhance the users trust and attract their attention. The example of e-commerce is important in order to gain a better understanding of how trust could be applied to the online interactions. These interactions will in the next sections be extended to them that we

19 Chapter 2. Trust 9 meet on the online social networks and we will prove by analysing Twitter that in order for somebody to gain the trust of his/her followers or to attract new followers has to follow similar techniques that are applied to e-commerce. For example, considering his/her website, a seller of an online shop exchanges information with a potential buyer. The data may include the cost of the desired product, the quality, and the delivery time. Analogously, on a social networking platform like Twitter, an online agent sends information to all his/her followers about fthe actions that he/she is performing at the moment, whether he/she likes a particular movie or a specific content from a website, etc. From what we have already defined as trust we could conclude that trust is not a relation itself but a second order property qualifying first order relations [17] and consequently online trust is a special instance of this second-order property and is the dominating element of the online communications. The common factor of offline and online trust is that both are based on the trustee s trustworthiness and that transparency and honesty are two of trust s main features. We shall as next step concentrate on the role of trust in an online environment. Online trust gives the trustor the advantage of expanding his social network through many interactions with individuals that are regarded as trustworthy and acts as an incentive for this purpose. Let us explain it a bit further. As we stated in the previous sections individuals have the ability to refine their social intelligence and this acts as protection mechanism to avoid risky interactions. Users become part of circular procedure of interactions that consequently lead to the selection process of trustworthy agents, and untrustworthy ones are steadily excluded and eliminated from the online social network. If we take a close look at the interactions of social networks like twitter we could easily spot these dynamics. At this particular point we may conclude that trust emerges in online communications and it also provides a way for the evolution of interactions in these online environments. 2.2 Summary At this point we have seen the different opinions on offline and online trust, we did support the existence of online trust and we did understand the main factors that affect the emergence of trust and trustworthiness online. Furthermore we looked at the role of online trust and presented some popular examples from e-commerce. Now we shall continue to our next chapter which will include online social networks.

20 Chapter 3 Online Social Networks In this chapter we will begin with a definition of online social networks, then we will focus on their structure and evolution across time and finally we will present social analysis as a technique that will help us succeed in our objectives. 3.1 What is an online social network A social network is a social structure which includes nodes that usually are individuals or organizations which are connected with other individuals or organizations through similar relation types. These types could vary and some indicative examples are the following: ethical values, common visions, goals, ideas, commercial transactions or friendship. A social network service in the Internet has as goal the evolution of online virtual communities where members may share common interests and activities or are interested in exploring the activities of other people. Online social networking has contributed a lot to this direction as it has allowed rapid information flow and offered new means of communication. The websites of social networking are used in a daily basis by millions of people and has become part of their everyday activities. What we have seen previously could be regarded as an external form of an online social network wherein users are free to create profiles and share their profiles with others. The online profile stands for a global identity for the individuals and agents relationships could be regarded as their unique global social fingerprint [18]. But there are also other forms of social networking. There is the so called internal social networking, which is usually a small social network that potential users need to be recommended by somebody who is already member in order to gain approval and 10

21 Chapter 3. Online Social Networks 11 access in the network (see Google+ social network in its early days). People s global identity (profile) is initially defined by the information that the users decide to upload online but is continuously assessed and refined through their social status. As we have seen in the previous chapter social status contributed to the knowledge of the behavior of an individual as well as his/her expected interaction with his/her peers. Through social status, likeminded individuals could come together and even communicate beyond social networking because of the similarity in social status 1. A status could be regarded as a gateway to more information that expands the connectivity to other users of the network. But as we have showed above, common patterns between people is not necessarily a prerequisite for trust to occur between unknown individuals Standard Features in Online Social Networking We may at this point wonder why social networking became so popular. Their extremely high degree of popularity would not have been possible without the wellknown basic features that they support. Almost the majority of such networking platforms have integrated these standard features which make them part of our everyday life. Without these features, it is not possible to belong to the category of social networking. We believe that the most important feature is the ability that is given to the potential user to create and manage his/her personalized homepage. Through this, a lot of personal information as well as preference of this user could be shared among his/her peers. These shared data could vary from the user s current location and hobbies to even simple preferences that somebody could have. This information is then processed by the social network platform in order to connect him/her to likeminded individuals in case they choose to do so. An additional feature is the ability that is given to users to choose other peers out of the likeminded set and add them as friends or contacts in their personal homepage. Consequently, we observe the importance and the dynamic of the homepage feature. Of course it may be the case that the potential friends might not be familiar with each other. So an additional feature of a direct message or request is provided to ask for more information about the person that invited you to his/her circle of friends. This is usually happening in social networks such as Facebook because of the existence of 1

22 Chapter 3. Online Social Networks 12 privacy policies. Many social platforms like Facebook have the ability to get integrated with many applications or gadgets. These add-on applications do also have the ability to communicate with other external web-sites out of the frames of the social networks which may contribute the social network s interactions thus overall traffic. On the other hand this may cause security worries as users are not aware of what is happening with their data in case of connection with an external source. Of course as we may know there are also some additional Features in Social Networking. We know that all social networking platforms could not be regarded as identical and the main cause that is responsible for their differences is not the basic features but the additional that each social platform is able to support. Not all these features exist to improve the user experience but they may serve other roles that could be useful in several cases. For example, a user s social networking profile could be used as a global identity to also authenticate his/ herself to other social networking websites without the need to maintain different login-password pairs. Other networks offer their user the assurance of the complete security of the application in order to prevent data attacks. And finally others are focused more on news and content distribution rather than aspects of real social life by limiting for example the number of pictures that an agent could upload. So it becomes clear that the scope and the goal of each network are the two most important factors that define the network s nature. 3.2 Structure and evolution of online social networks In this section we are going to comment on the evolution as well as the structure of large online social networks and we will present some evaluation results of several growth processes on which these networks evolved. A social network can be mainly regarded as a platform or a website wherein different agents are stimulated to sign up, usually for free, and get connected to different agents who are also participating in the same social network. This power of people of interacting with other peers in an online environment is the main factor which contributes to the success or failure of such networks. There are some indicative examples of social media platforms like Facebook and Twitter which faced an explosive growth and redefined the online landscape [19]. Equally in the sector of online shopping we have the examples of Amazon and ebay which relied on the power of social networks as we have seen in the previous section.

23 Chapter 3. Online Social Networks 13 In fact, social networking has become a significant source for new business startups which offered users the ability to build and manage their own social network. Consequently it was only a matter of time for the academic community to get involved with the subject and publish many works related to the structure, evolution and analysis of online social networks as well as attempts to identify common patterns and differences between offline and online social networking. According to R. Kumar, J. Novak, P. Raghavan 2004 in their work Structure and evolution of Blog space [20] the density of an online social network which determines the interconnections per agent, corresponds to the following rule: Firstly we observe a rapid growth at the beginning of its lifetime followed by a period of decline before stabilizing and steadily grow from that point on. This pattern could be easily explained if we think of the early users who create a significant amount of connections by the time of the exploration of the newly added system which is followed by the rapid growth, we discussed previously, in which new agents join quicker than friendships can be established [21]. Finally we observe that the increase in both connections and memberships is getting stabilized and continues to grow in a constant rate. The next step involves the classification of the peers of the network in three different classes according to their contribution and involvement. As a result users could be classified to singletons if the agent is a degree-zero node [22], which means that the member has no connections to any of his/her peers and consequently cannot be characterized as active user. The next category is the so called giant component which embodies the large net of people who are interconnected through multiple paths, either directly or indirectly, and are regarded as the most active participants. Finally we have the middle region which consists of the rest of the users who usually form small groups and interact with each other but not with the network as a whole. We could say that these agents tend to interact within small circles of trust which are characterized by high intimacy degree and they are the dominant population of the online network. It is observed that even after a significant growth in the network, the size of the middle region s communities remains greatly stable in comparison with the other two classes which change significantly. A remarkable outcome of a Yahoo study about the social networks proved that the likelihood that two different trust circles will merge is unexpectedly low [23] and this shows us that the information flow between all members of the different trust circles is regarded to be difficult or it demands the presence of specific mechanisms like the hashtag of twitter which connects people who exist in different trust circles. Another solution involves the nodes bridges.

24 Chapter 3. Online Social Networks 14 There is only after a significant fraction of time in which several users have been added between the different circles, the so called interconnectors, when we could claim that the information flows freely between the agents who lie in the different circles. Now we are able to speak of communities merge which may become part of the giant component. On the other hand, if the interconnectors lose their ability to hold the communities together, the growth will fade and maybe the networks will separate again (see betweeness centrality in the next section and chapters). An interesting property of the giant component is that even if we remove the key users, the connectivity of the remaining user-nodes will not be affected in contrast to the interconnectors of the middle region. Another interesting observation is that over time the average distance in the paths between the nodes of the giant components tends to shrink [21]. Finally, out of these findings we may conclude that users may choose among two different ways to participate in the online social network. They could register and steadily try to form their own inner network or they may be invited by an existing user and become part of his/her network. In the middle region the observed method is the second one where users are mainly invited because the inviter is seeking to build part or the whole his/her offline network online. On the other hand, the giant s component agents try to build new connections with known and unknown individuals in order to expand their influence. The structure of the online social networks described above is represented by large graphs where the agents are regarded as nodes and their connections as the edges. There are several studies of graph systems that have been widely studied from structural point of view. Some examples are the following: World Wide Web, biological networks, graphs as well as linguistic networks and the focus is on several graph properties such as size, average distance, minimum and maximum paths, density, degree distributions, clustering coefficients etc. [20]. There is significant work in the literature which tries to identify and quantify the structure of online social networks. Faloutsos [24] has made a very important resolution according to which, the degree distribution on the online environment is power law, which is confirmed by works which studied the graph of World Wide Web [25]. Jennifer Golbeck in her dissertation studied online friendships in terms of mail graphs by proposing a system which analyzed trust in context of exchanged mails [26]. The problem that arose with the large systems analysis was that they were highly dynamic and that made it difficult for researchers to come up with results that could last over

25 Chapter 3. Online Social Networks 15 time. The solution proposed was to take several snapshots of the systems in different time intervals and then perform analysis. This and similar approaches could be found in the following paper [27]. We talked about graphs as we are going to need graph theory for our next section where we will describe how people could perform social networks analysis or SNA in order to extract useful information. 3.3 Social Networks Analysis Social Networks Analysis has its origins in the fields of social science as well as graph theory as we stated above. Network analysis is associated with the formulation and solution of problems that have a network structure as we described. This structure is usually expressed through graphs. In order to analyze graphs we should rely on the graph theory which gives us a set of abstract concepts and several methods for this purpose. Furthermore these methods could be combined with other analytical tools (we will analyze all the tools that we are going to use in the following chapters) and with further methods that were developed for the visualization and analysis of such graphs. All these together form the basic concepts of social network analysis methods. Consequently we could easily understand that social networks analysis is not just a simple methodology but gives people a unique opportunity to understand in detail how society (offline and online) functions. Its goal is to focus on the relations between individuals, groups, etc. rather than on individuals themselves and their attributes [28]. In order for the reader to understand it better we claim that SNA regards individual as embedded objects in a network of relations and seeks for specific reasons of why particular social behaviors occur. Social networks analysis will give us the means to approach the case of trust in the social network environment and identify the several trust mechanisms that appear between different individuals. More generally researchers do need SNA in the following cases: Whenever we need to study a social network which may be either offline or online, or whenever we desire to gain an understanding of ways that could contribute to the improvement of the effectiveness of the network according to what this specific network is supposed to do. (scope of the network) Whenever we need to visualize or quantify our data in order to discover specific patterns in relations or between the interactions of the individuals (agents).

26 Chapter 3. Online Social Networks 16 Whenever we want to uncover the paths that the information follows (information flow analysis). Whenever we wish to identify the different perspectives for the specific network. For example we could claim that the ranges of actions that different agents perform are often strongly connected to their position in the network. (we could think of the interconnectors we mentioned previously.) Social networks analysis could unveil different types of actors or key players (in twitter this kind of individuals are called influencers) that could lead us to extract useful information about how these particular social network function. Whenever we need to identify the causes of misbehaving networks [29] Graph Theory for Social Networks Analysis Now we are able to focus in detail in the graph theory that could help us identify the trust mechanisms. Firstly we need to represent, as we said, relations in terms of networks. For this reason let us consider the following example: We have 4 individuals that know each other and need to communicate in order to arrange a meeting. For the example s purposes we will use the following names: Anastasia (1), Nikos (2), Mary (3) and Dimitris (4). We will now focus on their interactions. 1. Anastasia: Nikos, tell Dimitris and Mary that they are invited for tea. 2. Nikos: Mary, you and Dimitris should come for tea. 3. Nikos: Dimitris, I have already talked to Mary and both of you are invited for tea today. 4. Anastasia: Mary, did Dimitris inform you about the invitation? You should come. 5. Dimitris, we are invited for tea this evening. Now we will concentrate on their interactions and form a small network out of them: We could form out of this an edge list that clearly shows who is communicating with whom and we will create a matrix that clearly shows the information flow. A

27 Chapter 3. Online Social Networks 17 Figure 3.1: Graph showing the interactions of the example presented second property that could describe the graph is the adjacency matrix which shows the link degree of each vertex but also how they are connected. In the figure 3.2 we show the adjacency matrix as well as the edge list for our example. The graph that we used for the example presented is a directed graph which successfully captures the different interactions as well as the information flow between the agents. A different graph variant is the undirected graph which again captures the connections between the agents but lacks in showing the information flow. For our example the adjacency matrix becomes now symmetric in contrast with the edge list which remains untouched (see figure 3.3). To sum up we could generally claim that the directed graph shows who contacts whom and the undirected who knows whom. At this point we may move forward and talk a bit about whole and ego networks. The whole network represents all the agents and how they are connected with each other and is mainly undirected. On the other hand the ego network is a partial network, a subset of nodes, and corresponds to the connections of a specific individual. The picture 3.4 shows an ego and a whole network. The ego networks will help us identify how each user forms its own network as well as isolated nodes that in the previous section we described as singletons. The next step of a social network analysis is to identify strong and weak ties within a network or within a sub-network. We could achieve this by adding specific weight to each edge. There is a variety of weights and as a result they could take several forms according to the situation on which they are applied. For example we could put weights that represent the frequency of interactions that are performed within a welldefined period of time or numbers of items that have been exchanged within this period (for example the exchange of files). Furthermore we could take into account the cost

28 Chapter 3. Online Social Networks 18 a Figure 3.2: The Communication between nodes and the corresponding adjacency matrix a

29 Chapter 3. Online Social Networks 19 a Figure 3.3: This figurer shows the new communication and adjacency matrix a

30 Chapter 3. Online Social Networks 20 a Figure 3.4: This figure shows an ego network (black) and a whole one (red) a of communications in case of successful interactions that could add additional weight and strengthen the ties. As we may understand weight could be a function of many different things. Consequently the adjacency matrix as well as the edge list should be modified accordingly. a Figure 3.5: Example of weighted graph a To sum up, we could claim specifically for social interactions that a proxy for the strength of a tie can be [29]: The frequency of the interactions between peers or the amount of the exchanged flow.

31 Chapter 3. Online Social Networks 21 The reciprocated interactions or flows. The nature of interactions or the type of flow between agents (for example intimate or not). The structure of the nodes neighborhood (for example many mutual friends). Any other attributes of the nodes or ties. Now we are ready to speak of what the factor that stimulates people to form connections is. One factor is that of the homophily which may be described as the tendency to connect with people with similar characteristics such as social status, ethical values, interests, etc. The homophily leads to the creation of homogenous groups, which are called clusters. Within these clusters agents can easier form relations. If we could examine homophily within networks we could conclude that it is a factor that enhances the connections. On the other hand, extreme degree of homogenization may lead to the prevention of innovation and of new idea-generation. As a result a certain degree heterophily is in this context a desirable factor. Transitivity is another crucial factor that affects connectivity and we will try to explain transitivity with a short example. Let us imagine that within a social network there is a tie between users 1 and 2 as well as between users 2 and 3. Then within a transitive network there will also be a connection between users 1 and 3. It has been observed that strong ties between agents are more often transitive than weak ties, therefore we could claim that transitivity may imply strong ties but is not necessarily a sufficient condition 2. If a network is featured with transitivity and homophily we conclude that the agents form a clique or fully connected clusters [29]. Of course, there are also heterophily based networks. But these networks are characterized by weaker ties. A subcategory of these weaker ties is the so called bridges which may be regarded as nodes and edges that connect across different interlinked groups. Bridges contribute to the information flow between different networks and may increase the social cohesion. Additionally they help spur innovation. Bridges are similar to the interconnectors of the previous section. At this point we have a general view of the structure of the social networks. But it is also crucial to find some mechanisms that could help us identify the key players within such a network. Firstly, it is suggested that we should focus on the node degree 2

32 Chapter 3. Online Social Networks 22 of each particular node of the social graph. The degree centrality represents a node s in or out degree which is the number of the different links that lead in or out of this node. Of course if we are talking about an undirected graph these links are identical. The node degree is often used as node s degree of connectedness and help us identify which nodes are important on spreading information and as a result it may also be used as an indicator for influence and popularity which are two factors that are strongly connected with trust and trustworthiness of a node. On the contrary, we are going to show in our work that node degree is not as important as other features that affect trust. But in order to be able to witness how information flows we need to identify paths between agents-nodes. A path between any two nodes may be regarded as the sequence of all non-repeating nodes that connect them. A variation of path is the shortest path which is the one that connects two nodes with the smallest number of edges which is also known as the distance between the two nodes. Usually the shortest paths are desirable when we care for example about the speed of communication etc. We have discussed about paths and shortest paths as we may need the provided theory to introduce the case of the betweeness centrality of the network s nodes. Betweeness centrality is formally defined as the sum of all the shortest paths that access a specific node divided by the sum of all possible shortest paths that lie within the network and the result is usually normalized in the interval between zero (lowest value) and 1 (highest value) (This step is optional). Its purpose is to indicate which nodes are most likely to be in communication paths between the other nodes of the network and it is also useful in order for us to understand which nodes are most important for the information flow and act as brokers between sub-networks of the same larger network. In other words betweeness centrality is useful in determining points where the network would break apart. Another important metric that depends on the previously introduced shortest paths is the Closeness centrality which is defined as the mean length of all shortest paths from a node to all other nodes in the network. In other words we are defining a metric which indicates the average number of hops that are needed in order to reach any other node within the network and therefore is regarded as a measure of reach which is useful when we care about the speed of the information dissemination. At this point we will talk about a very important property which is widely used in several ranking algorithms and has its roots in the linear algebra field. Its name is eigenvector. Eigenvectors are a special set of vectors associated with a linear system of equations (i.e., a matrix equation) that are sometimes also known as characteristic

33 Chapter 3. Online Social Networks 23 a Figure 3.6: An example picture of betweeness centrality. The node in yellow has the highest betweeness centrality a vectors, proper vectors, or latent vectors [30]. Now we may introduce the eigenvector centrality of a node which is proportional to the sum of the eigenvector centralities of all nodes directly connected to it. This means that a node with higher eigenvector centrality tends to connect with other nodes with high eigenvector centrality and it is useful for the social networks analysis in order to determine which node is connected to the most connected nodes. This is similar to how Google ranks web pages by using the page rank algorithm which rated links from highly linked-to pages as highly ranked pages. We stress the importance of eigenvector centrality as in our work we used a variation of the page rank algorithm known as HITS algorithm to measure how trustful a person is, thus rank the trustworthiness of the different nodes. In particular we will measure how well-connected an agent is with trustful and trustworthy people To sum up with the different metrics we will try to make a small interpretation of each one as follows in order for the reader to gain an easier understanding of their meaning within an online social network: Degree Centrality T he number nodes that this agent is able to reach directly. Betweeness Centrality It gives us a quantified metric of the likelihood that an agent is the most direct route between any two nodes within the network. Closeness Centrality How quickly is able an agent to reach everyone in the network?

34 Chapter 3. Online Social Networks 24 Eigenvector Centrality Is an agent well-connected to other well-connected agents? Thesd metrics will provide us with the means for identifying key players according to what we are researching. For example let us consider the graph of figure 3.7. Figure 3.7: Picture that shows the key players In this graph we may observe that node number 10 is the one with the highest node degree value according to the node degree metric. On the other hand nodes 3 and 5 together are able to reach more nodes than node 10 and the tie between node 3 and 5 is critical as in case we break the tie the network will be separated into two different isolated sub-networks. As a result nodes 3 and 5 are regarded as more important players than node 10. Having discussed and briefly analyzed the different metrics that could be applied on online social networks we may now focus on a network s structure which may be regarded as complementary to what we have discussed in the previous section. Firstly we are going to speak about reciprocity which is defined as the ratio of the number of relations which are reciprocated or in other words there is an edge which comes from both the source and the target. Of course not all the relations are reciprocated. As it is easily anticipated we are focusing on directed graphs which support the mutual relations between two different nodes. Reciprocity could help us identify cliques in directed graphs as it is the twitter graph.

35 Chapter 3. Online Social Networks 25 At this point it would be useful to introduce density which is regarded as a common measure of how well connected a network is (in other words, how closely knit it is ) and is useful for comparing different networks or sub-networks of the same network. A perfectly connected network is called a clique and has density equal with 1 and all of its nodes must be connected with each other. If the graph is directed all nodes must be connected with each other and the relations must reciprocate. We will need cliques also in the next chapters. Another feature is that of a node s clustering coefficient which is formally defined as the density of its neighborhood. By the word neighborhood we mean that the network consists only of this specific node which is under examination and all other nodes which are directly connected with it. In case we need to calculate the clustering coefficient for an entire network we need to find the average of all coefficients for its nodes. We talked about clustering as it is an indicator of the presence of different subcommunities within a network. Clustering will help us identify small worlds which contain nodes that could be reached within a few steps. We referred to clustering algorithms techniques to build the twitter graph that is presented in our prototype. (see chapters 7 and 8). As a final feature that tends to describe online social networks and will be used to identify trust values between nodes is that of the preferential attachment which is an attribute of some networks, according to which new nodes are attached to the existing ones that are characterized by high node degree. But it is worth to mention here that the degree of these nodes thus increases disproportionately, compared to most other nodes in the network. The result is a network with few, very highly connected nodes and many nodes with a low degree. Such networks are said to follow a long-tailed degree distribution and they tend to have a small-world structure 3. There are several reasons which may lead to preferential attachment. The two most common that we are also going to use in our twitter analysis are the popularity and quality factors. By the term popularity we mean that people tend to associate themselves with popular agents, ideas, items, thus further increasing their popularity, irrespective of any objective, measurable characteristics. This is a common case in twitter if we are able to take a closer look to verified users such as Barack Obama, Kim Kardashian, David Beckham, etc. who have a significantly large amount of followers. The second property which is the quality factor indicates that people tend to evaluate other peers and everything else based on objective quality criteria, and as a result the higher quality 3

36 Chapter 3. Online Social Networks 26 a Figure 3.8: In this picture we observe 3 different clusters a

37 Chapter 3. Online Social Networks 27 nodes will naturally attract more attention, faster. A good combination of these two criteria may lead to high degree of trust between agents. 3.4 Summary In this chapter we begun with a description of the online social networks that included their main features and some differences that make them distinguish from each other and we continued by studying their structure and evolution across time. Finally we introduced the reader to the social network s analysis (SNA) and to the basic concepts of graph theory that is needed to perform a SNA.

38 Chapter 4 Case study: Twitter Now that we have clarified the basics about the social networks analysis we will focus on our case study which is the Twitter online social network. We will begin with an analysis of twitter and we will continue in chapter 5 with trust mechanisms that could be applied on twitter. The twitter analysis will be complemented with graphs that were plotted out of the data of users of the Twitter platform. 4.1 The Twitter platform Twitter.com is an online social networking platform which has millions of users worldwide. It enables agents to interact with their family members, friends, colleagues and other non intimate people through their personal computers and mobiles. The integrated user interface gives peers the ability to post short messages, which have an upper bound of 140 characters and can be read by different agents. Users select other agents that they wish to follow and subsequently they get notified whenever that particular person whom they follow has posted a new status that is called tweet. Furthermore it is crucial to clarify that in twitter, an agent who decides to follow another agent must not wait for an official approval and consequently the other agent is not obligated to follow him/her back. This means that reciprocity is not required to form a connection, and this justifies the fact that the twitter graph is a directed one. We chose twitter for our case study as its usage has increased dramatically in recent years and attracts the attention of many marketing companies as they are looking for the most trustful and influential people to promote their products. The next figure demonstrates the usage of twitter the last years. 28

39 Chapter 4. Case study: Twitter 29 a Figure 4.1: This chart shows the twitter usage in the recent years a 4.2 Twitter main mechanisms The users of the twitter platform can post either direct or indirect updates. Direct public posts aim to a specific node and are signified by character in front of the target node s username. On the other hand indirect status updates are used when a user s status has not a pre-specified target, but anyone of the followers, who may be attracted by the content, could read it. Despite the fact that direct updates are used for a direct communication between two peers, they are public and anyone who follows the sender is able to see them. It is worth mentioning that around 27 per cent of all messages are direct posts, and this is something that indicates that this mechanism is widely used among the social network s peers. An additional and interesting feature is that of the hashtag also known as (#). Users may include the special character # within their posts to connect their statuses with a specific subject or trend. Any other agent who may use the same hashtag subject, even if he/she is not in the first agent s network, is able to connect his/her post with the previous one. In other words everyone who uses the same subject is able to see other posts which refer to same subject.

40 Chapter 4. Case study: Twitter Twitter analysis As a next step, we managed to obtain ten thousand different users within one and a half hours. The method used for obtaining the users data will be described in a next chapter. For each single user in our crawled data, we did extract his/her count of followers and followees, which are agents followed by this particular node, the number of tweets (posts/statuses) and the corresponding dates when these tweets were posted. As we claimed before, our data-set included a total of users, where 8328 of them posted at least twice and they could be classified to the active nodes. The active time of agents is the time that has elapsed between the first time that he/she posted something online and the most recent one. This is usually known as a user s timeline. On average, agents were active communicating with the network for 206 days. We also used the above metric in our trust mechanism that will be presented in the following chapter. In particular, we have examined how the users are connected with each other, and how the number of followers is related to the number of actual friends. We define friends as the persons who received from a specific user at least two direct messages. By applying this pattern we did identify the number of actual friends each specific node possesses and we did compare it with the total number of followers. In the figure 4.2 it is easily anticipated that the number of actual friends (red color) is much less than this of the followers (blue color) and this indicates that agents tend to properly communicate with only a part of their connections. For this reason we could conclude that total number of connections could not stand alone as an indicator of how trustworthy a person is because we have no reliable evidence of his or her communication with his/her connections. In the figure 4.3 we have plotted the number of tweets in y axis vs. the number of total connections of each user. It easily anticipated that a lot of people have a high count of total connections but on the contrary they have less or no proper communication with them and this is again an indicator that the total number of connections does not reflect the agent s trustworthiness because again we do not have enough evidence of judging his/her overall behavior within the social network. On the contrary with the previous figures, figure 4.4 indicates that if the aggregated count of actual friends grows then the total count of tweets grows as well. This is an evidence that in order to be able to extract useful information about an agent s behavior we need to take into account the most of his/her actual friends. Additionally, this relies also on the definition we gave above for friends. In particular, it claims that if a person

41 Chapter 4. Case study: Twitter 31 Figure 4.2: This graph shows the followers (blue) compared with the actual friends has at posted at least two direct messages to another peer, the most likely he/she will continue to communicate with him/her. Another interesting aspect of our twitter analysis is the observation that the count of friends alters in accordance to the number of followees for a certain amount of time. This is justified by figure 4.5 which indicates that although the count of actual friends increases proportional to the expansion of followees, after some specific time the number of friends does steep. The above observation is validated by the experience that the price to confirm a new connection is not regarded that high in comparison to the price of preserving friends by steadily exchanging messages with them. Consequently we may claim that the number of users that a certain agent properly interacts with finally stops rising but on the other hand the number of followees may rise continually. In the previous section we have clarified the meaning of reciprocity. Reciprocity has an important role in the most of economic and social transactions, thus interactions. [31]. We have also discussed the examples of Amazon, ebay etc. which justify the existence of reciprocity in a certain degree. But because of all this information and signals that flood the users, every day attention is a rare commodity and that makes it a valued private good [32]. In Twitter, we did discover that there exists the idea of reciprocated attention [32]. Despite the fact that according to what we have earlier defined as friendship where agent 1 could be a friend of agent 2 while agent 2 does not reflect the same feeling for agent 1, we figured out that, 85 per cent of a user s friends reciprocate attention. This finding is important for defining a user s actual network

42 Chapter 4. Case study: Twitter 32 Figure 4.3: This picture shows the number of tweets VS the number of followers Figure 4.4: This figure shows the tweets (posts) vs. the number of friends Figure 4.5: This figure shows the number of friends vs. the number of followees

43 Chapter 4. Case study: Twitter 33 from which we can extract useful information about his presence in the network. 4.4 Summary In this chapter we introduced the twitter social networking platform and performed a twitter s analysis by crawling the data of users. We made a lot of interesting observations. To sum up, we could claim that even if we are making use of a not so strong definition of friend we did discover that there is only a small number of actual friends that are contained in agents networks in comparison with the total number of connections that they possess. As a result, this denotes the existence of two different networks: one whose density degree is high and consists of both followers and followees, and a sparser one which contains only the actual friends. The second one appears to be a more influential network, because of the fact that agents with a large number of friends have the tendency to post new messages more frequently than agents who have fewer actual friends. On the contrary, agents with many connections, but with few or no actual friends, post updates more seldom than those with few total connections.

44 Chapter 5 Twitter and Trust 5.1 Trust mechanisms in Twitter We have stated in the previous section that twitter is a directed graph and directed links are not only limited by intimate factors but they could also represent common interests or a passion for braking news or even for celebrity gossip. Such directed links denote the flow of information and hence indicate an agent s trustworthiness and by extension influence on other agents [33]. In our work we are going to present in depth five mechanisms that could be used to measure the trustworthiness of different nodes within a social network such as twitter. These mechanisms include 1. Each node s degree (the proportion of indegree vs. outdegree as we have defined it in the section of the social networks analysis). 2. The number of the statuses retweets of a particular node 3. The number of mentions that a particular node receives 4. The betweeness centrality of each node as we have defined it in the social networks analysis section 5. The HITS rank of each node. In the following sections we will thoroughly discuss the variation of the HITS ranker that we used in order to rank each node. Based on these measures we try to explain the dynamics of each user s trust values across time and topics of discussion and we came up with some observations such as popular agents who possess high node degree proportion are not necessarily trustworthy in terms of having proper communication with other peers. And this means that 34

45 Chapter 5. Twitter and Trust 35 their influence on other users is approaching zero. Another interesting observation is that the most trustworthy users may possess significant influence on a variety of topics and as a third observation we could claim that in order for someone to become trustworthy, he/she should make concerted effort and follow specific strategies that we will present in this section and in the following chapters. We believe that our findings could help businesses to refine their marketing policies and strategies on Twitter as we suggest that node degree alone cannot be an indicator of how trustworthy a person is and by extension how influential this person could be. We chose to follow the above strategy because of the fact that essential components like human choices and ways of how our society functions are very difficult to be regenerated within the frames of our work. The analysis of the above mechanisms provides us with a better understanding of the different roles that the agents may possess in social media. The node s degree gives us an initial understanding of the popularity of this node. The number of retweets represents the content value of what the particular node has tweeted and the number of mentions stands for the name value of the user [34]. The betweeness centrality of each node gives us a further understanding of the importance of the position of the node within the network and how valuable this node for the network s information flow is. And finally the weighted HITS ranker algorithm gives us quantified information of how many trustworthy nodes are connected with this node that we examine and in conjunction with the weight that is computed from the overall mentions and retweets over time, gives us the overall HITS rank for each user. The charts that we presented in the previous chapter, claim that degree node alone cannot reveal much about the levels of trust and trustworthiness and hence the influence of a user. Avnit in his work (Avnit 2009) called this phenomenon the million follower fallacy 1. He pointed out that some users do follow other users simply for etiquette. For example there are users who claim that it is polite to follow someone who followed you and of course as Daniel Romero indicates in his work, Influence and passivity in social media [35], not all the broadcasted tweets are being read. This is the reason why we need mechanisms like the frequency of mentions and retweets, that would be able to identify active users. At this point it would be important to briefly discuss in the literature about trust in twitter what has been proposed. There are some recent efforts which tried to capture the meaning of trust in the twitter social network. The Web Ecology Project tracked

46 Chapter 5. Twitter and Trust 36 popular and trustworthy twitter users for a period of ten days and the people working on the project observed that these users could be classified either as conversation based influencers or as content based influencers and they concluded that the news media like breaking News are better at spreading content while celebrities and other verified users are better at simply making conversation [36]. This indicated that users may belong to a different category of trust which relies on the fact that some people are regarded as authorities in some subjects. In particular the most followed users range from public figures and celebrities like Barack Obama, David Beckham, etc. to news resources like CNN, Breaking News, etc. Here the node degree is useful because we need to understand the popularity in terms of the connectivity of each user. The most retweeted users were content aggregation services like TweetMeme and businessmen like Guy Kawasaki. The common feature of them is that they are trackers of trending topics and people who are successful and possess knowledge in different fields. In contrast with the node degree mechanism, retweets represent the trustworthiness of users beyond one to one interaction as it is a fact that retweets can propagate multiple nodes away from its actual source (the sender). Additionally, because of the tight connections between different peers as it is presented in the triadic closure [37], retweeting in social network can act as a very important mechanism which can reinforce a particular message. For example, the probability of adopting an innovative idea is increasing when this idea is not only supported by a single person but from a whole group of people [38]. But we should be careful with the content that we retweet as it is proved in the work of Marcelo Mendoza and Barbara Poblete [39] that in case of crisis it is difficult for users to filter messages and distinguish false rumors from confirmed truths. And it is shown that if they retweet false rumors, for example death of people, their trustworthiness shrinks. Finally, we could claim that the people, whose identity appeared most of a time as a mention, belonged to the celebrity block. Ordinary users showed a great preference to the celebrities by mentioning them without necessarily retweeting their posts. We could claim at this point that mentions are most like replying to a message while retweet can be regarded as similar to citations in scientific papers. Another work used a variation of the PageRank algorithm to quantify trust and influence on twitter [40]. The authors of this work discovered high link reciprocity almost (72 per cent) from a non-random sample of users whose base was in Singapore and argued that high reciprocity is an indication of homophily. As a result they exploited this fact in computing a user s trust value and consequently his/her

47 Chapter 5. Twitter and Trust 37 influence rank. But their observations included mostly the structure of the network without taking into account the social behavior of each particular user. On the contrary we propose in this work a total solution that includes both the structure as well as the social behavior of each agent. 5.2 Algorithms for identifying trust in Twitter In a previous section we did discuss social network analysis and introduced the betweeness centrality mechanism. Now we will talk a bit more about it because we have included the betweeness centrality mechanism as part of our trust mechanism and we will comment that the betweeness rank of each node plays a significant role for its trust value. Furthermore we will also present the HITS ranker Betweeness Centrality We have already discussed that in large social networks such as twitter, not all nodes could be regarded as equal. For instance, if we remove a specific node from a network this could have a different impact on the network which depends on the node. If the node lies at a dead-end [41], its removal may have no consequences in contrast with the case of a cut-vertex (see bridge/interconnectors chapter 3) which may cause network s components to break apart. [42], [43]. In SNA, this matter of discovering the degree of centrality of the different agents as a function of their position within the network was studied in the following works [44], [45]. Different quantities were then defined in this context of social networks in order to quantify this centrality. Someone could regard centrality rank as proportional to connectivity of a node. However, we need to clarify that this is a wrong assumption because centrality is in general not related to connectivity. The reason behind this is that connectivity should be examined only as a local quantity which does not provide us with all appropriate information needed in order to assess the importance of the node in the network. Indeed, it may be the case that an agent may not possess high node degree but the effect of its removal may be fatal because of the fact that it links together different parts of the network. A good measure of the centrality of a node has thus to incorporate a more global information such as its role played in the existence of paths between any two given nodes in the network [41]. Now we are ready to deepen a bit into the algorithm in order to understand how the

48 Chapter 5. Twitter and Trust 38 centrality ranking is computed. In particular, betweeness centrality counts the fraction of shortest paths going through a given node. More precisely, the betweeness centrality of a node v is given by [44],[45] g(u) = (σ st (u)/σst st ) σ st is the total number of shortest paths from node s till node t and σ st (u) is the number of shortest path from s to t that are going through node u. The quotient σ st /σ st (u) is defined as µ st and is called pair dependency [46]. The betweeness centrality g scales proportional to the number of pairs of nodes s t u and some authors normalize it by (N - 1)(N - 2)/2 in order to get a number in the interval [0, 1] where N is the number of nodes in the giant component of the network that we discussed in a previous chapter. If some nodes receive high values of centrality this would be indicative that these nodes are able to reach others on short paths or that this vertex lies on many short paths. If a node with a high betweeness centrality value is removed from the graph then we may face two different situations. The first is that the paths between many pairs of nodes will be lengthened and there is an unwanted case when the node is a cut-vertex [42], [43] and its removal will create new smaller components of the previous graph. This was for instance used in the following work [47] to discover, iteratively, different communities in large networks. Of course there are also other centrality metrics based on shortest paths that link pairs of nodes. These are the stress, closeness, or graph centrality and could be found in these works [44],[45]. The basic pseudo-code that a programmer should consult before implementing the betweeness centrality algorithm is the following:

49 Chapter 5. Twitter and Trust 39 runs in In the work of Ulrik Brandes [46] the above algorithm requires O(n + m) space and O(n m) and O(n m + n 2 logn) time on unweighted and weighted networks, respectively, where m is the number of links. As next step it is crucial for us to clarify why we need betweeness centrality as one of our trust metrics for the twitter social network graph. We claim that a central position within a network may act as a router of the information flow. This means that a node which has such a position may contribute to different topics that are discussed within the different sub-networks that this node links together. This may have as a result to raise his/her trustworthiness and grow his/her trust-map. But we would like at this point to get a closer look at the picture 5.1 which will be accompanied by an example which may make clear the role of betweeness centrality in trust. For the purpose of the example let us claim that nodes 34 and 3 are managers that represent both football and basketball athletes and try to find the best deal for their clients. Secondly we will say that the white network represents the football industry and the gray sub-network represents the basketball industry. We are now able to witness that although node 34 has a central role within the football industry (white sub-

50 Chapter 5. Twitter and Trust 40 a Figure 5.1: Another example of Betweeness Centrality a network) he/she is some hops away from the basketball industry and that may have as a result to rely on other nodes, that may be competitors, to achieve his/her goals. On the contrary, node 9 can have access in both football and basketball information flow and that could give him a strategic advantage over his/her competitors that could result in raising his/her reputation and thus his trustworthiness within his/her clients HITS Another important metric that we will use is a variation of the HITS ranker algorithm. The initial letters HITS stand for Hyperlink Induced Topic Search (HITS) which is also publicly known as the Hubs and the authorities algorithm. HITS is a link analysis algorithm that rates web sites and was by introduced Jon Kleinberg and was a forerunner to the famous PageRank algorithm that Google uses to rank web pages. The idea behind Hubs and Authorities has its roots in a particular insight into the creation of web pages when the Internet was originally forming. This idea relies on the fact that, certain web pages, known as hubs, served as large directories that were not actually authoritative in the information that it held, but were used as compilations of a broad catalogue of information that led users directly to other authoritative pages [48]. In other words, a good hub represented a page that pointed to many other pages, and a

51 Chapter 5. Twitter and Trust 41 good authority represented a page that was linked by many different hubs.[49] For this reason the scheme for the HITS algorithm assigns two scores for each page: its authority, which estimates the value of the content of the page, and its hub value, which estimates the value of its links to other pages.[48] But in order for the reader to understand why HITS algorithm has been chosen to rank twitter users let us consider an example that will introduce us to the concept of HITS. We will examine how HITS could be used to rank scientific Journals.[49] Formerly, there were many different methods that tried to evaluate the significance of the academic published works. Garfield was the one that introduced the so called impact factor. [50]. According to what Garfield claims it does not matter how many citations a journal or an article receives but the importance of the citation is that which plays the most important role. In other words, it is better to receive citations from an important journal than from an unimportant one. [48] In a similar manner we use HITS to rank nodes in the twitter social network. This will also be explained in the implementation and evaluation part of this work but we will make an introduction here. Firstly we build weights for each node out of its tweets. We counted the frequency of retweets and mentions that each node received across its time in the twitter network and we build the weights out of the aggregated results to rank the nodes. Then we applied the HITS algorithm on the weighted graph and ranked again each of the nodes. Similarly a node may have a lot of incoming connections but the sum of the weights of the connections that point to that node may not be high. On the other hand there may be a node that is pointed from the most trustworthy and trustful nodes in terms of retweets and mentions and that could have as a result to enhance his/her trustworthiness over time. We have justified why we regard the HITS algorithm as an important mechanism to measure the users trust. Now we will try to explain how the algorithm functions 2. In the HITS algorithm which operates in the websites, the first step is to aggregate the results of the search query. Then, authority and hub values are defined in terms of one another in a mutual recursion. An authority value is computed as the sum of the scaled hub values that point to that page. A hub value is the sum of the scaled authority values of the pages it points to. The algorithm performs a sequence of iterations and each one consists of two basic steps: The first step is the Authority Update. We need to update each node s Authority score to be equal to the sum of the Hub scores of each node that points to it. This 2 The algorithm was taken by the Wikipedia website

52 Chapter 5. Twitter and Trust 42 means that a node is given a high authority score by being linked to by nodes that are recognized as Hubs for information. The second step is the Hub Update: We have to update each node s Hub Score to be equal to the sum of the Authority Scores of each node that it points to. This means that a node is given a high hub score by linking to nodes that are considered to be authorities on the subject. The Hub score and Authority score for a node is calculated by executing each step of the algorithm presented below: We need to start with each node having the same hub and authority score set to 1. Secondly we need to run the Authority update step which was described above following the Hub update step. As a next step we need to normalize the values by dividing each Hub score by the sum of the squares of all Hub scores, and dividing each Authority score by the sum of the squares of all Authority scores. 3 As we said HITS is iterative and as a result we need to repeat the procedure from the second step as necessary. HITS, is similar to PageRank as it relies on iterations which are based on the linkage of the documents on the web. However it does have some major differences: It is executed at query time, not at indexing time, with the associated hit on performance that accompanies query-time processing. Thus, the hub and authority scores assigned to a page are query-specific 4. It computes two scores per document, hub and authority, as opposed to a single score 5 and then an overall HITS score is assigned to each node. This is the pseudo-code used for the hits algorithm and its main difference from HITS is that it is applied on a weighted graph that takes into account the frequency of mentions and retweets for a specific node:

53 Chapter 5. Twitter and Trust 43 Figure 5.2: This is the HITS algorithm s pseudocode 5.3 Summary In this chapter we explored the different trust parameters that affect the twitter user s trust and trustworthiness such as node degree, their position within the network, as well as their interactions in terms of retweets and mentions. In order to rank them we did introduce two main algorithms, the betweeness centrality and the HITS ranker and we did analyze their features and structure.

54 Chapter 6 Design So far we did make an introduction about trust in general and we managed to disprove the statement that trust cannot emerge in online environments. Next we did explore the role of trust in online communities and social networks and after that we talked about online social networks. We managed to understand their structure and we discussed several methods of how to analyze these online social networks. The following step was to use some of those methods and perform a network analysis on the twitter social network and we observed some interesting features that we presented through charts. The final step was to identify the trust factors and the needed algorithms that would help us build our trust mechanisms. Out of these findings we did design and implement a prototype that helped us quantify and observe how trust emerges between the users within my twitter graph. In this section we will present the requirements that our prototype should fulfill its basic architecture and in the end it will be accompanied by a class and a sequence diagram. 6.1 Requirements analysis In this section we will present briefly the requirements that our system should fulfill Functional Requirements 1. The system should be able to identify my twitter graph by following a specific rationale that would have the ability to look at my followers and followees lists and iteratively extract from each single user his/her lists. From these lists it should identify all the friends that we defined in a previous section that are also 44

55 Chapter 6. Design 45 included in my lists. 2. The system should be able to extract and download all the tweets of the identified users together with the date that each tweet was created. 3. The system should be able to perform a tweet analysis by identifying all the possible retweets and mentions for each user. 4. The system should be able to calculate the frequency of the mentions and retweets of each user. 5. The system should be able to build and visualize the graph of the first requirement. 6. The system should support the basic mechanisms of twitter which include tweet, retweet, direct message, tweets with hashtags (#) and show the information flow for each mechanism. If for example two users that are not connected through an edge wish to tweet by contributing to the same subject through a hashtag (#) then the system should show the appropriate information flow by keeping a state machine. 7. The system should support basic graph manipulation mechanisms that include drag and drop functionality for the nodes and zoom in and zoom out buttons. 8. The system must be able to calculate the node degree, the betweeness centrality and the HITS rank for each user and calculate each user s overall trust value. 9. The system must be able to demonstrate different scenarios by identifying strategies of how a user could grow his/her trust map, what should be done in order for a user s trust map to shrink and in which case the trust map of a user remains the same across time Non Functional Requirements 1. The system should scale to large amount of data This means that it should not crash or overflow if millions of users together with all their data are to be processed by the system. 2. The system should be efficient and run in an acceptable amount of time by using for example the appropriate data structures.

56 Chapter 6. Design The system should be portable. By this, we mean that it should be environmentindependent and be able to run on all known operating systems that support java. 4. The system should be extensible. The system should be designed as generic as possible in order to support future functionalities. 6.2 Architecture We decided that the most appropriate architecture to follow is that of the Model View Controller or MVC. Model-View-Controller (MVC) is a classic design pattern which is often used by applications that need the ability to maintain multiple views of the same data. The MVC pattern relies on a clean separation of objects into one of three categories: The category of models which is responsible for maintaining data, the category of views for displaying all or a portion of the data, and the category of controllers for handling events that affect the model or view(s) 1. At this point it would be good to identify the problem that urged software engineers to define and design the model view controller architecture in order to acquire a better understanding why MVC is needed. It is a fact that user interfaces attract many change requests over time and it may be the case that different users of the same interface may ask for different changes. Furthermore, user interface technologies change rapidly across time and programmers may want to support different look and feel standards in order to comply with different users needs. Additionally they want to encapsulate important core code in packages that should be separated from the interfaces. They had to design a pattern where an interactive system could be arranged around a model of the core functionality and data. The system should possess separated view components that would present views of the model to the user. Additionally there should be controller components that could accept user input events and translate them into appropriate requests to views and/or model. Finally it should be complemented with a change propagation mechanism which could take care of the propagation of changes to the model. The solution was the design of the MVC architectural pattern which supports the development of a core model which is independent of style of input or output and gives the programmer the capability to define different views which may be needed for 1

57 Chapter 6. Design 47 different parts or for the project as a whole. Each one of these view components is able to retrieve data from the model and to display it and is accompanied by an associated controller component to handle events from users. The separated model component notifies all appropriate view components whenever its data changes. 2 In the following picture we could observe a simple representation of MVC architecture. a Figure 6.1: The MVC Architecture a Our work is designed in a similar manner. We created from the downloaded data the model component and the controller component is the one that is responsible for manipulating the model according to the several mechanisms that we wish to support. Finally the view component is obligated to represent the graphical user interface. The next step is to assign each class of our prototype implementation to the appropriate component and give a small description of the functionality of each class. Our system consists overall from 22 classes that contribute to the final result. These classes are: 1. Thsesis.java: This class is responsible for initializing the main components and includes the main method for our project. 2. initializetwitterapi.java: This class is responsible to initialize the authentication keys for the twitter API and connect to the author s account. 3. MyController.java: This class is responsible for the information flow between the nodes of the graph and contains the action listeners for each specific button 2

58 Chapter 6. Design 48 of the Graphical User Interface. 4. hashtag.java: This class is responsible for keeping state of the different subjects that are contained in tweets and of the nodes that are contributing to the different subjects. 5. Spaces.java: This class is responsible for a partial tweet analysis and determines if tweets do contain special Unicode space characters like: line separator, space, paragraph separator, medium mathematical space, etc. 6. regularexpressions.java: This class is responsible for the partial tweet analysis with the aid of regular expressions. 7. Extractor.java: This class is responsible for parsing tweets with the help of spaces.java and regularexpression.java. For example it is able to extract all the mentioned names or hashtags that are contained within a tweet or even different URLs that a node may have included in his/her post. 8. FindCliques.java: This class is responsible to identify cliques on the directed graph of twitter. We have already defined what a Clique in the social networks analysis section is. 9. twittergraph.java: This class is responsible to identify my twitter graph by following a specific rationale that would have the ability to look at my followers and followees lists and iteratively extract from each single user his/her lists. From these lists it identifies all the actual friends, which we defined in chapter 4, that are also included in my lists. 10. MyLink.java: This class is responsible for defining the nature of the edges between the nodes of the graph. 11. MyNode.java: This class is responsible for defining the nature of the nodes of the graph. 12. relation.java: This class is responsible for building all the needed relations between users with the help of classes: twittergraph.java, MyLink.java, MyNode.java. It also keeps state of who is following whom and state of the communication that users may have among them.

59 Chapter 6. Design Model.java: This class builds the main graph model by integrating together twittergraph.java, MyLink.java, MyNode.java and relation.java. 14. CalculateDegreeTust.java: This class calculates the incoming vs. outgoing degree of each node and assigns trust values according to the result of the calculation. All results are normalized into the interval [0 1]. 15. IntrefaceWithSetupAndRun.java: This class is an Interface that each of the trust mechanisms class should implement according to its needs. 16. IterationsOverNodes.java: This class is responsible for traversing over every node within an iterative procedure and extract useful information from the nodes, like determining how many edges they do have, if the nodes converge, etc., and also contains setters and getters methods that each trust mechanism may define. For example set how many iterations we should perform. 17. RankingUtilities.java: This class contains help methods like: compare, tostring, sort, etc. that every trust mechanism could use. 18. TraverseNodesAndRankAbstract.java: This is an abstract class and its main goal is to extract the trust value of each node at a specific time. Many methods could be overridden according to specific needs. 19. calculatetrustfromtweetsandhits.java: This class is responsible for downloading tweets for each particular user and with the help of the classes Spaces.java, regularexpressions.java and Extractor.java to build rules and out of the results of these rules construct weights for each node. After that, the HITS algorithm is applied and with the help of the following classes: IntrefaceWithSetupAndRun.java, IterationsOverNodes.java and RankingUtilities.java, calculates the HITS rank for each user. 20. CalculateBetweenessCentrality.java: This class is responsible to calculate the betweeness centrality rank for each node with the help of the following classes: IntrefaceWithSetupAndRun.java, IterationsOverNodes.java and RankingUtilities.java. 21. MyRenderer.java: This class is responsible for changing the colors of the nodes and edges, for changing the visual shape of the graphs and for the zoom in and zoom out functionality of the graphs.

60 Chapter 6. Design View.java: This class is responsible for the main graphical user interface component. At this point we have to group together the classes that refer to the same component. According to what the MVC architecture suggests we divided our classes into three main packages which also contained sub-packages. We tried to make our choice of the classes for each package to conform as much as possible with the needs of the MVC architecture that we described previously. (see table 6.1) Finaly figures 6.2 and 6.3 present our class and sequence diagram. The most important classes have been included alongside with some important methods and attributes in the class diagram while the sequence diagram tries to capture the dynamic aspects of the system by presenting the graph initialization mechanism. The sequence diagram includes the main methods that act as connectors for invoking the functionality of other classes. In particular we may see that the main Thesis.java class uses the initialization() method to enable a connection with the social network s platform. The initializetwitterapi.java class notifies the twittergraph.java class to crawl the data. After finishing collecting the data set an answer with a finalize() method is sent back that it has closed the connection with the twitter platform and the initializetwitterapi.java class sets the initial parameters for the view part of our system. The Thesis.java class again invokes the Model.java to build a graph out of the crawled data. But the Model.java needs to have all the relations (which nodes are connected with an edge) ready so and it invokes the relation.java. Subsequently relation.java needs all the nodes and links to be initialized and therefore invokes MyNode.java and MyLink.java and waits for all the data to arrive. After the relation.java finishes with the connection it invokes again the Model.java to continue its work through the method frgraph2() in order for the final model to be build. The final two steps consist of the Controller and the View component which are cooperating for the final visualization of the constructed graph. 6.3 Summary This chapter described our design methodology for our prototype. Furthermore, we did outline and describe the system requirements alongside with our java classes specifications. Additionally, we did comment on our system s architecture and described the reasons for choosing it. Finally, we did create a class and sequence diagram for the initialization functionality of the prototype.

61 Chapter 6. Design 51 Table 6.1: MVC class distribution Packages and Classes 1. FindCliques.java 2. Model.java 3. MyLink.java Model 4. MyNode.java 5. Relation.java 6. twittergraph.java 1. MyRenderer.java View 2. View.java 1. CalculateBetweenessCentrality.java 2. CalculateDegreeTrust.java 3. InterfaceWithSetupAndRun.java Controller.TrustMechanisms 4. IterationOverNodes.java 5. RankingUtilities.java 6. TravesreNodesAndRankAbstract.java 7. calculatetrustfromtweetsandhits.java 1. Extractor.java Controller.TextUtilities 2. Spaces.java 3. regularexpressions.java 1. Thesis.java Controller.Initialization 2. initializetwitterapi.java 1. MyController.java Controller.MyGraphMechanisms 2. hashtag.java

62 Chapter 6. Design 52 Figure 6.2: The system s class diagram

63 Chapter 6. Design 53 Figure 6.3: This initialization s sequence diagram

64 Chapter 7 Implementation The design part of the project has now been completed and we shall continue with the implementation section. In this section we are going to focus on the environment and the tools that have been used for the coding part of our project as well as the techniques that we adopted in order to acquire all the needed data for the twitter analysis and for our system. We are also going to comment on general coding techniques that we used in our implementation and we are going to present screenshots of our prototype. The overall implementation needed just over 2800 lines of code and we tried to keep it as clean and structured as possible in case that a future student wishes to build on the provided prototype. 7.1 Environment In our case the environment is the JRE version of Java Sun software and our prototype can run on each operating system with Java environment pre-installed. There is no need to worry about the operating system as the Java Virtual Machine takes care of every compatibility problem that may arise. Additionally we used the NetBeans IDE to code the whole system. We had to make the choice between Eclipse and NetBeans. The reason why we used this specific IDE is that the NetBeans platform, unlike the Eclipse Platform, is 100 percent pure Java. Eclipse uses native widgets for its graphical user interface, which requires a JNI module to be built for every platform which runs Eclipse. Furthermore, the upshot of this is that NetBeans works on more platforms than Eclipse. An application built on the NetBeans platform can run on Windows, Linux, Solaris, without recompilation. 54

65 Chapter 7. Implementation Tools During the coding phase we made extended use of two main tools. The first one was the twitter API that was open to developers and the second one was the Jung framework for designing and manipulating graphs The Jung framework stands for the Java Universal Network/Graph Framework (JUNG). JUNG is a software library that supplies the users with a common and extensible language for the modeling, analysis, and visualization of data that can be represented as a graph or network. It is implemented in Java, which allows JUNG-based applications to make use of the extensive built-in capabilities of the Java API, as well as those of other existing third-party Java libraries 1. The most recent version of JUNG includes support and implementation for some social network analysis which is crucial for my project. The Twitter API consists of two parts: the REST APIs and the Streaming API. I will focus on the REST API whose methods allow developers to access core Twitter data. This includes update timelines, status data, and user information. The Search API methods give developers the capability to interact with Twitter Search and trends data. The concern for developers given this separation is the effects on rate limiting and output format. The Streaming API provides near real-time high-volume access to Tweets in sampled and filtered form. 2 The twitter API gave me the opportunity to build crawlers to collect the data needed for the application. To choose between two different twitter APIs was a problem that I had to face at the beginning of the project. In order to register for the streaming API it would last approximately a month to get a license because of the twitter s security policy. For this reason for downloading users which were needed by the twitter analysis phase and later on for examining all the followers and followees lists of each user, it would take days because of the API s rate limit that I had to face. For example requesting the data of users would take about 29 hours using the thread sleep technique. The API rate limit is 350 requests per hour. In the beginning I tried to use my laptop but the internet went down several times during the day and as a result I needed to start the whole procedure again. I tried on the dice (Appleton tower labs) machines as well but there were problems with students that had to work on them. So I had to figure out another solution. My next step was to thoroughly examine the twitter API Rate Limit and observe all kinds of requests that the API invokes. In particular in the Twitter

66 Chapter 7. Implementation 56 website, in the section for the API support, it is stated that the Twitter API only allows clients to make a limited number of calls in a given hour. This policy affects the APIs in different ways. The default rate limit for calls to the REST API varies depending on the authorization method being used and whether the method itself requires authentication. In particular: Unauthenticated calls are limited to 150 requests per hour. Unauthenticated calls are measured against the public facing IP of the server or device making the request. OAuth (authenticated) calls that we are using are limited to 350 requests per hour and are measured against the oauth token used in the request and is initialized in the initializetwitterapi.java class. Furthermore we should ensure that we inspected all the headers returned when requesting methods which do not require authentication. If the request we made included invalid OAuth information the API would do one of two things: For methods which require authentication, the API will return an error response with more information about the error. For example an HTTP 401 error with the response body: Could not authenticate with OAuth. For methods which can be requested unauthenticated, the API will process the request as if authentication had not been used. This means the request will count against the unauthenticated rate limit. If this has happened the API will include the following header in its response: X-Warning: Invalid OAuth credentials detected. But the most important observation is that Rate limits are applied to methods that request information with the HTTP GET command. Generally API methods that use HTTP POST to submit data to twitter are not rate limited. Actions such as publishing status updates, sending direct messages, following and unfollowing are not directly rate limited by the API but are subject to fair use limits and return certain values if their flags are settled to true. Consequently we had to examine the majority of the methods in the API to witness which request methods they invoke in order to acquire our data by avoiding the rate limit. We needed to identify methods that could return a User object with all its data (followers and followees lists, statuses timeline, screen-name, description, etc.). The

67 Chapter 7. Implementation 57 big majority of the methods invoked an HTTP GET request which was vulnerable because of the rate limit. In the end we did discover a method that returned an object user by invoking an HTTP POST request. This method was the destroy friendship which is responsible for destroying a connection that you have with a particular user. However, there were two problems that we had to face. The first was that we did not want to destroy friendships within my account as this would ruin my twitter graph. But for the users that I needed to download was not a problem as I could avoid my own connections on twitter. The second problem in that case was that internally, the method had a flag that was not turned to true (successful) if a connection between me and the target was not present. The twitter s specifications for that method are in figure 7.1. Figure 7.1: Twitter destroy friendship method The problem could be solved if started following all the users and then destroyed the friendship with them by using the above method, but twitter would not allow me to do that as it is prohibited to follow so many users in such a small amount

68 Chapter 7. Implementation 58 of time. So, I continued exploring the API and discovered a similar method with the same internal functionality. The method is called destroy block and its functionality is to unblock a user if a block exists and return that user. The documentation of the API for that particular method claims, as before, that in order for the method to be successful there must exist a block between the source and the target user. However, we did discover here a security or. Even if there is no block between the two users, the method returns the target user without checking if its internal flag is set to true. As a result we did manage to collect the data from users in approximately one and a half hours. In the figure 7.2 we can observe that the destroy block method has similar specifications with the previously destroy friendship method. Figure 7.2: The destroy block method 7.3 Coding Techniques and Screenshots As next step we will comment a bit on the coding phase of our prototype. We used the majority of object oriented techniques like encapsulation, inheritance, singleton

69 Chapter 7. Implementation 59 design patterns, where appropriate, and techniques like overloading and overriding of methods. We also needed to keep state of the information flow between users and additionally each node object had to know which nodes he/she follows and which nodes do follow him/her. These were achieved through the usage of data structures like hashmaps which would take for example as a key the node object and as value a pointer to an ArrayList which holds all his/her followers, etc. We chose to use hashmaps because the lookup for users was a common functionality for the majority of the classes of the system and we know that in order to retrieve an object that is stored in a hashmap we need only O(1). This choice made our project efficient and scalable. Finally I would like to state that we did make use of Java Generics as much as possible in order to enhance the system s extensibility and portability. We also gave attention to our systems security aspects by protecting the visibility of crucial methods and variables of the prototype. For example we declared variables and methods as private and protected where it was needed. Below we will present some screenshots of the prototype. Figure 7.3: The Prototype The picture 7.3 presents the main Graphical User Interface of our system. The generate graph button is already pressed and the graph was visualized in the appropriate area. On the left, the panel is separated into three categories. On the top left corner there are the buttons for the basic operations of twitter like direct message, Retweet, or tweet with hashtag (#). Below there are the buttons that invoke the trust mecha-

70 Chapter 7. Implementation 60 nisms and they include the node degree button, the betweeness centrality button and the HITS button. There is also a trust map button. We will explain the trust maps in the evaluation chapter. Finally the last box of the buttons panel contains buttons for the graph representation which includes three different layouts for the user to choose. It also contains a button that allows picking and dragging nodes. In the middle there is a large text area where the system s user may enter his/her tweet. Another functionality of this area is its responsibility for presenting the trust strategies for the different trust maps. Additionally, if the user of the system picks a specific node the text area will show all his/her trust values. Finally below the grapharea we may see a button that contain the symbols + and - which are responsible for the zoom in and zoom out function. Figure 7.4: The Tweet Button functionality In the figure 7.4 you will observe the information flow of a certain tweet within the network. In this case the node NIVOL2000 tweeted and all of his followers received a notification. All of the notified agents are noted with a blue color which shows the information flow. But how is the information flow affected if one of the notified users decides to retweet the message? For the sake of the example let us claim that the node mennenia which is a follower of NIVOL2000, decides to retweet what NIVOL2000 previously tweeted. Then the follower of mennenia which is the node brunopanara is able to view the message that started from NIVOL2000. The figure 7.5 shows the above described information flow. The node in the green rectangle is mennenia and the node in the green circle is brunopanara.

71 Chapter 7. Implementation 61 Figure 7.5: The Retweet button functionality The final figure 7.6 will present the functionality of hashtag and in the next section which is the evaluation part of the project we will present all the other mechanisms including the trust maps. The nodes DimitriosDRS and NikosMRF did include within their tweets the subject #summer. As a result the followers of NikosMRF and the followers of DimitriosDRS are able to see both messages despite the fact that they belong to different sub-network. As we did state in a previous section it is desirable to contribute to trend subjects because this would make a node visible to other networks and may increase the chance of enhancing his/her trustworthiness beyond his/her personal sub-network.

72 Chapter 7. Implementation 62 Figure 7.6: The Hashtag Button functionality 7.4 Summary In this chapter we have outlined our implementation methodology and the functionality of our prototype. Furthermore we did also comment about the security flaw in the twitter API that allowed us to perform unlimited requests and fetch the data needed for the system.

73 Chapter 8 Evaluation In this section we will include screenshots from the prototype after having executed the algorithms of node degree, betweeness centrality and HITS. This will be accompanied by comments and charts for the majority of the outputs. Finally we will present the trust maps and strategies for growing, shrinking or maintaining the same level of trust in the map. 8.1 Evaluation of the Trust Mechanisms Figure 8.1: The picture shows the nodes with the highest node degree rank The node degree, as we described previously, is a metric that shows the proportion of incoming edges vs. outgoing edges and we did calculate its value for each of the nodes. At this point we remind the user that the node degree is defined as: 63

74 Chapter 8. Evaluation 64 Figure 8.2: This picture shows the node with the highest betweeness centrality (incoming edges)/(outgoing edges) and if (incoming edges==(outgoing edges), then node degree = 0; In the above figure 8.1 you may observe the two most high-rated users in terms of this metric. The corresponding nodes are highlighted with a green colour. For the evaluation version we did download the whole graph of my twitter network, in contrast to the examples of the previous section, and tried to zoom in to the areas of interest. Our graph is a combination of the actual friends as described in the twitter analysis of chapter 4 and other nodes in order to have a complete understanding of the network s behaviour. The nodes in green are the highest rated in terms of node degree trust. They are the nodes of ioannou nikos and techabilly with 0.66 and scores respectively. It is easily anticipated that, indeed, within their networks are regarded as popular but in the next steps we will prove that this is not enough to characterize an agent as trustworthy or trustful by itself. The figure 8.2 highlights the users that received the highest betweeness centrality score. The nodes of NIVOL2000 and NikosMRF are the highest rated. Indeed we are observing that NIVOL2000 is the most central node within his/her sub-network. The same condition holds for NikosMRF as he connects the left most sub-network with the bigger one. It is obvious that between the two users, NIVOL2000 possesses the highest centrality score by making him the most important node for the information flow. Without these nodes the network would have been separated in several smaller parts that could lead to the isolation of some nodes and reduce their trust and trustworthiness. For better understanding of this claim let us present the following example: 1 all the scores are normalized in the interval [0,1]

75 Chapter 8. Evaluation 65 Figure 8.3: The betweeness centrality results Figure 8.4: jtextarea with the donloaded tweets Let us suppose that the node NikosMRF additionally with the high centrality score, also receives a lot of attention through retweets and mentions and thus his tweets have the ability to propagate several hops away from him. If the node NIVOL2000 is excluded from the network, then the node NikosMRF will lose the ability to influence people that are not directly connected with him and lie in the other side of the network. This will steadily result in the reduction of his trustworthiness and overall trust value. Some of the results are presented in figure 8.3 and they are in ascending order. The final trust metric refers to the HITS algorithm and includes the tweets of all the nodes. For this purpose we had to download and parse each single one to identify retweets and mentions as stated in a previous chapter. Then we calculated the frequency that each user receives mentions and retweets, and weights were built out of them before the application of HITS. The list of tweet appears in the jtextarea of our prototype and is presented in figure 8.4. The overall results are presented in figure 8.5 and are again in ascending order. The figure 8.6 shows the most highly ranked users in terms of our HITS algorithm which are enty g and Nihal bak with ranks and respectively.

76 Chapter 8. Evaluation 66 Figure 8.5: The HITS results Figure 8.6: The most higky ranked users by the HITS algorithm

77 Chapter 8. Evaluation 67 Indeed enty g received the highest score not only because of the frequency of mentions and retweets but also because she is pointed by the most authorities such as Nihal bak, Nivol2000, etc. The interesting observation here was that the agent TheRealTry received the most retweets and mentions but he is not pointed by authorities except NIVOL2000. And this is a desirable result as it indicates that a user may have a proper communication with a few agents but this does not imply that the level of his/her trustworthiness may attract high rated peers. On the other hand if a user receives attention from authorities (nodes with high degree of trustworthiness) this automatically increases his/her trustworthiness. Finally we will present some charts which will help us understand better the above concepts. The first chart (figure 8.7) shows a comparison between the HITS rank and the betweeness centrality value. We may see that nodes which possess a zero centrality value do have a small HITS value and this is connected to the fact that these nodes may be pointed by nodes which have a higher authority score. For example we could observe the node MassimoFelici which although it has a betweeness value of 0, it is pointed by DimitriosDrs and NIVOL2000 that have a certain authority score and as a result MassimoFelici s HITS score increased. (see the red line of the chart just above the blue one) The fact that MassimoFelici s node is followed by two users that have a certain degree value could probably be justified by the fact that MassimoFelici has a big impact factor as a teaching fellow of the University of Edinburgh and as supervisor of the nodes NIVOL2000 and DimitriosDrs. Furthermore, we may witness that in the tail of the chart the HITS score differentiates itself from that of betweeness centrality and this means that although a node like NIVOL2000 may have high centrality score it may not proportionally possess the highest HITS score. In particular NIVOL2000 which represents the author s node on the network has few mentions and retweets but it has the fifth highest HITS score within the network as it is associated with high authoritative nodes like enty g and nihal bak. The next chart (figure 8.8) shows a comparison between the node degree metric and the overall obtained trust values for each node. Here it is easily anticipated that the node degree does not reflect the overall trust value for all nodes but only for the chart s middle region which represents nodes that lie almost at the dead-end of the network. This means that nodes do not possess high connectivity with authoritative nodes and are not regarded so important for the information flow. As a result their node degree reflects their overall trust value.

78 Chapter 8. Evaluation 68 Figure 8.7: The chart shows the Centrality results compared to the HITS results Figure 8.8: The chart shows the Overall trust value compared with the node degree results The final figure (figure 8.9) is a histogram that presents four regions of the overall trust value. The fourth region shows the total number of the most trustful users. The interesting observation here is that the corresponding users form a clique within the network which proves that the most trusted agents are strongly connected with reciprocated edges.

79 Chapter 8. Evaluation 69 Figure 8.9: Histogram of the overal trust value 8.2 Trust Maps The trust map shows the connectivity of users as a result of their trust values and overall trustworthiness. We have three different scenarios that we investigate which present the case of users growing their trust map, shrinking their trust map and a case where it remains unaltered. The pictures 8.10 and 8.11 present the first case where the trustworthiness of the nodes enty g and NIVOL2000, that we presented previously, act as trigger to grow their connectivity. The first figure shows the initial trust map and figure 8.11 the updated one. NIVOL2000 and enty g had the highest trust values among their peers. The question here is how an ordinary user can achieve this. The answer is to follow a specific strategy that would add value to his/her trustworthiness. In figure 8.11 we do witness that some extra black nodes (dummy 1, 2, 3, 4) decided to follow the users NIVOL 2000 and enty g. The strategy to grow your trust map is to firstly continuously contribute to the networks communication, secondly you need to try to connect with other big networks and act as a broker between your networks and thirdly you need to attract trustworthy people to follow you. For example verified users would be an ideal solution and finally try to contribute to the weekly trends of twitter by posting new statuses that contain the hashtag (#). The second case is that of the shrinking trust map. This is the case when a node

80 Chapter 8. Evaluation 70 Figure 8.10: The initial trust map Figure 8.11: The updated trust map of the growing case

81 Chapter 8. Evaluation 71 has a particular trust value and begins spamming constantly by posting for example: Look at the following site is the best... This will have as result to become steadily isolated from his network and untrustworthy. In figure 8.12 the nodes are steadily disconnected from these specific node and are becoming black which means that the user (NIVOL2000) is blocked by them and becomes a singleton node in the end. Additionally in this picture it becomes clear what happens when an agent with high betweeness centrality rank, like NIVOL2000, is disconnected from the network (see what we have stated before about bridges and interconnectors). Figure 8.12: The case of a shrinked trust map In the final case the trust map remains untouched. In order for this to happen we have to talk about CLIQUES where people are all connected with each other and no communication is performed out of this network. We are referring to a closed group of peers that do not seek outer communication and thus do not care about their trustworthiness outside of their clique network.

82 Chapter 8. Evaluation 72 Figure 8.13: The case where the map remains the same 8.3 Summary In this chapter we have seen how the different trust mechanisms behave within the network and we also excluded useful conclusions about how they are related to each other. This chapter overall proved our hypothesis that trust in online social networks is a multifactor concept and should not be examined by just a single angle.

83 Chapter 9 Conclusions Trust in online social networks was one of the most interesting works that I have ever coped with, thus it involved aspects of different disciplines that I had to integrate together in order to come up with a concrete result. We began with a definition of trust and the parties that it involves in an offline environment and then we argued in favor of the existence of online trust. The next step was to investigate the facets of online trust and we presented some examples that explored this matter. In order to clearly understand trust and how it is defined we had to focus on the social aspects of trust as well. This approach helped us to anticipate the role of trust and aided us to find ways to quantify and compute it in online environments. But of course this was not enough by itself. The next step involved the online social networks. We started with a definition of an online social network and commented on its main features. Additionally, we studied the evolution and structure of such networks because we proved that trust is close connected with a network s main structure and mechanisms. Furthermore in order to achieve our goals we needed to use social networks analysis theory that was based on graph theory. The third part of chapter three thoroughly presented SNA and the graph theory needed for the purposes of this dissertation. Consequently, we were able to present our case study of twitter. We began with an overview of twitter and its main features and we complemented this with a twitter analysis based on users. Interesting observations arose out of the performed analysis as we managed to prove that agents do not actively interact with all their connections but with a part of them and as a result a node s degree cannot stand alone as prerequisites for trust. So, we focused on other factors that could enhance a user s trustworthiness. We explained that we should take into account the agent s interactions 73

84 Chapter 9. Conclusions 74 across time by including the features of retweet and mentions as trust indicators as complementary to the node s degree. Furthermore, we proved that a node s position within the network is one of the most important factors for the information flow, thus trust. We also implemented the betweeness centrality and the HITS algorithm for ranking users and proved that the more trustworthy peers are connected with a specific node the better trust value a user acquires. 9.1 Lessons learned By the end of this dissertation we were able to define trust and understand its online role. We also gained a deeper understanding of the online social networks and ways for analyzing them. Furthermore, we were able to explore the connection points between the different disciplines that contribute to the emergence of trust in the online social networks such as Twitter. We also observed that a common background is not a necessity for the purposes of the online trust and only people s online identity and their online behavior, which are known as their online fingerprint, can be regarded as evidence for the evolution of the trustworthiness of an agent. Finally and the most important observation was that online trust is not only a matter of connectivity (node degree) but is a complex concept that relies on several other factors that we analyzed in our work. 9.2 Future Work Our prototype managed to capture the aspects of trust in twitter that have been discussed throughout this work. We have identified three main extensions that could be applied to the whole concept of our prototype as future work. The first extension involves the inclusion of semantic web techniques for the tweets analysis. Apart from building weights for each node out of the frequency of the retweets and mentions that this specific agent receives we could extend our system to support semantic web techniques. For example if a specific user mentions a friend with the within the post I admit is a great person to work with, then an additional trust value could be supplemented to the agent X and X would be classified as a positive link. On the other hand it

85 Chapter 9. Conclusions 75 could also be classified as a negative link and consequently affect his/her trustworthiness. The second extension involves the processing of data with the Map-Reduce framework. For example we could download with the streaming API of twitter millions of users and process their data including their tweets in a small amount of time. This could help us build the whole graph of twitter and apply our trust mechanisms on it. The last extension is the improvement of the user s experience by upgrading the functionality of the graphical user interface of our system to support 3D interactive graphics. 9.3 Final remarks It is a fact that a lot of people do face difficulties in defining what is trust and trustworthiness, especially when they are referring to their online extensions. Furthermore, in the online environment it is hard to quantify trust and trustworthiness. We hope that through our work we helped people gain a better understanding of online trust and trustworthiness and gave them the means to assess their online peers.

86 Appendix A Prototype s User Manual 1. You can run the provided jar file by either double-clicking on it or execute from a command line environment the following command: java -jar trust in social networks.jar 2. Press the Generate Graph button below the Initial Graph panel to generate a new graph. The newly generated graph refers to the actual twitter graph of the user Nikolaos Volakis. 3. Test the Tweet button functionality by pressing on the tweet button and then pick a node that you wish to tweet. 4. Test the Direct message button by clicking on it and then pick a source node and a destination node. Observe the information flow. 5. Test the Retweet button. Without refreshing the graph from the previous step press the retweet button and select a node to retweet the previous direct message. 6. Test the HashTag button. Press generate graph button and then enter a text in the jtextarea that contains the character # followed by a word. For example: very nice city #Edinburgh. Then press Enter your Text followed by the HashTag button. Repeat the whole procedure twice. The first time put a different subject from the #Edinburgh one. The second time put again a tweet containing #Edinburgh and press Enter your Text and HashTag again. You will be able to witness that the information associated with the subject #Edinburgh will be visible to both group of users. 7. Test the node degree by selecting the appropriate box 76

87 Appendix A. Prototype s User Manual Test the betweeness centrality by selecting the appropriate box 9. Test the HITS functionality by selecting the appropriate box Test the Case of a growing trust map by selecting it and then press the button grow. (the same for shrinking but then press the shrink button). 11. Test the graph s layouts by selecting one of the choices. 12. Test the zoom-in and zoom out functionality by pressing + or Test the graph s drag and pick functionality by selecting transforming and picking respectively. Figure A.1: Figure for the user s manual 1 It may take several minutes until all tweets have been downloaded

88 Bibliography [1] M. Ankolekar, Anupriya, T. Tran, and D. Vrandecic, The two cultures: mashing up web 2.0 and the semantic web, in Proceedings of the 16th international conference on World Wide Web, WWW 07, (New York, NY, USA), pp , ACM, [2] M. Taddeo and L. Floridi, The case for e-trust, Ethics and Information Technology, vol. 13, pp. 1 3, /s [3] P. PETTIT, The cunning of trust, Philosophy and Public Affairs, vol. 24, no. 3, pp , [4] H. Nissennbaum, Securing trust online: Wisdom or oxymoron. [5] W. J, Trust in cyberspace. [6] P. de Vries, Social presence as a conduit to the social dimensions of online trust, in Persuasive Technology (W. IJsselsteijn, Y. de Kort, C. Midden, B. Eggen, and E. van den Hoven, eds.), vol of Lecture Notes in Computer Science, pp , Springer Berlin / Heidelberg, [7] P. Papadopoulou, Applying virtual reality for trust-building e-commerce environments, Virtual Reality, vol. 11, pp , [8] C. L. Corritore, B. Kracher, and S. Wiedenbeck, On-line trust: concepts, evolving themes, a model, International Journal of Human-Computer Studies, vol. 58, no. 6, pp , Trust and Technology. [9] M. Taddeo, Modelling trust in artificial agents, a first step toward the analysis of e-trust, Minds and Machines, vol. 20, pp , /s [10] Seigman, The problem of trust princeton, [11] M. Turilli, A. Vaccaro, and M. Taddeo, The case of online trust, Knowledge, Technology and amp; Policy, vol. 23, pp , /s [12] G. Liu, Y. Wang, and M. A. Orgun, Quality of trust for social trust path selection in complex social networks, in Proceedings of the 9th International Conference on Autonomous Agents and Multiagent Systems: volume 1 - Volume 1, AAMAS 10, (Richland, SC), pp , International Foundation for Autonomous Agents and Multiagent Systems,

89 Bibliography 79 [13] C. Sierra, Agent-mediated electronic commerce, Autonomous Agents and Multi-Agent Systems, vol. 9, pp , /B:AGNT c0. [14] A. Bhattacherjee, Individual trust in online firms: Scale development and initial test, J. Manage. Inf. Syst., vol. 19, pp , July [15] D. H. McKnight and N. L. Chervany, What trust means in e-commerce customer relationships: An interdisciplinary conceptual typology, Int. J. Electron. Commerce, vol. 6, pp , December [16] F. L.i, The philosophy of information., [17] T. M., Defining trust and e-trust, pp , [18] C. Ess, Trust and new communication technologies: Vicious circles, virtuous circles, possible futures, Knowledge, Technology, Policy, vol. 23, pp , [19] S. N. Dorogovtsev and J. F. F. Mendes, Evolution of Networks: From Biological Nets to the Internet and WWW (Physics). New York, NY, USA: Oxford University Press, Inc., [20] R. Kumar, J. Novak, P. Raghavan, and A. Tomkins, Structure and evolution of blogspace, Commun. ACM, vol. 47, pp , December [21] L. Garton, C. Haythornthwaite, and B. Wellman, Studying online social networks, Journal of Computer-Mediated Communication, vol. 3, no. 1, pp. 0 0, [22] Y.-Y. Ahn, S. Han, H. Kwak, S. Moon, and H. Jeong, Analysis of topological characteristics of huge online social networking services, in Proceedings of the 16th international conference on World Wide Web, WWW 07, (New York, NY, USA), pp , ACM, [23] R. Kumar, J. Novak, and A. Tomkins, Structure and evolution of online social networks, in Link Mining: Models, Algorithms, and Applications (P. S. S. Yu, J. Han, and C. Faloutsos, eds.), pp , Springer New York, [24] M. Faloutsos, P. Faloutsos, and C. Faloutsos, On power-law relationships of the internet topology, SIGCOMM Comput. Commun. Rev., vol. 29, pp , August [25] A.-L. Barabasi, Internet: Diameter of the world-wide web, [26] J. A. Golbeck, Computing and applying trust in web-based social networks, [27] A. Ntoulas, J. Cho, and C. Olston, What s new on the web?: the evolution of the web from a search engine perspective, in Proceedings of the 13th international conference on World Wide Web, WWW 04, (New York, NY, USA), pp. 1 12, ACM, 2004.

90 Bibliography 80 [28] A. Mislove, M. Marcon, K. P. Gummadi, P. Druschel, and B. Bhattacharjee, Measurement and analysis of online social networks, in Proceedings of the 7th ACM SIGCOMM conference on Internet measurement, IMC 07, (New York, NY, USA), pp , ACM, [29] G. Erto, Semantic social network analysis, [30] M. Marcus and H. Minc, Introduction to linear algebra [31] E. Fehr and S. Gchter, Fairness and retaliation: The economics of reciprocity, The Journal of Economic Perspectives, vol. 14, no. 3, pp. pp , [32] B. A. Huberman, D. M. Romero, and F. Wu, Crowdsourcing, attention and productivity, sep [33] P. Dandekar, Analysis and generative model for trust networks, [34] F. B. K. P. G. Meeyoung Char, Hamed Haddadi, Measuring user influence in twitter: The million follower fallacy, [35] D. M. Romero, W. Galuba, S. Asur, and B. A. Huberman, Influence and passivity in social media, in Proceedings of the 20th international conference companion on World wide web, WWW 11, (New York, NY, USA), pp , ACM, [36] F. B. K. P. G. Meeyoung Char, Hamed Haddadi, The influentials: New approaches for analyzing influence on twitter, [37] M. S. Granovetter, The strength of weak ties, American Journal of Sociology, vol. 78, no. 6, pp. pp , [38] D. Watts and P. Doddsi, Influential, networks, and public opinion formation, pp , [39] M. Mendoza, B. Poblete, and C. Castillo, Twitter under crisis: can we trust what we rt?, in Proceedings of the First Workshop on Social Media Analytics, SOMA 10, (New York, NY, USA), pp , ACM, [40] J. Weng, E.-P. Lim, J. Jiang, and Q. He, Twitterrank: finding topic-sensitive influential twitterers, in Proceedings of the third ACM international conference on Web search and data mining, WSDM 10, (New York, NY, USA), pp , ACM, [41] M. Barthlemy, Betweenness centrality in large complex networks, The European Physical Journal B - Condensed Matter and Complex Systems, vol. 38, pp , /epjb/e [42] C. Bergi, Graphs and Hypergraphs [43] D. H. J. Clark, A first look at graph theory [44] L. Freeman, Sociometry, 1977.

91 Bibliography 81 [45] K. F. S. Wasserman, Social Network Analysis: Methods and applications. Cambridge University Press, [46] U. Brandes, A faster algorithm for betweenness centrality, J. Math. Sociol., vol. 25, no. 2, pp , [47] D. M. Wilkinson and B. A. Huberman, A method for finding communities of related genes, Proceedings of the National Academy of Sciences of the United States of America, vol. 101, no. Suppl 1, pp , [48] J. M. Kleinberg, Hubs, authorities, and communities, ACM Comput. Surv., vol. 31, December [49] C. D. Manning, P. Raghavan, and H. Schütze, Introduction to Information Retrieval. Cambridge University Press, 1 ed., July [50] P. Vinkler, Characterization of the impact of sets of scientific papers: The garfield (impact) factor, Journal of the American Society for Information Science and Technology, vol. 55, no. 5, pp , 2004.