The Impact of Social Affinity on Phone Calling Patterns: Categorizing Social Ties from Call Data Records

Size: px
Start display at page:

Download "The Impact of Social Affinity on Phone Calling Patterns: Categorizing Social Ties from Call Data Records"

Transcription

1 The Impact of Social Affinity on Phone Calling Patterns: Categorizing Social Ties from Call Data Records Sara Motahari Sprint Labs Burlingame, CA Sandeep Appala Carnegie Mellon University Moffett Field, CA Ole J. Mengshoel Carnegie Mellon University Moffett Field, CA Luca Zoia Carnegie Mellon University Moffett Field, CA Phyllis Reuther Sprint Labs Burlingame, CA Jay Shah Carnegie Mellon University Moffett Field, CA ABSTRACT Social ties defined by phone calls made between people can be grouped to various affinity networks, such as family members, utility network, friends, coworkers, etc. An understanding of call behavior within each social affinity network and the ability to infer the type of a social tie from call patterns is invaluable for various industrial purposes. For example, the telecom industry can use such information for consumer retention, targeted advertising, and customized services. In this paper, we analyze the patterns of 4.3 million phone call data records produced by 360,000 subscribers from two California cities. Our findings can be summarized as follows. We reveal significant differences among different affinity networks in terms of different call attributes. For example, members within the family network generate the highest average number of calls. Despite the differences between the two cities, for a given affinity network they show similar phone call behaviors. We identify specific features that model statistically meaningful changes in call patterns and can be used for prediction and classification of affinity networks, and we also find correlations between the features associated with call behavior. For example, when subscribers call each other after a long time, their calls tend to take longer. This knowledge leads to discussions of proper machine learning classification approaches as well as promising applications in telecom and security. Categories and Subject Descriptors G.3 [Probability and Statistics]: Statistical Computing; J.4 [Social and Behavioral Sciences]: Sociology. General Terms Measurement, Experimentation. Keywords Call Data, Social Networks, Call Behavior, Social Tie Inference Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. The 6th SNA-KDD Workshop 12 (SNA-KDD 12), August 12, 2012, Beijing, China. Copyright 2012 ACM $ INTRODUCTION Today s pervasive use of mobile phones produces huge volumes of call data records that can potentially provide significant amount of valuable information about various patterns of human interaction, social relations, mobility, etc. In particular, being able to understand, distinguish and discover social affinities and their nature from people s call data records are invaluable in many practical areas, such as planning advanced marketing strategies, dynamic pricing, customized services, and recommendation services. Telecom providers can use this information for customer retention and targeted advertising. For example, wireless service providers can identify the family members that are not under the same family plan as potential customers, or identify subscribers within a network of non-subscriber friends to predict and prevent churn. The impact of social ties on costumer attraction and retention is already known to telecom providers. For example, previous studies have shown that the number of customers who churn out of a service provider s network depends on the number of their friends that have already churned [1]. The strength of a social tie is also typically proportional to degree of trust [26], and trust plays a crucial role in security. Consequently, this line of research may also have applications to security. Despite the importance of such information about social relations and call behavior, there is a major gap in previous studies when it comes to the analysis of social network based on phone calls. Previous work has mainly focused on the structure of the obtained social graphs and the communities inside them, such as topological properties [1] [12][8][13]. Some previous studies have also tried to predict the existence of links and social connections [8][3][14]. Such analyses, while essential to our understanding of social networks and call graphs in general, do not provide any detailed insights on how to characterize, categorizes, or classify the type of a social tie from calling patterns. In this paper, we analyze a set of features that abstract calling patterns between subscribers, and investigate their ability to discriminate between different affinity networks. We collect and process Call Detail Records (CDRs) of a major wireless service provider from two different cities in California: a rural city (Modesto) with a relatively small population situated in an agricultural region and an urban city (San Francisco) with a larger and more diverse population, located close to Silicon Valley and two other cities (Oakland and San Jose). We assign each subscriber into various types of social ties (see also [21]

2 [23]) such as family, toll free, utility services or other based on our access to information sources about their registration accounts. We define each social group as an affinity network. In other words, according to our terminology an affinity network is a group of subscribers in which all the subscribers have a common social tie. We then identify and calculate various features that model call behavior and patterns such as the frequency, length, timing, and symmetry of calls. The results show that the affinity networks show meaningful and statistically significant differences in terms of their identified features. For example, family members call each other more often, their calls are shorter, and they have a more mutual (outgoing versus incoming) tie compared to other affinity networks. We will see that these distinctive patterns are consistent across the two geographical areas of Modesto and San Francisco despite the differences in population characteristics between the two cities. For example, in Modesto 39.3% of households contain children under the age of 18, while this is the case for 18.4% of households in San Francisco. We highlight some of the underlying correlations between these features. For example, when subscribers call each other after a long time, they tend to make longer calls. All these finding have implications on predictability and classifiability of social ties, as well as which machine learning techniques to be fruitfully used, as we also discuss in this paper. After a brief review of previous work in Section 2, we will describe the dataset that was used in this study in Section 3. Section 4 explains our methodology, and Section 5 summarizes the empirical results. Highlights of the results and their implications are discussed in Section SOCIAL NETWORK AND CALL DATA ANALYSIS Social networks have been analyzed from many perspectives. A number of recent studies have specifically used mobile call graph data to investigate and characterize the social interactions of subscribers [1][12], the evolution of social groups and the adoption of new products and services [6]. The main focus of previous work on social networks (whether from phone calls or other sources of data) has been on the structure of the obtained social graphs, including topological properties, degree distributions, core clusters, strongly connected components, extraction of communities, and community structure identification [1][12][8][13]. In particular, graph partitioning, such as spectral clustering is a popular approach for studying community structure in graphs [16][18][19][21]. Several previous studies have tried to predict the existence of social connections in different contexts, for example, by considering mutual connections on Facebook and social networking sites, or by considering proximity patterns on university campuses and other specific environments [8][3][14]. This issue has also been explored in the form of predicting missing and future links in co-authorship networks [5], phone call graphs [5][4], and simulated social graphs. The issues of node (or vertex) partitioning [2] and, more recently, link (or edge) partitioning [21][23] have also been studied in network science. When node partitioning is performed, one assigns each node to a partition or class. For edge partitioning, each edge is assigned to a partition or class. The benefit of edge partitioning compared to node partitioning is that it can model situations in which nodes do not neatly separate into disjoint, nonoverlapping classes. As we mentioned, such studies have been essential to our understanding of social networks. However they do not give us information about call patterns with respect to the type of social relations, and do not enable us to infer the affinity networks. This is the focus of this paper. Specifically, we focus on the characteristics of the edges of a social network induced by CDRs instead of focusing on node characteristics as done in most previous work. 3. DATA SET The data set we have used in our analysis contains CDRs of mobile subscribers in the cities of Modesto and San Francisco, in the state of California, as shown in Figure 1. The geographic locations of the base stations in downtown Modesto and San Francisco were taken as reference points. From each reference point, all the regions within a radius of 20 miles were covered and the data from their base stations were collected. The CDRs were collected by the telecommunication service provider Sprint. Our data collection methodology resulted in millions of phone calls between subscribers in these two cities from 30 consecutive days in the month of October, Call Detail Records CDRs are collected from various base stations and stored in a distributed file system warehouse. Each cell phone call of a subscriber is saved as a record that contains the following information: Unique subscriber IDs and phone number of the caller; Unique subscriber IDs and phone number of the call receiver; Date and time of the call s initiation; Date and time of the call s end; Direction of the call (outgoing versus incoming); Switch ID; Cell tower ID; Sector ID. From the CDRs, we constructed, as further discussed in Section 4, a social network that represents each mobile user as a vertex and the calls between them as an edge. 3.2 Population Characteristics Modesto and San Francisco are rural and urban cities respectively. They have different population size and demographics. Below, we summarize some of their characteristics Modesto Modesto has a population of about 200,000 and the population density is 5,423.4 people per square mile (2,094.0/km²). Ninety eight percent of the population lives in a household. There are about 69,000 households, out of which 39.3% (~27,000) have children under the age of 18 living in the house; 48.1% (~33,000) are married couples living together; the rest is a householder (male or female) living alone. The average family size is The population is spread out with 26.8% under the age of 18 (~54,000), 10.4% of the population aged 18 to 24 (~20,000); 26.4% of the population aged 25 to 44 (~53,000); 24.7% aged 45 to 64 (~49,000) and 11.7% of the population is 65 years of age or older (~23,000). The median age is 34.2 years. For every Our source is the United States Census for the year 2010.

3 females there are 95.0 males. For every 100 females age 18 and over, there are 91.5 males San Francisco San Francisco has a population around 805,000 and the population density is 17,160 per square mile (6,632/km²). Ninty seven percent of the population lives in a household. There are about 345,000 households, out of which 18.4% (around 63,000) have children under the age of 18 living in them; 31.6% (around 109,000) are married couples living together; the rest is a householder (male or female) without another present. The average family size is The population is spread out with 13.4% under the age of 18 (~107,000), 9.6% of the population aged 18 to 24 (~77,000), 37.5% of the population aged 25 to 44 (~301,000), 25.9% of the population aged 45 to 64 (~301,000), and 13.6% are 65 years of age or older (~109,000). The median age is 38.5 years. For every 100 females there are males. For every 100 females age 18 and over, there are males. Figure 1. Areas of interest: San Francisco and Modesto 4. METHODOLOGY In this section, we explain our methodology of processing and analyzing the CDRs as well as identifying features. 4.1 Anonymization Aside from the security measures and control of access to subscriber information, the CDRs were anonymized as the originating and destination phone numbers and identification numbers were encrypted using hashing. Hence, no personal identification was accessible. The subscriber numbers tied to a family plan account were also anonymized using hashes. Also, all our results are presented as aggregates and calculated for the overall demographic population, and no individual subscriber is pinpointed for the study. 4.2 Identification of Affinity Networks As we mentioned, a social affinity network may be a network of friends, family members, coworkers, etc. Obviously, the information about the relation types between subscribers is not provided in the CDRs. Therefore, we had to use outside information to identify the social ties without compromising the privacy of individual and personal accounts. We therefore grouped the social ties into the following affinity types: Family: for every primary subscriber, other subscribers belonging to the primary subscriber s family plan were identified. If any two subscribers were part of a common family accounts in a family plan, they were identified as family members. Toll: the numbers starting with were categorized as Toll free numbers. Utility: 2 we considered a pair of subscribers to be part of a utility network if one of the subscribers is a business establishment in Modesto or San Francisco. We created a limited but representative list of business establishments by scraping the Web. Others: all the other numbers which may include friends, professional colleagues, or any other personal relationship that a subscriber could hold. Above, we discuss toll free numbers (i.e. nodes), but in Table 1 we present toll free edges. This reflects the fact that the node partitioning into {Family, Toll, Utility, Others}, as introduced above, induces a corresponding edge partitioning. In other words, for each of the affinity edge types, we have a rule that says how it is defined in terms of the affinity node types. For example, if one node is a toll node, then all adjacent (both incoming and outgoing edges) are toll edges. By considering all the edges of one affinity type, in our case {Family, Toll, Utility, Other}, we obtain an affinity network. We could potentially identify additional affinity networks (e.g. grouping coworkers by accessing domain and corporate accounts information). However, the use of more information from Sprint or other external sources could compromise the privacy and confidentiality of the subscribers. Therefore, we limited the identified networks to the above ones. Nevertheless, we will see below that the above grouping is enough for pinpointing and analyzing the call patterns and identifying meaningful features, which is the purpose of this study. 4.3 Sampling We extracted CDRs for a set of subscribers from San Francisco and Modesto and ranked subscribers by their total number of calls. We then removed the outliers (top 10% and bottom 10%) and uniformly sampled 10,000 subscribers and their one-hop connections from this set. We then filtered all the CDRs from the two locations for this set of subscribers. The number of subscriber pairs in each edge partition (affinity network type) from each city is shown in Table 1. The table excludes self-edges, which may be used to reflect calls to voic . Table 1. Statistics for call data record (CDR) data sets Modesto San Francisco Number of call data records in the sample 1,966,022 (~1.97 million calls) 2,333,826 (~2.33 million calls) Number of subscribers 203, ,591 including primary and onehop nodes Number of social ties (edges) 304, ,236 Number of family member 534 4,220 edges Number of utility edges 25,232 4,111 Number of toll free edges 19,528 27,618 Number of other edges 258, , Visualization and Feature Identification We used a visualization tool for getting an intuitive understanding of the characteristics of different affinity networks. The visualization software NetEx [24] shows the vertices and the edges of the social graph using multiple GUI elements and 2

4 configurations. For example, as seen in Figure 2, the thickness of the edges can be varied based on the average number of calls. Such visualizations enabled us to identify and consider various features that may distinguish between affinity networks. In particular, we observed two categories of features: 1. Graph topological features: For example, in Figure 2(a), nodes 23, 24, 28, and 32 belong to the same family plan and have several mutual contacts. Figure 2(a) also shows a random set of three utility numbers and their one-hop neighbors. The topology resembles an ego centric network where the calls occur between the utility service subscriber and all the subscriber s one-hop neighbors, but no calls occur between their one-hop neighbors. In addition, the utility numbers picked do not have any mutual contacts. 2. Call pattern features: These features relate to the number, frequency, length, and timing of the calls. For example, when the thickness of the line is selected to represent the number of calls, we observe thicker edges between many family members in Figure 2. Here is how Figure 2(a) was created. All of the nodes belonging to a family plan were selected, and the number of mutual contacts between them was calculated by looking at their individual onehop neighbors. For this visualization, a set of family nodes with high numbers of mutual contacts and total calls between them were chosen. However, we observed similar graph structures and features across other family sub-networks as well. The above discussion illustrates the identification of potential candidates for the features that model the differences between call graphs of different affinity networks. We now turn to the discussion of the definition and discussion of the features that we picked for the purpose of this paper. 4.5 Definition of the Features Before introducing the features that were used, we define how call records and graphs were defined and constructed, respectively. Definition. A call record represents a call and can be defined as a tuple, where u and v are subscribers, t is the call start time, and d is the call duration. Note that this is a subset of a CDR as introduced in Section 3.1. The order of u and v in is important: u is the subscriber initiating the call while v receives the call. A set of call records is denoted. Among calls between u and v, the outgoing calls from u are defined as Θ u (u, = { { }}, while the outgoing calls from v are defined as: Θ v (u, ={ { }}. A call record relation with all calls between u and v can now be defined as Θ(u, = Θ u (u, Θ v (u,. Below, in our features, we are generally using Θ(u, since call direction generally does not matter except in one case (the engagement feature) where we use both Θ v (u, and Θ u (u,. Let G = (V, E) be an undirected graph where V represents the set of all subscribers in our sample, and E represents the set of all undirected edges {u, v}, with Θ(u, > 1. We define a social network N that consists of G along with a set of features defined for each edge; N = (G,). A key point in this paper is that we are, similar to previous work on link partitions [21] and link communities [23], focusing on features defined on edges E rather than features defined on nodes N. Figure 2. Average total call feature: (a) Family and (b) Utility affinity sub-networks plus one-hop neighbors Features Our features are defined for a given pair of subscribers, {u, v} E, and our goal is to characterize the nature of the social tie between u and v through the distribution of these features. The total number of calls defined as Θ(u,. between subscribers u and v is The average call duration between u and v is defined as: (u,= (u,v,t,d)ө(u, d/ Θ(u,. Given a pair of subscribers u and v, we consider the calls between them, arranged in ascending order by their start times. Now, consider for any two consecutive call records (u i,v i,t i,d i ) and (u i+1,v i+1,t i+1,d i+1 ) from, and define Θ(u,. The average inter-call interval is then defined as: ( ) The following feature is based on partitioning Θ(u, into weekend calls and weekday calls such that (u, U (u, and (u, (u,

5 For the purpose of this paper, we treat Saturday and Sunday as weekend days, and other days as weekdays. The asymmetry between weekend and weekday calls, or weekday to weekend asymmetry, is defined with respect to the total number of calls: ww ( u, wd ( u, we( u,. ( u, If there is maximal asymmetry, where calls take place on weekends or weekdays only. If, this means minimal asymmetry, with calls evenly distributed. The engagement ratio feature gives a measure of the social interaction between u and v. eg u( u, v( u, ( u, 1 ( u, The purpose of this feature is to characterize the interaction patterns between subscribers u and v. If the ratio of engagement is, there is a symmetric interaction between the pair of subscribers. If, there are either only outgoing calls to v (from u) or incoming calls from v (to u ). After we calculated features for all edges, we compared the features between different types of affinity networks and for different geographic regions (Modesto and San Francisco). We also explored the underlying correlations between the features. Statistically significant results are presented in Section Software Tools and Techniques As the CDRs for subscribers typically amount to massive data sets, it can be computationally intensive to construct features from them. We used Hadoop, see which implements the MapReduce parallel computing model [25] in order to process CDRs. Specifically, CDRs were stored in a Hadoop distributed file system and processed by feature construction algorithms written using the MapReduce framework. Figure 3 illustrates how we used the MapReduce mappers and reducers for our purpose. For every CDR, a mapper function emits a (key,value) pair, in our case a pair for each pair of subscribers, and the reducer function performs the aggregation operation needed for the desired features. In other words, the key emitted by the mapper application is the subscriber pair (i.e., the source subscriber phone number and the destination subscriber phone number), the direction of the call (incoming or outgoing), and so forth. For every key (subscriber pair), the corresponding values from the CDR are emitted as a value. Once the information had been calculated using Hadoop, SQL scripts were written to mine the information accordingly in the database. The Java-based NetEx software [24] was interfaced with the database to present data for visualizations. nc Figure 3. Call data record (CDR) processing using MapReduce and Hadoop 5. RESULTS We analyzed the call behavior with respect to the above features from several perspectives: 1) the behavioral differences among different affinity networks which is reflected in meaningful variations of their associated features; 2) dependencies between various characteristics of human call behavior; and 3) differences between geographic regions. 5.1 Differences between Affinity Networks We here present the results of analyzing and comparing the above features with respect to different affinity networks in the two cities. We observed different call patterns reflected in various features as explained in the following subsections Total Number of Calls Most of the subscribers in a family network have very high number of calls between each other. The average value of total number of calls made is the highest between family members ( calls in a month) in Modesto, whereas the average value of the total calls made to utility and toll free numbers have relatively low values of 2.48 and = 2.29 respectively. Analysis of variance shows a significant difference between the family members and the others [Fisher s Score, F = 2946, p (pvalue for statistical significance) < 0.001]. The average value of total number of calls made to family members in San Francisco is high, too, ( 31.44) when compared to toll free and utility service numbers which are 2.26 and The difference of the means between family members and others is again statistically significant (F = 20931, p < 0.001). No meaningful difference was found in the number of family member calls between the two cities. Figure 4 shows the mean value of total number of calls made within different affinity networks in the Modesto and San Francisco regions.

6 other on the weekends (F = 56.7 and p < for Modesto, and F =350, p < for San Francisco). Figure 4. Average number of calls Engagement Ratio The engagement ratio defines how reciprocal the call pattern between two subscribers is. The higher the engagement ratio, the more balance we observe in the number of outgoing and incoming calls. The average engagement ratios calculated for various affinity networks are shown in Figure 5. The family network has a high engagement ratio of = 0.57 whereas the toll free and utility services have very low values of = 0.05 and = 0.14, respectively, in Modesto. The engagement ratio in San Francisco also shows a similar value for the family network, namely = 0.56, and low values of = 0.03 and = 0.07 respectively for the toll free and utility service networks. An analysis of variance confirms a statistically meaningful difference between family members and other affinity networks (F = 27.3 and p < for Modesto, and F = 480 and p < for San Francisco). The above results suggest that family subscribers have a reciprocal social interaction with each other. In contrast, mostly outgoing calls are made to utility and toll free numbers; utility and toll free numbers do not reciprocate equally. Figure 5. Average engagement ratio Weekend versus Weekday Figure 6 shows the percentage of the calls that were made on the weekends and Figure 7 shows the asymmetry between weekday and weekend calls for each affinity network. The main difference, perhaps, is observed in the asymmetry of weekday to weekend calls between family members compared to the rest (others, toll, and utility). Toll free and utility numbers are called much more during the weekdays. Family members are more likely to call each Figure 6. Average weekend percentage of the calls Figure 7. Average weekday to weekend asymmetry Call Duration Figure 8 reports empirical results for average call duration. The calls occurring between pairs of subscribers in a family network have small average call duration and seconds, respectively, both in Modesto and San Francisco. The utility services and toll free networks have higher average call durations = and seconds in Modesto, and = and = seconds in San Francisco, respectively. Running an Anova analysis, we also observed a statistically meaningful difference between the call duration of the three groups of 1) family members, 2) toll-free/utility network, and 3) others (F = 8.2 and p = for Modesto, and F = 1302 and p < for San Francisco). This suggests that the calls between family members are quicker and contain short conversations, whereas calls to business establishments and tollfree numbers last for a substantially longer time. The mean values are depicted in Figure 8.

7 Figure 8. Average call duration Inter-call Interval The calls occurring between subscribers in a family network have a smaller average inter-call interval of 3272 minutes (~2 days) in Modesto and 3586 minutes (~ 2.5 days) in San Francisco when compared to other affinity networks. The utility service networks have the highest inter-call interval of 8361 minutes (~6 days) and 8589 (~7 days) in Modesto and San Francisco respectively, i.e. they occur very sporadically. This suggests that the subscribers belonging to a family network call each other more frequently compared to how often they call their utility or toll free numbers. The significance of the results were confirmed by an Anova analysis (F = 25.5, p < for Modesto, and F = 458 and p < for San Francisco). The mean values for the inter-call duration for each affinity network are plotted in Figure 9. interval, there needs to be more frequent calls (i.e., shorter time intervals). The total number of calls and the weekday to weekend asymmetry (ρ=-0.31): The subscriber pairs that make more calls to each other also make more calls on the weekends (meaning a smaller asymmetry ). This could be because family members make more calls on the weekends. We talk more about the cause versus effect issue in Section 6, as this discussion applies to most correlations found. The average call duration and average inter-call arrival time (ρ=0.21): people who call each other less frequently make longer calls. This effect is illustrated in Figure 10 and Figure 11. The total number of calls and the engagement ratio (ρ=0.34): The subscriber pairs that make more calls have a more reciprocal call pattern. The weekday to weekend asymmetry and the engagement ratio (ρ=-0.36): The subscriber pairs with more calls on the weekends have more reciprocal call patterns. The significance of the above correlations also holds for each city separately and is consistent across the two geographical regions. For example, Figures 10 and 11 show the average inter-call interval versus the average call duration in Modesto and San Francisco respectively. We see that in both cities, the calls take longer as the time interval between them increases. Figure 9. Average inter-call interval In the next subsection, we will analyze the dependencies between the above features and in Section 6 we will talk about the meanings and implications of the results. 5.2 Correlations between Features When it comes to feature selection, feature extraction, and the selection of the appropriate machine learning and data mining techniques for classification, it is important to be aware of the correlations among the features. The features that we analyzed in the previous subsection are obviously not independent. We looked at the correlation matrix and investigated the statistically meaningful correlations. The following pairs of features were found to have significant correlations: The total number of calls and the average inter-call interval (ρ=-0.48): This correlation is relatively obvious and is due to the definition of these features. For the pairs with more calls during a given time Figure 10. Average call duration (y-axis) as a function of intercall interval (x-axis) in Modesto Figure 11. Average call duration (y-axis) as a function of intercall interval (x-axis) in San Francisco

8 6. DISCUSSION AND CONCLUSION In this study, we aimed to address the current gap in our knowledge of human phone call behavior, one manifestation of ties in social networks. In particular, we have investigated how behaviors vary across different social affinity networks, which could include family members, coworkers, friends, service providers, customers, etc. We collected and processed Call Detail Records (CDRs) provided by Sprint from the two California cities of Modesto and San Francisco. We built a social graph with the edges (social ties) between two subscribers specified by phone calls between them. We then assigned each subscriber into four groups of social ties: family, toll free, utility services or other based on our access to information sources about their registration accounts. We defined each social group as an affinity network, in which all the subscribers grouped in an affinity networks have a certain common social tie. We then identified, calculated, analyzed, and compared various features that model call behavior and patterns such as the frequency, length, timing, and symmetry of the calls. Some key results and their implications can be summarized as follows: Call patterns of family members in both cities indicate strong social ties between them, which is reflected in the total number of calls and the frequency of the calls. These two features show statistically significant changes between family members and all other affinity networks. Furthermore, subscribers belonging to a family network in Modesto and San Francisco make on average 77.31% and 75.4%, respectively, of their total calls to each other and not outside the family network; on average only 9.46% and 10.65%, respectively, of their total calls are made to the utility and toll free numbers affinity networks when combined. Family members also call each other very frequently with a small time interval between successive calls, but calls to utility numbers are sporadic as the time taken between successive calls is long. Family members show a more reciprocal and mutual call behavior. This is reflected in the strong engagement ratios of 57% and 56% in Modesto and San Francisco respectively. In contrast, most of the calls made to utility numbers are not reciprocated equally (this is measured by weak engagement ratios of 13% and 10% respectively). While previous research [16][19] identified differences between communities in terms of personal network topologies and some behavioral characteristics, we found that the nature of social ties and their associated call patterns in Modesto and San Francisco have strong similarities. They also show very similar variations in terms of their related features across different affinity networks. We found strong correlations between the features that model call patterns. Some of these correlations are rather intuitive. For example, the number of calls is correctly expected to have a negative correlation with the inter-call arrival time. However, some show interesting call behavior. For example, the subscribers who call each other more frequently make shorter calls. Currently, it is not clear whether such correlations are the cause or the effect of some of the differences across affinity networks. For example, the above correlation could be the cause or the effect of shorter calls among family members. This is a topic for future research. Statistically significant features and their correlations have implications for research on prediction and classification of social ties. They mean that given these features and the social graph, we should be able to classify each edge of the graph into a specific social affinity network and infer the type of relationship. However, the dependencies among the features indicate that the classification approach would benefit from being able to capture such dependencies. This appears to benefit, for example, machine learning using random forests and Bayesian networks over simpler approaches such as single decision trees and naïve Bayesian classifiers. As a limitation of this study, we should note that the affinity network types that we identified and used as our ground truth were inferred and not directly verified. For example, we assumed that family plans must be shared among family members. While this is common sense, we did not verify this assumption. This means that non-family members who share a family plan introduce some noise into the dataset. There are similar limitations associated with the Utility, Toll, and Others edge types. From an application point of view, there are several opportunities to build on and expand this work. Knowledge about social tie strength and type is invaluable for different research and industrial purposes, especially for telecom providers. For example, customer acquisition can be enhanced by focusing on family members of existing customers that are not under their network; Churn prevention can be enhanced by focusing on friends of churned customers; Search engines and mediating businesses that recommend services to subscribers or customers to services can refine their recommendations and categorize them based on relationship type information, etc. Furthermore, knowledge and modeling of distinct call behavior of different social affinities, even without the affinity prediction capabilities, can be useful in predicting how call patterns and consequently, the load on the network may change in future as a result of predicted acquisitions and churns. There are also applications of tie strength and type in the area of security. For example, strength of social ties is a useful indicator of trust in many real-world relationships [26], and trust plays a crucial role in security applications. As future work, we intend to a) expand the type of social affinities that we considered and add coworkers, friends, etc. by using more information sources without invading subscriber privacy; b) investigate the impact of the features that are associated with the social graph topology and also location and proximity patterns of the subscribers [3]; c) include additional cities and geographical areas from different parts of the country and the world to have more diverse datasets; and d) apply the proper classification techniques to classify the edges into affinity networks. 7. ACKNOWLEDGMENT The authors would like to thank Kevin Leduc and Jason Powers, from Sprint Applied Research & Advanced Technology Labs, for providing technical and infrastructure support. We would also like to thank Sprint-Nextel and the Carnegie Mellon University - Silicon Valley community for their support of this work. This work is supported, in part, by NSF award CCF to Ole J. Mengshoel.

9 REFERENCES [1] A. Nanavati, S. Gurumurthy, G. Das, D. Chakraborty, K. Dasgupta, S. Mukherjea, and A. Joshi. On the structural properties of massive telecom call graphs: findings and implications. In Proc. of 15th ACM CIKM, pages , [2] A. Clauset, C. Moore, M. E. J. Newman, Hierarchical structure and the prediction of missing links in networks, Nature, 453, pages , May [3] D. J. Crandall, L. Backstrom, D. Cosley, S. Suri, D. Huttenlocher, and J. Kleinberg: Inferring social ties from geographic coincidences. PNAS, 107, pages , December [4] D. Wang, D. Pedreschi, C. Song, F. Giannotti, A. L. Barabasi. Human mobility, social ties, and link prediction. In Proc. of KDD-11, pages , [5] D. Liben-Nowell and J. Kleinberg, The link prediction problem for social networks. In Proc. of the 12 th International Conference on Information and Knowledge management (CIKM-03), [6] E. Cho, S. A. Myers, J. Leskovec. Friendship and mobility: user movement in location-based social networks. In Proc. of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining. San Diego, California, USA, August 21-24, [7] G. Palla, A.-L. Barabasi, and T. Vicsek. Quantifying social group evolution. Nature, 446, pages , April [8] H. Zhang and R. Dantu, Predicting social ties in mobile phone networks. In Proc. of 2010 IEEE International Conference on Intelligence and Security Informatics, pages 25-30, [9] J. Resig, S. Dawara, C. Homan, and A. Teredesai. Extracting social networks from instant messaging populations. In Proc. of ACM SIGKDD, pages 22-25, [10] J.-P. Onnela, J. Saramäki, J. Hyvönen, G. Szabó, D. Lazer, K. Kaski, J. Kertész, and A.-L. Barabási. Structure and tie strengths in mobile communication networks. PNAS, 104, pages , May [11] K. Dasgupta, R. Singh, B. Viswanathan, D. Chakraborty, S. Mukherjea, A. A. Nanavati, and A. Joshi. Social ties and their relevance to churn in mobile telecom networks. In Proc. of EDBT-08, pages , [12] M. Seshadri, S. Machiraju, A. Sridharan, J. Bolot, C. Faloutsos, and J. Leskovec. Mobile call graphs: Beyond power-law and lognormal distributions. In Proc. of KDD-08, pages , [13] M. E. J. Newman. Detecting community structure in networks. The European Physical Journal B - Condensed Matter and Complex Systems, 38, pages , March 2004), [14] M. C. Gonzalez, C. A. Hidalgo, and A.-L. Barabasi. Understanding individual human mobility patterns. Nature: 453(7196), pages , [15] N. Eagle, A. Pentland, and D. Lazer. Inferring friendship network structure by using mobile phone data. Proceeding of National Academic Science, 106(36), pages , [16] N. Eagle, Y.-A. de Montjoye, and L. M. A. Bettencourt. Community Computing: Comparisons between Rural and Urban Societies using Mobile Phone Data, In Proc. of CSE- 09, pages , [17] P. Perona and W. T. Freeman. A factorization approach to grouping. In Proc. of European Conference on Computer Vision, pages , [18] S. Sarkar and K. L. Boyer. Quantitative measures of changed based on feature organization: Eigenvalues and eigenvectors. Computer Vision and Image Understanding, 71, pages , July [19] S. Isaacman, R. Becker, R. Cáceres, S. Kobourov, J. Rowland, and A. Varshavsky. A tale of two cities. In Proc. of HotMobile-10, pages [20] T. Shi, M. Belkin, and B. Yu. Data Spectroscopy: Learning Mixture Models using Eigenspaces of Convolution Operators. In Proc. of ICML, [21] T. S. Evans and R. Lambiotte. Line Graphs, Link Partitions, and Overlapping Communities. Phys. Rev. E, , [22] U. V. Luxburg. A tutorial on spectral clustering. Statistics and Computing, 17, pages , [23] Y.-Y. Ahn, J. P. Bagrow, and S. Lehman. Link communities reveal multiscale complexity in networks. Nature, 466. pages , June [24] M. Cossalter, O. J. Mengshoel, and T. Selker. Visualizing and understanding large-scale Bayesian networks. In Proc. of the AAAI-11 Workshop on Scalable Integration of Analytics and Visualization, pages 12-21, [25] J. Dean and S. Ghemawat. MapReduce: simplified data processing on large clusters. In Proc. of the 6th conference on Symposium on Operating Systems Design & Implementation - Volume 6 (OSDI-04), San Francisco, CA, [26] T. Hyun-Jin Kim, A. Yamada, V. Gligor, J. I. Hong, and A. Perrig. User-Controlled Trust Establishment through Visualization of Tie Strength. Technical report CMU-CyLab , Carnegie Mellon University, February 2011.

User Modeling for Telecommunication Applications: Experiences and Practical Implications

User Modeling for Telecommunication Applications: Experiences and Practical Implications User Modeling for Telecommunication Applications: Experiences and Practical Implications Heath Hohwald, Enrique Frias-Martinez, and Nuria Oliver Data Mining and User Modeling Group Telefonica Research,

More information

SOCIAL NETWORK ANALYSIS EVALUATING THE CUSTOMER S INFLUENCE FACTOR OVER BUSINESS EVENTS

SOCIAL NETWORK ANALYSIS EVALUATING THE CUSTOMER S INFLUENCE FACTOR OVER BUSINESS EVENTS SOCIAL NETWORK ANALYSIS EVALUATING THE CUSTOMER S INFLUENCE FACTOR OVER BUSINESS EVENTS Carlos Andre Reis Pinheiro 1 and Markus Helfert 2 1 School of Computing, Dublin City University, Dublin, Ireland

More information

STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and

STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and Clustering Techniques and STATISTICA Case Study: Defining Clusters of Shopping Center Patrons STATISTICA Solutions for Business Intelligence, Data Mining, Quality Control, and Web-based Analytics Table

More information

Big Data Analytics of Multi-Relationship Online Social Network Based on Multi-Subnet Composited Complex Network

Big Data Analytics of Multi-Relationship Online Social Network Based on Multi-Subnet Composited Complex Network , pp.273-284 http://dx.doi.org/10.14257/ijdta.2015.8.5.24 Big Data Analytics of Multi-Relationship Online Social Network Based on Multi-Subnet Composited Complex Network Gengxin Sun 1, Sheng Bin 2 and

More information

Detecting Anomalies in Mobile Telecommunication Networks Using a Graph Based Approach

Detecting Anomalies in Mobile Telecommunication Networks Using a Graph Based Approach Proceedings of the Twenty-Eighth International Florida Artificial Intelligence Research Society Conference Detecting Anomalies in Mobile Telecommunication Networks Using a Graph Based Approach Cameron

More information

MapReduce Approach to Collective Classification for Networks

MapReduce Approach to Collective Classification for Networks MapReduce Approach to Collective Classification for Networks Wojciech Indyk 1, Tomasz Kajdanowicz 1, Przemyslaw Kazienko 1, and Slawomir Plamowski 1 Wroclaw University of Technology, Wroclaw, Poland Faculty

More information

Graph Processing and Social Networks

Graph Processing and Social Networks Graph Processing and Social Networks Presented by Shu Jiayu, Yang Ji Department of Computer Science and Engineering The Hong Kong University of Science and Technology 2015/4/20 1 Outline Background Graph

More information

Group CRM: a New Telecom CRM Framework from Social Network Perspective

Group CRM: a New Telecom CRM Framework from Social Network Perspective Group CRM: a New Telecom CRM Framework from Social Network Perspective Bin Wu Beijing University of Posts and Telecommunications Beijing, China wubin@bupt.edu.cn Qi Ye Beijing University of Posts and Telecommunications

More information

Estimation of Human Mobility Patterns and Attributes Analyzing Anonymized Mobile Phone CDR:

Estimation of Human Mobility Patterns and Attributes Analyzing Anonymized Mobile Phone CDR: Estimation of Human Mobility Patterns and Attributes Analyzing Anonymized Mobile Phone CDR: Developing Real-time Census from Crowds of Greater Dhaka Ayumi Arai 1 and Ryosuke Shibasaki 1,2 1 Department

More information

Categorical Data Visualization and Clustering Using Subjective Factors

Categorical Data Visualization and Clustering Using Subjective Factors Categorical Data Visualization and Clustering Using Subjective Factors Chia-Hui Chang and Zhi-Kai Ding Department of Computer Science and Information Engineering, National Central University, Chung-Li,

More information

How To Find Influence Between Two Concepts In A Network

How To Find Influence Between Two Concepts In A Network 2014 UKSim-AMSS 16th International Conference on Computer Modelling and Simulation Influence Discovery in Semantic Networks: An Initial Approach Marcello Trovati and Ovidiu Bagdasar School of Computing

More information

131-1. Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10

131-1. Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10 1/10 131-1 Adding New Level in KDD to Make the Web Usage Mining More Efficient Mohammad Ala a AL_Hamami PHD Student, Lecturer m_ah_1@yahoocom Soukaena Hassan Hashem PHD Student, Lecturer soukaena_hassan@yahoocom

More information

Discovery of User Groups within Mobile Data

Discovery of User Groups within Mobile Data Discovery of User Groups within Mobile Data Syed Agha Muhammad Technical University Darmstadt muhammad@ess.tu-darmstadt.de Kristof Van Laerhoven Darmstadt, Germany kristof@ess.tu-darmstadt.de ABSTRACT

More information

Social Network Discovery based on Sensitivity Analysis

Social Network Discovery based on Sensitivity Analysis Social Network Discovery based on Sensitivity Analysis Tarik Crnovrsanin, Carlos D. Correa and Kwan-Liu Ma Department of Computer Science University of California, Davis tecrnovrsanin@ucdavis.edu, {correac,ma}@cs.ucdavis.edu

More information

arxiv:1408.1519v2 [cs.si] 8 Aug 2014

arxiv:1408.1519v2 [cs.si] 8 Aug 2014 Group colocation behavior in technological social networks arxiv:48.59v2 [cs.si] 8 Aug 24 Chloë Brown, Neal Lathia, Anastasios Noulas, Cecilia Mascolo, and Vincent Blondel 2 Computer Laboratory, University

More information

Part 2: Community Detection

Part 2: Community Detection Chapter 8: Graph Data Part 2: Community Detection Based on Leskovec, Rajaraman, Ullman 2014: Mining of Massive Datasets Big Data Management and Analytics Outline Community Detection - Social networks -

More information

An Order-Invariant Time Series Distance Measure [Position on Recent Developments in Time Series Analysis]

An Order-Invariant Time Series Distance Measure [Position on Recent Developments in Time Series Analysis] An Order-Invariant Time Series Distance Measure [Position on Recent Developments in Time Series Analysis] Stephan Spiegel and Sahin Albayrak DAI-Lab, Technische Universität Berlin, Ernst-Reuter-Platz 7,

More information

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014 RESEARCH ARTICLE OPEN ACCESS A Survey of Data Mining: Concepts with Applications and its Future Scope Dr. Zubair Khan 1, Ashish Kumar 2, Sunny Kumar 3 M.Tech Research Scholar 2. Department of Computer

More information

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets http://info.salford-systems.com/jsm-2015-ctw August 2015 Salford Systems Course Outline Demonstration of two classification

More information

DIGITS CENTER FOR DIGITAL INNOVATION, TECHNOLOGY, AND STRATEGY THOUGHT LEADERSHIP FOR THE DIGITAL AGE

DIGITS CENTER FOR DIGITAL INNOVATION, TECHNOLOGY, AND STRATEGY THOUGHT LEADERSHIP FOR THE DIGITAL AGE DIGITS CENTER FOR DIGITAL INNOVATION, TECHNOLOGY, AND STRATEGY THOUGHT LEADERSHIP FOR THE DIGITAL AGE INTRODUCTION RESEARCH IN PRACTICE PAPER SERIES, FALL 2011. BUSINESS INTELLIGENCE AND PREDICTIVE ANALYTICS

More information

International Journal of Engineering Research ISSN: 2348-4039 & Management Technology November-2015 Volume 2, Issue-6

International Journal of Engineering Research ISSN: 2348-4039 & Management Technology November-2015 Volume 2, Issue-6 International Journal of Engineering Research ISSN: 2348-4039 & Management Technology Email: editor@ijermt.org November-2015 Volume 2, Issue-6 www.ijermt.org Modeling Big Data Characteristics for Discovering

More information

APPLICATION OF DATA MINING TECHNIQUES FOR BUILDING SIMULATION PERFORMANCE PREDICTION ANALYSIS. email paul@esru.strath.ac.uk

APPLICATION OF DATA MINING TECHNIQUES FOR BUILDING SIMULATION PERFORMANCE PREDICTION ANALYSIS. email paul@esru.strath.ac.uk Eighth International IBPSA Conference Eindhoven, Netherlands August -4, 2003 APPLICATION OF DATA MINING TECHNIQUES FOR BUILDING SIMULATION PERFORMANCE PREDICTION Christoph Morbitzer, Paul Strachan 2 and

More information

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012 Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization GENOME 560, Spring 2012 Data are interesting because they help us understand the world Genomics: Massive Amounts

More information

An Overview of Knowledge Discovery Database and Data mining Techniques

An Overview of Knowledge Discovery Database and Data mining Techniques An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,

More information

Exploring spatial decay effect in mass media and social media: a case study of China

Exploring spatial decay effect in mass media and social media: a case study of China Exploring spatial decay effect in mass media and social media: a case study of China Yihong Yuan Department of Geography, Texas State University, San Marcos, TX, USA, 78666. Tel: +1(512)-245-3208 Email:yuan@txtate.edu

More information

MALLET-Privacy Preserving Influencer Mining in Social Media Networks via Hypergraph

MALLET-Privacy Preserving Influencer Mining in Social Media Networks via Hypergraph MALLET-Privacy Preserving Influencer Mining in Social Media Networks via Hypergraph Janani K 1, Narmatha S 2 Assistant Professor, Department of Computer Science and Engineering, Sri Shakthi Institute of

More information

Using Data Mining for Mobile Communication Clustering and Characterization

Using Data Mining for Mobile Communication Clustering and Characterization Using Data Mining for Mobile Communication Clustering and Characterization A. Bascacov *, C. Cernazanu ** and M. Marcu ** * Lasting Software, Timisoara, Romania ** Politehnica University of Timisoara/Computer

More information

IJCSES Vol.7 No.4 October 2013 pp.165-168 Serials Publications BEHAVIOR PERDITION VIA MINING SOCIAL DIMENSIONS

IJCSES Vol.7 No.4 October 2013 pp.165-168 Serials Publications BEHAVIOR PERDITION VIA MINING SOCIAL DIMENSIONS IJCSES Vol.7 No.4 October 2013 pp.165-168 Serials Publications BEHAVIOR PERDITION VIA MINING SOCIAL DIMENSIONS V.Sudhakar 1 and G. Draksha 2 Abstract:- Collective behavior refers to the behaviors of individuals

More information

Clustering Anonymized Mobile Call Detail Records to Find Usage Groups

Clustering Anonymized Mobile Call Detail Records to Find Usage Groups Clustering Anonymized Mobile Call Detail Records to Find Usage Groups Richard A. Becker, Ramón Cáceres, Karrie Hanson, Ji Meng Loh, Simon Urbanek, Alexander Varshavsky, and Chris Volinsky AT&T Labs - Research

More information

Link Prediction in Social Networks

Link Prediction in Social Networks CS378 Data Mining Final Project Report Dustin Ho : dsh544 Eric Shrewsberry : eas2389 Link Prediction in Social Networks 1. Introduction Social networks are becoming increasingly more prevalent in the daily

More information

Network Performance Monitoring at Small Time Scales

Network Performance Monitoring at Small Time Scales Network Performance Monitoring at Small Time Scales Konstantina Papagiannaki, Rene Cruz, Christophe Diot Sprint ATL Burlingame, CA dina@sprintlabs.com Electrical and Computer Engineering Department University

More information

How To Identify A Churner

How To Identify A Churner 2012 45th Hawaii International Conference on System Sciences A New Ensemble Model for Efficient Churn Prediction in Mobile Telecommunication Namhyoung Kim, Jaewook Lee Department of Industrial and Management

More information

Performance Metrics for Graph Mining Tasks

Performance Metrics for Graph Mining Tasks Performance Metrics for Graph Mining Tasks 1 Outline Introduction to Performance Metrics Supervised Learning Performance Metrics Unsupervised Learning Performance Metrics Optimizing Metrics Statistical

More information

Social Media Mining. Network Measures

Social Media Mining. Network Measures Klout Measures and Metrics 22 Why Do We Need Measures? Who are the central figures (influential individuals) in the network? What interaction patterns are common in friends? Who are the like-minded users

More information

Social Media Mining. Data Mining Essentials

Social Media Mining. Data Mining Essentials Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

More information

STATISTICA. Financial Institutions. Case Study: Credit Scoring. and

STATISTICA. Financial Institutions. Case Study: Credit Scoring. and Financial Institutions and STATISTICA Case Study: Credit Scoring STATISTICA Solutions for Business Intelligence, Data Mining, Quality Control, and Web-based Analytics Table of Contents INTRODUCTION: WHAT

More information

20 A Visualization Framework For Discovering Prepaid Mobile Subscriber Usage Patterns

20 A Visualization Framework For Discovering Prepaid Mobile Subscriber Usage Patterns 20 A Visualization Framework For Discovering Prepaid Mobile Subscriber Usage Patterns John Aogon and Patrick J. Ogao Telecommunications operators in developing countries are faced with a problem of knowing

More information

Use of mobile phone data to estimate mobility flows. Measuring urban population and inter-city mobility using big data in an integrated approach

Use of mobile phone data to estimate mobility flows. Measuring urban population and inter-city mobility using big data in an integrated approach Use of mobile phone data to estimate mobility flows. Measuring urban population and inter-city mobility using big data in an integrated approach Barbara Furletti, Lorenzo Gabrielli, Giuseppe Garofalo,

More information

Policy-based Pre-Processing in Hadoop

Policy-based Pre-Processing in Hadoop Policy-based Pre-Processing in Hadoop Yi Cheng, Christian Schaefer Ericsson Research Stockholm, Sweden yi.cheng@ericsson.com, christian.schaefer@ericsson.com Abstract While big data analytics provides

More information

Random forest algorithm in big data environment

Random forest algorithm in big data environment Random forest algorithm in big data environment Yingchun Liu * School of Economics and Management, Beihang University, Beijing 100191, China Received 1 September 2014, www.cmnt.lv Abstract Random forest

More information

Data Mining Solutions for the Business Environment

Data Mining Solutions for the Business Environment Database Systems Journal vol. IV, no. 4/2013 21 Data Mining Solutions for the Business Environment Ruxandra PETRE University of Economic Studies, Bucharest, Romania ruxandra_stefania.petre@yahoo.com Over

More information

On the Analysis and Visualisation of Anonymised Call Detail Records

On the Analysis and Visualisation of Anonymised Call Detail Records On the Analysis and Visualisation of Anonymised Call Detail Records Varun J, Sunil H.S, Vasisht R SJB Research Foundation Uttarahalli Road, Kengeri, Bangalore, India Email: (varun.iyengari,hs.sunilrao,vasisht.raghavendra)@gmail.com

More information

Graph Mining Techniques for Social Media Analysis

Graph Mining Techniques for Social Media Analysis Graph Mining Techniques for Social Media Analysis Mary McGlohon Christos Faloutsos 1 1-1 What is graph mining? Extracting useful knowledge (patterns, outliers, etc.) from structured data that can be represented

More information

Educational Social Network Group Profiling: An Analysis of Differentiation-Based Methods

Educational Social Network Group Profiling: An Analysis of Differentiation-Based Methods Educational Social Network Group Profiling: An Analysis of Differentiation-Based Methods João Emanoel Ambrósio Gomes 1, Ricardo Bastos Cavalcante Prudêncio 1 1 Centro de Informática Universidade Federal

More information

The Impact of YouTube Recommendation System on Video Views

The Impact of YouTube Recommendation System on Video Views The Impact of YouTube Recommendation System on Video Views Renjie Zhou, Samamon Khemmarat, Lixin Gao College of Computer Science and Technology Department of Electrical and Computer Engineering Harbin

More information

Data Mining: Overview. What is Data Mining?

Data Mining: Overview. What is Data Mining? Data Mining: Overview What is Data Mining? Recently * coined term for confluence of ideas from statistics and computer science (machine learning and database methods) applied to large databases in science,

More information

Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques

Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques Subhashree K 1, Prakash P S 2 1 Student, Kongu Engineering College, Perundurai, Erode 2 Assistant Professor,

More information

ORGANIZATIONAL KNOWLEDGE MAPPING BASED ON LIBRARY INFORMATION SYSTEM

ORGANIZATIONAL KNOWLEDGE MAPPING BASED ON LIBRARY INFORMATION SYSTEM ORGANIZATIONAL KNOWLEDGE MAPPING BASED ON LIBRARY INFORMATION SYSTEM IRANDOC CASE STUDY Ammar Jalalimanesh a,*, Elaheh Homayounvala a a Information engineering department, Iranian Research Institute for

More information

Customer Classification And Prediction Based On Data Mining Technique

Customer Classification And Prediction Based On Data Mining Technique Customer Classification And Prediction Based On Data Mining Technique Ms. Neethu Baby 1, Mrs. Priyanka L.T 2 1 M.E CSE, Sri Shakthi Institute of Engineering and Technology, Coimbatore 2 Assistant Professor

More information

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Journal of Computational Information Systems 7: 16 (2011) 5956-5963 Available at http://www.jofcis.com Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Ping ZHOU, Jingsheng LEI, Wenjun YE

More information

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 by Tan, Steinbach, Kumar 1 What is Cluster Analysis? Finding groups of objects such that the objects in a group will

More information

ISSN: 2320-1363 CONTEXTUAL ADVERTISEMENT MINING BASED ON BIG DATA ANALYTICS

ISSN: 2320-1363 CONTEXTUAL ADVERTISEMENT MINING BASED ON BIG DATA ANALYTICS CONTEXTUAL ADVERTISEMENT MINING BASED ON BIG DATA ANALYTICS A.Divya *1, A.M.Saravanan *2, I. Anette Regina *3 MPhil, Research Scholar, Muthurangam Govt. Arts College, Vellore, Tamilnadu, India Assistant

More information

Practical Graph Mining with R. 5. Link Analysis

Practical Graph Mining with R. 5. Link Analysis Practical Graph Mining with R 5. Link Analysis Outline Link Analysis Concepts Metrics for Analyzing Networks PageRank HITS Link Prediction 2 Link Analysis Concepts Link A relationship between two entities

More information

Big Data para comprender la Dinámica Humana

Big Data para comprender la Dinámica Humana Big Data para comprender la Dinámica Humana Carlos Sarraute Grandata Labs Hablemos de Big Data 26 noviembre 2014 @ 2014 All rights reserved. Brief presentation Grandata Founded in 2012. Leverages advanced

More information

NETZCOPE - a tool to analyze and display complex R&D collaboration networks

NETZCOPE - a tool to analyze and display complex R&D collaboration networks The Task Concepts from Spectral Graph Theory EU R&D Network Analysis Netzcope Screenshots NETZCOPE - a tool to analyze and display complex R&D collaboration networks L. Streit & O. Strogan BiBoS, Univ.

More information

USING SPECTRAL RADIUS RATIO FOR NODE DEGREE TO ANALYZE THE EVOLUTION OF SCALE- FREE NETWORKS AND SMALL-WORLD NETWORKS

USING SPECTRAL RADIUS RATIO FOR NODE DEGREE TO ANALYZE THE EVOLUTION OF SCALE- FREE NETWORKS AND SMALL-WORLD NETWORKS USING SPECTRAL RADIUS RATIO FOR NODE DEGREE TO ANALYZE THE EVOLUTION OF SCALE- FREE NETWORKS AND SMALL-WORLD NETWORKS Natarajan Meghanathan Jackson State University, 1400 Lynch St, Jackson, MS, USA natarajan.meghanathan@jsums.edu

More information

2015 Workshops for Professors

2015 Workshops for Professors SAS Education Grow with us Offered by the SAS Global Academic Program Supporting teaching, learning and research in higher education 2015 Workshops for Professors 1 Workshops for Professors As the market

More information

IBM SPSS Modeler Social Network Analysis 15 User Guide

IBM SPSS Modeler Social Network Analysis 15 User Guide IBM SPSS Modeler Social Network Analysis 15 User Guide Note: Before using this information and the product it supports, read the general information under Notices on p. 25. This edition applies to IBM

More information

CHURN PREDICTION IN MOBILE TELECOM SYSTEM USING DATA MINING TECHNIQUES

CHURN PREDICTION IN MOBILE TELECOM SYSTEM USING DATA MINING TECHNIQUES International Journal of Scientific and Research Publications, Volume 4, Issue 4, April 2014 1 CHURN PREDICTION IN MOBILE TELECOM SYSTEM USING DATA MINING TECHNIQUES DR. M.BALASUBRAMANIAN *, M.SELVARANI

More information

Scaling Bayesian Network Parameter Learning with Expectation Maximization using MapReduce

Scaling Bayesian Network Parameter Learning with Expectation Maximization using MapReduce Scaling Bayesian Network Parameter Learning with Expectation Maximization using MapReduce Erik B. Reed Carnegie Mellon University Silicon Valley Campus NASA Research Park Moffett Field, CA 94035 erikreed@cmu.edu

More information

Single Level Drill Down Interactive Visualization Technique for Descriptive Data Mining Results

Single Level Drill Down Interactive Visualization Technique for Descriptive Data Mining Results , pp.33-40 http://dx.doi.org/10.14257/ijgdc.2014.7.4.04 Single Level Drill Down Interactive Visualization Technique for Descriptive Data Mining Results Muzammil Khan, Fida Hussain and Imran Khan Department

More information

Graph Database Proof of Concept Report

Graph Database Proof of Concept Report Objectivity, Inc. Graph Database Proof of Concept Report Managing The Internet of Things Table of Contents Executive Summary 3 Background 3 Proof of Concept 4 Dataset 4 Process 4 Query Catalog 4 Environment

More information

Customer Analytics. Turn Big Data into Big Value

Customer Analytics. Turn Big Data into Big Value Turn Big Data into Big Value All Your Data Integrated in Just One Place BIRT Analytics lets you capture the value of Big Data that speeds right by most enterprises. It analyzes massive volumes of data

More information

Community Detection Proseminar - Elementary Data Mining Techniques by Simon Grätzer

Community Detection Proseminar - Elementary Data Mining Techniques by Simon Grätzer Community Detection Proseminar - Elementary Data Mining Techniques by Simon Grätzer 1 Content What is Community Detection? Motivation Defining a community Methods to find communities Overlapping communities

More information

BNG 202 Biomechanics Lab. Descriptive statistics and probability distributions I

BNG 202 Biomechanics Lab. Descriptive statistics and probability distributions I BNG 202 Biomechanics Lab Descriptive statistics and probability distributions I Overview The overall goal of this short course in statistics is to provide an introduction to descriptive and inferential

More information

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS Dr. Ananthi Sheshasayee 1, J V N Lakshmi 2 1 Head Department of Computer Science & Research, Quaid-E-Millath Govt College for Women, Chennai, (India)

More information

Viral Marketing in Social Network Using Data Mining

Viral Marketing in Social Network Using Data Mining Viral Marketing in Social Network Using Data Mining Shalini Sharma*,Vishal Shrivastava** *M.Tech. Scholar, Arya College of Engg. & I.T, Jaipur (Raj.) **Associate Proffessor(Dept. of CSE), Arya College

More information

How To Filter Spam Image From A Picture By Color Or Color

How To Filter Spam Image From A Picture By Color Or Color Image Content-Based Email Spam Image Filtering Jianyi Wang and Kazuki Katagishi Abstract With the population of Internet around the world, email has become one of the main methods of communication among

More information

Fault Tolerance in Hadoop for Work Migration

Fault Tolerance in Hadoop for Work Migration 1 Fault Tolerance in Hadoop for Work Migration Shivaraman Janakiraman Indiana University Bloomington ABSTRACT Hadoop is a framework that runs applications on large clusters which are built on numerous

More information

Expansion Properties of Large Social Graphs

Expansion Properties of Large Social Graphs Expansion Properties of Large Social Graphs Fragkiskos D. Malliaros 1 and Vasileios Megalooikonomou 1,2 1 Computer Engineering and Informatics Department University of Patras, 26500 Rio, Greece 2 Data

More information

Transforming the Telecoms Business using Big Data and Analytics

Transforming the Telecoms Business using Big Data and Analytics Transforming the Telecoms Business using Big Data and Analytics Event: ICT Forum for HR Professionals Venue: Meikles Hotel, Harare, Zimbabwe Date: 19 th 21 st August 2015 AFRALTI 1 Objectives Describe

More information

Introduction to Data Mining

Introduction to Data Mining Introduction to Data Mining Jay Urbain Credits: Nazli Goharian & David Grossman @ IIT Outline Introduction Data Pre-processing Data Mining Algorithms Naïve Bayes Decision Tree Neural Network Association

More information

Analyzing Customer Churn in the Software as a Service (SaaS) Industry

Analyzing Customer Churn in the Software as a Service (SaaS) Industry Analyzing Customer Churn in the Software as a Service (SaaS) Industry Ben Frank, Radford University Jeff Pittges, Radford University Abstract Predicting customer churn is a classic data mining problem.

More information

Divide-n-Discover Discretization based Data Exploration Framework for Healthcare Analytics

Divide-n-Discover Discretization based Data Exploration Framework for Healthcare Analytics for Healthcare Analytics Si-Chi Chin,KiyanaZolfaghar,SenjutiBasuRoy,AnkurTeredesai,andPaulAmoroso Institute of Technology, The University of Washington -Tacoma,900CommerceStreet,Tacoma,WA980-00,U.S.A.

More information

DATA ANALYSIS II. Matrix Algorithms

DATA ANALYSIS II. Matrix Algorithms DATA ANALYSIS II Matrix Algorithms Similarity Matrix Given a dataset D = {x i }, i=1,..,n consisting of n points in R d, let A denote the n n symmetric similarity matrix between the points, given as where

More information

Location Analytics for Financial Services. An Esri White Paper October 2013

Location Analytics for Financial Services. An Esri White Paper October 2013 Location Analytics for Financial Services An Esri White Paper October 2013 Copyright 2013 Esri All rights reserved. Printed in the United States of America. The information contained in this document is

More information

Data Mining for Fun and Profit

Data Mining for Fun and Profit Data Mining for Fun and Profit Data mining is the extraction of implicit, previously unknown, and potentially useful information from data. - Ian H. Witten, Data Mining: Practical Machine Learning Tools

More information

SPATIAL DATA CLASSIFICATION AND DATA MINING

SPATIAL DATA CLASSIFICATION AND DATA MINING , pp.-40-44. Available online at http://www. bioinfo. in/contents. php?id=42 SPATIAL DATA CLASSIFICATION AND DATA MINING RATHI J.B. * AND PATIL A.D. Department of Computer Science & Engineering, Jawaharlal

More information

Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP

Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP ABSTRACT In data mining modelling, data preparation

More information

Clustering. 15-381 Artificial Intelligence Henry Lin. Organizing data into clusters such that there is

Clustering. 15-381 Artificial Intelligence Henry Lin. Organizing data into clusters such that there is Clustering 15-381 Artificial Intelligence Henry Lin Modified from excellent slides of Eamonn Keogh, Ziv Bar-Joseph, and Andrew Moore What is Clustering? Organizing data into clusters such that there is

More information

An Empirical Study of Application of Data Mining Techniques in Library System

An Empirical Study of Application of Data Mining Techniques in Library System An Empirical Study of Application of Data Mining Techniques in Library System Veepu Uppal Department of Computer Science and Engineering, Manav Rachna College of Engineering, Faridabad, India Gunjan Chindwani

More information

Advanced In-Database Analytics

Advanced In-Database Analytics Advanced In-Database Analytics Tallinn, Sept. 25th, 2012 Mikko-Pekka Bertling, BDM Greenplum EMEA 1 That sounds complicated? 2 Who can tell me how best to solve this 3 What are the main mathematical functions??

More information

Mobile Call Graphs: Beyond Power-Law and Lognormal Distributions

Mobile Call Graphs: Beyond Power-Law and Lognormal Distributions Mobile Call Graphs: Beyond Power-Law and Lognormal Distributions Mukund Seshadri Sprint Burlingame, California, USA mukund.seshadri@sprint.com Jean Bolot Sprint Burlingame, California, USA bolot@sprint.com

More information

DATA MINING TECHNIQUES AND APPLICATIONS

DATA MINING TECHNIQUES AND APPLICATIONS DATA MINING TECHNIQUES AND APPLICATIONS Mrs. Bharati M. Ramageri, Lecturer Modern Institute of Information Technology and Research, Department of Computer Application, Yamunanagar, Nigdi Pune, Maharashtra,

More information

Specific Usage of Visual Data Analysis Techniques

Specific Usage of Visual Data Analysis Techniques Specific Usage of Visual Data Analysis Techniques Snezana Savoska 1 and Suzana Loskovska 2 1 Faculty of Administration and Management of Information systems, Partizanska bb, 7000, Bitola, Republic of Macedonia

More information

Comparison of Data Mining Techniques used for Financial Data Analysis

Comparison of Data Mining Techniques used for Financial Data Analysis Comparison of Data Mining Techniques used for Financial Data Analysis Abhijit A. Sawant 1, P. M. Chawan 2 1 Student, 2 Associate Professor, Department of Computer Technology, VJTI, Mumbai, INDIA Abstract

More information

A SOCIAL NETWORK ANALYSIS APPROACH TO ANALYZE ROAD NETWORKS INTRODUCTION

A SOCIAL NETWORK ANALYSIS APPROACH TO ANALYZE ROAD NETWORKS INTRODUCTION A SOCIAL NETWORK ANALYSIS APPROACH TO ANALYZE ROAD NETWORKS Kyoungjin Park Alper Yilmaz Photogrammetric and Computer Vision Lab Ohio State University park.764@osu.edu yilmaz.15@osu.edu ABSTRACT Depending

More information

Subgraph Patterns: Network Motifs and Graphlets. Pedro Ribeiro

Subgraph Patterns: Network Motifs and Graphlets. Pedro Ribeiro Subgraph Patterns: Network Motifs and Graphlets Pedro Ribeiro Analyzing Complex Networks We have been talking about extracting information from networks Some possible tasks: General Patterns Ex: scale-free,

More information

Three Effective Top-Down Clustering Algorithms for Location Database Systems

Three Effective Top-Down Clustering Algorithms for Location Database Systems Three Effective Top-Down Clustering Algorithms for Location Database Systems Kwang-Jo Lee and Sung-Bong Yang Department of Computer Science, Yonsei University, Seoul, Republic of Korea {kjlee5435, yang}@cs.yonsei.ac.kr

More information

PSG College of Technology, Coimbatore-641 004 Department of Computer & Information Sciences BSc (CT) G1 & G2 Sixth Semester PROJECT DETAILS.

PSG College of Technology, Coimbatore-641 004 Department of Computer & Information Sciences BSc (CT) G1 & G2 Sixth Semester PROJECT DETAILS. PSG College of Technology, Coimbatore-641 004 Department of Computer & Information Sciences BSc (CT) G1 & G2 Sixth Semester PROJECT DETAILS Project Project Title Area of Abstract No Specialization 1. Software

More information

Fault Analysis in Software with the Data Interaction of Classes

Fault Analysis in Software with the Data Interaction of Classes , pp.189-196 http://dx.doi.org/10.14257/ijsia.2015.9.9.17 Fault Analysis in Software with the Data Interaction of Classes Yan Xiaobo 1 and Wang Yichen 2 1 Science & Technology on Reliability & Environmental

More information

Analysis of Internet Topologies: A Historical View

Analysis of Internet Topologies: A Historical View Analysis of Internet Topologies: A Historical View Mohamadreza Najiminaini, Laxmi Subedi, and Ljiljana Trajković Communication Networks Laboratory http://www.ensc.sfu.ca/cnl Simon Fraser University Vancouver,

More information

How To Perform An Ensemble Analysis

How To Perform An Ensemble Analysis Charu C. Aggarwal IBM T J Watson Research Center Yorktown, NY 10598 Outlier Ensembles Keynote, Outlier Detection and Description Workshop, 2013 Based on the ACM SIGKDD Explorations Position Paper: Outlier

More information

Mobile Phone APP Software Browsing Behavior using Clustering Analysis

Mobile Phone APP Software Browsing Behavior using Clustering Analysis Proceedings of the 2014 International Conference on Industrial Engineering and Operations Management Bali, Indonesia, January 7 9, 2014 Mobile Phone APP Software Browsing Behavior using Clustering Analysis

More information

Hadoop Operations Management for Big Data Clusters in Telecommunication Industry

Hadoop Operations Management for Big Data Clusters in Telecommunication Industry Hadoop Operations Management for Big Data Clusters in Telecommunication Industry N. Kamalraj Asst. Prof., Department of Computer Technology Dr. SNS Rajalakshmi College of Arts and Science Coimbatore-49

More information

Demographics of Atlanta, Georgia:

Demographics of Atlanta, Georgia: Demographics of Atlanta, Georgia: A Visual Analysis of the 2000 and 2010 Census Data 36-315 Final Project Rachel Cohen, Kathryn McKeough, Minnar Xie & David Zimmerman Ethnicities of Atlanta Figure 1: From

More information

Graph Mining and Social Network Analysis

Graph Mining and Social Network Analysis Graph Mining and Social Network Analysis Data Mining and Text Mining (UIC 583 @ Politecnico di Milano) References Jiawei Han and Micheline Kamber, "Data Mining: Concepts and Techniques", The Morgan Kaufmann

More information

IDENTIFICATION OF KEY LOCATIONS BASED ON ONLINE SOCIAL NETWORK ACTIVITY

IDENTIFICATION OF KEY LOCATIONS BASED ON ONLINE SOCIAL NETWORK ACTIVITY H. Efstathiades, D. Antoniades, G. Pallis, M. D. Dikaiakos IDENTIFICATION OF KEY LOCATIONS BASED ON ONLINE SOCIAL NETWORK ACTIVITY 1 Motivation Key Locations information is of high importance for various

More information

Measurement of Human Mobility Using Cell Phone Data: Developing Big Data for Demographic Science*

Measurement of Human Mobility Using Cell Phone Data: Developing Big Data for Demographic Science* Measurement of Human Mobility Using Cell Phone Data: Developing Big Data for Demographic Science* Nathalie E. Williams 1, Timothy Thomas 2, Matt Dunbar 3, Nathan Eagle 4 and Adrian Dobra 5 Working Paper

More information

Clustering Marketing Datasets with Data Mining Techniques

Clustering Marketing Datasets with Data Mining Techniques Clustering Marketing Datasets with Data Mining Techniques Özgür Örnek International Burch University, Sarajevo oornek@ibu.edu.ba Abdülhamit Subaşı International Burch University, Sarajevo asubasi@ibu.edu.ba

More information