# You Are What You Like! Information Leakage Through Users Interests

Save this PDF as:

Size: px
Start display at page:

## Transcription

6 (a) Sample a topic z i from M(φ). (b) Sample a word w i from M(β zi ). Note that α and B = z {β z } are corpus-level parameters, while φ is a document-level parameter (i.e., it is sampled once for each interest description). Given the parameters α and B,thejointprobabilitydistributionofaninterest topic mixture φ, asetofwordsw,andasetofk topics Z for a description is p(φ, Z, W α, B) =p(φ α) i p(z i φ)p(w i β zi ) (1) The observable variable is W (i.e., the set of words in the interest descriptions) while α, B, and φ are latent variables. Equation (1) describes a parametric empirical Bayes model, where we can estimate the parameters using Bayes inference. In this work, we used collapsed Gibbs sampling [19] to recover the posterior marginal distribution of φ for each interest description. Recall that φ is a vector, i.e., φ i is the probability that the interest description belongs to Topic i Step 3: Interest Feature Vector (IFV) Extraction The probability that a user is interested in Topic i is the probability that his interest descriptions belong to Topic i. Let V denote a user s interest feature vector, I is the set of his interest descriptions, and φ I i is the probability that interest description I belongs to Topic i.then,forall1 i k, V i =1 I I(1 φ I i ) is the probability that the user is interested in Topic i. For instance, in Figure 1, User4 has interests Lady Gaga, Michael Jackson, and Fadl Shaker. The probability that User4 belongs to Topic 1, which represents American singers, is the probability that at least one of these interests belongs to Topic 1.Thisequals1 ((1 0.8)(1 0.8)) = Step 4: Inference Neighbors Computation Observe that an IFV uniquely defines the interest of an individual in a k-dimensional feature space. Defining an appropriate distance measure in this space, we can quantify the similarity between the interests of any two users. This allows the identification of users who share similar interests, and likely have correlated profile data that can be used to infer their hidden profile data. We use a chi-squared distance metric. In particular, the correlation distance d V,W between two IFV vectors V and W is k (V i W i ) 2 d V,W = (V i + W i ) i=1 In [23], authors showed that the chi-squared distance gives better results when dealing with vectors of probabilities than others. Indeed, we conducted several tests with different other distance metrics: Euclidean, Manhattan and Kullback-Leibler, and results show that the chi-squared distance outperforms all of them. Using the above metric, we can compute the l nearest neighbors of a user u (i.e., the users who are the closest to u in the interest feature space). Anaiveapproachistocompute all M 2 /2 pairwise distances, where M is the number of all users, and then to find the l closest ones for each user. However, it becomes impractical for large values of M and k. Amoreefficientapproachusingk-d tree is taken. The main motivation behind k-d trees is that the tree can be constructed efficiently (with complexity O(M log 2 M)), then saved and used afterwards to compute the closest neighbor of any user with a worst case computation of O(k M 1 1/k ) Inference We can infer a user s hidden profile attribute x from that of its l nearest neighbors: first, we select the l nearest neighbors out of all whose attribute x is defined and public. Then, we do majority voting for the hidden value (i.e., we select the attribute value which the most users out of the l nearest neighbor have). If more than one attribute value has the maximal number of votes, we randomly choose one. For instance, suppose that we want to infer User4 s country-level location in Figure 1, and User4 has 5 nearest neighbors (who publish their locations) because all of them are interested in Topic 2 with high probability (e.g., they like Fadl Shaker ). If 3 out of these 5 are from Egypt and the others are from Lebanon then our guess for User4 s location is Egypt. Although there are multiple techniques besides majority voting to derive the hidden attribute value, we will show in Section 6.2 that, surprisingly, even this simple technique results in remarkable inference accuracy. 5. Dataset Description For the purpose of our study, we collected two profile datasets from Facebook. The first is composed of Facebook profiles that we crawled and which we accessed as everyone (see Section 5.1). The second is a set of more than 4000 private profiles that we collected from volunteers using a Facebook application (see Section 5.2). Next, we describe our methodology used to collect these datasets. We

8 Figure 2: Left: Complementary Cumulative Distribution Function of Music Interests. Right: Cumulative Distribution Function (CDF) of Country-level Locations (retrieved from the CurrentCity attribute) 5.4. Dataset Description In the following, we provide statistics that describe the datasets used in this study. First, Table 1 summarizes the statistics about the availability of attributes in the three datasets (i.e., in RawProfiles, PubProfiles and VolunteerProfiles). Attributes Raw(%) Pub(%) Volunteer(%) Gender Interests Current City Looking For Home Town Relationship Interested In Birth date Religion Table 1: The availability of attributes in our datasets. We observe that Gender is the most common attribute that users publicly reveal. However, three attributes that we want to infer are largely kept private. The age is the information that users conceal the most (89% are undisclosed in PubProfiles). Comparing the availability of the attributes in PubProfiles and VolunteerProfiles is enlightening. We can clearly note that users tend to hide their attribute values from public access even though these attributes are frequently provided (in their private profiles). For instance, the birth date is provided in more than 72% in VolunteerProfiles, whereas it is rarely available in PubProfiles (only 1.62% of users provide their full birth date). The current city is publicly revealed in almost 30% of the cases, whereas half of all volunteers provided this data in their private profile. Recall that the attributes we are interested in are either binary (Gender, Relationship) or multivalued (Age, Country-level location). Finally, note that, as it is shown in Table 1, the public availability of attributes in PubProfiles and in RawProfiles are roughly similar. Also note that the availability of interests slightly changes from RawProfiles (57%) to VolunteerProfiles (62%), yet still relatively abundant. This behavior might have at least two explanations: (1) by default, Facebook sets Interest to be a public attribute, (2) users are more willing to reveal their interests compared to other attributes. Figure 2(left)depictsthecomplementaryCDFofmusicinterests publicly revealed by users in the three datasets. Note that more than 30% of RawProfiles profiles reveal at least one music interest. Private profiles show a higher ratio which is more than 75%. Figure 2 (right) plots the cumulative distribution of the country-level locations of users in our datasets. The three curves show that a single country is over-represented, and that a large fraction of users locations is represented only by a few countries. Independently from the dataset, 40% of users come from a single country, and the top 10 countries represent more than 78% of users. The gentler slope of the curves above 10 countries indicates that other countries are more widely spread across the remaining profiles. Notably, the number of countries appearing in VolunteerProfiles shows that the distribution does not cover all countries in the world. In particular, our volunteers only come from less than 35 different countries. Nevertheless, we believe that VolunteerProfiles still fits for purpose because the overrepresentation shape of location distributions is kept, and illustrated by the Facebook statistics [4] in general (more than 50% of users come from only 9 countries). Motivated by this over-representation in our datasets, we validate our inference technique in Section 6.2 on users that come from the top 10 countries (following the Facebook statistics).

10 nique. Finally, note that previous works [27, 18] computed the inference accuracy using private data (i.e., their dataset is acrawlofacommunity,andthus,theycouldaccessall attributes that can only be seen by community members). Hence, these results are obtained with different attacker model, and the assumption that 50% of all attributes are accessible, as suggested in [27], is unrealistic in our model Experiments In order to validate our interest-based inference technique, we follow two approaches. First, for each attribute, we randomly sample users from PubProfiles such that the sampled dataset has the same OMD as the real Facebook dataset [4]. Then, we measure the inference accuracy on this sampled dataset. Second, we test our technique on the VolunteerProfiles dataset where both private and public attributes are known. Since we know the attribute values in the collected profiles, we can check if the inference is successful or not. In particular, we infer four attributes in both approaches: Gender, Relationship status, Age, and the Country of current location. We run experiments to infer an attribute a in PubProfiles as follows: 1. From all users that provide a in PubProfiles, werandomly sample a set of users (denoted by S) following the OMD of Facebook. The size of S for each attribute is tailored by (i) Facebook OMD and (ii) the number of available samples in PubProfiles. Table3showsthe size of S. Attribute Baseline Random guess IFV Inference Gender 51% 50% 69% Relationship 50% 50% 71% Country 41% 10% 60% Age 26% 16.6% 49% Table 4: Inference Accuracy of PubProfiles accuracy for each attribute is essential. Figure 3 depicts the inference accuracy in function of the number of neighbors. This figure clearly shows that each attribute has a specific number of neighbors that results in the best inference accuracy. Note that, as discussed at the beginning of this section, we rely on repeated random sampling to compute the results, and hence, the computed parameters are independent from the input data. Age inference requires two neighbors. This can be explained by the limited number of users that disclose their age which causes the IFV space to be very sparse: the more neighbors we consider the more likely it is that these neighbors are far and have different attribute values. For other attributes, the optimal number of neighbors is between 3 and 5. We tested different IFV sizes (i.e., k the number of topics). Notably, best results were achieved with k =100. In the sequel, we will use these estimated numbers of neighbors as well as k =100which yield the best inference accuracy. 2. For this sampled set, we compute the inference accuracy as it has been described in Section We repeat Steps 2 and 3 fifteen times and compute the average of all inference accuracy values (Monte Carlo experiment). For VolunteerProfiles weproceedasforpubprofiles,but since the attributes are a mix of public and private attributes, there is no need to do sampling, and we skip Step 1. Attribute Size of S Gender 1000 Relationship 400 Country 1000 Age 105 Table 3: Size of S Parameter Estimation Recall from Section that our algorithm is based on majority voting. Hence, estimating the number of neighbors that provides the best inference Figure 3: Correlation between Number of Neighbors and Inference accuracy Table 4 provides a summary of the results for PubProfiles. The information leakage can be estimated to 20% in comparison with the baseline inference. Surprisingly, the amount of the information is independent from the inferred attribute since the gain is about 20% for all of them. These results show that music interest is a good predictor of all attributes.

11 Inferred Attribute Male Female Male 53% 47% Female 14% 86% Table 5: Confusion Matrix of Gender Gender Inference Table 4 shows that the gender can be inferred with a high accuracy even if only one music interest is known in the PubProfiles. Ouralgorithmperforms18% better than the baseline. Recall that the baseline guesses male for all users (Table 2). To compare the inference accuracy for both males and females, we computed the confusion matrix in Table 5. Surprisingly, memale inference is highly accurate (86%) with a low false negative rate (14%). However, it is not the case for male inference. This behavior can be explained by the number of female profiles in our dataset. In fact, females represent 61.41% of all collected profiles (with publicly revealed gender attribute) and they were subscribed to music interests. However, males share only music interests which represents 35% less than woman. Hence, our technique is more capable of predicting females since the amount of their disclosed (music) interest information is larger compared to males. This also confirms that the amount of disclosed interest information is correlated with inference accuracy. Inferred Attribute Single Married Single 78% 22% Married 36% 64% Table 6: Confusion Matrix of Relationship Relationship Inference Inferring the relationship status (married/single) is challenging since less than 17% of crawled users disclose this attribute showing that it is highly sensitive. Recall that, as there is no publicly available statistics about the distribution of this attribute, we do random guessing as the baseline (having an accuracy of 50%). Our algorithm performs well with 71% of good inference for all users in PubProfiles. Aspreviously,weinvestigatehow music interests are a good predictor for both single and married users by computing the confusion matrix (Table 6). We notice that single users are more distinguishable, based on their IFV, than married ones. The explanation is that single users share more interests than married ones. In particular, asingleuserhasanaverageof9musicinterestswhereasa married user has only Likewise in case of gender, this confirms that the amount of disclosed interest information is correlated with inference accuracy. Country of Location Inference As described in Section 5, we are interested in inferring the users location in the top 10 countries in Facebook. Our approach can easily be extended to all countries, however, as shown by Figure 2, more than 80% of users in PubProfiles belong to 10 countries and these countries represent more than 55% of all Facebook users according to [4]. As the number of users belonging to the top 10 countries is very limited in VolunteerProfiles, wedonotevaluateourschemeonthat dataset. Table 4 shows that our algorithm has an accuracy of 60% on PubProfiles with 19% increase compared to the baseline (recall that, following Table 2, the baseline gives U.S. as a guess for all users). Figure 4 draws the confusion matrix 6 and gives more insight about the inference accuracy. In fact, countries with a specific (regional) music have better accuracy than others. Particularly, U.S. has more than 94% of correct inference, Philippine 80%, India 62%, Indonesia 58% and Greece 42%. This highlights the essence of our algorithm where semantically correlated music interests are grouped together and hence allow us to extract users interested in the same topics (e.g., Philippine music). Without asemanticknowledgethatspecifiestheoriginofasinger or band this is not possible. As for Gender and relationship, the number of collected profiles can also explain the incapacity of the system to correctly infer certain countries such as Italy, Mexico or France. In particular, as shown in Table 7, the number of users belonging to these countries is very small. Hence, their interests may be insufficient to compute a representative IFV whichyieldspooraccuracy. Inferred Att % 30% 11.6% 0% % 67% 3.4% 1.3% % 46.15% 38.4% 0% 35+ 0% 100% 0% 0% Table 8: Confusion Matrix of Age Inference Age Inference Finally, we are interested in inferring the age of users. To do that, we created five age categories 7 that are depicted in Table 8. Recall that the baseline technique always predicts the category of 26 and 34 years for all users. Table 4 shows that our algorithm performs 23% better than the baseline attack. Note that our technique gives good results despite that only 3% of all users provide their age (3133 users in total) in PubProfiles. Weinvestigatehow music interests are correlated with the age bin by computing 6 We removed Brazil since all its entries (2) were wrongly inferred. This is caused by the small number of Brazilians in our dataset. 7 We created six categories but since in PubProfiles we have only few users in the last 3 bins, we merge them together. For VolunteerProfiles we have six bins.

12 Country %ofusers US 71.9% PH 7.80% IN 6.21% ID 5.08% GB 3.62% GR 2.32% FR 2.12% MX 0.41% IT 0.40% BR 0.01% Figure 4: Confusion Matrix of Country Inference Table 7: Top 10 countries distribution in PubProfiles the confusion matrix in Table 8. We find that, as expected, most errors come from falsely putting users into their neighboring bins. For instance, our method puts 30% of years old users into the bin of years. However, note that fewer bins (such as teenager, adult and senior) would yield better accuracy, and it should be sufficient to many applications (e.g., for targeted ads). Observe that we have an error of 100% for the last bin. This is due to the small number of users (3 in PubProfiles) whobelongtothisbin (we cannot extract useful information and build a representative IFV for such a small number of users) VolunteerProfiles Inference Attribute Baseline Random guess IFV Inference Gender 51% 50% 72.5% Relationship 50% 50% 70.5% Age 26% 16.6% 42% Table 9: Inference Accuracy for VolunteerProfiles As a second step to validate our IFV technique, we perform inference on VolunteerProfiles. Table 9 shows that our algorithm also performs well on this dataset. Notice that Age inference is slightly worse than in PubProfiles. Recall from Section 6.2 that we had only a few users in the last three bins in PubProfiles, and hence, we merged these bins. However, in VolunteerProfiles, we have enough users and we can have 6 different age categories. This explains the small difference in inference accuracy between VolunteerProfiles and PubProfiles. Regarding other attributes, the accuracy is slightly worse for Relationship (-0.5%) and better for Gender (+3.5%). This small variation in inference accuracy between PubProfiles and VolunteerProfiles demonstrates that our technique has also good results with users having private attributes: in PubProfiles,wecouldcomputetheinference accuracy only on users who published their attribute values, while in VolunteerProfiles, wecouldalsotestourmethodonusers hiding their attributes. 7. Discussion Topic modeling We used LDA for semantic extraction. Another alternative is to use Latent Semantic Analysis (LSA) [17]. As opposed to LDA, LSA is not a generative model. It consists in extracting a spatial representation for words from a multi-document corpus by applying singular value decomposition. However, as pointed out in [12], spatial representations are inadequate for capturing the structure of semantic association; LSA assumes symmetric similarity between words which is not the case for a vast majority of associations. One classical example given in [12] involves China and North Korea: Griffiths et al. noticed that, generally speaking, people have always the intuition that North Korea is more similar to China than China to North Korea. This problem is resolved in LDA where P (occurence of word1 occurence of word2 ) P (occurence of word2 occurence of word1 ). In addition, [12] showed that LDA outperforms LSA in terms of drawing semantic correlations between words. Collaborative Filtering Our algorithm is based on discovering latent correlations between user interests in order to cluster users. An alternative approach could be to employ model-based collaborative filtering (MBCF) that avoids using semantic-knowledge. In MBCF, each user is represented by his interest vector. The size of this vector equals the number of all defined interest names, and its coordinates are defined as follows: a coordinate is 1 if the user has the corresponding interest name, otherwise it is 0. Since interest names are user-generated, the universe of all such names, and hence the vector size can be huge. This negatively impacts the performance.

14 References [1] Qt port of webkit: an open source web browser engine. [2] Robots Exclusion Protocol. RobotsExclusionProtocol, [3] OpenCyc. [4] Facebook Statistics. com/facebook/facebook-stats/,2011. [5] L. Backstrom, C. Dwork, and J. Kleinberg. Wherefore art thou r3579x?: anonymized social networks, hidden patterns, and structural steganography. In Proceedings of the 16th international conference on World Wide Web, WWW 07, pages , New York, NY, USA, ACM. [6] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet Allocation. Journal of Machine Learning Research, 3: , [7] W.-Y. Chen, J.-C. Chu, J. Luan, H. Bai, Y. Wang, and E. Y. Chang. Collaborative filtering for orkut communities: discovery of user latent behavior. In Proceedings of the 18th international conference on World wide web,www 09,pages , New York, NY, USA, ACM. [8] P. Cremonesi and R. Turrin. Analysis of cold-start recommendations in IPTV systems. In RecSys 09: Proceedings of the third ACM conference on Recommender systems, pages , New York, NY, USA, ACM. [9] S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman. Indexing by latent semantic analysis. JOURNAL OF THE AMERICAN SOCIETY FOR INFORMA- TION SCIENCE,41(6): ,1990. [10] C. Fellbaum, editor. WordNet An Electronic Lexical Database. The MIT Press, Cambridge, MA ; London, May [11] L. Getoor and C. P. Diehl. Link mining: a survey. SIGKDD Explor. Newsl.,7,December2005. [12] T. L. Griffiths, J. B. Tenenbaum, and M. Steyvers. Topics in semantic representation. Psychological Review,114:2007, [13] J. He, W. W. Chu, and Z. (victor Liu. Inferring privacy information from social networks. In IEEE International Conference on Intelligence and Security Informatics, [14] T. Hofmann. Probabilistic Latent Semantic Analysis. In Proceedings of Uncertainty in Artificial Intelligence, UAI,1999. [15] M. Kurant, M. Gjoka, C. T. Butts, and A. Markopoulou. Walking on a Graph with a Magnifying Glass. In Proceedings of ACM SIGMETRICS 11, SanJose,CA,June2011. [16] S. Lai, L. Xiang, R. Diao, Y. Liu, H. Gu, L. Xu, H. Li, D. Wang, K. Liu, J. Zhao, and C. Pan. Hybrid recommendation models for binary user preference prediction problem. In KDD Cup, [17] T. K. Landauer and S. T. Dumais. Solution to plato s problem: The latent semantic analysis theory of acquisition, induction and representation of knowledge. Psychological Review,1997. [18] J. Lindamood and M. Kantarcioglu. Inferring Private Information Using Social Network Data. Technical report, University of Texas at Dallas, [19] Z. Liu, Y. Zhang, E. Y. Chang, and M. Sun. Plda+: Parallel latent dirichlet allocation with data placement and pipeline processing. ACM Transactions on Intelligent Systems and Technology, special issue on Large Scale Machine Learning,2011.Softwareavailableathttp://code.google. com/p/plda. [20] D. B. Michael. Learning collaborative information filters, [21] D. Milne. An open-source toolkit for mining wikipedia. In Proc. New Zealand Computer Science Research Student Conf, [22] A. Mislove, B. Viswanath, K. P. Gummadi, and P. Druschel. You are who you know: Inferring user profiles in online social networks. [23] J. Puzicha, T. Hofmann, and J. Buhmann. Non-parametric similarity measures for unsupervised texture segmentation and image retrieval. In Computer Vision and Pattern Recognition, Proceedings., 1997 IEEE Computer Society Conference on, pages ,jun1997. [24] A. I. Schein, A. Popescul, L. H., R. Popescul, L. H. Ungar, and D. M. Pennock. Methods and metrics for cold-start recommendations. In In Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval,pages ACMPress,2002. [25] A. L. Traud, P. J. Mucha, and M. A. Porter. Social Structure of Facebook Networks [26] G. Wondracek, T. Holz, E. Kirda, and C. Kruegel. A practical attack to de-anonymize social network users. In 31st IEEE Symposium on Security and Privacy, Oakland, California, USA,2010. [27] E. Zheleva and L. Getoor. To join or not to join: the illusion of privacy in social networks with mixed public and private user profiles. In In WWW, [28] E. Zheleva, J. Guiver, E. M. Rodrigues, and N. Milic- Frayling. Statistical models of music-listening sessions in social media. In In WWW, 2010.

### Data Mining Yelp Data - Predicting rating stars from review text

Data Mining Yelp Data - Predicting rating stars from review text Rakesh Chada Stony Brook University rchada@cs.stonybrook.edu Chetan Naik Stony Brook University cnaik@cs.stonybrook.edu ABSTRACT The majority

Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus 1. Introduction Facebook is a social networking website with an open platform that enables developers to extract and utilize user information

### Social Media Mining. Data Mining Essentials

Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

### On the Effectiveness of Obfuscation Techniques in Online Social Networks

On the Effectiveness of Obfuscation Techniques in Online Social Networks Terence Chen 1,2, Roksana Boreli 1,2, Mohamed-Ali Kaafar 1,3, and Arik Friedman 1,2 1 NICTA, Australia 2 UNSW, Australia 3 INRIA,

### Automatic Web Page Classification

Automatic Web Page Classification Yasser Ganjisaffar 84802416 yganjisa@uci.edu 1 Introduction To facilitate user browsing of Web, some websites such as Yahoo! (http://dir.yahoo.com) and Open Directory

### Data Mining Algorithms Part 1. Dejan Sarka

Data Mining Algorithms Part 1 Dejan Sarka Join the conversation on Twitter: @DevWeek #DW2015 Instructor Bio Dejan Sarka (dsarka@solidq.com) 30 years of experience SQL Server MVP, MCT, 13 books 7+ courses

### Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification

Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification Tina R. Patil, Mrs. S. S. Sherekar Sant Gadgebaba Amravati University, Amravati tnpatil2@gmail.com, ss_sherekar@rediffmail.com

### STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and

Clustering Techniques and STATISTICA Case Study: Defining Clusters of Shopping Center Patrons STATISTICA Solutions for Business Intelligence, Data Mining, Quality Control, and Web-based Analytics Table

### CIRGIRDISCO at RepLab2014 Reputation Dimension Task: Using Wikipedia Graph Structure for Classifying the Reputation Dimension of a Tweet

CIRGIRDISCO at RepLab2014 Reputation Dimension Task: Using Wikipedia Graph Structure for Classifying the Reputation Dimension of a Tweet Muhammad Atif Qureshi 1,2, Arjumand Younus 1,2, Colm O Riordan 1,

### Data Mining is the process of knowledge discovery involving finding

using analytic services data mining framework for classification predicting the enrollment of students at a university a case study Data Mining is the process of knowledge discovery involving finding hidden

### A Practical Attack to De Anonymize Social Network Users

A Practical Attack to De Anonymize Social Network Users Gilbert Wondracek () Thorsten Holz () Engin Kirda (Institute Eurecom) Christopher Kruegel (UC Santa Barbara) http://iseclab.org 1 Attack Overview

### Private Record Linkage with Bloom Filters

To appear in: Proceedings of Statistics Canada Symposium 2010 Social Statistics: The Interplay among Censuses, Surveys and Administrative Data Private Record Linkage with Bloom Filters Rainer Schnell,

### Machine Learning Final Project Spam Email Filtering

Machine Learning Final Project Spam Email Filtering March 2013 Shahar Yifrah Guy Lev Table of Content 1. OVERVIEW... 3 2. DATASET... 3 2.1 SOURCE... 3 2.2 CREATION OF TRAINING AND TEST SETS... 4 2.3 FEATURE

### The Scientific Data Mining Process

Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In

### BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts

BIDM Project Predicting the contract type for IT/ITES outsourcing contracts N a n d i n i G o v i n d a r a j a n ( 6 1 2 1 0 5 5 6 ) The authors believe that data modelling can be used to predict if an

### Exploring Big Data in Social Networks

Exploring Big Data in Social Networks virgilio@dcc.ufmg.br (meira@dcc.ufmg.br) INWEB National Science and Technology Institute for Web Federal University of Minas Gerais - UFMG May 2013 Some thoughts about

### Music Lyrics Spider. Wang Litan 1 and Kan Min Yen 2 School of Computing, National University of Singapore 3 Science Drive 2, Singapore ABSTRACT

NUROP CONGRESS PAPER Music Lyrics Spider Wang Litan 1 and Kan Min Yen 2 School of Computing, National University of Singapore 3 Science Drive 2, Singapore 117543 ABSTRACT The World Wide Web s dynamic and

### A High Scale Deep Learning Pipeline for Identifying Similarities in Online Communities Robert Elwell, Tristan Chong, John Kuner, Wikia, Inc.

A High Scale Deep Learning Pipeline for Identifying Similarities in Online Communities Robert Elwell, Tristan Chong, John Kuner, Wikia, Inc. Abstract: We outline a multi step text processing pipeline employed

### 131-1. Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10

1/10 131-1 Adding New Level in KDD to Make the Web Usage Mining More Efficient Mohammad Ala a AL_Hamami PHD Student, Lecturer m_ah_1@yahoocom Soukaena Hassan Hashem PHD Student, Lecturer soukaena_hassan@yahoocom

### So today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02)

Internet Technology Prof. Indranil Sengupta Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Lecture No #39 Search Engines and Web Crawler :: Part 2 So today we

### Mining the Software Change Repository of a Legacy Telephony System

Mining the Software Change Repository of a Legacy Telephony System Jelber Sayyad Shirabad, Timothy C. Lethbridge, Stan Matwin School of Information Technology and Engineering University of Ottawa, Ottawa,

### !"!!"#\$\$%&'()*+\$(,%!"#\$%\$&'()*""%(+,'-*&./#-\$&'(-&(0*".\$#-\$1"(2&."3\$'45"

!"!!"#\$\$%&'()*+\$(,%!"#\$%\$&'()*""%(+,'-*&./#-\$&'(-&(0*".\$#-\$1"(2&."3\$'45"!"#"\$%&#'()*+',\$\$-.&#',/"-0%.12'32./4'5,5'6/%&)\$).2&'7./&)8'5,5'9/2%.%3%&8':")08';:

### Globally Optimal Crowdsourcing Quality Management

Globally Optimal Crowdsourcing Quality Management Akash Das Sarma Stanford University akashds@stanford.edu Aditya G. Parameswaran University of Illinois (UIUC) adityagp@illinois.edu Jennifer Widom Stanford

### Introduction. A. Bellaachia Page: 1

Introduction 1. Objectives... 3 2. What is Data Mining?... 4 3. Knowledge Discovery Process... 5 4. KD Process Example... 7 5. Typical Data Mining Architecture... 8 6. Database vs. Data Mining... 9 7.

### Better decision making under uncertain conditions using Monte Carlo Simulation

IBM Software Business Analytics IBM SPSS Statistics Better decision making under uncertain conditions using Monte Carlo Simulation Monte Carlo simulation and risk analysis techniques in IBM SPSS Statistics

### Deposit Identification Utility and Visualization Tool

Deposit Identification Utility and Visualization Tool Colorado School of Mines Field Session Summer 2014 David Alexander Jeremy Kerr Luke McPherson Introduction Newmont Mining Corporation was founded in

### Statistical Models in Data Mining

Statistical Models in Data Mining Sargur N. Srihari University at Buffalo The State University of New York Department of Computer Science and Engineering Department of Biostatistics 1 Srihari Flood of

### Citation Context Sentiment Analysis for Structured Summarization of Research Papers

Citation Context Sentiment Analysis for Structured Summarization of Research Papers Niket Tandon 1,3 and Ashish Jain 2,3 ntandon@mpi-inf.mpg.de, ashish.iiith@gmail.com 1 Max Planck Institute for Informatics,

### Chapter 6. The stacking ensemble approach

82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described

### WHAT DEVELOPERS ARE TALKING ABOUT?

WHAT DEVELOPERS ARE TALKING ABOUT? AN ANALYSIS OF STACK OVERFLOW DATA 1. Abstract We implemented a methodology to analyze the textual content of Stack Overflow discussions. We used latent Dirichlet allocation

### Comparison of K-means and Backpropagation Data Mining Algorithms

Comparison of K-means and Backpropagation Data Mining Algorithms Nitu Mathuriya, Dr. Ashish Bansal Abstract Data mining has got more and more mature as a field of basic research in computer science and

### WEB PAGE CATEGORISATION BASED ON NEURONS

WEB PAGE CATEGORISATION BASED ON NEURONS Shikha Batra Abstract: Contemporary web is comprised of trillions of pages and everyday tremendous amount of requests are made to put more web pages on the WWW.

### The primary goal of this thesis was to understand how the spatial dependence of

5 General discussion 5.1 Introduction The primary goal of this thesis was to understand how the spatial dependence of consumer attitudes can be modeled, what additional benefits the recovering of spatial

### Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing and Developing E-mail Classifier

International Journal of Recent Technology and Engineering (IJRTE) ISSN: 2277-3878, Volume-1, Issue-6, January 2013 Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing

### Music Classification by Composer

Music Classification by Composer Janice Lan janlan@stanford.edu CS 229, Andrew Ng December 14, 2012 Armon Saied armons@stanford.edu Abstract Music classification by a computer has been an interesting subject

### Mark Elliot October 2014

Final Report on the Disclosure Risk Associated with the Synthetic Data Produced by the SYLLS Team Mark Elliot October 2014 1. Background I have been asked by the SYLLS project, University of Edinburgh

### Performance Metrics for Graph Mining Tasks

Performance Metrics for Graph Mining Tasks 1 Outline Introduction to Performance Metrics Supervised Learning Performance Metrics Unsupervised Learning Performance Metrics Optimizing Metrics Statistical

### CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.

CI6227: Data Mining Lesson 11b: Ensemble Learning Sinno Jialin PAN Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore Acknowledgements: slides are adapted from the lecture notes

### STATISTICA. Financial Institutions. Case Study: Credit Scoring. and

Financial Institutions and STATISTICA Case Study: Credit Scoring STATISTICA Solutions for Business Intelligence, Data Mining, Quality Control, and Web-based Analytics Table of Contents INTRODUCTION: WHAT

### Environmental Remote Sensing GEOG 2021

Environmental Remote Sensing GEOG 2021 Lecture 4 Image classification 2 Purpose categorising data data abstraction / simplification data interpretation mapping for land cover mapping use land cover class

CS346: Advanced Databases Alexandra I. Cristea A.I.Cristea@warwick.ac.uk Data Security and Privacy Outline Chapter: Database Security in Elmasri and Navathe (chapter 24, 6 th Edition) Brief overview of

### Numerical Field Extraction in Handwritten Incoming Mail Documents

Numerical Field Extraction in Handwritten Incoming Mail Documents Guillaume Koch, Laurent Heutte and Thierry Paquet PSI, FRE CNRS 2645, Université de Rouen, 76821 Mont-Saint-Aignan, France Laurent.Heutte@univ-rouen.fr

### Author Gender Identification of English Novels

Author Gender Identification of English Novels Joseph Baena and Catherine Chen December 13, 2013 1 Introduction Machine learning algorithms have long been used in studies of authorship, particularly in

### Music Mood Classification

Music Mood Classification CS 229 Project Report Jose Padial Ashish Goel Introduction The aim of the project was to develop a music mood classifier. There are many categories of mood into which songs may

### Dynamics of Genre and Domain Intents

Dynamics of Genre and Domain Intents Shanu Sushmita, Benjamin Piwowarski, and Mounia Lalmas University of Glasgow {shanu,bpiwowar,mounia}@dcs.gla.ac.uk Abstract. As the type of content available on the

### Employer Health Insurance Premium Prediction Elliott Lui

Employer Health Insurance Premium Prediction Elliott Lui 1 Introduction The US spends 15.2% of its GDP on health care, more than any other country, and the cost of health insurance is rising faster than

### Data Mining Applications in Higher Education

Executive report Data Mining Applications in Higher Education Jing Luan, PhD Chief Planning and Research Officer, Cabrillo College Founder, Knowledge Discovery Laboratories Table of contents Introduction..............................................................2

### Chapter 20: Data Analysis

Chapter 20: Data Analysis Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Chapter 20: Data Analysis Decision Support Systems Data Warehousing Data Mining Classification

### Evaluation of a New Method for Measuring the Internet Degree Distribution: Simulation Results

Evaluation of a New Method for Measuring the Internet Distribution: Simulation Results Christophe Crespelle and Fabien Tarissan LIP6 CNRS and Université Pierre et Marie Curie Paris 6 4 avenue du président

### Experiments in Web Page Classification for Semantic Web

Experiments in Web Page Classification for Semantic Web Asad Satti, Nick Cercone, Vlado Kešelj Faculty of Computer Science, Dalhousie University E-mail: {rashid,nick,vlado}@cs.dal.ca Abstract We address

### A Content based Spam Filtering Using Optical Back Propagation Technique

A Content based Spam Filtering Using Optical Back Propagation Technique Sarab M. Hameed 1, Noor Alhuda J. Mohammed 2 Department of Computer Science, College of Science, University of Baghdad - Iraq ABSTRACT

### Neural Networks for Sentiment Detection in Financial Text

Neural Networks for Sentiment Detection in Financial Text Caslav Bozic* and Detlef Seese* With a rise of algorithmic trading volume in recent years, the need for automatic analysis of financial news emerged.

### Web Content Mining. Dr. Ahmed Rafea

Web Content Mining Dr. Ahmed Rafea Outline Introduction The Web: Opportunities & Challenges Techniques Applications Introduction The Web is perhaps the single largest data source in the world. Web mining

### A Platform for Supporting Data Analytics on Twitter: Challenges and Objectives 1

A Platform for Supporting Data Analytics on Twitter: Challenges and Objectives 1 Yannis Stavrakas Vassilis Plachouras IMIS / RC ATHENA Athens, Greece {yannis, vplachouras}@imis.athena-innovation.gr Abstract.

### IBM SPSS Direct Marketing 23

IBM SPSS Direct Marketing 23 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 23, release

### White Paper. Data Mining for Business

White Paper Data Mining for Business January 2010 Contents 1. INTRODUCTION... 3 2. WHY IS DATA MINING IMPORTANT?... 3 FUNDAMENTALS... 3 Example 1...3 Example 2...3 3. OPERATIONAL CONSIDERATIONS... 4 ORGANISATIONAL

### Predicting Airbnb user destination using user demographic and session information

1 Predicting Airbnb user destination using user demographic and session information Srinivas Avireddy, Sathya Narayanan Ramamirtham, Sridhar Srinivasa Subramanian Abstract In this report, we develop a

### Question 2 Naïve Bayes (16 points)

Question 2 Naïve Bayes (16 points) About 2/3 of your email is spam so you downloaded an open source spam filter based on word occurrences that uses the Naive Bayes classifier. Assume you collected the

### Text Mining in JMP with R Andrew T. Karl, Senior Management Consultant, Adsurgo LLC Heath Rushing, Principal Consultant and Co-Founder, Adsurgo LLC

Text Mining in JMP with R Andrew T. Karl, Senior Management Consultant, Adsurgo LLC Heath Rushing, Principal Consultant and Co-Founder, Adsurgo LLC 1. Introduction A popular rule of thumb suggests that

### WHITE PAPER. Social media analytics in the insurance industry

WHITE PAPER Social media analytics in the insurance industry Introduction Insurance is a high involvement product, as it is an expense. Consumers obtain information about insurance from advertisements,

### Understanding and Specifying Social Access Control Lists

Understanding and Specifying Social Access Control Lists Mainack Mondal MPI-SWS mainack@mpi-sws.org Krishna P. Gummadi MPI-SWS gummadi@mpi-sws.org Yabing Liu Northeastern University ybliu@ccs.neu.edu Alan

### IBM SPSS Direct Marketing 22

IBM SPSS Direct Marketing 22 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 22, release

### MALLET-Privacy Preserving Influencer Mining in Social Media Networks via Hypergraph

MALLET-Privacy Preserving Influencer Mining in Social Media Networks via Hypergraph Janani K 1, Narmatha S 2 Assistant Professor, Department of Computer Science and Engineering, Sri Shakthi Institute of

### Sentiment analysis using emoticons

Sentiment analysis using emoticons Royden Kayhan Lewis Moharreri Steven Royden Ware Lewis Kayhan Steven Moharreri Ware Department of Computer Science, Ohio State University Problem definition Our aim was

### Data Mining Application in Direct Marketing: Identifying Hot Prospects for Banking Product

Data Mining Application in Direct Marketing: Identifying Hot Prospects for Banking Product Sagarika Prusty Web Data Mining (ECT 584),Spring 2013 DePaul University,Chicago sagarikaprusty@gmail.com Keywords:

### Customer Classification And Prediction Based On Data Mining Technique

Customer Classification And Prediction Based On Data Mining Technique Ms. Neethu Baby 1, Mrs. Priyanka L.T 2 1 M.E CSE, Sri Shakthi Institute of Engineering and Technology, Coimbatore 2 Assistant Professor

### Projektgruppe. Categorization of text documents via classification

Projektgruppe Steffen Beringer Categorization of text documents via classification 4. Juni 2010 Content Motivation Text categorization Classification in the machine learning Document indexing Construction

### lop Building Machine Learning Systems with Python en source

Building Machine Learning Systems with Python Master the art of machine learning with Python and build effective machine learning systems with this intensive handson guide Willi Richert Luis Pedro Coelho

### MIRACLE at VideoCLEF 2008: Classification of Multilingual Speech Transcripts

MIRACLE at VideoCLEF 2008: Classification of Multilingual Speech Transcripts Julio Villena-Román 1,3, Sara Lana-Serrano 2,3 1 Universidad Carlos III de Madrid 2 Universidad Politécnica de Madrid 3 DAEDALUS

### Classification with Decision Trees

Classification with Decision Trees Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong 1 / 24 Y Tao Classification with Decision Trees In this lecture, we will discuss

### Mining the Dynamics of Music Preferences from a Social Networking Site

Mining the Dynamics of Music Preferences from a Social Networking Site Nico Schlitter Faculty of Computer Science Otto-v.-Guericke-University Magdeburg Nico.Schlitter@ovgu.de Tanja Falkowski Faculty of

### Efficient visual search of local features. Cordelia Schmid

Efficient visual search of local features Cordelia Schmid Visual search change in viewing angle Matches 22 correct matches Image search system for large datasets Large image dataset (one million images

### CHAPTER 1 INTRODUCTION

21 CHAPTER 1 INTRODUCTION 1.1 PREAMBLE Wireless ad-hoc network is an autonomous system of wireless nodes connected by wireless links. Wireless ad-hoc network provides a communication over the shared wireless

### Information, Entropy, and Coding

Chapter 8 Information, Entropy, and Coding 8. The Need for Data Compression To motivate the material in this chapter, we first consider various data sources and some estimates for the amount of data associated

### Web Content Summarization Using Social Bookmarking Service

ISSN 1346-5597 NII Technical Report Web Content Summarization Using Social Bookmarking Service Jaehui Park, Tomohiro Fukuhara, Ikki Ohmukai, and Hideaki Takeda NII-2008-006E Apr. 2008 Jaehui Park School

### The Science and Art of Market Segmentation Using PROC FASTCLUS Mark E. Thompson, Forefront Economics Inc, Beaverton, Oregon

The Science and Art of Market Segmentation Using PROC FASTCLUS Mark E. Thompson, Forefront Economics Inc, Beaverton, Oregon ABSTRACT Effective business development strategies often begin with market segmentation,

### Spam Filtering with Naive Bayesian Classification

Spam Filtering with Naive Bayesian Classification Khuong An Nguyen Queens College University of Cambridge L101: Machine Learning for Language Processing MPhil in Advanced Computer Science 09-April-2011

### Spam Filtering based on Naive Bayes Classification. Tianhao Sun

Spam Filtering based on Naive Bayes Classification Tianhao Sun May 1, 2009 Abstract This project discusses about the popular statistical spam filtering process: naive Bayes classification. A fairly famous

### Invited Applications Paper

Invited Applications Paper - - Thore Graepel Joaquin Quiñonero Candela Thomas Borchert Ralf Herbrich Microsoft Research Ltd., 7 J J Thomson Avenue, Cambridge CB3 0FB, UK THOREG@MICROSOFT.COM JOAQUINC@MICROSOFT.COM

### Predict the Popularity of YouTube Videos Using Early View Data

000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050

### Data Mining Chapter 6: Models and Patterns Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Data Mining Chapter 6: Models and Patterns Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University Models vs. Patterns Models A model is a high level, global description of a

### A Lightweight Solution to the Educational Data Mining Challenge

A Lightweight Solution to the Educational Data Mining Challenge Kun Liu Yan Xing Faculty of Automation Guangdong University of Technology Guangzhou, 510090, China catch0327@yahoo.com yanxing@gdut.edu.cn

### Organizing Your Approach to a Data Analysis

Biost/Stat 578 B: Data Analysis Emerson, September 29, 2003 Handout #1 Organizing Your Approach to a Data Analysis The general theme should be to maximize thinking about the data analysis and to minimize

### DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS 1 AND ALGORITHMS Chiara Renso KDD-LAB ISTI- CNR, Pisa, Italy WHAT IS CLUSTER ANALYSIS? Finding groups of objects such that the objects in a group will be similar

### Marketing Mix Modelling and Big Data P. M Cain

1) Introduction Marketing Mix Modelling and Big Data P. M Cain Big data is generally defined in terms of the volume and variety of structured and unstructured information. Whereas structured data is stored

### Introduction to Statistical Machine Learning

CHAPTER Introduction to Statistical Machine Learning We start with a gentle introduction to statistical machine learning. Readers familiar with machine learning may wish to skip directly to Section 2,

### Three types of messages: A, B, C. Assume A is the oldest type, and C is the most recent type.

Chronological Sampling for Email Filtering Ching-Lung Fu 2, Daniel Silver 1, and James Blustein 2 1 Acadia University, Wolfville, Nova Scotia, Canada 2 Dalhousie University, Halifax, Nova Scotia, Canada

### Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets http://info.salford-systems.com/jsm-2015-ctw August 2015 Salford Systems Course Outline Demonstration of two classification

### Email Spam Detection A Machine Learning Approach

Email Spam Detection A Machine Learning Approach Ge Song, Lauren Steimle ABSTRACT Machine learning is a branch of artificial intelligence concerned with the creation and study of systems that can learn

### II. RELATED WORK. Sentiment Mining

Sentiment Mining Using Ensemble Classification Models Matthew Whitehead and Larry Yaeger Indiana University School of Informatics 901 E. 10th St. Bloomington, IN 47408 {mewhiteh, larryy}@indiana.edu Abstract

### Design and Analysis of Ecological Data Conceptual Foundations: Ec o lo g ic al Data

Design and Analysis of Ecological Data Conceptual Foundations: Ec o lo g ic al Data 1. Purpose of data collection...................................................... 2 2. Samples and populations.......................................................

### Detection of local affinity patterns in big data

Detection of local affinity patterns in big data Andrea Marinoni, Paolo Gamba Department of Electronics, University of Pavia, Italy Abstract Mining information in Big Data requires to design a new class

### A MACHINE LEARNING APPROACH TO FILTER UNWANTED MESSAGES FROM ONLINE SOCIAL NETWORKS

A MACHINE LEARNING APPROACH TO FILTER UNWANTED MESSAGES FROM ONLINE SOCIAL NETWORKS Charanma.P 1, P. Ganesh Kumar 2, 1 PG Scholar, 2 Assistant Professor,Department of Information Technology, Anna University

### Web Search Engines: Solutions

Web Search Engines: Solutions Problem 1: A. How can the owner of a web site design a spider trap? Answer: He can set up his web server so that, whenever a client requests a URL in a particular directory

### Predicting Flight Delays

Predicting Flight Delays Dieterich Lawson jdlawson@stanford.edu William Castillo will.castillo@stanford.edu Introduction Every year approximately 20% of airline flights are delayed or cancelled, costing

### Mining a Corpus of Job Ads

Mining a Corpus of Job Ads Workshop Strings and Structures Computational Biology & Linguistics Jürgen Jürgen Hermes Hermes Sprachliche Linguistic Data Informationsverarbeitung Processing Institut Department

### Component visualization methods for large legacy software in C/C++

Annales Mathematicae et Informaticae 44 (2015) pp. 23 33 http://ami.ektf.hu Component visualization methods for large legacy software in C/C++ Máté Cserép a, Dániel Krupp b a Eötvös Loránd University mcserep@caesar.elte.hu

### PSG College of Technology, Coimbatore-641 004 Department of Computer & Information Sciences BSc (CT) G1 & G2 Sixth Semester PROJECT DETAILS.

PSG College of Technology, Coimbatore-641 004 Department of Computer & Information Sciences BSc (CT) G1 & G2 Sixth Semester PROJECT DETAILS Project Project Title Area of Abstract No Specialization 1. Software