The Splog Detection Task and A Solution Based on Temporal and Link Properties

Size: px
Start display at page:

Download "The Splog Detection Task and A Solution Based on Temporal and Link Properties"

Transcription

1 The Splog Detection Task and A Solution Based on Temporal and Link Properties Yu-Ru Lin, Wen-Yen Chen, Xiaolin Shi, Richard Sia, Xiaodan Song, Yun Chi, Koji Hino, Hari Sundaram, Jun Tatemura and Belle Tseng NEC Laboratories America N. Wolfe Road Suite SW3-350, Cupertino, CA ABSTRACT Spam blogs (splogs) have become a major problem in the increasingly popular blogosphere. Splogs are detrimental in that they corrupt the quality of information retrieved and they waste tremendous network and storage resources. We study several research issues in splog detection. First, in comparison to web spam and spam, we identify some unique characteristics of splog. Second, we propose a new online task that captures the unique characteristics of splog, in addition to tasks based on the traditional IR evaluation framework. The new task introduces a novel time-sensitive detection evaluation to indicate how quickly a detector can identify splogs. Third, we propose a splog detection algorithm that combines traditional content features with temporal and link regularity features that are unique to blogs. Finally, we develop an annotation tool to generate ground truth on a sampled subset of the TREC-Blog dataset. We conducted experiments on both offline (traditional splog detection) and our proposed online splog detection task. Experiments based on the annotated ground truth set show excellent results on both offline and online splog detection tasks. 1. INTRODUCTION The blogosphere is growing extremely fast and provides new business opportunities in areas such as advertisement, opinion extraction, and marketing. However, spam blogs (splogs) have become a major problem in the blogosphere they reduce the quality of information retrieval results and waste network and storage resources [5,8]. Therefore, detecting splogs in the blogosphere has great importance. In this paper, we propose our solution to detect splogs in the blogosphere. The main contributions of our work are as follows: 1. Modeling the splog problem: Unlike web or spam, a splog is dynamic since it continuously generates fresh content to drive traffic. To solve the splog problem, we need to take advantage of the unique splog properties. 2. Evaluation: Splogs are characterized by temporal content dynamics and hence need to be identified as quickly as possible before they waste network and storage resources. We propose a time-sensitive evaluation framework to measure splog detection performance based on how fast the detection is made. 3. Regularity based Detection: Our detection algorithm identifies unique features such as temporal and link properties useful for detecting splogs. We identify temporal content regularity (self-similarity of content) and temporal structural regularity (regular post times) as well as regularity in the linking structure (frequent links to non-authoritative websites). We have evaluated our approach using the traditional offline task, as well as our proposed online metrics. The results are excellent indicating the combined feature set works well in both offline and online splog detection tasks. The online task reveals the sensitivity of the detection to the amount of evidence (posts) as well as the complimentary roles played by content and splog regularity features. 1

2 Most of previous work in spam detection comes from web spam detection. Prior work to detect web spams can be further categorized into content analysis [6,7] and link analysis [2,3]. Our work combines traditional features with temporal and link features that are unique to blogs. The rest of this paper is organized as follows. In the next section we provide a high-level definition of splogs; we shall also discuss splog characteristics and how splog differ from web sites. In section 3, we provide the online task definition and also provide a baseline offline splog detection task. In section 4, we present out splog detection framework and we discuss our proposed regularity (temporal and link) based features. In section 5, we discuss data pre-processing and our annotation tool to label data. In section 6, we present out experimental results and we present our conclusions in section WHAT ARE SPLOGS? In this section, we provide a high-level definition of splogs and the splog problem we face today (Section 2.1), the typical splog characteristics (Section 2.2), and the differences between splog and other types of spam (Section 2.3). 2.1 Working definition of splogs Spam blogs, which are called splogs, are undesirable weblogs that the creators use solely for promoting affiliated sites [1]. As blogs became increasingly mainstream, the presence of splogs has a detrimental effect in the blogosphere. According to multiple reports, the following are alarming statistics % of blogs are splogs. For the week of Oct. 24, 2005, 2.7 million blogs out of 20.3 million are splogs [8]. An average of 44 of the top 100 blogs search results in the three popular blog search engines came from splogs [8]. 75% of new pings came from splogs; more than 50% of claimed blogs pinging weblogs.com are splogs [5]. The statistics exhibit serious problems caused by splogs, including (1) the degradation of information retrieval quality and (2) the tremendous waste of network and storage resources. Figure 1: Splogs use different schemes to achieve spamming. B represents a blog, S represents a splog, and W refers to an affiliate site. There is usually a profitable mechanism (Ads/ppc) in the splog or affiliated site(s). 2

3 Figure 1 illustrates the overall scheme taken by splog creators. Their motive is to drive visitors to affiliated sites (including the splog itself) that have some profitable mechanisms. By profitable mechanism, we refer to webbased business methods, such as search engine advertising programs (e.g. Google AdSense) or pay-per-click (ppc) affiliate programs. There are several schemes used by spammers to increase the visibility of splogs by getting indexed with high ranks on popular search engines. To deceive the search engine, the spammer may boost (1) relevancy (e.g. via keyword stuffing), (2) popularity (e.g. via link farm), or (3) recency (e.g. via frequent posts), based on some ranking criteria used by search engines. The increased visibility is unjustifiable since the content in splogs is often nonsense or stolen from other sites [1]. The spammer also attacks regular blogs through comments and trackbacks to boost the splog ranking. 2.2 Typical splog characteristics In a typical splog, content is usually generated by machines in order to attract visitors through their appearance in either search engines or individual blogs. By splog, we refer to a blog created by an author who has the intention of spamming. Note that a blog that may contain spam in the form of comment spam or trackback spam is not considered a splog. There are typical characteristics observed in splogs: 1. Machine-generated content: splog entries are generated automatically, usually nonsense, gibberish, repetitive or copied from other blogs or websites. 2. No value-addition: splogs provide useless or no unique information to their readers. There are blogs using automatic content aggregating techniques to provide useful service such as podcasting these are legitimate blogs because of their value addition. 3. Hidden agenda, usually an economic goal: splogs have commercial intention that can be revealed if we observe any affiliate ads or out-going links to affiliate sites. Some of these characteristics, such as no value-addition or hidden agenda, can also be found in other types of spams (e.g. web spam). However, splogs have unique properties that will be highlighted in the next section. 2.3 Uniqueness of splogs Splogs are different from web spams in the following aspects. 1. Dynamic content: blog readers are mostly interested in recent entries. Unlike web spams where the content is static, a splog continuously generates fresh content to drive traffic. 2. Non-endorsement link: A hyperlink is often interpreted as an endorsement of other pages. It is less likely that a web spam gets endorsements from normal sites. However, since spammers can create hyperlinks (comment links or trackbacks) in normal blogs, links in blogs cannot be simply treated as endorsements. Because of these two significant differences, the splog problem is different from that of traditional web spam as discussed next. 3. TASK DEFINITION In this section we propose our evaluation methodology for comparing splog detection techniques on TREC blog dataset. We first describe the detection task framework in Section 3.1. Next, two detection tasks used in traditional information retrieval are given in Section 3.2. In Section 3.3 we propose an online detection task with novel assessment method. 3.1 Framework for detection task The objective of a splog detector is to remove unwanted blogs. Blog search engines need splog detectors to improve the quality of their search results. Blog search engines differ from general web search engines in their growing contents namely feeds. The detection decision is performed on a blog that consists of a growing list of entries. Because entries become available gradually, there can be time delay to gather enough evidences (i.e., 3

4 entries) for detection. Since a splog will persist in the index until it is detected, earlier detection with few evidences is crucial for the overall search quality. We refer a detector that can make a decision with less evidence fast. An illustration of how early splog detection is beneficial is shown in Figure 2. The grid represents how the amount of entries (x-axis) increases over time for each blog (y-axis). For a specific time, the gray area denoted by downloaded in the storage shows the number of blogs discovered with the corresponding amount of entries. As time passes, more blogs are indexed as well as growing amounts of entries, as shown by the dashed border and arrows.... time Figure 2: Blogs are discovered and downloaded over time. Similarly, the amount of entries downloaded grows over time. The gray area represents blog data that have been downloaded, and the dashed border and arrows show the downloading process continuing over time. The target objective for blog search engines is to detect splogs as early as possible. As a result, we need to measure the speed of splog detection. Traditional detectors are evaluated offline, where a batch of data is inputted into the detector and some performance metrics are calculated on the detection results. Because we want to also evaluate the speed of splog detection, we propose an online detection evaluation. Another aspect of evaluation depends on the availability of ground truth information. Both offline and online detection can be evaluated with or without ground truth. Accordingly, there are four tasks as identified in Table 1. Table 1: Four detection tasks are identified based on Offline and Online detections. Dataset \ Task Type Offline (Traditional) Online (Time-Sensitive) With Ground Truth TASK 1 TASK 3 Without Ground Truth TASK 2 TASK Traditional IR-based detection To compare different detection methods, there are two evaluation frameworks used in traditional information retrieval research and also widely applied in many TREC tracks. 4

5 3.2.1 Evaluation with ground truth Evaluation is designed to compare detectors for an input set of blogs. Given a set of input blogs B with labels, the detectors can be evaluated by k-fold cross-validation, where the performance can be measured by metrics such as precision/recall, AUC, or ROC plot Evaluation without ground truth To evaluate detector performances on a large dataset, there will be limited amount of labeled ground truth. Each detector makes its decision on the large dataset, and returns the detection results as a ranked list. The detector performance is evaluated by measuring the precision at top N (precision@n) of the ranked list based on pooling of multiple detection lists. Based on the availability of ground truth, splog detectors can be compared using one of the above offline evaluations. However to measure the speed of detection efficiency, we propose an online detection framework. 3.3 Online detection As discussed above, the benefit of early splog detection is to quickly remove entries by splogs from the search index. Hence, we propose a new framework to evaluate time-sensitive detection performance. We want to measure the detection performance on newly discovered blogs and observe how the decisions on these blogs can improve as more entries are available. First, blogs in the dataset are partitioned based on the time of discovery (i.e., the first appearance in the dataset). We assume the splog detector evaluates the blog contents at uniform frequency, i.e. t 0 = t, t 1 = t+ t,, t k =t+k t. B(t i ) is defined as a partition that consists of blogs discovered after time t i-1 and before t i. B(t 0 ) is the initial training set, usually given with labels (splog or nonsplog). Figure 3: Online time-sensitive evaluation performance. For each partition B(t k ) (k > 0), the detector gives a decision at time t j (j >= k), for which the performance p jk is measured. p jk is the detection performance at time t j on the partition at t k (B(t k )). In order to measure the speed of detection, we are interested in how the detector performance p jk improves as j increases. Then, we introduce an overall performance measure of all decisions made with a specific delay. More specifically, for each delay i = j k, the overall performance P i is given as an average, where P i =E[p jk i=j-k]. The performance is plotted with i on the x-axis as shown in Figure 3, to demonstrate how quickly the detector can make a good decision. Note that 5

6 each performance p jk is measured based on the same evaluation metrics as the traditional offline evaluations. We expect the proposed online evaluation to provide significant insights on early detection of splogs. 4. OUR DETECTION METHOD We have developed new techniques for splog detection based on temporal and linking patterns, which are unique features that distinguish blogs from regular web pages. Due to the special characteristics of splogs, traditional content-based or link-based spam detection techniques are not sufficient. It is difficult to detect spams for individual pages (i.e., entries) by content-based techniques, since a splog can steal (copy) content from normal blogs. Link-based techniques based on propagation of trust from legitimate sites will work poorly for blogs since spammers can create links (comments and trackbacks) from normal blogs to splogs. Our observation is that a blog is a growing sequence of entries rather than individual pages. We expect that splogs can be recognized by their abnormal temporal and link patterns observed in entry sequences, since their motivation is different from normal, human-generated blogs. In a splog, the content and link structures are typically machine-generated (possibly copied from other blogs / websites). The link structure is focused on driving traffic to a specific set of affiliate websites. To capture such differences, we introduce new features, namely, temporal regularity and link regularity, which are described in the following subsection. 4.1 Baseline features We shall now discuss the content based features used in this work these will serve as the baseline feature set as they are widely used in splog detection. We use a subset of the content features presented in [7]. These features are used to distinguish between two classes of blogs normal and splogs, based on the statistical properties of the content. In this work we extract features from five different parts of a blog: (1) tokenized URLs, (2) blog and post titles, (3) anchor text, (4) blog homepage content and (5) post content. For each category we extract the following features: word count (w c ), average word length (w l ) and a vector containing the word frequency distribution (w f ). In this work, each content category is analyzed separately from the rest for computational efficiency feature selection using Fisher linear discriminant analysis (LDA) We need to reduce the length of the vector w f as the total number of unique terms (excluding words containing digits) is greater than 100,000 (this varies per category, and includes non-traditional usage such as helloooo ). This can easily lead to over fitting the data. Secondly, the distribution of the words is long-tailed i.e. most of the words are rarely used. We expect good feature subsets contain features highly correlated with (predictive of) the class, but uncorrelated with each other. The objective of Fisher LDA is to enable us to determine discriminative features while preserving as much of the class discrimination as possible. The solution is to compute the optimal transformation of the feature space based on a criterion that minimizes the within-class scatter (of the data set) and maximizes the between-class scatter simultaneously. This criterion can also be used as a separability measure for feature selection. We use the trace criteria, J = tr(s w -1 S b ) where S w denotes the within-class scatter and S b denotes the between-class scatter matrix. This criterion computes the ratio of between-class variance to the within-class variance in terms of the trace of the product (the trace is just the sum of eigenvalues of S w -1 S b ). We select the top k eigenvalues to determine the key dimensions of the w f vector. 4.2 Temporal regularity Temporal regularity captures consistency in timing of content creation (structural regularity), and similarity between contents (content regularity). Content regularity is given by the autocorrelation of the content, derived from computing a similarity measure on the baseline content feature vectors. We define a similarity measure based on the histogram intersection distance. Structural regularity is given by the entropy of the post time difference distribution. A splog will have low entropy, indicating machine generated content. 6

7 4.2.1 Temporal Content Regularity (TCR): We use the autocorrelation of the content to estimate the TCR value. Intuitively, the autocorrelation function (conventionally depicted as R(τ)) of a time series is an estimate of how a future sample is dependent on a current sample. A noise like signal will have a sharp auto-correlation function, while a highly coherent signal s autocorrelation function will fall off gradually. Since splogs are usually finally motivated, we conjecture that their content will be highly similar over time. However human bloggers will tend to post over a diverse set of topics, leading to a sharper autocorrelation function. We define TCR as a self-similarity measure of content and compute it by auto-correlation. Figure 4: The figure shows the difference in the autocorrelation function between a splog and a normal blog. Notice that auto-correlation function for a splog is very high and nearly constant, while the values for a normal blog are relatively low and fluctuate. These graphs have been derived from a splog and a normal blog from the TREC dataset. We compute the discrete time autocorrelation function R(k) for the posts. The posts are time difference normalized i.e. we are only interested in the similarity between the current post and a future post in terms of the number of posts in between (e.g. will the post after next be related to the current post), ignoring time. This is a simplifying assumption, but is useful when many posts do not have the time meta data associated with them. The autocorrelation function is defined as follows: Rk ( ) = 1 d( pl ( ), pl ( + k)), wf () l wf ( l+ k) <1> d( p( l), p( l+ k)) 1 E, w f () l w f ( l + k ) Where E is the expectation operator, R(k) is the expected value of the autocorrelation between the current l th post and the (l+k) th post; d is the dissimilarity measure, is the cardinality operator, and, refer to the familiar union and intersection operators. We use the autocorrelation vector R(k) as a feature to discriminate between splogs and normal blogs. 7

8 4.2.2 Temporal Structural Regularity (TSR): We estimate TSR of a blog by computing the entropy of the post time difference distribution. In order to estimate the distribution, we use hierarchical clustering on the post time difference values from a blog. 20 post interval distribution for a splog no. of posts post interval (in minutes) post interval distribution for a normal blog 20 no. of posts post interval (in days) Figure 5: The figure shows the differences in the temporal regularity between a splog and a normal blog. A splog posts with a post time of 20 minutes, while a normal blog has post time interval that is highly varied and ranges to several hours to nearly a week. The temporal structure of post time interval can be discovered by clustering close post intervals. We use hierarchical clustering method with single link merge criteria on the post interval difference values. The original dataset is initialized into N clusters for N data points. Two clusters are merged into one if the distance (linkage) between the two is the smallest amongst all pair wise cluster distances. We use the average variance stopping criteria for clustering. Once the clusters are determined, we compute cluster entropy as a measure of TSR: M ni Be = pilog pi, pi =, i= 1 N <2> Be TSR = 1, B max where B e is the blog entropy, B max is the maximum observed entropy, N is the total number of posts, n i and p i are the number of posts and the probability of the i th cluster respectively, and M is the number of clusters. Note that for some blogs including normal or splogs, post time is not available as part of the post metadata. We treat such cases as missing data. And if a blog does not have post time information, we do not use TSR as a feature. 8

9 4.3 Link regularity estimation Link regularity (LR) measures consistency in target websites pointed by a blog. We expect that a splog will exhibit more consistent behavior since the main intent of such splogs is to drive traffic to affiliate websites. Secondly we conjecture that there will be a significant portion of links that will be targeted to affiliated websites rather than normal blogs / websites. Importantly these affiliate websites will not be authoritative and we do not expect normal bloggers to link to such websites. We analyze the splog linking behavior using the HITS algorithm [4]. The intuition is that splogs target focused set of websites, while normal blogs usually have more diverse targeting websites. We use HITS with out-link normalization to compute hub scores. The normalized hub score for a blog is a useful indicator of the blog being a splog. Figure 6: Normal blogs tend to link to authoritative websites while splogs frequently link to affiliate websites that are not authorities. The thickness of the arrow from the splog indicates the frequency with which the splog links to an non-authoritative, affiliate website. We put blogs and their linking websites on two sides of a bi-partite graph to construct an adjacency matrix A, where A ij =1 indicates there is a hyperlink from blog b i to website w j. In the original HITS algorithm, good hubs and good authorities are identified by the mutually reinforcing relationship: a=a T h, h=aa, where a is the authority score, h is the hub score and A is the adjacency matrix. A blog with divergent out-links to authoritative websites will obtain a higher hub score. To suppress this effect and on the other side reinforce the influence of blogs with focused targets, we normalize A by out-degrees of blogs and then compute hub scores for blogs as LR. Our splog detector combines these new features (TCR, TSR, LR) with traditional content features into a large feature vector. We then use standard machine learning techniques (SVM classifier with a radial basis function kernel) to classify each blog into two classes: splog or normal blog. 5. DATA-PREPROCESSING AND GROUND TRUTH DEFINITION We have made significant efforts to pre-process the TREC-Blog dataset and to establish ground truth for training and testing. Our major contributions are summarized as follows: 9

10 1. Pre-processing: The TREC-Blog 2006 dataset is a crawl of 100,649 feeds collected over 11 weeks, from Dec. 6, 2005 to Feb. 21, 2006, totaling 77 days. After removing duplicate feeds and feeds without homepage or permalinks, we have about 43.6K unique blogs. We focus our analysis on this subset of blogs having homepage and at least one entry. 2. Annotation tool: We have developed a user-friendly interface (Figure 7) for annotators to label the TREC- Blog dataset. The detailed description of the tool is available at our webpage 1. Essentially, in the interface, the content of the blogs and their contents are fetched from the database and presented to the annotator. Figure 7: Splog Annotation Tool to view and label blogs. Through the interface as shown in Figure 7, an annotator can browse the blog homepage and entries that have been downloaded in the TREC-Blog dataset, or visit the blog site directly online, in order to assign one of the following five labels: (N) Normal, (S) Splog, (B) Borderline, (U) Undecided, and (F) Foreign Language. 3. Disagreement among annotators: We performed a pilot study to investigate how different annotators identify splogs. We presented a set of 60 blogs to a group of 6 annotators, asking them to assign each blog to one of the five labels. One interesting result is that the annotators have agreement on normal blogs but have varying opinions on splogs (S/B/U), which suggests that splog detection is not trivial even for humans. We plan to conduct further intensive user studies. 4. Ground truth: As of August 24, 2006, we have labeled 9240 blogs by using our annotation tool. The 9240 blogs are selected using random sampling as well as stratified sampling methods. Among these 9240 blogs, 7905 are labeled as normal blogs, 525 are labeled as splogs, and the rest are borderline/undecided/foreign. The annotated splog percentage is lower than what has been reported because (1) some known splogs are prefiltered from the TREC dataset, and (2) we have selected to examine the 43.6K subset of blogs that have both homepages and entries downloaded. Using the annotation tool to generate ground truth, we built a baseline splog detector and our detector

11 6. EXPERIMENTAL RESULTS In this section, we present the annotation results. We manually labeled 7905 normal blogs and 525 splogs. We decided to create a symmetric set for evaluation containing 525 splogs and 525 normal blogs. 6.1 Offline (Traditional) In the traditional offline task, we used all the 1050 samples for evaluation. We used a five fold cross-validation technique to evaluate the performance of the splog detector. The results show that the proposed classifier and the new temporal and structural features work well together. In Table 2, we summarize the results of the offline detection. The table shows the comparison between content features of different dimensionality against the combination of the same feature with the regularity features (TCR, TSR, LR). We use four measures AUC (area under the ROC curve, also ref. Figure 8), accuracy, precision and recall. The results indicate that the proposed feature set combines well with the traditional content based features, however the largest gains occur when the dimensionality of the content features is low. Table 2: The table shows a comparison of the baseline content scheme against the combination of baseline (designated as base-n, where n is the dimension of the baseline feature) with temporal and link-structure features (designated as R). The table indicates that the improvement due to the non-baseline features is smaller with increase in the number of dimensions to the baseline features. Feature AUC accuracy precision recall base R+base base R+base base R+base base R+base R

12 1 ROC curve TPF base253 (0.966) 0.2 base127 (0.957) base64 (0.938) base32 (0.895) R+base253 (0.974) 0.1 R+base127 (0.968) R+base64 (0.948) R+base32 (0.921) R (0.814) FPF Figure 8: The plot of the ROC curves for different values of the baseline feature size, base-n + R as well as the ROC curve due to the new features alone. 6.2 Online (Proposed) In the online evaluation framework, we are interested in the rate understanding the temporal effects in the splog classifier performance. In order to do this we create a training set B 0 and testing sets B 1 B 7. In the TREC dataset the crawler discovers new blogs every day, but refreshes the feed only every week. Due to this refresh anomaly, we test the data weekly as there is no intermediate download available for the blogs. The training sets B 1 B 7 refer to the blogs that were discovered on day i where 1 i 7. We use 400 blogs for training (B 0 ), and 360 blogs for testing (B 1 -B 7 ). The remaining 290 blogs were discovered after the 1 st week and are not part of the testing set. We now introduce some notation for clarity: t 1, t 2,, t n : testing period, week 1, 2,, 7. T r (t) denotes the training set blog B 0 data downloaded until time t. T r (1) would denote the data for the blogs corresponding the end of week 1. T e (t): denotes the training set blog (B 1, B 2,, B 7 which are discovered in week 1) data downloaded until time t. Note that T r T e = φ. C(T r (t)) or C t : a classifier trained on T r (t). C t,f denotes the classifier is trained using feature set F. For example C 1, base32+r denotes the classifier C 1 using feature set base32+r and where R = <TCR, TSR, LR>. 12

13 P(T r (t 1 ), T e (t 2 )) or P(t 1, t 2 ): the performance of classifier trained on T r (t 1 ), i.e. C 1 and tested on T e (t 2 ), or P(t 1, t 2 ). For example P(1, 3) denotes the performance of classifier C 1 trained on the 1 st week training set and tested on the 3 rd week testing set. P 1 P 2 P n t 1 t 2 t n Training set B 0 B 0 B 0 B 1 B 1 B 1 B 2 B 2 B 2 Testing set B 3 B 3 B 3 B k B k B k G G G Figure 9: The figure illustrates the online testing framework. The training set B 0 is partitioned weekly, as are the testing sets B 1 -B 7. At the end of each week, the splog detector is re-trained to include the latest week s labeled data and is then tested on blogs B 1 -B 7. The test blogs have additional downloaded entries for the latest week as well. The symbols P i refer to testing at the end of the i th week AUC week base252 base126 base63 base32 R+base252 R+base126 R+base63 R+base32 R accuracy week base252 base126 base63 base32 R+base252 R+base126 R+base63 R+base32 R Figure 10: The figures for the online experiments show impact of adding the structural features in the online evaluation. In the absence of content data, the regularity features provide a significant boost to the both the AUC and the accuracy results. The results for the online evaluation scheme are shown in Figure 10. They show the plot of P(i,i) against the weekly index i, with the metrics AUC (area under the ROC curve) and the accuracy metrics. The plot shows the result of testing the classifier on the data (testing sets B 1 -B 7 ) collected after the i th week, with the classifier being 13

14 retrained on B 0, with additional data for the set B 0. They indicate that the utility of adding the structural regularity features in on-line detection. In the absence of enough content, these features play a critical role in discriminating between splogs and normal blogs. It is striking to compare the results as shown in Figure 10 against the results for offline testing as shown in Table 2 and Figure 8. They indicate that the information provided by the structural information is complementary to the information in the content this is evidenced by the jump in the AUC and the accuracy values for the corresponding week. Importantly the contribution to the results in the online case due to the regularity features is significant. Not only is there a clear improvement for classifiers C i, base-n+r over the corresponding classifiers C i, base-n for all weeks, but also the difference is high for the low dimensional features. The explanation for the significant difference is as follows. In online tests when in the first week there is not enough content to train the classifier based on content analysis alone, and hence the regularity features become very important. Over time, the training data available to the classifier increases (i.e. the blogs in the training set B 0 have more entries over time) hence making the decision surface more stable. Additionally, we note that the content features that we have extracted are statistical in nature. Hence the feature vector corresponding to the content of both the training data and test data begin to stabilize only after a sufficient number of posts in each blog. Note also that over time the combination of both baseline and regularity features shows less fluctuation than the baseline features alone (ref. Figure 10). 7. CONCLUDING REMARKS In this paper we analyze the splog detection problem as an important open task for TREC-Blog track. A splog is significantly different from web spam, and thus new online detection tasks are identified. The new task measures how quickly a detector can identify splogs this is important as temporal dynamics are key distinguishing features of a blog from traditional web pages. We identify new features based on temporal content and structural regularity as well as link regularity. We also provide a set of ground truth labeled through our annotation tool. Our experimental results on both the offline and the online tasks are excellent and validate the importance of the both the proposed online task as well as the addition of regularity based features to traditional content features in such tasks. 8. REFERENCES [1] Wikipedia, Spam blog [2] Z. GYÖNGYI, P. BERKHIN, HECTOR GARCIA-MOLINA and J. PEDERSEN (2006). Link Spam Detection Based on Mass Estimation, 32nd International Conference on Very Large Data Bases (VLDB), Seoul, Korea. [3] Z. GYÖNGYI, H. GARCIA-MOLINA and J. PEDERSEN (2004). Combating web spam with TrustRank, Proceedings of the 30th International Conference on Very Large Data Bases (VLDB) 2004, Toronto, Canada. [4] J. M. KLEINBERG (1999). Authoritative sources in a hyperlinked environment. J. ACM 46(5): [5] P. KOLARI (2005) Welcome to the Splogosphere: 75% of new pings are spings (splogs) permalink: [6] P. KOLARI, A. JAVA, T. FININ, T. OATES and A. JOSHI (2006). Detecting Spam Blogs: A Machine Learning Approach, Proceedings of the 21st National Conference on Artificial Intelligence (AAAI 2006), July 2006, Boston, MA. [7] A. NTOULAS, M. NAJORK, M. MANASSE and D. FETTERLY (2006). Detecting spam web pages through content analysis, Proceedings of the 15th International Conference on World Wide Web, May 2006, Edinburgh, Scotland. [8] UMBRIA (2006) SPAM in the blogosphere 14

Social Media Mining. Data Mining Essentials

Social Media Mining. Data Mining Essentials Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

More information

A STUDY REGARDING INTER DOMAIN LINKED DOCUMENTS SIMILARITY AND THEIR CONSEQUENT BOUNCE RATE

A STUDY REGARDING INTER DOMAIN LINKED DOCUMENTS SIMILARITY AND THEIR CONSEQUENT BOUNCE RATE STUDIA UNIV. BABEŞ BOLYAI, INFORMATICA, Volume LIX, Number 1, 2014 A STUDY REGARDING INTER DOMAIN LINKED DOCUMENTS SIMILARITY AND THEIR CONSEQUENT BOUNCE RATE DIANA HALIŢĂ AND DARIUS BUFNEA Abstract. Then

More information

Active Learning SVM for Blogs recommendation

Active Learning SVM for Blogs recommendation Active Learning SVM for Blogs recommendation Xin Guan Computer Science, George Mason University Ⅰ.Introduction In the DH Now website, they try to review a big amount of blogs and articles and find the

More information

SVMs for the Blogosphere: Blog Identification and Splog Detection

SVMs for the Blogosphere: Blog Identification and Splog Detection SVMs for the Blogosphere: Blog Identification and Splog Detection Pranam Kolari, Tim Finin and Anupam Joshi University of Maryland, Baltimore County Baltimore MD {kolari1, finin, joshi}@umbc.edu Abstract

More information

Removing Web Spam Links from Search Engine Results

Removing Web Spam Links from Search Engine Results Removing Web Spam Links from Search Engine Results Manuel EGELE [email protected], 1 Overview Search Engine Optimization and definition of web spam Motivation Approach Inferring importance of features

More information

Web (2.0) Mining: Analyzing Social Media

Web (2.0) Mining: Analyzing Social Media Web (2.0) Mining: Analyzing Social Media Anupam Joshi, Tim Finin, Akshay Java, Anubhav Kale, and Pranam Kolari University of Maryland, Baltimore County, Baltimore MD 2250 Abstract Social media systems

More information

Detecting Spam Bots in Online Social Networking Sites: A Machine Learning Approach

Detecting Spam Bots in Online Social Networking Sites: A Machine Learning Approach Detecting Spam Bots in Online Social Networking Sites: A Machine Learning Approach Alex Hai Wang College of Information Sciences and Technology, The Pennsylvania State University, Dunmore, PA 18512, USA

More information

Practical Graph Mining with R. 5. Link Analysis

Practical Graph Mining with R. 5. Link Analysis Practical Graph Mining with R 5. Link Analysis Outline Link Analysis Concepts Metrics for Analyzing Networks PageRank HITS Link Prediction 2 Link Analysis Concepts Link A relationship between two entities

More information

SEO Guide for Front Page Ranking

SEO Guide for Front Page Ranking SEO Guide for Front Page Ranking Introduction This guide is created based on our own approved strategies that has brought front page ranking for our different websites. We hereby announce that there are

More information

Data Mining - Evaluation of Classifiers

Data Mining - Evaluation of Classifiers Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010

More information

Performance Metrics for Graph Mining Tasks

Performance Metrics for Graph Mining Tasks Performance Metrics for Graph Mining Tasks 1 Outline Introduction to Performance Metrics Supervised Learning Performance Metrics Unsupervised Learning Performance Metrics Optimizing Metrics Statistical

More information

Toward Spam 2.0: An Evaluation of Web 2.0 Anti-Spam Methods

Toward Spam 2.0: An Evaluation of Web 2.0 Anti-Spam Methods Toward Spam 2.0: An Evaluation of Web 2.0 Anti-Spam Methods Pedram Hayati, Vidyasagar Potdar Digital Ecosystem and Business Intelligence (DEBI) Institute, Curtin University of Technology, Perth, WA, Australia

More information

Knowledge Discovery from patents using KMX Text Analytics

Knowledge Discovery from patents using KMX Text Analytics Knowledge Discovery from patents using KMX Text Analytics Dr. Anton Heijs [email protected] Treparel Abstract In this white paper we discuss how the KMX technology of Treparel can help searchers

More information

Evaluation & Validation: Credibility: Evaluating what has been learned

Evaluation & Validation: Credibility: Evaluating what has been learned Evaluation & Validation: Credibility: Evaluating what has been learned How predictive is a learned model? How can we evaluate a model Test the model Statistical tests Considerations in evaluating a Model

More information

Sentiment analysis on tweets in a financial domain

Sentiment analysis on tweets in a financial domain Sentiment analysis on tweets in a financial domain Jasmina Smailović 1,2, Miha Grčar 1, Martin Žnidaršič 1 1 Dept of Knowledge Technologies, Jožef Stefan Institute, Ljubljana, Slovenia 2 Jožef Stefan International

More information

Search and Information Retrieval

Search and Information Retrieval Search and Information Retrieval Search on the Web 1 is a daily activity for many people throughout the world Search and communication are most popular uses of the computer Applications involving search

More information

PSG College of Technology, Coimbatore-641 004 Department of Computer & Information Sciences BSc (CT) G1 & G2 Sixth Semester PROJECT DETAILS.

PSG College of Technology, Coimbatore-641 004 Department of Computer & Information Sciences BSc (CT) G1 & G2 Sixth Semester PROJECT DETAILS. PSG College of Technology, Coimbatore-641 004 Department of Computer & Information Sciences BSc (CT) G1 & G2 Sixth Semester PROJECT DETAILS Project Project Title Area of Abstract No Specialization 1. Software

More information

Chapter 6. The stacking ensemble approach

Chapter 6. The stacking ensemble approach 82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described

More information

How To Cluster

How To Cluster Data Clustering Dec 2nd, 2013 Kyrylo Bessonov Talk outline Introduction to clustering Types of clustering Supervised Unsupervised Similarity measures Main clustering algorithms k-means Hierarchical Main

More information

Web Document Clustering

Web Document Clustering Web Document Clustering Lab Project based on the MDL clustering suite http://www.cs.ccsu.edu/~markov/mdlclustering/ Zdravko Markov Computer Science Department Central Connecticut State University New Britain,

More information

Part 1: Link Analysis & Page Rank

Part 1: Link Analysis & Page Rank Chapter 8: Graph Data Part 1: Link Analysis & Page Rank Based on Leskovec, Rajaraman, Ullman 214: Mining of Massive Datasets 1 Exam on the 5th of February, 216, 14. to 16. If you wish to attend, please

More information

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS 1 AND ALGORITHMS Chiara Renso KDD-LAB ISTI- CNR, Pisa, Italy WHAT IS CLUSTER ANALYSIS? Finding groups of objects such that the objects in a group will be similar

More information

Data Preprocessing. Week 2

Data Preprocessing. Week 2 Data Preprocessing Week 2 Topics Data Types Data Repositories Data Preprocessing Present homework assignment #1 Team Homework Assignment #2 Read pp. 227 240, pp. 250 250, and pp. 259 263 the text book.

More information

Towards better accuracy for Spam predictions

Towards better accuracy for Spam predictions Towards better accuracy for Spam predictions Chengyan Zhao Department of Computer Science University of Toronto Toronto, Ontario, Canada M5S 2E4 [email protected] Abstract Spam identification is crucial

More information

Categorical Data Visualization and Clustering Using Subjective Factors

Categorical Data Visualization and Clustering Using Subjective Factors Categorical Data Visualization and Clustering Using Subjective Factors Chia-Hui Chang and Zhi-Kai Ding Department of Computer Science and Information Engineering, National Central University, Chung-Li,

More information

MHI3000 Big Data Analytics for Health Care Final Project Report

MHI3000 Big Data Analytics for Health Care Final Project Report MHI3000 Big Data Analytics for Health Care Final Project Report Zhongtian Fred Qiu (1002274530) http://gallery.azureml.net/details/81ddb2ab137046d4925584b5095ec7aa 1. Data pre-processing The data given

More information

Using Data Mining for Mobile Communication Clustering and Characterization

Using Data Mining for Mobile Communication Clustering and Characterization Using Data Mining for Mobile Communication Clustering and Characterization A. Bascacov *, C. Cernazanu ** and M. Marcu ** * Lasting Software, Timisoara, Romania ** Politehnica University of Timisoara/Computer

More information

How To Identify A Churner

How To Identify A Churner 2012 45th Hawaii International Conference on System Sciences A New Ensemble Model for Efficient Churn Prediction in Mobile Telecommunication Namhyoung Kim, Jaewook Lee Department of Industrial and Management

More information

Chapter-1 : Introduction 1 CHAPTER - 1. Introduction

Chapter-1 : Introduction 1 CHAPTER - 1. Introduction Chapter-1 : Introduction 1 CHAPTER - 1 Introduction This thesis presents design of a new Model of the Meta-Search Engine for getting optimized search results. The focus is on new dimension of internet

More information

A COMPREHENSIVE REVIEW ON SEARCH ENGINE OPTIMIZATION

A COMPREHENSIVE REVIEW ON SEARCH ENGINE OPTIMIZATION Volume 4, No. 1, January 2013 Journal of Global Research in Computer Science REVIEW ARTICLE Available Online at www.jgrcs.info A COMPREHENSIVE REVIEW ON SEARCH ENGINE OPTIMIZATION 1 Er.Tanveer Singh, 2

More information

Experiments in Web Page Classification for Semantic Web

Experiments in Web Page Classification for Semantic Web Experiments in Web Page Classification for Semantic Web Asad Satti, Nick Cercone, Vlado Kešelj Faculty of Computer Science, Dalhousie University E-mail: {rashid,nick,vlado}@cs.dal.ca Abstract We address

More information

Predict the Popularity of YouTube Videos Using Early View Data

Predict the Popularity of YouTube Videos Using Early View Data 000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050

More information

Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms

Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms Scott Pion and Lutz Hamel Abstract This paper presents the results of a series of analyses performed on direct mail

More information

How To Filter Spam Image From A Picture By Color Or Color

How To Filter Spam Image From A Picture By Color Or Color Image Content-Based Email Spam Image Filtering Jianyi Wang and Kazuki Katagishi Abstract With the population of Internet around the world, email has become one of the main methods of communication among

More information

T-61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier. Santosh Tirunagari : 245577

T-61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier. Santosh Tirunagari : 245577 T-61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier Santosh Tirunagari : 245577 January 20, 2011 Abstract This term project gives a solution how to classify an email as spam or

More information

Predict Influencers in the Social Network

Predict Influencers in the Social Network Predict Influencers in the Social Network Ruishan Liu, Yang Zhao and Liuyu Zhou Email: rliu2, yzhao2, [email protected] Department of Electrical Engineering, Stanford University Abstract Given two persons

More information

Multiple Kernel Learning on the Limit Order Book

Multiple Kernel Learning on the Limit Order Book JMLR: Workshop and Conference Proceedings 11 (2010) 167 174 Workshop on Applications of Pattern Analysis Multiple Kernel Learning on the Limit Order Book Tristan Fletcher Zakria Hussain John Shawe-Taylor

More information

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data CMPE 59H Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data Term Project Report Fatma Güney, Kübra Kalkan 1/15/2013 Keywords: Non-linear

More information

131-1. Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10

131-1. Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10 1/10 131-1 Adding New Level in KDD to Make the Web Usage Mining More Efficient Mohammad Ala a AL_Hamami PHD Student, Lecturer m_ah_1@yahoocom Soukaena Hassan Hashem PHD Student, Lecturer soukaena_hassan@yahoocom

More information

Machine Learning Final Project Spam Email Filtering

Machine Learning Final Project Spam Email Filtering Machine Learning Final Project Spam Email Filtering March 2013 Shahar Yifrah Guy Lev Table of Content 1. OVERVIEW... 3 2. DATASET... 3 2.1 SOURCE... 3 2.2 CREATION OF TRAINING AND TEST SETS... 4 2.3 FEATURE

More information

Dynamical Clustering of Personalized Web Search Results

Dynamical Clustering of Personalized Web Search Results Dynamical Clustering of Personalized Web Search Results Xuehua Shen CS Dept, UIUC [email protected] Hong Cheng CS Dept, UIUC [email protected] Abstract Most current search engines present the user a ranked

More information

DON T FOLLOW ME: SPAM DETECTION IN TWITTER

DON T FOLLOW ME: SPAM DETECTION IN TWITTER DON T FOLLOW ME: SPAM DETECTION IN TWITTER Alex Hai Wang College of Information Sciences and Technology, The Pennsylvania State University, Dunmore, PA 18512, USA [email protected] Keywords: Abstract: social

More information

Efficient Identification of Starters and Followers in Social Media

Efficient Identification of Starters and Followers in Social Media Efficient Identification of Starters and Followers in Social Media Michael Mathioudakis Department of Computer Science University of Toronto [email protected] Nick Koudas Department of Computer Science

More information

Comparison of Ant Colony and Bee Colony Optimization for Spam Host Detection

Comparison of Ant Colony and Bee Colony Optimization for Spam Host Detection International Journal of Engineering Research and Development eissn : 2278-067X, pissn : 2278-800X, www.ijerd.com Volume 4, Issue 8 (November 2012), PP. 26-32 Comparison of Ant Colony and Bee Colony Optimization

More information

Spam Detection with a Content-based Random-walk Algorithm

Spam Detection with a Content-based Random-walk Algorithm Spam Detection with a Content-based Random-walk Algorithm ABSTRACT F. Javier Ortega Departamento de Lenguajes y Sistemas Informáticos Universidad de Sevilla Av. Reina Mercedes s/n 41012, Sevilla (Spain)

More information

Semi-Supervised Learning for Blog Classification

Semi-Supervised Learning for Blog Classification Proceedings of the Twenty-Third AAAI Conference on Artificial Intelligence (2008) Semi-Supervised Learning for Blog Classification Daisuke Ikeda Department of Computational Intelligence and Systems Science,

More information

Document Image Retrieval using Signatures as Queries

Document Image Retrieval using Signatures as Queries Document Image Retrieval using Signatures as Queries Sargur N. Srihari, Shravya Shetty, Siyuan Chen, Harish Srinivasan, Chen Huang CEDAR, University at Buffalo(SUNY) Amherst, New York 14228 Gady Agam and

More information

A Logistic Regression Approach to Ad Click Prediction

A Logistic Regression Approach to Ad Click Prediction A Logistic Regression Approach to Ad Click Prediction Gouthami Kondakindi [email protected] Satakshi Rana [email protected] Aswin Rajkumar [email protected] Sai Kaushik Ponnekanti [email protected] Vinit Parakh

More information

EVILSEED: A Guided Approach to Finding Malicious Web Pages

EVILSEED: A Guided Approach to Finding Malicious Web Pages + EVILSEED: A Guided Approach to Finding Malicious Web Pages Presented by: Alaa Hassan Supervised by: Dr. Tom Chothia + Outline Introduction Introducing EVILSEED. EVILSEED Architecture. Effectiveness of

More information

Three types of messages: A, B, C. Assume A is the oldest type, and C is the most recent type.

Three types of messages: A, B, C. Assume A is the oldest type, and C is the most recent type. Chronological Sampling for Email Filtering Ching-Lung Fu 2, Daniel Silver 1, and James Blustein 2 1 Acadia University, Wolfville, Nova Scotia, Canada 2 Dalhousie University, Halifax, Nova Scotia, Canada

More information

WEB SITE OPTIMIZATION THROUGH MINING USER NAVIGATIONAL PATTERNS

WEB SITE OPTIMIZATION THROUGH MINING USER NAVIGATIONAL PATTERNS WEB SITE OPTIMIZATION THROUGH MINING USER NAVIGATIONAL PATTERNS Biswajit Biswal Oracle Corporation [email protected] ABSTRACT With the World Wide Web (www) s ubiquity increase and the rapid development

More information

Data Quality Mining: Employing Classifiers for Assuring consistent Datasets

Data Quality Mining: Employing Classifiers for Assuring consistent Datasets Data Quality Mining: Employing Classifiers for Assuring consistent Datasets Fabian Grüning Carl von Ossietzky Universität Oldenburg, Germany, [email protected] Abstract: Independent

More information

How To Optimize Your Website For Search Engine Optimization

How To Optimize Your Website For Search Engine Optimization An insight of the services offered by www.shrushti.com CONTENTS CONTENTS... 1 INTRODUCTION... 2 Objectives... 2 Recommended Services For Your Website:... 3 Service 1 : Search Engine Optimization (SEO)...

More information

International Journal of Engineering Research-Online A Peer Reviewed International Journal Articles are freely available online:http://www.ijoer.

International Journal of Engineering Research-Online A Peer Reviewed International Journal Articles are freely available online:http://www.ijoer. RESEARCH ARTICLE SURVEY ON PAGERANK ALGORITHMS USING WEB-LINK STRUCTURE SOWMYA.M 1, V.S.SREELAXMI 2, MUNESHWARA M.S 3, ANIL G.N 4 Department of CSE, BMS Institute of Technology, Avalahalli, Yelahanka,

More information

Creating Synthetic Temporal Document Collections for Web Archive Benchmarking

Creating Synthetic Temporal Document Collections for Web Archive Benchmarking Creating Synthetic Temporal Document Collections for Web Archive Benchmarking Kjetil Nørvåg and Albert Overskeid Nybø Norwegian University of Science and Technology 7491 Trondheim, Norway Abstract. In

More information

BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES

BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES 123 CHAPTER 7 BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES 7.1 Introduction Even though using SVM presents

More information

Search engine ranking

Search engine ranking Proceedings of the 7 th International Conference on Applied Informatics Eger, Hungary, January 28 31, 2007. Vol. 2. pp. 417 422. Search engine ranking Mária Princz Faculty of Technical Engineering, University

More information

Offline sorting buffers on Line

Offline sorting buffers on Line Offline sorting buffers on Line Rohit Khandekar 1 and Vinayaka Pandit 2 1 University of Waterloo, ON, Canada. email: [email protected] 2 IBM India Research Lab, New Delhi. email: [email protected]

More information

Support Vector Machines with Clustering for Training with Very Large Datasets

Support Vector Machines with Clustering for Training with Very Large Datasets Support Vector Machines with Clustering for Training with Very Large Datasets Theodoros Evgeniou Technology Management INSEAD Bd de Constance, Fontainebleau 77300, France [email protected] Massimiliano

More information

Impact of Feature Selection on the Performance of Wireless Intrusion Detection Systems

Impact of Feature Selection on the Performance of Wireless Intrusion Detection Systems 2009 International Conference on Computer Engineering and Applications IPCSIT vol.2 (2011) (2011) IACSIT Press, Singapore Impact of Feature Selection on the Performance of ireless Intrusion Detection Systems

More information

Cross-Lingual Concern Analysis from Multilingual Weblog Articles

Cross-Lingual Concern Analysis from Multilingual Weblog Articles Cross-Lingual Concern Analysis from Multilingual Weblog Articles Tomohiro Fukuhara RACE (Research into Artifacts), The University of Tokyo 5-1-5 Kashiwanoha, Kashiwa, Chiba JAPAN http://www.race.u-tokyo.ac.jp/~fukuhara/

More information

Web Search. 2 o Semestre 2012/2013

Web Search. 2 o Semestre 2012/2013 Dados na Dados na Departamento de Engenharia Informática Instituto Superior Técnico 2 o Semestre 2012/2013 Bibliography Dados na Bing Liu, Data Mining: Exploring Hyperlinks, Contents, and Usage Data, 2nd

More information

Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP

Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP ABSTRACT In data mining modelling, data preparation

More information

Azure Machine Learning, SQL Data Mining and R

Azure Machine Learning, SQL Data Mining and R Azure Machine Learning, SQL Data Mining and R Day-by-day Agenda Prerequisites No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all:

More information

Chapter ML:XI (continued)

Chapter ML:XI (continued) Chapter ML:XI (continued) XI. Cluster Analysis Data Mining Overview Cluster Analysis Basics Hierarchical Cluster Analysis Iterative Cluster Analysis Density-Based Cluster Analysis Cluster Evaluation Constrained

More information

Method for detecting software anomalies based on recurrence plot analysis

Method for detecting software anomalies based on recurrence plot analysis Journal of Theoretical and Applied Computer Science Vol. 6, No. 1, 2012, pp. 3-12 ISSN 2299-2634 http://www.jtacs.org Method for detecting software anomalies based on recurrence plot analysis Michał Mosdorf

More information

User Authentication/Identification From Web Browsing Behavior

User Authentication/Identification From Web Browsing Behavior User Authentication/Identification From Web Browsing Behavior US Naval Research Laboratory PI: Myriam Abramson, Code 5584 Shantanu Gore, SEAP Student, Code 5584 David Aha, Code 5514 Steve Russell, Code

More information

Making Sense of the Mayhem: Machine Learning and March Madness

Making Sense of the Mayhem: Machine Learning and March Madness Making Sense of the Mayhem: Machine Learning and March Madness Alex Tran and Adam Ginzberg Stanford University [email protected] [email protected] I. Introduction III. Model The goal of our research

More information

Detecting Network Anomalies. Anant Shah

Detecting Network Anomalies. Anant Shah Detecting Network Anomalies using Traffic Modeling Anant Shah Anomaly Detection Anomalies are deviations from established behavior In most cases anomalies are indications of problems The science of extracting

More information

On the Effectiveness of Obfuscation Techniques in Online Social Networks

On the Effectiveness of Obfuscation Techniques in Online Social Networks On the Effectiveness of Obfuscation Techniques in Online Social Networks Terence Chen 1,2, Roksana Boreli 1,2, Mohamed-Ali Kaafar 1,3, and Arik Friedman 1,2 1 NICTA, Australia 2 UNSW, Australia 3 INRIA,

More information

Data, Measurements, Features

Data, Measurements, Features Data, Measurements, Features Middle East Technical University Dep. of Computer Engineering 2009 compiled by V. Atalay What do you think of when someone says Data? We might abstract the idea that data are

More information

Protein Protein Interaction Networks

Protein Protein Interaction Networks Functional Pattern Mining from Genome Scale Protein Protein Interaction Networks Young-Rae Cho, Ph.D. Assistant Professor Department of Computer Science Baylor University it My Definition of Bioinformatics

More information

Strategic Online Advertising: Modeling Internet User Behavior with

Strategic Online Advertising: Modeling Internet User Behavior with 2 Strategic Online Advertising: Modeling Internet User Behavior with Patrick Johnston, Nicholas Kristoff, Heather McGinness, Phuong Vu, Nathaniel Wong, Jason Wright with William T. Scherer and Matthew

More information

Context Aware Predictive Analytics: Motivation, Potential, Challenges

Context Aware Predictive Analytics: Motivation, Potential, Challenges Context Aware Predictive Analytics: Motivation, Potential, Challenges Mykola Pechenizkiy Seminar 31 October 2011 University of Bournemouth, England http://www.win.tue.nl/~mpechen/projects/capa Outline

More information

Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches

Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches PhD Thesis by Payam Birjandi Director: Prof. Mihai Datcu Problematic

More information

The Role of Size Normalization on the Recognition Rate of Handwritten Numerals

The Role of Size Normalization on the Recognition Rate of Handwritten Numerals The Role of Size Normalization on the Recognition Rate of Handwritten Numerals Chun Lei He, Ping Zhang, Jianxiong Dong, Ching Y. Suen, Tien D. Bui Centre for Pattern Recognition and Machine Intelligence,

More information

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014 RESEARCH ARTICLE OPEN ACCESS A Survey of Data Mining: Concepts with Applications and its Future Scope Dr. Zubair Khan 1, Ashish Kumar 2, Sunny Kumar 3 M.Tech Research Scholar 2. Department of Computer

More information

Distance Metric Learning in Data Mining (Part I) Fei Wang and Jimeng Sun IBM TJ Watson Research Center

Distance Metric Learning in Data Mining (Part I) Fei Wang and Jimeng Sun IBM TJ Watson Research Center Distance Metric Learning in Data Mining (Part I) Fei Wang and Jimeng Sun IBM TJ Watson Research Center 1 Outline Part I - Applications Motivation and Introduction Patient similarity application Part II

More information

Standardization and Its Effects on K-Means Clustering Algorithm

Standardization and Its Effects on K-Means Clustering Algorithm Research Journal of Applied Sciences, Engineering and Technology 6(7): 399-3303, 03 ISSN: 040-7459; e-issn: 040-7467 Maxwell Scientific Organization, 03 Submitted: January 3, 03 Accepted: February 5, 03

More information

Social Media Mining. Network Measures

Social Media Mining. Network Measures Klout Measures and Metrics 22 Why Do We Need Measures? Who are the central figures (influential individuals) in the network? What interaction patterns are common in friends? Who are the like-minded users

More information

A Comparison Framework of Similarity Metrics Used for Web Access Log Analysis

A Comparison Framework of Similarity Metrics Used for Web Access Log Analysis A Comparison Framework of Similarity Metrics Used for Web Access Log Analysis Yusuf Yaslan and Zehra Cataltepe Istanbul Technical University, Computer Engineering Department, Maslak 34469 Istanbul, Turkey

More information

STATISTICA. Financial Institutions. Case Study: Credit Scoring. and

STATISTICA. Financial Institutions. Case Study: Credit Scoring. and Financial Institutions and STATISTICA Case Study: Credit Scoring STATISTICA Solutions for Business Intelligence, Data Mining, Quality Control, and Web-based Analytics Table of Contents INTRODUCTION: WHAT

More information

Blog Post Extraction Using Title Finding

Blog Post Extraction Using Title Finding Blog Post Extraction Using Title Finding Linhai Song 1, 2, Xueqi Cheng 1, Yan Guo 1, Bo Wu 1, 2, Yu Wang 1, 2 1 Institute of Computing Technology, Chinese Academy of Sciences, Beijing 2 Graduate School

More information

Statistical Feature Selection Techniques for Arabic Text Categorization

Statistical Feature Selection Techniques for Arabic Text Categorization Statistical Feature Selection Techniques for Arabic Text Categorization Rehab M. Duwairi Department of Computer Information Systems Jordan University of Science and Technology Irbid 22110 Jordan Tel. +962-2-7201000

More information

Overview. Evaluation Connectionist and Statistical Language Processing. Test and Validation Set. Training and Test Set

Overview. Evaluation Connectionist and Statistical Language Processing. Test and Validation Set. Training and Test Set Overview Evaluation Connectionist and Statistical Language Processing Frank Keller [email protected] Computerlinguistik Universität des Saarlandes training set, validation set, test set holdout, stratification

More information

Crowdclustering with Sparse Pairwise Labels: A Matrix Completion Approach

Crowdclustering with Sparse Pairwise Labels: A Matrix Completion Approach Outline Crowdclustering with Sparse Pairwise Labels: A Matrix Completion Approach Jinfeng Yi, Rong Jin, Anil K. Jain, Shaili Jain 2012 Presented By : KHALID ALKOBAYER Crowdsourcing and Crowdclustering

More information

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not. Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C

More information

Assessment. Presenter: Yupu Zhang, Guoliang Jin, Tuo Wang Computer Vision 2008 Fall

Assessment. Presenter: Yupu Zhang, Guoliang Jin, Tuo Wang Computer Vision 2008 Fall Automatic Photo Quality Assessment Presenter: Yupu Zhang, Guoliang Jin, Tuo Wang Computer Vision 2008 Fall Estimating i the photorealism of images: Distinguishing i i paintings from photographs h Florin

More information