THE increasing of Internet resources brings up the problem

Transcription

1 User Interest Analysis in Web Filtering A-Ning DU and Bin-Xing FANG Abstract Web filtering can help people find the most interesting and valuable information. However, current web filtering techniques can not retrieve results which accurately represent the user interest. This paper investigated the user interest in web filtering and analyzed the problems of current machine learning base web filter. According to the difference of user interest, the task of web filtering is divided into three levels: relativityfilter,similarity-filter and homology-filter. And Biased Support Vector Machine(BSVM) is used to make the filter adaptable according to the difference of user interest. Experiments show that BSVM can greatly improve the web filtering performance. Index Terms Web Filtering, User Interest, Biased Support Vector Machine. I. INTRODUCTION THE increasing of Internet resources brings up the problem of information overload, quality enhancement, which means that people want to read the most interesting messages, and avoid having to read low-quality or uninteresting messages. Web filtering is the activity of classifying a stream of incoming web pages dispatched in an asynchronous way by an information producer to an information consumer[1], which helps people find the most interesting and valuable information and saves Internet users from drowned by the flood of incoming information. Recent years, the machine learning (ML) paradigm[2], instead of knowledge engineering and domain experts, becomes more popular in solving the above problem, because of its automatically-learning and relativity-analysis abilities. However, these ML algorithms are insufficiently accurate and do not adapt well to the ever-changing user interest/approprateness of the web document to the user. For example, distinguishing Pornography from SexEd may be less easy, and distinguishing Pornography from Erotica is even harder, since the border is extremely subjective. This paper studies how to adjust the web filtering results to be more fit for the user interest. Based on the careful study of the user interest, the web filtering result is divided into three scopes of relativity, similarity and homology, which help describe the user interest more accurately. To achieve more precisely the filtering result, the inductive process is improved so that it can get better precision and recall ability according to the user interest. The improved machine learning algorithm in this paper is based on the Support Vector Machine (SVM) algorithm because that of all the generic machine learning algorithms (Decision Tree, Rule Induction, Bayesian algorithm and SVM), SVM algorithm has shown to be superior to other machine learning algorithms with the solid foundation of Statistical Learning Theory (SLT). The improved A-Ning DU and Bin-Xing FANG are with the Research Center of Computer Network and Information Security Technology, Harbin Institute of Technology, People s Republic of China. algorithm is called Biased Support Vector Machine (BSVM), which imports a stimulant function, uses training examples distribution n + /n and a user-adaptable parameter k to deal imbalancedly different classes of the pre-assigned pages so as to adjust the filtering result to be best fit for the user interest. The remainder of the paper is organized as follows: Section 2 introduces web filtering, analyzes the user interest and corresponding difference in filtering result, and discusses the failure of current machine learning approaches. Section 3 puts forward the model of Biased Support Vector Machine, and analyzes its efficiency in web filtering. Section 4 closes the paper with our conclusions and future work. A. Web Filtering II. WEB FILTERING AND USER INTEREST Web filtering is the task of assigning a boolean value to each web page vector d i D, where D is a domain of web pages. A value of TRUE assigned to d i indicates a decision to page d i relative to the user interest, while FALSE indicates not. More formally, the task is to approximate the unknown target function Ψ : D {T RU E, F ALSE} (which describes how web pages ought to be assigned) by means of a function Φ : D {T RU E, F ALSE} called the filter. How to improve the precision and recall of the filter Φ are the core problem of web filtering. The general process of web filtering includes five steps: 1) user interest acquiring: acquire many user-assigned web pages as training set 2) web pages pre-processing: translate the assigned pages into a set of compact representations of page content. Usually a page d i is represented as a vector of term weights d i = {w 1i,w 2i,,w F i }, where F is the set of features that occur at least once in at least one document of D, and 0 < w ki < 1 represents how much feature f k contributes to the semantics of page d i 3) dimensionality reduction: select feature of high contribution to reduce the size of feature set F 4) construction of web filters: build a filter to describe user interest automatically 5) predict unfiltered web pages: use the filter to predict an unmarked web page is relative or not Representation of web pages is the basic step of the process, while the degree of dimensionality reduction is the key infecting factor. And what decides the effectiveness of web filters is that the generalization and description ability of web filtering algorithm. Current implementations of web filtering mainly use four techniques of URL blocking, keyword filtering, rating systems, and intelligent content analysis. URL blocking restricts or allows access by comparing the requested web page s URL 588

2 (and equivalent IP address) with URLs in a stored list. The advantages are speed and efficiency, while this approach requires a URL list, and it is quite costly to generate and maintain the list. Keyword filtering blocks access to web site on the basis of the occurrence of offensive words and phrases on those sites. However, many web sites that do not contain objectionable content will be blocked. Rating systems let web publishers associate labels or metadata with web pages to limit certain web content to target audiences. while in general this approach can not provide a reliable source of information. Intelligent content analysis system can automatically classify web content by use of ML algorithms, such learning and adaptation programs can help give semantic meaning to context-dependent words, and thereby they are the dominate approaches used in web filtering. Almost all existing filtering software use URL blocking, while some also provide rating and keyword option. Performance of a filtering system can be measured in terms of blocking rate which is the percentage of the correctly blocked Web pages, and overblocking rate which is the percentage of legitimate pages that are blocked. The Netprotect project evaluated 50 commercially available filtering systems using 2,794 URLs with pornographic content and 1,655 URLs with normal content [3]. Their results reproduced in Table I show that the accuracy of existing systems is far from satisfactory. TABLE I NETPROTECT S EVALUATION FOR WEB FILTERING TOOLS[3] Filtering Tools Blocking Efficiencies Overblocking Rate BizGuard 55% 10% Cyber Patrol 52% 2% CYBER sitter 46% 3% CYBER Snoop 65% 23% Internet Watcher % 0% Net Nanny 20% 5% Norton Internet Security 45% 6% Optenet 79% 25% SurfMonkey 65% 11% X-Stop 65% 4% B. Analysis of User Interest In practical web filtering applications, the web pages set related to user interest may be considerable large. However, what the user desired may be just several homologous pages. In order to show the difference of user interest, we first give some examples and analyze the true requirement of user. Example 1: Problems in Pornographic Pages Filtering Nowadays, Internet has been becoming an important source of information. However it is also host to pornographic, violent contents and others that are inappropriate for most viewers. Web filtering can be used to block access to pages that are against a defined policy. If a page contains a certain number of forbidden keywords, it is considered undesirable. The problem is that the meanings of words depend on the context. Different Page Subjects: For example, sites about breast cancer research, or sexual harassment, or even the home page of someone named Sexton, could be blocked as a forbidden page of Pornographic Class. Different Writer s Viewpoint: Articles on combating pornographic pages are harmless. Different Expression Orientation: The pornographic pages also contains many sub-classes such as gambling, nudity, violence, drugs, alcohol and so on. For example, Itzin[4] classified pornography into three sets: The sexually explicit and violent; the sexually explicit and nonviolent, but subordinating and dehumanizing; and the sexually explicit, nonviolent, and non-subordinating based upon mutuality. Research consistently shows that harmful effects are associated with the first two, but that the third is usually harmless. Example 2: Problems in Personal Information Filtering Information filtering deals with the delivery of information that is relevant to the user in a timely manner. An information filtering system assists users by filtering the data stream and delivering the relevant information to the user. The system selects the articles deemed to be interesting to the user and eliminates the rest. However, a filtering system might not be able to perfectly differentiate the articles that are actually relevant to the user from the ones that are not. The proportion of irrelevant articles delivered to the user should be as low as possible. The proportion of relevant articles eliminated should also be as low as possible. Different Page Subjects: An information filtering agent assists the user with the task of finding interesting news articles. While the articles may in a particular domain or many domains of academics, entertainment, migration, sports etc. Different Writer s Viewpoint: The user task of finding interesting news articles may only include articles supporting the event, or include all the articles about the event. Different Expression Orientation: For example, the user task of finding news articles about disaster may include articles about bailout, damnification etc, or only the articles about one aspect. As shown in the examples above, the user may be interested in portions of the web filtering result according to the difference of page subjects, writer s viewpoints and expression orientations. So we can divide web filtering tasks into three levels according to the user interest: relativity-filter: the filtering result contains all the web pages with the same key phrases or key sentences. These web pages express the same subject, but may be not consistent in viewpoint or orientation. Typical applications of relativity-filtering include erotic web pages filtering and hot topic tracing which expect to collect all the web pages related to the topic, regardless of approval or not. similarity-filter: the filtering result contains all the web pages that hold the same subject, viewpoint and orientation with the user. Typical applications of similarityfiltering include filtering of web pages on racialism or splittism. The similarity-filtering is more strict than relativity-filtering as not only key words or sentences but also orientation is taken into consideration. homology-filter: the filtering result contains only the web 589

3 pages with quite a lot of same sentences or paragraphs. The filtering results are almost the same as the user interest, and always this is because that the articles from the official or authoritative website are redistributed by other websites with little modification. An examples of homology-filtering is counting which article is the most reprinted one on the Bulletin Board Systems. We can define the all the filtering results acquired by ML algorithms as relative results(r 1 ) and the filtering results which the ML algorithms assign TRUE with probability near-to-1 as homologous results(r k ). So the results of similarity-filtering R i {R k R i R 1 }. As is illustrated in the left of Fig.1, most filtering tasks can be described as application of similarity-filtering with different similarity degrees between the web pages acquired and the user interest. User interest of high similkrity (Rk) User interest (Uk) Adkptkble filtering result (Ri) Generkl Filtering Result (R1) Internet (U) Fig. 1. Analysis and demonstration of filtering result estimation. Outside the biggest circles means filtering scope U, the smallest circle means user interest U k, the biggest circle R 1 is the filtering result of general ML algorithms as content relativity, the smaller one R k is the filtering result as content homology. The middle circle R i means the biased filtering result according to user demand as content similarity. C. Current Machine Learning Approaches and The Failure Web filtering by ML techniques is widely discussed in the literature. A few major ML algorithms are often chosen to construct web filter because of their simplicity, flexibility and robustness: Decision Trees is a ML approach to automatic induction of filtering trees based on training data[5], [6], [7]. It is a graph of nodes connected by arcs with each internal node corresponding to a feature and each arc to a possible value of that feature. Decision tree is easily interpretable by humans and has low computational complexity, which is a quite simple and practical idea in the field of ML. Rules Induction methods[8], [9] try to find a proper set of DNF rules for filtering task such that the error rate on training set is minimal. By use of local optimization techniques, rule induction methods dynamically evaluate rules and revise the covering rule set. K-Nearest neighbor (KNN)[10], [11] selects k most similar documents from the training set and uses the categories of these documents to determine categories of the document being classified. Documents are represented by vectors of words and the similarity between two documents is measured using Euclidean distance or other functions between these vectors. In [12], [13], [10], [14], [15], Naïve Bayes has been applied to web pages filtering. It uses the joint probabilities of words and categories to estimate the probabilities of categories given a document. Documents with a probabilities above a certain threshold are considered relevant. Lee et al.[2] applied Artificial Neural Networks to identify members of the forbidden class, which learns patterns by modifying the weights among nodes based on learning examples. Support Vector Machines (SVM)[16], [17], [18] is also a major statistical method. SVM is a process of finding a surface which separates the positives from the negatives with the widest possible margin among all the surfaces in F - dimensional space. SVM acts well in dealing with large scale training set and it has no need of human and machine efforts in parameter tuning. As is compared in [19], [20], [21], SVM achieved the best performance on different filtering corpus with strong robustness and acceptable efficiency. While the precondition of Naïve Bayes that omitting the feature dependence reduces its web content analysis ability. Artificial Neural Networks is computationally expensive, and over-fitting problem of Decision Trees and Rule Induction occurring in the procedure of user interest description makes it not satisfied. However, as is shown above, web filters based on ML algorithms can not achieve satisfactory results. This is because that it is difficult to understand and express the true meaning of user interest. Current ML algorithms acquire the user interest only by analyzing the arrange modes of words and expressions in the training examples. They neglect much information hidden in the training set, such as the distribution of number of positive example and negative examples, the max distributing radius of positives, the max distributing radius of negatives, and so on. In fact, such hidden information is quite valuable to express what portions of the web filtering result the user may be interested in. As a result, this paper tries to import the ML algorithms the ability to analyze these information. The improved ML algorithm is based on SVM because of its strong robustness and acceptable efficiency. III. BIASED SUPPORT VECTOR MACHINE FOR WEB FILTERING A. Biased Support Vector Machine Algorithm To fit the user interest better, we must import adjusting ability into the ML algorithms. So the approach proposed in this paper imports a stimulant function, uses training examples distribution n + /n and a user-adaptable parameter k to deals imbalancedly different classes of the pre-assigned pages, so as to be best fit for the user interest. The approach is called Biased Support Vector Machine, and a detailed description and analysis are in [22]. In the classical SVM, a penalty function F = C ξ i is introduced as additional capacity control function, where the non-negative variable ξ i is a measure of the misclassification errors and the coefficient C emphasizes the tolerant degree of misclassification error. Consequently the width of the margin decreases with C increasing. 590

4 BSVM introduces a stimulant function, F = C [(k 1) n y ξ i=1 i n + y ξ i= 1 i]/n, as the extension of penalty function. In BSVM, we describe positives as the examples of y i = +1, negatives as the examples of y i = 1, thus we define n + = {y i = +1} and n = {y i = 1}. The stimulant function uses both training examples distribution n + /n and an user-adaptable parameter k to express the user bias degree of different classes. Together with the effect of penalty function, the bias is described in Equation 1. The width of the margin to the positive side decreases with n + /n or k increasing. Thus BSVM can find a proper separating hyperplane with filtering result R i between R 1 and R k. bias= C+C (k 1) n /n = 1+(k 1) n /n C C n + /n 1 n + /n = n +/n+k n /n =k+n + /n (1) n /n BSVM is shown as follows. The generalized optimal separating hyperplane is determined by the vector w, that minimizes the functional, 1 min w,b,ξ 2 w 2 + C ξ i + C 1 ξ i C 2 ξ i y i=1 y i= 1 wherec 1 = C (k 1) n /n,c 2 = C n + /n,k 0 (2) subject to the constraints of: y i (w x i b) 1 ξ i where ξ i 0, i (3) Here C 1 and C 2 are the classification errors stimulant coefficients, k 0 is an adaptable parameter. The solution to the optimization problem of Equation 2 under the constraints of Equation 3 is given by the saddle point of the Lagrangian: L(w,b,ξ,α,β) = 1 2 w 2 + (C + C 1 ) y i=1ξ i + (C C 2 ) y i= 1 ξ i α i (y i [w T x i b] 1 + ξ i ) β i ξ i (4) where α, β are the Lagrange multipliers. The Lagrangian has to be minimized with respect to w,b,ξ and maximized with respect to α,β. Hence the solution to the problem is given by: min Q(α) = 1 2 with constraints of: and α i α j y i y j K(x i,x j ) i,j=1 α i (5) i=1 y i α i = 0 (6) i=1 B. Experiments and Analysis In our experiment, the forbidden pages belong to the category of Adult content. We have collected a total of 500 web pages by searching with the keyword porn. The corpus has been reviewed and classified as containing adult contents by human editors, which includes 100 non-pornographic web pages and 400 pornographic web pages. After taking 1/5 of each as training examples, we measured the training accuracy for SVM and BSVM in Table II. TABLE II TRAINING ACCURACY OF SVM AND BIASED SVM(K=5) Algorithm WebPage Correct Incorrect Total SVM Porngraphic 378(94.5%) 22(5.5%) 400 Non-porngraphic 69(69.0%) 31(31.0%) 100 Total 447(89.4%) 53(10.6%) 500 BSVM Porngraphic 396(99.0%) 4(1.0%) 400 Non-porngraphic 78(78.0%) 22(22.0%) 100 Total 474(94.8%) 26(5.2%) 500 To show the impact of adaptable parameters on BSVM, we experiment on benchmark collections of Chinese web pages 1 prepared by FuDan University. The collections include 9804 training examples and 9833 evaluating documents, which consist of a set of Chinese newswire stories classified under 20 categories. In this paper, we experiment on a document set made of two related categories (history and politics) of the benchmark. The document set contains totally 2800 web pages (2000 pages about politics as positives, 800 pages about history as negatives and 1/10 of each as training examples). We compute the positive sentences filtering precision under different C, and exhibit the influence of d = n + /n and k in Fig. 2. Concluded from the result, the positive sentences filtering precision increases with n + /n and k increasing. Fig. 2. BSVM filtering efficiency on different k and n + /n. The left figure shows the influence of parameter d=n + /n on the positive sentences filtering precision (k=1). The right figure shows the influence of parameter k on the positive sentences filtering precision (n + /n =1). IV. CONCLUSION AND FUTURE WORK In this paper, we give a study on different scopes of filtering result according to different filtering task and user interest. We find that the web filtering result can be divided three sets of relative pages set(r 1 ), similar pages set(r i ) and homologous pages set(r k ) with the relationship of R k R i R 1. To adjust the web filtering result to be more fit for the user 0 α i C + C 1 if y i = 1 0 α i C C 2 if y i = 1 (7) 1 The benchmark and a detailed description(in Chinese) are available at 16\&type=

5 interest, a Biased Support Vector Machine (BSVM) algorithm in introduced which imports a stimulant function, uses training examples distribution n + /n and a user-adaptable parameter k to deals imbalanced different classes of the pre-assigned pages. Experiments show that BSVM can greatly improve the web filtering performance. But problems of user bias description and parameter self-adaptable are still open and we leave them as future work. REFERENCES [1] N. J. Belkin and W. B. Croft, Information Filtering and Information Retrieval: Two Sides of the Same Coin? Communications of the ACM, vol. 35, no. 12, pp , Dec [2] P. Y. Lee, S. C. Hui, and A. C. M. Fong, Neural networks for web content filtering, IEEE Intelligent Systems, vol. 17, no. 5, pp , [3] N. Project, Report on currently available cots filtering tools, Technicle report, [4] O. B. Longe and F. A. Longe, The nigerian web content: Combating pornographic using content filters, Journal of Information Technology Impact, vol. 5, no. 2, pp , [5] J. R. Quinlan, Discovering rules by induction from large collections of examples, Expert Systems in the Micro-Electronic Age, pp , [6] J. R. Quinlan, Induction of decision trees, Machine Learning, vol. 1, no. 1, pp , [7] J. R. Quinlan, C4.5: programs for machine learning. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., [8] F. D. Chid Apte and S. Weiss, Text miningwith decision rules and decision trees, in Proceedings of the Conference on Automated Learning and Discovery, CMU, June [9] P. Clark and T. Niblett, The cn2 induction algorithm, Mach. Learn., vol. 3, no. 4, pp , [10] M. Iwayama and T. Tokunaga, Cluster-based text categorization: a comparison of category search strategies, in Proceedings of SIGIR- 95, 18th ACM International Conference on Research and Development in Information Retrieval, E. A. Fox, P. Ingwersen, and R. Fidel, Eds. Seattle, US: ACM Press, New York, US, 1995, pp [11] B. Masand, G. Linoff, and D. Waltz, Classifying news stories using memory based reasoning, in SIGIR 92: Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval. New York, NY, USA: ACM Press, 1992, pp [12] S. Chakrabarti, B. E. Dom, and P. Indyk, Enhanced hypertext categorization using hyperlinks, in Proceedings of SIGMOD-98, ACM International Conference on Management of Data, L. M. Haas and A. Tiwary, Eds. Seattle, US: ACM Press, New York, US, 1998, pp [13] K. M. A. Chai, H. L. Chieu, and H. T. Ng, Bayesian online classifiers for text classification and filtering, in SIGIR 02: Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval. New York, NY, USA: ACM Press, 2002, pp [14] A. McCallum and K. Nigam, A comparison of event models for naive bayes text classification, in AAAI-98 Workshop on Learning for Text Categorization, [15] A. McCallum, K. Nigam, J. Rennie, and K. Seymore, A machine learning approach to building domain-specific search engines, in The Sixteenth International Joint Conference on Artificial Intelligence (IJCAI- 99), [16] T. Joachims, Text categorization with support vector machines: Learning with many relevant features, in Proceedings of the European Conference on Machine Learning. Berlin,German: Springer, 1998, pp [17] T. Joachims, N. Cristianini, and J. Shawe-Taylor, Composite kernels for hypertext categorisation, in Proceedings of ICML-01, 18th International Conference on Machine Learning, C. Brodley and A. Danyluk, Eds. Williams College, US: Morgan Kaufmann Publishers, San Francisco, US, 2001, pp [18] V. Vapnik, Statistical Learning Theory. New York: John Wiley, Sons, [19] A. Du and B. Fang, Comparison of maching learning algorithms in chinese web filtering, in proceedings of The third International Conference on Machine Learning and Cybernetics. Shanghai,China: IEEE Press, 2004, pp [20] F. Sebastiani, Machine learning in automated text categorization, ACM Comput. Surv., vol. 34, no. 1, pp. 1 47, [21] Y. Yang and X. Liu, A re-examination of text categorization methods, in SIGIR 99: Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval. New York, NY, USA: ACM Press, 1999, pp [22] A. Du and B. Fang, A biased support vector machine approach to web filtering, in ICAPR 05: Proceedings of the Third International Conference on Advances in Patten Recognition, C. A. P. P. Sameer Singh, Maneesha Singh, Ed. Springer Verlag, Heidelberg, D-69121, Germany, 2005, pp