Fusion of Text and Image Features: A New Approach to Image Spam Filtering

Transcription

1 Fusion of Text and Image Features: A New Approach to Image Spam Filtering Congfu Xu 1, Kevin Chiew 2, Yafang Chen 1,andJuxinLiu 1 1 Institute of Artificial Intelligence, Zhejiang University, Hangzhou, China 2 School of Engineering, Tan Tao University, Long An, Vietnam Abstract. While enjoying the convenience of communications, many users have also experienced annoying spam. Even if the current spam detecting approaches have gained a competitive edge against text-based spam, they still face the challenge arising from imagebased spam (image spam in short). Image spam normally includes embedded images that contain the spam messages in binary format rather than text format and cost more storage and bandwidth resources. In this paper, we propose a hybrid image spam filtering framework to detect spam images based on both extracted text and image features. Our experimental results show that our approach achieves significant improvement in detection accuracy as compared with other methods that simply use text or image features, and works robustly in an environment with either complex background or compression artifact. 1 Introduction Nowadays one of the most pervasive applications of the Internet is the service which has brought great convenience in our communications. While enjoying the facilities of service, users are also facing a big number of annoying spam. spam, of which the volume has been growing tremendously in past few years as reported, has also decreased the quality of service. This is partly because spam costs the resources of storage and communication bandwidth. Moreover, a latest news 1 reports a research result telling that spam produces millions of tons of CO 2 globally every year. Many solutions are proposed for detecting and filtering spam s to prevent them from being received, forwarded, and spread. The basic technique for these solutions is to train classifiers to identify spam images from ham (hold-andmodify) images. These classifiers normally use two types of rules: (a) rules based on connection and relay properties of s, and (b) rules using the features extracted from the contents of s. The second type of rules that carry out contents filtering by using machine learning mechanisms such as Naive Bayes classification or support vector machines (SVM), have been a cornerstone of anti-spam systems [16] and have shown the advantage of high accuracy. However, currently there is a new attack which could be devastating on content filters. Instead of obscuring the message s text, spammers now are able to 1 See Y. Wang and T. Li (Eds.): Practical Applications of Intelligent Systems, AISC 124, pp springerlink.com c Springer-Verlag Berlin Heidelberg 2011

2 130 C. Xu et al. Fig. 1. Examples of spam images (noticing the high amount of text and the use of text obfuscation technique against OCR) defeat text analysis techniques by replacing text with images. A whitepaper released in November 2006 [17] shows the rise of image spam from 10% in April to 27% of all spam in October 2006 totaling up to 48 billion s every day. A possible way to detect image spam is using a pipeline of an optical character recognition (OCR) system, which extracts and recognizes embedded text, followed by a text classifier that separates spam from legitimate content. It was found that this approach can be effective for clean images [8]. However image spam has allowed spammers to design spam as CAPTCHAs (see the right part of Figure 1) or use obscuring image text to defeat OCR tools. Thus if an image spam filter is equipped with an OCR-based module as the unique countermeasure against spam, it is vulnerable to image spam with obfuscated text. In this paper, we propose a solution for image spam filtering. Since most of spam images contain large proportions of text as shown in Figure 1, our solution first extracts the text information embedded into images, together with the image information that can be identified by the unique properties [14] of spam images as compared with those of natural scene images or generic computer-generated graphic images. We then use a combinational filter with two-layer structure for training and classification, of which the bottom-layer classifiers obtain the image spam confidence score by using the two types of features, and a top-layer classifier makes the final decision by using the outputs of the bottom-layer classifiers. The remaining sections of the paper are organized as follows. Firstly in Section 2 we review the related work on the filtering techniques for contentbased image spam, following which in Section 3 we introduce the framework of image spam filtering in details. In Section 4, we report experimental results on real data sets of ham and spam images, and conclude the paper in Section 5. 2 Related Work The detection of image spam is a special case of image categorization, which is addressed as a task of two-class classification between ham and spam images in [1, 6,8] and has been extensively studied in context of many important applications. In [1], Aradhye et al. used a support vector classifier to extract the text regions in an image, followed by which they identified five visual features of the spam. The first feature is the relative area of the image occupied by text. It is used with the underlying idea that spam images usually contain more text than

3 Fusion of Text and Image Features: A New Approach 131 legitimate images. The other features such as color heterogeneity and saturation are identified over text and non-text regions based on the assumption that images of which the main part are synthetic are normally more likely to be spam. Based on the method in [1], Dredze et al. [6] proposed to use different kind of features. Although some visual features are used (like average RGB colors, the relative area occupied by the most common color, and color saturation features as in [1]), the most important role is played by metadata extracted from the images. They also introduced a feature selection algorithm (JIT) to select the most discriminant features based on their speed as well as the predictive power. Fumera et al. [8] proposed an approach to anti-spam filtering which exploits the text information embedded into images sent as attachments. This approach is based on the consideration that text embedded into images plays the same role as text in the body of s without images (i.e., it conveys the spam messages). After extracting text with OCR tools from images attached to s, they carried out the semantic analysis of text using text categorization techniques like the ones applied to the body of the without images. A method [4] is presented to recognize image spam based on detecting the presence of content obscuring techniques which aim to compromise the OCR effectiveness. The implementation is based on two low-level image features aimed at measuring the extent of character breaking or the presence of small noise components, and the presence of merged characters or large noise components. Nhung and Phuong used simple edge-based features [16] to compute a vector of similarity scores between an image and a set of templates. This similarity vector is then used with an SVM to separate spam images from other common categories of images. In [11] specific features are selected for inspection by the components-based method, and then the spam-filter system uses these features to identify image spam by feature matching. 3 Hybrid Framework for Image Spam Filtering Since the content obscuring techniques can defeat the attempts of using OCR tools [8] to detect text embedded into images, to filter such image spam, we propose an image categorization approach that detects both text and image features. Figure 2 shows the proposed hybrid framework for image spam filtering. The framework works by three phases. Firstly, we calculate the features of an input spam . This work includes keyword detection and text-related features extraction. We then use an SVM to obtain the image spam confidence score. Secondly, we define a small number of reliable spam-indicative features from the image metadata and image color properties, and then use an SVM again to classify the image. Lastly, we use fusion classifier to make a decision based on the outputs of both text and image classifiers. An example of a spam image is shown in Figure 3. The spam image is identified by our framework as a ham image with the confidence score of by the image classifier and as a spam image by the text classifier with the confidence score of Thus finally the image is identified as a spam image after fusion

4 132 C. Xu et al. Fig. 2. Architecture of our hybrid framework for image spam filtering Fig. 3. An example of spam image of both confidence scores. The functions of major components are introduced as follows. 3.1 Keyword Detection Semantic analysis of text embedded into images first requires text extraction by techniques such as OCR which may bring with the following two issues: (a) high computational complexity and (b) susceptible to content obscuring techniques. For the first issue, it is possible to reduce the computational complexity by using a hierarchical architecture for the spam filter. Text extraction and analysis are carried out only if the previous and less complex modules are unable to reliably identify whether an is legitimate or not. To further reduce computational complexity, techniques based on image signature could be employed. For the second issue, since embedded text extraction is often inaccurate, we use keyword detection to improve classification accuracy. We first define a keyword set composed of thirty words and five phrases. And then, for every image we calculate a feature indicating whether at least one element of the keyword set is detected in the text extracted by an OCR system. Performing OCR on images attached to s is carried out by the demo version of the commercial software ABBYY FineReader 8.0 Professional with default parameter settings.

5 Fusion of Text and Image Features: A New Approach Text-Related Features Extraction The text-related features detect the properties of text in an image. The text regions in the image are firstly extracted. A subsequent step defines some features from the image by using the extracted text regions. Our method of text region extraction comprises the following three main steps. Step 1: Edge detection. A convolution operation with a compass operator [12] is used to generate intensity images of four oriented edges which are at 0, 45,90 and 135 orientations respectively. For color images, we convert them into gray images at first. Step 2: Feature generation. We first subdivide an image into a grid of w h equally sized cells C ij where i =1,...,w and j =1,...,h(each cell is as big as pixels in this work), and then compute the six features over all cells. These six features, namely mean μ, standard deviation σ, energye g, entropy E t, inertial-quadrature I, and local homogeneity H, are defined by the following Equations (1) to (6) [5, 9]: μ = 1 w h E(i, j) (1) w h i=1 j=1 σ = 1 w h [E(i, j) μ] w h 2 (2) E g = i,j i=1 j=1 E 2 (i, j) (3) E t = i,j I = i,j H = i,j E(i, j)loge(i, j) (4) (i j) 2 E(i, j) (5) 1 E(i, j) (6) 1+(i j) 2 in which E(i, j) is the normalized symmetrical gray level co-occurrence matrix (GLCM) of cell C ij [10]. Step 3: Text region detection. We first use the K-means clustering based on the above features to obtain the text areas and background areas, and then refine the text region by morphological dilation and erosion. Figure 4 illustrates the process of text region detection. Based on the extracted text regions, we calculate the following simple features that are most indicative of spam images: (1) Extent of text regions. The extent of text in the image is defined as the proportion between the area of the extracted text regions and the total areas of the image; (2) Amount of text regions; and (3) Amount of text letters.

6 134 C. Xu et al. (a) Initial picture (b) Candidate of text region (c) After erosion operation (d) After dilation operation (e) Final result (f) Labeled by pane Fig. 4. Illustration of the process of text region detection Text may be inherently presented in natural scene images in the form of road signs, building names, company names or others, and synthetic images may include text. However, the extraction of text features as defined above is intuitively expected to be discriminative between spam images and non-spam images. Figure 5 shows the distributions of features 1 and 3, from which we can find that the spam images and non-spam images distribute in different data domains. For feature 1, more than 40% of ham images distribute in the range of 0 to 0.1, and more than 80% of spam images in the range of 0.2 to 0.6; whereas for feature 3, more ham images distribute in the range of 0 to 6, and more spam images in the range of 6 to 60. According to[3],we also use three features to detect the presence of content obscuring. The idea is to measure the perimetric complexity which is used in the psychophysics of reading literature and aspect ratio (the ratio between width and height). The perimetric complexity is defined as the squared length of the boundary between black and white pixels in the whole image, divided by the black area.

7 Fusion of Text and Image Features: A New Approach % 80% ham images spam images 90% 80% ham images spam images 70% 70% 60% 60% 50% 50% 40% 40% 30% 30% 20% 20% 10% 10% 0% More than 0.6 0% More than 60 (a) Distribution of extent of text regions (feature 1) (b) Distribution of amount of text letters (feature 3) Fig. 5. Feature distributions in all images 3.3 Image Features Extraction Our first group of image features relies on the following metadata: (1) File format. The file format of an image includes its extension, the actual file format (as identified by metadata) and whether they match with each other; and (2) Image metadata. We extract 10 features that are contained in the image metadata, including whether the image has comments, bits per pixel, number of bands, progressive flag, sample precision, transparent color, approx high, index value, logical height and width. The rest of our image features based on the following color properties: (1) Color saturation. As defined by Frankel et al. [7], color saturation is quantified as the fraction of the total number of pixels in the image for which the difference max(r, G, B) min(r, G, B) is greater than a predefined threshold; (2) Color histogram. The color histogram is a compact summary of the image, and the legitimate images typically convey a much larger number of colors than spam images. We chose a 6-bit color space leading to 64 feature vectors; and (3) Color moments. The use of color moments is based on the assumption that the distribution of color in an image can be interpreted as a probability distribution. The distribution of spam images is always not continuous since they are synthetic. In our study, we use the following three central moments of an image s color distribution, namely mean, standard deviation and skewness. Using RGB channels and three moments for each channel, we obtain nine feature vectors. Figure 6 shows several ham and spam images and Figure 7 shows their color saturation, from which we can see that spam images are generally more saturated as compared with images of natural scenes. 3.4 Bottom-Layer Classifiers Some significant advantages of an SVM, such as excellent generalization ability through maximum margin approach, the absence of local minima, and the sparse representation of solution, are the major reason for using an SVM as a

8 136 C. Xu et al. (a) Ham image 1 (b) Ham image 2 (c) Ham image 3 (d) Spam image 1 (e) Spam image 2 Fig. 6. Three ham images and two spam images Fig. 7. Color saturation of images in Figure 6 powerful model in classification tasks. Both the text classifier and image classifier use SVMs first to differentiate between text and images, and obtain the spam confidence scores as the inputs of classifier fusion for further decision. The kernel trick is another important point to the success of SVMs. Polynomial kernel, radial basic function (RBF) kernel and sigmoid kernel are three typical kernels. In our study, LIBSVM 2 is adopted and RBF is used as a kernel function since the corresponding Hilbert space is of infinite dimension. The 2 The software is available at

9 Fusion of Text and Image Features: A New Approach 137 default parameters are used. In the previous section, we extract features and obtain the vector space model (VSM) which represents each image. The text-based vector space includes seven feature vectors and the image-based vector space includes 87 feature vectors. The text classifier and image classifier use their vectors as inputs to the SVM for training and classification respectively. 3.5 Classifier Fusion Combining the outputs from multiple tools has been reported effective in terms of improving information retrieval [13,15] and classification performance [2,18]. Our experiments also show that we can improve accuracy by combining the results of several classifiers. Furthermore, it makes sense that by including the inputs of many types of classifiers we can protect ourselves from risk of any one classifier being compromised. We use an SVM again to fuse the confidence scores of text and image classifiers. The outputs of bottom-layer classifiers constitute a vector for SVM training and classification. The vector is defined as (S t,s i )in which S t is the confidence score of text classifier and S i the confidence score of image classifier. Similar to bottom-layer classifiers, LIBSVM and RBF are also adopted for classifiers fusion. 4 Experiment 4.1 Experimental Setup The experiments are carried out on the corpora of images taken from real s. The corpora are collections of personal s used in [6], containing 2006 ham images and 3297 spam images. To our best knowledge, this is the only corpus of real ham images publicly available to research communities 3. For the experiments, the images are first split into two subsets: about 60% are randomly chosen for training classifiers on the bottom layer, and the other 40% for testing. And then for fusion stage, about 50% images are randomly chosen for training, and the other 50% for testing. We repeat this random selection 10 times and average all of the results. We first reduce the images by scaling so that the width and height are no more than 200 pixels. This simple mechanism makes our method robust to random pixels and simple scaling. It also meets the computational requirements since image analysis has high computational complexity. We then extract features from all the images from the positive and negative test sets. In our evaluation, accuracy, precision, image spam recall (recall in short) and image non-spam recall (non-spam recall in short) are defined as follows: accuracy = # of all images correctly classified # of all images 3 Available at spam/

10 138 C. Xu et al. 100% % 87.00% Performance 80% 60% 40% 20% Image classifier Text classifier Fusion classifier with averaging Fusion classifier with SVM Accuracy Precision Recall Non-spam recall Measure Performance 74.00% 61.00% 48.00% 35.00% SA with Bayes-OCR Huang's approach in [8] Our approach Precision Measure Recall (a) Performance comparison for different approaches (b) Performance comparison with Huang s approach Fig. 8. Experimental results precision = recall = # of spam images correctly classified # of images classified as spam # of spam images correctly classified # of all spam images # of non-spam images correctly classified non-spam recall = # of all non-spam images All the experiments are conducted on a typical PC with Core 2 Quad Q6600 CPU and 4GB memory and with Windows XP installed. 4.2 Experimental Results Figure 8(a) shows the details of experiment results, from which we can see that, as compared with the text classifier, the image classifier can obtain higher accuracy for common categories of images classification; whereas the text classifier has a better discriminative capability for spam images classification. The fusion classifier with averaging has achieved better results in total accuracy though, we cannot see any improvement in other indicators. The discriminative capability is greatly improved when we fuse the confidence scores of text classifier and image classifier with an SVM. Therefore, we can draw such a conclusion from the results: the fusion classifier with an SVM combines the classification performance from the text and image classifiers in a complementary fashion that unites the strengths of both. To evaluate the performance of our approach, we compare it with a public spam corpus SpamAssassin 4 (SA in short) in its standard configuration and equipped with a device Bayes-OCR for filtering image spam, and with the existing approach which is presented in a recent paper [11]. The comparative results are shown in Figure 8(b). The results of SA with Bayes-OCR are our baseline, of which the precision values are very good (almost as high as 100%) while the recall is still acceptably challenged (lower than 40%). Although our experiment 4 Available at

11 Fusion of Text and Image Features: A New Approach 139 and the approach in [11] are not using the same corpora, from the table we can see that our approach obtains better results, i.e., the precision is high enough to compete that from SA with Bayes-OCR, while the recall is much more improved. We also compare our approach with the existing approach in [6] which uses the same corpus. The average accuracy of our approach is %, better than the result of % by the approach in [6]. For some text-based anti-spam filtering experiments, there are a number of public benchmark datasets publicly available; whereas for our experiments, there are not any other shared ham images available besides another public corpus SpamArchive 5 which consists of 16,021 spam images. We hope that a larger corpus with real spam and non-spam images be available in the future to facilitate the experiments so that we can conduct a more fair comparison for the above mentioned approaches. 5 Conclusion In this paper, we have presented a novel hybrid framework for detecting spam with content embedded in images by fusion of classifiers. Given a spammed image, our method has been able to extract both the text and image features, and input the vector into the bottom-layer classifiers respectively, and lastly obtain the final decision based on the fusion of the outputs of the classifiers. Our experimental results have shown that our approach has achieved a significant improvement in the accuracy of image spam detection as compared with other approaches. For the next stage of study, we will further formalize our framework and approach, and will develop an online version of the fusion method by considering the spam filter s handing capacity and test the image model s ability in spam detection. Acknowledgments. This paper is supported by the 863 Plan project of China (No. 2007AA01Z197) and the Natural Science Foundations of China (No ), and partially supported by the National Basic Research Program of China (No. 2010CB327903). We would like to thank Dr. Mark Dredze who is now in the Department of Computer Science at University of Pennsylvania for making his data set publicly available and sending us his code for performing the feature extraction. References 1. Aradhye, H.B., Myers, G.K., Herson, J.A.: Image analysis for efficient categorization of image-based spam . In: Proceedings of International Conference on Document Analysis and Recognition, pp (August 2005) 5 SpamArchive was downloadable from SpamArchive.org which has been shut down. It is now available at image spam/

12 140 C. Xu et al. 2. Bennett, P.N., Dumais, S.T., Horvitz, E.: The combination of text classifiers using reliability indicators. Information Retrieval 8(1), (2005) 3. Biggio, B., Fumera, G., Pillai, I., Roli, F.: Image spam filtering by content obscuring detection. In: Proceedings of the Fourth Conference on and Anti-Spam (CEAS 2007), pp. 2 3 (August 2007) 4. Biggio, B., Fumera, G., Pillai, I., Roli, F.: Image spam filtering using visual information. In: Proceedings of the 14th International Conference on Image Analysis and Processing (ICIAP 2007), pp (September 2007) 5. Cheng, H.D., Sun, Y.: A hierarchical approach to color image segmentation using homogeneity 9(12), (2000) 6. Dredze, M., Gevaryahu, R., Elias-Bachrach, A.: Learning fast classifiers for image spam. In: Proceedings of the Fourth Conference on and Anti-Spam (CEAS 2007), pp (August 2007) 7. Frankel, C., Swain, M., Athitsos, V.: Webseer: an image search engine for the world wide web. Technical report, University of Chicago (1996) 8. Fumera, G., Pillai, I., Roli, F.: Spam filtering based on the analysis of text information embedded into images. Journal of Maching Learning Research (special issue on Machine Learning in Computer Security) 7, (2006) 9. Gopalan, C., Manjula, D.: Statistical modeling for the detection, localization and extraction of text from heterogeneous textual images using combined feature scheme, (2010) 10. Haralick, R., Shanmugam, K., Dinstein, I.: Textual features for image classification 3(6), (1973) 11. Huang, H., Guo, W., Zhang, Y.: A novel method for image spam filtering. In: Proceedings of the 9th International Conference for Young Computer Scientists (ICYCS 2008), pp (November 2008) 12. Jain, A.K.: Fundamentals of Digital Image Processing. Prentice-Hall, Inc., Upper Saddle River (1989) 13. Lynam, T.R., Buckley, C., Clarke, C.L.A., Cormack, G.V.: A multi-system analysis of document and term selection for blind feedback. In: Proceedings of the 13th ACM Conference on Information and Knowledge Management (CIKM 2004), pp (November 2004) 14. Mehta, B., Nangia, S., Gupta, M., Nejdl, W.: Detecting image spam using visual features and near duplicate detection. In: Proceedings of the 17th International Conference on World Wide Web (WWW 2008), pp (April 2008) 15. Montague, M., Aslam, J.A.: Condorcet fusion for improved retrieval. In: Proceedings of the 11th ACM Conference on Information and Knowledge Management (CIKM 2002), pp (November 2002) 16. Nhung, N.P., Phuong, T.M.: An efficient method for filtering image-based spam. In: Proceedings of 2007 IEEE International Conference on Research, Innovation and Vision for the Future, pp (March 2007) 17. Secure Computing Whitepaper. Image spam: The latest attack on the enterprise inbox. Technical report (November 2006) 18. Zhang, Y.: Using bayesian priors to combine classifiers for adaptive filtering. In: Proceedings of the 27th Conference on Research and Development in Information Retrieval (SIGIR 2004), pp (July 2004)