Embedded-Text Detection and Its Application to Anti-Spam Filtering

Transcription

1 UNIVERSITY OF CALIFORNIA Santa Barbara Embedded-Text Detection and Its Application to Anti-Spam Filtering A Thesis submitted in partial satisfaction of the requirements for the degree of Master of Science in Computer Science by Ching-Tung Wu Committee in Charge: Professor Kwang-Ting Cheng, Chair Professor Yuan-Fang Wang Associate Professor Matthew Turk April 2005

2 The Thesis of Ching-Tung Wu is approved: Professor Yuan-Fang Wang Associate Professor Matthew Turk Professor Kwang-Ting Cheng, Committee Chairperson April 2005

3 Embedded-Text Detection and Its Application to Anti-Spam Filtering Copyright c 2005 by Ching-Tung Wu iii

4 Abstract Embedded-Text Detection and Its Application to Anti-Spam Filtering Ching-Tung Wu Embedded-text in images usually carry important messages about the content. In the past, several algorithms have been proposed to detect text boxes in video frames. Previous work often followed a multi-step framework using a combination of image-analysis and machine-learning techniques. In this work, we propose a unified embedded-text detection framework to efficiently and accurately locate text boxes particularly in web and images. We approach the embeddedtext problem from the angle of object detection. We define position-independent features to capture the essence of characters and a smart-scan algorithm to trace text lines using their spatial and geometrical properties. We also propose a novel anti-spam system which utilizes visual clues, including the embedded-text information. The experimental results demonstrate the effectiveness of the proposed embedded-text detection framework and the anti-spam filtering system. Professor Kwang-Ting Cheng Thesis Committee Chair iv

5 Contents List of Figures List of Tables vii viii 1 Introduction Previous Work Motivation Contributions Outline Embedded-Text Detection Cascade Detector Position-Independent Feature - Motivation Position-Independent Feature - Definition Cascade Detector Training Experiments and Discussions Detection Framework One-Step Framework Two-Step Framework Experiments Metrics Experimental Results Computation Cost Memory Consumption Visual-Based Anti-Spam Filtering The Spam Datasets and Analysis The Anti-Spam Filter v

6 3.2.1 Feature Description Feature Extraction The Classifier Experiments Datasets Experimental Results Conclusions 65 Bibliography 67 vi

7 List of Figures 1.1 Embedded-Text in Images Example Top two features selected by AdaBoosting in Viola s work Canny edge detection result of a sample image Edge coding, 3x3 sub-window, horizontal, and vertical pattern Edge map of text and confusing background objects Top three features selected by BrownBoosting Tough case of negative training samples One-Step Framework Flow Example of Locality Shifts Locality-Filter Rectangle Features Vertical Locality-Filter Example of Letter A Horizontal Locality-Filter Example of Letter A Horizontal Locality-Filter Example of Letter I Horizontal Locality-Filter Example of Letter L Diagram of the search algorithm Two-Step Framework Flow Embedded-Text Detection Raw Results Embedded-Text Detection Final Results Spam with embedded-text in image example Text embedded in image example Detected embedded-text regions vii

8 List of Tables stage cascade detector using BrownBoosting Feature sets comparison Locality-Filter Verification Performance Comparison Computation Cost Comparison Statistics of s containing accessible images Classifier comparison Filter comparison Blocked/Missed analysis viii

9 Chapter 1 Introduction As the Internet continues to grow, the type of information available to users has shifted from text-only to multimedia-enriched. Embedded-text in multimedia content (e.g. images and videos) is becoming one of the prevalent means for delivering messages to content viewers. Several embedded-text detection and recognition algorithms have been proposed in the past. However, as the type of multimedia content keeps getting more and more diverse and the amount grows exponentially, developing an embedded-text detector that is accurate and efficient has become an interesting topic and a challenging task. Embedded-text often carries important messages regarding the where, what, and who of the content. This information clearly helps with the recognition and annotation of the content. Embedded-text within images and videos can be cat- 1

10 Chapter 1. Introduction egorized as natural scene text and artificial text. Natural scene text refers to text that appears as part of the scene in images or videos; for example, a news video showing an event that happened in a recognizable location. Artificial text, on the other hand, is manually added to provide content to viewers. One example of artificial text is the closed caption of a video; another is the text banner box on a web image. The first step to extract embedded-text in an image or a video frame is to pinpoint the location of the text boxes. In this thesis, we aim to develop a robust and accurate embedded-text detector particularly for detecting artificial text in web and images. The detection result can be fed into the existing Optical Character Recognition (OCR) packages such as OmniPage or GNU OCR. In the second part of this thesis, we also demonstrate an application to anti-spam filtering which directly uses the text-detection result. 1.1 Previous Work Artificial text boxes are designed to attract attention. These text boxes usually present strong coherence in space, geometry, and color. For example, they usually align horizontally, contain homogeneous color regions, pose strong edge energies, and exhibit high intensity differences from neighboring objects or contrasting 2

11 Chapter 1. Introduction background. Since visual, spatial, and geometric coherence provides strong clues, previous work has primarily focused on these characteristics for the task of text detection. Based on the detection framework, previous work can be roughly classified into two categories; (1) Image-analysis-based (2) Image-analysis plus machinelearning-based. In the first category, embedded-text detection depends heavily on image analysis techniques, such as edge detection, connected component, and texture analysis. In the second category, a machine learning method is applied in addition to the image analysis techniques. The machine learning step is used mainly for verification purposes because it can further reduce the number of false positives. In the image-analysis category, embedded-text detection depends heavily on image analysis techniques. These can be further sub-categorized into edge-based and texture-based approaches. The common flow works as follows: First, edge detection or texture analysis is performed on the image under consideration. The image s raw features are then grouped according to connectivity, spatial, and geometric correlations to form potential text regions. Second, potential text regions are further examined using rule-based heuristics, such as the size, aspect ratio, and orientation of the region. Fletcher et al. [6] used edge detection and connectedcomponent analysis to generate connected text blocks. Then, they applied a 3

12 Chapter 1. Introduction Hough transformation to aggregate the text blocks into text strings. LeBourgeois [12] proposed using a linear filter to transform the gradient map, such as the edge energy map of an image, to a texture map by smearing and dilating the gradient map in one direction. Since text pixels appear in a coherent way, text regions usually stand out and can be easily identified after the transformation. Zhou et al. [39] proposed a texture-based approach to extract text from web images using color clustering and connected component analysis. Wu et al. [23] presented a text segmentation based on distinctive textures of text. Text strokes are extracted and refined in multiple phases to form the final bounding boxes. In the second category, the machine-learning method is applied in conjunction with image analysis techniques. The machine-learning step serves as the text pixel or text region verifier. It is usually applied after the image analysis step to further reduce the false positive rate. Chen et al. [2] used edge dilation to merge horizontal and vertical edges into regions, then applied an SVM model to classify these text regions. The assumption was that text lines contain short horizontal and vertical edges within a confined space. The dilation and merging of these edges make text appear as continuous regions. These regions are then classified using the trained SVM model. Wolf et al. [25] extended LeBourgeois idea by applying more geometrical and spatial heuristics on the text lines identification. Then, an SVM model was trained for the final classification. Kim et al. [10] also 4

13 Chapter 1. Introduction presented a texture-based approach by using SVM to report the probability of a pixel being part of text texture, then, by applying CAMSHIFT algorithm on the probability map to locate text regions. The common part between image analysis based and image analysis plus machine learning based is that they often followed a two-step procedure. In the first step, potential text regions are identified. Then, in the second step, a statisticaldriven and/or heuristics-based models were applied to refine the potential text regions and to determine the final ones. 1.2 Motivation The first motivation for our work is that the number of images with embeddedtext has increased rapidly, especially in web and applications. Being able to detect text embedded in these images will be a good starting point to further analyze the received contents. One example is the Unsolicited Commercial (UCE), also known as spam, on the Internet. With the increasing importance of and the incursions of Internet marketers, spam has become a major problem. In the past, researchers have addressed this problem as a text classification or categorization problem. Recently, spammers have been using embedded-text in 5

14 Chapter 1. Introduction images to avoid these text-based anti-spam filters. Without the ability to detect embedded-text, it will be very difficult to alleviate this problem. The second motivation for our work is to improve the text-detection accuracy beyond the previous two-step framework. We hope to keep the advantages of the existing two-step framework while reducing or eliminating its disadvantages. In the previous embedded-text frameworks, the text detection problem is divided into two sub-tasks. In the first step, potential text regions are identified using primarily image-analysis techniques. Then, in the second step, potential text regions are further refined using statistical-driven models or heuristic rules based on spatial and geometrical constraints. One advantage of these frameworks is that the embedded-text detection problem can be divided into two easier subproblems. Each step in the two-step framework deals with one of the sub-tasks, making the process more efficient. The first step is to locate potential text regions. The second step, which usually takes more time, is to refine and to determine the final text regions. By separating the tasks, the slower step only has to deal with part of the image, rather than the entire image. Another advantage is that the false alarms can be further reduced as the text regions pass through each step. A major disadvantage of the two-step framework is that the two steps depend on each other, hence, the overall detection rate is constrained by the lowest one of 6

15 Chapter 1. Introduction the two. For example, if the lowest detection rate of the two steps is 80%, then, the overall detection rate cannot surpass 80%. Viola and Jones [22] proposed a rapid object-detection framework using rectangle features and boosted cascade detector for the task of face detection. The rectangle features are position-dependent and somewhat rigid. They are suitable for capturing the differences in intensity among facial regions. The cascade detector is a degenerated decision tree which acts as a rejection-based classifier. In each stage of the cascade detector, the classifier is trained using AdaBoosting to passes almost every positive sample while rejecting most of the negative ones. The combination of stages yields a desirable ratio of a high detection rate to a low false-positive rate. Both rectangle features and cascade detector can be evaluated efficiently at every location and scale. In our work, we focus on developing a robust embedded-text detection framework to locate artificial text boxes embedded in images, especially web and images. Our goals are (1) efficiency, especially in detection speed; (2) favorable ratio of high detection rate to low false-positive rate. We approach the embeddedtext detection problem from the angle of object detection. We define three sets of Position-Independent Features (PIF) and propose using the boosted cascade detector, similar to Viola and Jones, as the text detection engine. The PIF features can be evaluated efficiently. They are suitable for tackling the diverse and some- 7

16 Chapter 1. Introduction what unpredictable shapes of text. Then, we propose a unified embedded-text detection framework using the cascade detector and PIF features. The proposed framework is capable of detecting horizontal text lines embedded within an image at every location and scale. 1.3 Contributions The main contributions of this thesis are summarized as follows: We define three sets of position-independent features (PIF) which can be computed efficiently for the task of text detection. The three sets of PIF features are local edge-pattern, local edge-density, and global edge-density. These PIF features are able to capture diverse types and sizes of fonts as well as to differentiate edges of characters from those of random objects. We propose using a cascade detector for the task of text detection. The proposed cascade detector is trained using the position-independent features and BrownBoosting mentioned above. We propose a novel unified-step embedded-text detection framework using the above cascade detector. This embedded-text detector does not require any pre-processing step and can do full image scans. 8

17 Chapter 1. Introduction We introduce the use of visual information for anti-spam filtering. By thoroughly analyzing a large collection of spam s, we demonstrate that useful features and parameters can be derived from images in spam s for the purpose of anti-spam filtering. Based on such visual features and parameters, we propose a novel visualbased anti-spam filter. The proposed filter, used in conjunction with existing text-based filters, can improve the filtering accuracy. We have successfully integrated the proposed anti-spam filter with Thunderbird and demonstrated very promising results. 1.4 Outline The rest of this thesis is organized as follows: In chapter 2, we present the details of the proposed embedded-text detector. In chapter 2.1, we introduce the concept of the position-independent features as well as details of feature selection and training. In chapter 2.2, we present the detection framework and explain how to apply the cascade detector for the task of text detection. In chapter 2.3, we show experimental results to evaluate the effectiveness of the position-independent features and to demonstrate the robustness of the proposed embedded-text detector. 9

18 Chapter 1. Introduction In chapter 3, we present the details of the visual-based anti-spam filter. In chapter 3.1, we present the statistics of some visual parameters from a thorough analysis of more than 120K spam s downloaded from SpamArchive [30]. In chapter 3.2, we present the proposed filtering system in detail. Then, in chapter 3.3, we present experimental results to show the effectiveness of the proposed anti-spam filter. In chapter 4, we present the conclusions of this thesis. 10

19 Chapter 1. Introduction Figure 1.1: Embedded-Text in Images Example 11

20 Chapter 2 Embedded-Text Detection The heart of the proposed embedded-text detection system is the cascade detector and Position-Independent Features (PIF). The cascade detector is degenerated decision tree which consists of several stages organized in a sequential manner. Each stage, in turn, is a boosted classifier that is trained to pass almost every positive sample while rejecting as many negative ones as possible. The PIF features are designed to differentiate the edges of text from those of random objects. Both of the cascade detector and the PIF features can be evaluated efficiently. Therefore, they can be applied on part or even the entire image under consideration. In this chapter, we present the architecture of the proposed embedded-text detection system. In chapter 2.1, we discuss the details of the PIF features, feature 12

21 Chapter 2. Embedded-Text Detection selection, and cascade detector training. We elaborate the rationales behind the design of the PIF features and the training strategy. In chapter 2.2, we demonstrate how to applied the text detector engine (trained cascade detector and PIF features) in the traditional two-step framework and the proposed unified one-step framework. 2.1 Cascade Detector Viola and Jones [22] proposed a cascade framework that is capable of detecting face rapidly while achieving a high detection accuracy. There are three key contributions in their approach. The first one is the introduction of a new image structure called the Integral Image, which allows the features to be computed efficiently. The second one is the use of AdaBoosting learning algorithm to select and combine a small number of critical features to become efficient and yet strong classifiers. The third contribution is a smart framework for organizing classifiers in a cascade manner, which allows background regions to be quickly discarded while spending more computation on promising face-like regions. Our purpose is to extend the similar idea to a new task, detecting embedded-text within images. 13

22 Chapter 2. Embedded-Text Detection Position-Independent Feature - Motivation Viola and Jones defined a set of rectangle features, which can be calculated by computing the differences of raw pixel sums among regions within the rectangle under consideration. More than 40K rectangle features can be extracted from a 24x24 training sample, and they can be calculated quickly in less than 100 machine instructions per feature value using a data structure called integral image. Although rectangle features are proven to be effective and efficient for the specific purpose of face detection, no work has been done so far for general object detection using this type of feature. The main reason is that the rectangle features carry position information (position-dependent), and thus the target object has to strictly follow a well-defined rigid structure. For example, the first feature selected by AdaBoosting algorithm, as shown in Figure 2.1, measures the differences in intensity between the eyes and the upper cheeks. The feature capitalizes on the observation that the eye region is often darker than the cheeks. However, in the task of text detection, due to the different fonts, sizes, and shapes of text, position-dependent features cannot capture the unified information towards a set of samples. Edge information is a dominant feature used by many researchers for embeddedtext detection in images. Figure 2.2 shows the Canny edge detection result of a 14

23 Chapter 2. Embedded-Text Detection Figure 2.1: Top two features selected by AdaBoosting in Viola s work sample image. By comparing the edges from the text and the background, we conclude several key observations as follows: Text edges always follow a few fixed patterns, such as vertical and horizontal lines, half-circles, and T-like conjunctions, while background edges are totally unpredictable. Text edges should be smooth and continuous, and thus edge points of an individual character should be connected to each other. Therefore, if we define a small sub-window around a text edge point, the number of edge points within the sub-window should fall within a certain range. Neverthe- 15

24 Chapter 2. Embedded-Text Detection Figure 2.2: Canny edge detection result of a sample image less, the number of edge points within a sub-window around a background edge point is completely random. For a sub-window containing a full character, the edge points often equally spread over the entire sub-window. However, for a sub-window containing a background object, the edge points often cluster into small regions. 16

25 Chapter 2. Embedded-Text Detection Position-Independent Feature - Definition In this section, we introduce three different types of edge-based features corresponding to the three key observations mentioned above. In addition, the proposed features satisfy the following rules: Features must be able to differentiate edges of text from those of background objects. Features do not include any position information, i.e. position-independent. Features can be computed efficiently. Local Edge-Pattern Figure 2.3: Edge coding, 3x3 sub-window, horizontal, and vertical pattern 17

26 Chapter 2. Embedded-Text Detection To compute local-edge pattern, first, we apply the Canny edge detector on the image under consideration, then we binarize the detection result by using 0 to represent non-edge points and 1 to represent edge points. For each edge point, if we define a 3x3 sub-window as shown in Figure 2.3a, we can use 8-bit binary strings to represent different edge patterns. Figure 2.3b and Figure 2.3c show the horizontal and vertical patterns. They represent a horizontal and a vertical line respectively. Each binary number can be mapped to a decimal number. For example, the binary and decimal numbers for Figure 2.3b and Figure 2.3c are (66) and (24) respectively. By using the binary coding scheme and the 3x3 sub-window, we can extract and compute edge patterns efficiently. The total number of edge patterns defined by a 3x3 sub-window is 256. These edge patterns can be used to represent horizontal, vertical, diagonal, curvy, and even conjunctional patterns. In algorithm 1, we show the local edge-pattern extraction work flow. Theoretically, we can compute 256 integral images, one for each edge pattern. In real implementation, however, we do not need to compute all of them. Instead, we only need to compute the features that are actually selected during the cascade training. It should be noted that multiple integral images can be generated after a single scan. The additional cost is mainly for memory, not for computation. Moreover, we use a histogram manner to represent features of a rectangle to be 18

27 Chapter 2. Embedded-Text Detection Algorithm 1 Local Edge-Pattern Extraction INPUT: Image OUTPUT: Local Edge-Pattern Features step1: Produce an edge map by using Canny edge detector step2: Obtain a pattern map by evaluating each edge point step3: Compute integral images based on the pattern map step4: Extract and Evaluate histogram features evaluated. This design implicitly eliminates the position information. In other words, we only consider the frequency of an interested pattern in a sub-window, not the position of that pattern. Therefore, we refer to our feature as Position- Independent Features (PIF) in contrast to Viola and Jones rectangle features. Local Edge-Density Based on the second observation in section 2.1.1, we concluded that the local edge-density around each edge point will be useful for discriminating edges of text from those of background objects. Therefore, we propose the local edge-density feature set to capitalize on this information. We define a number of small subwindows around each edge point, and count the number of edge points within the sub-windows. Empirically, we adopt different size sub-windows from 3x3 to 7x7. Using the similar algorithm mentioned above, we can calculate the integral 19

28 Chapter 2. Embedded-Text Detection images based on local edge-density maps with respect to each pre-defined window. In order to eliminate position information, we adopt a normalized histogram representation as well. Global Edge-Density Besides local edge-density, we have also observed that the edges of text in the rectangle under consideration often form a well-proportioned distribution. Therefore, we propose the global edge-density feature set to capitalize on this finding. For a NxN rectangle, we use a N 2 x N 2 sub-window to do a sliding scan. Each sub-window steps N 4 pixels in x and/or y directions from the previous one. As a result, totally nine sub-windows can be scanned. Based on the integral image calculated from the edge map, we can extract nine numbers from the nine subwindows. These nine numbers exhibit an intuitive cue for the global distribution of the edge points. However, a potential problem is that they are not positionindependent since each sub-window contains position information with respect to the rectangle. Therefore, in order to remove the position information, we sort these nine numbers to make them position-independent. Moreover, we can produce more features based on the nine basic numbers, such as addition or subtraction between any pair of them. Because this feature set requires additional cost of 20

29 Chapter 2. Embedded-Text Detection sorting, the computation is significantly increased. In chapter 2.1.3, we introduce a hierarchical training algorithm to minimize the computation cost Cascade Detector Training The cascade detector is a degenerated decision tree in which several simple classifiers (stages) are organized in a sequential manner. During the testing phase, the stages are applied subsequently to a region of interest until at some stage the candidate is rejected or all the stages are passed. In general, such a hierarchical structure, i.e. cascade, can be learned by using Boosting algorithm which is a method of combining several weak learners into a strong learner. In Viola and Jones framework, a variant of AdaBoosting, discrete AdaBoosting, was adopted to train the cascade detector for the task of face detection. However, for the specific purpose of embedded-text detection, we not only adopted a different boosting algorithm, BrownBoosting[8], but also a different training strategy, hierarchical training. The reasons are explained in detail as follows: BrownBoosting: Different boosting algorithms use different strategies to re-weight the training samples. In discrete AdaBoosting, the weight of a sample is adjusted based on its classification result. If a sample is correctly classified, then the weight of this sample will be reduced for the next round training. On the other hand, if a sample is misclassified, the weight of the 21

30 Chapter 2. Embedded-Text Detection sample will remain the same. While the success of AdaBoosting is indisputable, however, there is a rising awareness that the algorithm is quite susceptible to noise. Dietterich has given a detailed explanation of this behavior in [9]. He shows that, in general, boosting tends to assign much higher weights to noisy samples than those to legitimate ones. As a result, the hypotheses generated in the later iterations often over-fit the noisy samples. In our embedded-text detection, this weakness will significantly degrade the performance of the resulting classifier. In Figure 2.4, we show the edge maps of two letters, T and H, as well as three examples from noisy objects. We observed an interesting phenomenon. If we only consider the features proposed in section 2.1.2, i.e. local edge-pattern, local edge-density, and global edge-density, it is very difficult to differentiate the edges of text from those of noisy background objects. Due to (1) some negative samples are very similar to positive ones (as shown in Figure 2.4); (2) Millions of negative samples are required during the cascade training, a careful and thorough check for every sample is quite impossible, if we follow the original AdaBoosting algorithm, the final classifier will be inevitably less desirable. Therefore, we borrow a key idea from BrownBoosting[8], another variant of AdaBoosting. We use a non-monotone function to re-weight the misclassified samples: for small values of margin, the weight increases in a way very 22

31 Chapter 2. Embedded-Text Detection similar to AdaBoosting; however, for some point onwards, the weight decreases to restrain the impact caused by the noisy examples (e.g. confusing background objects). Hierarchical Training: We noticed that the computation cost for the proposed three sets of PIF features differs from each other. The cost of the local edge-pattern feature set is as cheap as the rectangle features proposed in Viola and Jones work, while the local and global edge-density feature sets require more time to compute. During the testing phase, the number of testing samples reduces as they pass through the cascade. Therefore, it is important to keep the early stages as fast as possible. To avoid using the expensive sets of features at early stages of the cascade, we adopted a hierarchical training strategy. The training procedure is as follows: (1) for the first eleven stages, we only used the local edge-pattern feature set; (2) from the 12th to 14th stages, we added the second feature set, which is the local edge-density set; (3) from the 14th stages on, we applied all three sets of features until the training process converged and the final stop criteria was reached. Our experimental results showed that the hierarchical training scheme significantly boosted the detection performance without much compromise on the total computation cost. In other words, the hierarchical training scheme helps the boosting algorithm to find the best combination 23

32 Chapter 2. Embedded-Text Detection of the PIF feature sets by balancing the detection performance and computation cost. Figure 2.4: Edge map of text and confusing background objects Experiments and Discussions For the cascade detector training, we used 5,000 positive samples and 5,000 negative samples in each stage. The positive samples were manually labeled and selected from a pool of web images. Each positive sample contained a single English letter. The negative samples were dynamically generated from a selected set of Corel images [4]. Table 2.1 shows the details of the 18-stage cascade detector and its stage classifiers. The target criteria for each stage was 0.97 detection rate and 0.60 false-positive rate. In other words, each stage classifier must pass at least 97% of the positive samples while rejecting more than 40% of the negative ones. To save computation cost, the local edge-density and global edge-density feature sets were only used in later stages. As shown in Table 2.2, once the new feature sets were added, the number of features selected in that stage were greatly 24

33 Chapter 2. Embedded-Text Detection Table 2.1: 18-stage cascade detector using BrownBoosting Stage Features Detection Rate False Positive Features: Local Edge-Pattern Features: Local Edge-Pattern, Local Edge-Density Features: Local Edge-Pattern, Local Edge-Density, Global Edge-Density reduced. This behavior indicated that the classification accuracy and efficiency can be further improved by combining the three PIF feature sets. In Figure 2.5, we show the top three features selected in the first round of the boosting. The depicted features are considered as the most discriminative ones. An intuitive explanation for these three patterns are as follows: Figure 2.5a represents a horizontal line, Figure 2.5b represents a vertical line, and Figure 2.5c 25

34 Chapter 2. Embedded-Text Detection Table 2.2: Feature sets comparison Stage Features Detection Rate False Positive w/ Local Edge-Density w/o Local Edge-Density w/ Global Edge-Density w/o Global Edge-Density corresponds to a half circle. These patterns appear frequently as parts of the text edges. Figure 2.5: Top three features selected by BrownBoosting Figure 2.6 shows several negative samples which were misclassified when the local edge-pattern feature set was used alone. These samples were correctly classified after adding the local and global edge-density feature sets. This example 26

35 Chapter 2. Embedded-Text Detection Figure 2.6: Tough case of negative training samples clearly demonstrates the effectiveness of the edge-density feature sets and proves the rationales behind the design. As shown in Table 2.1, the final detection rate is around 80%. It is relatively low compared to that of the face-detection applications. (the common detection rate for the face-detection is around 90%). However, faces appear in images as independent objects while text objects appear as regions. This implies that if there is a text object detected at some location, then, there should be text objects of similar size at nearby positions. The additional spatial and geometric regularities can be used not only to enhance the overall detection accuracy but also to reduce the overall false positive rate. In the following sub-section, we present the details of our embedded-text framework. 2.2 Detection Framework Here we describe two detection frameworks, the proposed unified one-step framework, and traditional two-step framework. In the one-step framework, the 27

36 Chapter 2. Embedded-Text Detection cascade detector acts as a full image scanner while in the two-step framework, the cascade detector acts as the text region classifier. In the following subsections, we present both frameworks in detail One-Step Framework In the one-step framework, the cascade detector is applied on the entire image at every location and at various scales. For the task of text detection, we use a sliding window to scan through the entire image. The scanning starts with an initial window size. After each full image scan, the sliding window is enlarged by 10% until it is larger than the stopping window size or the image size. In our experiments, we found that initial window size of 12x10 and the stopping window size of 72x60 are sufficient to detect embedded-text in web images. Once a subwindow is detected as text, its neighboring windows are examined and aggregated to form the text region. In addition to the cascade detector, there are three major components in the proposed one-step framework. These components are locality filter, search algorithm, and post processing. The locality filter centers the detection window on the text object. The search algorithm helps cascade detector find the text line. Then, the post processing cleans up the detection results and merge them into regions. These components guides the cascade detector to right starting points 28

37 Chapter 2. Embedded-Text Detection and optimizes the detection performance during the full image scan. They are discussed in detail in the following sub-sections. Figure 2.7: One-Step Framework Flow During the scanning, the corresponding PIF features of the sub-window are calculated using integral images and their values are sent to the cascade detector to determine whether the sub-window contains text. A. Locality Filter Since the PIF features used in the cascade detector are completely position independent, they are insensitive to the locality change of the text in a window. An example is depicted in Figure 2.8. The cascade detector and PIF features would not be able to tell the differences between the left and right samples without additional helps. To avoid this phenomenon, we define three locality filters 29

38 Chapter 2. Embedded-Text Detection to tackle the horizontal and vertical shifts of text objects in the detection subwindows. The three locality filters are central, vertical, and horizontal locality filter. The locality filters serve for the following purposes: (1) To center the detection sub-window on the edge-rich region by eliminating sub-windows which does not contain enough edge points in the center region; (2) To reduce the number of detected sub-windows around the same text object; (3) To reject sub-windows which do not lie entirely on the text line by eliminating sub-windows with vertical shifts; (4) To pinpoint the begin and start position of a text line by eliminating sub-windows with left and right horizontal shifts. Figure 2.8: Example of Locality Shifts The locality filters act as the gate keeper and can be considered as the first stage of the cascade detector. Although the locality filters are not trained by using boosting algorithm, they are empirically verified to meet the same detection rate criteria, i.e. to maintain a near 100% detection rate. 30

39 Chapter 2. Embedded-Text Detection The evaluation of these filters are based on rectangle features which can be calculated from the integral image of edge points in the original image. Figure 2.9 shows the rectangle features used in the three locality filters. These rectangle features are similar to some of the extended features in [14]. The central locality-filter examines the number of edge points in the center region of a sub-window. The area of the center region is set to 81% of the overall area of the sub-window (9/10 in width and height). The sub-window is rejected if the number of edge points is lower than the predefined threshold. In our implementation, we set the threshold to 60% of the total number of edge points in the sub-window. The vertical locality-filter divides the sub-window into top, middle, and bottom regions, then examines the number of edge points in these regions. The height of each region is set to 10% of the sub-window height. If the number of edge points in any of the regions is zero, the, the sub-window is rejected. Figure 2.10 shows the example of vertical locality filters on letter A. The horizontal locality-filter divides the sub-window into left, and right regions, then examines the number of edge points in them. The width of each region is set to 10% of the sub-window width. If the number of edge points in one of the region is zero while the other is not zero, then, the sub-window is rejected. The reason is that some letters are very narrow in width, such as letter I, while 31

40 Chapter 2. Embedded-Text Detection others are nearly square, such as letter A and D. If the detection sub-window aligns with the letter, then, the letter should overlap with both regions (square letters), or should not overlap with any of the regions (narrow letters). Figure 2.11, figure 2.12, and figure 2.13 show the examples of horizontal locality-filter on letter A, I, and L respectively. Figure 2.9: Locality-Filter Rectangle Features Figure 2.10: Vertical Locality-Filter Example of Letter A 32

41 Chapter 2. Embedded-Text Detection Figure 2.11: Horizontal Locality-Filter Example of Letter A Figure 2.12: Horizontal Locality-Filter Example of Letter I B. Search Algorithm In face detection, since faces have no correlation between each other, a random and exhaustive scan of the entire image is necessary in order to detect all the independent faces in the image. Text objects, on the other hand, appear in an image as horizontal regions and have strong spatial and geometrical correlation among each other along the same line. If a text object is detected at a particular location, then, its neighboring sub-windows are likely to be text objects of the same height as well. This spatial and geometrical coherence can be used to (1) 33

42 Chapter 2. Embedded-Text Detection Figure 2.13: Horizontal Locality-Filter Example of Letter L verify whether a detected sub-window is indeed text, hence reduce false positive; (2) effectively detect text objects along the same line. In this paper, we propose an efficient search algorithm which captures the spatial and geometrical coherence among text objects on the same text line. The search algorithm works as follows: For each sub-window size, WxH, we divide the image into 20 bands horizontally. The width of each band is equal to the width of the sub-window W. Depends on the sub-window width, W, each band may or may not overlap with its neighbors. For each band, we do a vertical sliding scan. Each vertical scan steps 2 pixels from the previous one all the way toward the bottom of the image until it reaches the end. 34

43 Chapter 2. Embedded-Text Detection During the vertical sliding scan, if a sub-window is detected as text at band I, its neighboring ten sub-windows to the left and to the right from the detected location are scanned. Each sub-window steps by half of the subwindow width to the left or to the right. If there are less than half of the neighboring sub-windows contain text, then the original sub-window is regarded as false positive and discarded. This step is considered as the text verification step. If the sub-window is verified as text sub-window, we do a horizontal sliding scan starting from band I to band I-1 and from band I to band I+1. Each sub-window steps by half of the sub-window width to the left or to the right. The sub-windows detected consecutively are merged into a single text rectangle. This step is considered as the horizontal sliding step. The vertical sliding scan and horizontal sliding scan are repeatedly applied on the image until the entire image has been visited. Figure 2.14 shows the diagram of the search algorithm. The search algorithm makes use of the spatial and geometrical coherence between text objects along the same line. Figure 2.16 shows the raw detection results of images in figure 1.1 after the vertical and horizontal sliding scan. 35

44 Chapter 2. Embedded-Text Detection Figure 2.14: Diagram of the search algorithm C. Post-Processing Due to the exhaustive scan over the entire image, there are usually multiple text rectangles along the same text line. These rectangles could be neighboring, overlapping, or duplicate. Two rectangles are overlapping if their bounding boxes overlap. Two rectangles are neighboring if they do not overlap but the distance between each other is within a threshold. (In our application, we set the threshold to 10% of the width of the shorter rectangle). A rectangle is duplicate if the major part of it is covered by another one. (In our application, we set the threshold to 80%). In the post-processing step, actions are carried out as follows: (1) Overlapping text rectangles along the same line are merged into one single rectangle; (2) Du- 36

45 Chapter 2. Embedded-Text Detection plicate rectangles are removed; (3) Single rectangles that are narrow or short are removed. As in [22], detected rectangles are partitioned into subsets based on their height and centroid location on the Y-axis. For two rectangles in the same subset, if their bounding boxes overlap, they are merged by enlarging one of the rectangle to include the other. This process is repeated until there are no more overlapping text rectangles. After the merging process, each partition usually yields a single text line, sometimes a few. Figure 2.16 shows the examples of raw detection results. Figure 2.17 shows the final detection results after post-processing Two-Step Framework The cascade detector can also be applied in the two-step framework. In this case, the cascade detector works as a text-region classifier. There are two advantages of using cascade detector for classification. First, cascade detector is relatively fast. Second, the cascade detector can achieve very detection rate to false positive rate ratio. To classify a text region, we use a sliding window to scan the entire region. The height of the sub-window is set to the height of the text region and the width of the sub-window is set to 1.2 times of the height. Each subsequent sub-window 37

46 Chapter 2. Embedded-Text Detection steps 2 pixel from the previous one. The final decision for the text region is determined by the majority vote of each individual sub-window. Figure 2.15: Two-Step Framework Flow 2.3 Experiments We have designed and conducted experiments to answer the following questions: The effectiveness of the locality-filters. The comparison between the cascade detector and other learning-based classifiers, particularly Support Vector Machines (SVM). The performance comparison between the proposed top-down framework and traditional bottom-up approaches. The computation cost comparison between the proposed top-down framework and traditional bottom-up approaches. 38

47 Chapter 2. Embedded-Text Detection Metrics To evaluate the performance of the proposed embedded-text detector, we defined two types of metrics, pixel-based and region-based. Before we dive in the definiton of metrics, we define text regions as follows for better explanation: detected region: a text region reported by the text detector. matched region: a detected text region which covers more than 90% and less than 110% of the ground-truth region. In other words, a detection text region matches a ground-truth region if they are relatively similar in position and size. refined region: a matched region whose top and height are set to that of the matching ground-truth region. The reason for doing this is because sometimes there s a small shift between the detected region and groundtruth. The refined region is to overcome this phenomenon and to better reflect the detection results. The pixel-based metric is measured as the ratio of number of correctly detected pixels to the total number of ground-truth pixels. The DR pixel (Detection Rate) and F P pixel (False Positive) are defined as follows: 39

48 Chapter 2. Embedded-Text Detection DR pixel = area of refined regions area of ground truth regions (2.1) For the pixel-based metric, the false positive comes from the following two parts: (1) If a detected region is larger than the ground-truth region it matches, then region is considered matched, however, the exceeding part is counted as false positive. (2) If a detected text region does not match any of the ground-truth regions, the entire region is counted as false positive. F P pixel = area of mismatched regions area of the image (2.2) The region-based metric is measured as the ratio of number of correctly detected regions to the total number of ground-truth regions. The false positive is defined as the ratio of the number of mismatcched regions to the number of detected regions. The DR region (Detection Rate) and F P region (False Positive) are defined as follows: DR region = # of matched regions # of ground truth regions (2.3) F P region = # of mismatched regions # of detected regions (2.4) 40

49 Chapter 2. Embedded-Text Detection Experimental Results Locality-Filter Verification In this experiment, we verify the performance of locality-filters. To conduct this experiment, we prepared three sets of data, the positive set, vertical shift set, and horizontal shift set. They are summarized as follows: Positive set: 1,000 positive samples randomly selected from the cascade detector training samples. Vertical shift set: 5,000 samples manually generated using positive set. Each sample in the positive set was used to generate five samples by randomly shifting from zero to half of the height of the sample vertically. Horizontal shift set: 5,000 samples manually generated using positive set. Each sample in the positive set was used to generate five samples by randomly shifting from zero to half of the width of the sample horizontally. Table 2.3: Locality-Filter Verification Filter Positive Set Vertical Shift Set Horizontal Shift Set Central Filter Vertical Filter Horizontal Filter Overall

50 Chapter 2. Embedded-Text Detection Table 2.3 shows the experimental results. As we can see, the proposed localityfilters can maintain a near 99% detection rate while reducing over 50% of the sub-windows with vertical or horizontal shift. Also, the vertical locality-filter is very powerful in rejecting sub-windows with vertical shift. This filter can help the cascade detector find the text line more precisely. Framework Experiments In this experiment, we compared the performance among different frameworks, Image Analysis based [6, 12, 13, 15, 23], image analysis + machine learning based [2, 25, 38], image analysis + cascade detector, and the proposed framework. To conduct this experiment, we extracted 2,000 images from a pool of SPAM s for training and testing. We manually labeled each image and identified 4,461 text regions. We reserved 3,000 text regions (1,435 images) for training Support Vector Machines (SVM) classifiers. The rest of the images were used for testing. The datasets were summarized as follows: Testing dataset: 565 images. Totally 1,461 regions were labeled. SVM Training positive dataset: 1,435 images. Totally 3,000 regions were labeled. SVM Training negative dataset: 3,000 regions were randomly cropped from a pool of Corel images. 42

51 Chapter 2. Embedded-Text Detection We implemented two commonly used features to train the SVM classifiers. The first one was the distance map [2] and the second one was star-like pixel pattern [38]. The distance map is generated by transforming every pixel from its raw value to its distance to the closest edge point. The star-like pixel pattern is extracted directly from the gray level using a star pattern. The SVM program we used for this experiment was SVMTorch[3]. The two implementation were referred as SVM1 and SVM2 respectively in Table 2.4 and Table 2.5. Table 2.4: Performance Comparison Framework Pixel-based Region-based Proposed 93.60/ /3.70 Image Analysis 90.68/ /14.50 Image Analysis + SVM / /15.13 Image Analysis + SVM / /15.89 Image Analysis + Cascade 87.76/ /9.89 Table 2.4 shows the comparison results. There are three observations. (1) Clearly, the proposed embedded-text detection framework outperforms the rest. (2) For other frameworks that bases on image analysis step, the detection rate is bounded by the image analysis step. (3) Cascade detector based framework yields relatively lower false positive than others. 43

52 Chapter 2. Embedded-Text Detection Computation Cost In this experiment, we compared the computation cost among different frameworks. The experiment was carried out on Pentium-4, 2.8GHz Linux machine with 1GB memory. Table 2.5 shows the experimental results. As indicated in the table, the proposed embedded-text detection system only took 0.32 seconds per image, while SVM based systems required 0.54 and 1.98 seconds respectively. Table 2.5: Computation Cost Comparison Framework Step1 Step2 Overall Proposed na na 0.32s Image Analysis 0.11s na 0.11s Image Analysis + SVM1 0.11s 1.87s 1.98s Image Analysis + SVM2 0.11s 0.43s 0.54s Image Analysis + Cascade 0.11s 0.26s 0.37s If comparing SVMs and Cascade alone, it is obvious that the cascade approach is much more efficient than SVM-based ones. The star-like pixel pattern in SVM2 is based on raw pixels, therefore, it is close to the performance of cascade, however, cascade still outperforms because cascade generally requires less computation than SVM. 44

53 Chapter 2. Embedded-Text Detection Memory Consumption The memory consumption of the embedded-text detector is higher than that of the face detector. In the face detection problem, rectangle features can be calculated from a single integral image. In the proposed text detector, multiple integral images have to be built. To reduce the memory consumption, the image can be divided into sub-regions horizontally or vertically. For example, if the image is divided into four regions, the memory consumption can be reduced to 1/4 of the original without much computation overhead. 45

54 Chapter 2. Embedded-Text Detection Figure 2.16: Embedded-Text Detection Raw Results 46

55 Chapter 2. Embedded-Text Detection Figure 2.17: Embedded-Text Detection Final Results 47

56 Chapter 3 Visual-Based Anti-Spam Filtering With the increasing importance of and the incursions of Internet marketers, unsolicited commercial (also known as spam) has become a major problem on the Internet. Junk has been recognized as a problem since 1975 [27]. It was not a serious concern until marketers began to flood the system, overtaxing the resources of Internet Service Providers (ISPs). Since the late 90s, several anti-spam filtering solutions have been proposed. In general, these approaches treat the spam filtering problem as a text classification or categorization problem, employing various machine learning techniques to solve the problem. In [31], the authors proposed using a Naive Bayesian classifier to filter junk s. Some researchers have also suggested using Support Vector Machines (SVM) [33] and decision trees 48

57 Chapter 3. Visual-Based Anti-Spam Filtering [34] for this task. These text-based approaches have achieved a remarkable accuracy in filtering spam s. However, there are two major limitations to these text-based approaches. First, spammers often use various tricks to confuse text-based anti-spam filters [28]. Examples of these tricks are text obfuscation, random space or word insertion, HTML layout, and text embedded in images. Second, as the scale and capacity of the Internet continues to grow, the type of information in s has become more diverse. The genre of content has moved from text-only to multimedia-enriched. These limitations greatly reduce the effectiveness of existing text-based anti-spam filters. The key issue behind these challenges is that the type of content in s has switched from text-based to visual-based. On one hand, legitimate message senders start to add more multimedia content, particularly images, to text-only s to enrich the message. On the other hand, spammers mask spam s with unreadable content, such as images, and employ HTML-based tricks to confound text-based anti-spam filters. The raw content of these spam s may not make any sense, but the key messages can be rendered visible when recipients open them. Since visual information is becoming more prevalent in s, it becomes increasingly necessary to use such information to achieve high accuracy for anti- 49

58 Chapter 3. Visual-Based Anti-Spam Filtering spam filtering. Our research team has investigated ways of using visual information, particularly images, in anti-spam filtering. We studied the spam s to analyze the characteristics of the visual information in spam. One noticeable characteristic is the types of images used in spam s. These images are usually artificially generated and contain embedded text (i.e. text boxes embedded into image files). In this work, we propose a novel anti-spam filter, which classifies multimedia-enriched s based on their visual information. Specifically, the proposed anti-spam filter extracts image features and uses an one-class SVM classifier to decide whether an unseen is in the spam category. Experimental results show that, for s containing image data, the proposed anti-spam filter can achieve a detection rate of 80% or more with less than 1% false positives. In addition, the proposed anti-spam filter can work with existing text-based filters. In comparison with the Bayesian text-based filter used in Thunderbird (a mail client of Mozilla [29]), our proposed filter can improve the spam detection rate from 47.7% to 83.9% for the validation set derived from the SpamArchive dataset [30]. The main contributions of this paper are summarized as follows: We introduce the use of visual information for anti-spam filtering. By thoroughly analyzing a large collection of spam s, we demonstrate that 50

59 Chapter 3. Visual-Based Anti-Spam Filtering useful features and parameters can be derived from images in spam s for the purpose of anti-spam filtering. Based on such visual features and parameters, we propose a novel visualbased anti-spam filter. The proposed filter, used in conjunction with existing text-based filters, can improve the filtering accuracy. We have successfully integrated the proposed anti-spam filter with Thunderbird and demonstrated very promising results. The rest of this chapter is organized as follows: In Section 3.1, we present the statistics of some visual parameters from a thorough analysis of more than 120K spam s randomly downloaded from SpamArchive [30]. In Section 3.2, we present the proposed filtering system in detail. Then, in Section 3.3, we present experimental results to show the effectiveness of the proposed anti-spam filter. 3.1 The Spam Datasets and Analysis We prepared the following datasets for analysis and experiments: SpamArchive dataset: 122,877 spam s. Ling-Spam dataset: 2,412 legitimate s from Linguist mailing list and 481 spam s [32]. 51

60 Chapter 3. Visual-Based Anti-Spam Filtering The SpamArchive dataset was randomly downloaded from the SpamArchive website. The total number of downloaded and processed spam messages is 122,877. The Ling-Spam dataset is a public anti-spam filtering corpus [32]. For this work, we used the Ling-Spam dataset to train the Bayesian anti-spam filter in Thunderbird. Using these two datasets and the trained Bayesian anti-spam filter, we intended to show experimentally the limitations of text-based anti-spam filters, and also to demonstrate the additional power they can gain with our visual-based anti-spam filter. In the SpamArchive dataset, 37.76% (46,395/122,877), of the s contain images. Among those s containing images, only 43.72% (20,283/46,395) contain accessible images. Many s did not have images explicitly attached but they provided links to the images. A fraction of these links were no longer accessible at the time we processed them. We further analyzed these 20,283 spam s. The statistics are shown in Table 3.1. Table 3.1: Statistics of s containing accessible images Type no. of s Percentage w/ image with embedded-text 16, % w/ banner or graphics 19, % w/ external image 19, % blocked by Bayesian filter 9, % 52

61 Chapter 3. Visual-Based Anti-Spam Filtering Figure 3.1: Spam with embedded-text in image example The results shown in Table 3.1 clearly indicate that if a spam contains images, the images are likely to be artificially generated and external (i.e. not explicitly attached to the ). These images may be banners/graphics or text boxes embedded within the images. Figure 3.1 shows an example of a spam which only contains embedded-text message and no traditional text. The message embedded in the spam is clearly visible when being rendered by clients even though it contains no text at all. Figure 3.2 shows an example of a computer generated image with embedded-text regions. In the following section, we introduce the ideas behind our visual-based anti-spam filter. 53

62 Chapter 3. Visual-Based Anti-Spam Filtering 3.2 The Anti-Spam Filter For each , we extract a set of features from the images contained in the . The set of features is then used for classification, where the one-class Support Vector Machines (SVM) is used as the base classifier. In the following subsections, we discuss the two major components, the features and the classifier, respectively Feature Description Based on the observations summarized in Section 3.1, we define three sets of features as follows: Embedded-text features Banner and graphic features Image location features More and more frequently, spam s are embedding text messages in images to get around text-based anti-spam filters. To detect such devious techniques, it would be helpful to know (1) whether there is embedded text in the images, (2) if so, the area of text regions vs. the total image area. To derive such information, we have developed a text-in-image detector which is capable of detecting the text region(s) in an image. The details of the detector will be described later. We use 54

63 Chapter 3. Visual-Based Anti-Spam Filtering this text-in-image detector to scan through each image in the and derive the following embedded-text features: (1) the total number of text regions detected in all images in the , (2) the percentage of images with detected embedded-text regions, and (3) the pixel count ratio of the detected text regions to that of the overall image area. Figure 3.3 shows an example of an image with identified text regions. Many of the images in spam s are banners and computer-generated graphics which are part of advertisements. We have developed a banner detector and a graphics detector. Banner images are usually very narrow in width or height. Also, banner images usually have a large aspect ratio vertically or horizontally. Graphic images, on the other hand, usually contain homogeneous background and very little texture. Using these detectors, we can extract the following banner and graphic features: (1) the ratio of the number of banner images to the total number of images, and (2) the ratio of the number of graphic images to the total. Spammers usually put the images behind web servers and create references in the s to save server and network resources. This is in contrast to personal s, where images are usually attached with the s. We define the image location feature to be the ratio of the number of external images to the total number of images in the . 55

64 Chapter 3. Visual-Based Anti-Spam Filtering Feature Extraction Banner and Graphic Feature Extraction Since banner images carry certain geometric characteristics, they can be detected by using a very simple rule-based detector. In extracting banner features, we first use the rule-based detector to check the size and aspect ratio of images; then, we calculate the corresponding features according to the detected number of banner images. Because computer-generated graphics usually contain homogeneous color patterns, they contain almost no texture in fine resolution. To extract graphics features, we first apply wavelet transformation on the input images. Then, we extract texture features in three orientations (vertical, horizontal, and diagonal) at fine resolution. If any of these extracted texture features falls below a predefined threshold, the image is likely to be a computer-generated graphic. We calculate the graphic features based on the detected number of graphic images. Embedded-Text Feature Extraction Several text-detection and text-recognition methods have been proposed in the past. Previous text-detection methods often followed a multi-step framework using a combination of image analysis and machine learning techniques. While a multi-step framework greatly simplifies the problem by dividing the problem 56

65 Chapter 3. Visual-Based Anti-Spam Filtering into several sub-tasks, the overall detection accuracy is the product of every step and is bounded by the lowest one. Since the web and images are diverse, previous text-detection methods could not work well if any of the steps yields low detection rate. We have developed an unified embedded-text detector particularly for web and images. Based on Viola and Jones object detection framework [22], we define position-independent features to capture the essence of characters and a smart scan algorithm to trace text lines using their spatial and geometrical regularities. The embedded-text detector can be applied at every location and at every size of the image. The heart of the embedded-text detector is the boosted cascade detector. The cascade detector is a degenerated decision tree which acts as a rejection-based classifier. In each stage of the cascade detector, a classifier is training using BrownBoosting[8], which is a variant of AdaBoosting[7], to reject as many negative samples as possible while keeping almost all the positive samples. As a result, the cascade detector can yield a very high detection rate to false positive rate ratio. Figure 3.2 shows an example of an image with embeddedtext. Figure 3.3 shows the detection result by our embedded-text detector. We calculate the embedded-text feature based on the detected text regions, such as the number and area of regions. 57

66 Chapter 3. Visual-Based Anti-Spam Filtering Figure 3.2: Text embedded in image example The Classifier In previous approaches, the anti-spam filtering problem has typically been treated as a two-class or multiple-class classification problem. In the two-class case, researchers were trying to determine whether an unseen was spam, whereas in the multiple-class case, the unseen s were divided into several categories (such as commercial, financial, objectionable, health, spiritual, etc.). One difficulty with the two-class and multiple-class classification is the need for multiple sets of training samples. For example, in the Naive Bayesian approach, 58

67 Chapter 3. Visual-Based Anti-Spam Filtering Figure 3.3: Detected embedded-text regions one set of spam s and one set of legitimate s are required to train the classifier. While spam datasets are easily accessible, a representative set of legitimate s is difficult to collect. In our anti-spam filter, we define the antispam filtering problem as a task of finding whether an unseen is the spam class. We propose using the one-class SVM [37] as the base classifier. One-Class SVM The basic model of SVM[36] is a maximal margin classifier. Given a positive and a negative dataset, the SVM classifier maps the data from the input space to 59