Labeling of Textured Data with Co-training and Active Learning

Transcription

1 M. Turtinen and M. Pietikäinen: Labeling of textured data with co-training and active learning. In Texture 2005: Proceedings of the 4th International Workshop on Texture Analysis and Synthesis, pp , Labeling of Textured Data with Co-training and Active Learning Markus Turtinen and Matti Pietikäinen Machine Vision Group, Electrical and Information Engineering Department P.O.Box 4500, FI University of Oulu, Finland Abstract In this paper, we present a robust texture labeling method that requires minimum user interaction. Initially only a fraction of the textures needs to be manually labeled, and then a co-training procedure is used to automatically label most of the unlabeled samples. Simultanously an active learning framework is used to learn those unlabeled samples that would provide much information for the system if labeled. Samples found by active learning are labeled explicitly with a visualization-based approach which provides a very user-friendly view into the data and enables possibility to learn new classes. In the experiments, the labeling framework is applied to real texture image data for building a training set for a classifier. 1. Introduction A major problem when building a practical computer vision system is to collect an extensive labeled data set for training. Typically it is not difficult to acquire a large amount of images from targets utilizing, for example, multiple cameras or video streaming. Problems arise when one is asked to describe the content of them and label images according to their appearance. Manual labeling of each image is laborious, time consuming and prone to human made errors that make the labeling procedure very critical task in many computer vision applications. Real surfaces are not usually flat but contain variations in some scale. Several real-world computer vision applications might utilize proper texture information derived from images. Much research has been done to describe, model and recognize textural properties of digital images but only little effort has been devoted to the design of robust and user friendly texture data labeling systems. A typical example of application where labeling is required is texture classification. The performance of the analysis is directly connected to the success of labeling. For a human, it is typically very difficult to label textures reliably. Instead of manual labeling it would be preferable if the labeling can be automated as much as possible. This defines a relevant learning problem: consider we want to learn interesting models from the current texture representation and provide labels for the data automatically using this information [14]. On the other hand, we should be able to determine the level of labeling confidence and prevent the system to learn wrong models for textures. To understand better the difficulties of manual labeling of textured samples, let us consider a digital image taken with a typical four megapixel camera from outdoor view. One should put individual labels to each pixel or region indicating classes like sky, grass or trees, for example. It should not be too difficult to manually segment few images into predefined categories. But huge within class variations caused by illumination changes, non-homogeneity of objects, foreshortening, shadowing or occlusions, for example, require a lot of labeled samples per class and one should manually analyze several input images. A straighforward approach for labeling textures would be to ask a human to give labels for a small subset of samples and then automatically provide labels for the rest of the data with supervised learning. This requires that the features used are very effective and initial samples represent all classes reliably. Often these are too strict restrictions in the real-world applications. The use of invariant feature sets, see for example [10], are effective for tacling against intraclass variation and reducing the number of models needed per class, but still labeling is required. We approach the texture labeling problem from the view of learning, utilizing both active learning [18] and cotraining [2]. In active learning, those unlabeled samples that provide most additional information are automatically selected for explicit labeling. In practice, these samples are usually the ones that are difficult to classify. Simultaneously, a great number of samples can be automatically labeled with high confidence utilizing co-training. Several approaches for both active learning and learning from labeled and unlabeled data have been proposed. In active learning new samples to be labeled are typically

2 selected to maximize the performance of the classifier in some way [4, 8]. Some attempts for learning with both labeled and unlabeled data sets have also been done. See, for example, a review of Seeger [15]. Our goal is label textures with only a small human effort. In our framework two different texture feature sets are extracted from images. The feature sets should complement each other. We select to use local binary pattern (LBP) [13] and Gabor [11] features. LBP features are used to capture microstructural information of the textures. Gabor features offer larger spatial support than LBP and describe more macrostructural elements. Feature sets are then used in the co-training algorithm [2] for labeling a part of the samples automatically. Active learning, resembling the one proposed in [18], is used to select the most informative and ambiguous samples for manual labeling. The actual labeling in this phase is made with the visualization-based apporoach [16] utilizing the self-organizing map [9] to cluster the feature data and represent it on a low-dimensional space. Our major contribution is to propose a multi-class texture labeling system that is robust, easy to use and flexible. Textures are modeled with complementary features providing two separate representations of the images. Confidence based co-training is used to automatically label those samples that can be predicted with high confidence. Visualization-based approach can be used to learn new classes during labeling, which makes the framework very flexible. The proposed method is applied for analyzing outdoor scene images and for training a classifier to classify image regions into classes like road, tree and grass, for example. In addition, the approach is validated with three other real-world texture sets. 2. Texture Labeling Framework In our labeling system two different texture feature sets, LBP [13] and Gabor [11], are extracted from the image data. Small amount of textures are selected to be initially labeled while most of the data is unlabeled. At this point, we have a small labeled data set D l having labels from finite set Y of size k. We also have a larger unlabeled data set D u obtained with both features. Multi-class co-training and active learning is applied to the data. In the co-training, part of those unlabeled samples that are correctly labeled with high confidence are automatically added to the labeled set D l. In the active learning step some of those samples that have ambiguously predicted labels are selected for explicit labeling. The labeling is made with visualization-based approach and labeled samples are added to the D l. We are able to learn new texture classes and increse the size of the Y with visualization. The following subsections give more detailed explanation of the components used Texture Feature Extraction Textures are modeled with two separate feature sets providing an extensive representation of the textures. The local binary pattern (LBP) [13] distributions and Gabor filtering [11] are selected as texture measures because they both have performed very well in various texture analysis problems. They also give different types of representations of the textured images, which is useful in the co-training and active learning phases. LBP detects microstructures like curved edges, spots, flat areas etc. from the images. LBP is invariant against monotonic gray-scale variations and it has been extended to have rotation invariant and multiscale properties. Gabor filtering offers larger spatial support than LBP and is more likely to capture the macrostructures of a texture. The image is convolved with Gabor filters at different scales and orientations providing a multiresolution representation of the texture. As a texture measure, the means and the standard deviations of the filtered images are used. There are also rotation invariant extensions to Gabor filtering approach. Basically, the proposed approach does not set strict limits to the feature sets used, but they should provide complementary information. It is also possible to combine texture with some other cue, like color Co-training Based on Labeling Confidence The co-training algorithm proposed by Blum and Mitchell [2] utilizes two different views from the data and trains a pair of learning algorithms to classify samples in the two class problem. They assume that the feature sets used are statistically independent and samples that are confidently labeled with another feature set might be mislabeled with the other learner. The learners can therefore train each other by allowing each one to label some amount of the confidently labeled samples in both negative and positive categories. The assumption that feature sets are conditionally independent rarely holds with real-world data distributions. Regardless of that, co-training has been succesfully used, for example, in text classification [2]. Suprisingly, co-training has beed seldom applied to multi-class problems. Ghani [6] presented a multi-class text categorization approach decomposing the original classification problem into multiple binary problems using errorcorrecting output codes (ECOC). Co-training was then applied separately to each binary classification problem. In our approach the original classification task is also decomposed into multiple binary problems. Allwein et al. [1] proposed a unified approach for reducing multi-class problem into multiple binary problems to be solved by a margin classifier, like the support vec-

3 tor machine (SVM) [17] which is used in this paper. The decomposition can be performed via coding matrix M { 1, 0, +1} k l, where k is the number of classes and l is the number of binary problems. M(r, s) = 1 if the learner s should consider samples from the class r as positive examples, M(r, s) = -1 if the samples from the class r should be considered as negative examples, and if M(r, s) = 0 there is no matter how the learner considers samples from the class r. For example, for the popular one-against-all coding, M is a k k matrix in which all diagonal elements are +1 and all other elements are -1. As there are several methods for constructing a coding matrix, there is also more than one way to find the predicted label for the sample after classified it with multiple binary classifiers. Allwein et al. [1] proposed two different decoding methods. In the first approach the Hamming distance between the predicted labels obtained with different binary classifiers and codebook rows indicating classes is calculated. Hamming coding scheme ignores entirely the magnitude of the predictions which can often be an indication of a level of confidence. That is the reason they also proposed the loss-based decoding method d L (M(r), f(x)) = l L(M(r, s)f s (x)). (1) s=1 r After calculating the distances with some distance measure, the predicted label ŷ 1,..., k is attained with ŷ = arg min dist(m(r), f(x)). Allwein et al. [1] suggested to use the same loss function L in Eq. 1 than used in the learning algorithm. They also obtained better results with loss-based decoding than with Hamming decoding. In our co-training algorithm the loss-based approach is used. Even though the texture features used measure different properties of texture they still are pure texture statistics. For this reason there is some lack of statistical independence of the features and it might lead to errornous label predictions if the basic co-training is used. Therefore, we slightly modify the labeling rule typically used in co-training algorithms where both feature sets are able to label some amount of confidently labeled samples [2]. At the beginning we have two labeled data sets Dl 1 and Dl 2. Multi-class learners are applied to the data sets and estimated output ŷ and the level of confidence (estimated loss) d L (M(y), f D (x)) for every x in the unlabeled set D u are calculated. The total loss of prediction is obtained by summing individual losses of separate learning algorithms. Then n samples with highest confidence (smallest d L (M(y), f(x))) are labeled and added into the labeled sets only if the both predictions ŷ 1 and ŷ 2 are the same. If there is no similarly predicted labels with high confidence then both learners are allowed to label some of the most confidently predicted samples, like in the traditional co-training. This modification improves the labeling especially in the beginning of the learning when there are only a limited number of the training samples Active Learning of Texture Models Active learning model used in this work is based on the approach of Yan et al. [18] where they labeled video frames utilizing color features to identify different humans in the scene. The approach uses margin-based learner and decomposes a multiclass problem into a set of binary-class problems. So settings used in the co-training phase can be directly extended and utilized in the active learning step. The goal is to find such unlabeled samples which can minimize the expected loss on the data set. Consider we have a labeled data set D l of samples to be used in the training of the learner to classify unlabeled samples. P (x y) is the conditional distribution over a sample x, and P (x) is the marginal distribution of x. Multi-class learner gives an estimated output ŷ and an estimated loss d L (M(y), f D l (x)) for each x in the unlabeled set D u. The expected risk function of such a learner is R(f D l ) = E x E y x (d L (M(y), f D l (x)) = y Y d L(M(y), f D l (x))p (y x)p (x)dx. X The task of an active learner is to select some of the unlabeled samples and ask labels for them and add labeled samples into the D l to form new labeled data set D l. Yan et al. [18] showed that the optimal query set can be found by maximizing R(f D l ) R(f D l ). The maximization is usually impractical and computationally intensive due the enormous amount of combinations. But there exist some simple heuristic strategies that can be applied for sample selection without requiring to re-learn the classifier to estimate the expected risk for each combination. One solution is to use the best worst case model, which chooses the most ambiguous samples for labeling [3]. It is based on the assumption that the loss function can be expected to be small if the confidently labeled samples are most likely correctly labeled. Now, let y x be the predicted label for sample x. If Eq. 1 is used for calculting the confidence of each labeling, in the best worst case we choose the sample with the maximum expected loss for the predicted label with arg max x l L(M(y x, s), f s (x)). (2) s=1 In our case, we have two feature sets extracted from the same images. The total loss for the predicted label is formed by summing the individual losses like in the cotraining phase. The ambiguous samples (different predicted label ŷ for both learners and high loss value) are selected for explicit labeling. When the learning process progresses the predicted labels are most likely to be similar with both learners and only Eq. 2 is used for selecting samples.

4 2.4. Visualization-based Learning Explicit labeling is made with the visualization-based method that resembles the approach presented in [16]. In that approach, the self-organizing map (SOM) [9] was applied for scene image data obtained by dividing original images into non-overlapping image patches. Two-phase visualization for the data extracted from each training image was made and the training set was incrementally built. Suspicious and hardly visualizable samples were mapped back to the original image providing contextual information for image patches and helping to determine the correct class for them. The SOM represents the original high-dimensional data on a low-dimensional, typically 2D, grid. It preserves the topology of the data, i.e. samples close to each other in the high-dimensional space locate close to each other in the projection space also. This is very attractive property in terms of visualization-based learning: similar data construct clusters on the 2D plane and user is able to visualize them and rapidly label number of samples. The user could select nodes representing some class and label all the samples inside them or select them under more careful visualization. During visualization new classes can be revealed and introduced, making the whole learning process very flexible. Compared to [16] we do not use all the data available to train SOMs. Only those samples are fed to the map that active learner requires. Visualization is thus easier because the map is not so dense and only one-stage visualization is required. Determining the label for individual image patches may be difficult without contextual information [7] and in such an application the back-projection of patches into the original image is useful. 3. Experiments A. Outdoor Scene Image Analysis Natural scene images from the Outex texture database [12] were used in the experiments. Test suite ID Outex NS consists of 22 sequential outdoor scene images of pixels taken by a human walking on the street. Half of them were selected into the training image set to be used in the labeling framework and others were left for the validation set. There are manually defined ground truth regions for each image (sky, trees, grass, road and buildings). To study the performance of the labeling framework only these ground truth regions were first used and manual labeling with visualization-based approach was not used at all. We selected the first and the last training image from the sequence and manually defined ground truth regions for them. The ground truth regions of the other nine training images were then used in the co-training and active learning and the ground truth regions of the remaining eleven images were used for testing. Images were divided into the blocks of pixels and only those blocks whose every pixel belonged to some ground truth class were considered. With a block-based approach the amount of data reduces dramatically compared to the case when all pixels in the images are covered using sliding blocks. The radial basis function (RBF) support vector machine (SVM) was used as a learner. The grid-search and 5-fold cross-validation using the initial training data obtained from the two manually labeled images were used to estimate the SVM parameters C and γ. Multi-class co-training and active learning were then made using the two manually labeled images as initial training data D l. The remaining nine training images were used to generate the sample pool D u. LBP8,1 (dimensionality 59) and Gabor filtering at four scales and six orientations (dimensionality 48) texture features were extracted from pixels sub-images. Two different coding schemes, one-against-all and pairwise, were used in multi-class to binary coding. The algorithm was iterated 10 rounds and 100 samples in each round were labeled with co-training and other 100 with active learning process. Fig. 1 shows the classification results obtained with both feature sets using separate testing data. The classification is made with the similar SVM classifier that is used in the labeling phase using the same parameters C and γ as before. The last results (iteration 11) are achieved by optimizing C and γ with the new training set. With fully supervised learning using all the training data available and optimized SVM classifier the baseline results for LBP8,1 and Gabor features were 93.5% and 89.3%. The results obtained with co-training and active learning indicate that the classification rate reaches close to the optimum during learning. There is not much difference between coding schemes used. C lassification rate [%] LB P 8,1 OneV sall Gabor OneVsAll LB P P airwise 8,1 Gabor Pairwise Iterations Figure 1. Classification rates during learning. To study more about the effects of co-training and active learning phases, a separate experiment was run where only co-training or active learning was used. In each iteration maximum 50 samples were labeled automatically with cotraining or explicitly with active learning and added to the

5 Dl. The classification performance converges slower with co-training compared to active learning. This is reasonable because with co-training the easiest samples are labeled first and added to the training set. Active learning can boost very fast the classification rate by discovering the most ambiguous samples in the beginning of the training. This is a very important property in the data labeling system. For this reason, an experiment was run where active learning was first applied four times to the data and 50 samples were explicitly labeled. After that the co-training was used to label remaining samples automatically and active learning step was applied in every fifth iteration to imporeve the labeling. With these settings the classification rate incresed from 85.5% to 90.3% during first five iterations with LBP8,1 and from 83.8% to 86.9% with Gabor features, respectively. In the end of the learning the rates were 93.3% and 89.1% that are very close to the optimum. It can be noted that very good classification performance can be achieved using only a small labeling effort. In the last experiment the whole data available was used and visualization-based learning was also applied in the labeling. The initial training set consisted of the same two labeled scene images as before. The samplepool was created dividing the rest of the training images into pixels subimages. Totally there were 8190 unlabeled subimages. For the first four iterations active learning alone was used to select 100 samples to be explicitly labeled in each round. Visualization-based learning was applied for manual labeling. After that the co-training was used to automatically label maximum 200 samples in each iteration. Active learning with visualization-based labeling was performed in every fifth iteration to improve the labeling. Those samples whose class was difficult to determine with visualization (i.e. several classes mixed) were left out of the training set. Fig. 2 shows examples of segmented training image (top) and testing image (bottom) created by classifying pixels regions in the images using SVM classifier and cre ated training set (LBP8,1 ). Most of the regions are correctly classified but especially blocks on the category boundaries are difficult to classify. B. Other Texture Classification Problems Three texture classification problems from Outex [12] and CUReT [5] texture databases were used to study the performance of the learning framework in a more general way. Datasets 1 and 2 are from the Outex database. Dataset 1 (ID Outex TC 00004) represents a typical texture classification problem. It has 24 texture classes with 88 images of 64*64 pixels in each. Half of them (N=1056) were selected to the training set and the other half for testing. Dataset 2 (ID Outex TC 00013) is a color texture classification problem having *128 pixels images from 68 classes. Figure 2. Examples of training and testing image classified into five categories. Half of the images (N=680) were used in the training phase. Dataset 3 consists of CUReT textures. 118 images from 20 classes taken under varying viewpoint and illumination with a viewing angle less than 60 degrees were selected to the set. Half of the images were considered as training images (N=1180). With the first dataset, 70 images of 1056 were selected in the initial training set. LBP8,1 and Gabor (4 scales, 6 orientations) features were extracted from the images. Active learning (without visualization-based labeling) was used to explicitly label samples on the first four iterations (50 samples in each iteration). After that co-training was used to automatically label 50 samples in each iteration and active learning was applied in every fifth round. Five different coding schemes were experimented: one-against-all, pairwise, ECOC with 15bit BCH code, ECOC with 31bit BCH code, and ECOC with 63bit BCH code. There was no significant difference in the performance between coding schemes. During the first five iterations the classification rate increased from 73% to near 97% with Gabor features and from 89% to near 97% with LBP8,1 with all coding schemes. The baseline classification result obtained with the full training data and supervised learning were 98.5% with Gabor and 98.8 with LBP8,1. All coding schemes were able to achieve very good performance during learning. It can also be noted that active learning was able to find the most informative samples very fast (during the first few iterations). and Gabor filthe multiresolution LBP8,1+16,3+24,5 tering at four scales and six orientations were applied to the Outex TC textures. Initially 225 images were put to the labeled training set. Similar settings as before were used, expect only 25 samples were labeled in each iteration and one vs. all coding was used. On the first five rounds the classication rates increased from 68.1% to 77.8% with LBP8,1+16,3+24,5 and from 68.1% to 73.4%

6 with Gabor, respectively. During learning the classification rates exceeded 80.2% and 74.0%, while using supervised learning with full training data the rates were 84.4% and 74.9%, recpectively. Because the Outex TC images are color textures, we also experimented how texture features and separate color features can be used together in the learning framework. The same LBP8,1+16,3+24,5 features as before were used as the other feature set. The color feature set was constructed by quantizing each color channel into four levels in the RGB colorcube. Using the same kind of learning procedure than before, the classification rates increased from 68.1% to 76.1% with LBP8,1+16,3+24,5 and 67.1% to 71.8% with color features on the first four iterations. During learning, the rates reached 84.4% and 76.6% which are close to the rates obtained with fully supervised learning (84.4% and 76.8%). With CUReT textures 20% of the training images were selected into the initial training set. LBP8,1+16,3+24,5 ri and Gabor features were extracted and similar settings than with Outex TC textures were used in the labeling procedure. During first five iterations the classification rate increased from 70.9% to 92.6% with LBP8,1+16,3+24,5 ri and from 79.8% to 92.0% with Gabor features. During learning the rates increased to 94.5% and 97.3% which are close to the ones obtained with supervised learning (95.0% and 97.6%). 4. Conclusions Collection of labeled training data is typically laborious and prone to human made errors in many computer vision tasks. In this paper, we proposed a robust texture data labeling framework utilizing efficient texture features, cotraining and active learning in order to ease human work in labeling. Visualization-based learning was used as a userinterface in actual labeling task providing very useful view into the data. We experimented the labeling framework with various texture image sets for creating a labeled data set for texture classifier. Active learning was able to select such unlabeled samples that would provide useful information to the training set and increased the testing classification rate very rapidly. Co-training was used to automatically label those samples whose label can be known with high confidence. Those unlabeled samples that active learner suggested to be labeled were analyzed and labeled with visualization-based user interface. With this kind of approach the amount of human work needed in the labeling reduces dramatically, but the labeling performance remains very good. In addition, the user can introduce totally new classes in the visualization phase which makes the labeling procedure very flexible. Acknowledgments Financial support of the Infotech Oulu Graduate Schools and Academy of Finland is gratefully acknowledged. References [1] E. Allwein, R. Schapire, and Y. Singer. Reducing multiclass to binary: a unifying approach for margin classifiers. Journal of Machine Learning Research, 1: , [2] A. Blum and T. Mitchell. Combining labeled and unlabeled data with co-training. In Computational Learning Theory, pages , [3] C. Campbell, N. Cristianini, and A. Smola. Query learning with large margin classifiers. In ICML 00, pages , [4] D. Cohn, Z. Ghahramani, and M. Jordan. Active learning with statistical models. Journal of Artificial Intelligence Research, 4: , [5] K. Dana, B. van Ginneken, S. Nayar, and J. Koenderink. Reflectance and texture of real world surfaces. ACM Trans. Graphichs, 18(1):1 34, [6] R. Ghani. Combining labeled and unlabeled data for multiclass text categorization. In ICML 02, pages , [7] X. He, R. Zemel, and M. Carreira-Perpinan. Multiscale conditional random fields for image labeling. In IEEE Conf. CVPR 04, pages , [8] V. Iyengar, C. Apte, and T. Zhang. Active learning using adaptive resampling. In Int l Conf. on Knowledge Discovery and Data Mining, pages 91 98, [9] T. Kohonen. Self-organizing Maps. Springer-Verlag, Berlin, Germany, [10] S. Lazebnik, C. Schmid, and J. Ponce. A sparse texture representation using local affine region. IEEE Trans. PAMI, 27(8): , [11] B. Manjunath and W. Ma. Texture features for browsing and retrieval of image data. IEEE Trans. PAMI, 18(8): , [12] T. Ojala, T. Mäenpää, M. Pietikäinen, J. Viertola, J. Kyllönen, and S. Huovinen. Outex - new framework for empirical evaluation of texture analysis algorithms. In ICPR 02, volume 1, pages , [13] T. Ojala, M. Pietikäinen, and T. Mäenpää. Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE Trans. PAMI, 24(7): , [14] R. Picard and T. Minka. Vision texture for annotation. Multimedia Systems, 3(1):3 14, [15] M. Seeger. Learning with labeled and unlabeled data. Technical report, Institute of Adaptive and Neural Computation, University of Edinburgh, [16] M. Turtinen and M. Pietikäinen. Visual training and classification of textured scene images. In The 3rd International Workshop on Texture Analysis and Synthesis, pages , [17] V. Vapnik. The Nature of Statistical Learning Theory. Springer-Verlag, New York, [18] R. Yan, J. Yang, and A. Hauptmann. Automatically labeling video data using multi-class active learning. In IEEE Conf. ICCV 03, pages , 2003.