Supervised Image Segmentation across Scanner Protocols: A Transfer Learning Approach

Transcription

1 Supervised Image Segmentation across Scanner Protocols: A Transfer Learning Approach Annegreet van Opbroek 1, M. Arfan Ikram 2, Meike W. Vernooij 2 and Marleen de Bruijne 1,3 1 Biomedical Imaging Group Rotterdam, Departments of Medical Informatics and Radiology, Erasmus MC - University Medical Center Rotterdam, the Netherlands 2 Departments of Epidemiology and Radiology, Erasmus MC - University Medical Center Rotterdam, the Netherlands 3 Department of Computer Science, University of Copenhagen, Denmark Abstract. Supervised classification techniques are among the most powerful methods used for automatic segmentation of medical images. A disadvantage of these methods is that they require a representative training set and thus encounter problems when the training data is acquired e.g. with a different scanner protocol than the target segmentation data. We therefore propose a framework for supervised biomedical image segmentation across different scanner protocols, by means of transfer learning. We establish a transfer learning algorithm for classification, which can exploit a large amount of labeled samples from different sources in addition to a small amount of samples from the target source. The algorithm iteratively re-weights the contribution of training samples from these different sources based on classification by a weighted SVM classifier. We evaluate this technique by performing tissue classification on MRI brain data from four substantially different scanning protocols. For a small number of labeled samples from a single image obtained with the same protocol, the proposed transfer learning method outperforms classification on all available training data as well as classification based on the labeled target samples only. The classification errors for these cases can be reduced with up to 40 percent compared to traditional classification techniques. 1 Introduction Supervised classification techniques are commonly used in automatic segmentation of biomedical images. A major drawback of these methods is that they require a sufficiently large training set from a similar distribution as the images to be segmented. This means that in practice these techniques often cannot be applied to data obtained with a different scanning protocol, scanner, or image modality, without establishing a new, usually manually annotated dataset. Common methods to cope with differences between training and test distributions are based on exploiting similarities between training and test data, e.g. by embedding the physics of the image acquisition process in the segmentation

2 framework [1], and using prior tissue probability maps to identify new training samples from target data [2]. Another approach is to use unsupervised clustering methods on the target source [3 5]. In this article we present a new approach to the problem of learning automatic classification across scanners, which can reduce the effort of re-training, by transferring knowledge from different scanners. Our method relies on a relatively new area of machine learning, called Transfer Learning [6]. Transfer learning copes with cases where distributions, feature spaces and/or tasks differ between training and test data, as opposed to traditional machine learning techniques where these are assumed to be the same between training and testing. As a proof of concept we investigate whether transfer learning can improve between-scanner segmentation performance in a basic voxelwise classification segmentation framework. Hereto we establish a transfer learning method that makes use of a large amount of different-distribution training samples, which come from different sources than the target data, in addition to a relatively small amount of same-distribution samples from the target distribution. Our method relies on iteratively weighting these different-distribution samples according to classification-outcome of a weighted support vector machine (SVM) classifier on all available training data. This way, the suitable different-distribution samples are selected, which help regularize the classification and thus reduce the problems related to the small sample size of the same-distribution data. 2 A Transfer Learning Approach to Classification We make use of a small amount of labeled data from the target source, which we will call the same-distribution training data, denoted by T s = {x s i,ys i }Ns i=1, where N s represents the number of same-distribution training samples x s i, with corresponding labels yi s. Apart from T s we also have a large amount of training data from other sources, which may have different distributions. This differentdistribution training data is denoted by T d = {x d i,yd i }N d i=1, where N d is the total number of different-distribution training samples x d i, with labels yd i (typically N d N s ), so that there is a total training set T = T d T s. Our algorithm iteratively calculates a weighted SVM classifier c t from all available training data, which is then used to determine a new weight vector w t+1 for all training samples. Samples from T d that contradict labeled same-distribution data may be misclassified by the trained classifier. In the next round these samples will receive a lower weight, and will thus have less influence on the decision boundary. The weighting of the different-distribution training samples is achieved by multiplication with β 1 δ(ct(xd i ),yd i ), where δ(c( t x d i ),yd i ) denotes the Kronecker delta. The value for β is taken as β = 1/(1 + 2lnN d /N it ), as determined for the TrAdaBoost algorithm [7], which is also based on iteratively reweighting same-distribution training samples. Here N it denotes the total number of iterations performed. Thus, when more iterations are performed the reduction of weights in one iteration diminishes. An initial weight vector w 1 for respec-

3 tively the different- and same-distribution samples gives each of the differentdistribution training samples a weight R N d and the same-distribution training 1 samples a weight N s, so that R (R 0) denotes the ratio between the total weight of the T d samples and the total weight of the T s samples. The algorithm is summarized in Table 1. Table 1. Our transfer learning algorithm in pseudo code Input: Output: T, N s, N d, N it,r classifier for test data c Nit set β = 1/(1+ { 2lnN d /N it) R set wi 1 for i = 1,2,...,N N d = d 1 for i = N d +1,...,N d +N s N s For t = 1,2,...,N it Normalize w t = w t N d +Ns j=1 w j t Calculate Classifier c t from T and w t, Update w t+1 i = w t iβ 1 δ(ct(xd i ),yd i ), for i = 1,2,...,N d end 2.1 Weighted Support Vector Machine Classification For classification we use a weighted support vector machine, as provided by LIB- SVM [8]. In this weighted SVM every trainingsample x i is givenaweight w t i 0 that describes the importance of the sample, so that a training sample with a high weight is more important to classify correctly. The objective function for the optimal classifier c t (x) = V x+v 0 by weighted SVM reads min V 1 2 V V +C N i=1 wt i ξ i (1) s.t. V T x i +v 0 1 ξ i for y i = +1 V T x i +v 0 1+ξ i for y i = 1 ξ i 0 i = 1,2,...,N. Also, as in regular SVM the trade-off parameter C can be determined with cross-validation. 3 Experiment: MRI Brain Tissue Segmentation We consider the application of MRI brain tissue segmentation by voxelwise classification on images acquired with a certain scanner, for which no manual labels

4 are yet available. Since manual segmentation is time consuming only a very small amount of voxels in a single image is labeled, to train a classification scheme for the remainder of images. In addition to these same-distribution training samples, a larger amount of manually labeled samples from different patients made with a variety of scanners, are already available from other studies. We evaluate whether our transfer learning algorithm can outperform traditional classification. Data Description We use annotated MRI brain data from four different sources, which display a large amount of variation in intensity and tissue contrast: 1. 6 T1-weighted images from the Rotterdam Scan Study [9] made with a 1.5T GE scanner with mm 3 voxel size HASTE-Odd images (inversion time = 4400 ms, TR = 2800 ms, TE = 29 ms) from the Rotterdam Scan Study [9] made with a 1.5T Siemens scanner with mm 3 voxel size T1-weighted images from the Internet Brain Segmentation Repository (IBSR) [10] with mm 3 voxel size T1-weighted images from the IBSR, 10 from a 1.5T Siemens scanner, 10 from a 1.5T GE scanner, both with mm 3 voxel size Slices from one image of the four different sources are shown in Fig. 1. (a) Source 1 (b) Source 2 (c) Source 3 (d) Source 4 Fig. 1. Slices of images from the four different sources. For the Haste-Odd the images are inverted so that CSF has lowest intensity and WM the highest, as in the T1-weighted images. All 56 images are corrected for non-uniformity in intensity with the N3 method [11] within a mask, and then normalized so that the voxels between the 4th and the 96th intensity percentile are mapped between 0 and 1. For all images expert segmentations for white matter (WM), gray matter (GM), and cerebrospinal fluid (CSF) are available for training and testing. Some images also have expert segmentations for white matter lesions; these voxels are included in the WM class. Experimental Setup A set of cross-validation experiments is performed. All four sources are in turn used as the test source, while the other three sources are

5 used to extract different-distribution training data. As training and test samples voxels are randomly selected within the manually annotated brain mask. As different-distribution training data, samples per source are randomly selected from the different images. Same-distribution training data is sampled from a single image from the test source, where we start by adding one samedistribution sample from every class, and subsequently add more samples randomly distributed over the classes to generate learning curves to analyze the classification performance as a function of the number of same-distribution training samples. The learning curves are created by testing on randomly selected samples from each test source image that is not used for training, resulting in a total of to test samples, depending on the amount of images in the source. This gives a reliable estimation of the classification performance on whole images, but is computationaly less expensive. We repeat the experiment for every test source image as same-distribution training source, and determine average classification accuracy per image. In addition, to compare to brain tissue segmentation results reported in the literature, full image segmentation is performed at N s = 20. A total of four features are used for classification, consisting of the image intensity plus three spatial features which give the x, y, and z coordinates of the voxel as a fraction of the total dimensions of the brain. The features within each source are normalized to zero mean and unit standard deviation. From now on we will refer to the transfer learning algorithm with the weighted SVM classifier as a Transfer SVM. The Transfer SVM is compared to normal SVM classification both onallavailabletrainingdata(t s T d ) andonthesame-distributiontraining data (T s ) alone. In all classification schemes the SVM classifier is extended to a multiclass classifier by one-vs-one classification, which overall gives better results on the data than one-vs-rest classification. All SVM classifiers use a linear kernel. Parameter Selection The SVM parameter C for each of the four experiments is determined with cross-validation with a regular SVM in the three differentdistribution training sources. The parameter with the best performance in the three sources is selected. The total number of iterations of the transfer learning algorithm is set to N it = 20, which is sufficient for convergence in all cases. For each number of same-distribution training samples the parameter R that determines the balance between the initial weights of the different- and samedistribution samples in the Transfer SVM is determined with cross-validation with the Transfer SVM on the data from the three different-distribution sources. In turn each source is selected as target source to extract labeled same-distribution samples and test data, while the other two sources are used to extract differentdistribution training data. The best R is then determined by averaging between the three sources. The resulting R is around 5 for three same-distribution training samples, but falls off exponentially to R 1 for 200 same-distribution training samples.

6 Results In Fig.2the learningcurvesforeachofthe fourexperimentsareshown. In all experiments we see that for a small number of same-distribution training samples (between 3 and 40 in Fig. 2(b), and a broader range for the other three experiments) the Transfer SVM outperforms both the SVM on all training data T s + T d and the SVM on the labeled same-distribution data T s only. When more same-distribution training samples are available the SVM classifier on T s converges to the Transfer SVM and might even perform slightly better. For a very small amount of same-distribution training data the SVM classifier on all available training samples performs best in two of the four cases. In these cases the Transfer SVM seems to give too little weight to the different-distribution training samples, which results in a classifier that does not outperform the SVM on T s +T d. (a) Source 1 (b) Source 2 (c) Source 3 (d) Source 4 Fig. 2. Mean classification errors and 95%-confidence intervals for our transfer learning SVM, the conventional SVM on T d T s, and the SVM on T s, all with a linear kernel. The test source consist of (a) Source 1, (b) Source 2, (c) Source 3, (d) Source 4, the different-distribution training data comes from the three remaining sources. Fig. 3 shows an example of segmentations from Source 1 obtained by the three methods for 20 same-distribution training samples. The Transfer SVM classifier produces a good segmentation with a classification error of 6.7%. The

7 segmentation of the SVM T s + T d classifier undersegments WM and CSF and gives an error of 9.3%, while the SVM T s classifier oversegments the CSF, resulting in an error of 14.0%. According to the learning curve in Fig. 2(a) the number of labeled same-distribution training samples needed for the SVM T s classifier to produce a similar result lies around N s = 80. Thus, in this case transferring knowledge from the different-distribution training samples with a transfer classifier reduces the costs on labeling new same-distribution data with a factor 4. (a) (b) (c) (d) (e) Fig.3. Segmentation results on Source 1 for N s = 20. (a) Original T1 image, (b) manual segmentation, (c) SVMT s+t d classifier, (d) SVMT s classifier, and (e) Transfer SVM classifier. 4 Conclusion and Discussion We presented a transfer learning method to segment biomedical images from different scanning protocols. Our algorithm makes use of labeled image data from a variety of scanners, on top of a small amount of labeled training data from the target scanner. Experiments on MRI brain tissue segmentation show that our algorithm can reduce the amount of misclassified voxels in the image with up to 40%. To do this, the algorithm exploits already available training data from different scanners and weights them according to correspondence with the training data from the target scanner. With our algorithm just a few manually annotated samples in a single image obtained with a new protocol are sufficient to retrain the method and provide much better results than a classifier trained on different- or same-distribution data alone. Our transfer learning algorithm has proved to be capable of handling data from four drastically different sources, with different pulse sequence parameters and different slice thickness in different orientations. To allow direct comparison of different learning algorithms, we have in this work focused on segmentation based on voxelwise classification. This framework could be used as the basis for a more advanced brain tissue segmentation algorithm including e.g. regularization, atlas based priors, and modeling of partial volume effect. To give an indication of the performance of our segmentation

8 method in comparison with the current state of the art, we compare the performance reported for nine methods in [3] applied to the IBSR dataset with 20 subjects. For N s = 20 our method reports a mean Jaccard index on the 20 imagesof0.15forcsf, 0.68forGM, and 0.61for WM. The best ofthe 9algorithms in [3] reports coefficients of 0.10 for CSF, 0.68 for GM, and 0.69 for WM. Three methods in [3] report coefficients higher than 0.61 for WM, but our algorithm outperforms all methods based on CSF scores, and all but one on GM scores. Also, when N s is increased our method performs even better. Eventhough the focus ofthe experiments in this paperis onmri braintissue segmentation, the proposed method is general and we expect it to be useful in many other applications where classification is used, such as voxel classification in other areas, brain structure segmentation, and computer aided diagnosis. To conclude, the experiments give a strong indication that supervised image segmentation techniques can benefit from different-distribution training data, in a transfer learning setting. This way, the large amount of (publicly) available annotated data from previous studies can be used to segment images with relatively large differences in imaging protocols, which forms an important step towards application in a clinical setting. References 1. Fischl, B., Salat, D., van der Kouwe, A., Makris, N., Ségonne, F., Quinn, B., Dale, A.: Sequence-independent segmentation of magnetic resonance images. Neuroimage 23 (2004) S69 S84 2. Cocosco, C., Zijdenbos, A., Evans, A.: A fully automatic and robust brain MRI tissue classification method. Medical Image Analysis 7(4) (2003) Mayer, A., Greenspan, H.: An adaptive mean-shift framework for MRI brain segmentation. Medical Imaging, IEEE Transactions on 28(8) (2009) Grabowski, T., Frank, R., Szumski, N., Brown, C., Damasio, H.: Validation of partial tissue segmentation of single-channel magnetic resonance images of the brain. NeuroImage 12(6) (2000) Van Leemput, K., Maes, F., Vandermeulen, D., Suetens, P.: Automated modelbased tissue classification of MR images of the brain. Medical Imaging, IEEE Transactions on 18(10) (1999) Pan, S., Yang, Q.: A survey on transfer learning. Knowledge and Data Engineering, IEEE Transactions on 22(10) (2010) Dai, W., Yang, Q., Xue, G., Yu, Y.: Boosting for transfer learning. In: Proceedings of the 24th international conference on Machine learning, ACM (2007) Chang, C., Lin, C.: Libsvm: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST) 2(3) (2011) Hofman, A., Breteler, M., Van Duijn, C., Janssen, H., Krestin, G., Kuipers, E., Stricker, B., Tiemeier, H., Uitterlinden, A., Vingerling, J., et al.: The Rotterdam Study: 2010 objectives and design update. European journal of epidemiology 24(9) (2009) Worth, A.: The Internet Brain Segmentation Repository (IBSR) 11. Sled, J., Zijdenbos, A., Evans, A.: A nonparametric method for automatic correction of intensity nonuniformity in MRI data. Medical Imaging, IEEE Transactions on 17(1) (1998) 87 97