Valmet: A new validation tool for assessing and improving 3D object segmentation



Similar documents
Morphological analysis on structural MRI for the early diagnosis of neurodegenerative diseases. Marco Aiello On behalf of MAGIC-5 collaboration

Machine Learning for Medical Image Analysis. A. Criminisi & the InnerEye MSRC

Colour Image Segmentation Technique for Screen Printing

More Local Structure Information for Make-Model Recognition

LOCAL SURFACE PATCH BASED TIME ATTENDANCE SYSTEM USING FACE.

Template-based Eye and Mouth Detection for 3D Video Conferencing

Integration and Visualization of Multimodality Brain Data for Language Mapping

Computer Aided Liver Surgery Planning Based on Augmented Reality Techniques

Java Modules for Time Series Analysis

Representation and visualization of variability in a 3D anatomical atlas using the kidney as an example

Impact of Model-based Risk Analysis for Liver Surgery Planning

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining

Descriptive statistics Statistical inference statistical inference, statistical induction and inferential statistics

VisualCalc AdWords Dashboard Indicator Whitepaper Rev 3.2

Descriptive Statistics and Measurement Scales

Assessment. Presenter: Yupu Zhang, Guoliang Jin, Tuo Wang Computer Vision 2008 Fall

QAV-PET: A Free Software for Quantitative Analysis and Visualization of PET Images

4D Magnetic Resonance Analysis. MR 4D Flow. Visualization and Quantification of Aortic Blood Flow

Image Registration and Fusion. Professor Michael Brady FRS FREng Department of Engineering Science Oxford University

Image Segmentation and Registration

2. MATERIALS AND METHODS

Lecture 2: Descriptive Statistics and Exploratory Data Analysis

Measuring Line Edge Roughness: Fluctuations in Uncertainty

Graph Embedding to Improve Supervised Classification and Novel Class Detection: Application to Prostate Cancer

DEVELOPMENT OF AN IMAGING SYSTEM FOR THE CHARACTERIZATION OF THE THORACIC AORTA.

REGISTRATION OF 3D ULTRASOUND IMAGES TO SURFACE MODELS OF THE HEART

advancing magnetic resonance imaging and medical data analysis

DESCRIPTIVE STATISTICS. The purpose of statistics is to condense raw data to make it easier to answer specific questions; test hypotheses.

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

Exploratory data analysis (Chapter 2) Fall 2011

MODERN VOXEL BASED DATA AND GEOMETRY ANALYSIS SOFTWARE TOOLS FOR INDUSTRIAL CT

Medical Image Processing on the GPU. Past, Present and Future. Anders Eklund, PhD Virginia Tech Carilion Research Institute

Basic 3D reconstruction in Imaris 7.6.1

How To Check For Differences In The One Way Anova

WHAT IS A JOURNAL CLUB?

Summary of Formulas and Concepts. Descriptive Statistics (Ch. 1-4)

Introducing MIPAV. In this chapter...

Multiphysics Software Applications in Reverse Engineering

BASIC STATISTICAL METHODS FOR GENOMIC DATA ANALYSIS

3D Model based Object Class Detection in An Arbitrary View

Diffusione e perfusione in risonanza magnetica. E. Pagani, M. Filippi

Northumberland Knowledge

Epipolar Geometry. Readings: See Sections 10.1 and 15.6 of Forsyth and Ponce. Right Image. Left Image. e(p ) Epipolar Lines. e(q ) q R.

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

VECTORAL IMAGING THE NEW DIRECTION IN AUTOMATED OPTICAL INSPECTION

Fairfield Public Schools

Evaluating the Lead Time Demand Distribution for (r, Q) Policies Under Intermittent Demand

The Exploration of Cross-Sectional Data with a Virtual Endoscope

Introduction to Data Visualization

Definition of the evaluation protocol and goals for Competition 1

Data Exploration Data Visualization

AnatomyBrowser: A Framework for Integration of Medical Information

In Practice Whole Body MR for Visualizing Metastatic Prostate Cancer

An introduction to Value-at-Risk Learning Curve September 2003

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012

Four-dimensional Mathematical Data Visualization via Embodied Four-dimensional Space Display System

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Slides by Tan, Steinbach, Kumar adapted by Michael Hahsler

COM CO P 5318 Da t Da a t Explora Explor t a ion and Analysis y Chapte Chapt r e 3

Medical Imaging Specialists and 3D: A Domain Perspective on Mobile 3D Interactions

Algebra Academic Content Standards Grade Eight and Grade Nine Ohio. Grade Eight. Number, Number Sense and Operations Standard

Biostatistics: DESCRIPTIVE STATISTICS: 2, VARIABILITY

Evaluating System Suitability CE, GC, LC and A/D ChemStation Revisions: A.03.0x- A.08.0x

MAVIparticle Modular Algorithms for 3D Particle Characterization

Iris Sample Data Set. Basic Visualization Techniques: Charts, Graphs and Maps. Summary Statistics. Frequency and Mode

MAGNETIC resonance (MR) image segmentation is fundamental

LDA at Work: Deutsche Bank s Approach to Quantifying Operational Risk

Current Standard: Mathematical Concepts and Applications Shape, Space, and Measurement- Primary

Robot Perception Continued

Neuroimaging module I: Modern neuroimaging methods of investigation of the human brain in health and disease

Music Mood Classification

Perception of Light and Color

JPEG compression of monochrome 2D-barcode images using DCT coefficient distributions

Environmental Remote Sensing GEOG 2021

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining

QUANTITATIVE IMAGING IN MULTICENTER CLINICAL TRIALS: PET

Face Recognition in Low-resolution Images by Using Local Zernike Moments

CALCULATIONS & STATISTICS

Diagrams and Graphs of Statistical Data

Analyzing The Role Of Dimension Arrangement For Data Visualization in Radviz

ADVANCES IN AUTOMATIC OPTICAL INSPECTION: GRAY SCALE CORRELATION vs. VECTORAL IMAGING

In mathematics, there are four attainment targets: using and applying mathematics; number and algebra; shape, space and measures, and handling data.

Examples of Data Representation using Tables, Graphs and Charts

Jitter Measurements in Serial Data Signals

X X X a) perfect linear correlation b) no correlation c) positive correlation (r = 1) (r = 0) (0 < r < 1)

Software Packages The following data analysis software packages will be showcased:

EXPLORING SPATIAL PATTERNS IN YOUR DATA

Instagram Post Data Analysis

99.37, 99.38, 99.38, 99.39, 99.39, 99.39, 99.39, 99.40, 99.41, cm

Local outlier detection in data forensics: data mining approach to flag unusual schools

SENSITIVITY ANALYSIS AND INFERENCE. Lecture 12

. Learn the number of classes and the structure of each class using similarity between unlabeled training patterns

Biostat Methods STAT 5820/6910 Handout #6: Intro. to Clinical Trials (Matthews text)

Chapter 4. Probability and Probability Distributions

Digital Imaging and Communications in Medicine (DICOM) Supplement 132: Surface Segmentation Storage SOP Class

Norbert Schuff Professor of Radiology VA Medical Center and UCSF

Unit 9 Describing Relationships in Scatter Plots and Line Graphs

2DI36 Statistics. 2DI36 Part II (Chapter 7 of MR)

Segmentation of building models from dense 3D point-clouds

ART Extension for Description, Indexing and Retrieval of 3D Objects

6. MEASURING EFFECTS OVERVIEW CHOOSE APPROPRIATE METRICS

Transcription:

Valmet: A new validation tool for assessing and improving 3D object segmentation 1,2 Guido Gerig, 1 Matthieu Jomier, 2 Miranda Chakos 1 Department of Computer Science, UNC, Chapel Hill, NC 27599, USA 2 Department of Psychiatry, UNC, Chapel Hill, NC 27599, USA Software: http://www.ia.unc.edu/public/valmet email: gerig@cs.unc.edu Abstract. Extracting 3D structures from volumetric images like MRI or CT is becoming a routine process for diagnosis based on quantitation, for radiotherapy planning, for surgical planning and image -guided intervention, for studying neurodevelopmental and neurodegenerative aspects of brain diseases, and for clinical drug trials. Key issues for segmenting anatomical objects from 3D medical images are validity and reliability. We have developed VALMET, a new tool for validation and comparison of object segmentation. New features not available in commercial and public-domain image processing packages ar e the choice between different metrics to describe differences between segmentations and the use of graphical overlay and 3D display for visual assessment of the locality and magnitude of segmentation variability. Input to the tool are an original 3D image (MRI, CT, ultrasound), and a series of segmentations either generated by several human raters and/or by automatic methods (machine). Quantitative evaluation includes intra-class correlation of resulting volumes and four different shape distance metrics, a) percentage overlap of segmented structures (R intersect S)/(R union S), b) probabilistic overlap measure for non-binary segmentations, c) mean/median absolute distances between object surfaces, and maximum (Hausdorff) distance. All these measures are calculated for arbitrarily selected 2D cross-sections and full 3D segmentations. Segmentation results are overlaid onto the original image data for visual comparison. A 3D graphical display of the segmented organ is color-coded depending on the selected metric for measuring segmentation difference. The new tool is in routine use for intra- and inter-rater reliability studies and for testing novel automatic machine-segmentation versus a gold standard established by human experts. Preliminary studies showed that the new tool could significantly improve intra- and inter-rater reliability of hippocampus segmentation to achieve intra-class correlation coefficients significantly higher than published elsewhere. 1. Scoring measurement methods The computer vision community started several efforts with a number of workshops, conferences, and special issues of journals on the topic of empirical evaluation technique, see for example [Bowyer 1998], [Niessen

2000], [Vincken 2000], [Chalana 1997], and [Remiejer 1999]. Measuring performance of algorithms or human raters in image segmentation requires an appropriate metric, a goodness index that gives us a valid measure of the quality of a segmentation result. A good source and discussion of techniques is found in the most recent book, Performance Characterization in Computer Vision [Klette 2000]. Typical procedures for validation of computer-assisted segmentation are listed in [Kapur 1996]: Segmentation results are validated by a) visual inspection, b) comparison with manual segmentation, c) tests with synthetic data, d) use of fiducials on patients, and e) use of fiducials and/or cadavers. The problem is not that there is no ground truth for medical data, but that the ground truth is not typically available to the segmentation validation system in any form that can be readily used. For some structures or parts of structures the boundary can only be known with non-negligible tolerance. For other structures, the ground truth/gold standard is in reality barely fuzzy. The following list covers metrics to measure the differences of segmentation results for measuring and comparing the reliability of intra-rater, inter-rater and machine-to-rater segmentations. We have developed a new tool called VALMET [freely avaible at http://www.ia.unc.edu/public/valmet ] that reads an original 3D image to be segmented and a series of 3D segmentation results. Valmet calculates different metrics to assess pairwise segmentation differences and differences between groups. It further displays volumetric images with overlaid segmentations as 3 orthogonal sections with coupled cursors and as 3D renderings (see Fig. 2). 1.1 Volumes A feature most easily accessible is the total volume of a structure. This is the simplest morphologic measure and often used in reliability studies in neuroimaging applications. For binary segmentations, we calculate the number of voxels adjusted by the voxel volume. More precise volumetric measurements can be obtained by fitting a surface (marching cubes, e.g.) with sub-voxel accuracy and calculating the volume by integration. Comparing volumes of segmented structures does not take into account any regional differences and does not give an answer to the question where differences occur. Further, over - and underestimation along boundaries or surfaces cancel and can give excellent agreement even if the boundary segmentation is poor [Niessen 2000]. 1.2 Volumetric Overlap (true and false positives, true and false negatives) One approach for taking in to account the spatial properties of structures is a pair-wise comparison of two binary segmentations by relative overlap. Assuming spatial registration, images are analyzed voxel by voxel to calculate

false positives, false negative, true positive and true negative voxels. Well accepted measures are the intersection of subject and reference divided by the union, ( S R) /( S R), or intersection divided by reference ( S R) / R. Both measures give a score of 1 for perfect agreement and 0 for complete disagreement. The first is more sensitive to differences since both denominator and numerator change with increasing or decreasing overlap. The measure gives comparable results if applied at different institutions if structures and resolution of image data are standardized. However, the overlap measure depends on the size and the shape complexity of the object and is related to the image sampling. Assuming that most of the error occurs at the boundary of objects, small objects are penalized and get a much lower score than large objects. 1.3 Probabilistic distances between segmentations. In a lot of medical image segmentation tasks there are no clear boundaries between anatomical structures. Absolute ground truth by manual segmentation does not exist and only a `fuzzy probabilistic segmentation is possible. Manual probabilistic segmentations can be generated by aggregating repeated multiple segmentations of the same structure done either by a trained individual rater or by multiple raters. We have developed a probabilistic overlap measure between two fuzzy segmentations derived from the normalized L 1 distance between two probability distributions. The probabilistic overlap is defined as where P A and B fuzzy segmentations and PA PB POV ( A, B) = 1, 2 P P are the probability distributions representing the two PAB is the pooled joint probability distribution. 1.4 Maximum Surface Distance (Hausdorff distance) The Hausdorff-Chebyshev metric defines the largest difference between two contours. Given two contours C and D, we first calculate for each point c on C the minimal distance to all the points on contour D, d c ( c,, d C ( c, = min{ d ( c, s), s D}. We calculate this minimal distance for ps each boundary point and take the maximum minimal distance as the worst case distance, hc ( C, = max{ dc ( c,, c C}. The Hausdorff metric is not symmetric and h c ( C, is not equal to h c ( D, C) (see drawn figure), which is accounted for by finally calculating HC ( C, = max{ hc ( C,, hc ( D, C)}. The Hausdorff metric calculation AB

is computationally very expensive, as we need to compare each contour point to all the other ones. A comparison of complex 3D surfaces would require huge number of calculations. The VALMET implementation uses 3D Euclidean distance transform calculation on one object and overlay of the second object to efficiently calculate the measure. The measure is extremely sensitive to outliers and does not reflect properties integrated along the whole boundary or surface. In certain cases, however, where a procedure does have to stay within certain limits, this measure would be the metrics of choice. 1.4 Mean absolute surface distance The mean absolute surface distance tells us how much on average the two surfaces differ. This measure integrates over both over- and under-estimation of a contour, and results in an L 1 norm with intuitive explanation [Chalana 1997]. The calculation is not straightforward if point to point correspondence on two surfaces is not available. We use a similar strategy as for the Hausdorff metric calculation, namely signed Euclidean distance transforms on one object and overlay of the second object surface. We then trace the surface and integrate the distance values. This calculation is not symmetric, since distances from A to B are not the same as B to A (see discussion Hausdorff distance above). We therefore derive a common average by combining the two averages. The mean absolute dista nce, as opposed to binary overlap, does not depend on the object size. As a prerequisite, however, it requires existing surfaces and is therefore only suitable for single object comparison. 1.5 Interclass correlation coefficient for assessing intra-, inte r-rater and ratermachine reliability A common measure of reliability of segmentation tasks is the intraclass correlation coefficient. The measure calculates the ratio between the variance of a normally population and the population of measurements, i.e. the variance of the population 2 b σ plus the variance of the rater σ 2 0. The σ b intraclass correlation is thus defined as ρ =. If the rater 2 2 σ b + σ 0 variance is small relative to the total, then the variation in measurements among different cases will be due largely to natural variation in the population and thus close to 1. Hence we can be confident in the rater s reliability. In neuroimaging applications, inter- and intra-rater reliability studies based on volumetric measurements have become standard. Commonly accepted values range from 0.9 to 0.99 for volume assessments with manual tracing of simply shaped subcortical structures or organs like kidneys and liver, for example. 2

2 Visualization of intra- and inter-rater reliability Visualization of intra- and inter -rater reliability on 2D cross-sections with label overlay and by 3D surface renderings is shown in Fig. 1. The concept is as follows: We load a 3D volumetric gray level image dataset and a series of segmentation results either by different raters or as repeated measurements of one rater into the tool. The labels are overlaid onto the original image with variable opacity. The 3D rendering reconstructs the 3D surfaces and displays either intra-rater or inter-rater variability as color overlays. This tool has shown its usefulness to act as a training tool for manual rater s segmentation. The new capability to visually assess rater differences on 2D slices and 3D views is new and not available by other packages. Fig. 1: Qualitative assessment of intra- and inter-rater reliability. The images show 2D orthogonal sections of a region of interest of the interior brain with segmentation of the left and the right hippocampal structures. Left: Intra-rater variability of 3 segmentations (observations) by one rater with yellow=3, green=2, and blue=1 votes per voxel. The 3D rendering displays the regional fuzziness of the boundary. Right: Inter-rater variability between two raters by comparison of two average segmentations. Yellow marks the region segmented by both, and blue and green regions segmented by only one of them. This displays clearly illustrates the agreement/disagreement between the raters, which is dominant in the hippocampus amygdala transition area (HATA) and the region of the hippocampal tail. Fig. 2 shows the screen of VALMET applied to hippocampus segmentation. Repeated 3D manual segmentations of the hippocampus provided by several experts are compared to qualitatively and quantitatively assess the intra- and inter-rater reliability. The tool displays three orthogonal cuts with overlay of labeled regions and a 3D surface rendering of the object boundaries. The hue indicates the local surface direction either inwards (blue, see color bar in Fig. 2) and outwards (red, again see color bar in Fig. 2) relative to the reference object and the distance between the surfaces, according to the metric chosen.

Fig. 2: User interface of VALMET. The tool calculates overlap measures, Hausdorff distance, mean absolute (and signed) surface distances, and probabilistic overlap. The 3D rendering provides a color display of both intersecting surfaces (green and red), showing regional differences between two surfaces. The application shows the result of a inter- and intra-rater hippocampus segmention study. 3. Segmentation Validation: Manual hippocampus segmentation Unlike some anatomical structures the hippocampus as imaged through MRI has no clear boundaries, and it is very difficult to establish ground truth by manual segmentation. Hence it is very important to quantify variability in manual segmentations done by trained raters. As part of a large schizophrenia neuroimaging study, intra- and inter-rater reliability were tested with blind studies of series of 3D image data. For each series, we randomly selected 5 cases from an ongoing schizophrenia. The 5 cases were replicated 3 times and numbered randomly resulting in 15 image datasets, numbered differently for each rater. Trained raters go through all theses cases and segment left and right hippocampal structures using a new 3D segmentation tool IRIS [IRIS, 1999] developed by our group. The tool allows triplanar region editing and graphical 2D/3D interaction between image planes and segmented objects. We used an intraclass correlation program written in SAS (SAS Institute Inc.) to calculate intra- and inter-rater reliability.

Table 1 shows the reliability of two of the raters. We tested the reliability in two series, a first series after raters have been trained with the tools and became familiar with the instructions for hippocampus, and a second series after they evaluated and compared their results using the new tools described above. The results of the first series show that the reliability of raters A and B differs significantly between right and left hippocampus, each achieving a high reliability for one of the structures. The inter-rater reliability of 0.75 for the right and 0.62 for the left hippocampus suggest that the left hippocampus is more difficult to segment than the right hippocampus. The second series was measured after the two raters visualized their segmentations using VALMET and revised the protocol. Interestingly, they both are becoming very reliable. This is reflected in reliabilities up to 0.95 and in the pooled intra-rater reliability of 0.93 and 0.96. However, the reliability between raters (inter-rater) became worse and dropped significantly from 0.75 to 0.67 for the right and from 0.61 to 0.48 for the left hippocampal structures. The second series used 5 different cases with 3 replications. In conclusion, we find that the intra-rater reliability for manual hippocampus segmentation was very high in comparison to studies done at other sites (Hogan 2000). A reliability of 0.95 for the manual segmentat ion of a structure as difficult as the hippocampus has to be considered excellent. We attribute this performance to the 2D/3D capabilities of the IRIS segmentation tool and VALMET. The inter-rater reliability is insufficient and reflects that both raters do excellent but different segmentations. Table 1: Reliability of manual hippocampus segmentation. Intraclass Correlation: Manual Hippocampus Segmentation Study design: 2 raters, 5 cases, 3 observations each Analysis: Individual and pooled analysis First reliability series individual analysis pooled analysis intra-rater intra-rater intra-rater inter-rater rater A rater B A and B A vs. B right hippocampus 0.89067 0.66422 0.77241 0.75062 left hippocampus 0.69061 0.85157 0.81391 0.61923 Second reliability series individual analysis pooled analysis intra-rater intra-rater intra-rater inter-rater rater A rater B A and B A vs. B right hippocampus 0.96073 0.88145 0.93229 0.67325 left hippocampus 0.95416 0.94822 0.96094 0.48218

4. Discussion No consensus exists regarding a necessary and sufficient set of measures to characterize segmentation performance. We plan to provide a suite comprising a reasonable variety of geometric and statistic al methods. In addition to the measures already implemented in the prototype validation tool VALMET, we will consider providing a number of others including moments and volume of error voxels normalized by the surface area. Measures implemented in VALMET and other geometric measures reported in the literature tend to favor least squares measures. Measures in this class are intuitive and work well for noise-free data. However real medical images have structure noise and random noise that can lead to high spatial frequencies in segmented surfaces. Methods based on least-squares measures are very sensitive to even a small number of extreme data values in the sense that a small number of outlier voxels can disproportionately bias a measure and make an otherwise good segmentation appear to compare poorly with truth. Statistically robust methods include quantiles of distance, which are robust to extreme values. A next version of VALMET will include the calculation of a surface distance histogram and choice of arbitrary quantiles. Bibliography Bowyer, K.W., Phillips, P., Empirical Evaluation Techniques in Computer Vision, IEEE Computer Society, 1998 Chalana V and Kim Y: A methodology for evaluation of boundary detection algorithms on medical images, IEEE Trans. Med. Imaging 16: 642-652 (1997) Hogan, R.E., Mark, K.E., Wang, L., Joshi, S., Miller, M.I. and Bucholz, R.D., Mesial Temporal Sclerosis and Temporal Lobe Epilepsy: MR Imaging Deformation-based Segmentation of the Hippocampus in Five Patients, Radiology 216, pp. 291-297, July 2000 IRIS (1999): Interactive Rendering and Image Segmentation, UNC student project spring 1999, Gregg, D., Larsen, E., Neelamkavil, A., Sthapit, S. and Wynn, Chris, Dave Stotts and Guido Gerig, supervisors, http://www.cs.unc.edu/~stotts/comp145/homes/iris/ Kapur, T., Grimson, E.L., Wells, W.M., and Kikinis, R., Segmentation of brain tissue from magnetic resonance images, Medical Image Analysis, 1(2);109-127, 1996 Klette R, Stiehl SH, Viergever MA, and Vincken KL, eds: Performance Characterization in Computer Vision, Kluwer Academic Publishers (2000) Niessen, W.J., Bouma, C.J., Vincken, K.L., Viergever, M.A., Error Metrics for Quantitative Evaluation of Medical Image Segmentation, in Performance Characterization in Computer Vision, Kluwer Academic Publishers, pp. 299-311, 2000 Remiejer P, Rasch C, Lebesque JV, and van Herk M: A general methodology for threedimensional analysis of variation in target volume delineation. Med. Phys. 27: 1961-1970 (1999) Vincken, K.L., Koster, A.S.E., De Graaf, C.N. and Viergever, M.A., Model-based evaluation of image segmentation methods, in Performance Characterization in Computer Vision, Kluwer Academic Publishers, pp. 299-311, 2000