Differential Protein Expression Analysis via Liquid-Chromatography/Mass-Spectrometry Data Visualization

Transcription

1 Differential Protein Expression Analysis via Liquid-Chromatography/Mass-Spectrometry Data Visualization Lars Linsen Julia Löcherbach Matthias Berth Jörg Bernhardt Dörte Becher Department of Mathematics and Computer Science Ernst-Moritz-Arndt-Universität Greifswald Greifswald, Germany DECODON GmbH Greifswald, Germany Department of Microbiology Ernst-Moritz-Arndt-Universität Greifswald Greifswald, Germany (a) (b) Figure 1: Interactive visualization of liquid-chromatography/mass-spectrometry data: (a) Global view of entire data set. (b) Close-up view of region of interest. ABSTRACT Differential protein expression analysis is one of the main challenges in proteomics. It denotes the search for proteins, whose encoding genes are differentially expressed under a given experimental setup. An important task in this context is to identify the differentially expressed proteins or, more generally, all proteins present in the sample. One of the most promising and recently widely used approaches for protein identification is to cleave proteins into peptides, separate the peptides using liquid chromatography, and determine the masses of the separated peptides using mass spectrometry. The resulting data needs to be analyzed and matched against protein sequence databases. The analysis step is typically done by searching for intensity peaks in a large number of 2D graphs. We present an interactive visualization tool for the exploration of liquid-chromatography/mass-spectrometry data in a 3D space, which allows for the understanding of the data in its entirety and a detailed analysis of regions of interest. We compute differential expression over the liquid-chromatography/mass-spectrometry domain and embed it visually in our system. Our exploration tool can treat single liquid-chromatography/mass-spectrometry data sets linsen@uni-greifswald.de {loeba,berth}@decodon.com {Joerg.Bernhardt,dbecher}@uni-greifswald.de as well as data acquired using multi-dimensional protein identification technology. For efficiency purposes we perform a peakpreserving data resampling and multiresolution hierarchy generation prior to visualization. CR Categories: I.3.7 [Computer Graphics]: Three-dimensional Graphics and Realism; I.3.8 [Computer Graphics]: Applications Keywords: interactive visual exploration, hierarchical data representation, visualization in bioinformatics, proteomics 1 INTRODUCTION While the last decade was dominated by the study of genomes (genomics), the past few years have experienced a tremendous progress in the study of gene expression-transcriptomics and especially proteomics. The term proteome was first used to describe the set of proteins encoded by the genome [22]. Nowadays, proteomics is referred to as the study of the function of all expressed proteins [20]. One of the main challenges in proteomics is the analysis of differential protein expression or, more precisely, the identification of proteins, whose encoding genes are differentially expressed. Its significance in biology and medicine is evident. For example, in order to understand how diseases affect organisms, one can differentially compare the expression in healthy and diseased cells. For protein identification, a protein consisting out of a large chain of amino acids can be digested to many smaller peptides, which are separated using liquid chromatography and measured using mass spectrometry. The measured data is analyzed and matched

2 against existing protein sequence databases. When using multidimensional protein identification technology (MudPIT) [21], the liquid chromatography method is applied under multiple different conditions in order to obtain data even from proteins, whose properties are hard to capture. The multiple results need to be merged during the analysis step. For differential protein expression analysis, data from different cell populations (e. g., diseased vs. healthy) or from one cell population acquired under different conditions need to be compared. Protein identification is used to register the data sets and to identify differentially expressed proteins. More information on the relevant biological background in the field of proteomics is given in Section 2. We present an interactive visualization tool, whose goal is to aid protein identification and differential protein expression analysis. Our tool displays data from liquid chromatography and mass spectrometry in a three-dimensional space. Traditionally, a twodimensional plot of the liquid-chromatography result and a large number of two-dimensional plots of the mass-spectrometry results are examined independently. The integration of data from both techniques allows for a better understanding of the data set in its entirety, quantitative depiction of expression ratios on a global scale, and easy visual detection of data acquisition errors. Our tool also supports the visualization of data acquired using multi-dimensional protein identification technology. Each fraction can be displayed individually or in a merged setup. In the merged setup, each peptide (or its fragments) is represented by its maximum intensity peak while color coding is used to indicate the condition, under which the maximum intensity has been observed. The merged setup requires a registration of the individual fractions prior to visualization, as liquid chromatography-mass spectrometry experiments suffer from not being precisely reproducible. Registration is also required prior to differential expression computation. Differential expression of a test vs. a control data set is computed and displayed over the liquid chromatography-mass spectrometry domain. Thus, differential expression visualization can naturally be integrated into our visual exploration tool. Our tool is described in detail in Section 5. In Section 6, we describe scenarios on how our visualization tool is used for error detection, protein identification, and differential protein expression analysis. Depending on the experimental setup, data sets may be pretty large. Moreover, the data is unstructured in one of the dimensions and has non-equidistant spacing in the second dimension. In order to obtain an efficient visualization at interactive frame rates, we apply a resampling of the data prior to visualization. Our resampling method ensures that maximum intensity peaks are maintained. After resampling, we build a hierarchical data scheme, which allows for multiresolution visualization of the data set. The data preparation is described in Section 4. very powerful and in certain aspects superior to other approaches such as 2D electrophoresis, see for example [12]. In particular, LC/MS-based methods are capable of capturing both cell proteins and membrane proteins and seem to perform especially well for the latter. Figure 2 illustrates the individual processing steps leading to protein identification. In the first step of our processing pipeline, proteins are digested by the enzyme trypsin. One can even predict, where a protein will be cut. For example, trypsin cuts after lys and arg (in case no prolin is present) We obtain peptides of a protein. In order to examine the peptides individually, we need to separate them. Peptide separation is done by liquid chromatography (LC). A liquid containing the peptides is forced through a column (loading). The column contains a substrate that binds the peptides. Afterwards, the peptides are washed out of the column using a water-acetonitril-mixture (eluting). The weaker a peptide is bound, the faster it gets washed out. Thus, peptides can be separated by their binding properties (hydrophobicity). The masses of the separated peptides can be determined individually using mass spectrometry (MS). Mass spectrometry is a technique that separates ions by their mass-to-charge ratios (m/z-ratios). Thus, we need to ionize the peptides. Different approaches for ionization exist. Among the most popular are electrospray ionization (ESI) [5] and matrix-assisted laser desorption ionization (Maldi) [8]. When using electrospray ionization one may have to normalize the results due to decreasing ionization. After ionization the molecules are accelerated and handed to the mass analyzer. The mass analyzer uses electric or magnetic fields (quadrupol) to deflect the charged particles, while the kinetic energy imparted by motion gives the particles inertia dependent on their mass. The mass analyzer steers the particles to a detector based on their m/z-ratio. The detector measures intensity in counts per second. The result is typically displayed in a two-dimensional graph, where intensity is shown over mass or m/z-ratio, respectively, see Figure 3. 2 BACKGROUND 2.1 Protein Identification Proteins consist of large chains of amino acids, while smaller chains of amino acids form peptides. A protein is defined by its sequence of amino acids. A common technique to identify a protein is to cut the protein into peptides or even fragments of peptides, identify the fragments, and determine the amino acid sequence of the protein from the fragments. There are several ways of performing the individual steps. For our experiments, we have decided to use liquid chromatography (LC) followed by mass spectrometry (MS) or even multi-dimensional protein identification technology (Mud- PIT) [21]. It has been demonstrated that LC/MS-based methods are Figure 3: Mass spectrometry outputs intensity values over m/zratios. Intensity peaks represent the amount of most frequently present peptides. For the final identification step, the fragment pattern of each peptide is matched against patterns in a database (e. g., Mascot [13]) to determine the amino acid sequence. One problem of the described LC-MS approach is that the simultaneous digest of the protein mixture results in a highly complex collection of thousands of peptides. Thus, a single LC step may not be capable to separate all of them. This problem is typically solved by dividing the peptides into several fractions using multiple LC iterations. Each fraction is then analyzed using mass spectrome-

3 Figure 2: Protein identification pipeline: (i) Proteins are cleaved to form peptides. (ii) Peptides are separated using liquid chromatography: reversed-phase trap (RP) column is loaded with sample peptides and eluted using acetonitril gradient. (iii) Masses of separated peptides are determined using mass spectrometry: ionize peptides, deflect ions using quadrupol, and detect ions. (iv) Analyze data using 3D visual exploration and quantification tool. (v) Match analysis results with database. try. This is referred to as multi-dimensional protein identification technology (MudPIT) [21]. m/z 2.2 Structure of the Data The data coming out of a liquid chromatograph is a function over time. When using a detector to measure the peptides leaving the column, it detects peptides at discrete points in time t 1,...,t n IR. The points in time t i, i = 1,...,n, are not distributed equidistantly. The number n can be in the range of many thousands. The data coming out of the mass spectrometer is a function over the m/z-ratio. The intensity is measured in counts per time and stored as an intensity list for discrete m/z-ratios m 1,...,m p IR. The m/z-ratios m j, j = 1,..., p, are not distributed equidistantly, neither. The number p depends on the experimental setup and can be in the range of tenths of thousands. Figure 3 shows a plot of such an intensity list. The plot exhibits several high peaks. Instead of generating such two-dimensional graphs for each point in time t i, i = 1,...,n, we use a three-dimensional setup, where the intensity is shown as a heightfield over the dimensions time and m/z-ratio. Unfortunately, the m/z-ratios m 1,...,m p vary from one point in time t i to the subsequent point in time t i+1, and even the number p of values in the m/z-ratio dimension varies significantly. Thus, when looking at a two-dimensional domain with the dimensions being m/z-ratio and time, data positions are scattered in one dimension and non-equidistant in the other dimension. Figure 4 sketches the discrete data positions for LC-MS in the twodimensional domain. In terms of acquired data, MudPIT can be regarded as multiple LC-MS. A MudPIT data set consists of several data sets of the structure shown in Figure 4, each of which represents one fraction. Thus, MudPIT adds another dimension. t 1 t 2 t n Figure 4: Structure of LC-MS data: Data value positions are scattered in m/z-dimension and non-equidistant in t-dimension. 2.3 Driving Biological Questions In the following, we formulate the driving biological questions that we intend to answer using our visual exploration tool. Since liquid chromatography operations are rather tricky to execute and many things can go wrong, the first thing that biologists need to know before they proceed with an excessive data analysis is whether the data is actually correct. Question 1: Can visualization provide an immediate and obvious check for correctness of the data? Typically, LC-MS and MudPIT data are explored by looking at intensity plots over m/z-ratio for many points in time (many peptides) and various fractions (cf. Figure 3). Looking at many or even t

4 all these mass spectrometry plots individually is a very tiring process. It also does not provide an intuitive understanding of the entire data on a global scale. When looking at a selected set of mass spectrometry plots it is unknown whether the shown peaks are significant, i. e., among the ones with highest intensity, or not. Question 2: Can visualization provide a global understanding of the data while still displaying full (quantitative) information about position and size of intensity peaks? Does such a visualization allow for interactive data exploration even for large data sets without losing detail information? Most important for a visualization tool in terms of biological insight is not to display one data set, but to compare different data sets, where each data set represents an experiment taken under certain conditions. Question 3: Can visualization display information that allows for differential protein expression analysis? 3 RELATED WORK Visualization of LC-MS data is commonly restricted to a simple plotting of mass spectrometry functions for each peptide (cf. Figure 3). To allow for a global understanding of the data, a visualization method over a two-dimensional domain (like the one sketched in Figure 4) has been introduced by Li et al. [9]. The output of the visualization method, however, restricts itself to two-dimensional images, where intensity is displayed using different shades of gray in a linear or logarithmic mapping. The images exhibit clearly the positions of intensity peaks but the different shades of gray are hard to distinguish, which limits the perception of the actual intensity values. In the work by Li et al. [9], the resolution of the resulting images can be chosen by the user. To deal with the challenges of not operating on a Cartesian grid (cf. Section 2.2), the authors averaged the intensities collected within each pixel. Due to the averaging step, intensity peaks may not have been displayed with their maximum intensity but with a much lower value. Significant intensity peaks may have been reduced to smaller peaks that do not seem to be noteworthy. From a biological point of view, one would rather maintain the actual maximum intensity of a peak, even if it gets slightly shifted. The shift is bounded by the value 1/ 2 of the pixel resolution. Such maximum-preserving operations have been used in other contexts, e. g. [17]. To deal with large data sets, multiresolution methods are commonly applied in visualization. Many different approaches exist for two-dimensional and even higher-dimensional domains [4, 7, 10, 14]. A common technique to build multiresolution hierarchies is the use of wavelets [18]. By storing data values explicitly for the lowest resolution only and computing higher-resolution representations by successively adding details, wavelet-based hierarchies do not require any additional storage space (compared to the storage space of the original data at highest resolution). In the work by Andersson et al. [1], one-dimensional wavelets have been applied to LC data for data reduction and de-noising. Differential expression data has been visualized in the context of functional genomics. Gene expression determined using microarray technology is displayed using two-dimensional [6, 19] or three-dimensional scatter plots [16]. In proteomics, differential expression can be visualized by (differential) intensities over the two dimensions time (from LC) and m/z-ratio (from MS). When using 2D electrophoresis proteins of a sample are separated and identified by 2D orthogonal displacement. Visualizations using 2D images are common and have been used for differential expression display in [2]. Registration of such 2D images [3] allows for a subsequent fusion of various images [11]. 4 DATA PREPARATION 4.1 Resampling Since LC-MS data is scattered in the m/z-dimension, a direct data visualization method would have to apply scattered data interpolation techniques or domain triangulation (e. g., Delaunay triangulation). Both approaches do (in their general form) not account for the non-scattered structure in the time-dimension. Moreover, we observed that they are too computationally expensive for practical purposes when applied to large data sets. To allow for an efficient visualization with an acceptable amount of preprocessing, we decided to perform a resampling step. Since the time-dimension is already structured, we only apply a onedimensional resampling in the m/z-direction. We generate a structured domain with non-equidistant samples in the t-direction and equidistant samples in the m/z-direction. Resampling should be done such that all intensity peaks are preserved (with their maximum intensity). The only way to fulfill this condition is to use a sufficiently high resampling rate. If the resolution res of the mass spectrograph is known, it is best to resample with rate 1/(2 res). If the resolution of the mass spectrograph is unknown, it can be estimated by determining the minimum distance between any two measurement. Obviously, we generate a lot of redundant information. For displaying data visually on a computer screen, there is no need to go beyond the screen s resolution. We reduce the generated data by merging adjacent data values. However, we still want to be able to retrieve the highest resolution data for display when zooming into regions of interest and when outputting data to peak quantification tools. We generate a hierarchical data representation that allows for multiresolution visualization and data access. 4.2 Hierarchical Data Representation A hierarchical data representation scheme stores a data set at various resolutions. For downsampling, the sample positions of resolution L n are split into two groups: the ones that belong to the next coarser resolution L n 1 (even vertices) and the ones that belong to L n \ L n 1 (odd vertices). The values at the even vertices are computed from the values at the sample positions L n. When using wavelet-based techniques the values at the odd vertices store the difference between the values at the even vertices and the values at the sample positions L n. Thus, resolution L n can be reconstructed from resolution L n 1 at any time by adding the differences. Only resolution L n 1 and the difference set L n \L n 1 need to be stored, which is the same amount of data storage as needed to store L n. Thus, setting up a multiresolution hierarchy using a wavelet scheme does not require additional data storage. The simplest and, thus, most widely used wavelets are Haar wavelets, see [18]. One-dimensional Haar wavelets compute the values fi n 1 L n 1 at the even vertices by averaging the values fi n and fi+1 n at the respective sample point pairs L n. The values fi+1 n 1 at the odd vertices store the differences f n i fi n 1. We adopt the ideas from wavelet-based multiresolution hierarchy generation. However, averaging data values would cause intensity peaks to lose their maximum intensity or even to vanish. To maintain the maximum intensity of all peaks, we set the values at the even vertices to fi n 1 = max( fi n, f i+1 n ). The values at the odd vertices store the differences fi+1 n 1 = f i n 1 min( fi n, f i+1 n ). The sign of fi+1 n 1 is used to indicate whether f n i or fi+1 n was the larger value. Figure 5 shows an example for our peak-preserving multiresolution hierarchy generation.

5 n n 1 n 2 f n f n 1 f 0 n f 0 f 0 f even even n f2 Ln odd Ln 1 n 1 f1 n 1 f3 odd Ln 2 f n 2 2 Figure 5: Hierarchical data representation: Multiresolution scheme is peak-preserving ( f0 n 2 = max{ f0 n, f 1 n, f 2 n, f 3 n }) and does not require additional storage space (stores only [{ f0 n 2 };{ f2 n 2 };{ f1 n 1, f3 n 1 }]). 5 VISUAL DATA EXPLORATION 5.1 Interactive Visualization System We have developed a three-dimensional visualization system for interactive exploration of LC-MS data. We render heightfields over the two-dimensional resampled domain with dimensions time and m/z-ratio. The system allows for visualization of the entire data set on a global scale, see Figure 1(a), and zoomed views into regions of interest, see Figure 1(b). To fulfill the real-time requirements, the appropriate resolution is selected from the multiresolution data hierarchy described in the previous section. The color scheme for the visualization can be changed interactively. A one-dimensional transfer function is used to map intensity values to RGB color values. Thus, in the general setting, the color of a peak indicates its intensity. Also, the material properties and the shading method can be chosen by the user. In terms of viewing, the system supports both parallel and perspective projection. To filter, i. e., select and display only the most significant peaks, a thresholding method is provided that culls all peaks beneath a chosen threshold. The system also allows for peak labeling. The labels show the peaks properties (e. g., intensity, m/z-value, time-value) or if known the name of the respective peptide. Figures 1 and 6 illustrate certain aspects of the system s functionality. Existing quantification tools can be used to measure the spots in the two-dimensional images. The quantified values are used to match with existing databases, which allows us to identify peptides and, finally, proteins. We have used Delta2D 1 for quantification and Mascot [13] for database look-ups. 5.3 Differential Protein Expression For differential protein expression analysis, data is acquired by experiments under changing conditions or of different cell populations. Of interest is the change in expression from one experimental setup to another. Typically, one experimental setup serves as control data, while the other setups provide the test data. Each experiment can be visualized and analyzed individually using our interactive visualization tool described above. To visualize the differential expression of test vs. control data, we use the same three-dimensional setup as before and display differential expression in terms of heightfields over dimensions time (LC) and m/zratio (MS). To compute the differential expression, we have to determine differences in peak intensity. Unfortunately, the reproducibility of LC-MS measurements only works with a certain tolerance. We observed that this tolerance is pretty high. Intensity peaks for one and the same peptide may shift significantly when executing an experiment multiple times. Thus, prior to differential expression computation, we need to register (or align) test and control data. The registration step warps the domain of the test data set onto the domain of the control data set. Figure 8 illustrates the shift by visualizing test and control data as two heightfields over the same domain. The position of most dominant peaks of the two heightfields should coincide but, instead, exhibit a severe shift. 5.2 Peak Quantification While typically smooth shading, diffuse and specular material, and perspective projection is used to render the heightfield in an easy-toperceive volumetric fashion, the other options are offered to serve specific tasks. If we look at the heightfield from a bird s eye view using a parallel projection, a two-dimensional image is rendered, where color encodes intensity with respect to the chosen transfer function. We use flat shading exhibiting no diffuse or specular lighting to not fudge the color. Figure 7(a) shows such a bird s eye view. The rendered image can be outputted to a file using common image formats. The resolution for the output can be chosen arbitrarily from the multiresolution data hierarchy. To maximize the accuracy of further processing steps, one would typically choose the highest resolution. Also, the output could be the entire data set as well as a selected region. Since most image formats use a rather limited amount of bits for storing the data, the high-precision representation used for visualization must be converted to a lower-precision representation for the outputted images. As the intensity peaks tend to exhibit very high intensities, it is often beneficial to convert the data not using a linear but a logarithmic mapping. Both conversion options are supported though. Figure 7(b) shows an outputted image using a logarithmic scale of gray values. Figure 8: Visualization of test (green) and control (red) data as two heightfields over same domain without registration. Location of most dominant peaks are supposed to coincide but clearly exhibit a shift. The warping transformation is computed using a landmark approach. In both data sets we identify the most significant intensity peaks. A one-to-one mapping of the test intensity peaks to the control intensity peaks is done by hand for a few intensity peaks. These are the so-called landmarks. The user intervention is moderate, since only a few landmarks are required. (In our experiments, we used between five and 30 landmarks.) Our visual representation of the data makes it very easy to match the peaks even for nonprofessionals. We triangulate the domains using the landmarks. The warping transformation is linear within each triangle. After the warping step, we can compute the differential expression by mere subtraction of the intensity values. Figure 9 shows an 1

6 (a) (b) Figure 6: Options for interactive visualization system (cf. Figure 1): (a) Changing color, material properties, and shading method. (b) Thresholding for peak selection combined with peptide labeling of significant peaks. (a) (b) Figure 7: Peak quantification: Bird s eye view of heightfield (a) is outputted to 2D image (b) using logarithmic scale. Existing software is used for peak quantification on spots in 2D image. example. We merge the two heightfields (test vs. control) by taking the maximum intensities and color the resulting heightfield using green color in case of down-regulation and red color in case of upregulation. (The resulting heightfield here represents the maximum of the two intensities not the difference.) 5.4 Multi-dimensional Protein Identification Technology To visualize MudPIT data, we generate a heightfield for each fraction. We provide a slider for the user to switch between the heightfield renderings of the individual fractions. Figure 10 shows the visualization of three fractions exhibiting changes in intensity. MudPIT data also suffers from not generating results that are precisely reproducible. Thus, the locations of intensity peaks representing peptides present in subsequent fractions do not coincide. When trying to integrate MudPIT data from various fractions into one setup, we need to perform a registration step, again. We proceed as described in the previous section. Figure 11 shows the integrated visualization of various fractions after registration. Each fraction gets assigned one color. Thus, colors indicate in which fraction intensity peaks are highest. Using the registered fractions, we can also interpolate between the intensities of subsequent fractions. Instead of using a slider to switch between visualizations of different fractions as in Figure 10, we generate an animation over the fraction dimension, where intensity values are animated over fractions in the order increasing ammonium chloride. 2 For smooth transition we use linear interpolation of the heightfields. The animation allows for an even better perception of intensity changes with changing fractions. 2 A movie of an animation accompanies the paper.

7 Figure 10: Visualization of three fractions obtained by using MudPIT. Slider allows to interactively switch between fractions. Figure 9: Visualization of differential protein expression over dimensions time (LC) and m/z-ratio (MS): Color green indicates downregulation, color red upregulation. Differential expression computation requires prior data registration by domain warping. 6 APPLICATION SCENARIO, RESULTS, AND DISCUSSION To test our system, we have applied our methods to data acquired from human cell line SiHa used to model cervical cancer [15]. The cell line was grown under normal conditions and showed no perturbation. The MudPIT experiment was done using liquid chromatography (LC) with reversed-phase trap (RP) column and mass spectrometry (MS) with electrospray ionization (ESI). 11 fractions with 0mM, 20mM, 40mM, 60mM, 80mM, 100mM, 150mM, 200mM, 300mM, 500mM, and 900mM ammonium chloride were used. The retention time during liquid chromatography were in the range from 0 to 85 minutes. The m/z-ratios measured by mass spectrometry were in the range of 300 to The measured intensities were as high as 10 9 counts per second. In a first step, we visualized the data as heightfields over the time-dimension measured by liquid chromatography and the massover-charge-dimension measured by mass spectrometry. Various visualizations are shown in Figures 1 and 6. By looking at the data from a bird s eye view as in Figure 7(a), we detected some artifacts. Some of the intensity peaks are not separated but form streaks in Figure 11: Integrated visualization of various fractions from MudPIT: Colors indicate, in which fraction each peak has maximum intensity. The color scheme assigns colors cyan, white, magenta, yellow, red, green, and blue to fractions in the order of increasing ammonium chloride. the time-dimension. The streaks indicate problems with the liquid chromatograph, which did not separate the peptides properly. Using our visualization tool the error becomes obvious and is detected immediately in an intuitive way. Consequentially, the experiments would have needed to be repeated to gain any significant biological insight from further analysis. Despite the data acquisition error, the bird s eye view of our heightfield has been outputted to a two-dimensional image 7(b). We used highest resolution and logarithmic mapping. Since the timedimension of the data set has a resolution in the range of the resolution of modern computer screens, the multiresolution hierarchy has only been generated in the m/z-dimension using a one-dimensional scheme. The two-dimensional image can be used to quantify the intensity spots and to match the results with the values obtained from an existing database. This procedure allows for identification of the peptides with highest occurrence. The identified peptides can be labeled with their name. For the MudPIT data, we can interactively explore the various fractions, as shown in Figure 10 and the video. After registration of the fractions, we integrate the various fractions into one heightfield. The result is shown in Figure 11. From there, we could proceed with the quantification step as done for the LC-MS data.

8 After the identification step, we looked into differential protein expression to analyze changes under varying conditions. We defined the control and test data, registered them, and explored the differential expression using, once again, the 3D visualization setup. The resulting heightfield is shown in Figure 9. In an intuitive way, we could visually detect significant up- and downregulation of certain peptides, which had been identified beforehand. This new insight can help to formulate biological hypotheses. It remains to discuss whether the driving biological questions formulated in Section 2.3 have been answered. Question 1: We have demonstrated that our visualization tool is, indeed, capable to immediately exhibit data acquisition errors in an intuitive and obvious way. Question 2: Our examples also show that our visual system allows for a global understanding of the entire data set as well as exploration of regions of interest. We were able to achieve interactive frame rates when displaying the entire data set by generating a multiresolution hierarchy. The hierarchical data representation provides all data information including all details without requiring additional storage space. Question 3: The scenario presented in this chapter describes in detail the individual steps toward differential protein expression analysis. It documents how our visualization system helps to accomplish several important tasks. 7 CONCLUSIONS AND FUTURE WORK We have presented an interactive visualization system for the analysis of differential protein expression. The system renders heightfields of liquid-chromatography/mass-spectrometry data. It uses data resampling and hierarchical data representation to fit the realtime requirements even for large data sets. The visualization tool also allows for integration of data obtained by multi-dimensional protein identification technology. Differential expression for control vs. test data is computed and visualized after registration via domain warping. Our system provides an intuitive understanding of the data on a global scale and allows for detailed data exploration. Data acquisition errors become easy to detect. Moreover, the visualization tool supports protein/peptide identification. It also provides an intuitive means to match intensity peaks of different data sets, which can be used for data registration purposes. In terms of future work, we plan on integrating a quantification step based on the highest-resolution three-dimensional representation into our system, as the current conversion into a standard image format loses precision. Moreover, we would like to incorporate the simultaneous visualization of multiple test data sets for differential protein expression analysis. REFERENCES [1] F. O. Andersson, R. Kaiser, and S. P. Jakobsson. Data preprocessing by wavelets and genetic algorithms for enhanced multivariate analysis of lc peptide mapping. Journal of Pharmaceutical and Biomedical Analysis, 34: , [2] J. Bernhardt, K. Buttner, C. Scharf, and M. Hecker. Dual channel imaging of two-dimensional electropherograms in bacillus subtilis. Electrophoresis, 20(11): , [3] J. Bernhardt, J. Weibezahn, C. Scharf, and M. Hecker. Bacillus subtilis during feast and famine: visualization of the overall regulation of protein synthesis during glucose starvation by proteome analysis. Genome Res., 13(2): , [4] P. Cignoni, C. Montani, E. Puppo, and R. Scopigno. Multiresolution modeling and visualization of volume data. IEEE Transactions on Visualization and Computer Graphics, 3(4): , [5] J. B. Fenn, M. Mann, C. K. Meng, S. F. Wong, and C. M. Whitehouse. Electrospray ionization for mass spectrometry of large biomolecules. Science, 246(64), [6] D. R. Gilbert, M. Schroeder, and J. van Helden. Interactive visualization and exploration of relationships between biological objects. Trends Biotechnol., 18(12): , [7] R. Grosso, C. Lürig, and T. Ertl. The multilevel finite element method for adaptive mesh optimization and visualization of volume data. In R. Yagel and H. Hagen, editors, Proceedings of IEEE Conference on Visualization 1997, pages IEEE, IEEE Computer Society Press, [8] M. Karas and F. Hillenkamp. Laser desorption ionization of proteins with molecular masses exceeding daltons. Anal Chem, 60: , [9] X.-J. Li, P. G. A. Pedrioli, J. E. J, D. Martin, E. C. Yi, H. Lee, and R. Aebersold. A tool to visualize and evaluate data obtained by liquid chromatography/electrospray ionization/mass spectrometry. Anal Chem, 76: , [10] L. Linsen, V. Pascucci, M. A. Duchaineau, B. Hamann, and K. I. Joy. Wavelet-based multiresolution with n 2 subdivision. Journal on Computing, 71(1+2), [11] S. Luhn, M. Berth, M. Hecker, and J. Bernhardt. Using standard positions and image fusion to create proteome maps from collections of two-dimensional gel electrophoresis images. Proteomics, 3(7): , [12] D. M. Maynard, J. Masuda, X. Yang, J. A. Kowalak, and S. P. Markey. Characterizing complex peptide mixtures using a multidimensional liquid chromatography-mass spectrometry system: Saccharomyces cerevisiae as a model system. Journal of Chromatography B, 810(1):69 76, [13] D. N. Perkins, D. J. Pappin, D. M. Creasy, and J. S. Cottrell. Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis, 20(18): , [14] D. Pinskiy, E. Brugger, H. R. Childs, and B. Hamann. An octreebased multiresolution approach supporting interactive rendering of very large volume data sets. In H. Arabnia, R. Erbacher, X. He, C. Knight, B. Kovalerchuk, M. Lee, Y. Mun, M. Sarfraz, J. Schwing, and H. Tabrizi, editors, Proceedings of the 2001 International Conference on Imaging Science, Systems, and Technology (CISST 2001), Volume 1, pages Computer Science Research, Education, and Applications Press (CSREA), Athens, Georgia, [15] J. T. Prince, M. W. Carlson, R. Wang, P. Lu, and E. M. Marcotte. The need for a public proteomics repository (commentary). Nature Biotechnology, 22: , [16] N. Shah, V. Filkov, B. Hamann, and K. I. Joy. GeneBox: interactive visualization of microarray data sets. In F. Valafar and H. Valafar, editors, International Conference on Mathematics and Engineering Techniques in Medicine and Biological Sciences (METMBS 03), pages Computer Science Research, Education, and Applications Press (CSREA), Athens, Georgia, [17] Y. Shinagawa and T. L. Kunii. Unconstrained automatic image matching using multiresolutional critical-point filters. IEEE Trans. Pattern Anal. Mach. Intell., 20(9): , [18] E. J. Stollnitz, T. D. DeRose, and D. H. Salesin. Wavelets for Computer Graphics: Theory and Applications. The Morgan Kaufmann Series in Computer Graphics and Geometric Modeling, Brian A. Barsky (series editor), Morgan Kaufmann Publishers, San Francisco, U.S.A., [19] C. Tang, L. Zhang, and A. Zhang. Interactive visualization and analysis for gene expression data. In Hawaii International Conference on System Sciences, [20] M. Tyers and M. Mann. From genomics to proteomics. Nature, 422: , [21] M. P. Washburn, D. Wolters, and J. R. Y. III. Large-scale analysis of the yeast proteome by multidimensional protein identification technology. Nature Biotechnology, 19: , [22] M. R. Wilkins, C. Pasquali, R. D. Appel, K. Ou, O. Golaz, J. C. Sanchez, J. X. Yan, A. A. Gooley, G. Hughes, I. Humphery-Smith, K. L. Williams, and D. F. Hochstrasser. From proteins to proteomes: large scale protein identification by two-dimensional electrophoresis and amino acid analysis. Biotechnology, 14:61 65, 1996.