Visual Analysis of Gel-free Proteome Data

Transcription

1 1 Visual Analysis of Gel-free Proteome Data Lars Linsen Julia Löcherbach Matthias Berth Dörte Becher Jörg Bernhardt Abstract We present a visual exploration system supporting protein analysis when using gel-free data acquisition methods. The data to be analyzed is obtained by coupling liquid chromatography (LC) with mass spectrometry (MS). LC-MS data have the properties of being non-equidistantly distributed in the time dimension (measured by LC) and being scattered in the mass-to-charge ratio dimension (measured by MS). We describe a hierarchical data representation and visualization method for large LC-MS data. Based on this visualization we have developed a tool that supports various data analysis steps. Our visual tool provides a global understanding of the data, intuitive detection and classification of experimental errors, and extensions to LC- MS/MS, LC/LC-MS, and LC/LC-MS/MS data analysis. Due to the presence of randomly occurring rare isotopes within the same protein molecule, several intensity peaks may be detected that all refer to the same peptide. We have developed methods to unite such intensity peaks. This deisotoping step is visually documented by our system, such that misclassification can be detected intuitively. For differential protein expression analysis we compute and visualize the differences in protein amounts between experiments. In order to compute the differential expression, the experimental data need to be registered. For registration we perform a non-rigid warping step based on landmarks. The landmarks can be assigned automatically using protein identification methods. We evaluate our methods by comparing protein analysis with and without our interactive visualizationbased exploration tool. Index Terms Interactive visual exploration, hierarchical data representation, visualization in bioinformatics, proteomics. I. INTRODUCTION Proteomics is the study of the function of proteins [29]. The goal of proteomics is to determine how much of which protein is present under which conditions. Thus, protein analyses include protein identification and qualitative analysis, i. e., which proteins are present in a given sample, quantitative analysis, i. e., how much of a protein is present in a given sample, and differential protein expression analysis, i. e., what is the difference in protein expression under changing conditions. Differential protein expression experiments are typically performed for test vs. control sample. The computed differences indicate up- or down-regulation of certain proteins. Various approaches exist to quantitatively measure occurrence of proteins in a given sample. A common approach is to use 2D gel electrophoresis, where whole or intact proteins Lars Linsen and Julia Löcherbach are with the Department of Mathematics and Computer Science, Ernst-Moritz-Arndt-Universität Greifswald, Matthias Berth is with DECODON GmbH, and Dörte Becher and Jörg Bernhardt are with the Department of Microbiology, Ernst- Moritz-Arndt-Universität Greifswald, all located in Greifswald, Germany. addresses: linsen@uni-greifswald.de, {loeba,berth}@decodon.com, {dbecher,joerg.bernhardt}@uni-greifswald.de. are separated in two orthogonal directions. This setup results in a 2D image, which can be analyzed using image processing approaches. A relatively new approach is to couple liquid chromatography (LC) with mass spectrometry (MS). The LC step is used to separate proteins (or parts thereof that are generated by a cleavage process) and the subsequent MS step is used to determine the masses of the individual parts called peptides. More experimental background on the deployed methods is given in Section III. The structure of the obtained LC-MS data is rather complicated. While the LC step leads to sample locations that are non-equidistantly distributed in the time dimension, the MS step creates scans for each time step with unstructured or scattered samples. Thus, the visualization of LC-MS data is not as trivial as generating a 2D image. The first approaches to visualize LC-MS data in an accurate and interactive way have recently been introduced by us, Linsen et al. [17], and by de Corral and Pfister [7]. In this paper, we present a system for visual exploration of LC-MS data and its extensions to support interactive protein analysis. For visualization purposes, we create a hierarchical representation of LC-MS data. The hierarchy generation follows a one-dimensional resampling step and assures that intensity peaks are preserved throughout the different levels of resolution. The hierarchical data representation and visualization methods are described in Section IV. When representing the LC-MS data over the described twodimensional domain, the peptides (i. e., parts of a cleaved protein) correspond to individual intensity peaks. However, due to the presence of different isotopes of the same chemical element, one kind of peptide may be represented by several intensity peaks. To determine the peptide s and, thus, the protein s quantity in the sample, the intensity peaks of one kind of peptide need to be united. Our system supports such a deisotoping step and visually shows its effect. In order to identify the proteins in our sample, the LC-MS experiment gets extended by another MS step. This is called tandem mass spectrometry. The second MS step determines the amino acid sequence of a peptide, such that the proteins can be identified by matching the sequence to data obtained from an online database. Our visual exploration tool also supports the visualization of such LC-MS/MS data, as well as further extensions to LC/LC-MS and LC/LC-MS/MS experiments. For the computation of differential protein expression, the results of two experiments (typically test vs. control) need to be compared quantitatively. As the LC step of LC-MS experiments is not reproducible with sufficient precision, a registration step is required prior to the difference computations. We execute a non-rigid registration step based on landmarks. The landmarks can be obtained automatically by performing a protein identification step for both experiments and match

2 2 the equally identified intensity peaks. After registration, the computed differential expressions are visualized by our system in a 3D setup. The visual exploration methods supported by our system are described in Section V. This work is an extension of our previous efforts [17]. The extended system supports further important features for visual protein analysis. The new features include the extension to visual data exploration of data obtained by tandem mass spectrometry, the deisotoping step to unite scattered intensity peaks belonging to one peptide, the description of an automated domain warping step based on protein identification to register data sets allowing for differential expression computation, and the quantitative depiction of up- and down-regulation for a test vs. control experiment in our 3D interactive environment. Moreover, we conduct a thorough evaluation of our method. We apply it to a real data set examining the protein expression in Bacillus subtilis and compare visual protein analysis using our interactive exploration tool with methods used prior to the introduction of our tool. The evaluation is described in Section VI. II. RELATED WORK LC-MS data can be displayed directly using one 2D plot of intensity over time (output of the LC step) and many thousands of 2D plots of intensity over mass-to-charge ratio (output of the MS step). Obviously, exploring the data by looking at some of these 2D plots does not lead to a good comprehension of the data as a whole. Instead one would like to have a global view on the entire data, i. e., looking at intensities over a two-dimensional domain. A first approach has been taken by Li et al. [14]. The output of the visualization method, however, restricts itself to two-dimensional images, where intensity is displayed using different shades of gray in a linear or logarithmic mapping. The images exhibit clearly the positions of intensity peaks but the different shades of gray are hard to distinguish, which limits the perception of the actual intensity values. The usage of 2D images is quite common for data visualization in proteomics, e. g., when using gel electrophoresis [2]. However, if we generate such images for LC-MS data, we run into the problem that the samples are not structured over a Cartesian grid. In the work by Li et al. [14], the resolution of the resulting images can be chosen by the user. To deal with the challenges of not operating on a Cartesian grid, the authors averaged the intensities collected within each pixel. Due to the averaging step, intensity peaks may not have been displayed with their maximum intensity but with a much lower value. Significant intensity peaks may have been reduced to smaller peaks that do not seem to be noteworthy. From a biological point of view, one is particularly interested in the height of the high intensity peaks. Thus, a reduction of the intensities would lead to a significant distortion of the actual data. We have developed a 3D visualization technique for LC-MS data, where intensities are displayed over the 2D domain [17]. The 3D view allows for a global understanding of the absolute and relative intensities. For data representation, we do not perform any averaging steps but preserve all the intensity peaks. The approach is revised in Section V. Such maximumpreserving operations have been used in other contexts, e. g. by Shinagawa and Kunii [26]. Meanwhile, de Corral and Pfister [7] have presented a visualization method for LC-MS data similar to ours. Their approach is motivated by the adaptive representation of height fields in terrain visualization applications. They decided to adjust the method by Losasso and Hoppe [19] to their needs. The main focus of de Corral and Pfister s work is real-time display of large data sets using a sophisticated data management and acceleration by employing hardware-accelerated methods. The driving biological questions are not discussed in depth including the need for preserving the maximum intensities throughout all levels of resolution. To deal with large data sets, multiresolution methods are commonly applied in visualization. Many different approaches exist for two-dimensional and even higher-dimensional domains [4], [11], [18], [23]. Hierarchical methods are also common in terrain visualization [5], [6], [8], [12], [16], [19], which was the motivation for de Corral and Pfister [7] to use such approaches. A common technique to build multiresolution hierarchies is the use of wavelets [27]. By storing data values explicitly for the lowest resolution only and computing higher-resolution representations by successively adding details, wavelet-based hierarchies do not require any additional storage space (compared to the storage space of the original data at highest resolution). In the work by Andersson et al. [1], one-dimensional wavelets have been applied to LC data for data reduction and de-noising. III. GEL-FREE PROTEOMICS A. Liquid Chromatography-Mass Spectrometry (LC-MS) Liquid chromatography-fed mass spectrometry (LC-MS) is an analytical technique that combines physical separation via liquid chromatography with mass analysis via mass spectrometry. It recently has obtained a lot of attention in the field of proteomics. It has been demonstrated that LC-MS-based methods are very powerful and in certain aspects superior or complementary to other approaches such as 2D electrophoresis, see for example [21]. In particular, LC-MS-based methods are capable of capturing both intracellular proteins and membrane proteins and seem to perform especially well for the latter. Figure 1 illustrates the individual LC-MS processing steps. Proteins are macromolecules consisting of hundreds or thousands of amino acids. A biological sample, in turn, can be a mix of thousands of different proteins. In the first step of our processing pipeline, protein molecules are cut into smaller fragments (called peptides), e. g., by the enzyme trypsin. Trypsin cuts at well-defined positions in the amino acid chain (after lysine and also after arginine if not followed by proline), such that the sequences of potential fragments are known when a protein s sequence is known.

3 3 Fig. 1. Liquid chromatography-mass spectrometry (LC-MS) workflow: After growing and isolating the biological material (A), the sample is taken and its proteins are digested using peptidases (B). This process leads to cleaved proteins called peptides, which are fed to the liquid chromatograph, where the peptides are separated. This diagram shows the utilization of two consecutive LC steps. The peptides are separated by loading the peptide mixture onto an ion exchange column (C), eluting the column step by step using an ammonium chloride gradient, directly loading the eluate onto a reverse phase column (D), and eluting the reverse phase column using a continuous acetonitrile gradient (E). The eluate directly enters the mass spectrometer to determine the masses of the separated peptides. The MS ionizes the peptides, deflects them using a quadrupole, and detects the ions (F). Finally, the data is delivered to a computer, where our visual data analysis tool is applied (G). In order to examine the peptides individually, we need to separate them. Peptide separation is done by liquid chromatography (LC). A solvent containing the peptides is forced through a separation column (loading). The column contains the stationary phase that binds the peptides. Afterwards, the peptides are washed out of the column by the mobile phase (eluting). The diagram in Figure 1 shows the utilization of two consecutive LC steps, which is described below. The weaker a peptide is bound to the substrate, the faster it gets washed out. Thus, peptides can be separated by their binding properties (e. g. hydrophobicity). The output data of the LC step can be displayed using a 2D plot, where intensity in counts per second is plotted over time, see Figure 2. Fig. 2. Liquid chromatography outputs intensity values over time. The masses of the separated peptides can be determined individually using mass spectrometry (MS). Mass spectrometry separates ions by their mass-to-charge ratios (m/z-ratios). Thus, we need to ionize the peptides. Different approaches for ionization exist. Among the most popular are electrospray ionization (ESI) [9] and matrix-assisted laser desorption ionization (Maldi) [13]. When using electrospray ionization one may have to normalize the results due to decreasing ionization. After ionization the molecules are accelerated and handed to the mass analyzer. The mass analyzer uses electric (time-of-flight) or magnetic fields (quadrupole) to deflect the charged particles, while the kinetic energy imparted by motion gives the particles inertia dependent on their mass. The mass analyzer steers the particles to a detector based on their m/zratio. The detector measures intensity in counts per second. The MS output can be displayed by one 2D plot for each time step. The 2D plot shows intensity over mass-to-charge ratio (or m/z-ratio), see Figure 3. The data coming out of a liquid chromatograph is a function over time. When delivering the data from the detector to a computer system, the values are given at discrete points in time t 1,...,t n IR. The points in time t i, i = 1,...,n, are not distributed equidistantly. The number n can be in the range of many thousands. Figure 2 shows a 2D plot. The data coming out of the mass spectrometer is a function over the m/z-ratio. The intensity is measured in counts per time and stored as an intensity list for discrete m/z-ratios m 1,...,m p IR. The m/z-ratios m j, j = 1,..., p, are not

4 4 Fig. 3. Mass spectrometry outputs intensity values over m/z-ratios. to determine the amino acid sequence of selected peptides. Of interest are peptides with highest peaks in the first-order MS spectrum. The respective ions can be selected using the quadrupole of the mass spectrometer and fragmented using collision induced dissociation. The fragments are detected by a TOF analyzer. Since peptide bonds (bonds between amino acids) are the weakest bonds within a peptide, they are the first to break during fragmentation. Thus, the intensity plot of the second-order MS spectrum (i. e., the peak list we obtain by the second MS step) exhibits peaks whose peptides differ by the number of amino acids. The mass differences of the peaks allows for the determination of the amino acid sequence. For protein identification, the determined amino acid sequence can be matched against amino acid sequences retrieved from online databases using analysis systems such as Mascot [22]. distributed equidistantly, either. The number p depends on the experimental setup and can be in the range of tens to hundreds of thousands. Figure 3 shows a 2D plot of such an intensity list. The plot exhibits several high peaks. Instead of generating such two-dimensional graphs for each point in time t i, i = 1,...,n, we use a three-dimensional setup, where the intensity is shown as a heightfield over the dimensions time and m/z-ratio. Unfortunately, the m/z-ratios m 1,...,m p vary from one point in time t i to the subsequent point in time t i+1, and even the number p of values in the m/z-ratio dimension varies significantly. Thus, when looking at a two-dimensional domain with the dimensions being m/zratio and time, data positions are scattered in one dimension and non-equidistant in the other dimension. Figure 4 sketches the discrete data positions for LC-MS in the two-dimensional domain. m/z t 1 t 2 t n Fig. 4. Structure of LC-MS data: Data value positions are scattered in m/zdimension and non-equidistant in t-dimension. B. Tandem Mass Spectrometry For tandem mass spectrometry (also referred to as LC- MS/MS) we attach another MS step to the end of the pipeline shown in Figure 1(A-F). The goal of the second MS step is t C. Multi-dimensional Protein Identification One problem of the described LC-MS approach is that the simultaneous digest of the protein mixture results in a highly complex collection of thousands of peptides. Thus, a single LC step may not be capable to separate all of them. This problem is typically solved by adding another LC step preceding the one described above. This method is referred to as multidimensional protein identification technology (MudPIT) or LC/LC-MS [30]. The diagram in Figure 1 illustrates this twostep LC using an ion exchange column and a reverse phase column. The ion exchange column is eluted using a stepwisely increasing ammonium chloride gradient. Certain peptides can only be eluted using certain fractions of ammonium chloride concentration. Thus, we elute the peptides in several steps, while the fraction increases from one step to the next. For each fraction, we perform a reverse phase LC step (the peptides eluted from the ion exchange column are directly loaded into the reverse phase column) and an MS step as before. We can also couple the MudPIT with tandem mass spectrometry leading to so-called LC/LC-MS/MS experiments. In terms of data representation, both MudPIT and tandem mass spectrometry add another dimension to our LC-MS data. IV. HIERARCHICAL DATA REPRESENTATION AND A. Resampling VISUALIZATION Since LC-MS data is scattered in the m/z-dimension, a direct data visualization method would have to apply scattered data interpolation techniques or domain triangulation (e. g., Delaunay triangulation). Both approaches do (in their general form) not account for the non-scattered structure in the timedimension. Moreover, scattered data interpolation leads to a loss of precision, while domain triangulation can be computationally expensive. To allow for an efficient visualization with an acceptable amount of preprocessing, we decided to perform a resampling step. Since the time-dimension is already structured, we only apply a one-dimensional resampling in the m/z-direction. We generate a structured domain with non-equidistant samples in the t-direction and equidistant samples in the m/z-direction.

5 5 Resampling should be done such that all intensity peaks are preserved (with their maximum intensity). The only way to fulfill this condition is to use a sufficiently high resampling rate. If the resolution res of the mass spectrograph is known, it is best to resample with rate 1/(2 res). If the resolution of the mass spectrograph is unknown, it can be estimated by determining the minimum distance between any two measurements. Obviously, we generate a lot of redundant information. For displaying data visually on a computer screen, there is no need to go beyond the screen s resolution. We reduce the amount of data by merging adjacent data values. However, we still want to be able to retrieve the highest resolution data for display when zooming into regions of interest and when outputting data to peak quantification tools. We generate a hierarchical data representation that allows for multiresolution visualization and data access. B. Hierarchical Data Representation A hierarchical data representation scheme stores a data set at various resolutions. For downsampling, the sample positions of resolution L n are split into two groups: the ones that belong to the next coarser resolution L n 1 (called even vertices) and the ones that belong to L n \ L n 1 (called odd vertices). The values at the even vertices are computed from the values at the sample positions L n. When using wavelet-based techniques the values at the odd vertices store the difference between the values at the even vertices and the values at the sample positions L n. Thus, resolution L n can be reconstructed from resolution L n 1 at any time by adding the differences. Only resolution L n 1 and the difference set L n \ L n 1 need to be stored, which is the same amount of data storage as needed to store L n. Thus, setting up a multiresolution hierarchy using a wavelet scheme does not require additional data storage. The simplest and, thus, most widely used wavelets are Haar wavelets, see [27]. One-dimensional Haar wavelets compute the values fi n 1 L n 1 at the even vertices by averaging the values fi n and fi+1 n at the respective sample point pairs L n. The values fi+1 n 1 at the odd vertices store the differences f n i fi n 1. We adopt the ideas from wavelet-based multiresolution hierarchy generation. However, averaging data values would cause intensity peaks to lose their maximum intensity or even to vanish. To maintain the maximum intensity of all peaks, we set the values at the even vertices to f n 1 i = max( f n i, f n i+1 ). The values at the odd vertices store the differences fi+1 n 1 = f i n 1 min( fi n, fi+1 n ). The sign of fi+1 n 1 is used to indicate whether f n i or fi+1 n was the larger value. Figure 5 shows an example for our peakpreserving multiresolution hierarchy generation. C. Interactive Visualization System We have developed a three-dimensional visualization system for interactive exploration of LC-MS data. We render n n 1 n 2 f n f n 1 f 0 n f 0 f 0 f even even n f2 L n odd L n 1 n 1 1 f n 1 3 f odd L n 2 n 2 2 Fig. 5. Hierarchical data representation: Multiresolution scheme is peakpreserving ( f0 n 2 = max{ f0 n, f 1 n, f 2 n, f 3 n }) and does not require additional storage space (stores only [{ f0 n 2 };{ f2 n 2 };{ f1 n 1, f3 n 1 }]). heightfields over the two-dimensional resampled domain with dimensions time and m/z-ratio. Figure 6 shows a resulting image when displaying intensity over the 2D domain. The system allows for visualization of the entire data set on a global scale and zoomed views into regions of interest. To fulfill the real-time requirements, the appropriate resolution is selected from the multiresolution data hierarchy described in the previous section. Figure 7 shows how the system switches from a low-resolution data visualization (Figure 7(a)) to a higher-resolution one during zooming (Figure 7(b)). Fig. 6. Visualization of LC-MS data: Intensities are shown over time and m/z-ratio dimensions. Standard interaction mechanisms are provided: The color scheme for the visualization can be changed interactively. A one-dimensional transfer function is used to map intensity values to RGB color values such that the color of a peak f

6 6 (a) (b) Fig. 7. Multiresolution LC-MS data visualization: When zooming into regions of interest, resolution switches from coarse for global view (a) to fine for a more detailed view (b). indicates its intensity. The material properties and the shading method can be chosen by the user, and the system supports both parallel and perspective projection. To filter, i. e., select and display only the most significant peaks, a thresholding method is provided that culls all peaks beneath a chosen threshold. The system also allows for peak labeling.1 V. V ISUAL DATA E XPLORATION A. Deisotoping When investigating LC-MS data using our visual exploration tool, it becomes apparent that there are intensity peaks that seem to form groups. A frequent pattern shows three to five peaks in a row. They all belong to the same time step and exhibit an equal spacing between them. Often, the second intensity peak is the highest (depending on the peptide/protein), and the maximum intensities decrease monotonically for the subsequent peaks. Figure 8 shows a typical example for such an intensity peak pattern. Further investigations can lead to the observation that the distance between successive intensity peaks of a group is about 1/z Da, where z is the charge of the peptide. Thus, the group of intensity peaks represent the (stable) isotopic distribution of one peptide. For example, carbon is known to primarily exist in form of the stable isotope 12 C, but about 1.1% of all carbon atoms are of form 13 C, where the numbers 12 and 13 denote the masses (in Dalton). Also for hydrogen (3 H), nitrogen (15 N), oxygen (17 O, 18 O), and sulfur (32 S, 33 S, 34 S, 36 S) alternative stable isotopes are known. The larger the peptide the higher is the probability to contain one or more rarely occurring stable isotopes that result in one Dalton mass shift. When trying to quantify the amount of a protein present in a given sample, one kind of protein may exist in various 1 A movie showing features of the visual exploration tool accompanies the paper. Fig. 8. Visualization of LC-MS data exhibit characteristic patterns due to isotopes. Shown is a group of adjacent peaks forming a short chain. forms containing different isotopes. To determine the amount of protein, we should count all occurrences of the protein, not just the ones that contain exactly the same isotopes. Hence, when looking at the group of intensity peaks, the individual intensity peaks should not be considered separately. Instead, their intensities should be summed up to form one major intensity peak, which represents the number of all peptides of one kind. This step is called deisotoping. We perform the deisotoping step by scanning the MS spectrum for such patterns of intensity peaks. If we find three or more intensity peaks that show the described behavior and whose distances are 1 Da (with a small error tolerance), we classify them as candidates for deisotoping. We unite the intensity peaks at the position of their highest representative. The height of the united intensity peak is the sum of the

7 7 intensities of the individual peaks. Figure 9 shows the effect of deisotoping. The MS spectrum shown in red color exhibits the characteristic pattern. In the MS spectrum shown in blue color, the group of peaks have been replaced by one major intensity peak. To assure highest precision, all these calculations are conducted on the original data, i. e., before any resampling step are applied. Fig. 9. Deisotoping: Intensity peaks forming the characteristic isotope pattern (shown in red color) are united to form one major intensity peaks each (shown in blue color). Various other approaches exist for deisotoping, including sophisticated ones using database queries [15] to retrieve mass information for isotopes. These approaches are computationally expensive but can deal with overlapping patterns. Our approach is a simple and fast one. We have chosen this approach, as it can be incorporated in our interactive visual system providing an immediate and intuitive understanding of the applied modifications. Thus, potential misclassifications would immediately be noticable. B. Quantification 2D gel electrophoresis is a wide-spread approach for proteome analysis, many existing quantification methods operate on 2D images. Since biologists are already familiar with these algorithms and successfully have applied them to their own data sets, we want our system to support their integration. Therefore, we want to export our data to 2D grey-scale images. We allow the user to pick any resolution up to the highest resolution supported by our hierarchical data representation for the image export. We compute the image from the underlying data representation and not from the visualization. The exported image could represent the entire data set or a selected region of interest. Our visualization uses higher precision for storing the data values than is supported by the common image formats. Thus, during image export we are losing precision. Since most of the intensity values fall into the low intensity range, while only a few intensity peaks exhibit large values, we are using a logarithmic scale to map the high-precision values to the lower-precision ones. Using a logarithmic scale, we can still distinguish between the different low-intensity values. When using a linear scale, many low-intensity values would be mapped to the same exported value. Nevertheless, we support image export using both logarithmic and linear scales. C. Visual Exploration of Tandem Mass Spectrometry Data When using tandem mass spectrometry, a second MS spectrum is determined for some selected peptides, which are the peptides that exhibit high intensity peaks in the first MS spectrum. The visual exploration of tandem mass spectrometry data is based on our general framework visualizing intensities over time and m/z-ratios of the first MS step. In addition, we can switch on labeling of intensity peaks. We label those intensity peaks that correspond to peptides, for which a second MS step has been executed. When clicking a label, a new window appears, which displays the second-order MS spectrum. Figure 10 illustrates this interactive exploration method. 2 The second-order MS spectrum can be used to support the identification of the peptides and, thus, the proteins. The intensity peaks in the second-order MS spectrum differ by the masses of the amino acids of the peptides sequences. By computing the mass differences of the peaks, the amino acid sequence can be determined. The determined amino acid sequence can then be matched against amino acid sequences retrieved from online databases (such as Mascot [22]) to identify the peptide. The labels in Figure 10 show identity tags. After the identification step, they can be replaced by peptide names or other properties obtained from the database. Our system also provides the option to display additional available information about intensity peaks. Figure 11 shows a table with all properties next to the 3D visualization system. When selecting an intensity peak from the first- or the second-order MS spectrum by mouse click, the corresponding rows in the table are highlighted. This interaction mechanism allows the user to immediately obtain exact values for the measured quantities. D. Registration In order to get insight into multiple LC-MS experiments, we would like to display their visualization over the same domain and to allow for quantitative comparisons as well as interpolations and animations. The multiple experiments may have been taken under varying conditions (for example, test vs. control to compute differential protein expressions) or may be different fractions of a MudPIT experiment. We have to solve several problems in this regard, which are imposed by the structure of LC-MS data. Not only that we get different sampling locations when executing two different experiments, also when executing the same experiment multiple times the results vary. LC runs are not reproducible with sufficient precision. We observed that the variation may be pretty high, as intensity peaks for one peptide exhibit significant shifts from one output to another. The intensity peaks of 2 A movie showing interactive exploration of tandem mass spectrometry data accompanies the paper.

8 8 Fig. 10. Visual exploration of tandem mass spectrometry data. Intensity peaks of peptides, for which a second-order MS spectrum is available, are labeled. Clicking on the labels opens a new window showing the second-order MS spectrum. Fig. 12. Visualization of two LC-MS data sets. The samples have been taken under different conditions. Left: Locations of the maximum intensity peaks (colored in red and green) are supposed to coincide but clearly exhibit shifts. Right: Registration by landmark-based domain warping corrects the local shift. Fig. 11. Tables next to the 3D visualization provide exact quantities and properties of interactively selected peaks. Figure 12 illustrate the shift by visualizing test (colored in red) and control data (colored in green) as two heightfields over the same domain. The positions of most dominant peaks in the two heightfields are supposed to coincide, but they exhibit a severe shift. Thus, we need to register (align) the multiple experiments, if we want to have a meaningful visual or quantitative comparison of multiple data sets. The registration step warps the domain of the test data set onto the domain of the control data set. The warping transformation is computed using a landmarkbased approach. Obviously, the position of the most significant intensity peaks make for good landmarks. While landmarkbased approaches are known to produce good results, their drawback is the necessity of user interaction. Typically, substantial expertise is required to manually determine matching landmarks in two data sets. Moreover, for large data sets, this user interaction may become a rather tedious task. For our system, however, we can make use of the tandem mass spectrometry analysis described in the previous section. The second MS step allows us to determine which peptides are associated with which intensity peaks. When performing this peptide identification step for both data sets, we automatically obtain matching landmarks. The landmarks are the positions of the intensity peaks, to which the same identifications have been assigned. Once the landmarks are given, we can warp the domain of a test data set onto the domain of the control data set. We cannot use a rigid transformation, as the shifts may vary locally. Thus, we use a localized method: In a first step, we partition the domain of the control data set using a 2D Delaunay triangulation, where the landmarks (and the corners of the bounding box) are used as vertices. The blue lines on the left-hand side of Figure 13 illustrate the domain partitioning step. The same partition is induced onto the test data set using the matching landmarks as vertices (right-hand side of Figure 13). Based on the partition, we can warp every point

9 9 p from the domain of the test to the domain of the control data set. We determine, in which triangle point p lies and compute its barycentric coordinates within that triangle. The same barycentric coordinates are used within the matching triangle of the control data set to compute the warped position of p, see Figure 13. p 1 p 1 p p 3 p p 3 p 2 Fig. 13. Domain warping for registration of two LC-MS experiments: Peptides with intensity peak locations p 1, p 2, and p 3 are identified, matched, and used as landmarks. Delaunay triangulation is used to composite the 2D domain. Point p is warped to new position by using barycentric coordinates. The right-hand side of Figure 12 shows the result of our registration step applied to the two experiments of the example shown on the left hand-side of the figure. The matching intensity peaks have been moved to the same position. E. Visual Differential Protein Expression Analysis Differential protein expression analysis can be used to determine how the occurrences of proteins change when altering the conditions under which the samples are taken. Typically, one experimental setup serves as control data, while the other setups provide the test data. For example, the test and control data may have been taken from healthy and diseased organisms, respectively. The difference in expression can lead to hypotheses about what proteins are responsible for the occurrence of a disease. Another question is, if and how protein expression changes for different cell populations. To answer such questions, LC-MS data from two or more experiments need to be compared. We provide means to visually explore the quantitative differences in the two data sets. In addition to explore each data set on its own, we use an integrated view of both experiments by visualizing the differential expression. As we mentioned in the previous section, the two experiments need to be registered. After registration, differential expression can be computed by subtracting the intensity values for the test data set from the intensity values for the control data. The resulting differential expression is visualized as a heightfield over the 2D domain, i. e., over time- and m/zdimension. Figure 14 shows an example for differential expression visualization. The visual exploration of the resulting data is supported just like for data from a single experiment. F. Visual Exploration of MudPIT Data To visualize MudPIT data, we generate a heightfield for each fraction. We provide a slider for the user to switch between the heightfield renderings of the individual fractions. p 2 Fig. 14. Visual differential protein analysis for a test vs. control experiment. Up-regulation is shown in red, down-regulation in green. The peaks values display the quantitative difference of the two experiments over dimensions time and m/z-ratio. Figure 15 shows the visualization of three fractions exhibiting changes in intensity. An integrated view of all or selected fractions of a MudPIT data set may be of interest. However, results obtained by the MudPIT process also suffer from not being (precisely) reproducible. Thus, the locations of intensity peaks representing peptides present in subsequent fractions do not coincide. When trying to integrate MudPIT data from various fractions into one setup, we need to perform a registration step, again. We proceed as described above. Figure 16 shows the integrated visualization of various fractions after registration. Each fraction gets assigned one color. Thus, colors indicate in which fraction which intensity peaks are highest. Fig. 16. Integrated visualization of various fractions from MudPIT: Colors indicate, in which fraction each peak has maximum intensity. The color scheme assigns colors cyan, white, magenta, yellow, red, green, and blue to fractions in the order of increasing ammonium chloride. Using the registered fractions, we can also interpolate between the intensities of subsequent fractions. Instead of using a slider to switch between visualizations of different fractions as in Figure 15, we generate an animation over the fraction dimension, where intensity values are animated over fractions in the order of increasing ammonium chloride concentration. 3 3 A movie of an animation accompanies the paper.

10 10 Fig. 15. Visualization of three fractions obtained by using MudPIT. Slider allows to interactively switch between fractions. For smooth transition we use linear interpolation of the heightfields. The animation allows for an even better perception of intensity changes with changing fractions. VI. R ESULTS AND E VALUATION For the explanations of our visual exploration system in the previous chapters, we chiefly have applied our methods to data acquired from human cell line SiHa used to model cervical cancer [24]. The cell line was grown under normal conditions and showed no perturbation. The MudPIT experiment was done using liquid chromatography (LC) with reverse phase (RP) column and mass spectrometry (MS) with electrospray ionization (ESI). Eleven fractions with 0 mmol, 20 mmol, 40 mmol, 60 mmol, 80 mmol, 100 mmol, 150 mmol, 200 mmol, 300 mmol, 500 mmol, and 900 mmol ammonium chloride were used. The retention time during liquid chromatography were in the range from 0 to 85 minutes. The m/z-ratios measured by mass spectrometry were in the range of 300 to The measured intensities were as high as 109 counts per second. Moreover, we applied our tools to our own data in order to identify and quantify proteins and their differential expression in Bacillus subtilis. The data acquired consists of seven LCMS/MS cycles. The first cycle was a tandem mass spectrometry experiment as described in Section III, which is also referred to as flow through. Cycles two to seven describe a MudPIT (LC/LC-MS/MS) experiment (cf. Section III). Each cycle corresponds to one fraction. For the individual fractions we used 2.5 mmol, 7.5 mmol, 12.5 mmol, 17.5 mmol, 25 mmol, and 37.5 mmol ammonium chloride. Figure 17 shows the 3D visualization of the flow through. Our first observation from the visualization is that the biological experiment was successful in terms of possible failures during data acquisition. In a successful experiment, a large number of intensity peaks spread over the 2D domain without forming clusters or patterns. An intuitive and immediate check for errors is particularly important, as the data acquisition methods are not routinely conducted yet due to their novelty. Fig D visualization for global understanding and error check. For comparison, we show two examples, where experiments failed. In Figure 18(a), we observe that the intensity peaks form a vertical band. Apparently, the LC step did not work properly, such that peptides have not been separated appropriately but agglomerate in this narrow band. In Figure 18(b), the sample is empty, i. e., only very few values have been detected. This may be a detection problem during the MS step. Thus, our visualization method does not only provide an intuitive and immediate validation check but can even hint to what problem may have caused the experiment to fail. Next, we want to look into more detail of our data set. We detect a very high intensity peak and turn on the labeling to obtain more information about the peak. This visual exploration step is shown in Figure 11. The high intensity peak is labeled with the identity tag The corresponding spectrum of this MS scan is plotted in Figure 3. When selecting this peak we can retrieve further information. The quantities of the selected intensity peak are highlighted in the given table, see

11 11 (a) Fig. 18. Two failed experiments (bird s eye view): (a) Measured intensities agglomerate in a narrow vertical band, possibly caused by an LC error. (b) Few peptides have been detected, possibly caused by an MS error. Figure 19. In the table we enlist the identity tag of the scan, the number of intensity peaks in the scan, the order of the MS scan, and the maximum intensity, which is the intensity of the selected peak. Moreover, the subsequent rows give the information of the second MS step. Five intensity peaks of scan 1109 have undergone further investigations (scans 1110 to 1114). In the last column the value of their m/z ratio is given, which allows us to detect the second-order MS spectrum that corresponds to the selected intensity peak. (b) VII. CONCLUSIONS AND FUTURE WORK We have presented a system for visual exploration of data from gel-free protein experiments. We provide interactive 3D visualizations of LC-MS data, tandem mass spectrometry data, and MudPIT data. Interactivity is achieved by using a hierarchical data representation scheme. Care has been taken to assure that the biological data (in terms of maximum intensities and sample locations) is preserved with high precision. We provide methods for deisotoping and registration and support protein quantification and identification. These algorithms can be used for computing and visualizing differential protein expression. We have evaluated our approach by illustrating how protein analysis becomes much more targeted toward the retrieval of significant data when interacting with our visualization-based exploration tool. In our current implementation we have not yet integrated the database queries and the intensity peak quantification step. Integrating the database search results would eliminate the manual database queries. Integrating the peak intensity step would also improve the accuracy of our system. By exporting the intensities to a 2D image, we are not only losing precision in the intensity values, we also need to resample the time steps to obtain equidistantly sampled values, which fit the regular pattern of a pixel image. This resampling step may introduce some inaccuracies. We also plan on coupling our deisotoping step with a deconvolution step. Currently, we are only detecting patterns with distances 1 Da, i. e., with charge z = 1. During the MS ionization step, ions with higher charge may be generated such that the distances in the patterns would be proportionally smaller. REFERENCES Fig. 19. Quantitative information about scan 1109 is shown and highlighted when the respective intensity peak is selected. We can use the second-order MS spectrum to determine the amino acid sequence of a peptide and make a database query to identify the corresponding peptide. The feedback from the biologists in our group was that our tool provides intuitive visual exploration mechanisms, which allow them to quickly obtain the precise quantitative and qualitative information from LC-MS runs. Our visual exploration tool helps to significantly accelerate their workflow and makes their analysis steps more intuitive, transparent, and comprehensible. Of great value to the experimenter is the possibility to immediately detect problems in the experimental protocol indicated by abnormal 3D charts. Further impact of our work to the proteomics community includes that our visual analysis system allows for differential expression analysis for the entire LC-MS data set, which previously had not been solved adequately. The consideration of deisotoping for the graphical and quantitative representation of LC-MS data gets us closer to a reliable quantitation of whole monoisotopic peak sets of the same peptide within an LC-MS proteome data set. [1] F. O. Andersson, R. Kaiser, and S. P. Jakobsson. Data preprocessing by wavelets and genetic algorithms for enhanced multivariate analysis of lc peptide mapping. Journal of Pharmaceutical and Biomedical Analysis, 34: , [2] J. Bernhardt, K. Buttner, C. Scharf, and M. Hecker. Dual channel imaging of two-dimensional electropherograms in bacillus subtilis. Electrophoresis, 20(11): , [3] J. Bernhardt, J. Weibezahn, C. Scharf, and M. Hecker. Bacillus subtilis during feast and famine: visualization of the overall regulation of protein synthesis during glucose starvation by proteome analysis. Genome Res., 13(2): , [4] P. Cignoni, C. Montani, E. Puppo, and R. Scopigno. Multiresolution modeling and visualization of volume data. IEEE Transactions on Visualization and Computer Graphics, 3(4): , [5] P. Cignoni, E. Puppo, and R. Scopigno. Representation and visualization of terrain surfaces at variable resolution. The Visual Computer, 13(5): , [6] D. Cohen-Or and Y. Levanoni. Temporal continuity of levels of detail in delaunay triangulated terrain. In IEEE Visualization 1996, pages IEEE Computer Society Press, [7] J. de Corral and H. Pfister. Hardware-accelerated 3d visualization of mass spectrometry data. In C. Silva, E. Gröller, and H. Rushmeier, editors, IEEE Visualization 2005, pages IEEE Computer Society Press, [8] M. Duchaineau, M. Wolinski, D. E. Sigeti, M. Miller, C. Aldrich, and M. B. Mineev-Weinstein. Roaming terrain: Real-time optimally adapting meshes. In IEEE Visualization 1997, pages IEEE Computer Society Press, [9] J. B. Fenn, M. Mann, C. K. Meng, S. F. Wong, and C. M. Whitehouse. Electrospray ionization for mass spectrometry of large biomolecules. Science, 246(64), 1989.

12 12 [10] D. R. Gilbert, M. Schroeder, and J. van Helden. Interactive visualization and exploration of relationships between biological objects. Trends Biotechnol., 18(12): , [11] R. Grosso, C. Lürig, and T. Ertl. The multilevel finite element method for adaptive mesh optimization and visualization of volume data. In R. Yagel and H. Hagen, editors, Proceedings of IEEE Conference on Visualization 1997, pages IEEE, IEEE Computer Society Press, [12] H. Hoppe. Smooth view-dependent level-of-detail control and its application to terrain rendering. In IEEE Visualization 1998, pages IEEE Computer Society Press, [13] M. Karas and F. Hillenkamp. Laser desorption ionization of proteins with molecular masses exceeding daltons. Anal Chem, 60: , [14] X.-J. Li, P. G. A. Pedrioli, J. E. J, D. Martin, E. C. Yi, H. Lee, and R. Aebersold. A tool to visualize and evaluate data obtained by liquid chromatography/electrospray ionization/mass spectrometry. Anal Chem, 76: , [15] X.-J. Li, H. Zhang, J. R. Ranish, and R. Aebersold. Automated statistical analysis of protein abundance ratios from data generated by stable isotope dilution and tandem mass spectrometry. Anal Chem, 75: , [16] P. Lindstrom, D. Koller, W. Ribarsky, L. Hodges, N. Faust, and G. Turner. Real-time continuous level of detail rendering of height fields. In SIGGRAPH 1996, pages ACM SIGGRAPH, [17] L. Linsen, J. Löcherbach, M. Berth, J. Bernhardt, and D. Becher. Differential protein expression analysis via liquid-chromatography/massspectrometry data visualization. In C. Silva, E. Gröller, and H. Rushmeier, editors, IEEE Visualization 2005, pages IEEE Computer Society Press, [18] L. Linsen, V. Pascucci, M. A. Duchaineau, B. Hamann, and K. I. Joy. Wavelet-based multiresolution with n 2 subdivision. Journal on Computing, 71(1+2), [19] F. Losasso and H. Hoppe. Geometry clipmaps: Terrain rendering using nested regular grids. ACM Transaction on Graphics, 24(3): , [20] S. Luhn, M. Berth, M. Hecker, and J. Bernhardt. Using standard positions and image fusion to create proteome maps from collections of two-dimensional gel electrophoresis images. Proteomics, 3(7): , [21] D. M. Maynard, J. Masuda, X. Yang, J. A. Kowalak, and S. P. Markey. Characterizing complex peptide mixtures using a multi-dimensional liquid chromatography-mass spectrometry system: Saccharomyces cerevisiae as a model system. Journal of Chromatography B, 810(1):69 76, [22] D. N. Perkins, D. J. Pappin, D. M. Creasy, and J. S. Cottrell. Probabilitybased protein identification by searching sequence databases using mass spectrometry data. Electrophoresis, 20(18): , [23] D. Pinskiy, E. Brugger, H. R. Childs, and B. Hamann. An octreebased multiresolution approach supporting interactive rendering of very large volume data sets. In H. Arabnia, R. Erbacher, X. He, C. Knight, B. Kovalerchuk, M. Lee, Y. Mun, M. Sarfraz, J. Schwing, and H. Tabrizi, editors, Proceedings of the 2001 International Conference on Imaging Science, Systems, and Technology (CISST 2001), Volume 1, pages Computer Science Research, Education, and Applications Press (CSREA), Athens, Georgia, [24] J. T. Prince, M. W. Carlson, R. Wang, P. Lu, and E. M. Marcotte. The need for a public proteomics repository (commentary). Nature Biotechnology, 22: , [25] N. Shah, V. Filkov, B. Hamann, and K. I. Joy. GeneBox: interactive visualization of microarray data sets. In F. Valafar and H. Valafar, editors, International Conference on Mathematics and Engineering Techniques in Medicine and Biological Sciences (METMBS 03), pages Computer Science Research, Education, and Applications Press (CSREA), Athens, Georgia, [26] Y. Shinagawa and T. L. Kunii. Unconstrained automatic image matching using multiresolutional critical-point filters. IEEE Trans. Pattern Anal. Mach. Intell., 20(9): , [27] E. J. Stollnitz, T. D. DeRose, and D. H. Salesin. Wavelets for Computer Graphics: Theory and Applications. The Morgan Kaufmann Series in Computer Graphics and Geometric Modeling, Brian A. Barsky (series editor), Morgan Kaufmann Publishers, San Francisco, U.S.A., [28] C. Tang, L. Zhang, and A. Zhang. Interactive visualization and analysis for gene expression data. In Hawaii International Conference on System Sciences, [29] M. Tyers and M. Mann. From genomics to proteomics. Nature, 422: , [30] M. P. Washburn, D. Wolters, and J. R. Yates III. Large-scale analysis of the yeast proteome by multidimensional protein identification technology. Nature Biotechnology, 19: , [31] M. R. Wilkins, C. Pasquali, R. D. Appel, K. Ou, O. Golaz, J. C. Sanchez, J. X. Yan, A. A. Gooley, G. Hughes, I. Humphery-Smith, K. L. Williams, and D. F. Hochstrasser. From proteins to proteomes: large scale protein identification by two-dimensional electrophoresis and amino acid analysis. Biotechnology, 14:61 65, Lars Linsen is an assistant professor of computer science at the Department of Mathematics and Computer Science of the Ernst-Moritz-Arndt-Universität Greifswald, Germany. He received a B.S. and an M.S. (Diplom) in computer science from the Universität Karlsruhe (TH), Germany, as well as a Ph.D. in He was awarded with the 2002 Preis des Fördervereins des Forschungszentrum Informatik for an outstanding dissertation. He spent three years as a post-doctoral researcher and lecturer at the Institute for Data Analysis and Visualization (IDAV) and the Department of Computer Science of the University of California, Davis, U.S.A. He joined the Ernst-Moritz-Arndt-Universität Greifswald in October His research interests are in the areas of scientific and information visualization, multiresolution methods, computer graphics, and geometric modeling. He is a member of ACM and ACM SIGGRAPH. Julia Löcherbach is a Ph.D. student at the Department of Mathematics and Computer Science of the Ernst-Moritz-Arndt-Universität Greifswald, Germany. She received a B.S. and an M.S. (Diplom) in biomathematics from the Ernst-Moritz-Arndt- Universität. She spent one year at the Heriot-Watt University Edinburgh, Scotland. She also works as a part-time software developer for Decodon GmbH, Greifswald, Germany. Matthias Berth received an M.S. (Diplom) in 1995 as well as a Ph.D. in Mathematics in 1999 from Ernst-Moritz-Arndt-Universität Greifswald, Germany. He is co-founder and CTO of DE- CODON GmbH where he is responsible for research and development. Dörte Becher is postdoctoral research fellow at the Institute for Microbiology of the Ernst-Moritz- Arndt-Universität Greifswald, Germany. From she studied chemistry and received her M.S.(Diplom) in chemistry in 1992 and her PhD in microbiology in During this time she worked as a guest student at the University of Aberdeen for some month. Since 1999 she is one of the responsibilities for mass spectrometry in the Institute for Microbiology. Her research interests are in the areas of mass spectrometry in proteomics, in particular the protein identification, investigation of post translational modifications, methods of gel free protein identification/quantitation and characterization of protein complexes. Jörg Bernhardt is postdoctoral research fellow at the Institute for Microbiology of the Ernst-Moritz- Arndt-Universität Greifswald, Germany he recieved his M.S.(Diplom) in Microbial Physiology, and his Ph.D. in During this time (1999) he was awarded with the poster price of the Japanese Electrophoresis Society in Tokyo and with the Research Award of the University of Greifswald for his Dissertation on the Proteome of Bacillus subtilis. In Dec 2000 he became cofounder and CSO of DECODON, a company engaged in the development of software tools for functional genomics. His research interests are in the areas of functional genomics, proteomics, bacterial physiology, and image analysis.