Peptide mapping by capillary/standard LC/MS and multivariate analysis

Size: px
Start display at page:

Download "Peptide mapping by capillary/standard LC/MS and multivariate analysis"


1 UPTEC X 4 36 ISSN AUG 24 RAGNAR STOLT Peptide mapping by capillary/standard LC/MS and multivariate analysis Master s degree project

2 Molecular Biotechnology Programme Uppsala University School of Engineering UPTEC X 4 36 Date of issue 24-8 Author Ragnar Stolt Title (English) Peptide mapping by capillary/standard LC/MS and multivariate analysis Title (Swedish) Abstract The potential of LC/MS peptide mapping combined with multivariate analysis was investigated using IgG1 as a model protein. Five batches of IgG1 were exposed to different levels of an oxidizing agent. A method to detect differences between the batches using solely MS data was developed and successfully applied. Four peptide fragments containing methionine residues were found to represent the most significant differences and characterized using MS/MS. In order to evaluate different computational strategies Principal Component Analysis (PCA) was used. Attempts were also made in order to use the information from the whole LC/MS space. Keywords Peptide Mapping, LC/MS, PCA, PTM, IgG1, Genetic Algorithms, Matlab Programming Supervisors Rudolf Kaiser AstraZeneca, Analytical Development Södertälje Scientific reviewer Per Andrén Uppsala University, Laboratory for Biological and Medical Mass Spectrometry Project name Language English Sponsors Security ISSN Classification Supplementary bibliographical information Pages Biology Education Centre Biomedical Center Husargatan 3 Uppsala Box 592 S Uppsala Tel +46 () Fax +46 ()

3 Peptide mapping by capillary/standard LC/MS and multivariate analysis Ragnar Stolt Sammanfattning Inom läkemedelsindustrin är det viktigt att utveckla analytiska metoder för att kunna hitta små skillnader mellan olika prover av ett läkemedelsprotein. Man måste kunna kartlägga vilka förändringar som introduceras i proteinet då det t.ex. lagras i rumstemperatur under lång tid. Dessa förändringar kan nämligen ändra proteinets egenskaper och eventuellt även leda till ett immunsvar med allvarliga konsekvenser. Traditionellt har man inom proteinkemin karaktäriserat proteiner bland annat genom s.k. peptidmappning. Peptidmappning går ut på att enzymatiskt klyva ett protein och analysera de uppkommna peptidfragmenten med vätskekromatografi. Varje resulterande kromatogram motsvarar då ett fingeravtryck av proteinet och små skillnader mellan olika prover kan spåras genom små förändringar i fingeravtrycken. På så vis kan man avgöra om det föreligger några skillnader men inte vad de består av. Den här studien bygger på att ytterligare förbättra möjligheterna med peptid mappning genom att analysera peptidfragmenten med en masspektrometer. Små förändringar i form av oxidation infördes på ett modellprotein. Med hjälp av traditionella statistiska metoder har stokastiska icke signifikanta skillnader filtrerats bort och förändringarna kunde via tandem masspektrometri karaktäriseras som oxidation av metionin. Stor tyngd har lagts på att utveckla algoritmer som kan hantera den komplicerade och stora datamängd som masspektrometrisk data utgör. Examensarbete 2 p i Molekylär bioteknikprogrammet Uppsala universitet Augusti 24 3

4 Table of contents: 1 INTRODUCTION MODEL PROTEIN, IMMUNOGLOBULIN G PEPTIDE MAPPING Digestion REVERSED PHASE HIGH PERFORMANCE LIQUID CHROMATOGRAPHY (RP-HPLC) MASS SPECTROMETRY (MS) Ion Source Time of Flight Analyzer Tandem Mass Spectrometry Hybrid Quadrupoles Time of Flight The Detector DATA ANALYSIS Normalization Confidence Interval Principal Component Analysis (PCA) Genetic Algorithms Wavelet Transformation MATERIAL AND METHODS EQUIPMENT AND CHEMICALS Chemicals Equipment METHODS Oxidation of Model Protein Digestion of Model Protein RP-HPLC LC/MS Design of Experiment DATA ANALYSIS Importing Data to Matlab Approach 1: Collapsed Time Scale Normalization Principal Component Analysis (PCA) Confidence Interval Finding Oxidized Fragments Approach 2: Timescale Wavelet Denoising Preprocessing Using Genetic Algorithms and Normalization Bucketing Confidence Interval RESULTS DATA ANALYSIS Approach 1: Collapsed Time Scale Normalization Principal Component Analysis (PCA) Normalization with Normalization Parameter Evaluating Auto Scaling Comparing Normalization Techniques Confidence Interval Approach 2: Time Scale Wavelet Denoising Preprocessing Confidence Interval Bucketing TANDEM MASS SPECTROMETRY DISCUSSION ACKNOWLEDGEMENTS REFERENCES

5 1 Introduction Today a number of different recombinant proteins are available on the pharmaceutical market. The breakthrough for recombinant techniques is often associated with the release of insulin produced in E.Coli 1982 [1]. Right from the beginning it has been important to develop methods to characterize and analyze recombinant proteins. There are problems using recombinant techniques due to posttranslational modifications (PTM). Eukaryotic organisms, especially the human species, have developed a complex system for PTM:s. Vital proteins will not function properly if these PTM:s are missing. On the contrary prokaryotic organisms, e.g. E.Coli, do not perform any PTM:s at all. The pharmaceutical companies have therefore to be able to detect differences between product and native form of the drug candidate protein. Differences from the native copy can lead to dysfunction of the protein drug and also an unwanted immunorespons with hazardous consequences. There is also a great need of investigating the quality of a protein drug. What kind of modifications will be introduced in the protein when it e.g. is stored at room temperature for days? Maybe a couple of amino acids in the protein will be oxidized and some other will be exposed to deamidation or deglycosylation. These questions need to be answered before commercializing a new protein drug. A common method to detect differences between protein batches is peptide mapping, using RP-HPLC [2]. To facilitate data analysis a multivariate approach can be successful. Principal component analysis (PCA) is often used [3] to model variations in the data set, making it easier to detect e.g. outliers and to produce information concerning system reproducibility. It is also important to minimize stochastic and system drift variations especially when looking for small differences in the data set. Otherwise it can be difficult to separate non-chemical variations from true physical differences in the protein. The UV-data collected from the HPLC is however often not sufficient to disclose small variations in the data set. Furthermore the UV-chromatogram does not give any qualitative information. It is not possible using this kind of data to answer the question Where on the protein are the modifications located and what do they consist of?. To further enlarge the possibilities of peptide mapping the univariate approach has to be abandoned and more physical information describing the properties of the protein need to be gathered. One possibility to enlarge the amount of available information is to use a LC/MS system, gathering information not only in the time domain but also in the m/z-domain resulting in a bivariate peptide map instead of the traditional univariate UV-map. Mass data (m/z) can also give qualitative information about the parts of the protein where the modifications are situated. Using MS/MS these parts can be analyzed further and comparing with a reference batch individual differing amino acids can be detected. This project focuses on studying LC/MS peptide maps and developing computational methods to separate true chemical differences from noise without any a priori information. Found differences will be characterized using MS/MS. 5

6 1.1 Model protein, Immunoglobulin G1 As model protein Immunoglobulin G (IgG1, κ) has been chosen. The IgG molecule is very important to the immune defense system and the most abundant antibody with approximately 13.5mg/ml in serum [4]. IgG binds to foreign molecules and is thereby activating other members of the immune defense system. IgG is a molecule consisting of two major chains one smaller forming the light chain and one larger forming the heavy chain each represented twice (fig. 1). The different chains are held together with a total of four disulfide bonds. The molecular mass of the IgG molecule used in this study is 145 kda (without any PTM:s) and there are 45 amino acids. There is a N-linked glycosylation site on each heavy chain. To be able to evaluate the possibilities with a LC/MS peptide map, small chemical changes were introduced by oxidizing IgG. Comparing batches with different amount of added oxidizing agent hopefully reveals some information about the potential of the analytical LC/MS system. The amino acid most sensitive to oxidizing agents is Figure 1: Immunoglobulin G1 methionine. Oxidization of methionine produces methionine sulfoxid [5] in a reversible reaction. This oxidization corresponds to an addition of an oxygen atom resulting in a 16 Da increment of mass. Increasing the concentration of oxidizing agent further can irreversible oxidize methionine sulfoxid to methionine sulfone. There are six methionine residues represented in the amino acid sequence. 1.2 Peptide Mapping Peptide mapping is a method used to create a fingerprint, specific for a certain protein. The protein is digested with a suitable enzyme and the peptide fragments are separated using e.g. Reversed Phase High Performance Liquid Chromatography (RP-HPLC). Traditionally an UV-detector is often chosen for data collection. In this study a mass detector was used Digestion The digestion method has to be compatible with the chemical conditions necessary for the HPLC-system and the mass spectrometer. It is important to develop a digestion routine with high reproducibility in order to be able to compare the results from different runs. The enzyme used has to digest the protein into a sufficient number of peptide fragments. Too many and too small fragments risk to obstruct the data analysis and signal to noise ratio will decrease. Too few fragments decrease the amount of information that can be gathered from a peptide map. 6

7 1.3 Reversed Phase High Performance Liquid Chromatography (RP-HPLC) RP-HPLC is a widely used and well-established tool for the analysis and purification of biomolecules e.g. a protein digest. The system uses high pressure to force a mobile phase through a column packed with porous micro particles. Particle sizes range typically between 3 and 5 µm. The smaller particle diameter the more pressure will be generated in the system. The particle pore size generally ranges between 1-1Å. Smaller pore silicas may sometimes separate small or hydrophilic peptides better than larger pore silica [6]. The most common columns are packed with silica particles to which different alkylsilane chains are chemically attached. Butyl (C4), octyl (C8) and octadecyl (C18) silane chains are the most commonly used. C4 is generally used for proteins and C18 for small molecules. The idea is that large proteins with a lot of hydrophobic moieties need shorter chains on the stationary phase for sufficient hydrophobic interaction. The choice of column diameter depends on the required sample load and the flow rate. Small-bore columns (1. and 2.1 mm i.d.) can improve sensitivity and reduce solvent usage. Column length does not significantly affect most polypeptide separations [6]. To speed up the analytical cycle time short columns with high flow rate and fast gradients can be used at expense of resolution. An HPLC-system optimized for columns with small inner diameters and low flows are called micro-hplc. A micro HPLC system has narrow capillaries, typically 5µm i.d. The pumps are commonly working with a split-flow enabling low flow with high accuracy. The advantage of micro-hplc is mainly reduction in mobile phase solvent consumption and high sensitivity, which makes it possible to load low amounts of sample, facilitating the connection to a mass spectrometer system. In this form of liquid chromatography the stationary phase is nonpolar and the mobile phase relatively polar. Analytes will thus be separated mainly due to their hydrophobic properties. During a gradient separation two different kinds of solvents are used as mobile phase. One of the solvents is relatively Polypeptide enters the column at injection Polypeptide adsorbs to hydrophobic surface Figure 2: The idea behind gradient separation with RP-HPLC Polypeptide desorbs from stationary phase when organic solvent reaches critical concentration. hydrophilic and the other is relatively organic (hydrophobic). The two solvents are mixed together and the relative content of the organic solvent increases with time. Analytes are at the beginning of the gradient attached through hydrophobic interaction to the solid phase. When the organic content of the mobile phase reaches a critical value desorption will take place and the analytes will pass through the column. The majority of peptides (1 to 3 amino acid residues in length) have reached their critical value when the gradient reaches 3% organic content. The separation is however also influenced by molecular size. Smaller molecules will move slower through the column than larger based on the fact that smaller molecules will have access to a larger volume of the column. The analytes partitioning 7

8 process between mobile and solid phase will also impact the separation process. However it is quite safe to say that polar analytes elute first and non-polar analytes last. To get separation mainly based on hydrophobic differences an ion-pairing agent is often added to the mobile phase in order to serve one or more of the following functions: ph control, suppression of non wanted interactions between basic analytes and the silanol surface, suppression of non wanted interactions between analytes, or complexation with oppositely charged ionic groups. It has been shown [6] that addition of an ion-pairing agent has a dramatic beneficial effect on RP-HPLC not only to enhance separation but also to improve peak symmetry. Trifluoroacetic acid (TFA) is an ion-pairing agent widely used. It is volatile and has a long history of proven reliability. To be able to connect the HPLC system to a mass-spectrometer it is important to choose an ion-pairing agent with care. Ion suppression reduces the sensitivity of the mass-spectrometer system. Another effect that can influence on peptide separations is temperature. Higher temperature is associated with increased diffusion according to Einstein s diffusion constant D: kbt D = (1) 6πrη Where η is a viscosity constant, k B corresponds to Bolzmann s constant, T is temperature and r the radius of the diffusing particle. It is however difficult to draw any general conclusions because it has been shown that an increase in temperature increases resolution between certain analytes and decreases resolution between other analytes [6]. For good reproducibility a firm temperature control has to be applied. RP-HPLC is one of the most widely used forms of chromatography mainly because of its high resolution. Chromatographic resolution is defined as the ratio of the difference in retention time between two neighboring peaks A and B and the mean of their base widths: R t t t R A R B R S = 2 = (2) w A + wb wav Where tr corresponds to retention time and w base width. It is possible when using RP-HPLC to separate peptides whose sequence only differs by one single amino acid residue. 1.4 Mass Spectrometry (MS) A mass spectrometer is an analytical instrument that determines the molecular weight of ions according to their mass to charge ratio m/z. The device consists mainly of three basic components: the ionization source, the mass analyzer/filter and the detector. 8

9 1.4.1 Ion Source Ionization is an essential part of the mass spectrometric process. The molecules have to be charged and in gaseous phase in order to accelerate in the electrical field inside the mass spectrometer. Today several different ionization techniques have been developed. The most often used techniques concerning analysis of peptides and proteins are matrix-assisted laser desorption/ionization (MALDI) and electrospray ionization (ESI). These two techniques are so called soft ionization techniques, which means that the molecules are ionized without fragmentation. In this study ESI has been used. ESI creates a fine spray of highly charged droplets in the presence of a strong electric field. The sample solution is injected at a constant flow, which makes ESI particularly useful when sample solution is introduced by a LC-system. If the LC-flow is compatible with the mass spectrometer an online LC/MS system is easily established. The charged droplets are introduced to the mass analyzer compartment together with dry gas, heat or both. This will lead to solvent vaporization. When the droplets decrease in volume the electric field density increases and eventually repulsion will exceed surface tension and charged molecules will start to leave the droplet via a so called Taylor cone [7]. This process is conducted at atmospheric pressure and is sometime also called atmospheric pressure ionization (API). Using ESI it is possible to study molecules with masses up to 15 Dalton, mainly because of the fact that ESI generates multiple charged molecules, which means that a low upper m/z limit is sufficient for analysis of large biomolecules. A typical detection limit using ESI is femtomole [7] Time of Flight Analyzer The most commonly used analyzers are quadropoles, Fourier transform ion cyclotron resonance and time of flight analyzers (TOF). In this study a TOF analyzer with a reflectron was used. The TOF analyzer is the simplest construction based on the idea that ions are accelerated through an electrical field with the same amount of kinetic energy. These ions will differ in velocity due to their charge to mass ratio: 1 m v = 2U z (3) Where U corresponds to the accelerating voltage. The differences in velocity will in turn lead to different flight time from the ion source to the detector. One advantage of TOF instruments is that no scanning of the m/z spectrum is necessary. Another advantage is the fact that there is virtually no upper mass limit using TOF. However the resolving power of TOF instruments is low. Resolving power, also called resolution, is defined as the ability of a mass spectrometer to distinguish between different m/z ratios at a certain peak height. Looking at just one peak in the mass spectrum resolution is commonly defined as the ratio between the m/z value and the full width of the peak at half maximum. Analyzers with reflectron can improve resolution. A reflectron is a device with gradient electrostatic field strength. This so-called ion mirror will redirect the ion beam towards the detector. Ions with greater kinetic energy will penetrate 9

10 deeper into the reflectron compared with low energetic ions. This mechanism will compensate for a wide distribution of initial kinetic energy and thus increase mass resolution Tandem Mass Spectrometry The peptide has to be fragmented in order to determine its sequence. Fragmentation can be achieved by inducing ion-molecule collisions by a process called collision induced dissociation (CID). The idea behind CID is to select the peptide ion of interest and introduce it into a collision cell, with a collision gas often Argon, resulting in break of the peptide backbone. From the resulting daughter ion spectrum the m/z values of the involving amino acids can be found. The peptide fragments can be divided into different series. When charge is retained on the N-terminal (fig. 3) the resulting series of fragments are called a n, b n and c n. When charge is retained on the C-terminal fragmentation can also occur at three different positions called x n, y n, z n. x 3 y 3 z 3 x 2 y 2 z 2 x 1 y 1 z 1 R 1 O R 2 O R 3 O R 4 O NH 2 CH C NH CH C NH CH C NH CH C OH a 1 b 1 c 1 a 2 b 2 c 2 a 3 b 3 c 3 a 1 b 1 H 2 N + CHR 1 H 2 N CHR 1 C O + O c + 1 H 2 N CHR 1 C NH 3 x 1 z 1 y 1 R 4 HC C + O 2 H H 3 N + CHR 4 C OH O + C NH O O CHR 4 C OH Figure 3: Collision induced dissociation (CID). Peptide fragments are produced according to the scheme above. Ions of the b and y series are often dominating the daughter ion spectrum Hybrid Quadrupoles Time of Flight To select peptide ions, that are to be investigated by CID, a quadrupole device can be used. Quadrupoles are four parallel rods with an applied direct current and a radio frequency electromagnetical field. When ions reach the quadrupole they will start to oscillate depending on the radiofrequency field and their m/z value. Only ions with a particular m/z value will be able to escape the quadrupole, the rest will collide with the quadrupole walls. Thus the quadrupole works as a mass filter. By scanning the radio frequency field an entire mass spectrum can be obtained. 1

11 The instrument used in this study is a quadrupole TOF hybrid (fig. 4). The quadrupole is used to select an ion of interest, which is fragmented in the collision cell. The resulting daughter ions are analyzed using a TOF device and a detector. Ions from ESI Quadrupole Ar Collision Cell TOF Detector Selected ion Figure 4: The concept behind a Quadrupole - TOF system. Ion fragments Reflectron The Detector The detector (fig. 5) converts the kinetic energy from the arriving ions to an electrical current. The amplitude of the current is correlated with the number of ions reaching the detector. Most detectors available today build on the principle of electron multiplication. The detector in this particular instrument is called microchannel plate (MCP). A MCP detector is a huge number of electron multiplicator tubes. Photoelectron Hollow glass capillary with secondary electron emission coating. Ions Secondary electrons. Figure 5: Microchannel Plate detector. When a charged particle collides with the tube wall secondary electrons will be emitted and reflected further down the tube, leading to a cascade of secondary electrons well gathered in space. The signal is amplified in the MCP detector with typically a factor of Data Analysis Normalization When using a LC/MS device small differences in sample concentration, injection volume or loss of sensitivity will introduce variations in the data set that complicates the comparison between different batches. These variations can however be compensated for by normalizing the data set. Most normalization techniques e.g. when treating HPLC data are based on an 11

12 internal standard or an external standard. Normalization in this context means that the data set is divided by the area or height of the standard peak. As standard peak the peak with largest area in the chromatogram can often be used with good results. However when analyzing MS data with a large number (up to thousands) of m/z values normalization is not a trivial task. Which m/z value should be chosen as standard peak in order to produce the most accurate normalization? What if a m/z value with large variation or equally bad, with too little variation is chosen? The normalized data set will under these conditions poorly represent the true values. A better approach would be to calculate intensity quotients between the m/z values of a reference sample and a target sample. The mean of these quotients could be used as a normalization parameter. Averaging the quotients will work as a low pass filter (fig. 6) and only significant trends in the data set will be represented in the normalization parameter thus minimizing the impact of m/z values with large variation. This normalization technique will work well under the conditions that the number of m/z values is fairly large and the chemical differences between the batches are fairly small. Large chemical differences will slip through the low pass filter and give rise to a skew normalization. x( n) + x( n 1) + x( n 2) x( n n + 1) x( n) = n z transforma tion.9 z + z X ( z) =... H ( e jω 1 ) = n 1 + z nω sin 2 ω sin z n n+ 1 Figure 6: The low pass nature of a mean operation. ω corresponds to frequency. H corresponds to the transfer function. The z-transform clearly shows that only low frequency components will slip through the filter. A n = frequency (rad/s) Confidence Interval A classic way of treating the problem with stochastic variation between different samples of the same batch is to estimate a confidence interval. Assuming that the observed variable belongs to the normal distribution it is fairly easy to calculate the probability of finding the true mean value within the variation of the measured variable. Or the other way around: it is possible to calculate the limits within which the true mean value with a certain amount of probability can be found. Comparing two batches the confidence interval for the differences in mean intensity of the measured m/z values will give some useful information. If the calculated confidence interval ranges from a positive value to a negative value, i.e. includes zero, it is not possible to 12

13 statistically declare the calculated difference as a true different. By using this approach nonsignificant changes can be removed Principal Component Analysis (PCA) Principal component analysis is a multivariate projection method designed to extract and display the systematic variation of a data set [8]. The data set is composed by a number of N observations and K variables. Examples of observations can be samples of different batches or time points in a continuous process. The variables are often represented by different kinds of analytical results e.g. UV-data, NIR-data, m/z-data. Geometrically the data set can be interpreted by representing each observation as a point in the N-dimensional orthogonal variable space (fig. 7), where each axis constitutes a variable. A new set of orthogonal variables is introduced where each new variable minimizes the residual variance of the observations by least square analysis. Minimizing the residual variance is equivalent to maximizing the variance of the observations along the new variable axis. This new set of variables is called principal components (PC). It is possible to calculate as many PC:s as there are variables. The Euclidian distance between each projection point of the observations on the PC and the PC center point is called the score value. Each observation is represented of a single score value in the principal component space. Score value Residual variance PC 1 Variable 1 δ PC 2 Variable 2 Figure 7: Geometrical interpretation of PCA with only two variables. Cosine of δ corresponds to the loading value of variable 1. The PC:s space has an equal number of dimensions as the original variable space. However reduction of dimensionality can be done by choosing the PC:s for the PCA model which together describes mostly of the variance in the original data set. The degree of variance explained is called the cumulative variance. Two or three PC:s are often sufficient, meaning that the original variable space with N-dimensions has been reduced to a new variable space with two or three orthogonal axis without any significant loss of information. The eigenvalue of each PC is proportional to the variance explained by that particular PC and can thus be an useful tool when ranking the PC:s. 13

14 Another important value besides the score value is the cosine of the angle between the original variable axis and the new PC axis. This value is proportional to the importance of the original variable for the direction of the PC and it is called the loading. Each original variable will give rise to a loading value. In the resulting PCA-plot it is easy to find relations between observations. It is also possible to collect information about e.g. outliers and classification. The loading plot, where the loadings are plotted in the PC:space reveals information about relationships between variables. To facilitate the interpretation of the PCA-plot data is often mean-centered and auto scaled. Mean centering means that the average value of each variable is subtracted from the data set. After mean centering the mean value of each variable will be zero. Auto scaling means that the standard deviation is calculated for each variable and the obtained scaling factor (1/σ i where σ is the standard deviation and i = 1,2,3,.K-1,K) is then multiplied with each variable. By putting all variables on a comparable footing, no variable is allowed to dominate over another because of its variance. PCA is an efficient and nowadays common chemometrical method for decomposition of twodimensional data sets, however it is important to emphasize that PCA poorly represents nonlinear correlations Genetic Algorithms Using RP-HPLC subtle variations will be introduced in the chromatographic profiles despite identical experimental conditions. These variations can be due to e.g. small changes in TFA concentration (remember that TFA is volatile), column temperature, degeneration of column silica etc. Since these variations do not represent a true change in the sample but still affect the chromatogram, it will be difficult to draw any analytical conclusions. Peak shapes, retention time and baselines are all variables that will be exposed to small non-sample related variations. To compensate for these subtle variations different alignment algorithms have been developed [1,21], trying to optimize the alignment between chromatograms by slightly altering peak shape and baseline structure. Today a lot of different mathematical techniques are described dealing with the optimization problem. If an explicit function exists describing the experimental system optimization techniques such as Newton-Raphson or Steepest Descent can be used with success. These traditionally iterative methods are however computationally demanding and if the system is too complex to be described by an explicit function these methods will not be successful. The risk of finding a local optimum instead of the global must also be considered when using these techniques. Another approach to the optimization problem is to ignore explicit relations and with biased stochastic methods search the solution space. A Genetic algorithm (GA) is a typical example of such a stochastic optimization method that can handle fairly large and complex systems without enormous computational power [11]. 14

15 Genetic algorithms simulate the biological evolution and consider populations of solutions rather than one solution at a time. A reproduction process that is biased towards better solutions forms the next population and after a certain number of generations or a specific criterion the optimum is hopefully found. The first step in the genetic algorithm is to create an initial population. This can be done by using a priori information or just random initialization. The created population of chromosomes can be, e.g. when studying energy minimization, coordinate vectors of the involving atoms. Next step is to evaluate the chromosomes and to give each a specific value of fitness. The chromosomes that produce the best solution will be given the highest value of fitness. The next population of chromosomes will be a combination of the chromosomes in the preceding generation. The number of offspring each chromosomes produces is proportional to its value of fitness, i.e. chromosomes with higher fitness will have greater impact on the qualities of next generation than chromosomes with lower values of fitness. Mutations and cross-over effects are also introduced during the breeding process. These stochastic elements make it possible to escape a local optimum. Aligning target chromatograms against a reference is a typical problem that could be solved using genetic algorithms [12,3]. To evaluate the fitness of a chromosome the Euclidian distance between the two chromatograms that are to be aligned can be used Wavelet Transformation Normalization and Genetic Algorithms is not always sufficient when preprocessing LC/MS data. Stochastic noise often disturbs the interpretation of the chromatograms and introduces larger variations than acceptable. Furthermore the alignment genetic algorithm produces a better result if raw data is denoised. Traditionally in the field of signal analysis denoising and compression of time dependent data is done using different methods of Fourier transformation e.g. Fast Fourier Transformation (FFT), Discrete Fourier Transformation (DFT) [13]. These methods transform the signal from the time dependent space to a frequency dependent space (Fourier space). By using information from Fourier space a low pass filter can be applied and high frequent noise can easily be removed. The filtered signal can via inverse Fourier transformation be analyzed in the time dependent space. Fourier space also reveals information about how the energy of the signal is distributed on different frequencies. Frequency components representing only a small part of the energy can be removed without loosing any significant information, thus compressing the original signal. However Fourier transformation is not capable of coping with non-stationary signals where the nature of the signal s frequency components changes over time. Solving this problem using Fourier transformation on small time portions of the signal will be hazardous to resolution because of Heisenberg s uncertainty principle. A good resolution in the time domain will lead to a miserable resolution in the frequency domain. A better approach when studying this type of signals e.g. chromatograms is to use wavelet analysis. 15

16 Wavelet analysis is a technique that opposite to Fourier transformation preserves time information and is capable of revealing aspects of data like trends, discontinuities in higher derivatives and self-similarity. Using wavelet transformation Heisenberg s uncertainty principle will not cause any problems since wavelet analysis is a multiresolution technique where resolution is proportional to frequency [14]. A wavelet is a waveform of effectively limited duration that has an average value of zero (fig. 8). Wavelets tend to be irregular and asymmetric. Wavelet analysis can be summarized as the process of describing a signal via a number of shifted and scaled versions of the so called mother wavelet. A Figure 8: Example of mother wavelet: Daubechies 2 (db2). t Mathematically wavelet transformation can be described as the inner product of the test signal with the basis functions: ψ ψ 1 * t τ CWTx ( τ, s) = Ψx ( τ, s) = x( t) ψ dt (4) s s Where ψ corresponds to the mother wavelet, s is scale, τ is translation and x corresponds to the test signal. The basis functions are the scaled and translated versions of the mother wavelet. This definition shows that wavelet analysis is a measure of similarity in the sense of frequency components between the basis functions and the signal itself. The calculated wavelet coefficient refer to the closeness of the signal to the wavelet at the current scale. The resulting coefficient will have a scale and translational component. The scale component describes the inverse of the frequency and the translational component describes the time domain of the signal, i.e. the coefficient describes the frequency components of the signal at all time points. Using discrete implementations of CWT makes it easy to compress or denoise a signal, via low pass filtering [15]. 16

17 2 Material and Methods 2.1 Equipment and chemicals Chemicals IgG1 solved in 1mM acetic acid 1. Guanidine-HCl, analytical grade, was purchased from ICN Biomedicals. NH 4 HCO 3, analytical grade, was purchased from BDH Laboratory supplies. Trypsin was purchased from Promega and H 2 O 2 from Acros Organics Equipment Agilent 11, micro-hplc system Micromass LCT Micromass Quattro Ultima 1 Due to confidential reasons the name of the producing company cannot be mentioned. This clone of IgG is slightly modified compared to native clones. 17

18 2.2 Methods Oxidation of Model Protein Hydrogen peroxide (H 2 O 2 ) was chosen as oxidization agent. It has been shown [5] that hydrogen peroxide gently oxidizes proteins. Prior to oxidization ph was set using Ammonium Bicarbonate NH 4 HCO 3 (AmBic), 3 µl 1M AmBic was added to 25µl IgG (1.1µg/µl) solution [16]. As oxidizing agent 45.5µl (35%w/v) H 2 O 2 in 955µl H2O was used. The oxidization agent was diluted by adding 1µl to 3µl H 2 O. This reagent is called 1:1 ox-agent and diluted even further according to table 1. Five batches with 5µl ph corrected IgG solution each were prepared using the following scheme: Batch nr: Added reagent Added volume (µl) 1 H 2 O :3 ox-agent :2 ox-agent :1 ox-agent :1 ox-agent 1. Table 1: Oxidization scheme. The oxidized batches were incubated 1 min in 4 C and evaporated with speedvac. 4 min 32 C. The remaining pellets were stored at -75 C until further analysis Digestion of Model Protein The pellets with more or less oxidized IgG were dissolved via thorough vortexing with 5µl 6M Guanidine-HCl in order to denaturate the protein. To enhance denaturation the batches were preincubated 75 min, 65 C. Prior to digestion 5µl 1M AmBic and 4µl H 2 O were added. The final ph was approximately 8, which corresponds to ph optimum of the enzyme. The final Guanidine-HCl concentration was.6m and AmBic.1M. AmBic is volatile and the remaining salt concentrations should be low enough to avoid sensitivity loss during mass spectrometry. For the cleavage reaction 1.25 µl of Trypsin 1µg/µl was added. This corresponds to an enzyme-substrate ratio of 1:4 (w/w). The batches were incubated 15 h at 37 C RP-HPLC A systematic approach to the problem of optimizing HPLC separation is using factorial design and a suitable optimization algorithm e.g. multisimplex [17]. However complex systems such as peptide digestions are not, using this approach, easily handled within a 18

19 limited period of time. In this study a more empirical trial and error method was used. The slope of the gradient, flow rate and column temperature are all variables that have to be considered when optimizing HPLC separation. As organic mobile phase acetonitrile with.5% TFA was used. As hydrophilic mobile phase water with.5% TFA was used. The following conditions (table 2 and table 3) were found to give acceptable separation within a reasonable period of time and were used throughout the study: Time (min) % Organic mobile phase Table 2: Gradient conditions. Flow rate: 56µl/min Column: Zorbax Extend-C18 1mm*15mm, 3.5µm Sample temperature: 8 C Column temperature: 32 C Table 3: System conditions LC/MS The outlet capillary from the HPLC system was connected to the inlet of the mass spectrometer. The two systems were independently controlled from two different computers. The operator program controlling the mass spectrometer was Masslynx 4., the corresponding program for the HPLC-system was Chemstation (22). The mass spectrometer was tuned and calibrated with NaI following the standard procedure described in [18]. The resolution was found to be 37 which is regarded as low for this specific instrumental setup. Mass data were collected from m/z = 3 Th to m/z = 15 Th with a rate of 3 centroid spectra per minute Design of Experiment Three samples of each batch were analyzed. Due to technical breakdown only two samples of batch 1 and batch 5 were run. Theoretically the concentration of digested protein in the samples should be about 7 picomol/µl, i.e. 1µl should be enough regarding the sensitivity of the mass spectrometer. However an injection volume of 1 µl turned out to give almost no result at all. Instead 1 µl was used. In order to reduce the influence of systematic errors the samples where analyzed in a randomized order according to the following scheme: 19

20 Batch, sample Randomized order Injection volume (µl) b c 15 Not run b c b c b c b c 14 Not run Table 4: Injection scheme. 2.3 Data Analysis Importing Data to Matlab All data processing was carried out using Matlab 6.1 (Mathworks Inc. USA). Twodimensional LC/MS data (fig. 9) with one m/z spectrum for each time point were exported from Masslynx to ASCII-format via software called Databridge. The ASCII file contains an array of approximately 16 elements where the m/z spectra are saved in ascending time order (fig. 1). I Time Figure 9: The nature of LC/MS data. m/z However since only m/z values with intensity over the detection limit will be collected, the lengths between spectra from different time points (fig. 1) will not be identical. This fact has to be compensated for in order to receive a matrix that represents the whole LC/MS space, with columns of equal length. An algorithm with the following pseudo code was constructed to solve this problem. Step 1 Find the m/z value with the highest intensity at time n. Step 2 Collect intensities from all time points (including the present) where this m/z value can be found within a small m/z window. If m/z values cannot be found at a certain time point insert zero intensity. 2

An introduction to quadrupole time-of-flight mass spectrometry

An introduction to quadrupole time-of-flight mass spectrometry JOURNAL OF MASS SPECTROMETRY J. Mass Spectrom. 21; 36: 849 865 SPECIAL FEATURE: TUTORIAL An introduction to quadrupole time-of-flight mass spectrometry Igor V. Chernushevich, Alexander V. Loboda and Bruce

More information

Time-Correlated Single Photon Counting

Time-Correlated Single Photon Counting Technical Note Time-Correlated Single Photon Counting Michael Wahl PicoQuant GmbH, Rudower Chaussee 29, 12489 Berlin, Germany, The Principle of Time-Correlated Single Photon Counting

More information

Nuclear Magnetic Resonance as a Tool for On-Line Catalytic Reaction Monitoring

Nuclear Magnetic Resonance as a Tool for On-Line Catalytic Reaction Monitoring Nuclear Magnetic Resonance as a Tool for On-Line Catalytic Reaction Monitoring Von der Fakultät für Mathematik, Informatik und Naturwissenschaften der Rheinisch-Westfälischen Technischen Hochschule Aachen

More information

Hardware Manual. AB SCIEX API 4000 LC/MS/MS System

Hardware Manual. AB SCIEX API 4000 LC/MS/MS System Hardware Manual AB SCIEX API 4000 LC/MS/MS System Part Number: 5005565 A April 2010 This document is provided to customers who have purchased AB SCIEX equipment to use in the operation of such AB SCIEX

More information

THE PROBLEM OF finding localized energy solutions

THE PROBLEM OF finding localized energy solutions 600 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 45, NO. 3, MARCH 1997 Sparse Signal Reconstruction from Limited Data Using FOCUSS: A Re-weighted Minimum Norm Algorithm Irina F. Gorodnitsky, Member, IEEE,

More information

A Tutorial on Physical Security and Side-Channel Attacks

A Tutorial on Physical Security and Side-Channel Attacks A Tutorial on Physical Security and Side-Channel Attacks François Koeune 12 and François-Xavier Standaert 1 1 UCL Crypto Group Place du Levant, 3. 1348 Louvain-la-Neuve, Belgium

More information

High-Dimensional Image Warping

High-Dimensional Image Warping Chapter 4 High-Dimensional Image Warping John Ashburner & Karl J. Friston The Wellcome Dept. of Imaging Neuroscience, 12 Queen Square, London WC1N 3BG, UK. Contents 4.1 Introduction.................................

More information

Analysis of dynamic sensor networks: power law then what?

Analysis of dynamic sensor networks: power law then what? Analysis of dynamic sensor networks: power law then what? (Invited Paper) Éric Fleury, Jean-Loup Guillaume CITI / ARES INRIA INSA de Lyon F-9 Villeurbanne FRANCE Céline Robardet LIRIS / CNRS UMR INSA de

More information

Field Estimation of Soil Water Content

Field Estimation of Soil Water Content Field Estimation of Soil Water Content A Practical Guide to Methods, Instrumentation and Sensor Technology VIENNA, 2008 TRAINING COURSE SERIES30 TRAINING COURSE SERIES No. 30 Field Estimation of Soil Water

More information

Development and early HTA of novel microfluidic systems for bio-analytical and drug delivery applications

Development and early HTA of novel microfluidic systems for bio-analytical and drug delivery applications Università Campus Bio-Medico di Roma School of Engineering PhD Course in Biomedical Engineering (XX 2004/2007) Development and early HTA of novel microfluidic systems for bio-analytical and drug delivery

More information



More information

2 Basic Concepts and Techniques of Cluster Analysis

2 Basic Concepts and Techniques of Cluster Analysis The Challenges of Clustering High Dimensional Data * Michael Steinbach, Levent Ertöz, and Vipin Kumar Abstract Cluster analysis divides data into groups (clusters) for the purposes of summarization or

More information

Real-time PCR handbook

Real-time PCR handbook Real-time PCR handbook Single-tube assays 96- and 384-well plates 384-well TaqMan Array cards OpenArray plates The image on this cover is of an OpenArray plate which is primarily used for mid-density real-time

More information

Agilent Time Domain Analysis Using a Network Analyzer

Agilent Time Domain Analysis Using a Network Analyzer Agilent Time Domain Analysis Using a Network Analyzer Application Note 1287-12 0.0 0.045 0.6 0.035 Cable S(1,1) 0.4 0.2 Cable S(1,1) 0.025 0.015 0.005 0.0 1.0 1.5 2.0 2.5 3.0 3.5 4.0 Frequency (GHz) 0.005

More information

Quantification strategies in real-time PCR

Quantification strategies in real-time PCR A-Z of quantitative PCR (Editor: SA Bustin) Chapter3. Quantification strategies in real-time PCR 87 Quantification strategies in real-time PCR Michael W. Pfaffl Chaper 3 pages 87-112 in: A-Z of quantitative

More information

A Comparative Analysis Of Predictive Data-Mining Techniques

A Comparative Analysis Of Predictive Data-Mining Techniques A Comparative Analysis Of Predictive Data-Mining Techniques A Thesis Presented for the Master of Science Degree The University of Tennessee, Knoxville Godswill Chukwugozie Nsofor August, 2006 DEDICATION

More information

Axial Flow Compressor Mean Line Design

Axial Flow Compressor Mean Line Design Axial Flow Compressor Mean Line Design Niclas Falck February 2008 Master Thesis Division of Thermal Power Engineering Department of Energy Sciences Lund University, Sweden Niclas Falck 2008 ISSN 0282-1990

More information

Principles and Instrumentation in Time-of-ýight Mass Spectrometry

Principles and Instrumentation in Time-of-ýight Mass Spectrometry JOURNAL OF MASS SPECTROMETRY, VOL. 3, 1519È1532 (1995) SPECIAL FEATURE: TUTORIAL Principles and Instrumentation in Time-of-ýight Mass Spectrometry Physical and Instrumental Concepts Michael Guilhaus School

More information

Chapter 2 Survey of Biodata Analysis from a Data Mining Perspective

Chapter 2 Survey of Biodata Analysis from a Data Mining Perspective Chapter 2 Survey of Biodata Analysis from a Data Mining Perspective Peter Bajcsy, Jiawei Han, Lei Liu, and Jiong Yang Summary Recent progress in biology, medical science, bioinformatics, and biotechnology

More information

Reverse Engineering of Geometric Models - An Introduction

Reverse Engineering of Geometric Models - An Introduction Reverse Engineering of Geometric Models - An Introduction Tamás Várady Ralph R. Martin Jordan Cox 13 May 1996 Abstract In many areas of industry, it is desirable to create geometric models of existing

More information

Simulation of Medical Irradiation and X-Ray Detector Signals

Simulation of Medical Irradiation and X-Ray Detector Signals Simulation of Medical Irradiation and X-Ray Detector Signals Der Naturwissenschaftlichen Fakultät der Friedrich-Alexander-Universität Erlangen-Nürnberg zur Erlangung des Doktorgrades Dr. rer. nat. vorgelegt

More information


MINING DATA STREAMS WITH CONCEPT DRIFT Poznan University of Technology Faculty of Computing Science and Management Institute of Computing Science Master s thesis MINING DATA STREAMS WITH CONCEPT DRIFT Dariusz Brzeziński Supervisor Jerzy Stefanowski,

More information

A 3D OBJECT SCANNER An approach using Microsoft Kinect.

A 3D OBJECT SCANNER An approach using Microsoft Kinect. MASTER THESIS A 3D OBJECT SCANNER An approach using Microsoft Kinect. Master thesis in Information Technology 2013 October Authors: Behnam Adlkhast & Omid Manikhi Supervisor: Dr. Björn Åstrand Examiner:

More information

Compressed network monitoring for IP and all-optical networks

Compressed network monitoring for IP and all-optical networks Compressed network monitoring for IP and all-optical networks Mark Coates, Yvan Pointurier and Michael Rabbat Department of Electrical and Computer Engineering McGill University Montreal, Quebec H3A-A7,

More information

Introduction to Data Mining and Knowledge Discovery

Introduction to Data Mining and Knowledge Discovery Introduction to Data Mining and Knowledge Discovery Third Edition by Two Crows Corporation RELATED READINGS Data Mining 99: Technology Report, Two Crows Corporation, 1999 M. Berry and G. Linoff, Data Mining

More information

An Introduction to Variable and Feature Selection

An Introduction to Variable and Feature Selection Journal of Machine Learning Research 3 (23) 1157-1182 Submitted 11/2; Published 3/3 An Introduction to Variable and Feature Selection Isabelle Guyon Clopinet 955 Creston Road Berkeley, CA 9478-151, USA

More information

Sampling 50 Years After Shannon

Sampling 50 Years After Shannon Sampling 50 Years After Shannon MICHAEL UNSER, FELLOW, IEEE This paper presents an account of the current state of sampling, 50 years after Shannon s formulation of the sampling theorem. The emphasis is

More information

Revisiting the Edge of Chaos: Evolving Cellular Automata to Perform Computations

Revisiting the Edge of Chaos: Evolving Cellular Automata to Perform Computations Revisiting the Edge of Chaos: Evolving Cellular Automata to Perform Computations Melanie Mitchell 1, Peter T. Hraber 1, and James P. Crutchfield 2 In Complex Systems, 7:89-13, 1993 Abstract We present

More information



More information

Neural Networks as Cybernetic Systems

Neural Networks as Cybernetic Systems --- Neural Networks as Cybernetic Systems 2 nd and revised edition Holk Cruse Neural Networks as Cybernetic Systems 2 nd and revised edition Holk Cruse, Dr. Department of Biological Cybernetics and Theoretical

More information