Genetic Sequence Matching Using D4M Big Data Approaches
|
|
|
- Elijah Morrison
- 10 years ago
- Views:
Transcription
1 Genetic Sequence Matching Using Big Data Approaches Stephanie Dodson, Darrell O. Ricke, and Jeremy Kepner MIT Lincoln Laboratory, Lexington, MA, U.S.A Abstract Recent technological advances in Next Generation Sequencing tools have led to increasing speeds of DNA sample collection, preparation, and sequencing. One instrument can produce over 6 Gb of genetic sequence data in a single run. This creates new opportunities to efficiently handle the increasing workload. We propose a new method of fast genetic sequence analysis using the Dynamic Distributed Dimensional Data Model () an associative array environment for MATLAB developed at MIT Lincoln Laboratory. Based on mathematical and statistical properties, the method leverages big data techniques and the implementation of an Apache Acculumo database to accelerate computations one-hundred fold over other methods. Comparisons of the method with the current gold-standard for sequence analysis,, show the two are comparable in the alignments they find. This paper will present an overview of the genetic sequence algorithm and statistical comparisons with. I. INTRODUCTION New technologies are producing an ever-increasing volume of sequence data, but the inadequacy of current tools puts restrictions on large-scale analysis. At over 3 billion base pairs (bp) long, the human genome naturally falls into the category of Big Data, and techniques need to be developed to efficiently analyze and uncover novel features. Applications include sequencing individual genomes, cancer genomes, inherited diseases, infectious diseases, metagenomics, and zoonotic diseases. In 23, the cost of sequencing the first human genome was $3 billion. The cost for sequencing has declined steadily since then and is projected to drop to $ in several years. The dramatic decrease in the cost of obtaining genetic sequences has resulted in an explosion of data with a range of applications, including early identification and detection of infectious organisms from human samples. DNA sequences are highly redundant among organisms. For example, Homo sapiens share about 7% of genes with the zebrafish [1]. Despite the similarities, the discrepancies create unique sequences that act as fingerprints for organisms. The magnitude of sequence data makes correctly identifying organisms based on segments of genetic code a complex computational problem. Given one segment of DNA, current technologies can quickly determine what organism it likely belongs to. However, the speed rapidly diminishes as the number of segments This work is sponsored by the Assistant Secretary of Defense for Research and Engineering under Air Force Contract #FA C-2. Opinions, interpretations, recommendations and conclusions are those of the authors and are not necessarily endorsed by the United States Government. increases. The time to identify all organisms present in a human sample (blood or oral swab), can be up to 45 days (Figure 1b) with major bottlenecks being (1) sample shipment to sequencing centers and (2) sequence analysis. Consider the specific case of an E. coli outbreak that took place in Germany in the summer of 211 (Figure 1a). In this instance, while genetic sequencing resolved how the virulence of this isolate was different with the insertion of a bacterial virus carrying a toxic gene from previously characterized isolates, it arrived too late to significantly impact the number of deaths. Figure 1b shows that there are a number of places in the Current time frames for the sequencing process where improved algorithms and computation could have a significant impact. Ultimately, with the appropriate combination of technologies, it should be possible to shorten the timeline for identification from weeks to less than one day. A shortened timeline could significantly reduce the number of deaths from such an outbreak. A. First developed in 199, the Basic Local Alignment Search Tool () is the current gold-standard method used by biologists for genetic sequence searches. Many versions are available on the National Center for Biotechnology Information (NCBI) website to match nucleotide or protein sequences [2]. In general, is a modification of the Smith Waterman algorithm [3] and works by locating short seeds where the reference and unknown samples have identical seed sequences [4] [5]. Seeds are expanded outwards, and with each added base pair, probabilities and scores are updated to signify how likely it is the two sequences are a match. Base pairs are added until the probabilities and scores reach threshold values [5] [6]. The score most highly used by biologists to distinguish true matches is the expect value, or E-value. The E-value uses distributions of random sequences and sequence lengths to measure the expected number of matches that will occur by chance. Similar to a statistical P-value, low E-values represent matches [4] [5] [6]. E-value thresholds depend on the dataset, with typical values ranging from 1E-6 to 1E-3. Despite the relation to the P-value, E-value calculations are complex and do not have a straightforward meaning. Additionally, the underlying distributions are known to break down in alignments with shorter sequences or numerous repeats (low-complexity sequences) [4]. Direct use of to compare a 6 Gb collection with a comparably sized reference set requires months on a /14/$31. c 214 IEEE
2 1. Complex background 2. collection and shipment 3. preparation and sequencing 4. Analysis of sample 5. Time to actionable data Current: Goal: 2-3 days Onsite 1-3 days 3 hours 7-14 days < 12 hours -45 days < 1 day Fig. 1. Example disease outbreak and processing pipeline. In the May to July 211 virulent E. coli outbreak in Germany, the identification of the E. coli source was too late to have substantial impact on illnesses. Improved computing and algorithms can play a significant role in reducing the current time of to 45 days to less than 1 day. Photo source: core system. The Dynamic Distributed Dimensional Data Model () developed at Lincoln Laboratory, has been used to accelerate DNA sequence comparison. II. A. Schema COMPUTATIONAL METHODS is an innovation in computer programming that combines the advantages of five processing technologies: triple store databases, associative arrays, distributed arrays, sparse linear algebra, and fuzzy algebra. Triple store databases are a key enabling technology for handling massive amounts of data and are used by many large Internet companies (e.g., Google Big Table) [7]. Triple stores are highly scalable and run on commodity computers, but lack interfaces to support rapid development of the mathematical algorithms used by signal processing experts. provides a parallel linear algebraic interface to triple stores. Using, developers can create composable analytics with significantly less effort than if they used traditional approaches. The central mathematical concept of is the associative array that combines spreadsheets, triple stores, and sparse linear algebra. Associative arrays are group theoretic constructs that use fuzzy algebra to extend linear algebra to words and strings [8]. Associative arrays provide an intuitive mechanism for representing and manipulating triples of data and are a natural way to interface with the new class of high performance NoSQL triple store databases (e.g., Google Big Table, Apache Accumulo, Apache HBase, NetFlix Cassandra, Amazon Dynamo) [9]. Because the results of all queries and functions are associative arrays, all expressions are composable and can be directly used in linear algebraic calculations. The composability of associative arrays stems from the ability to define fundamental mathematical operations whose results are also associative arrays [8]. Given two associative arrays A and B, the results of all the following operations will also be associative arrays: A + B A - B A & B A B A*B provides tools that enable the algorithm developer to implement a sequence alignment algorithm on par with in just a few lines of code. The direct interface to high performance triple store databases allows new database sequence alignment techniques to be explored quickly. B. Algorithm The presented algorithm is designed to couple linear algebra approaches implemented in with well-known statistical properties to simplify and accelerate current methods of genetic sequence analysis. The analysis pipeline can be broken into four key steps: collection, ingestion, comparison, and correlation (Figure 2). Collection Ingestion and Comparison Correlation FASTA files s Unknown s Good alignment IDs A₁ Unknown IDs A₂ words (mers) words (mers) IDs A₁ A₂ Poor alignment Unknown IDs Fig. 2. Pipeline for analysis of DNA sample using technologies. Data is received from the sequencers in FASTA format, and parsed into -mers. The row, column, and value triples are ingested into associative arrays, and matrix multiplication finds the common words between sample and reference sequences. Matches are tested for good alignment using the linear R-value correlation.
3 & Cut offs 1 1,553 1,576 3,584 R-Value Cutoff 2 Hit-count Cutoff R 1 3 3, , hit count Low-complexity Poor alignment GGTGATGTGGGCGAGTAGCTTGGTGATGTTGGAGAATAGCTA-GGAGATGTTGGTGAGTA GGTGACGTTGGCGAATATGATGGTGACGTTGGTGAATA-CGATGGTGATGTAGGTGAGTA -GCTGGGGGAGGTGGGAGAATA-GCTGGGTGACGTTGGTGAGTA-GCTGGGAGAGGTTGG CGAT-GGTGATGTTGGCGAGTACGAT-GGTGATGTTGGTGAGTACGAT-GGTGATGTTGG Misaligned words Misaligned words Matches: 8 Hits: 2 1 R: ï.1453 P: Matches: 3 Hits: R: ï P: TGAGTA-GCTGGGGAG 18 TGAGTACGCTGGAGAG G31WZIU2H4XAW G31WZIU2JNO, CCF , & Cutïoffs R-value: Hit count: 8 E-Value: CAA R-value:.3489 Hit count: 3 E-Value: 1 E -9 ï1 ï2 R ï1 ï3 R-value: Hit count: 7 E-Value: 4.7 R-value: Hit count: 65 E-Value: 3 E -35 ï4 Matches: 7 Hits: R: ï P: 2.463eï7 ï hit count G31WZIU2H8EKE > ADB /organism="colletotrichum liriopes" /gene="chs1" /locus_tag="" /product="chitin synthase" /protein_id="adb " 16 /taxon_id="78192" /note="" Length= Score = 32.5 bits (17), Expect = 4.7 Identities = 42/53 (79%), Gaps = 6/53 (11%) Strand=Plus/Minus ADB CTCTGGTTCTCGGGTCG-TCTCTGGCGGCGTACGTCGTCTGA-GAGAC-GACG CTCTGGTTCTCGGGTTGATCT-TGGCA-CGGCCGTCG-CTGACGACACAGACG Poor alignment Poor Alignment (c) Fig. 3. Quality of matches based on number of common words and R-values. The vertical axis displays the modified R-value ( R 1 ) in a log scale, and the horizontal is the hit count. Due to selection requirements (dashed lines), the strong matches lie in the bottom right quadrant of the graph. The arrangement of the figure mirrors and shows the number of matches found by and in each region. (c) Representatives matches from each quadrant show the relationship between R-values, hit counts, and alignments. Reliable matches have an E-score below 3. Low hit counts and R-values are the result of large stretches of poor alignment and low-complexity repeats (top-left). Clockwise to the top-right, hit counts improve with increased complexity, but misaligned words result in a scattering of uncorrelated points. The words highlighted in blue and red are misaligned. At the bottom-left, higher R-values are created by segments of good alignment, but low hit counts remain due to large portions of poor alignment. Strong matches (bottom-right) are the result of long stretches of good alignment. The straight green bands in the graph correspond to the correctly aligned highlighted lengths of DNA. In collection, unknown sample data is received from a sequencer in FASTA format, and parsed into a suitable form for. DNA sequences are split into k-mers of bases in length (words). Uniquely identifiable metadata is attached to the words and positions are stored for later use. Low complexity words (composed of only 1 or 2 DNA bases) are dropped. Very long sequences are segmented into groups of 1, k-mers to reduce non-specific k-mer matches.
4 & Cut offs ,553 2 R-Value Cutoff R Hit-count Cutoff 256 2,413 3, Hit Count Fig. 4. Distribution of modified R-values and words in common between two sequences for Dataset 2. The vertical axis displays R 1 in a log scale, and the horizontal is the hit count. Dashed lines are selection requirements, and true matches lie at the bottom right quadrant of the graph. Shown are all matches after top hit count selection and prior to final R-value cut. The arrangement of the figure mirrors and shows the number of matches found by and in each region. The ingestion step loads the sequence identifiers, words, and positions into associative arrays by creating unique rows for every identifier and columns for each word. The triple store architecture of associative arrays effortlessly handles the ingestion and organization of the data. During the process, redundant k-mers are removed from each 1, bp segment, and only the first occurrence of words are saved. The length and the four possible bases gives a total of 4 (just over onemillion) possible words, naturally leading to sparse matrices and operations of sparse linear algebra. Additionally, the segmentation into groups of 1, ensures sparse vectors/matrices with less than 1 in 1, values used for each sequence segment. A similar procedure is followed for all known reference data, and is also ingested into a matrix. Sequence similarity is computed in the comparison step. Using the k-mers as vector indices allows a vector crossproduct value of two sequences to approximate a pair-wise alignment of the sequences. Likewise, a matrix multiplication allows the comparison of multiple sequences to multiple sequences in a single mathematical operation. For each unknown sequence, only strong matches (those with greater than 2 words in common) are stored for further analysis. Computations are accelerated by the sparseness of the matrices. The redundant nature of DNA allows two unrelated sequences to have numerous words in common. Noise is removed by assuring the words of two matching sequences fall in the same order. The correlation step makes use of the - mer positions to check the alignments. When the positions in reference and unknown sequences are plotted against each other, true alignments are linearly correlated with an absolute correlation coefficient (R-value) close to one. Matches with over 2 words in common and absolute R-values greater than.9 are considered strong. After these initial constraints are applied, additional tests may be used for organism identification and with large datasets. III. RESULTS The algorithm was tested using two generated datasets (Datasets 1 and 2) []. The smaller Dataset 1 (72,877 genetic sequences) was first compared to a fungal dataset and used to examine the selection criteria. Results were compared with those found by running. Dataset 2 (323,28 sequences) was formed from a human sample and spiked with in silico bacteria organisms using FastqSim [11]. Bacterial results from Dataset 2 were again compared to and used to test for correct organism identification. In both comparisons, the reference sets were compiled from RNA present in GenBank. It is important to note that the reference sequences are unique to the taxonomic gene level. Therefore, each gene of an organism is represented by a unique sequence and metadata. A. Dataset 1 Dataset 1 was compared to the fungal RNA dataset and, was run using N and a MIT Lincoln Laboratory developed Java BlastParser program. found 6,924 total matches, while discovered 8,842. Examination demonstrates the quality of the matches varies based on the hit counts and correlation coefficients. To separate the minute differences in linear correlation, the R-values were modified to Log( R 1 ). Because of this choice, the strong R-values with absolute value close to one have a modified value near zero. Figure 3a displays the modified R-values plotted versus the hit counts. The dashed lines show the threshold values of 2 words in common and a R-value greater than.9. Strong matches lie in the bottom right of the graph. The unusually large void between 2 and 1 on the vertical axis (Rvalues of.9 to.99) serves as a clear distinction between regions of signal and noise for the and data. Together, the hit count and R-value thresholds greatly reduce the background noise. Figure 3b shows a full distribution of the number of matches satisfying the numerical requirements. Almost 63% of the total finds have hit counts less than 2, and about 18% fall below both thresholds. Before the correlation selection, identifies
5 Organism Spiked 1 Not in Family Level Genus Level Species Level Francisella philomiragia 3,718 3,55 3,939 2 Francisella tularensis R 1 3 Escherichia coli Hit Count Fig. 5. Results of Dataset 2 shown in tabular and graphical forms. Number of sequences and found matching to the in silico spiked data as well as a background organism (E. coli) detected by both algorithms. Matches are highlighted based on correct identification of genus, species, strain, or false positives. Higher hit counts and R-values tend to correctly identify the genus and species, with lower values matching to artifacts. 5,16 background matches, of which, finds 1,576. After all restrictions, and both identify 1,717 matches. Representative correlations from each region demonstrate the selection quality of the algorithm (Figure 3c). Also shown are the alignments, in which the top strand is the unknown sequence and the bottom is the matching reference. Vertical lines indicate an exact base pair alignment. Dashes in the sequences represent a gap that was added by to improve the arrangement. The alignments include the initial seeds and the expansion until threshold values were reached. The results demonstrate how hit counts below twenty are a result of poor alignments and low complexity repeats; these matches have few regions with at least a bp overlap. Strong alignments as seen in the bottom-right of Figure 3c are comprised of long, well aligned stretches. Segments of poor alignment still exist, but they are proportionally less, and offset the sample and reference sequences by equal amounts. It is worth noting that the E-values of the four examples coincide with the results. The results of Dataset 1 present comparable findings to run with default parameters. B. Dataset 2 Additional filters were used in the analysis of the larger Dataset 2. As in Dataset 1, alignments were first required to have at least 2 words in common. Additionally, for each unknown sequence, the R-value was only computed for matches within % of the maximum hit count. For example, unknown sequence A might have 22 words in common with reference B, 46 with reference C, and 5 with reference D. R-values were calculated for the matches with references C and D since the hit counts are within % of 5, the maximum value. Similar to Dataset 1, absolute R-values were thresholded at.9, but were also required to be within 1% of the maximum for each unknown sequence. The additional percentage threshold values were chosen to identify matches of similar strength and reduce the number of computations. Again, results were compared with, this time run with default parameters. Comparisons of and findings before R-value filters are displayed in Figure 4a. The stricter conditions eliminate many of the false positives with low hit counts and R-values as seen in Dataset 1. Before the R-value restrictions, finds significantly more alignments than (Figure 4b). These numbers are reduced with R-value filters, but during organism identification steps, the majority of the additional points mapped to the correct species (discussed below). Notice the majority of findings missed by lie in the lower hit count regime, all of which emerge from the second hit count filter (within % of maximum). After applying all filters, each sample matched either to one or multiple references. Unique matches were labeled as that reference. In the case of multiple alignments, the taxonomies were compared and the sample was classified as the lowest common taxonomic level. For example, if unknown sequence A maps equally to references B and C, both of which are different species within the same genus, sequence A is classified as the common genus. The number of sequences matching to each family, genus, and species is tallied to give the final results. In Figure 5a, and results are numerically compared with the spiked organisms. The numbers indicate how many sequences were classified as species. Both and correctly identified the species F. philomiragia and F. tularensis, with numbers close to the truth data. It is important to note that F. philomiragia and F. tularensis are very closely related. Studies show F. philomiragia and F. tularensis to have between 98.5% and 99.9% identity [13] which accounts for the slight difference in numbers of matching sequences. and identified nearly the same amount of E. coli presence. Interestingly, E. coli was not an in silica spiked organism, and is instead a background organism (present in the human sample) detected by both. As previously noted, the numbers in Figure 4b show identified significantly more matches than. The data in Figure 4a was color-coded based on the taxonomy of matches. Results are presented in Figure 5b and reveal
6 ACKNOWLEDGMENT The authors are indebted to the following individuals for their technical contributions to this work: Chansup Byun, Ashley Conard, Dylan Hutchinson, and Anna Shcherbina. REFERENCES Fig. 6. Sequence-alignment implementations. The implementation requires x less code than. + Triple Store (Accumulo) reduces run time by x compared to. the majority of points with high hit counts and R-values are matching to the truth data. Again, at this stage, no R-value filters have been applied, but the results clearly indicate how hit counts and R-values are appropriate selection parameters to correctly and efficiently identify organisms present in a sample. C. Computational Acceleration with Apache Accumulo In the analysis described, parallel processing using pmatlab [15] was heavily relied upon to increase computation speeds. The implementation of an Apache Accumulo database additionally accelerated the already rapid sparse linear algebra computations and comparison processes, but was not used to the full advantage. As shown in [16], the triple store database can be used to identify the most common -mers. The least popular words are then selected and used in comparisons, as these hold the most power to uniquely identify the sequences. Preliminary results show subsampling greatly reduces the number of direct comparisons, and increases the speed x. Figure 6 shows the relative performance and software size of sequence alignment implemented using, alone, and with triple store Accumulo. Future developments will merge the results discussed here with the subsampling acceleration techniques of the database. IV. CONCLUSION With Matlab and techniques, the described algorithm is implemented in less than 1, lines of code. That gives a x improvement over, and on comparable hardware the performance level is within a factor of 2. The precise code allows for straight forward debugging and comprehension. Results shown here with Datasets 1 and 2 demonstrate that findings are comparable to and possibly more accurate. The next steps are to integrate the Apache Accumulo capabilities and optimize the selection parameters over several known datasets. Additionally, the capabilities will be ported to the SciDB database. The benefit of using in this application is that it can significantly reduce programming time, increase performance, and simplify the current complex sequence matching algorithms. [1] K. Howe, et al., The zebrafish reference genome sequence and its relationship to the human genome, Nature, vol. 496, p. 498, 213. [2] NCBI. [Online]. Available: [3] T. F. Smith and S. M. Waterman, Identification of Common Molecular Subsequences, Journal of Molecular Biology, vol. 147, pp , [4] The Statistics of Sequence Similarity Scores, NCBI, [Online]. Available: [5] I. Lobo, Basic Local Alignment Search Tool (), Nature Education 1(1):215, 28. [6] S. F. Altschul, W. Gish, W. Miller, E. W. Meyers, and D. J. Lipman, Basic Local Alignment Search Tool, J. Mol. Bio, 215, p. 43, 199. [7] F. Chang, J. Dean, S. Ghernawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber, Bigtable: A distributed storage system for structured data,, ACM Transactions of Computer Systems (TOCS), vol. 26, no. 2, p. 4, 28. [8] J. Kepner, W. Arcand, W. Bergeron, N. Bliss, R. Bond, C. Byun, G. Condon, K. Gregson, M. Hubbell, J. Kurz et al., Dynamic distributed dimensional data model () database and computation system, in Acoustics, Speech and Signal Processing (ICASSP), 212 IEEE International Conference on. IEEE, 212, p [9] J. Kepner, C. Anderson, W. Arcand, D. Bestor, B. Bergeron, C. Byun, M. Hubbell, P. Michaleas, J. Mullen, D. Gwynn, A. Prout, A. Reuther, A. Rosa, and C Yee, 2. Schema: A General Purpose High Performance Schema for the Accumulo Database,IEEE High Performance Extreme Computing (HPEC) conference, 213. [] Rosenzweig, et al. 24 in preparation [11] Shcherbina, FASTQSim: Platform-Independent Data Characterization and in Silico Read Generation for NGS Datasets, 214 submitted. [12] B. Huber, R. Escudero, H. J. Busse, E. Seibold, H. C. Scholz, P. Anda, P. Kämpfer, W. D. Splettstoesser, Description of Francisella hispaniensis sp. nov., isolated from human blood, reclassification of Francisella novicida (Larson et al. 1955) Olsufiev et al as Francisella tularensis subsp. novicida comb. nov. and emended description of the genus Francisella, Int. J. Syst. Evol. Microbiol., 2. [13] M. Forsman, G. Sandstrom, and A. Sjostedt, Analysis of 16S Ribosomal Sequences of Francisella Strains and Utilization for Determination of the Phylogeny of the Genus and for Identification of Strains by PCR, Int. J. Syst. Evol. Micorbiol., [14] D. G. Hollis, R. E. Weaver, A. G. Steigerwalt, J. D. Wenger, C. W. Moss, and D. J. Brenner, Francisella philomiragia comb. nov. (formerly Yersinia philomiragia) and Francisella tularensis biogroup novicida (formerly Francisella novicida) associated with human disease, J. Clin. Microbiol., [15] Kepner, Jeremy, Parallel MATLAB for Multicore and Multinode Computers, SIAM Book Series on Software, Environments, and Tools (Jack Dongarra, series editor) Philadelphia: SIAM Press, 29. [16] J. Kepner, D. O. Ricke and D. Hutchinson, Taming Biological Big Data with, Lincoln Laboratory Journal, vo.l 2. no. 2, p. 82, 213.
Taming Biological Big Data with D4M
TAMING BIOLOGICAL BIG DATA WITH D4M Taming Biological Big Data with D4M Jeremy Kepner, Darrell O. Ricke, and Dylan Hutchison The supercomputing community has taken up the challenge of taming the beast
Driving Big Data With Big Compute
Driving Big Data With Big Compute Chansup Byun, William Arcand, David Bestor, Bill Bergeron, Matthew Hubbell, Jeremy Kepner, Andrew McCabe, Peter Michaleas, Julie Mullen, David O Gwynn, Andrew Prout, Albert
Achieving 100,000,000 database inserts per second using Accumulo and D4M
Achieving 100,000,000 database inserts per second using Accumulo and D4M Jeremy Kepner, William Arcand, David Bestor, Bill Bergeron, Chansup Byun, Vijay Gadepally, Matthew Hubbell, Peter Michaleas, Julie
D4M 2.0 Schema: A General Purpose High Performance Schema for the Accumulo Database
DM. Schema: A General Purpose High Performance Schema for the Accumulo Database Jeremy Kepner, Christian Anderson, William Arcand, David Bestor, Bill Bergeron, Chansup Byun, Matthew Hubbell, Peter Michaleas,
A Tutorial in Genetic Sequence Classification Tools and Techniques
A Tutorial in Genetic Sequence Classification Tools and Techniques Jake Drew Data Mining CSE 8331 Southern Methodist University [email protected] www.jakemdrew.com Sequence Characters IUPAC nucleotide
RETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison
RETRIEVING SEQUENCE INFORMATION Nucleotide sequence databases Database search Sequence alignment and comparison Biological sequence databases Originally just a storage place for sequences. Currently the
When you install Mascot, it includes a copy of the Swiss-Prot protein database. However, it is almost certain that you and your colleagues will want
1 When you install Mascot, it includes a copy of the Swiss-Prot protein database. However, it is almost certain that you and your colleagues will want to search other databases as well. There are very
Choices, choices, choices... Which sequence database? Which modifications? What mass tolerance?
Optimization 1 Choices, choices, choices... Which sequence database? Which modifications? What mass tolerance? Where to begin? 2 Sequence Databases Swiss-prot MSDB, NCBI nr dbest Species specific ORFS
DNA Sequencing Overview
DNA Sequencing Overview DNA sequencing involves the determination of the sequence of nucleotides in a sample of DNA. It is presently conducted using a modified PCR reaction where both normal and labeled
Visualization of Phylogenetic Trees and Metadata
Visualization of Phylogenetic Trees and Metadata November 27, 2015 Sample to Insight CLC bio, a QIAGEN Company Silkeborgvej 2 Prismet 8000 Aarhus C Denmark Telephone: +45 70 22 32 44 www.clcbio.com [email protected]
GenBank: A Database of Genetic Sequence Data
GenBank: A Database of Genetic Sequence Data Computer Science 105 Boston University David G. Sullivan, Ph.D. An Explosion of Scientific Data Scientists are generating ever increasing amounts of data. Relevant
APPM4720/5720: Fast algorithms for big data. Gunnar Martinsson The University of Colorado at Boulder
APPM4720/5720: Fast algorithms for big data Gunnar Martinsson The University of Colorado at Boulder Course objectives: The purpose of this course is to teach efficient algorithms for processing very large
GenBank, Entrez, & FASTA
GenBank, Entrez, & FASTA Nucleotide Sequence Databases First generation GenBank is a representative example started as sort of a museum to preserve knowledge of a sequence from first discovery great repositories,
Searching Nucleotide Databases
Searching Nucleotide Databases 1 When we search a nucleic acid databases, Mascot always performs a 6 frame translation on the fly. That is, 3 reading frames from the forward strand and 3 reading frames
BLAST. Anders Gorm Pedersen & Rasmus Wernersson
BLAST Anders Gorm Pedersen & Rasmus Wernersson Database searching Using pairwise alignments to search databases for similar sequences Query sequence Database Database searching Most common use of pairwise
Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning
Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning SAMSI 10 May 2013 Outline Introduction to NMF Applications Motivations NMF as a middle step
Analyzing A DNA Sequence Chromatogram
LESSON 9 HANDOUT Analyzing A DNA Sequence Chromatogram Student Researcher Background: DNA Analysis and FinchTV DNA sequence data can be used to answer many types of questions. Because DNA sequences differ
Rapid alignment methods: FASTA and BLAST. p The biological problem p Search strategies p FASTA p BLAST
Rapid alignment methods: FASTA and BLAST p The biological problem p Search strategies p FASTA p BLAST 257 BLAST: Basic Local Alignment Search Tool p BLAST (Altschul et al., 1990) and its variants are some
Efficient Parallel Execution of Sequence Similarity Analysis Via Dynamic Load Balancing
Efficient Parallel Execution of Sequence Similarity Analysis Via Dynamic Load Balancing James D. Jackson Philip J. Hatcher Department of Computer Science Kingsbury Hall University of New Hampshire Durham,
INTRODUCTION TO CASSANDRA
INTRODUCTION TO CASSANDRA This ebook provides a high level overview of Cassandra and describes some of its key strengths and applications. WHAT IS CASSANDRA? Apache Cassandra is a high performance, open
Bioinformatics Resources at a Glance
Bioinformatics Resources at a Glance A Note about FASTA Format There are MANY free bioinformatics tools available online. Bioinformaticists have developed a standard format for nucleotide and protein sequences
Module 1. Sequence Formats and Retrieval. Charles Steward
The Open Door Workshop Module 1 Sequence Formats and Retrieval Charles Steward 1 Aims Acquaint you with different file formats and associated annotations. Introduce different nucleotide and protein databases.
Large Data Analysis using the Dynamic Distributed Dimensional Data Model (D4M)
Large Data Analysis using the Dynamic Distributed Dimensional Data Model (D4M) Dr. Jeremy Kepner Computing & Analytics Group Dr. Christian Anderson Intelligence & Decision Technologies Group This work
Tutorial for proteome data analysis using the Perseus software platform
Tutorial for proteome data analysis using the Perseus software platform Laboratory of Mass Spectrometry, LNBio, CNPEM Tutorial version 1.0, January 2014. Note: This tutorial was written based on the information
Bio-Informatics Lectures. A Short Introduction
Bio-Informatics Lectures A Short Introduction The History of Bioinformatics Sanger Sequencing PCR in presence of fluorescent, chain-terminating dideoxynucleotides Massively Parallel Sequencing Massively
MASCOT Search Results Interpretation
The Mascot protein identification program (Matrix Science, Ltd.) uses statistical methods to assess the validity of a match. MS/MS data is not ideal. That is, there are unassignable peaks (noise) and usually
Sequence Formats and Sequence Database Searches. Gloria Rendon SC11 Education June, 2011
Sequence Formats and Sequence Database Searches Gloria Rendon SC11 Education June, 2011 Sequence A is the primary structure of a biological molecule. It is a chain of residues that form a precise linear
Similarity Search in a Very Large Scale Using Hadoop and HBase
Similarity Search in a Very Large Scale Using Hadoop and HBase Stanislav Barton, Vlastislav Dohnal, Philippe Rigaux LAMSADE - Universite Paris Dauphine, France Internet Memory Foundation, Paris, France
NATIONAL GENETICS REFERENCE LABORATORY (Manchester)
NATIONAL GENETICS REFERENCE LABORATORY (Manchester) MLPA analysis spreadsheets User Guide (updated October 2006) INTRODUCTION These spreadsheets are designed to assist with MLPA analysis using the kits
Intro to Bioinformatics
Intro to Bioinformatics Marylyn D Ritchie, PhD Professor, Biochemistry and Molecular Biology Director, Center for Systems Genomics The Pennsylvania State University Sarah A Pendergrass, PhD Research Associate
Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.
Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C
Delivering the power of the world s most successful genomics platform
Delivering the power of the world s most successful genomics platform NextCODE Health is bringing the full power of the world s largest and most successful genomics platform to everyday clinical care NextCODE
Pairwise Sequence Alignment
Pairwise Sequence Alignment [email protected] SS 2013 Outline Pairwise sequence alignment global - Needleman Wunsch Gotoh algorithm local - Smith Waterman algorithm BLAST - heuristics What
Composite Data Virtualization Composite Data Virtualization And NOSQL Data Stores
Composite Data Virtualization Composite Data Virtualization And NOSQL Data Stores Composite Software October 2010 TABLE OF CONTENTS INTRODUCTION... 3 BUSINESS AND IT DRIVERS... 4 NOSQL DATA STORES LANDSCAPE...
SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING
AAS 07-228 SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING INTRODUCTION James G. Miller * Two historical uncorrelated track (UCT) processing approaches have been employed using general perturbations
TOWARD BIG DATA ANALYSIS WORKSHOP
TOWARD BIG DATA ANALYSIS WORKSHOP 邁 向 巨 量 資 料 分 析 研 討 會 摘 要 集 2015.06.05-06 巨 量 資 料 之 矩 陣 視 覺 化 陳 君 厚 中 央 研 究 院 統 計 科 學 研 究 所 摘 要 視 覺 化 (Visualization) 與 探 索 式 資 料 分 析 (Exploratory Data Analysis, EDA)
Architectures for Big Data Analytics A database perspective
Architectures for Big Data Analytics A database perspective Fernando Velez Director of Product Management Enterprise Information Management, SAP June 2013 Outline Big Data Analytics Requirements Spectrum
PROC. CAIRO INTERNATIONAL BIOMEDICAL ENGINEERING CONFERENCE 2006 1. E-mail: [email protected]
BIOINFTool: Bioinformatics and sequence data analysis in molecular biology using Matlab Mai S. Mabrouk 1, Marwa Hamdy 2, Marwa Mamdouh 2, Marwa Aboelfotoh 2,Yasser M. Kadah 2 1 Biomedical Engineering Department,
ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat
ESS event: Big Data in Official Statistics Antonino Virgillito, Istat v erbi v is 1 About me Head of Unit Web and BI Technologies, IT Directorate of Istat Project manager and technical coordinator of Web
SGI. High Throughput Computing (HTC) Wrapper Program for Bioinformatics on SGI ICE and SGI UV Systems. January, 2012. Abstract. Haruna Cofer*, PhD
White Paper SGI High Throughput Computing (HTC) Wrapper Program for Bioinformatics on SGI ICE and SGI UV Systems Haruna Cofer*, PhD January, 2012 Abstract The SGI High Throughput Computing (HTC) Wrapper
Activity IT S ALL RELATIVES The Role of DNA Evidence in Forensic Investigations
Activity IT S ALL RELATIVES The Role of DNA Evidence in Forensic Investigations SCENARIO You have responded, as a result of a call from the police to the Coroner s Office, to the scene of the death of
MultiQuant Software 2.0 for Targeted Protein / Peptide Quantification
MultiQuant Software 2.0 for Targeted Protein / Peptide Quantification Gold Standard for Quantitative Data Processing Because of the sensitivity, selectivity, speed and throughput at which MRM assays can
Large-Scale Test Mining
Large-Scale Test Mining SIAM Conference on Data Mining Text Mining 2010 Alan Ratner Northrop Grumman Information Systems NORTHROP GRUMMAN PRIVATE / PROPRIETARY LEVEL I Aim Identify topic and language/script/coding
Big Data With Hadoop
With Saurabh Singh [email protected] The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials
Comparative genomic hybridization Because arrays are more than just a tool for expression analysis
Microarray Data Analysis Workshop MedVetNet Workshop, DTU 2008 Comparative genomic hybridization Because arrays are more than just a tool for expression analysis Carsten Friis ( with several slides from
Core Bioinformatics. Degree Type Year Semester. 4313473 Bioinformàtica/Bioinformatics OB 0 1
Core Bioinformatics 2014/2015 Code: 42397 ECTS Credits: 12 Degree Type Year Semester 4313473 Bioinformàtica/Bioinformatics OB 0 1 Contact Name: Sònia Casillas Viladerrams Email: [email protected]
DNA Mapping/Alignment. Team: I Thought You GNU? Lars Olsen, Venkata Aditya Kovuri, Nick Merowsky
DNA Mapping/Alignment Team: I Thought You GNU? Lars Olsen, Venkata Aditya Kovuri, Nick Merowsky Overview Summary Research Paper 1 Research Paper 2 Research Paper 3 Current Progress Software Designs to
Using In-Memory Computing to Simplify Big Data Analytics
SCALEOUT SOFTWARE Using In-Memory Computing to Simplify Big Data Analytics by Dr. William Bain, ScaleOut Software, Inc. 2012 ScaleOut Software, Inc. 12/27/2012 T he big data revolution is upon us, fed
What is Analytic Infrastructure and Why Should You Care?
What is Analytic Infrastructure and Why Should You Care? Robert L Grossman University of Illinois at Chicago and Open Data Group [email protected] ABSTRACT We define analytic infrastructure to be the services,
Protein Protein Interaction Networks
Functional Pattern Mining from Genome Scale Protein Protein Interaction Networks Young-Rae Cho, Ph.D. Assistant Professor Department of Computer Science Baylor University it My Definition of Bioinformatics
Tutorial for Windows and Macintosh. Preparing Your Data for NGS Alignment
Tutorial for Windows and Macintosh Preparing Your Data for NGS Alignment 2015 Gene Codes Corporation Gene Codes Corporation 775 Technology Drive, Ann Arbor, MI 48108 USA 1.800.497.4939 (USA) 1.734.769.7249
Molecular typing of VTEC: from PFGE to NGS-based phylogeny
Molecular typing of VTEC: from PFGE to NGS-based phylogeny Valeria Michelacci 10th Annual Workshop of the National Reference Laboratories for E. coli in the EU Rome, November 5 th 2015 Molecular typing
Similarity Searches on Sequence Databases: BLAST, FASTA. Lorenza Bordoli Swiss Institute of Bioinformatics EMBnet Course, Basel, October 2003
Similarity Searches on Sequence Databases: BLAST, FASTA Lorenza Bordoli Swiss Institute of Bioinformatics EMBnet Course, Basel, October 2003 Outline Importance of Similarity Heuristic Sequence Alignment:
Next Generation Sequencing: Technology, Mapping, and Analysis
Next Generation Sequencing: Technology, Mapping, and Analysis Gary Benson Computer Science, Biology, Bioinformatics Boston University [email protected] http://tandem.bu.edu/ The Human Genome Project took
Just the Facts: A Basic Introduction to the Science Underlying NCBI Resources
1 of 8 11/7/2004 11:00 AM National Center for Biotechnology Information About NCBI NCBI at a Glance A Science Primer Human Genome Resources Model Organisms Guide Outreach and Education Databases and Tools
A greedy algorithm for the DNA sequencing by hybridization with positive and negative errors and information about repetitions
BULLETIN OF THE POLISH ACADEMY OF SCIENCES TECHNICAL SCIENCES, Vol. 59, No. 1, 2011 DOI: 10.2478/v10175-011-0015-0 Varia A greedy algorithm for the DNA sequencing by hybridization with positive and negative
Cloud Computing Where ISR Data Will Go for Exploitation
Cloud Computing Where ISR Data Will Go for Exploitation 22 September 2009 Albert Reuther, Jeremy Kepner, Peter Michaleas, William Smith This work is sponsored by the Department of the Air Force under Air
Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related
Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing
Genetic Technology. Name: Class: Date: Multiple Choice Identify the choice that best completes the statement or answers the question.
Name: Class: Date: Genetic Technology Multiple Choice Identify the choice that best completes the statement or answers the question. 1. An application of using DNA technology to help environmental scientists
Lecture 13: DNA Technology. DNA Sequencing. DNA Sequencing Genetic Markers - RFLPs polymerase chain reaction (PCR) products of biotechnology
Lecture 13: DNA Technology DNA Sequencing Genetic Markers - RFLPs polymerase chain reaction (PCR) products of biotechnology DNA Sequencing determine order of nucleotides in a strand of DNA > bases = A,
Data processing goes big
Test report: Integration Big Data Edition Data processing goes big Dr. Götz Güttich Integration is a powerful set of tools to access, transform, move and synchronize data. With more than 450 connectors,
Healthcare data analytics. Da-Wei Wang Institute of Information Science [email protected]
Healthcare data analytics Da-Wei Wang Institute of Information Science [email protected] Outline Data Science Enabling technologies Grand goals Issues Google flu trend Privacy Conclusion Analytics
Big Data: Rethinking Text Visualization
Big Data: Rethinking Text Visualization Dr. Anton Heijs [email protected] Treparel April 8, 2013 Abstract In this white paper we discuss text visualization approaches and how these are important
CD-HIT User s Guide. Last updated: April 5, 2010. http://cd-hit.org http://bioinformatics.org/cd-hit/
CD-HIT User s Guide Last updated: April 5, 2010 http://cd-hit.org http://bioinformatics.org/cd-hit/ Program developed by Weizhong Li s lab at UCSD http://weizhong-lab.ucsd.edu [email protected] 1. Introduction
SESSION DEPENDENT DE-IDENTIFICATION OF ELECTRONIC MEDICAL RECORDS
SESSION DEPENDENT DE-IDENTIFICATION OF ELECTRONIC MEDICAL RECORDS A Thesis Presented in Partial Fulfillment of the Requirements for the Degree Bachelor of Science with Honors Research Distinction in Electrical
DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS
DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS 1 AND ALGORITHMS Chiara Renso KDD-LAB ISTI- CNR, Pisa, Italy WHAT IS CLUSTER ANALYSIS? Finding groups of objects such that the objects in a group will be similar
BASIC STATISTICAL METHODS FOR GENOMIC DATA ANALYSIS
BASIC STATISTICAL METHODS FOR GENOMIC DATA ANALYSIS SEEMA JAGGI Indian Agricultural Statistics Research Institute Library Avenue, New Delhi-110 012 [email protected] Genomics A genome is an organism s
Statistical Data Mining. Practical Assignment 3 Discriminant Analysis and Decision Trees
Statistical Data Mining Practical Assignment 3 Discriminant Analysis and Decision Trees In this practical we discuss linear and quadratic discriminant analysis and tree-based classification techniques.
CCR Biology - Chapter 9 Practice Test - Summer 2012
Name: Class: Date: CCR Biology - Chapter 9 Practice Test - Summer 2012 Multiple Choice Identify the choice that best completes the statement or answers the question. 1. Genetic engineering is possible
Analyzing microrna Data and Integrating mirna with Gene Expression Data in Partek Genomics Suite 6.6
Analyzing microrna Data and Integrating mirna with Gene Expression Data in Partek Genomics Suite 6.6 Overview This tutorial outlines how microrna data can be analyzed within Partek Genomics Suite. Additionally,
Guide for Data Visualization and Analysis using ACSN
Guide for Data Visualization and Analysis using ACSN ACSN contains the NaviCell tool box, the intuitive and user- friendly environment for data visualization and analysis. The tool is accessible from the
Lab 2/Phylogenetics/September 16, 2002 1 PHYLOGENETICS
Lab 2/Phylogenetics/September 16, 2002 1 Read: Tudge Chapter 2 PHYLOGENETICS Objective of the Lab: To understand how DNA and protein sequence information can be used to make comparisons and assess evolutionary
Network Protocol Analysis using Bioinformatics Algorithms
Network Protocol Analysis using Bioinformatics Algorithms Marshall A. Beddoe [email protected] ABSTRACT Network protocol analysis is currently performed by hand using only intuition and a protocol
Comparing SQL and NOSQL databases
COSC 6397 Big Data Analytics Data Formats (II) HBase Edgar Gabriel Spring 2015 Comparing SQL and NOSQL databases Types Development History Data Storage Model SQL One type (SQL database) with minor variations
Fast Analytics on Big Data with H20
Fast Analytics on Big Data with H20 0xdata.com, h2o.ai Tomas Nykodym, Petr Maj Team About H2O and 0xdata H2O is a platform for distributed in memory predictive analytics and machine learning Pure Java,
BIOLOMICS SOFTWARE & SERVICES GENERAL INFORMATION DOCUMENT
BIOLOMICS SOFTWARE & SERVICES GENERAL INFORMATION DOCUMENT BIOAWARE SA NV - VERSION 2.0 - AUGUST 2013 BIOLOMICS SOFTWARE DYNAMIC CREATION AND MODIFICATION OF DATABASES Create simple or complex databases
Hosting Transaction Based Applications on Cloud
Proc. of Int. Conf. on Multimedia Processing, Communication& Info. Tech., MPCIT Hosting Transaction Based Applications on Cloud A.N.Diggikar 1, Dr. D.H.Rao 2 1 Jain College of Engineering, Belgaum, India
Bioruptor NGS: Unbiased DNA shearing for Next-Generation Sequencing
STGAAC STGAACT GTGCACT GTGAACT STGAAC STGAACT GTGCACT GTGAACT STGAAC STGAAC GTGCAC GTGAAC Wouter Coppieters Head of the genomics core facility GIGA center, University of Liège Bioruptor NGS: Unbiased DNA
15.062 Data Mining: Algorithms and Applications Matrix Math Review
.6 Data Mining: Algorithms and Applications Matrix Math Review The purpose of this document is to give a brief review of selected linear algebra concepts that will be useful for the course and to develop
MUSICAL INSTRUMENT FAMILY CLASSIFICATION
MUSICAL INSTRUMENT FAMILY CLASSIFICATION Ricardo A. Garcia Media Lab, Massachusetts Institute of Technology 0 Ames Street Room E5-40, Cambridge, MA 039 USA PH: 67-53-0 FAX: 67-58-664 e-mail: rago @ media.
December 4, 2013 MATH 171 BASIC LINEAR ALGEBRA B. KITCHENS
December 4, 2013 MATH 171 BASIC LINEAR ALGEBRA B KITCHENS The equation 1 Lines in two-dimensional space (1) 2x y = 3 describes a line in two-dimensional space The coefficients of x and y in the equation
InfiniteGraph: The Distributed Graph Database
A Performance and Distributed Performance Benchmark of InfiniteGraph and a Leading Open Source Graph Database Using Synthetic Data Objectivity, Inc. 640 West California Ave. Suite 240 Sunnyvale, CA 94086
SeqScape Software Version 2.5 Comprehensive Analysis Solution for Resequencing Applications
Product Bulletin Sequencing Software SeqScape Software Version 2.5 Comprehensive Analysis Solution for Resequencing Applications Comprehensive reference sequence handling Helps interpret the role of each
Web-Based Genomic Information Integration with Gene Ontology
Web-Based Genomic Information Integration with Gene Ontology Kai Xu 1 IMAGEN group, National ICT Australia, Sydney, Australia, [email protected] Abstract. Despite the dramatic growth of online genomic
Joining Cassandra. Luiz Fernando M. Schlindwein Computer Science Department University of Crete Heraklion, Greece [email protected].
Luiz Fernando M. Schlindwein Computer Science Department University of Crete Heraklion, Greece [email protected] Joining Cassandra Binjiang Tao Computer Science Department University of Crete Heraklion,
Biological Databases and Protein Sequence Analysis
Biological Databases and Protein Sequence Analysis Introduction M. Madan Babu, Center for Biotechnology, Anna University, Chennai 25, India Bioinformatics is the application of Information technology to
Let the data speak to you. Look Who s Peeking at Your Paycheck. Big Data. What is Big Data? The Artemis project: Saving preemies using Big Data
CS535 Big Data W1.A.1 CS535 BIG DATA W1.A.2 Let the data speak to you Medication Adherence Score How likely people are to take their medication, based on: How long people have lived at the same address
How To Cluster
Data Clustering Dec 2nd, 2013 Kyrylo Bessonov Talk outline Introduction to clustering Types of clustering Supervised Unsupervised Similarity measures Main clustering algorithms k-means Hierarchical Main
