MK Gibson et al. Training and Benchmarking of Resfams profile HMMs
|
|
|
- Edwina McKenzie
- 10 years ago
- Views:
Transcription
1 Training and Benchmarking of Resfams profile HMMs A curated list of antibiotic resistance (AR) proteins used to generate Resfams profile HMMs, along with associated reference gene sequences, is available at These proteins were compiled using the Comprehensive Antibiotic Resistance Database (CARD) (McArthur et al., 2013), the Antibiotic Resistance Database (ARDB) (Liu and Pop, 2009), the Lactamase Engineering Database (LacED) (Thai et al., 2009) and Jacoby and BushÕs collection of curated beta-lactamase proteins ( All proteins were handcurated to ensure functional homology using phylogenetic analysis and literature searches, eliminating all incorrectly annotated and truncated protein sequences. Resfams profile HMMs were trained using CARD (McArthur et al., 2013), LacED (Thai et al., 2009), and Jacoby & BushÕs collection of beta-lactamases while retaining ARDB (Liu and Pop, 2009) as an independent test set. Gathering thresholds were optimized such that precision and recall metrics on this independent set of known genes were as close to 1.0 as possible. Precision was calculated as (the total number of correct annotations)/(total number of annotations). A precision of 1.0 for a profile HMM indicates that all annotations of proteins recruited by that HMM in the ARDB test set matched the annotation of the profile HMM. Recall was calculated as (the total number of correct annotations)/(the total number of proteins in ARDB with annotations matching the annotation of the profile HMM). A recall of 1.0 for a profile HMM indicates that all proteins contained in the ARDB test set that matched the annotation of the profile HMM were correctly recruited by the profile HMM. Resfams profile HMM Precision and Recall Optimization Overview The full set of precision and recall metrics of Resfams profile HMMs can be found in Supplementary Figure S1. We first evaluated the ability of each profile HMM to recruit the sequences used to initially train the profile HMMs. Protein sequences were recruited into an AR family or sub-family if they covered greater than 80% of the profile HMM and had an e-value score of less than 1x These global coverage and e-value scores were optimized to achieve maximum precision and recall. The distribution of precision and recall across all Resfams using these universal significance thresholds shows low precision for individual protein families (Supplementary Fig. S1A). These results indicate that a large number of AR families are structurally similar, highlighting an important challenge in predicting AR functions from sequence. To improve Resfams prediction accuracy, we optimized profile specific gathering thresholds, which set an inclusion bit score cut-off for a protein sequence alignment on a profile-by-profile basis (Supplementary Fig. S1B). Finally, we tested the prediction accuracy of these optimized Resfams families on AR proteins from ARDB (Liu and Pop, 2009) not used in training of the original profile HMMs (Supplementary Fig. S1C). These recruited protein sequences were subsequently incorporated into the corresponding Resfams protein families, resulting in the final database of AR profile HMMs used for all further analysis in this study. Resfams profile HMM Annotation from Microbial Sequence Alone To test the ability of Resfams to accurately distinguish between AR and all other genomic functions, we used the well-curated UniProtKB/SwissProt database (Supplementary Fig. S7). The UniProtKB/Swiss-Prot protein database was downloaded on September 15, 2013, containing 540,732 reviewed proteins. The full set of reviewed proteins was aligned to the core Resfams database of profile HMMs (Resfams.hmm) using the hmmscan function of the HM- MER3 (Finn et al., 2011) software package using the following parameters: --cut_ga, --tblout. All proteins in the UniProtKB/Swiss-Prot database that were recruited to at least one Resfams AR protein family are represented on Supplementary Figure S7. Hits were designated as true positive if the annotation in the UniProtKB/Swiss-Prot database matched the Resfams AR family annotation for the top hit. In addition, for all efflux/transporter and quinolone resistance AR mechanisms, true positive hits required the protein to be designated as an Antibiotic Resistance protein in the UniProtKB/Swiss-Prot database as these proteins are often also associated with other functions beyond resistance. All other hits were designated as false positive hits. Page 1
2 Resfams AR protein families had less than 5% false discovery rate with the exception of ABC transporters, a class of transmembrane proteins with extremely diverse functions that are categorized together due to a common ATP binding domain. Because ABC transporters are difficult to predict a priori as resistance proteins, these Resfams annotations were excluded from analysis in the absence of functional confirmation. Excluding ABC transporters, Resfams very accurately predicted AR function from microbial genomes. Combined with its ability to annotate sequence-divergent AR proteins, this indicates that Resfams is adept at predicting AR without the need for a functional verification assay (e.g. functional metagenomics). The accuracy of Resfams profile HMMs for predicting AR function from microbial sequence alone is supported by the results obtained from functional metagenomic selections of the human gut and soil microbiotas. For example, we found that AR to tetracycline is mediated almost exclusively by tetracycline MFS efflux pumps in the soil microbiota and by ribosomal protection (TetM/TetO/TetW/TetS) in the human gut microbiota (Fig. 3). Resfams profile HMMs predict the same profile of tetracycline resistance mechanisms across habitats obtained from functional selections in bacterial genomes. Conversely, pairwise sequence alignment to AR specific databases incorrectly predicts enrichment of all tetracycline resistance mechanisms in the human gut versus soil (Fig. 5b). Comparison of Resfams HMMs to BLAST to AR-specific databases For comparison of Resfams HMMs to BLAST to AR-specific databases in functional metagenomic selections (Fig. 1), we used assembled contigs from functional metagenomics studies number 1 (MDR soil isolates) and 2 (pediatric gut resistome) described in Supplementary Table S2. A total of 161 assembled contigs from the MDR soil isolate resistome study and 3,692 assembled contigs from the pediatric gut resistome study were used for method comparison. Open reading frames were predicted in the assembled contigs using the stand-alone version of MetaGeneMark (Zhu et al., 2010) with default parameters. Within each functional selection investigated, all proteins 100% identical over the length of the shorter sequence were collapsed into a single sequence using CD-HIT with the following parameters: -c 1.0 -as 1.0 -g 1-d 0. The longest protein in the identified cluster was retained for downstream analysis. A total of 281 and 9,795 proteins were identified in the MDR soil isolates and pediatric gut resistome study, respectively. The same predicted protein sequences for each study were then annotated using either the full Resfams database of profile HMMs (Resfams-full.hmm) or BLAST to AR-specific databases as described above. The hand curated annotations for the MDR soil isolates study previously reported in Forsberg et al., 2012 were used as a gold standard for that study for comparison of BLAST and Resfams HMMs. A recent report using pairwise sequence alignment to the ARDB (Liu and Pop, 2009) concluded that AR is highly enriched in the human gut as compared to natural environments, such as the soil (Hu et al., 2013). In contrast, we find no statistical difference between the total AR using either functional selection data from metagenomes or sequenced isolate genomes. This emphasizes that studies of AR in microbial genomes and communities and comparisons across habitats requires functional or consensus-based annotation methods in order to provide a complete, unbiased representation of AR reservoirs. Resistome analysis using functional metagenomic selections Functional selections using antibiotics common to all three functional metagenomic selections described (see Methods) were used for comparative resistome analysis (Penicillin, Piperacillin, Tetracycline, Chloramphenicol, and Gentamicin). Contigs were assembled using PARFuMS (Forsberg et al., 2012) and open reading frames were predicted using MetaGeneMark (Zhu et al., 2010) as described above. Within each sample, all proteins 100% identical over the length of the shorter sequence were collapsed into a single sequence using CD-HIT with the following parameters: -c 1.0 ÐaS 1.0 Ðg 1-d 0. All proteins over 350bp and unique within sample were then used for downstream analysis. All proteins were then annotated using Resfams HMMs as described above, resulting in a total of 3,099 AR proteins used for comparative resistome analysis (64, MDR soil isolates; 1,082, pediatric gut resistome, 1,953; soil resistome). A count matrix of unique protein sequences per Resfams family annotation for each resistome sample was generated by summing unique annotation counts across all antibiotic selections for a sample and normalizing them by metagenomic library size (Supplementary Table S2). The normalized AR protein count table was then used to generate Bray-Curtis and binary Jaccard distance matrices and perform principal coordinate analysis (PCoA) using the beta_diversity.py Page 2
3 and principal_coordinates.py scripts (QIIME (Caporaso et al., 2010)). The significance of clusters by was determined using ANOSIM and performed using the compare_categories.py script (QIIME (Caporaso et al., 2010)). Random forests analysis was performed using the supervised_learning.py script (QIIME (Caporaso et al., 2010) and randomforest R package (Liaw and Wiener, 2002)) to determine the Resfams families that most discriminate resistomes between habitats. PCoA plots were plotted along with the six most discriminating Resfams families as determined by random forests analysis as biplots for bray-curtis distance matrix. Biplot positions were calculated as the weighted average of the coordinate positions of all samples along the first two PCoA axes, where the weights are the relative abundances of the Resfams family. The size of the biplot points represents the aggregate abundance of the Resfams family across all samples. A bipartite network (Fig. 3) was generated from the normalize AR protein count table using the make_bipartite_network.py script (QIIME (Caporaso et al., 2010)) and then visualized using Cytoscape (Shannon et al., 2003) version using the edge weighted spring embedded format. Generation of binary heatmaps for genome comparisons Binary heatmaps corresponding to genome annotations (e.g., Fig. 4) were created using the Heatplus package in R. Sections of the heatmap were colored and outlined if there was a significant enrichment of that AR mechanism (listed in Supplementary Table S1) as determined by Fisher s Exact Test (P < 0.01; Supplementary Table S5). If there was not a significant enrichment, the heatmap was colored black. References Caporaso, JG, Kuczynski, J, Stombaugh, J, Bittinger, K, Bushman, FD, Costello, EK, et al. (May 2010). QIIME allows analysis of high-throughput community sequencing data. Nature methods 7: Finn, RD, Clements, J, Eddy, SR. (July 2011). HMMER web server: interactive sequence similarity searching. Nucleic acids research 39: W Forsberg, KJ, Reyes, A, Wang, B, Selleck, EM, Sommer, MOA, Dantas, G. (Aug. 2012). The shared antibiotic resistome of soil bacteria and human pathogens. Science (New York, N.Y.) 337: Hu, Y, Yang, X, Qin, J, Lu, N, Cheng, G, Wu, N, et al. (2013). Metagenome-wide analysis of antibiotic resistance genes in a large cohort of human gut microbiota. Nature communications 4: Liaw, A, Wiener, M. (2002). Classification and regression by randomforest. R News 2: Liu, B, Pop, M. (Jan. 2009). ARDB Antibiotic Resistance Genes Database. Nucleic acids research 37: D McArthur, AG, Waglechner, N, Nizam, F, Yan, A, Azad, MA, Baylay, AJ, et al. (July 2013). The comprehensive antibiotic resistance database. Antimicrobial agents and chemotherapy 57: Shannon, P, Markiel, A, Ozier, O, Baliga, NS, Wang, JT, Ramage, D, et al. (Nov. 2003). Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome research 13: Thai, QK, Bös, F, Pleiss, J. (2009). The Lactamase Engineering Database: a critical survey of TEM sequences in public databases. BMC genomics 10: 390. Zhu, W, Lomsadze, A, Borodovsky, M. (July 2010). Ab initio gene identification in metagenomic sequences. Nucleic acids research 38: e132. Page 3
4 Supplementary Figure S1: Resfams family precision and recall distributions. The x-axis represents the precision or recall values while the y-axis represents the number of Resfams families of a given value. (A) Precision and recall distributions of Resfams families on the proteins used to train the original profile using universal thresholds. (B) Precision and recall distributions of Resfams families on the proteins used to train the original profile HMMs using defined gathering thresholds. (C) Precision and recall distributions of Resfams families on an independent validation set of antibiotic resistance proteins not used to train the Resfams profile HMMs using defined gathering thresholds. Page 4
5 Supplementary Figure S2: Phenotypic resistance profiles across ecologies. Binary heatmap (gray, resistance observed; white, no resistance) of phenotypic profiles of 16 soil microbiota (green) and 18 human gut microbiota (magenta) to 18 antibiotics. Samples are clustered using the Jaccard Index and resistance profiles are herarcically clustered. Page 5
6 Supplementary Figure S3: Principal coordinate analysis (PCoA) plots of functional metagenomic selections. PCoA plot depicting binary Jaccard distances between resistomes of soil microbiota (green), human gut microbiota (magenta), and MDR soil isolates (blue), calculated using unique ARG counts generated by Resfams annotations. Resistomes of different ecologies cluster separately (P < 0.001, ANOSIM). Page 6
7 Supplementary Figure S4: Distribution of total antibiotic resistance (AR) genome composition across habitat and bacteria phyla. The total AR genome composition was calculated by taking the total number of genes annotated by Resfams profile HMMs as AR genes in the genome divided by the total number of genes in the genome. The total AR genome composition was averaged across (A) habitats and (C) bacterial phyla. Error bars represent one standard deviation. The raw number of total AR genes per genome was averaged across all genomes by (B) habitat. Error bars represent one standard deviation. Page 7
8 Supplementary Figure S5: Enrichment of antibiotic resistance (AR) mechanisms in functional metagenomic selections. Distribution of Resfams family AR genes identified in functional metagenomic selections of the soil (green) and human gut (red), including (A) all mechanisms, (B) β-lactamase Ambler classes, and (C) tetracycline resistance mechanisms. The normalized count of unique antibiotic resistance genes per metagenome investigated in each mechanistic category along the x-axis is depicted along the y-axis (*P < 0.05, **P < 0.001; Student s t-test). Page 8
9 Supplementary Figure S6: Enrichment of antibiotic resistance functions in pathogens. (A) Binary heatmap of resistomes organized by pathogen status of 2,966 genome sequenced bacterial isolates. The heatmap is colored by enrichment of a particular AR mechanism within pathogenic or non-pathogenic organisms (P < 0.01, Fisher s exact). Enrichment of (B) β-lactamase ambler class and (C) tetracycline resistance functions within pathogenic or non-pathogenic organisms (*P < 0.01, Fisher s exact). Page 9
10 Supplementary Figure S7: Resfams profile HMM annotation of 540,732 proteins in the UniProtKB/Swiss-Prot curated database. The y-axis represents the total number of proteins recruited to Resfams familes in each of the antibiotic resistance mechanism categories represented along the x-axis. True positive hits (blue) both match the annotation category and are flagged as antibiotic resistant protein forms in the UniProtKB/Swiss-Prot Database. (A) Recruitment distribution of all Resfams families. (B) ABC Transporter mechanistic class broken down into the individual Resfams families, showing that the majority of false positive arise from the general ABC Transporter profile HMM. (C) Recruitment distribution of all Resfams families used in annotation of genomes in absence of functional assay. Page 10
Visualizing Networks: Cytoscape. Prat Thiru
Visualizing Networks: Cytoscape Prat Thiru Outline Introduction to Networks Network Basics Visualization Inferences Cytoscape Demo 2 Why (Biological) Networks? 3 Networks: An Integrative Approach Zvelebil,
T cell Epitope Prediction
Institute for Immunology and Informatics T cell Epitope Prediction EpiMatrix Eric Gustafson January 6, 2011 Overview Gathering raw data Popular sources Data Management Conservation Analysis Multiple Alignments
Tutorial for proteome data analysis using the Perseus software platform
Tutorial for proteome data analysis using the Perseus software platform Laboratory of Mass Spectrometry, LNBio, CNPEM Tutorial version 1.0, January 2014. Note: This tutorial was written based on the information
RETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison
RETRIEVING SEQUENCE INFORMATION Nucleotide sequence databases Database search Sequence alignment and comparison Biological sequence databases Originally just a storage place for sequences. Currently the
When you install Mascot, it includes a copy of the Swiss-Prot protein database. However, it is almost certain that you and your colleagues will want
1 When you install Mascot, it includes a copy of the Swiss-Prot protein database. However, it is almost certain that you and your colleagues will want to search other databases as well. There are very
TOOLS FOR T-RFLP DATA ANALYSIS USING EXCEL
TOOLS FOR T-RFLP DATA ANALYSIS USING EXCEL A collection of Visual Basic macros for the analysis of terminal restriction fragment length polymorphism data Nils Johan Fredriksson TOOLS FOR T-RFLP DATA ANALYSIS
Knowledge Discovery and Data Mining
Knowledge Discovery and Data Mining Lecture 15 - ROC, AUC & Lift Tom Kelsey School of Computer Science University of St Andrews http://tom.home.cs.st-andrews.ac.uk [email protected] Tom Kelsey ID5059-17-AUC
Guide for Bioinformatics Project Module 3
Structure- Based Evidence and Multiple Sequence Alignment In this module we will revisit some topics we started to look at while performing our BLAST search and looking at the CDD database in the first
Bioinformatics Grid - Enabled Tools For Biologists.
Bioinformatics Grid - Enabled Tools For Biologists. What is Grid-Enabled Tools (GET)? As number of data from the genomics and proteomics experiment increases. Problems arise for the current sequence analysis
A Primer of Genome Science THIRD
A Primer of Genome Science THIRD EDITION GREG GIBSON-SPENCER V. MUSE North Carolina State University Sinauer Associates, Inc. Publishers Sunderland, Massachusetts USA Contents Preface xi 1 Genome Projects:
Twincore - Zentrum für Experimentelle und Klinische Infektionsforschung Institut für Molekulare Bakteriologie
Twincore - Zentrum für Experimentelle und Klinische Infektionsforschung Institut für Molekulare Bakteriologie 0 HELMHOLTZ I ZENTRUM FÜR INFEKTIONSFORSCHUNG Technische Universität Braunschweig Institut
MultiExperiment Viewer Quickstart Guide
MultiExperiment Viewer Quickstart Guide Table of Contents: I. Preface - 2 II. Installing MeV - 2 III. Opening a Data Set - 2 IV. Filtering - 6 V. Clustering a. HCL - 8 b. K-means - 11 VI. Modules a. T-test
AmphoraNet: Taxonomic Composition Analysis of Metagenomic Shotgun Sequencing Data
Csaba Kerepesi, Dániel Bánky, Vince Grolmusz: AmphoraNet: Taxonomic Composition Analysis of Metagenomic Shotgun Sequencing Data http://pitgroup.org/amphoranet/ PIT Bioinformatics Group, Department of Computer
Current Motif Discovery Tools and their Limitations
Current Motif Discovery Tools and their Limitations Philipp Bucher SIB / CIG Workshop 3 October 2006 Trendy Concepts and Hypotheses Transcription regulatory elements act in a context-dependent manner.
Searching Nucleotide Databases
Searching Nucleotide Databases 1 When we search a nucleic acid databases, Mascot always performs a 6 frame translation on the fly. That is, 3 reading frames from the forward strand and 3 reading frames
Sequence Formats and Sequence Database Searches. Gloria Rendon SC11 Education June, 2011
Sequence Formats and Sequence Database Searches Gloria Rendon SC11 Education June, 2011 Sequence A is the primary structure of a biological molecule. It is a chain of residues that form a precise linear
Final Project Report
CPSC545 by Introduction to Data Mining Prof. Martin Schultz & Prof. Mark Gerstein Student Name: Yu Kor Hugo Lam Student ID : 904907866 Due Date : May 7, 2007 Introduction Final Project Report Pseudogenes
Course on Functional Analysis. ::: Gene Set Enrichment Analysis - GSEA -
Course on Functional Analysis ::: Madrid, June 31st, 2007. Gonzalo Gómez, PhD. [email protected] Bioinformatics Unit CNIO ::: Contents. 1. Introduction. 2. GSEA Software 3. Data Formats 4. Using GSEA 5. GSEA
Real-time PCR: Understanding C t
APPLICATION NOTE Real-Time PCR Real-time PCR: Understanding C t Real-time PCR, also called quantitative PCR or qpcr, can provide a simple and elegant method for determining the amount of a target sequence
ProteinQuest user guide
ProteinQuest user guide 1. Introduction... 3 1.1 With ProteinQuest you can... 3 1.2 ProteinQuest basic version 4 1.3 ProteinQuest extended version... 5 2. ProteinQuest dictionaries... 6 3. Directions for
Guide for Data Visualization and Analysis using ACSN
Guide for Data Visualization and Analysis using ACSN ACSN contains the NaviCell tool box, the intuitive and user- friendly environment for data visualization and analysis. The tool is accessible from the
OD-seq: outlier detection in multiple sequence alignments
Jehl et al. BMC Bioinformatics (2015) 16:269 DOI 10.1186/s12859-015-0702-1 RESEARCH ARTICLE Open Access OD-seq: outlier detection in multiple sequence alignments Peter Jehl, Fabian Sievers * and Desmond
MASCOT Search Results Interpretation
The Mascot protein identification program (Matrix Science, Ltd.) uses statistical methods to assess the validity of a match. MS/MS data is not ideal. That is, there are unassignable peaks (noise) and usually
Data Analysis on the ABI PRISM 7700 Sequence Detection System: Setting Baselines and Thresholds. Overview. Data Analysis Tutorial
Data Analysis on the ABI PRISM 7700 Sequence Detection System: Setting Baselines and Thresholds Overview In order for accuracy and precision to be optimal, the assay must be properly evaluated and a few
UGENE Quick Start Guide
Quick Start Guide This document contains a quick introduction to UGENE. For more detailed information, you can find the UGENE User Manual and other special manuals in project website: http://ugene.unipro.ru.
Efficient Parallel Execution of Sequence Similarity Analysis Via Dynamic Load Balancing
Efficient Parallel Execution of Sequence Similarity Analysis Via Dynamic Load Balancing James D. Jackson Philip J. Hatcher Department of Computer Science Kingsbury Hall University of New Hampshire Durham,
SGI. High Throughput Computing (HTC) Wrapper Program for Bioinformatics on SGI ICE and SGI UV Systems. January, 2012. Abstract. Haruna Cofer*, PhD
White Paper SGI High Throughput Computing (HTC) Wrapper Program for Bioinformatics on SGI ICE and SGI UV Systems Haruna Cofer*, PhD January, 2012 Abstract The SGI High Throughput Computing (HTC) Wrapper
AGILENT S BIOINFORMATICS ANALYSIS SOFTWARE
ACCELERATING PROGRESS IS IN OUR GENES AGILENT S BIOINFORMATICS ANALYSIS SOFTWARE GENESPRING GENE EXPRESSION (GX) MASS PROFILER PROFESSIONAL (MPP) PATHWAY ARCHITECT (PA) See Deeper. Reach Further. BIOINFORMATICS
UCHIME in practice Single-region sequencing Reference database mode
UCHIME in practice Single-region sequencing UCHIME is designed for experiments that perform community sequencing of a single region such as the 16S rrna gene or fungal ITS region. While UCHIME may prove
Analysis of FFPE DNA Data in CNAG 2.0 A Manual
Analysis of FFPE DNA Data in CNAG 2.0 A Manual Table of Contents: I. Background P.2 II. Installation and Setup a. Download/Install CNAG 2.0 P.3 b. Setup P.4 III. Extract Mapping 500K FFPE Data P.7 IV.
BLAST. Anders Gorm Pedersen & Rasmus Wernersson
BLAST Anders Gorm Pedersen & Rasmus Wernersson Database searching Using pairwise alignments to search databases for similar sequences Query sequence Database Database searching Most common use of pairwise
Introduction to Bioinformatics 3. DNA editing and contig assembly
Introduction to Bioinformatics 3. DNA editing and contig assembly Benjamin F. Matthews United States Department of Agriculture Soybean Genomics and Improvement Laboratory Beltsville, MD 20708 [email protected]
A Tutorial in Genetic Sequence Classification Tools and Techniques
A Tutorial in Genetic Sequence Classification Tools and Techniques Jake Drew Data Mining CSE 8331 Southern Methodist University [email protected] www.jakemdrew.com Sequence Characters IUPAC nucleotide
NORTH PACIFIC RESEARCH BOARD SEMIANNUAL PROGRESS REPORT
1. PROJECT INFORMATION NPRB Project Number: 1303 Title: Assessing benthic meiofaunal community structure in the Alaskan Arctic: A high-throughput DNA sequencing approach Subaward period July 1, 2013 Jun
Clone Manager. Getting Started
Clone Manager for Windows Professional Edition Volume 2 Alignment, Primer Operations Version 9.5 Getting Started Copyright 1994-2015 Scientific & Educational Software. All rights reserved. The software
Sonatype CLM Server - Dashboard. Sonatype CLM Server - Dashboard
Sonatype CLM Server - Dashboard i Sonatype CLM Server - Dashboard Sonatype CLM Server - Dashboard ii Contents 1 Introduction 1 2 Accessing the Dashboard 3 3 Viewing CLM Data in the Dashboard 4 3.1 Filters............................................
Lecture/Recitation Topic SMA 5303 L1 Sampling and statistical distributions
SMA 50: Statistical Learning and Data Mining in Bioinformatics (also listed as 5.077: Statistical Learning and Data Mining ()) Spring Term (Feb May 200) Faculty: Professor Roy Welsch Wed 0 Feb 7:00-8:0
Choices, choices, choices... Which sequence database? Which modifications? What mass tolerance?
Optimization 1 Choices, choices, choices... Which sequence database? Which modifications? What mass tolerance? Where to begin? 2 Sequence Databases Swiss-prot MSDB, NCBI nr dbest Species specific ORFS
Metagenomic Assembly
Metagenomic Assembly Successes, Validation, And Challenges Matthew Scholz Los Alamos National Laboratory Genome Sciences Division Metagenomic Assembly is Complex Complexity dictates method of assembly
Unraveling protein networks with Power Graph Analysis
Unraveling protein networks with Power Graph Analysis PLoS Computational Biology, 2008 Loic Royer Matthias Reimann Bill Andreopoulos Michael Schroeder Schroeder Group Bioinformatics 1 Complex Networks
Delivering the power of the world s most successful genomics platform
Delivering the power of the world s most successful genomics platform NextCODE Health is bringing the full power of the world s largest and most successful genomics platform to everyday clinical care NextCODE
CRAC: An integrated approach to analyse RNA-seq reads Additional File 3 Results on simulated RNA-seq data.
: An integrated approach to analyse RNA-seq reads Additional File 3 Results on simulated RNA-seq data. Nicolas Philippe and Mikael Salson and Thérèse Commes and Eric Rivals February 13, 2013 1 Results
Hierarchical Clustering Analysis
Hierarchical Clustering Analysis What is Hierarchical Clustering? Hierarchical clustering is used to group similar objects into clusters. In the beginning, each row and/or column is considered a cluster.
Mass Spectrometry Signal Calibration for Protein Quantitation
Cambridge Isotope Laboratories, Inc. www.isotope.com Proteomics Mass Spectrometry Signal Calibration for Protein Quantitation Michael J. MacCoss, PhD Associate Professor of Genome Sciences University of
Introduction to Genome Annotation
Introduction to Genome Annotation AGCGTGGTAGCGCGAGTTTGCGAGCTAGCTAGGCTCCGGATGCGA CCAGCTTTGATAGATGAATATAGTGTGCGCGACTAGCTGTGTGTT GAATATATAGTGTGTCTCTCGATATGTAGTCTGGATCTAGTGTTG GTGTAGATGGAGATCGCGTAGCGTGGTAGCGCGAGTTTGCGAGCT
An introduction to bioinformatic tools for population genomic and metagenetic data analysis, 2.5 higher education credits Third Cycle
An introduction to bioinformatic tools for population genomic and metagenetic data analysis, 2.5 higher education credits Third Cycle Faculty of Science; Department of Marine Sciences The Swedish Royal
Using Illumina BaseSpace Apps to Analyze RNA Sequencing Data
Using Illumina BaseSpace Apps to Analyze RNA Sequencing Data The Illumina TopHat Alignment and Cufflinks Assembly and Differential Expression apps make RNA data analysis accessible to any user, regardless
Data mining on the EPIRARE survey data
Deliverable D1.5 Data mining on the EPIRARE survey data A. Coi 1, M. Santoro 2, M. Lipucci 2, A.M. Bianucci 1, F. Bianchi 2 1 Department of Pharmacy, Unit of Research of Bioinformatic and Computational
Bioinformatics Resources at a Glance
Bioinformatics Resources at a Glance A Note about FASTA Format There are MANY free bioinformatics tools available online. Bioinformaticists have developed a standard format for nucleotide and protein sequences
3. About R2oDNA Designer
3. About R2oDNA Designer Please read these publications for more details: Casini A, Christodoulou G, Freemont PS, Baldwin GS, Ellis T, MacDonald JT. R2oDNA Designer: Computational design of biologically-neutral
Methods for network visualization and gene enrichment analysis July 17, 2013. Jeremy Miller Scientist I [email protected]
Methods for network visualization and gene enrichment analysis July 17, 2013 Jeremy Miller Scientist I [email protected] Outline Visualizing networks using R Visualizing networks using outside
DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS
DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS 1 AND ALGORITHMS Chiara Renso KDD-LAB ISTI- CNR, Pisa, Italy WHAT IS CLUSTER ANALYSIS? Finding groups of objects such that the objects in a group will be similar
Identification of rheumatoid arthritis and osteoarthritis patients by transcriptome-based rule set generation
Identification of rheumatoid arthritis and osterthritis patients by transcriptome-based rule set generation Bering Limited Report generated on September 19, 2014 Contents 1 Dataset summary 2 1.1 Project
Big Data: Rethinking Text Visualization
Big Data: Rethinking Text Visualization Dr. Anton Heijs [email protected] Treparel April 8, 2013 Abstract In this white paper we discuss text visualization approaches and how these are important
Step-by-Step Guide to Bi-Parental Linkage Mapping WHITE PAPER
Step-by-Step Guide to Bi-Parental Linkage Mapping WHITE PAPER JMP Genomics Step-by-Step Guide to Bi-Parental Linkage Mapping Introduction JMP Genomics offers several tools for the creation of linkage maps
Shouguo Gao Ph. D Department of Physics and Comprehensive Diabetes Center
Computational Challenges in Storage, Analysis and Interpretation of Next-Generation Sequencing Data Shouguo Gao Ph. D Department of Physics and Comprehensive Diabetes Center Next Generation Sequencing
Data Mining with SQL Server Data Tools
Data Mining with SQL Server Data Tools Data mining tasks include classification (directed/supervised) models as well as (undirected/unsupervised) models of association analysis and clustering. 1 Data Mining
Consensus alignment server for reliable comparative modeling with distant templates
W50 W54 Nucleic Acids Research, 2004, Vol. 32, Web Server issue DOI: 10.1093/nar/gkh456 Consensus alignment server for reliable comparative modeling with distant templates Jahnavi C. Prasad 1, Sandor Vajda
CD-HIT User s Guide. Last updated: April 5, 2010. http://cd-hit.org http://bioinformatics.org/cd-hit/
CD-HIT User s Guide Last updated: April 5, 2010 http://cd-hit.org http://bioinformatics.org/cd-hit/ Program developed by Weizhong Li s lab at UCSD http://weizhong-lab.ucsd.edu [email protected] 1. Introduction
Next Generation Sequencing Technologies in Microbial Ecology. Frank Oliver Glöckner
Next Generation Sequencing Technologies in Microbial Ecology Frank Oliver Glöckner 1 Max Planck Institute for Marine Microbiology Investigation of the role, diversity and features of microorganisms Interactions
How Sequencing Experiments Fail
How Sequencing Experiments Fail v1.0 Simon Andrews [email protected] Classes of Failure Technical Tracking Library Contamination Biological Interpretation Something went wrong with a machine
Vaxign Reverse Vaccinology Software Demo Introduction Zhuoshuang Allen Xiang, Yongqun Oliver He
Vaxign Reverse Vaccinology Software Demo Introduction Zhuoshuang Allen Xiang, Yongqun Oliver He Unit for Laboratory Animal Medicine Department of Microbiology and Immunology Center for Computational Medicine
SUBJECT: New Features in Version 5.3
User Bulletin Sequencing Analysis Software v5.3 July 2007 SUBJECT: New Features in Version 5.3 This user bulletin includes the following topics: New Features in v5.3.....................................
Integrating DNA Motif Discovery and Genome-Wide Expression Analysis. Erin M. Conlon
Integrating DNA Motif Discovery and Genome-Wide Expression Analysis Department of Mathematics and Statistics University of Massachusetts Amherst Statistics in Functional Genomics Workshop Ascona, Switzerland
GenBank, Entrez, & FASTA
GenBank, Entrez, & FASTA Nucleotide Sequence Databases First generation GenBank is a representative example started as sort of a museum to preserve knowledge of a sequence from first discovery great repositories,
Single-Cell Whole Genome Sequencing on the C1 System: a Performance Evaluation
PN 100-9879 A1 TECHNICAL NOTE Single-Cell Whole Genome Sequencing on the C1 System: a Performance Evaluation Introduction Cancer is a dynamic evolutionary process of which intratumor genetic and phenotypic
Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012
Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization GENOME 560, Spring 2012 Data are interesting because they help us understand the world Genomics: Massive Amounts
Blog Post Extraction Using Title Finding
Blog Post Extraction Using Title Finding Linhai Song 1, 2, Xueqi Cheng 1, Yan Guo 1, Bo Wu 1, 2, Yu Wang 1, 2 1 Institute of Computing Technology, Chinese Academy of Sciences, Beijing 2 Graduate School
DeCyder Extended Data Analysis module Version 1.0
GE Healthcare DeCyder Extended Data Analysis module Version 1.0 Module for DeCyder 2D version 6.5 User Manual Contents 1 Introduction 1.1 Introduction... 7 1.2 The DeCyder EDA User Manual... 9 1.3 Getting
Analyzing microrna Data and Integrating mirna with Gene Expression Data in Partek Genomics Suite 6.6
Analyzing microrna Data and Integrating mirna with Gene Expression Data in Partek Genomics Suite 6.6 Overview This tutorial outlines how microrna data can be analyzed within Partek Genomics Suite. Additionally,
Use of Whole Genome Sequencing (WGS) of food-borne pathogens for public health protection
EFSA Scientific Colloquium n 20 Use of Whole Genome Sequencing (WGS) of food-borne pathogens for public health protection Parma, Italy, 16-17 June 2014 Why WGS based approach Infectious diseases are responsible
Florida International University - University of Miami TRECVID 2014
Florida International University - University of Miami TRECVID 2014 Miguel Gavidia 3, Tarek Sayed 1, Yilin Yan 1, Quisha Zhu 1, Mei-Ling Shyu 1, Shu-Ching Chen 2, Hsin-Yu Ha 2, Ming Ma 1, Winnie Chen 4,
A Streamlined Workflow for Untargeted Metabolomics
A Streamlined Workflow for Untargeted Metabolomics Employing XCMS plus, a Simultaneous Data Processing and Metabolite Identification Software Package for Rapid Untargeted Metabolite Screening Baljit K.
What s new in Teamcenter Service Pack 10.1.4
Siemens PLM Software What s new in Teamcenter Service Pack 10.1.4 Benefits Streamlined collaboration between mechanical and electronic design teams Improved software, development and delivery with integration
G E N OM I C S S E RV I C ES
GENOMICS SERVICES THE NEW YORK GENOME CENTER NYGC is an independent non-profit implementing advanced genomic research to improve diagnosis and treatment of serious diseases. capabilities. N E X T- G E
Frequently Asked Questions Next Generation Sequencing
Frequently Asked Questions Next Generation Sequencing Import These Frequently Asked Questions for Next Generation Sequencing are some of the more common questions our customers ask. Questions are divided
A Multiple DNA Sequence Translation Tool Incorporating Web Robot and Intelligent Recommendation Techniques
Proceedings of the 2007 WSEAS International Conference on Computer Engineering and Applications, Gold Coast, Australia, January 17-19, 2007 402 A Multiple DNA Sequence Translation Tool Incorporating Web
Classification and Regression by randomforest
Vol. 2/3, December 02 18 Classification and Regression by randomforest Andy Liaw and Matthew Wiener Introduction Recently there has been a lot of interest in ensemble learning methods that generate many
Cloud BioLinux: Pre-configured and On-demand Bioinformatics Computing for the Genomics Community
Cloud BioLinux: Pre-configured and On-demand Bioinformatics Computing for the Genomics Community Ntinos Krampis Asst. Professor J. Craig Venter Institute [email protected] http://www.jcvi.org/cms/about/bios/kkrampis/
Knowledge Discovery from patents using KMX Text Analytics
Knowledge Discovery from patents using KMX Text Analytics Dr. Anton Heijs [email protected] Treparel Abstract In this white paper we discuss how the KMX technology of Treparel can help searchers
MiSeq: Imaging and Base Calling
MiSeq: Imaging and Page Welcome Navigation Presenter Introduction MiSeq Sequencing Workflow Narration Welcome to MiSeq: Imaging and. This course takes 35 minutes to complete. Click Next to continue. Please
Version 5.0 Release Notes
Version 5.0 Release Notes 2011 Gene Codes Corporation Gene Codes Corporation 775 Technology Drive, Ann Arbor, MI 48108 USA 1.800.497.4939 (USA) +1.734.769.7249 (elsewhere) +1.734.769.7074 (fax) www.genecodes.com
BIO 3350: ELEMENTS OF BIOINFORMATICS PARTIALLY ONLINE SYLLABUS
BIO 3350: ELEMENTS OF BIOINFORMATICS PARTIALLY ONLINE SYLLABUS NEW YORK CITY COLLEGE OF TECHNOLOGY The City University Of New York School of Arts and Sciences Biological Sciences Department Course title:
Environmental Remote Sensing GEOG 2021
Environmental Remote Sensing GEOG 2021 Lecture 4 Image classification 2 Purpose categorising data data abstraction / simplification data interpretation mapping for land cover mapping use land cover class
Data Analysis for Ion Torrent Sequencing
IFU022 v140202 Research Use Only Instructions For Use Part III Data Analysis for Ion Torrent Sequencing MANUFACTURER: Multiplicom N.V. Galileilaan 18 2845 Niel Belgium Revision date: August 21, 2014 Page
