Bioinformatics using Python for Biologists
|
|
|
- Ashlynn Fox
- 10 years ago
- Views:
Transcription
1 Bioinformatics using Python for Biologists 10.1 The SeqIO module Many file formats are employed by the most popular databases to store information in ways that should be easily interpreted by a computer program. In this case, interpreting means extracting information (i.e. parsing) and converting it in formats appropriate for further processing and analysis. The parsing of such files is very often a very important task that the bioinformatician must do very accurately. However, the task of parsing these files can be frustrated by the fact that the formats can change quite regularly, and that formats may contain small subtleties which can break even the most well designed parsers. Biopython SeqIO module provides parsers for many common file formats, which generally extract information from the inout file and convert it into a SeqRecord object. There are two methods for sequence file parsing: SeqIO.parse() and SeqIO.read(); both of them require two mandatory arguments and an optional argument: a handle that specifies where the data must be read (could be a file name, a file opened for reading, data downloaded from a database using a script, or the output of another piece of code); a flag indicating the format of the data (a full list of supported format is available at an optional argument that specifies the alphabet of the sequence data. The difference between SeqIO.parse() and SeqIO.read() is that SeqIO.parse() returns an iterator that goes through all records in the input handle, to be used in for or while loops. On the other hand, SeqIO.read() must be used on files containing a single record. The arguments are the same; Both methods return SeqRecord objects Reading local files Let's read the file D.rerio_calcineurin.fasta, containing fasta format records of all entries matching the keyword calcineurin in the zebrafish (Danio rerio) genome obtained from the NCBI ( The SeqIO.parse() method will generate an iterator on SeqRecord objects; features can then be extracted from each SeqRecord object as described in the Module 9: 1
2 >>> import Bio >>> from Bio import SeqIO >>> handle = open("d.rerio_calcineurin.fa","r") >>> type(handle) <type 'file'> >>> for seq_record in SeqIO.parse(handle,"fasta"): print seq_record.id print repr(seq_record.seq) print len(seq_record) gi ref XM_ Seq('GGAAGCCGCTCTTGATACTCCAGTCAGTCTTCAGAGCAGTCTTCGGAGTATTTATAG ', SingleLetterAlphabet()) 2808 gi ref XM_ Seq('ATGCCTGTTCCACATACTGAAGTATCCAGGGAAAAAGAGGAACAGCAGCCTGGCTAA ', SingleLetterAlphabet()) 1035 >>> handle.close() Since the handle is a file, it is good habit to close it when the processing is done. Remember that the iterator empties the file, meaning that to scan the records another time, the file must be closed, than opened again, and then used again as the handle argument to SeqIO.parse(). In a similar way, we can parse an equivalent file, this time in genbank format; this time, we also omit the explicit creation of the handle and pass to SeqIO.parse the file name or complete path: >>> for seq_record in \ SeqIO.parse("D.rerio_calcineurin.gb","genbank"): print seq_record.id print repr(seq_record.seq) print len(seq_record) XM_ Seq('GGAAGCCGCTCTTGATACTCCAGTCAGTCTTCAGAGCAGTCTTCGGAGTATTTATAG ', IUPACAmbiguousDNA()) 2808 XM_ Seq('ATGCCTGTTCCACATACTGAAGTATCCAGGGAAAAAGAGGAACAGCAGCCTGGCTAA ', IUPACAmbiguousDNA()) 1035 Few things must be noted: the genbank-specific SeqIO.parse() is able to assign the correct alphabet to the sequence records in the input file, while the fasta parser assigns a generic SingleLetterAlphabet(). Second, the genbank SeqRecord store a more compact id attribute for the sequence records. As mentioned before, SeqIO.parse() can process any number of records in the input handle. SeqIO.read() instead checks whether there is only one record in the 2
3 handle, raising an exception if this condition is not met: >>> handle = open("d.rerio_calcineurin.gb","r") >>> SeqIO.read(handle,"genbank") Traceback (most recent call last): File "<stdin>", line 1, in <module> File "Bio/SeqIO/ init.py", line 614, in read ValueError: More than one record found in handle The usage of an iterator is a way to parse large files without consuming large amounts of memory. On the other hand, as mentioned above each single record can be accessed only one time in the for loop. The iterator provides methods to access records step by step: >>> handle = open( D.rerio_calcineurin.gb") >>> iterator = SeqIO.parse(handle,"genbank") >>> first_record = iterator.next() >>> type(first_record) <class 'Bio.SeqRecord.SeqRecord'> >>> first_record.id 'XM_ ' >>> first_record.seq Seq('GGAAGCCGCTCTTGATACTCCAGTCAGTCTTCAGAGCAGTCTTCGGAGTATTTATAG ', IUPACAmbiguousDNA()) >>> first_record.description 'PREDICTED: Danio rerio nuclear factor of activated T-cells, cytoplasmic 2-like (LOC ), mrna.' >>> second_record = iterator.next() >>> second_record.id 'XM_ ' When the records in the file are over, the.next() method will either returns the special Python object None or a StopIteration exception (depending on which Biopython release you have installed on your system). Using this approach you could in principle assign each record to a different variable, if you need to keep these records at hand. This is impractical if the number of record is high, or it is unknown beforehand. It is however possible to store all SeqReference objects returned by SeqIO into a data structure such as a list: 3
4 >>> records = list\ (SeqIO.parse("D.rerio_calcineurin.gb", "genbank")) >>> len(records) 61 >>> records[0] # the first record SeqRecord(seq=Seq('GGAAGCCGCTCTTGATACTCCAGTCAGTCTTCAGAGCAGTCTTCGG AGTATTTATAG', IUPACAmbiguousDNA()), id='xm_ ', name='xm_ ', description='predicted: Danio rerio nuclear factor of activated T-cells, cytoplasmic 2-like (LOC ), mrna.', dbxrefs=[]) >>> records[0].id 'XM_ ' >>> records[0].seq Seq('GGAAGCCGCTCTTGATACTCCAGTCAGTCTTCAGAGCAGTCTTCGGAGTATTTATAG ', IUPACAmbiguousDNA()) >>> for key,value in records[0].annotations.items(): print key,value comment MODEL REFSEQ: This record is predicted by automated computational analysis. This record is derived from a genomic sequence (NW_ ) annotated using gene prediction method: GNOMON, supported by EST evidence. Also see: Documentation of NCBI's Annotation Process sequence_version 1 source Danio rerio (zebrafish) taxonomy ['Eukaryota', 'Metazoa', 'Chordata', 'Craniata', 'Vertebrata', 'Euteleostomi', 'Actinopterygii', 'Neopterygii', 'Teleostei', 'Ostariophysi', 'Cypriniformes', 'Cyprinidae', 'Danio'] keywords [''] accessions ['XM_ '] data_file_division VRT date 23-MAR-2011 organism Danio rerio gi >>> records[-1] # the last record SeqRecord(seq=Seq('GCAGCAATTTGAGGAAGAAGCGCAAACAGACAGGTCAGGTGTGGCG ATGGCAGCAAA', IUPACAmbiguousDNA()), id='bc ', name='bc139891', description='danio rerio zgc:162913, mrna (cdna clone MGC: IMAGE: ), complete cds.', dbxrefs=[]) SeqIO provides also a method to convert the iterator SeqRecord objects into values of a dictionary, whose keys are the SeqRecord.id attributes: 4
5 >>> handle = open( D.rerio_calcineurin.gb") >>> records = SeqIO.to_dict(SeqIO.parse(handle, "genbank")) >>> for key,value in records.items(): print key,value.id,value.description BC BC Danio rerio zgc:112142, mrna (cdna clone MGC: IMAGE: ), complete cds. XM_ XM_ PREDICTED: Danio rerio nuclear factor of activated T-cells, cytoplasmic, calcineurin-dependent 3 (nfatc3), mrna. BC BC Danio rerio zgc:92347, mrna (cdna clone MGC:92347 IMAGE: ), complete cds. Note that if duplicate keys are found, an exception will be raised. For very large number of records, there is a method, Bio.SeqIO.index(), which creates a dictionary-like object, but without keeping all the data in memory. Instead, the dictionary values correspond to the position of the record in the file. When a particular record is accessed, the record content is parsed on the fly. This method allows the handling of a huge number of records, with a little cost in flexibility and speed. Moreover, these dictionary-like objects are read-only, meaning that once created, data can not be inserted or removed. Note that in this case the first argument (the handle) can not be an open file handle, but it must be a file name. >>> records = SeqIO.index("D.rerio_calcineurin.gb","genbank") >>> records.keys() ['BC ', 'XM_ ', 'BC ', 'BC ', 'BC ', 'NM_ ', 'NM_ ', 'BC ', 'NM_ ', 'BC ', 'NM_ ', 'BC ', 'NM_ ', 'BC ', 'BC ', 'BC ', 'NM_ ', 'NM_ ', 'BC ', 'BC ', 'XM_ ', 'BC ', 'BC ', 'NM_ ', 'XM_ ', 'XM_ ', 'XM_ ', 'XM_ ', 'BC ', 'XM_ ', 'BC ', 'NM_ ', 'NM_ ', 'XM_ ', 'BC ', 'XM_ ', 'BC ', 'NM_ ', 'NM_ ', 'NM_ ', 'NM_ ', 'BC ', 'BC ', 'BC ', 'GU ', 'XM_ ', 'NM_ ', 'NM_ ', 'BC ', 'AY ', 'BC ', 'BC ', 'NM_ ', 'BC ', 'NM_ ', 'NM_ ', 'BC ', 'BC ', 'XM_ ', 'BC ', 'BC '] >>> print records["bc "].description Danio rerio zgc:112142, mrna (cdna clone MGC: IMAGE: ), complete cds Reading files from the web As we stated before, a handle can also be used to fetch data from web databases. Since parsing the file with an iterator using a handle consumes the handle itself, it is good practice to store the downloaded file locally. Nevertheless, sometimes it could 5
6 be more easy to perform the parsing on-the-fly using web handles. To download files from the NCBI, we will use the Entrez.efetch interface, which takes as arguments the database where the file should be found, the file format, and the database identifier: >>> from Bio import Entrez >>> handle = Entrez.efetch(db="nucleotide",\ rettype="fasta",id="xm_ ") >>> record = SeqIO.read(handle,"fasta") >>> record SeqRecord(seq=Seq('GGAAGCCGCTCTTGATACTCCAGTCAGTCTTCAGAGCAGTCTTC GGAGTATTTATAG', SingleLetterAlphabet()), id='gi ref XM_ ', name='gi ref XM_ ', description='gi ref XM_ PREDICTED: Danio rerio nuclear factor of activated T-cells, cytoplasmic 2-like (LOC ), mrna', dbxrefs=[]) >>> handle = Entrez.efetch(db="nucleotide",\ rettype="gb",id="xm_ ") >>> record = SeqIO.read(handle,"genbank") >>> print record ID: XM_ Name: XM_ Description: PREDICTED: Danio rerio nuclear factor of activated T-cells, cytoplasmic 2-like (LOC ), mrna. Number of features: 4 /comment=model REFSEQ: This record is predicted by automated computational analysis. This record is derived from a genomic sequence (NW_ ) annotated using gene prediction method: GNOMON, supported by EST evidence. Also see: Documentation of NCBI's Annotation Process /sequence_version=1 /source=danio rerio (zebrafish) /taxonomy=['eukaryota', 'Metazoa', 'Chordata', 'Craniata', 'Vertebrata', 'Euteleostomi', 'Actinopterygii', 'Neopterygii', 'Teleostei', 'Ostariophysi', 'Cypriniformes', 'Cyprinidae', 'Danio'] /keywords=[''] /accessions=['xm_ '] /data_file_division=vrt /date=23-mar-2011 /organism=danio rerio /gi= Seq('GGAAGCCGCTCTTGATACTCCAGTCAGTCTTCAGAGCAGTCTTCGGAGTATTTAT AG', IUPACAmbiguousDNA()) It is possible to download multiple files, by writing a string containing all their identifiers separated by commas: 6
7 >>> handle = Entrez.efetch(db="nucleotide",\ rettype="gb",id="xm_ ,bc ,\ BC ") >>> record = SeqIO.parse(handle,"genbank") >>> for seq_record in record: print seq_record.id, seq_record.description[:50] print "Sequence length %i," % len(seq_record), print "%i features," % len(seq_record.features), print "from: %s" % seq_record.annotations["source"] XM_ PREDICTED: Danio rerio nuclear factor of activated Sequence length 2808, 4 features, from: Danio rerio (zebrafish) BC Danio rerio zgc:92347, mrna (cdna clone MGC:92347 Sequence length 1188, 3 features, from: Danio rerio (zebrafish) BC Danio rerio zgc:113352, mrna (cdna clone MGC:11335 Sequence length 1660, 3 features, from: Danio rerio (zebrafish) 10.4 Writing sequence files The SeqIO.write() method can write into a file SeqRecord objects in the format specified by the user, from a list of popular sequence file formats. The method requires three arguments: one or more SeqRecord objects; a handle or a filename to write to; a sequence format. In the following example, we manually create three SeqRecord objects for three (very short) proteins. Then, the three objects are put into a list, which is used as the first argument for the SeqIO.write() method, to specify which objects to write into a file. Next, we create a handle, which is a file opened for writing, and pass it to the method as the second argument. Finally, we specify that we want the output file to be written in fasta format. The Bio.SeqIO.write() function returns the number of SeqRecord objects written to the file. >>> from Bio.Seq import Seq >>> from Bio.SeqRecords import SeqRecord >>> from Bio.Alphabet import generic_protein >>> Rec1 = SeqRecord(Seq( ACCA,generic_protein), \ id= 1, description= ) >>> Rec2 = SeqRecord(Seq( CDFAA,generic_protein), \ id= 2, description= ) >>> Rec3 = SeqRecord(Seq( GRKLM,generic_protein), \ id= 3, description= ) >>> My_records = [Rec1, Rec2, Rec3] >>> from Bio import SeqIO >>> handle_w = open( MySeqs.fa, w ) >>> SeqIO.write(My_records, handle_w, fasta ) 3 >>> handle_w.close() The input SeqRecord objects can be in the form of a list, such as in the above example, or an iterator, or an individual SeqRecord: 7
8 >>> handle = open("d.rerio_calcineurin.gb") >>> records = SeqIO.parse(handle,"genbank") >>> handle_w = open("all_records_in_fasta.fa","w") >>> SeqIO.write(records, handle_w, "fasta") 60 >>> handle.close() >>> handle_w.close() >>> handle = open("d.rerio_calcineurin.gb") >>> records = SeqIO.parse(handle,"genbank") >>> first_record = records.next() >>> handle_w = open("only_the_first_record.fa","w") >>> SeqIO.write(first_record, handle_w, "fasta") 1 >>> handle.close() >>> handle_w.close() 10.5 Parsing Multiple Alignments Biopython provides a data structure to store multiple alignments (the MultipleSeqAlignment class), and the Bio.AlignIO module for reading and writing them as various file formats. Let's open the seed multiple sequence alignment of the calcineurin-like phosphoesterases from the Pfam Family Metallophos (PF00149), containing 330 protein sequences. The file is in the Stockholm format, which is one of the most popular formats for multiple alignment handling. The Bio.AlignIO module provides two methods to parse multiple alignments,.parse() and.read(), which parse files containing many or just one alignments, as usual Biopython convention. Both methods require the same arguments: an handle to the multiple alignment, either an open file or a filename; the format of the multiple alignment (a full list of available formats can be found at the alphabet used by the alignment (optional). 8
9 >>> from Bio import AlignIO >>> alignment = AlignIO.read("PF00149.sth", "stockholm") >>> dir(alignment) [' add ', ' doc ', ' format ', ' getitem ', ' init ', ' iter ', ' len ', ' module ', ' repr ', ' str ', '_alphabet', '_annotations', '_append', '_records', '_str_line', 'add_sequence', 'append', 'extend', 'format', 'get_alignment_length', 'get_all_seqs', 'get_column', 'get_seq_by_num', 'sort'] >>> print alignment SingleLetterAlphabet() alignment with 330 rows and 477 columns FKIVQFSDAHLSDYFTLE HGG YKUE_BACSU/ LRVLHISDLHMLPNQHR HGG O69651_MYCTU/ LRVLQVSDIHMVGGQRK HGG Q9X935_STRCO/ LNILHLSDLHLENISVS HGG YKOQ_BACSU/ LPYGVISDPHYHRWDAFATTNA DGLN-SRLE--HNH Q9R2P6_YERPE/3-205 LRFVQLSDIHLGTVRSAG HGG O27247_METTH/ LRIVQISDLHLNHSTPDA HGP Y461_CHLTR/ LRIAQISDLHFHKRVPEK HGP Y578_CHLPN/ >>> The AlignIO.parse() returns an iterator that goes through the alignment providing SeqRecord objects for each sequence in the alignment. 9
10 >>> for record in alignment: print record.id,record.annotations YKUE_BACSU/ {'start': 58, 'end': 225, 'accession': 'O '} O69651_MYCTU/ {'start': 51, 'end': 235, 'accession': 'O '} Q9X935_STRCO/ {'start': 47, 'end': 241, 'accession': 'Q9X935.1'} YKOQ_BACSU/ {'start': 46, 'end': 211, 'accession': 'O '} Q9R2P6_YERPE/3-205 {'start': 3, 'end': 205, 'accession': 'Q9R2P6.1'} O27247_METTH/ {'start': 130, 'end': 285, 'accession': 'O '} Y461_CHLTR/ {'start': 52, 'end': 261, 'accession': 'O '} Y578_CHLPN/ {'start': 45, 'end': 254, 'accession': 'Q9Z7X6.1'} O03968_9CAUD/ {'start': 269, 'end': 543, 'accession': 'O '} ASM3A_MOUSE/ {'start': 35, 'end': 294, 'accession': 'P '} ASM3B_HUMAN/ {'start': 21, 'end': 281, 'accession': 'Q '} Similarly to other modules, the AlignIO module provides to write alignments to file in several formats, to convert between formats, and so on. You can also perform slicing operations, which can be thought as accessing the alignment as a matrix. The standard slicing operator [i:j] returns the alignment rows between row i and row j- 1. To select alignment columns, you can use the operator [:,k], which will select the k th column 1 0
11 >>> print "Number of rows: %i" % len(alignment) Number of rows: 330 >>> print alignment[3:7] SingleLetterAlphabet() alignment with 4 rows and 477 columns LNILHLSDLHLENISVS HGG YKOQ_BACSU/ LPYGVISDPHYHRWDAFATTNA DGLN-SRLE--HNH Q9R2P6_YERPE/3-205 LRFVQLSDIHLGTVRSAG HGG O27247_METTH/ LRIVQISDLHLNHSTPDA HGP Y461_CHLTR/ >>> print alignment[:,6] SSSSSSSSSTATTSTSAAATSSSTSASSTAPATTTTTTTSASAAAAASSGSSSASAAASGGGGGG GNNGGGGSGGGGGGGGSGCGGGGGGSNNNNNNNNNNNNNNNNNNNSSTTTTTTNNGGGGGGTTTG GGGGSSSSASSTSSSSASSSSGGGGGSASSGSASAASAAAAATSTTSSSSSSASSSSSSSAAAGG GGGGGGGAGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG GGGGGGGGGGGGGGGSGGGGGGGGPGGGGSSASSGSTSGASSSSSTTSSSSSSSSSSSSSAAAAA GGGST >>> print alignment[2,6] S 1 1
Biopython Tutorial and Cookbook
Biopython Tutorial and Cookbook Jeff Chang, Brad Chapman, Iddo Friedberg, Thomas Hamelryck, Michiel de Hoon, Peter Cock Last Update September 2008 Contents 1 Introduction 5 1.1 What is Biopython?.........................................
Bioinformatics Resources at a Glance
Bioinformatics Resources at a Glance A Note about FASTA Format There are MANY free bioinformatics tools available online. Bioinformaticists have developed a standard format for nucleotide and protein sequences
Biopython Tutorial and Cookbook
Biopython Tutorial and Cookbook Jeff Chang, Brad Chapman, Iddo Friedberg, Thomas Hamelryck, Michiel de Hoon, Peter Cock, Tiago Antao, Eric Talevich, Bartek Wilczyński Last Update 21 October 2015 (Biopython
org.rn.eg.db December 16, 2015 org.rn.egaccnum is an R object that contains mappings between Entrez Gene identifiers and GenBank accession numbers.
org.rn.eg.db December 16, 2015 org.rn.egaccnum Map Entrez Gene identifiers to GenBank Accession Numbers org.rn.egaccnum is an R object that contains mappings between Entrez Gene identifiers and GenBank
Lecture Outline. Introduction to Databases. Introduction. Data Formats Sample databases How to text search databases. Shifra Ben-Dor Irit Orr
Introduction to Databases Shifra Ben-Dor Irit Orr Lecture Outline Introduction Data and Database types Database components Data Formats Sample databases How to text search databases What units of information
GenBank, Entrez, & FASTA
GenBank, Entrez, & FASTA Nucleotide Sequence Databases First generation GenBank is a representative example started as sort of a museum to preserve knowledge of a sequence from first discovery great repositories,
Module 1. Sequence Formats and Retrieval. Charles Steward
The Open Door Workshop Module 1 Sequence Formats and Retrieval Charles Steward 1 Aims Acquaint you with different file formats and associated annotations. Introduce different nucleotide and protein databases.
Module 3. Genome Browsing. Using Web Browsers to View Genome Annota4on. Kers4n Howe Wellcome Trust Sanger Ins4tute zfish- [email protected].
Module 3 Genome Browsing Using Web Browsers to View Genome Annota4on Kers4n Howe Wellcome Trust Sanger Ins4tute zfish- [email protected] Introduc.on Genome browsing The Ensembl gene set Guided examples
RETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison
RETRIEVING SEQUENCE INFORMATION Nucleotide sequence databases Database search Sequence alignment and comparison Biological sequence databases Originally just a storage place for sequences. Currently the
A Tutorial in Genetic Sequence Classification Tools and Techniques
A Tutorial in Genetic Sequence Classification Tools and Techniques Jake Drew Data Mining CSE 8331 Southern Methodist University [email protected] www.jakemdrew.com Sequence Characters IUPAC nucleotide
How To Use The Assembly Database In A Microarray (Perl) With A Microarcode) (Perperl 2) (For Macrogenome) (Genome 2)
The Ensembl Core databases and API Useful links Installation instructions: http://www.ensembl.org/info/docs/api/api_installation.html Schema description: http://www.ensembl.org/info/docs/api/core/core_schema.html
Exercise 4 Learning Python language fundamentals
Exercise 4 Learning Python language fundamentals Work with numbers Python can be used as a powerful calculator. Practicing math calculations in Python will help you not only perform these tasks, but also
17 July 2014 WEB-SERVER MANUAL. Contact: Michael Hackenberg ([email protected])
WEB-SERVER MANUAL Contact: Michael Hackenberg ([email protected]) 1 1 Introduction srnabench is a free web-server tool and standalone application for processing small- RNA data obtained from next generation
Bioinformatics Grid - Enabled Tools For Biologists.
Bioinformatics Grid - Enabled Tools For Biologists. What is Grid-Enabled Tools (GET)? As number of data from the genomics and proteomics experiment increases. Problems arise for the current sequence analysis
Biological Sequence Data Formats
Biological Sequence Data Formats Here we present three standard formats in which biological sequence data (DNA, RNA and protein) can be stored and presented. Raw Sequence: Data without description. FASTA
Package GEOquery. August 18, 2015
Type Package Package GEOquery August 18, 2015 Title Get data from NCBI Gene Expression Omnibus (GEO) Version 2.34.0 Date 2014-09-28 Author Maintainer BugReports
Exercise 1: Python Language Basics
Exercise 1: Python Language Basics In this exercise we will cover the basic principles of the Python language. All languages have a standard set of functionality including the ability to comment code,
Useful Scripting for Biologists
Useful Scripting for Biologists Brad Chapman 23 Jan 2003 Objectives Explain what scripting languages are Describe some of the things you can do with a scripting language Show some tools to use once you
Name Spaces. Introduction into Python Python 5: Classes, Exceptions, Generators and more. Classes: Example. Classes: Briefest Introduction
Name Spaces Introduction into Python Python 5: Classes, Exceptions, Generators and more Daniel Polani Concept: There are three different types of name spaces: 1. built-in names (such as abs()) 2. global
Sequence Database Administration
Sequence Database Administration 1 When you install Mascot, it includes a copy of the Swiss-Prot protein database. However, it is almost certain that you and your colleagues will want to search other databases
Tutorial. Reference Genome Tracks. Sample to Insight. November 27, 2015
Reference Genome Tracks November 27, 2015 Sample to Insight CLC bio, a QIAGEN Company Silkeborgvej 2 Prismet 8000 Aarhus C Denmark Telephone: +45 70 22 32 44 www.clcbio.com [email protected] Reference
Getting started in Bio::Perl 1) Simple script to get a sequence by Id and write to specified format
BIOPERL TUTORIAL (ABREV.) Getting started in Bio::Perl 1) Simple script to get a sequence by Id and write to specified format use Bio::Perl; # this script will only work if you have an internet connection
Database manager does something that sounds trivial. It makes it easy to setup a new database for searching with Mascot. It also makes it easy to
1 Database manager does something that sounds trivial. It makes it easy to setup a new database for searching with Mascot. It also makes it easy to automate regular updates of these databases. 2 However,
Python Loops and String Manipulation
WEEK TWO Python Loops and String Manipulation Last week, we showed you some basic Python programming and gave you some intriguing problems to solve. But it is hard to do anything really exciting until
Technical Report. The KNIME Text Processing Feature:
Technical Report The KNIME Text Processing Feature: An Introduction Dr. Killian Thiel Dr. Michael Berthold [email protected] [email protected] Copyright 2012 by KNIME.com AG
netflow-indexer Documentation
netflow-indexer Documentation Release 0.1.28 Justin Azoff May 02, 2012 CONTENTS 1 Installation 2 1.1 Install prerequisites............................................ 2 1.2 Install netflow-indexer..........................................
Chapter 3 Writing Simple Programs. What Is Programming? Internet. Witin the web server we set lots and lots of requests which we need to respond to
Chapter 3 Writing Simple Programs Charles Severance Unless otherwise noted, the content of this course material is licensed under a Creative Commons Attribution 3.0 License. http://creativecommons.org/licenses/by/3.0/.
Python Lists and Loops
WEEK THREE Python Lists and Loops You ve made it to Week 3, well done! Most programs need to keep track of a list (or collection) of things (e.g. names) at one time or another, and this week we ll show
DNA Sequence formats
DNA Sequence formats [Plain] [EMBL] [FASTA] [GCG] [GenBank] [IG] [IUPAC] [How Genomatix represents sequence annotation] Plain sequence format A sequence in plain format may contain only IUPAC characters
Sequence Formats and Sequence Database Searches. Gloria Rendon SC11 Education June, 2011
Sequence Formats and Sequence Database Searches Gloria Rendon SC11 Education June, 2011 Sequence A is the primary structure of a biological molecule. It is a chain of residues that form a precise linear
NaviCell Data Visualization Python API
NaviCell Data Visualization Python API Tutorial - Version 1.0 The NaviCell Data Visualization Python API is a Python module that let computational biologists write programs to interact with the molecular
A skip list container class in Python
A skip list container class in Python Abstract An alternative to balanced trees John W. Shipman 2012-11-29 13:23 Describes a module in the Python programming language that implements a skip list, a data
When you install Mascot, it includes a copy of the Swiss-Prot protein database. However, it is almost certain that you and your colleagues will want
1 When you install Mascot, it includes a copy of the Swiss-Prot protein database. However, it is almost certain that you and your colleagues will want to search other databases as well. There are very
BioJava In Anger. A Tutorial and Recipe Book for Those in a Hurry
BioJava In Anger BioJava In Anger A Tutorial and Recipe Book for Those in a Hurry Introduction: BioJava can be both big and intimidating. For those of us who are in a hurry there really is a whole lot
Genome Explorer For Comparative Genome Analysis
Genome Explorer For Comparative Genome Analysis Jenn Conn 1, Jo L. Dicks 1 and Ian N. Roberts 2 Abstract Genome Explorer brings together the tools required to build and compare phylogenies from both sequence
Job Cost Report JOB COST REPORT
JOB COST REPORT Job costing is included for those companies that need to apply a portion of payroll to different jobs. The report groups individual pay line items by job and generates subtotals for each
Data formats and file conversions
Building Excellence in Genomics and Computational Bioscience s Richard Leggett (TGAC) John Walshaw (IFR) Common file formats FASTQ FASTA BAM SAM Raw sequence Alignments MSF EMBL UniProt BED WIG Databases
Python course in Bioinformatics. by Katja Schuerer and Catherine Letondal
Python course in Bioinformatics by Katja Schuerer and Catherine Letondal Python course in Bioinformatics by Katja Schuerer and Catherine Letondal Copyright 2004 Pasteur Institute [http://www.pasteur.fr/]
FWG Management System Manual
FWG Management System Manual Last Updated: December 2014 Written by: Donna Clark, EAIT/ITIG Table of Contents Introduction... 3 MSM Menu & Displays... 3 By Title Display... 3 Recent Updates Display...
Writing Control Structures
Writing Control Structures Copyright 2006, Oracle. All rights reserved. Oracle Database 10g: PL/SQL Fundamentals 5-1 Objectives After completing this lesson, you should be able to do the following: Identify
Assignment 2: More MapReduce with Hadoop
Assignment 2: More MapReduce with Hadoop Jean-Pierre Lozi February 5, 2015 Provided files following URL: An archive that contains all files you will need for this assignment can be found at the http://sfu.ca/~jlozi/cmpt732/assignment2.tar.gz
InstallShield Tip: Accessing the MSI Database at Run Time
InstallShield Tip: Accessing the MSI Database at Run Time Robert Dickau Senior Techincal Trainer Flexera Software Abstract In some cases, it can be useful for a running installation to access the tables
Searching Nucleotide Databases
Searching Nucleotide Databases 1 When we search a nucleic acid databases, Mascot always performs a 6 frame translation on the fly. That is, 3 reading frames from the forward strand and 3 reading frames
Big Data and Scripting map/reduce in Hadoop
Big Data and Scripting map/reduce in Hadoop 1, 2, parts of a Hadoop map/reduce implementation core framework provides customization via indivudual map and reduce functions e.g. implementation in mongodb
Acronis Backup & Recovery: Events in Application Event Log of Windows http://kb.acronis.com/content/38327
Acronis Backup & Recovery: Events in Application Event Log of Windows http://kb.acronis.com/content/38327 Mod ule_i D Error _Cod e Error Description 1 1 PROCESSOR_NULLREF_ERROR 1 100 ERROR_PARSE_PAIR Failed
Recovering Business Rules from Legacy Source Code for System Modernization
Recovering Business Rules from Legacy Source Code for System Modernization Erik Putrycz, Ph.D. Anatol W. Kark Software Engineering Group National Research Council, Canada Introduction Legacy software 000009*
Forensic Analysis of Internet Explorer Activity Files
Forensic Analysis of Internet Explorer Activity Files by Keith J. Jones [email protected] 3/19/03 Table of Contents 1. Introduction 4 2. The Index.dat File Header 6 3. The HASH Table 10 4. The
CS 1133, LAB 2: FUNCTIONS AND TESTING http://www.cs.cornell.edu/courses/cs1133/2015fa/labs/lab02.pdf
CS 1133, LAB 2: FUNCTIONS AND TESTING http://www.cs.cornell.edu/courses/cs1133/2015fa/labs/lab02.pdf First Name: Last Name: NetID: The purpose of this lab is to help you to better understand functions:
Prescribed Specialised Services 2015/16 Shadow Monitoring Tool
Prescribed Specialised Services 2015/16 Shadow Monitoring Tool Published May 2015 We are the trusted national provider of high-quality information, data and IT systems for health and social care. www.hscic.gov.uk
When you install Mascot, it includes a copy of the Swiss-Prot protein database. However, it is almost certain that you and your colleagues will want
1 When you install Mascot, it includes a copy of the Swiss-Prot protein database. However, it is almost certain that you and your colleagues will want to search other databases as well. There are very
Radius Maps and Notification Mailing Lists
Radius Maps and Notification Mailing Lists To use the online map service for obtaining notification lists and location maps, start the mapping service in the browser (mapping.archuletacounty.org/map).
BUDAPEST: Bioinformatics Utility for Data Analysis of Proteomics using ESTs
BUDAPEST: Bioinformatics Utility for Data Analysis of Proteomics using ESTs Richard J. Edwards 2008. Contents 1. Introduction... 2 1.1. Version...2 1.2. Using this Manual...2 1.3. Why use BUDAPEST?...2
DNA Sequence Analysis Software
DNA Sequence Analysis Software Group: Xin Xiong, Yuan Zhang, HongboLiu Supervisor: Henrik Bulskov Table of contents Introduction...2 1 Backgrounds and Motivation...2 1.1 Molecular Biology...2 1.2 Computer
Version 5.0 Release Notes
Version 5.0 Release Notes 2011 Gene Codes Corporation Gene Codes Corporation 775 Technology Drive, Ann Arbor, MI 48108 USA 1.800.497.4939 (USA) +1.734.769.7249 (elsewhere) +1.734.769.7074 (fax) www.genecodes.com
Library page. SRS first view. Different types of database in SRS. Standard query form
SRS & Entrez SRS Sequence Retrieval System Bengt Persson Whatis SRS? Sequence Retrieval System User-friendly interface to databases http://srs.ebi.ac.uk Developed by Thure Etzold and co-workers EMBL/EBI
Introduction to Synoptic
Introduction to Synoptic 1 Introduction Synoptic is a tool that summarizes log files. More exactly, Synoptic takes a set of log files, and some rules that tell it how to interpret lines in those logs,
Introduction to Bioinformatics 2. DNA Sequence Retrieval and comparison
Introduction to Bioinformatics 2. DNA Sequence Retrieval and comparison Benjamin F. Matthews United States Department of Agriculture Soybean Genomics and Improvement Laboratory Beltsville, MD 20708 [email protected]
Integrating VoltDB with Hadoop
The NewSQL database you ll never outgrow Integrating with Hadoop Hadoop is an open source framework for managing and manipulating massive volumes of data. is an database for handling high velocity data.
CS106A, Stanford Handout #38. Strings and Chars
CS106A, Stanford Handout #38 Fall, 2004-05 Nick Parlante Strings and Chars The char type (pronounced "car") represents a single character. A char literal value can be written in the code using single quotes
Lecture 2, Introduction to Python. Python Programming Language
BINF 3360, Introduction to Computational Biology Lecture 2, Introduction to Python Young-Rae Cho Associate Professor Department of Computer Science Baylor University Python Programming Language Script
Basic processing of next-generation sequencing (NGS) data
Basic processing of next-generation sequencing (NGS) data Getting from raw sequence data to expression analysis! 1 Reminder: we are measuring expression of protein coding genes by transcript abundance
Apply PERL to BioInformatics (II)
Apply PERL to BioInformatics (II) Lecture Note for Computational Biology 1 (LSM 5191) Jiren Wang http://www.bii.a-star.edu.sg/~jiren BioInformatics Institute Singapore Outline Some examples for manipulating
THE GENBANK SEQUENCE DATABASE
Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins, Second Edition Andreas D. Baxevanis, B.F. Francis Ouellette Copyright 2001 John Wiley & Sons, Inc. ISBNs: 0-471-38390-2 (Hardback);
A Multiple DNA Sequence Translation Tool Incorporating Web Robot and Intelligent Recommendation Techniques
Proceedings of the 2007 WSEAS International Conference on Computer Engineering and Applications, Gold Coast, Australia, January 17-19, 2007 402 A Multiple DNA Sequence Translation Tool Incorporating Web
PROGRAMMING FOR BIOLOGISTS. BIOL 6297 Monday, Wednesday 10 am -12 pm
PROGRAMMING FOR BIOLOGISTS BIOL 6297 Monday, Wednesday 10 am -12 pm Tomorrow is Ada Lovelace Day Ada Lovelace was the first person to write a computer program Today s Lecture Overview of the course Philosophy
Python and MongoDB. Why?
Python and MongoDB Kevin Swingler Why? Python is becoming the scripting language of choice in big data It has a library for connecting to a MongoDB: PyMongo Nice mapping betwenpython data structures and
Analog Documentation. Release 0.3.4. Fabian Büchler
Analog Documentation Release 0.3.4 Fabian Büchler April 01, 2014 Contents 1 Contents 3 1.1 Quickstart................................................ 3 1.2 Analog API................................................
CRASH COURSE PYTHON. Het begint met een idee
CRASH COURSE PYTHON nr. Het begint met een idee This talk Not a programming course For data analysts, who want to learn Python For optimizers, who are fed up with Matlab 2 Python Scripting language expensive
Introduction to Genome Annotation
Introduction to Genome Annotation AGCGTGGTAGCGCGAGTTTGCGAGCTAGCTAGGCTCCGGATGCGA CCAGCTTTGATAGATGAATATAGTGTGCGCGACTAGCTGTGTGTT GAATATATAGTGTGTCTCTCGATATGTAGTCTGGATCTAGTGTTG GTGTAGATGGAGATCGCGTAGCGTGGTAGCGCGAGTTTGCGAGCT
Oracle Database Security and Audit
Copyright 2014, Oracle Database Security and Beyond Checklists Learning objectives Understand data flow through an Oracle database instance Copyright 2014, Why is data flow important? Data is not static
CD-HIT User s Guide. Last updated: April 5, 2010. http://cd-hit.org http://bioinformatics.org/cd-hit/
CD-HIT User s Guide Last updated: April 5, 2010 http://cd-hit.org http://bioinformatics.org/cd-hit/ Program developed by Weizhong Li s lab at UCSD http://weizhong-lab.ucsd.edu [email protected] 1. Introduction
Perl in a nutshell. First CGI Script and Perl. Creating a Link to a Script. print Function. Parsing Data 4/27/2009. First CGI Script and Perl
First CGI Script and Perl Perl in a nutshell Prof. Rasley shebang line tells the operating system where the Perl interpreter is located necessary on UNIX comment line ignored by the Perl interpreter End
User Manual - Sales Lead Tracking Software
User Manual - Overview The Leads module of MVI SLM allows you to import, create, assign and manage their leads. Leads are early contacts in the sales process. Once they have been evaluated and assessed,
Hidden Markov Models in Bioinformatics. By Máthé Zoltán Kőrösi Zoltán 2006
Hidden Markov Models in Bioinformatics By Máthé Zoltán Kőrösi Zoltán 2006 Outline Markov Chain HMM (Hidden Markov Model) Hidden Markov Models in Bioinformatics Gene Finding Gene Finding Model Viterbi algorithm
Introduction to Bioinformatics 3. DNA editing and contig assembly
Introduction to Bioinformatics 3. DNA editing and contig assembly Benjamin F. Matthews United States Department of Agriculture Soybean Genomics and Improvement Laboratory Beltsville, MD 20708 [email protected]
Pattern Insight Clone Detection
Pattern Insight Clone Detection TM The fastest, most effective way to discover all similar code segments What is Clone Detection? Pattern Insight Clone Detection is a powerful pattern discovery technology
Next Generation Sequencing Data Visualization
Next Generation Sequencing Data Visualization GBrowse2 from GMOD Andreas Gisel Institute for Biomedical Technologies CNR Bari - Italy GMOD is the Generic Model Organism Database project GMOD is a collection
09336863931 : provid.ir
provid.ir 09336863931 : NET Architecture Core CSharp o Variable o Variable Scope o Type Inference o Namespaces o Preprocessor Directives Statements and Flow of Execution o If Statement o Switch Statement
Converting GenMAPP MAPPs between species using homology
Converting GenMAPP MAPPs between species using homology 1 Introduction and Background 2 1.1 Fundamental principles of the GenMAPP Gene Database 2 1.1.1 Gene Database data types 2 1.1.2 GenMAPP System Codes
GenBank: A Database of Genetic Sequence Data
GenBank: A Database of Genetic Sequence Data Computer Science 105 Boston University David G. Sullivan, Ph.D. An Explosion of Scientific Data Scientists are generating ever increasing amounts of data. Relevant
The Django web development framework for the Python-aware
The Django web development framework for the Python-aware Bill Freeman PySIG NH September 23, 2010 Bill Freeman (PySIG NH) Introduction to Django September 23, 2010 1 / 18 Introduction Django is a web
CLC Server Command Line Tools USER MANUAL
CLC Server Command Line Tools USER MANUAL Manual for CLC Server Command Line Tools 2.5 Windows, Mac OS X and Linux September 4, 2015 This software is for research purposes only. QIAGEN Aarhus A/S Silkeborgvej
Guide for Bioinformatics Project Module 3
Structure- Based Evidence and Multiple Sequence Alignment In this module we will revisit some topics we started to look at while performing our BLAST search and looking at the CDD database in the first
Simulation Tools. Python for MATLAB Users I. Claus Führer. Automn 2009. Claus Führer Simulation Tools Automn 2009 1 / 65
Simulation Tools Python for MATLAB Users I Claus Führer Automn 2009 Claus Führer Simulation Tools Automn 2009 1 / 65 1 Preface 2 Python vs Other Languages 3 Examples and Demo 4 Python Basics Basic Operations
Symbol Tables. Introduction
Symbol Tables Introduction A compiler needs to collect and use information about the names appearing in the source program. This information is entered into a data structure called a symbol table. The
Table of Contents. Chapter 1: Introduction. Chapter 2: Getting Started. Chapter 3: Standard Functionality. Chapter 4: Module Descriptions
Table of Contents Chapter 1: Introduction Chapter 2: Getting Started Chapter 3: Standard Functionality Chapter 4: Module Descriptions Table of Contents Table of Contents Chapter 5: Administration Table
Tutorial for Windows and Macintosh. Preparing Your Data for NGS Alignment
Tutorial for Windows and Macintosh Preparing Your Data for NGS Alignment 2015 Gene Codes Corporation Gene Codes Corporation 775 Technology Drive, Ann Arbor, MI 48108 USA 1.800.497.4939 (USA) 1.734.769.7249
monoseq Documentation
monoseq Documentation Release 1.2.1 Martijn Vermaat July 16, 2015 Contents 1 User documentation 3 1.1 Installation................................................ 3 1.2 User guide................................................
