Bioinformatics using Python for Biologists



Similar documents
Biopython Tutorial and Cookbook

Bioinformatics Resources at a Glance

Biopython Tutorial and Cookbook

org.rn.eg.db December 16, 2015 org.rn.egaccnum is an R object that contains mappings between Entrez Gene identifiers and GenBank accession numbers.

Lecture Outline. Introduction to Databases. Introduction. Data Formats Sample databases How to text search databases. Shifra Ben-Dor Irit Orr

GenBank, Entrez, & FASTA

Module 1. Sequence Formats and Retrieval. Charles Steward

Module 3. Genome Browsing. Using Web Browsers to View Genome Annota4on. Kers4n Howe Wellcome Trust Sanger Ins4tute zfish-

RETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison

A Tutorial in Genetic Sequence Classification Tools and Techniques

How To Use The Assembly Database In A Microarray (Perl) With A Microarcode) (Perperl 2) (For Macrogenome) (Genome 2)

Exercise 4 Learning Python language fundamentals

17 July 2014 WEB-SERVER MANUAL. Contact: Michael Hackenberg

Bioinformatics Grid - Enabled Tools For Biologists.

Biological Sequence Data Formats

Package GEOquery. August 18, 2015

Exercise 1: Python Language Basics

Useful Scripting for Biologists

Name Spaces. Introduction into Python Python 5: Classes, Exceptions, Generators and more. Classes: Example. Classes: Briefest Introduction

Sequence Database Administration

Tutorial. Reference Genome Tracks. Sample to Insight. November 27, 2015

Getting started in Bio::Perl 1) Simple script to get a sequence by Id and write to specified format

Database manager does something that sounds trivial. It makes it easy to setup a new database for searching with Mascot. It also makes it easy to

Python Loops and String Manipulation

Technical Report. The KNIME Text Processing Feature:

netflow-indexer Documentation

Chapter 3 Writing Simple Programs. What Is Programming? Internet. Witin the web server we set lots and lots of requests which we need to respond to

Python Lists and Loops

DNA Sequence formats

Sequence Formats and Sequence Database Searches. Gloria Rendon SC11 Education June, 2011

NaviCell Data Visualization Python API

A skip list container class in Python

When you install Mascot, it includes a copy of the Swiss-Prot protein database. However, it is almost certain that you and your colleagues will want

BioJava In Anger. A Tutorial and Recipe Book for Those in a Hurry

Genome Explorer For Comparative Genome Analysis

Job Cost Report JOB COST REPORT

Data formats and file conversions

Python course in Bioinformatics. by Katja Schuerer and Catherine Letondal

FWG Management System Manual

Writing Control Structures

Assignment 2: More MapReduce with Hadoop

InstallShield Tip: Accessing the MSI Database at Run Time

Searching Nucleotide Databases

Big Data and Scripting map/reduce in Hadoop

Acronis Backup & Recovery: Events in Application Event Log of Windows

Recovering Business Rules from Legacy Source Code for System Modernization

Forensic Analysis of Internet Explorer Activity Files

CS 1133, LAB 2: FUNCTIONS AND TESTING

Prescribed Specialised Services 2015/16 Shadow Monitoring Tool

When you install Mascot, it includes a copy of the Swiss-Prot protein database. However, it is almost certain that you and your colleagues will want

Radius Maps and Notification Mailing Lists

BUDAPEST: Bioinformatics Utility for Data Analysis of Proteomics using ESTs

DNA Sequence Analysis Software

Version 5.0 Release Notes

Library page. SRS first view. Different types of database in SRS. Standard query form

Introduction to Synoptic

Introduction to Bioinformatics 2. DNA Sequence Retrieval and comparison

Integrating VoltDB with Hadoop

CS106A, Stanford Handout #38. Strings and Chars

Lecture 2, Introduction to Python. Python Programming Language

Basic processing of next-generation sequencing (NGS) data

Apply PERL to BioInformatics (II)

THE GENBANK SEQUENCE DATABASE

A Multiple DNA Sequence Translation Tool Incorporating Web Robot and Intelligent Recommendation Techniques

PROGRAMMING FOR BIOLOGISTS. BIOL 6297 Monday, Wednesday 10 am -12 pm

Python and MongoDB. Why?

Analog Documentation. Release Fabian Büchler

CRASH COURSE PYTHON. Het begint met een idee

Introduction to Genome Annotation

Oracle Database Security and Audit

CD-HIT User s Guide. Last updated: April 5,

Perl in a nutshell. First CGI Script and Perl. Creating a Link to a Script. print Function. Parsing Data 4/27/2009. First CGI Script and Perl

User Manual - Sales Lead Tracking Software

Hidden Markov Models in Bioinformatics. By Máthé Zoltán Kőrösi Zoltán 2006

Introduction to Bioinformatics 3. DNA editing and contig assembly

Pattern Insight Clone Detection

Next Generation Sequencing Data Visualization

: provid.ir

Converting GenMAPP MAPPs between species using homology

GenBank: A Database of Genetic Sequence Data

The Django web development framework for the Python-aware

CLC Server Command Line Tools USER MANUAL

Guide for Bioinformatics Project Module 3

Simulation Tools. Python for MATLAB Users I. Claus Führer. Automn Claus Führer Simulation Tools Automn / 65

Symbol Tables. Introduction

Table of Contents. Chapter 1: Introduction. Chapter 2: Getting Started. Chapter 3: Standard Functionality. Chapter 4: Module Descriptions

Tutorial for Windows and Macintosh. Preparing Your Data for NGS Alignment

monoseq Documentation

Transcription:

Bioinformatics using Python for Biologists 10.1 The SeqIO module Many file formats are employed by the most popular databases to store information in ways that should be easily interpreted by a computer program. In this case, interpreting means extracting information (i.e. parsing) and converting it in formats appropriate for further processing and analysis. The parsing of such files is very often a very important task that the bioinformatician must do very accurately. However, the task of parsing these files can be frustrated by the fact that the formats can change quite regularly, and that formats may contain small subtleties which can break even the most well designed parsers. Biopython SeqIO module provides parsers for many common file formats, which generally extract information from the inout file and convert it into a SeqRecord object. There are two methods for sequence file parsing: SeqIO.parse() and SeqIO.read(); both of them require two mandatory arguments and an optional argument: a handle that specifies where the data must be read (could be a file name, a file opened for reading, data downloaded from a database using a script, or the output of another piece of code); a flag indicating the format of the data (a full list of supported format is available at http://biopython.org/wiki/seqio); an optional argument that specifies the alphabet of the sequence data. The difference between SeqIO.parse() and SeqIO.read() is that SeqIO.parse() returns an iterator that goes through all records in the input handle, to be used in for or while loops. On the other hand, SeqIO.read() must be used on files containing a single record. The arguments are the same; Both methods return SeqRecord objects. 10.2 Reading local files Let's read the file D.rerio_calcineurin.fasta, containing fasta format records of all entries matching the keyword calcineurin in the zebrafish (Danio rerio) genome obtained from the NCBI (http://www.ncbi.nlm.nih.gov/nuccore). The SeqIO.parse() method will generate an iterator on SeqRecord objects; features can then be extracted from each SeqRecord object as described in the Module 9: 1

>>> import Bio >>> from Bio import SeqIO >>> handle = open("d.rerio_calcineurin.fa","r") >>> type(handle) <type 'file'> >>> for seq_record in SeqIO.parse(handle,"fasta"): print seq_record.id print repr(seq_record.seq) print len(seq_record) gi 326679292 ref XM_003201225.1 Seq('GGAAGCCGCTCTTGATACTCCAGTCAGTCTTCAGAGCAGTCTTCGGAGTATTTATAG ', SingleLetterAlphabet()) 2808 gi 326677866 ref XM_003200885.1 Seq('ATGCCTGTTCCACATACTGAAGTATCCAGGGAAAAAGAGGAACAGCAGCCTGGCTAA ', SingleLetterAlphabet()) 1035 >>> handle.close() Since the handle is a file, it is good habit to close it when the processing is done. Remember that the iterator empties the file, meaning that to scan the records another time, the file must be closed, than opened again, and then used again as the handle argument to SeqIO.parse(). In a similar way, we can parse an equivalent file, this time in genbank format; this time, we also omit the explicit creation of the handle and pass to SeqIO.parse the file name or complete path: >>> for seq_record in \ SeqIO.parse("D.rerio_calcineurin.gb","genbank"): print seq_record.id print repr(seq_record.seq) print len(seq_record) XM_003201225.1 Seq('GGAAGCCGCTCTTGATACTCCAGTCAGTCTTCAGAGCAGTCTTCGGAGTATTTATAG ', IUPACAmbiguousDNA()) 2808 XM_003200885.1 Seq('ATGCCTGTTCCACATACTGAAGTATCCAGGGAAAAAGAGGAACAGCAGCCTGGCTAA ', IUPACAmbiguousDNA()) 1035 Few things must be noted: the genbank-specific SeqIO.parse() is able to assign the correct alphabet to the sequence records in the input file, while the fasta parser assigns a generic SingleLetterAlphabet(). Second, the genbank SeqRecord store a more compact id attribute for the sequence records. As mentioned before, SeqIO.parse() can process any number of records in the input handle. SeqIO.read() instead checks whether there is only one record in the 2

handle, raising an exception if this condition is not met: >>> handle = open("d.rerio_calcineurin.gb","r") >>> SeqIO.read(handle,"genbank") Traceback (most recent call last): File "<stdin>", line 1, in <module> File "Bio/SeqIO/ init.py", line 614, in read ValueError: More than one record found in handle The usage of an iterator is a way to parse large files without consuming large amounts of memory. On the other hand, as mentioned above each single record can be accessed only one time in the for loop. The iterator provides methods to access records step by step: >>> handle = open( D.rerio_calcineurin.gb") >>> iterator = SeqIO.parse(handle,"genbank") >>> first_record = iterator.next() >>> type(first_record) <class 'Bio.SeqRecord.SeqRecord'> >>> first_record.id 'XM_003201225.1' >>> first_record.seq Seq('GGAAGCCGCTCTTGATACTCCAGTCAGTCTTCAGAGCAGTCTTCGGAGTATTTATAG ', IUPACAmbiguousDNA()) >>> first_record.description 'PREDICTED: Danio rerio nuclear factor of activated T-cells, cytoplasmic 2-like (LOC100333254), mrna.' >>> second_record = iterator.next() >>> second_record.id 'XM_003200885.1' When the records in the file are over, the.next() method will either returns the special Python object None or a StopIteration exception (depending on which Biopython release you have installed on your system). Using this approach you could in principle assign each record to a different variable, if you need to keep these records at hand. This is impractical if the number of record is high, or it is unknown beforehand. It is however possible to store all SeqReference objects returned by SeqIO into a data structure such as a list: 3

>>> records = list\ (SeqIO.parse("D.rerio_calcineurin.gb", "genbank")) >>> len(records) 61 >>> records[0] # the first record SeqRecord(seq=Seq('GGAAGCCGCTCTTGATACTCCAGTCAGTCTTCAGAGCAGTCTTCGG AGTATTTATAG', IUPACAmbiguousDNA()), id='xm_003201225.1', name='xm_003201225', description='predicted: Danio rerio nuclear factor of activated T-cells, cytoplasmic 2-like (LOC100333254), mrna.', dbxrefs=[]) >>> records[0].id 'XM_003201225.1' >>> records[0].seq Seq('GGAAGCCGCTCTTGATACTCCAGTCAGTCTTCAGAGCAGTCTTCGGAGTATTTATAG ', IUPACAmbiguousDNA()) >>> for key,value in records[0].annotations.items(): print key,value comment MODEL REFSEQ: This record is predicted by automated computational analysis. This record is derived from a genomic sequence (NW_003336048) annotated using gene prediction method: GNOMON, supported by EST evidence. Also see: Documentation of NCBI's Annotation Process sequence_version 1 source Danio rerio (zebrafish) taxonomy ['Eukaryota', 'Metazoa', 'Chordata', 'Craniata', 'Vertebrata', 'Euteleostomi', 'Actinopterygii', 'Neopterygii', 'Teleostei', 'Ostariophysi', 'Cypriniformes', 'Cyprinidae', 'Danio'] keywords [''] accessions ['XM_003201225'] data_file_division VRT date 23-MAR-2011 organism Danio rerio gi 326679292 >>> records[-1] # the last record SeqRecord(seq=Seq('GCAGCAATTTGAGGAAGAAGCGCAAACAGACAGGTCAGGTGTGGCG ATGGCAGCAAA', IUPACAmbiguousDNA()), id='bc139891.1', name='bc139891', description='danio rerio zgc:162913, mrna (cdna clone MGC:162913 IMAGE:7401269), complete cds.', dbxrefs=[]) SeqIO provides also a method to convert the iterator SeqRecord objects into values of a dictionary, whose keys are the SeqRecord.id attributes: 4

>>> handle = open( D.rerio_calcineurin.gb") >>> records = SeqIO.to_dict(SeqIO.parse(handle, "genbank")) >>> for key,value in records.items(): print key,value.id,value.description BC093219.1 BC093219.1 Danio rerio zgc:112142, mrna (cdna clone MGC:112142 IMAGE:7428541), complete cds. XM_685181.5 XM_685181.5 PREDICTED: Danio rerio nuclear factor of activated T-cells, cytoplasmic, calcineurin-dependent 3 (nfatc3), mrna. BC076024.1 BC076024.1 Danio rerio zgc:92347, mrna (cdna clone MGC:92347 IMAGE:7055812), complete cds. Note that if duplicate keys are found, an exception will be raised. For very large number of records, there is a method, Bio.SeqIO.index(), which creates a dictionary-like object, but without keeping all the data in memory. Instead, the dictionary values correspond to the position of the record in the file. When a particular record is accessed, the record content is parsed on the fly. This method allows the handling of a huge number of records, with a little cost in flexibility and speed. Moreover, these dictionary-like objects are read-only, meaning that once created, data can not be inserted or removed. Note that in this case the first argument (the handle) can not be an open file handle, but it must be a file name. >>> records = SeqIO.index("D.rerio_calcineurin.gb","genbank") >>> records.keys() ['BC093219.1', 'XM_685181.5', 'BC076024.1', 'BC091833.1', 'BC152175.1', 'NM_200899.1', 'NM_001005392.1', 'BC062840.1', 'NM_199836.1', 'BC154648.1', 'NM_001099250.1', 'BC076019.1', 'NM_001002452.1', 'BC064307.1', 'BC153488.1', 'BC122248.1', 'NM_001007413.1', 'NM_001044758.1', 'BC065451.1', 'BC093272.1', 'XM_001922343.4', 'BC065972.1', 'BC090735.1', 'NM_001017701.1', 'XM_002664259.1', 'XM_001923726.2', 'XM_678815.5', 'XM_694965.5', 'BC163337.1', 'XM_001923264.3', 'BC139891.1', 'NM_205678.1', 'NM_200854.1', 'XM_687678.4', 'BC076439.1', 'XM_001339606.4', 'BC058868.1', 'NM_214773.1', 'NM_199653.1', 'NM_001017735.1', 'NM_200042.1', 'BC071331.1', 'BC129492.1', 'BC055256.1', 'GU733827.1', 'XM_003200885.1', 'NM_200037.1', 'NM_199895.1', 'BC076514.1', 'AY639016.1', 'BC049341.1', 'BC150441.1', 'NM_001002447.1', 'BC163350.1', 'NM_001014338.1', 'NM_001045159.1', 'BC155186.1', 'BC045981.1', 'XM_003201225.1', 'BC142750.1', 'BC053153.1'] >>> print records["bc093219.1"].description Danio rerio zgc:112142, mrna (cdna clone MGC:112142 IMAGE:7428541), complete cds. 10.3 Reading files from the web As we stated before, a handle can also be used to fetch data from web databases. Since parsing the file with an iterator using a handle consumes the handle itself, it is good practice to store the downloaded file locally. Nevertheless, sometimes it could 5

be more easy to perform the parsing on-the-fly using web handles. To download files from the NCBI, we will use the Entrez.efetch interface, which takes as arguments the database where the file should be found, the file format, and the database identifier: >>> from Bio import Entrez >>> handle = Entrez.efetch(db="nucleotide",\ rettype="fasta",id="xm_003201225.1") >>> record = SeqIO.read(handle,"fasta") >>> record SeqRecord(seq=Seq('GGAAGCCGCTCTTGATACTCCAGTCAGTCTTCAGAGCAGTCTTC GGAGTATTTATAG', SingleLetterAlphabet()), id='gi 326679292 ref XM_003201225.1 ', name='gi 326679292 ref XM_003201225.1 ', description='gi 326679292 ref XM_003201225.1 PREDICTED: Danio rerio nuclear factor of activated T-cells, cytoplasmic 2-like (LOC100333254), mrna', dbxrefs=[]) >>> handle = Entrez.efetch(db="nucleotide",\ rettype="gb",id="xm_003201225.1") >>> record = SeqIO.read(handle,"genbank") >>> print record ID: XM_003201225.1 Name: XM_003201225 Description: PREDICTED: Danio rerio nuclear factor of activated T-cells, cytoplasmic 2-like (LOC100333254), mrna. Number of features: 4 /comment=model REFSEQ: This record is predicted by automated computational analysis. This record is derived from a genomic sequence (NW_003336048) annotated using gene prediction method: GNOMON, supported by EST evidence. Also see: Documentation of NCBI's Annotation Process /sequence_version=1 /source=danio rerio (zebrafish) /taxonomy=['eukaryota', 'Metazoa', 'Chordata', 'Craniata', 'Vertebrata', 'Euteleostomi', 'Actinopterygii', 'Neopterygii', 'Teleostei', 'Ostariophysi', 'Cypriniformes', 'Cyprinidae', 'Danio'] /keywords=[''] /accessions=['xm_003201225'] /data_file_division=vrt /date=23-mar-2011 /organism=danio rerio /gi=326679292 Seq('GGAAGCCGCTCTTGATACTCCAGTCAGTCTTCAGAGCAGTCTTCGGAGTATTTAT AG', IUPACAmbiguousDNA()) It is possible to download multiple files, by writing a string containing all their identifiers separated by commas: 6

>>> handle = Entrez.efetch(db="nucleotide",\ rettype="gb",id="xm_003201225.1,bc076024.1,\ BC091833.1") >>> record = SeqIO.parse(handle,"genbank") >>> for seq_record in record: print seq_record.id, seq_record.description[:50] print "Sequence length %i," % len(seq_record), print "%i features," % len(seq_record.features), print "from: %s" % seq_record.annotations["source"] XM_003201225.1 PREDICTED: Danio rerio nuclear factor of activated Sequence length 2808, 4 features, from: Danio rerio (zebrafish) BC076024.1 Danio rerio zgc:92347, mrna (cdna clone MGC:92347 Sequence length 1188, 3 features, from: Danio rerio (zebrafish) BC091833.1 Danio rerio zgc:113352, mrna (cdna clone MGC:11335 Sequence length 1660, 3 features, from: Danio rerio (zebrafish) 10.4 Writing sequence files The SeqIO.write() method can write into a file SeqRecord objects in the format specified by the user, from a list of popular sequence file formats. The method requires three arguments: one or more SeqRecord objects; a handle or a filename to write to; a sequence format. In the following example, we manually create three SeqRecord objects for three (very short) proteins. Then, the three objects are put into a list, which is used as the first argument for the SeqIO.write() method, to specify which objects to write into a file. Next, we create a handle, which is a file opened for writing, and pass it to the method as the second argument. Finally, we specify that we want the output file to be written in fasta format. The Bio.SeqIO.write() function returns the number of SeqRecord objects written to the file. >>> from Bio.Seq import Seq >>> from Bio.SeqRecords import SeqRecord >>> from Bio.Alphabet import generic_protein >>> Rec1 = SeqRecord(Seq( ACCA,generic_protein), \ id= 1, description= ) >>> Rec2 = SeqRecord(Seq( CDFAA,generic_protein), \ id= 2, description= ) >>> Rec3 = SeqRecord(Seq( GRKLM,generic_protein), \ id= 3, description= ) >>> My_records = [Rec1, Rec2, Rec3] >>> from Bio import SeqIO >>> handle_w = open( MySeqs.fa, w ) >>> SeqIO.write(My_records, handle_w, fasta ) 3 >>> handle_w.close() The input SeqRecord objects can be in the form of a list, such as in the above example, or an iterator, or an individual SeqRecord: 7

>>> handle = open("d.rerio_calcineurin.gb") >>> records = SeqIO.parse(handle,"genbank") >>> handle_w = open("all_records_in_fasta.fa","w") >>> SeqIO.write(records, handle_w, "fasta") 60 >>> handle.close() >>> handle_w.close() >>> handle = open("d.rerio_calcineurin.gb") >>> records = SeqIO.parse(handle,"genbank") >>> first_record = records.next() >>> handle_w = open("only_the_first_record.fa","w") >>> SeqIO.write(first_record, handle_w, "fasta") 1 >>> handle.close() >>> handle_w.close() 10.5 Parsing Multiple Alignments Biopython provides a data structure to store multiple alignments (the MultipleSeqAlignment class), and the Bio.AlignIO module for reading and writing them as various file formats. Let's open the seed multiple sequence alignment of the calcineurin-like phosphoesterases from the Pfam Family Metallophos (PF00149), containing 330 protein sequences. The file is in the Stockholm format, which is one of the most popular formats for multiple alignment handling. The Bio.AlignIO module provides two methods to parse multiple alignments,.parse() and.read(), which parse files containing many or just one alignments, as usual Biopython convention. Both methods require the same arguments: an handle to the multiple alignment, either an open file or a filename; the format of the multiple alignment (a full list of available formats can be found at http://biopython.org/wiki/alignio); the alphabet used by the alignment (optional). 8

>>> from Bio import AlignIO >>> alignment = AlignIO.read("PF00149.sth", "stockholm") >>> dir(alignment) [' add ', ' doc ', ' format ', ' getitem ', ' init ', ' iter ', ' len ', ' module ', ' repr ', ' str ', '_alphabet', '_annotations', '_append', '_records', '_str_line', 'add_sequence', 'append', 'extend', 'format', 'get_alignment_length', 'get_all_seqs', 'get_column', 'get_seq_by_num', 'sort'] >>> print alignment SingleLetterAlphabet() alignment with 330 rows and 477 columns FKIVQFSDAHLSDYFTLE--------------------------HGG YKUE_BACSU/58-225 LRVLHISDLHMLPNQHR---------------------------HGG O69651_MYCTU/51-235 LRVLQVSDIHMVGGQRK---------------------------HGG Q9X935_STRCO/47-241 LNILHLSDLHLENISVS---------------------------HGG YKOQ_BACSU/46-211 LPYGVISDPHYHRWDAFATTNA-----------DGLN-SRLE--HNH Q9R2P6_YERPE/3-205 LRFVQLSDIHLGTVRSAG--------------------------HGG O27247_METTH/130-285 LRIVQISDLHLNHSTPDA--------------------------HGP Y461_CHLTR/52-261 LRIAQISDLHFHKRVPEK--------------------------HGP Y578_CHLPN/45-254 >>> The AlignIO.parse() returns an iterator that goes through the alignment providing SeqRecord objects for each sequence in the alignment. 9

>>> for record in alignment: print record.id,record.annotations YKUE_BACSU/58-225 {'start': 58, 'end': 225, 'accession': 'O34870.2'} O69651_MYCTU/51-235 {'start': 51, 'end': 235, 'accession': 'O69651.1'} Q9X935_STRCO/47-241 {'start': 47, 'end': 241, 'accession': 'Q9X935.1'} YKOQ_BACSU/46-211 {'start': 46, 'end': 211, 'accession': 'O35040.1'} Q9R2P6_YERPE/3-205 {'start': 3, 'end': 205, 'accession': 'Q9R2P6.1'} O27247_METTH/130-285 {'start': 130, 'end': 285, 'accession': 'O27247.1'} Y461_CHLTR/52-261 {'start': 52, 'end': 261, 'accession': 'O84467.1'} Y578_CHLPN/45-254 {'start': 45, 'end': 254, 'accession': 'Q9Z7X6.1'} O03968_9CAUD/269-543 {'start': 269, 'end': 543, 'accession': 'O03968.1'} ASM3A_MOUSE/35-294 {'start': 35, 'end': 294, 'accession': 'P70158.1'} ASM3B_HUMAN/21-281 {'start': 21, 'end': 281, 'accession': 'Q92485.2'} Similarly to other modules, the AlignIO module provides to write alignments to file in several formats, to convert between formats, and so on. You can also perform slicing operations, which can be thought as accessing the alignment as a matrix. The standard slicing operator [i:j] returns the alignment rows between row i and row j- 1. To select alignment columns, you can use the operator [:,k], which will select the k th column 1 0

>>> print "Number of rows: %i" % len(alignment) Number of rows: 330 >>> print alignment[3:7] SingleLetterAlphabet() alignment with 4 rows and 477 columns LNILHLSDLHLENISVS---------------------------HGG YKOQ_BACSU/46-211 LPYGVISDPHYHRWDAFATTNA-----------DGLN-SRLE--HNH Q9R2P6_YERPE/3-205 LRFVQLSDIHLGTVRSAG--------------------------HGG O27247_METTH/130-285 LRIVQISDLHLNHSTPDA--------------------------HGP Y461_CHLTR/52-261 >>> print alignment[:,6] SSSSSSSSSTATTSTSAAATSSSTSASSTAPATTTTTTTSASAAAAASSGSSSASAAASGGGGGG GNNGGGGSGGGGGGGGSGCGGGGGGSNNNNNNNNNNNNNNNNNNNSSTTTTTTNNGGGGGGTTTG GGGGSSSSASSTSSSSASSSSGGGGGSASSGSASAASAAAAATSTTSSSSSSASSSSSSSAAAGG GGGGGGGAGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG GGGGGGGGGGGGGGGSGGGGGGGGPGGGGSSASSGSTSGASSSSSTTSSSSSSSSSSSSSAAAAA GGGST >>> print alignment[2,6] S 1 1