Bioinformatics using Python for Biologists

Transcription

1 Bioinformatics using Python for Biologists 10.1 The SeqIO module Many file formats are employed by the most popular databases to store information in ways that should be easily interpreted by a computer program. In this case, interpreting means extracting information (i.e. parsing) and converting it in formats appropriate for further processing and analysis. The parsing of such files is very often a very important task that the bioinformatician must do very accurately. However, the task of parsing these files can be frustrated by the fact that the formats can change quite regularly, and that formats may contain small subtleties which can break even the most well designed parsers. Biopython SeqIO module provides parsers for many common file formats, which generally extract information from the inout file and convert it into a SeqRecord object. There are two methods for sequence file parsing: SeqIO.parse() and SeqIO.read(); both of them require two mandatory arguments and an optional argument: a handle that specifies where the data must be read (could be a file name, a file opened for reading, data downloaded from a database using a script, or the output of another piece of code); a flag indicating the format of the data (a full list of supported format is available at an optional argument that specifies the alphabet of the sequence data. The difference between SeqIO.parse() and SeqIO.read() is that SeqIO.parse() returns an iterator that goes through all records in the input handle, to be used in for or while loops. On the other hand, SeqIO.read() must be used on files containing a single record. The arguments are the same; Both methods return SeqRecord objects Reading local files Let's read the file D.rerio_calcineurin.fasta, containing fasta format records of all entries matching the keyword calcineurin in the zebrafish (Danio rerio) genome obtained from the NCBI ( The SeqIO.parse() method will generate an iterator on SeqRecord objects; features can then be extracted from each SeqRecord object as described in the Module 9: 1

2 >>> import Bio >>> from Bio import SeqIO >>> handle = open("d.rerio_calcineurin.fa","r") >>> type(handle) <type 'file'> >>> for seq_record in SeqIO.parse(handle,"fasta"): print seq_record.id print repr(seq_record.seq) print len(seq_record) gi ref XM_ Seq('GGAAGCCGCTCTTGATACTCCAGTCAGTCTTCAGAGCAGTCTTCGGAGTATTTATAG ', SingleLetterAlphabet()) 2808 gi ref XM_ Seq('ATGCCTGTTCCACATACTGAAGTATCCAGGGAAAAAGAGGAACAGCAGCCTGGCTAA ', SingleLetterAlphabet()) 1035 >>> handle.close() Since the handle is a file, it is good habit to close it when the processing is done. Remember that the iterator empties the file, meaning that to scan the records another time, the file must be closed, than opened again, and then used again as the handle argument to SeqIO.parse(). In a similar way, we can parse an equivalent file, this time in genbank format; this time, we also omit the explicit creation of the handle and pass to SeqIO.parse the file name or complete path: >>> for seq_record in \ SeqIO.parse("D.rerio_calcineurin.gb","genbank"): print seq_record.id print repr(seq_record.seq) print len(seq_record) XM_ Seq('GGAAGCCGCTCTTGATACTCCAGTCAGTCTTCAGAGCAGTCTTCGGAGTATTTATAG ', IUPACAmbiguousDNA()) 2808 XM_ Seq('ATGCCTGTTCCACATACTGAAGTATCCAGGGAAAAAGAGGAACAGCAGCCTGGCTAA ', IUPACAmbiguousDNA()) 1035 Few things must be noted: the genbank-specific SeqIO.parse() is able to assign the correct alphabet to the sequence records in the input file, while the fasta parser assigns a generic SingleLetterAlphabet(). Second, the genbank SeqRecord store a more compact id attribute for the sequence records. As mentioned before, SeqIO.parse() can process any number of records in the input handle. SeqIO.read() instead checks whether there is only one record in the 2

3 handle, raising an exception if this condition is not met: >>> handle = open("d.rerio_calcineurin.gb","r") >>> SeqIO.read(handle,"genbank") Traceback (most recent call last): File "<stdin>", line 1, in <module> File "Bio/SeqIO/ init.py", line 614, in read ValueError: More than one record found in handle The usage of an iterator is a way to parse large files without consuming large amounts of memory. On the other hand, as mentioned above each single record can be accessed only one time in the for loop. The iterator provides methods to access records step by step: >>> handle = open( D.rerio_calcineurin.gb") >>> iterator = SeqIO.parse(handle,"genbank") >>> first_record = iterator.next() >>> type(first_record) <class 'Bio.SeqRecord.SeqRecord'> >>> first_record.id 'XM_ ' >>> first_record.seq Seq('GGAAGCCGCTCTTGATACTCCAGTCAGTCTTCAGAGCAGTCTTCGGAGTATTTATAG ', IUPACAmbiguousDNA()) >>> first_record.description 'PREDICTED: Danio rerio nuclear factor of activated T-cells, cytoplasmic 2-like (LOC ), mrna.' >>> second_record = iterator.next() >>> second_record.id 'XM_ ' When the records in the file are over, the.next() method will either returns the special Python object None or a StopIteration exception (depending on which Biopython release you have installed on your system). Using this approach you could in principle assign each record to a different variable, if you need to keep these records at hand. This is impractical if the number of record is high, or it is unknown beforehand. It is however possible to store all SeqReference objects returned by SeqIO into a data structure such as a list: 3

4 >>> records = list\ (SeqIO.parse("D.rerio_calcineurin.gb", "genbank")) >>> len(records) 61 >>> records[0] # the first record SeqRecord(seq=Seq('GGAAGCCGCTCTTGATACTCCAGTCAGTCTTCAGAGCAGTCTTCGG AGTATTTATAG', IUPACAmbiguousDNA()), id='xm_ ', name='xm_ ', description='predicted: Danio rerio nuclear factor of activated T-cells, cytoplasmic 2-like (LOC ), mrna.', dbxrefs=[]) >>> records[0].id 'XM_ ' >>> records[0].seq Seq('GGAAGCCGCTCTTGATACTCCAGTCAGTCTTCAGAGCAGTCTTCGGAGTATTTATAG ', IUPACAmbiguousDNA()) >>> for key,value in records[0].annotations.items(): print key,value comment MODEL REFSEQ: This record is predicted by automated computational analysis. This record is derived from a genomic sequence (NW_ ) annotated using gene prediction method: GNOMON, supported by EST evidence. Also see: Documentation of NCBI's Annotation Process sequence_version 1 source Danio rerio (zebrafish) taxonomy ['Eukaryota', 'Metazoa', 'Chordata', 'Craniata', 'Vertebrata', 'Euteleostomi', 'Actinopterygii', 'Neopterygii', 'Teleostei', 'Ostariophysi', 'Cypriniformes', 'Cyprinidae', 'Danio'] keywords [''] accessions ['XM_ '] data_file_division VRT date 23-MAR-2011 organism Danio rerio gi >>> records[-1] # the last record SeqRecord(seq=Seq('GCAGCAATTTGAGGAAGAAGCGCAAACAGACAGGTCAGGTGTGGCG ATGGCAGCAAA', IUPACAmbiguousDNA()), id='bc ', name='bc139891', description='danio rerio zgc:162913, mrna (cdna clone MGC: IMAGE: ), complete cds.', dbxrefs=[]) SeqIO provides also a method to convert the iterator SeqRecord objects into values of a dictionary, whose keys are the SeqRecord.id attributes: 4

5 >>> handle = open( D.rerio_calcineurin.gb") >>> records = SeqIO.to_dict(SeqIO.parse(handle, "genbank")) >>> for key,value in records.items(): print key,value.id,value.description BC BC Danio rerio zgc:112142, mrna (cdna clone MGC: IMAGE: ), complete cds. XM_ XM_ PREDICTED: Danio rerio nuclear factor of activated T-cells, cytoplasmic, calcineurin-dependent 3 (nfatc3), mrna. BC BC Danio rerio zgc:92347, mrna (cdna clone MGC:92347 IMAGE: ), complete cds. Note that if duplicate keys are found, an exception will be raised. For very large number of records, there is a method, Bio.SeqIO.index(), which creates a dictionary-like object, but without keeping all the data in memory. Instead, the dictionary values correspond to the position of the record in the file. When a particular record is accessed, the record content is parsed on the fly. This method allows the handling of a huge number of records, with a little cost in flexibility and speed. Moreover, these dictionary-like objects are read-only, meaning that once created, data can not be inserted or removed. Note that in this case the first argument (the handle) can not be an open file handle, but it must be a file name. >>> records = SeqIO.index("D.rerio_calcineurin.gb","genbank") >>> records.keys() ['BC ', 'XM_ ', 'BC ', 'BC ', 'BC ', 'NM_ ', 'NM_ ', 'BC ', 'NM_ ', 'BC ', 'NM_ ', 'BC ', 'NM_ ', 'BC ', 'BC ', 'BC ', 'NM_ ', 'NM_ ', 'BC ', 'BC ', 'XM_ ', 'BC ', 'BC ', 'NM_ ', 'XM_ ', 'XM_ ', 'XM_ ', 'XM_ ', 'BC ', 'XM_ ', 'BC ', 'NM_ ', 'NM_ ', 'XM_ ', 'BC ', 'XM_ ', 'BC ', 'NM_ ', 'NM_ ', 'NM_ ', 'NM_ ', 'BC ', 'BC ', 'BC ', 'GU ', 'XM_ ', 'NM_ ', 'NM_ ', 'BC ', 'AY ', 'BC ', 'BC ', 'NM_ ', 'BC ', 'NM_ ', 'NM_ ', 'BC ', 'BC ', 'XM_ ', 'BC ', 'BC '] >>> print records["bc "].description Danio rerio zgc:112142, mrna (cdna clone MGC: IMAGE: ), complete cds Reading files from the web As we stated before, a handle can also be used to fetch data from web databases. Since parsing the file with an iterator using a handle consumes the handle itself, it is good practice to store the downloaded file locally. Nevertheless, sometimes it could 5

6 be more easy to perform the parsing on-the-fly using web handles. To download files from the NCBI, we will use the Entrez.efetch interface, which takes as arguments the database where the file should be found, the file format, and the database identifier: >>> from Bio import Entrez >>> handle = Entrez.efetch(db="nucleotide",\ rettype="fasta",id="xm_ ") >>> record = SeqIO.read(handle,"fasta") >>> record SeqRecord(seq=Seq('GGAAGCCGCTCTTGATACTCCAGTCAGTCTTCAGAGCAGTCTTC GGAGTATTTATAG', SingleLetterAlphabet()), id='gi ref XM_ ', name='gi ref XM_ ', description='gi ref XM_ PREDICTED: Danio rerio nuclear factor of activated T-cells, cytoplasmic 2-like (LOC ), mrna', dbxrefs=[]) >>> handle = Entrez.efetch(db="nucleotide",\ rettype="gb",id="xm_ ") >>> record = SeqIO.read(handle,"genbank") >>> print record ID: XM_ Name: XM_ Description: PREDICTED: Danio rerio nuclear factor of activated T-cells, cytoplasmic 2-like (LOC ), mrna. Number of features: 4 /comment=model REFSEQ: This record is predicted by automated computational analysis. This record is derived from a genomic sequence (NW_ ) annotated using gene prediction method: GNOMON, supported by EST evidence. Also see: Documentation of NCBI's Annotation Process /sequence_version=1 /source=danio rerio (zebrafish) /taxonomy=['eukaryota', 'Metazoa', 'Chordata', 'Craniata', 'Vertebrata', 'Euteleostomi', 'Actinopterygii', 'Neopterygii', 'Teleostei', 'Ostariophysi', 'Cypriniformes', 'Cyprinidae', 'Danio'] /keywords=[''] /accessions=['xm_ '] /data_file_division=vrt /date=23-mar-2011 /organism=danio rerio /gi= Seq('GGAAGCCGCTCTTGATACTCCAGTCAGTCTTCAGAGCAGTCTTCGGAGTATTTAT AG', IUPACAmbiguousDNA()) It is possible to download multiple files, by writing a string containing all their identifiers separated by commas: 6

7 >>> handle = Entrez.efetch(db="nucleotide",\ rettype="gb",id="xm_ ,bc ,\ BC ") >>> record = SeqIO.parse(handle,"genbank") >>> for seq_record in record: print seq_record.id, seq_record.description[:50] print "Sequence length %i," % len(seq_record), print "%i features," % len(seq_record.features), print "from: %s" % seq_record.annotations["source"] XM_ PREDICTED: Danio rerio nuclear factor of activated Sequence length 2808, 4 features, from: Danio rerio (zebrafish) BC Danio rerio zgc:92347, mrna (cdna clone MGC:92347 Sequence length 1188, 3 features, from: Danio rerio (zebrafish) BC Danio rerio zgc:113352, mrna (cdna clone MGC:11335 Sequence length 1660, 3 features, from: Danio rerio (zebrafish) 10.4 Writing sequence files The SeqIO.write() method can write into a file SeqRecord objects in the format specified by the user, from a list of popular sequence file formats. The method requires three arguments: one or more SeqRecord objects; a handle or a filename to write to; a sequence format. In the following example, we manually create three SeqRecord objects for three (very short) proteins. Then, the three objects are put into a list, which is used as the first argument for the SeqIO.write() method, to specify which objects to write into a file. Next, we create a handle, which is a file opened for writing, and pass it to the method as the second argument. Finally, we specify that we want the output file to be written in fasta format. The Bio.SeqIO.write() function returns the number of SeqRecord objects written to the file. >>> from Bio.Seq import Seq >>> from Bio.SeqRecords import SeqRecord >>> from Bio.Alphabet import generic_protein >>> Rec1 = SeqRecord(Seq( ACCA,generic_protein), \ id= 1, description= ) >>> Rec2 = SeqRecord(Seq( CDFAA,generic_protein), \ id= 2, description= ) >>> Rec3 = SeqRecord(Seq( GRKLM,generic_protein), \ id= 3, description= ) >>> My_records = [Rec1, Rec2, Rec3] >>> from Bio import SeqIO >>> handle_w = open( MySeqs.fa, w ) >>> SeqIO.write(My_records, handle_w, fasta ) 3 >>> handle_w.close() The input SeqRecord objects can be in the form of a list, such as in the above example, or an iterator, or an individual SeqRecord: 7

8 >>> handle = open("d.rerio_calcineurin.gb") >>> records = SeqIO.parse(handle,"genbank") >>> handle_w = open("all_records_in_fasta.fa","w") >>> SeqIO.write(records, handle_w, "fasta") 60 >>> handle.close() >>> handle_w.close() >>> handle = open("d.rerio_calcineurin.gb") >>> records = SeqIO.parse(handle,"genbank") >>> first_record = records.next() >>> handle_w = open("only_the_first_record.fa","w") >>> SeqIO.write(first_record, handle_w, "fasta") 1 >>> handle.close() >>> handle_w.close() 10.5 Parsing Multiple Alignments Biopython provides a data structure to store multiple alignments (the MultipleSeqAlignment class), and the Bio.AlignIO module for reading and writing them as various file formats. Let's open the seed multiple sequence alignment of the calcineurin-like phosphoesterases from the Pfam Family Metallophos (PF00149), containing 330 protein sequences. The file is in the Stockholm format, which is one of the most popular formats for multiple alignment handling. The Bio.AlignIO module provides two methods to parse multiple alignments,.parse() and.read(), which parse files containing many or just one alignments, as usual Biopython convention. Both methods require the same arguments: an handle to the multiple alignment, either an open file or a filename; the format of the multiple alignment (a full list of available formats can be found at the alphabet used by the alignment (optional). 8

9 >>> from Bio import AlignIO >>> alignment = AlignIO.read("PF00149.sth", "stockholm") >>> dir(alignment) [' add ', ' doc ', ' format ', ' getitem ', ' init ', ' iter ', ' len ', ' module ', ' repr ', ' str ', '_alphabet', '_annotations', '_append', '_records', '_str_line', 'add_sequence', 'append', 'extend', 'format', 'get_alignment_length', 'get_all_seqs', 'get_column', 'get_seq_by_num', 'sort'] >>> print alignment SingleLetterAlphabet() alignment with 330 rows and 477 columns FKIVQFSDAHLSDYFTLE HGG YKUE_BACSU/ LRVLHISDLHMLPNQHR HGG O69651_MYCTU/ LRVLQVSDIHMVGGQRK HGG Q9X935_STRCO/ LNILHLSDLHLENISVS HGG YKOQ_BACSU/ LPYGVISDPHYHRWDAFATTNA DGLN-SRLE--HNH Q9R2P6_YERPE/3-205 LRFVQLSDIHLGTVRSAG HGG O27247_METTH/ LRIVQISDLHLNHSTPDA HGP Y461_CHLTR/ LRIAQISDLHFHKRVPEK HGP Y578_CHLPN/ >>> The AlignIO.parse() returns an iterator that goes through the alignment providing SeqRecord objects for each sequence in the alignment. 9

10 >>> for record in alignment: print record.id,record.annotations YKUE_BACSU/ {'start': 58, 'end': 225, 'accession': 'O '} O69651_MYCTU/ {'start': 51, 'end': 235, 'accession': 'O '} Q9X935_STRCO/ {'start': 47, 'end': 241, 'accession': 'Q9X935.1'} YKOQ_BACSU/ {'start': 46, 'end': 211, 'accession': 'O '} Q9R2P6_YERPE/3-205 {'start': 3, 'end': 205, 'accession': 'Q9R2P6.1'} O27247_METTH/ {'start': 130, 'end': 285, 'accession': 'O '} Y461_CHLTR/ {'start': 52, 'end': 261, 'accession': 'O '} Y578_CHLPN/ {'start': 45, 'end': 254, 'accession': 'Q9Z7X6.1'} O03968_9CAUD/ {'start': 269, 'end': 543, 'accession': 'O '} ASM3A_MOUSE/ {'start': 35, 'end': 294, 'accession': 'P '} ASM3B_HUMAN/ {'start': 21, 'end': 281, 'accession': 'Q '} Similarly to other modules, the AlignIO module provides to write alignments to file in several formats, to convert between formats, and so on. You can also perform slicing operations, which can be thought as accessing the alignment as a matrix. The standard slicing operator [i:j] returns the alignment rows between row i and row j- 1. To select alignment columns, you can use the operator [:,k], which will select the k th column 1 0

11 >>> print "Number of rows: %i" % len(alignment) Number of rows: 330 >>> print alignment[3:7] SingleLetterAlphabet() alignment with 4 rows and 477 columns LNILHLSDLHLENISVS HGG YKOQ_BACSU/ LPYGVISDPHYHRWDAFATTNA DGLN-SRLE--HNH Q9R2P6_YERPE/3-205 LRFVQLSDIHLGTVRSAG HGG O27247_METTH/ LRIVQISDLHLNHSTPDA HGP Y461_CHLTR/ >>> print alignment[:,6] SSSSSSSSSTATTSTSAAATSSSTSASSTAPATTTTTTTSASAAAAASSGSSSASAAASGGGGGG GNNGGGGSGGGGGGGGSGCGGGGGGSNNNNNNNNNNNNNNNNNNNSSTTTTTTNNGGGGGGTTTG GGGGSSSSASSTSSSSASSSSGGGGGSASSGSASAASAAAAATSTTSSSSSSASSSSSSSAAAGG GGGGGGGAGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG GGGGGGGGGGGGGGGSGGGGGGGGPGGGGSSASSGSTSGASSSSSTTSSSSSSSSSSSSSAAAAA GGGST >>> print alignment[2,6] S 1 1