BINF 3360, Introduction to Computational Biology Lecture 2, Introduction to Python Young-Rae Cho Associate Professor Department of Computer Science Baylor University Python Programming Language Script Language General-purpose script language Broad applications (web, bioinformatics, network programming, graphics, software engineering) Features Object-oriented Extension with modules Database integration Embeddable Web frameworks / Web modules 1
Getting Started Download & Installation http://www.python.org/download/ (the most recent version: Python 3.3) Edit & Run Create a file named test.py Edit the code # This is a test. dna = ATCGATGA print dna, \n Run the code > python test.py Primitives Primitive Data Types Numbers or Strings num = 1234 st = 1234 num_1 = num + int(st) st_1 = str(num) + st Substring dna1 = ACGTGAACT dna2 = dna1[0:4] length = len(dna2) Reversing dna1 = ACGTGAACT dna2 = dna1[::-1] 2
Lists List Variables A list of comma-separated values lst1 = [ A, C, G ] lst2 = [ T ] lst1 = lst1 + lst2 Variable-length list Insert, Delete, Append, Reverse, and Sort lst = [ A, T, G ] lst.insert(1, C ) del lst[2] lst.append( T ) lst.extend([ A, C ]) lst.reverse() lst.sort() lst = [ A, T, G ] lst [1:2] = C lst [1:1] = T lst [2:3] = lst [len(lst) : len(lst)] = T lst [len(lst) : len(lst)] = [ A, C ] lst [::-1] Sets Set Variables DNAbases = { A, C, G, T } RNAbases = { A, C, G, U } DNAbases RNAbases DNAbases & RNAbases DNAbases - RNAbases Add and Remove bases = { A, D, G } bases.add( T ) bases.remove( D ) 3
Dictionaries Initialization d = { key1 : value1, key2 : value2, key3 : value3 } d = dict() d[ key1 ] = value1 k2, v2 = key2, value2 d[k2] = v2 Mapping d[ key1 ] d.get( key1 ) d.keys() d.values() Delete del d[ key1 ] Input / Output Standard Input import sys data = sys.stdin.readline().replace( \n, ) Reading Files name = myfilename.txt with open(name) as file: data = file.read() name = sys.stdin.readline() with open(name) as file: data = file.read() name = sys.argv[1] with open(name) as file: data = file.read() Writing Files name = output.txt with open(name, w ) as file: file.write( ATCGATG ) 4
Functions Types Built-in system functions User-defined functions Defining Function def function_name (parameter_list): statement statement return value Function Call Iteration Iterative Process def find_max(lst): max_so_far = lst[0] for item in lst[1:]: if item > max_so_far: max_so_far = item return max_so_far lst1 = [3,5,10,4,6] maximum = find_max(lst1) 5
Recursion Recursive Call def print_tree(tree, level): print * 4 * level, tree[0] for subtree in tree[1:]: print_tree(subtree, level+1) t1 = [ A, [ T, [ A ], [ T ]], [ G, [ G ], [ C ]]] print_tree(t1, 0) Modules Module A collection of functions Module python (.py) files in a library directory Module Call import random seq = 'ATCGATAGCTA' random_base = seq[random.randint(0,len(seq)-1)] from random import * seq = 'ATCGATAGCTA' random_base = seq[randint(0,len(seq)-1)] 6
Regular Expressions Special Languages Metacharacters Quantifiers Alternatives Character Set Same to the regular expressions in Perl Usage import re if re.match( TATA.* AA, seq): print It matched! import re matches = re.findall( TATA.* AA, seq) print matches Biological Applications Parsing Sequences Base Frequency Counting Motif (Substring) Search Sequence Transformation DNA Replication Transcription from DNA to RNA Translating RNA into Protein DNA Sequence Mutation 7
Parsing Sequences (1) Single Sequence in FASTA Format >gi 5524211 gb AAD44166.1 cytochrome b LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIP YIGTNLVEWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDK IPFHPYYTIKDFLGLLILILLLLLLALLSPDMLGDPDNHMPADPLNTPLHIKPEWYFLFAYAILRS VPNKLGGVLALFLSIVILGLMPFLHTSKHRSMMLRPLSQALFWTLTMDLLTLTWIGSQPVEYP YTIIGQMASILYFSIILAFLPIAGXIENY Parsing Make a function to return the sequence from the FASTA format def read_fasta_seq(filename): with open(filename) as f: return f.read().partition( \n )[2].replace( \n, ) Parsing Sequences (2) Multiple Sequences in FASTA Format >SEQUENCE_1 MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLGKAAKKADRLAAEG LVSVKVSDDFTIAAMRPSYLSYEDLDMTFVENEYKALVAELEKENEERRRLKDPNKPEHKIP QFASRKQLSDAILKEAEEKIKEELKAQGKPEKIWDNIIPGKMNSFIADNSQLDSKLTLMGQFY VMDDKKTVEQVIAEKEKEFGGKIKIVEFICFEVGEGLEKKTEDFAAEVAAQL >SEQUENCE_2 SATVSEINSETDFVAKNDQFIALTKDTTAHIQSNSLQSVEELHSSTINGVKFEEYLKSQIATIGE NLVVRRFATLKAGANGVVNGYIHTNGRVGVVIAAACDSAEVASKSRDLLRQICMH Parsing? 8
Frequency Counting DNA Sequence Validation def validate_dna (base_sequence): seq = base_sequence.upper() return len(seq) == (seq.count( T ) + seq.count( C ) + seq.count( A ) + seq.count( G ) ) def validate_dna (base_sequence): seq = base_sequence.upper() for base in seq: if base not in ACGT : return False return True Counting Base Frquency Make a function to calculate the percent of C and G in a DNA sequence def percent_of_gc (base_sequence): seq = base_sequence.upper() return (seq.count( G ) + seq.count( C )) / len(seq) Motif Search Searching Substring Make a function to take a sequence and a motif and return the position(s) of matching in the sequence def motif_search (seq, motif): return seq.find(motif) def all_motif_search (seq, motif): pos = [] idx = seq.find(motif) pos.append(idx) seq = seq.partition(motif)[2] while seq.find(motif) >= 0: idx += seq.find(motif) + len(motif) pos.append(idx) seq = seq.partition(motif)[2] return pos 9
Transcription Simulating Transcription Make a function to transcribe a DNA into an RNA def transcription (dna): return dna.replace( T, U ) Translation (1) Making Genetic Code Make a function to translate a codon to an amino acid def codon2aa(codon): genetic_code = { UUU : F, UUC : F, UUA : L, } if codon in genetic_code.keys(): return genetic_code[codon] else: return Error 10
Translation (2) Simulating Translation Make a function to translate an RNA into a protein sequence def translation(rna): protein = for n in range(0, len(rna), 3): protein += codon2aa(rna[n:n+3]) return protein Translation (3) Simulating Translation cont Make a generator - an object that returns values from a series it computes def aa_generator(rna): return (codon2aa(rna[n:n+3]) for n in range(0, len(rna), 3) ) def translation(rna): gen = aa_generator(rna) protein = aa = next(gen) while aa: protein += aa aa = next(gen) return protein 11
Mutation Simulating Mutation Make a function to simulate single point mutations in a DNA sequence import random def mutation(dna): position = random.randint(0,len(dna)-1) bases = ACGT new_base = bases[random.randint(0,3)] dna[position:position+1] = new_base return dna bases.replace(dna[position], ) new_base = bases[random.randint(0,2)] Questions? Lecture Slides are found on the Course Website, web.ecs.baylor.edu/faculty/cho/3360 12