Lecture 2, Introduction to Python. Python Programming Language



Similar documents
Bioinformatics Resources at a Glance

The sequence of bases on the mrna is a code that determines the sequence of amino acids in the polypeptide being synthesized:

Biological Sequence Data Formats

( TUTORIAL. (July 2006)

Name Class Date. Figure Which nucleotide in Figure 13 1 indicates the nucleic acid above is RNA? a. uracil c. cytosine b. guanine d.

From DNA to Protein. Proteins. Chapter 13. Prokaryotes and Eukaryotes. The Path From Genes to Proteins. All proteins consist of polypeptide chains

GenBank, Entrez, & FASTA

Bioinformatics Grid - Enabled Tools For Biologists.

a. Ribosomal RNA rrna a type ofrna that combines with proteins to form Ribosomes on which polypeptide chains of proteins are assembled

Module 10: Bioinformatics

Sorting. Lists have a sort method Strings are sorted alphabetically, except... Uppercase is sorted before lowercase (yes, strange)

Hidden Markov Models in Bioinformatics. By Máthé Zoltán Kőrösi Zoltán 2006

Introduction to Programming with Python. A Useful Reference.

Programming Exercises

Translation Study Guide

Bob Jesberg. Boston, MA April 3, 2014

Gene Finding CMSC 423

Protein Synthesis How Genes Become Constituent Molecules

DNA Sequence Analysis Software

Pairwise Sequence Alignment

Lecture 4: Exact string searching algorithms. Exact string search algorithms. Definitions. Exact string searching or matching

Apply PERL to BioInformatics (II)

Gene and Chromosome Mutation Worksheet (reference pgs in Modern Biology textbook)

Algorithms in Computational Biology (236522) spring 2007 Lecture #1

Protein & DNA Sequence Analysis. Bobbie-Jo Webb-Robertson May 3, 2004

Thomas Jefferson High School for Science and Technology Program of Studies Foundations of Computer Science. Unit of Study / Textbook Correlation

CD-HIT User s Guide. Last updated: April 5,

The Steps. 1. Transcription. 2. Transferal. 3. Translation

ML for the Working Programmer

Genetic programming with regular expressions

Hands on Simulation of Mutation

AP BIOLOGY 2009 SCORING GUIDELINES

RNA & Protein Synthesis

Exercise 1: Python Language Basics

Some programming experience in a high-level structured programming language is recommended.

Certified PHP Developer VS-1054

Chapter 3 Writing Simple Programs. What Is Programming? Internet. Witin the web server we set lots and lots of requests which we need to respond to

A Web Based Software for Synonymous Codon Usage Indices

LabGenius. Technical design notes. The world s most advanced synthetic DNA libraries. hi@labgeni.us V1.5 NOV 15

AP BIOLOGY 2010 SCORING GUIDELINES (Form B)

CS 1133, LAB 2: FUNCTIONS AND TESTING

SMock A Test Platform for the Evaluation of Monitoring Tools

A Multiple DNA Sequence Translation Tool Incorporating Web Robot and Intelligent Recommendation Techniques

Bio-Informatics Lectures. A Short Introduction

Semester Review. CSC 301, Fall 2015

shodan-python Documentation

Genetic information (DNA) determines structure of proteins DNA RNA proteins cell structure enzymes control cell chemistry ( metabolism )

PRACTICE TEST QUESTIONS

Vector NTI Advance 11 Quick Start Guide

Protein Protein Interaction Networks

Introduction to Bioinformatics 3. DNA editing and contig assembly

Appendix 2 Molecular Biology Core Curriculum. Websites and Other Resources

Financial Accounting Tutorial

Gene Models & Bed format: What they represent.

TECHNICAL UNIVERSITY OF CRETE DATA STRUCTURES FILE STRUCTURES

Introduction to Shell Scripting

Activity 7.21 Transcription factors

DNA Replication & Protein Synthesis. This isn t a baaaaaaaddd chapter!!!

PROC. CAIRO INTERNATIONAL BIOMEDICAL ENGINEERING CONFERENCE

RETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison

Concluding lesson. Student manual. What kind of protein are you? (Basic)

Hadoop Streaming. Table of contents

Scottish Qualifications Authority

MORPHEUS. Prediction of Transcription Factors Binding Sites based on Position Weight Matrix.


CellLine, a stochastic cell lineage simulator: Manual

COMPARING DNA SEQUENCES TO DETERMINE EVOLUTIONARY RELATIONSHIPS AMONG MOLLUSKS

13.4 Gene Regulation and Expression

Name Date Period. 2. When a molecule of double-stranded DNA undergoes replication, it results in

PYTHON Basics

ISTEP+: Biology I End-of-Course Assessment Released Items and Scoring Notes

Structure and Function of DNA

Introduction to: Computers & Programming: Input and Output (IO)

Visualizing molecular simulations

Library page. SRS first view. Different types of database in SRS. Standard query form

Introduction to Python

Lecture 1 MODULE 3 GENE EXPRESSION AND REGULATION OF GENE EXPRESSION. Professor Bharat Patel Office: Science 2, b.patel@griffith.edu.

CS177 MIDTERM 2 PRACTICE EXAM SOLUTION. Name: Student ID:

Oracle Database 11g SQL

Transcription and Translation of DNA

SeqScape Software Version 2.5 Comprehensive Analysis Solution for Resequencing Applications

Focusing on results not data comprehensive data analysis for targeted next generation sequencing

Basic Concepts of DNA, Proteins, Genes and Genomes

Python for Economists

Curriculum Map. Discipline: Computer Science Course: C++

Moving from CS 61A Scheme to CS 61B Java

Basic attributes of genetic processes (replication, transcription, translation)

DNA, RNA, Protein synthesis, and Mutations. Chapters

Frequently Asked Questions Next Generation Sequencing

A Tutorial Introduc/on to Big Data. Hands On Data Analy/cs over EMR. Robert Grossman University of Chicago Open Data Group

BioHPC Web Computing Resources at CBSU

13.2 Ribosomes & Protein Synthesis

Module 3 Questions. 7. Chemotaxis is an example of signal transduction. Explain, with the use of diagrams.

BIO 3350: ELEMENTS OF BIOINFORMATICS PARTIALLY ONLINE SYLLABUS

Transcription:

BINF 3360, Introduction to Computational Biology Lecture 2, Introduction to Python Young-Rae Cho Associate Professor Department of Computer Science Baylor University Python Programming Language Script Language General-purpose script language Broad applications (web, bioinformatics, network programming, graphics, software engineering) Features Object-oriented Extension with modules Database integration Embeddable Web frameworks / Web modules 1

Getting Started Download & Installation http://www.python.org/download/ (the most recent version: Python 3.3) Edit & Run Create a file named test.py Edit the code # This is a test. dna = ATCGATGA print dna, \n Run the code > python test.py Primitives Primitive Data Types Numbers or Strings num = 1234 st = 1234 num_1 = num + int(st) st_1 = str(num) + st Substring dna1 = ACGTGAACT dna2 = dna1[0:4] length = len(dna2) Reversing dna1 = ACGTGAACT dna2 = dna1[::-1] 2

Lists List Variables A list of comma-separated values lst1 = [ A, C, G ] lst2 = [ T ] lst1 = lst1 + lst2 Variable-length list Insert, Delete, Append, Reverse, and Sort lst = [ A, T, G ] lst.insert(1, C ) del lst[2] lst.append( T ) lst.extend([ A, C ]) lst.reverse() lst.sort() lst = [ A, T, G ] lst [1:2] = C lst [1:1] = T lst [2:3] = lst [len(lst) : len(lst)] = T lst [len(lst) : len(lst)] = [ A, C ] lst [::-1] Sets Set Variables DNAbases = { A, C, G, T } RNAbases = { A, C, G, U } DNAbases RNAbases DNAbases & RNAbases DNAbases - RNAbases Add and Remove bases = { A, D, G } bases.add( T ) bases.remove( D ) 3

Dictionaries Initialization d = { key1 : value1, key2 : value2, key3 : value3 } d = dict() d[ key1 ] = value1 k2, v2 = key2, value2 d[k2] = v2 Mapping d[ key1 ] d.get( key1 ) d.keys() d.values() Delete del d[ key1 ] Input / Output Standard Input import sys data = sys.stdin.readline().replace( \n, ) Reading Files name = myfilename.txt with open(name) as file: data = file.read() name = sys.stdin.readline() with open(name) as file: data = file.read() name = sys.argv[1] with open(name) as file: data = file.read() Writing Files name = output.txt with open(name, w ) as file: file.write( ATCGATG ) 4

Functions Types Built-in system functions User-defined functions Defining Function def function_name (parameter_list): statement statement return value Function Call Iteration Iterative Process def find_max(lst): max_so_far = lst[0] for item in lst[1:]: if item > max_so_far: max_so_far = item return max_so_far lst1 = [3,5,10,4,6] maximum = find_max(lst1) 5

Recursion Recursive Call def print_tree(tree, level): print * 4 * level, tree[0] for subtree in tree[1:]: print_tree(subtree, level+1) t1 = [ A, [ T, [ A ], [ T ]], [ G, [ G ], [ C ]]] print_tree(t1, 0) Modules Module A collection of functions Module python (.py) files in a library directory Module Call import random seq = 'ATCGATAGCTA' random_base = seq[random.randint(0,len(seq)-1)] from random import * seq = 'ATCGATAGCTA' random_base = seq[randint(0,len(seq)-1)] 6

Regular Expressions Special Languages Metacharacters Quantifiers Alternatives Character Set Same to the regular expressions in Perl Usage import re if re.match( TATA.* AA, seq): print It matched! import re matches = re.findall( TATA.* AA, seq) print matches Biological Applications Parsing Sequences Base Frequency Counting Motif (Substring) Search Sequence Transformation DNA Replication Transcription from DNA to RNA Translating RNA into Protein DNA Sequence Mutation 7

Parsing Sequences (1) Single Sequence in FASTA Format >gi 5524211 gb AAD44166.1 cytochrome b LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIP YIGTNLVEWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDK IPFHPYYTIKDFLGLLILILLLLLLALLSPDMLGDPDNHMPADPLNTPLHIKPEWYFLFAYAILRS VPNKLGGVLALFLSIVILGLMPFLHTSKHRSMMLRPLSQALFWTLTMDLLTLTWIGSQPVEYP YTIIGQMASILYFSIILAFLPIAGXIENY Parsing Make a function to return the sequence from the FASTA format def read_fasta_seq(filename): with open(filename) as f: return f.read().partition( \n )[2].replace( \n, ) Parsing Sequences (2) Multiple Sequences in FASTA Format >SEQUENCE_1 MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLGKAAKKADRLAAEG LVSVKVSDDFTIAAMRPSYLSYEDLDMTFVENEYKALVAELEKENEERRRLKDPNKPEHKIP QFASRKQLSDAILKEAEEKIKEELKAQGKPEKIWDNIIPGKMNSFIADNSQLDSKLTLMGQFY VMDDKKTVEQVIAEKEKEFGGKIKIVEFICFEVGEGLEKKTEDFAAEVAAQL >SEQUENCE_2 SATVSEINSETDFVAKNDQFIALTKDTTAHIQSNSLQSVEELHSSTINGVKFEEYLKSQIATIGE NLVVRRFATLKAGANGVVNGYIHTNGRVGVVIAAACDSAEVASKSRDLLRQICMH Parsing? 8

Frequency Counting DNA Sequence Validation def validate_dna (base_sequence): seq = base_sequence.upper() return len(seq) == (seq.count( T ) + seq.count( C ) + seq.count( A ) + seq.count( G ) ) def validate_dna (base_sequence): seq = base_sequence.upper() for base in seq: if base not in ACGT : return False return True Counting Base Frquency Make a function to calculate the percent of C and G in a DNA sequence def percent_of_gc (base_sequence): seq = base_sequence.upper() return (seq.count( G ) + seq.count( C )) / len(seq) Motif Search Searching Substring Make a function to take a sequence and a motif and return the position(s) of matching in the sequence def motif_search (seq, motif): return seq.find(motif) def all_motif_search (seq, motif): pos = [] idx = seq.find(motif) pos.append(idx) seq = seq.partition(motif)[2] while seq.find(motif) >= 0: idx += seq.find(motif) + len(motif) pos.append(idx) seq = seq.partition(motif)[2] return pos 9

Transcription Simulating Transcription Make a function to transcribe a DNA into an RNA def transcription (dna): return dna.replace( T, U ) Translation (1) Making Genetic Code Make a function to translate a codon to an amino acid def codon2aa(codon): genetic_code = { UUU : F, UUC : F, UUA : L, } if codon in genetic_code.keys(): return genetic_code[codon] else: return Error 10

Translation (2) Simulating Translation Make a function to translate an RNA into a protein sequence def translation(rna): protein = for n in range(0, len(rna), 3): protein += codon2aa(rna[n:n+3]) return protein Translation (3) Simulating Translation cont Make a generator - an object that returns values from a series it computes def aa_generator(rna): return (codon2aa(rna[n:n+3]) for n in range(0, len(rna), 3) ) def translation(rna): gen = aa_generator(rna) protein = aa = next(gen) while aa: protein += aa aa = next(gen) return protein 11

Mutation Simulating Mutation Make a function to simulate single point mutations in a DNA sequence import random def mutation(dna): position = random.randint(0,len(dna)-1) bases = ACGT new_base = bases[random.randint(0,3)] dna[position:position+1] = new_base return dna bases.replace(dna[position], ) new_base = bases[random.randint(0,2)] Questions? Lecture Slides are found on the Course Website, web.ecs.baylor.edu/faculty/cho/3360 12