BIOL 75302 (phytoinformatics) Dr. Damon P. Little City University of New York, Lehman College & The New York Botanical Garden dlittle@nybg.org; 718-817-8521 http://www.nybg.org/files/scientists/dlittle/phytoinformatics.html [office hours by appointment] Mondays & Wednesdays 2:00 5:00 PM Pfizer conference room, The New York Botanical Garden Objectives This course will provide students of plant organismal biology the computational tools needed to process and extract data from text and image files, basic UNIX command line tools, relational database structure, introductory Simple Query Language (SQL), and introductory AWK and PERL programming. Techniques for querying and managing DNA sequence databases will also be covered. By the end of the course you should be: 1. comfortable using the BASH command line interface 2. able to extract and manipulate data in text files/streams using text processing tools and pipes 3. able to run programs in batch mode in a single user environment as well as a high performance computing environment 4. able to write basic SQL queries for MySQL 5. able to design a relational MySQL database 6. able write basic AWK and PERL scripts 7. able to assemble sequencing reads into useful contigs 8. able to conduct basic sequence analyses including similarity and feature searches 9. able to extract data from images A1
Texts Abascal, F., R. Zardoya & M. J. Telford. 2010. TranslatorX: multiple alignment of nucleotide sequences guided by amino acid translations. Nucleic Acids Research 38: W7 W13. Altschul, S. F., W. Gish, W. Miller, E. W. Myers & D. J. Lipman. 1990. Basic local alignment search tool. Journal of Molecular Biology 215: 403 410. Arbuthnott, J. 1710. An argument for divine providence, taken from the constant regularity observ d in the births of both sexes. Philosophical Transactions 27: 186 190. Ashburner, M., C. A. Ball, J. A. Blake, D. Botstein, H. Butler, J. M. Cherry, A. P. Davis, K. Dolinski, S. S. Dwight, J. T. Eppig, M. A. Harris, D. P. Hill, L. Issel-Tarver, A. Kasarskis, S. Lewis, J. C. Matese, J. E. Richardson, M. Ringwald, G. M. Rubin & G. Sherlock. 2000. Gene Ontology: tool for the unification of biology. Nature Genetics 25: 25 29. Caporaso, J. G., K. Bittinger, F. D. Bushman, T. Z. DeSantis, G. L. Andersen & R. Knight. 2010. PyNAST: a flexible tool for aligning sequences to a template alignment. Bioinformatics 26: 266 267. Codd, E. F. 1970. A relational model of data for large shared data banks. Communications of the ACM 13: 377 387. Conesa, A., S. Götz, J. M. García-Gómez, J. Terol, M. Talón & M. Robles. 2005. Blast2GO: a universal tool for annotation, visualization and analysis in functional genomics research. Bioinformatics 21: 3674 3676. Cozens, S. 2000. Beginning Perl. 1st ed. Wrox Press (http://www.perl.org/books/beginningperl/). Edgar, R. C. 2004. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Research 32: 1792 1797. Eitner, K., U. Koch, T. Gawȩda & J. Marciniak. 2010. Statistical distribution of amino acid sequences: a proof of Darwinian evolution. Bioinformatics 26: 2933 2935. Ewing, B. & P. Green. 1998. Base calling of automated sequencer traces using Phred II: error probabilities. Genome Research 8: 186 194. Ewing, B., L. Hillier, M. C. Wendl & P. Green. 1998. Base calling of automated sequencer traces using Phred I: accuracy assessment. Genome Research 8: 175 185. Hall, G. S. & D. P. Little. 2007. Relative quantitation of virus population size in mixed genotype infections using sequencing chromatograms. Journal of Virological Methods 146: 22 28. Katoh, K., K. Misawa, K. Kuma & T. Miyata. 2002. MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Research 30: 3059 3066. Lassmann, T. & E. L. Sonnhammer. 2005. Kalign an accurate and fast multiple sequence alignment algorithm. BMC Bioinformatics 6: 298. Pertsemlidis, A. & J. W. Fondon III. 2001. Having a BLAST with bioinformatics (and avoiding BLASTphemy). Genome Biology 2: 1 10. A2
Phillips, A., D. Janies & W. Wheeler. 2000. Multiple sequence alignment in phylogenetic analysis. Molecular Phylogenetics and Evolution 16: 317 330. Schuler, G. D. 1997. Sequence mapping by electronic PCR. Genome Research 7: 541 550. Simpson, J. T., K. Wong, S. D. Jackman, J. E. Schein, S. J. M. Jones & İ. Birol. 2009. ABySS: a parallel assembler for short read sequence data. Genome Research 19: 1117 1123. Sobell, M. G. 2013. A practical guide to LINUX commands, editors, and shell programming. 3rd ed. Prentice Hall, Upper Saddle River. Warren, R. L., G. G. Sutton, S. J. M. Jones & R. A. Holt. 2007. Assembling millions of short DNA sequences using SSAKE. Bioinformatics 23: 500 501. Wu, S. & U. Manber. 1992. Fast text searching: allowing errors. Communications of the ACM 35: 83 91. Grading laboratory exercises (1 per week, 2% each, 30% total) 1 midterm take home exam (20%) 1 take home final exam (20%) 1 term project (5% project proposal, 15% written, 10% oral presentation) Exam questions will be based on the laboratory exercises. Therefore it is very important that the laboratory exercises be completed. Assignments are due at the beginning of class on the date specified. No late assignments will be accepted. Term project The term project is an attempt to reproduce a peer reviewed bioinformatics publication that is no more than 10 years old and for which the data and software are available to you. There are three components: 1. A project proposal consisting of a one page outline that describes the data and analyses that will be conducted (due October 6). Please include a copy of the publication with your proposal. 2. An oral presentation, with slides, describing the original publication, data, and analyses followed by a description of your attempts to reproduce the original results (in class December 15). Consideration should be given to alternative analyses that may be more appropriate for the aims of the publication and data. 3. A 8 16 page written version of the oral presentation (due December 19). A3
Course schedule WEEK 1 LECTURE (SEPTEMBER 3). Overview of grading, exams, and other logistics; bioinformatics defined; and overview of LINUX systems and distributions. Readings: Arbuthnott (1710); Eitner et al. (2010); Sobell (2013: chapters 1 & 2). WEEK 1 LABORATORY (SEPTEMBER 3). Installing Ubuntu LINUX 14.04. WEEK 2 LECTURE (SEPTEMBER 8 & 10). BASH shell, software installation, moving data, files, and streams. Readings: Sobell (2013: chapters 4, 8, & 17). WEEK 2 LABORATORY (SEPTEMBER 8 & 10). Basic BASH (cd, ls, pwd, <tab>, apropos, man, find, less, mkdir, file, and PATH), file permissions (chmod, chown, and sudo), installing software (apt-get, gzip, tar, and make), and moving data (cp, mv, ssh, sftp, wget, and rm). WEEK 3 LECTURE (SEPTEMBER 15 & 17 1 ). The power of command line text tools, pipes, and job control. Readings: Sobell (2013: chapters 3, 5, & 14). WEEK 3 LABORATORY (SEPTEMBER 15 & 17 1 ). Basic UNIX text tools (grep, awk, tr, sort, uniq, sed, wc, cat, head, tail, split, join, diff, and tre-agrep), pipes, and redirects. WEEK 4 LECTURE (SEPTEMBER 22 & 24). An overview of database types; the structure of relational databases; and relational database table and field structure. Readings: Codd (1970). WEEK 4 LABORATORY (SEPTEMBER 22 & 24). Job control in a single user environment (&,./, nice, nohup, top, ps, and scripts) and a high performance computing environment (qhost, qsub, qstat, and qdel). SQL queries of relational databases. Read- WEEK 5 LECTURE (SEPTEMBER 29 2 & OCTOBER 1 2 ). ings: Sobell (2013: chapter 13). WEEK 5 LABORATORY (SEPTEMBER 29 2 & OCTOBER 1 2 ). Manual database queries. WEEK 6 LECTURE (OCTOBER 6 & 8). Efficient SQL queries of relational databases. Readings: the MySQL manual (http://dev.mysql.com/doc/refman/5.5/en/). Term project proposal due October 6. WEEK 6 LABORATORY (OCTOBER 6 & 8). LIKE, DISTINCT, and mysqlimport). MySQL (CREATE, SELECT, INSERT, UPDATE, DELETE, WEEK 7 LECTURE (OCTOBER 15). Intermediate SQL queries of relational databases. Readings: the MySQL manual (http://dev.mysql.com/doc/refman/5.5/en/). WEEK 7 LABORATORY (OCTOBER 15). MySQL (AS, JOIN). WEEK 8 LECTURE (OCTOBER 20 & 22). Text editors, basic PERL data structures, and PERL operators. Readings: Cozens (2000: chapters 1, 2, & 9); Sobell (2013: chapter 11). 1 Location TBA 2 Time and Location TBA A4
WEEK 8 LABORATORY (OCTOBER 20 & 22). DROP, and mysqldump). MySQL (CONCAT, JOIN, ORDER, COUNT, GROUP, WEEK 9 LECTURE (OCTOBER 27 & 29). PERL regexp, arrays, and hashes. Readings: Cozens (2000: chapters 3, 5, & 6; Appendix A). Take home midterm exam distributed October 22. WEEK 9 LABORATORY (OCTOBER 27 & 29). split, and join). PERL (open, close, unlink, qx, print, m, s, tr, reverse, WEEK 10 LECTURE (NOVEMBER 3 & 5). PERL conditionals (if), loops (for and while), and CPAN. Readings: Cozens (2000: chapters 4, 7, & 13; Appendix C). Take home midterm exam due October 29. WEEK 10 LABORATORY (NOVEMBER 3 & 5). WEEK 11 LECTURE (NOVEMBER 10 & 12). Cozens (2000: chapters 8 & 12). The PERL and MySQL interface. PERL (my and sub) and cgi programing. Readings: WEEK 11 LABORATORY (NOVEMBER 10 & 12). PERL and SQL cgi programing. WEEK 12 LECTURE (NOVEMBER 17 & 19). DNA/RNA/protein sequence searches, open reading frame identification, and GO. Readings: Altschul et al. (1990); Ashburner et al. (2000); Conesa et al. (2005); Pertsemlidis & Fondon III (2001); Schuler (1997); Wu & Manber (1992). WEEK 12 LABORATORY (NOVEMBER 17 & 19). BLAST, tre-agrep, e-pcr, and BLAST2GO. WEEK 13 LECTURE (NOVEMBER 24 & 26). DNA/RNA/protein sequence alignment. Readings: Abascal et al. (2010); Caporaso et al. (2010); Edgar (2004); Katoh et al. (2002); Lassmann & Sonnhammer (2005); Phillips et al. (2000). BLAST, MUSCLE, MAFFT, KALIGN, transla- WEEK 13 LABORATORY (NOVEMBER 24 & 26). torx, and PYNAST. WEEK 14 LECTURE (DECEMBER 1 & 3). DNA sequence processing, assembly, and quantitative sequencing. Readings: Ewing et al. (1998); Ewing & Green (1998); Hall & Little (2007); Simpson et al. (2009); Warren et al. (2007). WEEK 14 LABORATORY (DECEMBER 1 & 3). PHRED, PHRAP, polysnp, ABySS, and SSAKE. WEEK 15 LECTURE (DECEMBER 8 & 10). Extraction of data from images. WEEK 15 LABORATORY (DECEMBER 8 & 10). ImageMagick and Fiji. WEEK 16 LECTURE & LABORATORY (DECEMBER 15). Term project presentations. Take home final exam distributed December 15, due December 23. Term project due December 19. A5