BIOL 75302 (phytoinformatics)



Similar documents
Bio-Informatics Lectures. A Short Introduction

PROGRAMMING FOR BIOLOGISTS. BIOL 6297 Monday, Wednesday 10 am -12 pm

Tutorial 0A Programming on the command line

Command Line Crash Course For Unix

A Tiny Queuing System for Blast Servers

Beyond Windows: Using the Linux Servers and the Grid

Linux command line. An introduction to the Linux command line for genomics. Susan Fairley

Fred Hantelmann LINUX. Start-up Guide. A self-contained introduction. With 57 Figures. Springer

Linux System Administration on Red Hat

An introduction to bioinformatic tools for population genomic and metagenetic data analysis, 2.5 higher education credits Third Cycle

Introduction to Bioinformatics 3. DNA editing and contig assembly

Birmingham Environment for Academic Research. Introduction to Linux Quick Reference Guide. Research Computing Team V1.0

Open Source Computational Fluid Dynamics

Installation Guide for AmiRNA and WMD3 Release 3.1

Basic Linux & Package Management. Original slides from GTFO Security

Geospiza s Finch-Server: A Complete Data Management System for DNA Sequencing

Introduction to Running Hadoop on the High Performance Clusters at the Center for Computational Research

An introduction to bioinformatic tools for metagenetic and population genomic data analysis, 2.0 higher education credits

An A-Z Index of the Apple OS X command line (TERMINAL) The tcsh command shell of Darwin (the open source core of OSX)

PREREQUISITES LOGGING IN

HP-UX Essentials and Shell Programming Course Summary

Course plan. MSc on Bioinformatics for Health Sciences Academic Year Qualification Master's Degree

ICS 351: Today's plan

RETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison

Unix Sampler. PEOPLE whoami id who

Higher National Unit Specification. General information for centres. Multi User Operating Systems. Unit code: DH3A 34

What is a contig? What are the contig assembly programs?

Tutorial: Using WestGrid. Drew Leske Compute Canada/WestGrid Site Lead University of Victoria

SeqScape Software Version 2.5 Comprehensive Analysis Solution for Resequencing Applications

BIO 3350: ELEMENTS OF BIOINFORMATICS PARTIALLY ONLINE SYLLABUS

Syntax: cd <Path> Or cd $<Custom/Standard Top Name>_TOP (In CAPS)

A Crash Course on UNIX

Command Line - Part 1

INASP: Effective Network Management Workshops

Basic C Shell. helpdesk@stat.rice.edu. 11th August 2003

How To Use The Librepo Software On A Linux Computer (For Free)

A data management framework for the Fungal Tree of Life

Bioinformatics Grid - Enabled Tools For Biologists.

USEFUL UNIX COMMANDS

UGENE Quick Start Guide

Higher National Unit Specification. General information for centres. Unit title: Multi User Operating Systems. Unit code: D76G 34

Thirty Useful Unix Commands

This presentation explains how to monitor memory consumption of DataStage processes during run time.

Efficient Parallel Execution of Sequence Similarity Analysis Via Dynamic Load Balancing

Introduction to Shell Programming

Cisco Networking Academy Program Curriculum Scope & Sequence. Fundamentals of UNIX version 2.0 (July, 2002)

UNIX / Linux commands Basic level. Magali COTTEVIEILLE - September 2009

Introduction to Linux and Cluster Basics for the CCR General Computing Cluster

LSN 10 Linux Overview

CD-HIT User s Guide. Last updated: April 5,

Introduction to CloudScript

SEO - Access Logs After Excel Fails...

Introduction to Mac OS X

Introduction to R and UNIX Working with microarray data in a multi-user environment

University of Glasgow - Programme Structure Summary C1G MSc Bioinformatics, Polyomics and Systems Biology

Training Day : Linux

Introduction to Shell Scripting

REGULATIONS FOR THE DEGREE OF BACHELOR OF SCIENCE IN BIOINFORMATICS (BSc[BioInf])

BIO 3352: BIOINFORMATICS II HYBRID COURSE SYLLABUS

Linux Overview. Local facilities. Linux commands. The vi (gvim) editor

Beginners Shell Scripting for Batch Jobs

SYSTEM BACKUP AND RESTORE (AlienVault USM 4.8+)

Using Parallel Computing to Run Multiple Jobs

ClusterControl: A Web Interface for Distributing and Monitoring Bioinformatics Applications on a Linux Cluster

CERULIUM TERADATA COURSE CATALOG

Structure and Function of DNA

High Performance Computing with Sun Grid Engine on the HPSCC cluster. Fernando J. Pineda

Hadoop Hands-On Exercises

Bioinformatics Resources at a Glance

Monitoring Netflow with NFsen

UNIX - Command-Line Survival Guide

A Tutorial in Genetic Sequence Classification Tools and Techniques

LECTURE-7. Introduction to DOS. Introduction to UNIX/LINUX OS. Introduction to Windows. Topics:

Lab 1: Introduction to C, ASCII ART and the Linux Command Line Environment

ITP 300: Database Web Development. Database Web Development (Monday section) Fall 2012 Course Units

Installing and Running MOVES on Linux

A Primer of Genome Science THIRD

FIVS 316 BIOTECHNOLOGY & FORENSICS Syllabus - Lecture followed by Laboratory

ULTEO OPEN VIRTUAL DESKTOP V4.0

Tour of the Terminal: Using Unix or Mac OS X Command-Line

COS 480/580: Database Management Systems

Chapter 1. Backup service

Partek Flow Installation Guide

Guidelines for Establishment of Contract Areas Computer Science Department

Network Monitoring Tool with LAMP Architecture

Getting Started with HPC

Vector NTI Advance 11 Quick Start Guide

Cygwin command line windows. Get that Linux feeling - on Windows

Configuring Keystone in OpenStack (Essex)

Phylogenetic Trees Made Easy

A Multiple DNA Sequence Translation Tool Incorporating Web Robot and Intelligent Recommendation Techniques

Genome Explorer For Comparative Genome Analysis

HARFORD COMMUNITY COLLEGE 401 Thomas Run Road Bel Air, MD Course Outline CIS INTRODUCTION TO UNIX

Supplementary material: A benchmark of multiple sequence alignment programs upon structural RNAs Paul P. Gardner a Andreas Wilm b Stefan Washietl c

Linux für bwgrid. Sabine Richling, Heinz Kredel. Universitätsrechenzentrum Heidelberg Rechenzentrum Universität Mannheim. 27.

grep, awk and sed three VERY useful command-line utilities Matt Probert, Uni of York grep = global regular expression print

Web Hosting: Pipeline Program Technical Self Study Guide

CS Unix Tools & Scripting Lecture 9 Shell Scripting

Agenda. Using HPC Wales 2

Transcription:

BIOL 75302 (phytoinformatics) Dr. Damon P. Little City University of New York, Lehman College & The New York Botanical Garden dlittle@nybg.org; 718-817-8521 http://www.nybg.org/files/scientists/dlittle/phytoinformatics.html [office hours by appointment] Mondays & Wednesdays 2:00 5:00 PM Pfizer conference room, The New York Botanical Garden Objectives This course will provide students of plant organismal biology the computational tools needed to process and extract data from text and image files, basic UNIX command line tools, relational database structure, introductory Simple Query Language (SQL), and introductory AWK and PERL programming. Techniques for querying and managing DNA sequence databases will also be covered. By the end of the course you should be: 1. comfortable using the BASH command line interface 2. able to extract and manipulate data in text files/streams using text processing tools and pipes 3. able to run programs in batch mode in a single user environment as well as a high performance computing environment 4. able to write basic SQL queries for MySQL 5. able to design a relational MySQL database 6. able write basic AWK and PERL scripts 7. able to assemble sequencing reads into useful contigs 8. able to conduct basic sequence analyses including similarity and feature searches 9. able to extract data from images A1

Texts Abascal, F., R. Zardoya & M. J. Telford. 2010. TranslatorX: multiple alignment of nucleotide sequences guided by amino acid translations. Nucleic Acids Research 38: W7 W13. Altschul, S. F., W. Gish, W. Miller, E. W. Myers & D. J. Lipman. 1990. Basic local alignment search tool. Journal of Molecular Biology 215: 403 410. Arbuthnott, J. 1710. An argument for divine providence, taken from the constant regularity observ d in the births of both sexes. Philosophical Transactions 27: 186 190. Ashburner, M., C. A. Ball, J. A. Blake, D. Botstein, H. Butler, J. M. Cherry, A. P. Davis, K. Dolinski, S. S. Dwight, J. T. Eppig, M. A. Harris, D. P. Hill, L. Issel-Tarver, A. Kasarskis, S. Lewis, J. C. Matese, J. E. Richardson, M. Ringwald, G. M. Rubin & G. Sherlock. 2000. Gene Ontology: tool for the unification of biology. Nature Genetics 25: 25 29. Caporaso, J. G., K. Bittinger, F. D. Bushman, T. Z. DeSantis, G. L. Andersen & R. Knight. 2010. PyNAST: a flexible tool for aligning sequences to a template alignment. Bioinformatics 26: 266 267. Codd, E. F. 1970. A relational model of data for large shared data banks. Communications of the ACM 13: 377 387. Conesa, A., S. Götz, J. M. García-Gómez, J. Terol, M. Talón & M. Robles. 2005. Blast2GO: a universal tool for annotation, visualization and analysis in functional genomics research. Bioinformatics 21: 3674 3676. Cozens, S. 2000. Beginning Perl. 1st ed. Wrox Press (http://www.perl.org/books/beginningperl/). Edgar, R. C. 2004. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Research 32: 1792 1797. Eitner, K., U. Koch, T. Gawȩda & J. Marciniak. 2010. Statistical distribution of amino acid sequences: a proof of Darwinian evolution. Bioinformatics 26: 2933 2935. Ewing, B. & P. Green. 1998. Base calling of automated sequencer traces using Phred II: error probabilities. Genome Research 8: 186 194. Ewing, B., L. Hillier, M. C. Wendl & P. Green. 1998. Base calling of automated sequencer traces using Phred I: accuracy assessment. Genome Research 8: 175 185. Hall, G. S. & D. P. Little. 2007. Relative quantitation of virus population size in mixed genotype infections using sequencing chromatograms. Journal of Virological Methods 146: 22 28. Katoh, K., K. Misawa, K. Kuma & T. Miyata. 2002. MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Research 30: 3059 3066. Lassmann, T. & E. L. Sonnhammer. 2005. Kalign an accurate and fast multiple sequence alignment algorithm. BMC Bioinformatics 6: 298. Pertsemlidis, A. & J. W. Fondon III. 2001. Having a BLAST with bioinformatics (and avoiding BLASTphemy). Genome Biology 2: 1 10. A2

Phillips, A., D. Janies & W. Wheeler. 2000. Multiple sequence alignment in phylogenetic analysis. Molecular Phylogenetics and Evolution 16: 317 330. Schuler, G. D. 1997. Sequence mapping by electronic PCR. Genome Research 7: 541 550. Simpson, J. T., K. Wong, S. D. Jackman, J. E. Schein, S. J. M. Jones & İ. Birol. 2009. ABySS: a parallel assembler for short read sequence data. Genome Research 19: 1117 1123. Sobell, M. G. 2013. A practical guide to LINUX commands, editors, and shell programming. 3rd ed. Prentice Hall, Upper Saddle River. Warren, R. L., G. G. Sutton, S. J. M. Jones & R. A. Holt. 2007. Assembling millions of short DNA sequences using SSAKE. Bioinformatics 23: 500 501. Wu, S. & U. Manber. 1992. Fast text searching: allowing errors. Communications of the ACM 35: 83 91. Grading laboratory exercises (1 per week, 2% each, 30% total) 1 midterm take home exam (20%) 1 take home final exam (20%) 1 term project (5% project proposal, 15% written, 10% oral presentation) Exam questions will be based on the laboratory exercises. Therefore it is very important that the laboratory exercises be completed. Assignments are due at the beginning of class on the date specified. No late assignments will be accepted. Term project The term project is an attempt to reproduce a peer reviewed bioinformatics publication that is no more than 10 years old and for which the data and software are available to you. There are three components: 1. A project proposal consisting of a one page outline that describes the data and analyses that will be conducted (due October 6). Please include a copy of the publication with your proposal. 2. An oral presentation, with slides, describing the original publication, data, and analyses followed by a description of your attempts to reproduce the original results (in class December 15). Consideration should be given to alternative analyses that may be more appropriate for the aims of the publication and data. 3. A 8 16 page written version of the oral presentation (due December 19). A3

Course schedule WEEK 1 LECTURE (SEPTEMBER 3). Overview of grading, exams, and other logistics; bioinformatics defined; and overview of LINUX systems and distributions. Readings: Arbuthnott (1710); Eitner et al. (2010); Sobell (2013: chapters 1 & 2). WEEK 1 LABORATORY (SEPTEMBER 3). Installing Ubuntu LINUX 14.04. WEEK 2 LECTURE (SEPTEMBER 8 & 10). BASH shell, software installation, moving data, files, and streams. Readings: Sobell (2013: chapters 4, 8, & 17). WEEK 2 LABORATORY (SEPTEMBER 8 & 10). Basic BASH (cd, ls, pwd, <tab>, apropos, man, find, less, mkdir, file, and PATH), file permissions (chmod, chown, and sudo), installing software (apt-get, gzip, tar, and make), and moving data (cp, mv, ssh, sftp, wget, and rm). WEEK 3 LECTURE (SEPTEMBER 15 & 17 1 ). The power of command line text tools, pipes, and job control. Readings: Sobell (2013: chapters 3, 5, & 14). WEEK 3 LABORATORY (SEPTEMBER 15 & 17 1 ). Basic UNIX text tools (grep, awk, tr, sort, uniq, sed, wc, cat, head, tail, split, join, diff, and tre-agrep), pipes, and redirects. WEEK 4 LECTURE (SEPTEMBER 22 & 24). An overview of database types; the structure of relational databases; and relational database table and field structure. Readings: Codd (1970). WEEK 4 LABORATORY (SEPTEMBER 22 & 24). Job control in a single user environment (&,./, nice, nohup, top, ps, and scripts) and a high performance computing environment (qhost, qsub, qstat, and qdel). SQL queries of relational databases. Read- WEEK 5 LECTURE (SEPTEMBER 29 2 & OCTOBER 1 2 ). ings: Sobell (2013: chapter 13). WEEK 5 LABORATORY (SEPTEMBER 29 2 & OCTOBER 1 2 ). Manual database queries. WEEK 6 LECTURE (OCTOBER 6 & 8). Efficient SQL queries of relational databases. Readings: the MySQL manual (http://dev.mysql.com/doc/refman/5.5/en/). Term project proposal due October 6. WEEK 6 LABORATORY (OCTOBER 6 & 8). LIKE, DISTINCT, and mysqlimport). MySQL (CREATE, SELECT, INSERT, UPDATE, DELETE, WEEK 7 LECTURE (OCTOBER 15). Intermediate SQL queries of relational databases. Readings: the MySQL manual (http://dev.mysql.com/doc/refman/5.5/en/). WEEK 7 LABORATORY (OCTOBER 15). MySQL (AS, JOIN). WEEK 8 LECTURE (OCTOBER 20 & 22). Text editors, basic PERL data structures, and PERL operators. Readings: Cozens (2000: chapters 1, 2, & 9); Sobell (2013: chapter 11). 1 Location TBA 2 Time and Location TBA A4

WEEK 8 LABORATORY (OCTOBER 20 & 22). DROP, and mysqldump). MySQL (CONCAT, JOIN, ORDER, COUNT, GROUP, WEEK 9 LECTURE (OCTOBER 27 & 29). PERL regexp, arrays, and hashes. Readings: Cozens (2000: chapters 3, 5, & 6; Appendix A). Take home midterm exam distributed October 22. WEEK 9 LABORATORY (OCTOBER 27 & 29). split, and join). PERL (open, close, unlink, qx, print, m, s, tr, reverse, WEEK 10 LECTURE (NOVEMBER 3 & 5). PERL conditionals (if), loops (for and while), and CPAN. Readings: Cozens (2000: chapters 4, 7, & 13; Appendix C). Take home midterm exam due October 29. WEEK 10 LABORATORY (NOVEMBER 3 & 5). WEEK 11 LECTURE (NOVEMBER 10 & 12). Cozens (2000: chapters 8 & 12). The PERL and MySQL interface. PERL (my and sub) and cgi programing. Readings: WEEK 11 LABORATORY (NOVEMBER 10 & 12). PERL and SQL cgi programing. WEEK 12 LECTURE (NOVEMBER 17 & 19). DNA/RNA/protein sequence searches, open reading frame identification, and GO. Readings: Altschul et al. (1990); Ashburner et al. (2000); Conesa et al. (2005); Pertsemlidis & Fondon III (2001); Schuler (1997); Wu & Manber (1992). WEEK 12 LABORATORY (NOVEMBER 17 & 19). BLAST, tre-agrep, e-pcr, and BLAST2GO. WEEK 13 LECTURE (NOVEMBER 24 & 26). DNA/RNA/protein sequence alignment. Readings: Abascal et al. (2010); Caporaso et al. (2010); Edgar (2004); Katoh et al. (2002); Lassmann & Sonnhammer (2005); Phillips et al. (2000). BLAST, MUSCLE, MAFFT, KALIGN, transla- WEEK 13 LABORATORY (NOVEMBER 24 & 26). torx, and PYNAST. WEEK 14 LECTURE (DECEMBER 1 & 3). DNA sequence processing, assembly, and quantitative sequencing. Readings: Ewing et al. (1998); Ewing & Green (1998); Hall & Little (2007); Simpson et al. (2009); Warren et al. (2007). WEEK 14 LABORATORY (DECEMBER 1 & 3). PHRED, PHRAP, polysnp, ABySS, and SSAKE. WEEK 15 LECTURE (DECEMBER 8 & 10). Extraction of data from images. WEEK 15 LABORATORY (DECEMBER 8 & 10). ImageMagick and Fiji. WEEK 16 LECTURE & LABORATORY (DECEMBER 15). Term project presentations. Take home final exam distributed December 15, due December 23. Term project due December 19. A5