Bioinformatics: course introduction



Similar documents
RETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison

BIO 3350: ELEMENTS OF BIOINFORMATICS PARTIALLY ONLINE SYLLABUS

A Primer of Genome Science THIRD

BIOINF 525 Winter 2016 Foundations of Bioinformatics and Systems Biology

Just the Facts: A Basic Introduction to the Science Underlying NCBI Resources

Core Bioinformatics. Degree Type Year Semester Bioinformàtica/Bioinformatics OB 0 1

Introduction to Bioinformatics 3. DNA editing and contig assembly

Teaching Bioinformatics to Undergraduates

Protein Protein Interaction Networks

Integrating Bioinformatics, Medical Sciences and Drug Discovery

A Multiple DNA Sequence Translation Tool Incorporating Web Robot and Intelligent Recommendation Techniques

BBSRC TECHNOLOGY STRATEGY: TECHNOLOGIES NEEDED BY RESEARCH KNOWLEDGE PROVIDERS

Data Integration. Lectures 16 & 17. ECS289A, WQ03, Filkov

How many of you have checked out the web site on protein-dna interactions?

Molecular Genetics: Challenges for Statistical Practice. J.K. Lindsey

REGULATIONS FOR THE DEGREE OF BACHELOR OF SCIENCE IN BIOINFORMATICS (BSc[BioInf])

Next Generation Sequencing: Technology, Mapping, and Analysis

Bioinformatics Grid - Enabled Tools For Biologists.

CCR Biology - Chapter 9 Practice Test - Summer 2012

Module 1. Sequence Formats and Retrieval. Charles Steward

Algorithms in Computational Biology (236522) spring 2007 Lecture #1

Guide for Bioinformatics Project Module 3

PROC. CAIRO INTERNATIONAL BIOMEDICAL ENGINEERING CONFERENCE

Intro to Bioinformatics

Linear Sequence Analysis. 3-D Structure Analysis

Genetic information (DNA) determines structure of proteins DNA RNA proteins cell structure enzymes control cell chemistry ( metabolism )

Core Facility Genomics

SGI. High Throughput Computing (HTC) Wrapper Program for Bioinformatics on SGI ICE and SGI UV Systems. January, Abstract. Haruna Cofer*, PhD

On Covert Data Communication Channels Employing DNA Steganography with Application in Massive Data Storage

Translation Study Guide

Biological Sciences Initiative. Human Genome

Biological Sequence Data Formats

When you install Mascot, it includes a copy of the Swiss-Prot protein database. However, it is almost certain that you and your colleagues will want

Syllabus of B.Sc. (Bioinformatics) Subject- Bioinformatics (as one subject) B.Sc. I Year Semester I Paper I: Basic of Bioinformatics 85 marks

Vad är bioinformatik och varför behöver vi det i vården? a bioinformatician's perspectives

BIO 3352: BIOINFORMATICS II HYBRID COURSE SYLLABUS

Dr Alexander Henzing

Human Genome and Human Genome Project. Louxin Zhang

Master's projects at ITMO University. Daniil Chivilikhin PhD ITMO University

Pairwise Sequence Alignment

University of Glasgow - Programme Structure Summary C1G MSc Bioinformatics, Polyomics and Systems Biology

Biological Databases and Protein Sequence Analysis

Lab 2/Phylogenetics/September 16, PHYLOGENETICS

Bioinformatics Resources at a Glance

Doctor of Philosophy in Computer Science

Focusing on results not data comprehensive data analysis for targeted next generation sequencing

Introduction to bioinformatics

Gene mutation and molecular medicine Chapter 15

Searching Nucleotide Databases

Distributed Data Mining in Discovery Net. Dr. Moustafa Ghanem Department of Computing Imperial College London

A Systemic Artificial Intelligence (AI) Approach to Difficult Text Analytics Tasks

Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.

Biology & Big Data. Debasis Mitra Professor, Computer Science, FIT

Human-Mouse Synteny in Functional Genomics Experiment

Genetomic Promototypes

Healthcare Analytics. Aryya Gangopadhyay UMBC

Analysis of gene expression data. Ulf Leser and Philippe Thomas

Basic Concepts of DNA, Proteins, Genes and Genomes

LESSON 9. Analyzing DNA Sequences and DNA Barcoding. Introduction. Learning Objectives

Genomes and SNPs in Malaria and Sickle Cell Anemia

Big Data Healthcare. Fei Wang Associate Professor Department of Computer Science and Engineering School of Engineering University of Connecticut

Introduction to Bioinformatics 2. DNA Sequence Retrieval and comparison

Bio-Informatics Lectures. A Short Introduction

Appendix 2 Molecular Biology Core Curriculum. Websites and Other Resources

Final Project Report

THE UNIVERSITY OF MANCHESTER Unit Specification

SNP Essentials The same SNP story

BASIC STATISTICAL METHODS FOR GENOMIC DATA ANALYSIS

T cell Epitope Prediction

An Introduction to Genomics and SAS Scientific Discovery Solutions

AP Biology Essential Knowledge Student Diagnostic

Analyzing A DNA Sequence Chromatogram

Genome Explorer For Comparative Genome Analysis

Big Data Visualization for Genomics. Luca Vezzadini Kairos3D

Learning outcomes. Knowledge and understanding. Competence and skills

Statistics Graduate Courses

BioBoot Camp Genetics

escience and Post-Genome Biomedical Research

Euro-BioImaging European Research Infrastructure for Imaging Technologies in Biological and Biomedical Sciences

Lecture/Recitation Topic SMA 5303 L1 Sampling and statistical distributions

Protein Sequence Analysis - Overview -

1. Introduction Gene regulation Genomics and genome analyses Hidden markov model (HMM)

How To Understand The Science Of Genomics

Sequencing the Human Genome

Integrating DNA Motif Discovery and Genome-Wide Expression Analysis. Erin M. Conlon

Biomedical Big Data and Precision Medicine

Using MATLAB: Bioinformatics Toolbox for Life Sciences

Sequence Formats and Sequence Database Searches. Gloria Rendon SC11 Education June, 2011

InSyBio BioNets: Utmost efficiency in gene expression data and biological networks analysis

UGENE Quick Start Guide

SICKLE CELL ANEMIA & THE HEMOGLOBIN GENE TEACHER S GUIDE

Databases and platforms for data analysis from NGS of MTB

Unraveling protein networks with Power Graph Analysis

Factors for success in big data science

Transcription:

Bioinformatics: course introduction Filip Železný Czech Technical University in Prague Faculty of Electrical Engineering Department of Cybernetics Intelligent Data Analysis lab http://ida.felk.cvut.cz Filip Železný (ČVUT) Bioinformatics - intro 1 / 38

A6M33BIN Purpose of this course: Understand the computational problems in bioinformatics, the available types of data and databases, and the algorithms that solve the problems. Methods/Prerequisities mainly: probability and statistics, algorithms (complexity classes), programming skills also: discrete math topics (graphs, automata), relational databases Lectures by FZ may be held in English (pending your consensus) Purpose of this lecture Sneak informal preview of the major bioinformatics topics Filip Železný (ČVUT) Bioinformatics - intro 2 / 38

Teachers Doc. Filip Železný CTU Prague, Dept. of Cybernetics zelezny@fel.cvut.cz Dr. Jiří Kléma CTU Prague, Dept. of Cybernetics klema@labe.felk.cvut.cz Prof. Zdeněk Sedláček Charles Univ., Dept. of Biology and Medical Genetics Ing. Ondřej Kuželka CTU Prague, Dept. of Cybernetics kuzelon2@fel.cvut.cz Filip Železný (ČVUT) Bioinformatics - intro 3 / 38

Course materials Main page find a6m33bin on department s courseware page http://cw.felk.cvut.cz Course largely based on Mark Craven s bioinformatics class page at UW Wisconsin Contains a lot of links to useful materials in English Links will be also continually added to our CW The only Czech bioinformatics book Fatima Cvrčková: Úvod do praktické bioinformatiky (Academia, 2006) user-oriented, for biologists/medics, not informaticians Filip Železný (ČVUT) Bioinformatics - intro 4 / 38

Bioinformatics Bioinformatics representation storage retrieval analysis of gene- and protein-centric biological data Not just bio databases! Also: computational biology Related: systems biology, structural biology Filip Železný (ČVUT) Bioinformatics - intro 5 / 38

Bioinformatics: Main sources of data Information processes inside each cell which govern the entire organism. Filip Železný (ČVUT) Bioinformatics - intro 6 / 38

Bioinformatics vs. Biomedical Informatics Biomedical informatics includes Bioinformatics but also other fields such as signal analysis image analysis healthcare informatics not usually associated with bioinformatics. Filip Železný (ČVUT) Bioinformatics - intro 7 / 38

Bioinformatics vs. Bio-Inspired Computing Artificial neural networks Swarm intelligence Genetic algorithms DNA computing Also computers + biology but not bioinformatics Filip Železný (ČVUT) Bioinformatics - intro 8 / 38

Bioinformatics vs. Bioinformatics http://www.esoterika.cz/clanek/2992- mimosmyslova spionaz dalkove pozorovani i.htm Podle definičního třídění ruských vědců rozlišujeme dva obory paranormálních jevů: bioinformatika a bioenergetika. Bioinformatika (tzn. mimosmyslové vnímání, ESP) zahrnuje získávání a výměnu informací mimosmyslovou cestou (nikoli normálními smyslovými orgány). V podstatě rozlišujeme následující formy bioinformace: hypnózu (kontrolu vědomí), telepatii, dálkové vnímání, prekognici, retrokognici, mimotělní zkušenost, vidění rukama nebo jinými částmi těla, inspiraci a zjevení. not bioinformatics Filip Železný (ČVUT) Bioinformatics - intro 9 / 38

Bioinformatics: Impact Worldwide Basic biological research Personalized health care Gene-therapy Drug discovery etc. Czech landscape Small community (FEL, VSCHT, MMF,...) High demand (IKEM, IEM, IMB,...) come to see our projects Filip Železný (ČVUT) Bioinformatics - intro 10 / 38

Bioinformatics: origins 1950 s: Fred Sanger deciphers the sequence of letters (amino acids) in the insulin protein 51 letters Filip Železný (ČVUT) Bioinformatics - intro 11 / 38

Bioinformatics: origins 2004: Human Genome (DNA) deciphered billions of letters (nucleic acids) Filip Železný (ČVUT) Bioinformatics - intro 12 / 38

Progress in Sequencing Sequencing: reading the letters in the macromolecules of interest Work continues: population sequencing (not just 1 individual), variation analysis Extinct species (Neandertal genome sequenced in 2010) Filip Železný (ČVUT) Bioinformatics - intro 13 / 38

Shotgun sequencing DNA letters can be read only small sequences Shotgun approach: first shatter DNA into fragments Classical bioinformatics problem: assemble a genome from the read sequence fragments Shortest superstring problem Graph-theoretical formulations (Hamiltonian / Eulerian path finding) Filip Železný (ČVUT) Bioinformatics - intro 14 / 38

Databases Read bio sequences are stored in public databases Main umbrella institutes European Bioinformatics US National Center for Institute (EBI) Biotechnology Information (NCBI) Protein databases: Protein Data Bank (PDB), SWISS-PROT,... Gene databases: EMBL, GenBank, Entrez,... Many more Mutually interlinked Filip Železný (ČVUT) Bioinformatics - intro 15 / 38

Database Retrieval by Similarity Typical biologist s problem: retrieve sequences similar to one I have (protein, DNA fragment,..) Sequence similarity may imply homology (descent from a common ancestor) and similar functions Similarity is tricky: insertions and deletions must be considered Bioinformatics problem: find and score the best possible alignment Dynamic programming, heuristic methods,... Filip Železný (ČVUT) Bioinformatics - intro 16 / 38

Whole Genome Similarity Entire genomes (not just fragments) may be aligned Reveal relatedness between organisms Further complications come into play variations in repeat numbers inversions etc. Filip Železný (ČVUT) Bioinformatics - intro 17 / 38

Inference of Phylogenetic Trees Given a pairwise similarity function, and a set of genomes, infer the optimal phylogenetic tree of the corresponding organisms Application of hierarchical clustering A modern approach to replace phenotype-based taxonomy Filip Železný (ČVUT) Bioinformatics - intro 18 / 38

Multiple Sequence Alignment Aligning more than two sequences Reveal shared evolutionary origins (conserved domains) NP-complete problem (exp time in the number of aligned sequences) Filip Železný (ČVUT) Bioinformatics - intro 19 / 38

Probabilistic Sequence Models specific sites (substrings) on a sequence have specific roles e.g. genes or promoters on DNA, active sites on proteins How to tell them apart? Markov Chain Model Each type of site has a different probabilistic model Filip Železný (ČVUT) Bioinformatics - intro 20 / 38

Protein Spatial Structure From the DNA nucleic-acid sequence, the protein amino-acid sequence is constructed by cell machinery The protein folds into a complex spatial conformation Spatial conformation can be determined at high cost e.g. X-ray crystallography Determined structures are deposited in public protein data bases Filip Železný (ČVUT) Bioinformatics - intro 21 / 38

Protein Structure Prediction Can we compute protein structure from sequence? At least distinguish α-helices from β-sheets Very difficult, not yet solved problem Approches include machine learning Filip Železný (ČVUT) Bioinformatics - intro 22 / 38

Protein Function Prediction Protein function is given by its geometrical conformation E.g., ability to bind to DNA or to other proteins The active site (shown in purple) is most important Important machine-learning tasks: prediction of function from structure detection of active sites within structure Filip Železný (ČVUT) Bioinformatics - intro 23 / 38

Protein Docking Problem Proteins interact by docking Will a protein dock into another protein? Optimization problem in a geometrical setting Important for novel drug discovery e.g: green - receptor, red - drug the trouble is, the protein may dock also in many unwanted receptors immensely hard computational problems under uncertainty Filip Železný (ČVUT) Bioinformatics - intro 24 / 38

Gene Expression Analysis A gene is expressed if the cell produces proteins according to it Rate of expression can be measured for thousands of genes simultaneously by microarrays Can we predict phenotype (e.g. diseases) by gene expression profiling? Filip Železný (ČVUT) Bioinformatics - intro 25 / 38

High-throughput data analysis Gene expression data are called high-troughput since lots of measurements (thousands of genes) are produced in a single experiment Puts biologists in a new, difficult situation: how to interpret such data? Example problems: Too many suspects (genes), multiple hypothesis testing How to spot functional patterns among so many variables? How to construct multi-factorial predictive models? Wide opportunities for novel data analysis methods, incl. machine learning Filip Železný (ČVUT) Bioinformatics - intro 26 / 38

Other high-throughput technologies Methylation arrays (epigenetics) Chip-on-chip (protein X DNA interactions) mass spectrometry (presence of proteins)..and more Filip Železný (ČVUT) Bioinformatics - intro 27 / 38

Genome-wide association studies Correlates traits (e.g. susceptibility to disease) to genetic variations variations : single nucleotide polymorphisms (SNP) in DNA sequence involves a population of people X: SNP s, Y: level of association Filip Železný (ČVUT) Bioinformatics - intro 28 / 38

Gene Regulatory Networks Feedback loops in expression: (a protein coded by) a gene influences the expression of another gene positively (transcription factor) or negatively (inhibitor) Results in extremly complex networks with intricate dynamics Most of regulatory networks are unknown or only partially known. Can we infer such networks from time-stamped gene expression data? Filip Železný (ČVUT) Bioinformatics - intro 29 / 38

Metabolic Networks Capture metabolism (energy processing) in cells Involves gene/proteins but also other molecules Computational problems similar as in gene regulation networks Filip Železný (ČVUT) Bioinformatics - intro 30 / 38

Exploiting Background Knowledge The bioinformatics tasks exemplified so far followed the pattern Data Genomic knowledge A lot of relevant formal (computer-understandable) knowledge available so the equation should be Data + Current Genomic Knowledge New Genomic Knowledge for example: Gene expression data + Known functions of genes Phenotype linked to a gene function But how to represent backround knowledge and use it systematically in data analysis? Important bioinformatics problem Filip Železný (ČVUT) Bioinformatics - intro 31 / 38

Examples of Genomic Background Knowledge scientific abstracts gene ontology interaction networks and many other kinds Filip Železný (ČVUT) Bioinformatics - intro 32 / 38

Bioinformatics at the IDA lab Protein structure analysis with machine learning Prodigy Software Filip Železný (ČVUT) Bioinformatics - intro 33 / 38

Bioinformatics at the IDA lab Gene expression analysis with machine learning Filip Železný (ČVUT) Bioinformatics - intro 34 / 38

Bioinformatics at the IDA lab Gene expression analysis with machine learning Filip Železný (ČVUT) Bioinformatics - intro 35 / 38

Bioinformatics at the IDA lab Applications in medical studies Filip Železný (ČVUT) Bioinformatics - intro 36 / 38

Bioinformatics at the IDA lab If you find this course interesting, you can take part in IDA s research! Filip Železný (ČVUT) Bioinformatics - intro 37 / 38