Databases indexation

Databases indexation Laurent Falquet, Basel October, 2006 Swiss Institute of Bioinformatics Swiss EMBnet node Overview Data access concept sequential direct Indexing EMBOSS Fetch Other BLAST Why indexing? formatdb Parsing output Excel import/export Tab delimited Coma delimited

Why indexing? Human tendency to classify and group Examples: Dictionnary Book Library DVD chapters ipod play lists Advantages: Fast access Easy data finding Disadvantages: Time to prepare indices Data access: sequential vs direct Sequential access Direct access Vary from very short to very long Very small variations track sector head

Similar concept for databases Flat files = sequential Indexing = simulated direct >seq1 cgatgtcatgtg >seq2 cgatcgtagctgtagctgtag >seq3 catgtgcatgcgacgt ID seq1 seq2 seq3 Position (byte) 0 19 47 Length (byte) 19 28 23 Tools EMBOSS dbxflat dbxfasta dbiblast seqret seqretsplit entret Other examples SRS (icarus language) http://srs.ebi.ac.uk http://www.lionbioscience.com/ indexer & fetch (warning local SIB tool) Relational (MySQL, Oracle ) Web (Google!!)

EMBOSS how to index? Where is your file? What is the format? Where should be the indices? Where is the emboss.default file? (.embossrc) Other EMBOSS tools textsearch Whichdb More details www.emboss.org EMBOSS example Input file and directory ~/embossidx/ecoli.dat cd embossidx Index creation dbxflat -idformat swiss -dbname ecoli -filenames '*.dat' -dbresource swiss -directory. -release 1.0 -date 26/09/06 -fields id,acc Generates 5 files (default) ECOLI.ent ECOLI.pxac ECOLI.pxid ECOLI.xac ECOLI.xid Don t forget to modify ~/.embossrc

.embossrc setemboss_filter 1 # Ecoli DB ecoli[ type: P comment:"e.coli proteome" method: emboss format:swiss dir:"{path}/embossidx" file:"ecoli.dat" release:"1.0" indexdir:"{path}/embossidx" ] Example of queries seqret ecoli:thio_ecoli seqret ecoli:p00274 entret ecoli:thio_ecoli and even seqret ecoli:*_ecoli Where {path} is the path to your home directory Indexer & fetch Warning this is a local SIB tool!! Input file and directory ~/embossidx/ecoli.dat cd embossidx Index creation indexer -h '^ID' -t '^//' -i -p '^ID\s+(\S+)' ECOLI.dat ecoli.idx Generates 1 file ecoli.idx Don t forget to modify config file

Config file: fetch.conf fetch.conf #dbkey formatindexfiledatafile ecolisp ~/embossidx/ecoli.idx~/embossidx/ecoli.dat Example of queries fetch -c fetch.conf ecoli:thio_ecoli fetch -c fetch.conf -f ecoli:thio_ecoli[20..50] BLAST Maintained at NCBI Source distributed freely with several accessory tools ftp://ftp.ncbi.nlm.nih.gov/too lbox/ncbi_tools/ncbi.tar.gz May require compilation to install on your local computer blastall contains blastp blastn blastx tblastn tblastx Other tools blastpgp megablast formatdb

Available Blast programs Program Query Database blastp VS blastn nucleotide VS nucleotide blastx nucleotide VS tblastn nucleotide VS tblastx nucleotide nucleotide VS What makes BLAST so fast? Indexing all words of 3 aa or 11 bp in the sequence database Searching the query for all words of a score > T Search the indexed database for all perfect matches Try to align matches that are on the same diagonal

Indexing for Blast (1) A substitution matrix is used to compute the word scores Query REL LKP score > T AAA AAA AAC AAC AAD AAD... YYY YYY List of all possible words with 3 amino acid residues (8000) score < T LKP LKP ACT ACT TVF TVF...... List of words matching the query with a score > T Indexing for Blast (2) Database sequences ACT ACT ACT ACT...... TVF TVF Search for exact matches TVF TVF List of words matching the query with a score > T List List of of sequences sequences containing containing words words similar similar to to the the query query (hits) (hits)

Indexing for Blast (3) Database sequence Query A Ungapped extension if: 2 "Hits" are on the same diagonal but at a distance less than A Database sequence Query A Extension using dynamic programming limited to a restricted region limited through a score drop-off threshold BLAST indexing with formatdb Formatdb mydb.seq must contain sequences in FASTA format formatdb -i mydb.seq -p T -n mydb Generates 3 files mydb.psq mydb.pin mydb.phr Then start a Blast: blastall -p blastp -d mydb -i myseq (-optional parameters)

Blast local vs remote blastall Executed locally Slow No need to transfert db blastall.remote Executed remotely Fast Requires special priviledges and db transfert Using BioPerl (remoteblast.pm) Blast at NCBI No user db See www.bioperl.org Multiple Blasts? 1 seq vs db seq 1 FASTA seq as input db seq vs db seq Several single FASTA seq files as input or 1 Multiple FASTA seq file as input Possibility to export results as XML Use Perl to automatize the queries and parse the output

Parsing Blast output BLASTP 2.2.10 [Oct-19-2004] Reference:Altschul,Stephen F., Thomas L. Madden, Alejandro A. Schaffer, Jinghui Zhang, Zheng Zhang, Webb Miller,and David J. Lipman (1997), "Gapped BLAST and PSI-BLAST: anew generation of databasesearch programs", Nucleic Acids Res.25:3389-3402. Query= ACCA_BACSU O34847 Acetyl-coenzyme A carboxylasecarboxyl transferase subunitalpha (EC 6.4.1.2). (325 letters) Database:ecoli_blast 4339 sequences; 1,373,039 totalletters Searching...done Score E Sequences producingsignificantalignments: (bits) Value ACCA_ECOLI P30867 Acetyl-coenzyme A carboxylasecarboxyltransfe... 266 1e-72 Parsing Blast output (2) >ACCA_ECOLI P30867 Acetyl-coenzyme A carboxylasecarboxyl transferase subunitalpha (EC 6.4.1.2). Length = 318 Score = 266 bits(681), Expect=1e-72 Identities= 143/312 (45%), Positives = 188/312 (60 %), Gaps = 3/312 (0%) Query:5 LEFEKPVIELQTKIAELKKFTQDS---DMDLSAEIERLEDRLAKLQDDIYKNLKPWDRVQ 61 L+FE+P+ EL+ KI L ++ D+++ E+ RL ++ +L I+ +L W Q Sbjct:5 LDFEQPIAELEAKIDSLTAVSRQDEKLDINIDEEVHRLREKSVELTRKIFADLGA WQIAQ 64 Query:62 IARLADRPTTLDYIEHLFTDFFECHGDRAYGDDEAIVGGIAKFHGLPVTVIGHQRGKDTK 121 +AR RP TLDY+ F+F E GDRAY DD+AIVGGIA+ G PV +IGHQ+G++TK Sbjct:65 LARHPQRPYTLDYVRLAFDEFDELAGDRAYADDKAIVGGIARLDGRPV MIIGHQKGRETK 124 Query:122 ENLVRNFGMPHPEGYRKALRLMKQADKFNRPIICFIDTKGAYPGRAAEERGQSEAIAKNL 181 E + RNFGMP PEGYRKALRLM+ A++F PII FIDT GAYPG AEERGQSEAIA+NL Sbjct:125 EKIRRNFGMPAPEGYRKALRLM Q MAERFKMPIITFIDTPGAYPGVGAEERGQSEAIARNL 184 Query:182 FEMAGLRVPXXXXXXXXXXXXXXXXXXXXXXXHMLENSTYSVISPEGAAALLWKDSSLAK 241 EM+ L VP +ML+ STYSVISPEG A++L WK + A Sbjct:185 REMSRLGVPVVCTVIGEGGSGGALAIGVGDKVNMLQYSTYSVISPEGCASILWKSADKAP 244 Query:242 KAAETMKITAPDLKELGIIDHMIKEVKGGAHHDVKLQASYMDXXXXXXXXXXXXXXXXXX 301 AAE M I AP LKEL +ID +I E GGAH + + A+ + Sbjct:245 LAAEAMGIIAPRLKELKLIDSIIPEPLGGAHRNPEAMAASLKAQLLADLADLDVLSTEDL 304 Query:302 VQQRYEKYKAIG 313 +RY++ + G Sbjct:305 KNRRYQRLMSYG 316

Parsing Blast output (3) With BioPerl: #!/usr/local/bin/perl use Bio::SearchIO; my $blast_report= new Bio::SearchIO ('-format'=>'blast', '-file' => $ARGV[0]); print "Query name:\tquery description:\thitname:\thitdescription:\te-value\tscore\n"; while( my $result=$blast_report->next_result){ print $result->query_name(),"\t",$result->query_description(),"\n"; while( my $hit= $result->next_hit()){ print "\t\t",$hit->name(),"\t",$hit->description(); while( my $hsp = $hit->next_hsp()){ print "\t",$hsp->evalue(),"\t", $hsp->score(); } print "\n"; } } exit0; MS-Excel import/export Excel can import Tab delimited Coma delimited Excel can export Tab delimited Space delimited AC/ID desc score e-value THIO_ECOLI thioredoxin Escherichia coli 234 2.1e-5 THIO_HUMAN thioredoxin Homo sapiens 120 0.001

MS-Excel import/export Tab delimited file: \t delimits the columns \n delimits the lines Optional first line contains columns title Example: AC/ID\tdesc\tscore\te-value\n THIO_ECOLI\tthioredoxin Escherichia coli\t234\t2.1e-5\n THIO_HU MAN\tthioredoxin Homo sapiens\t120\t0.001\n MS-Excel import/export Coma delimited file:, delimits the columns, each value is surrounded by \n delimits the lines Optional first line contains columns title Example: AC/ID, desc, score, e-value \n THIO_ECOLI, thioredoxin Escherichia coli, 234, 2.1e-5 \n THIO_HU M A N, thioredoxin Homo sapiens, 120, 0.001 \n