Bioinformatics Grid - Enabled Tools For Biologists.

Transcription

1 Bioinformatics Grid - Enabled Tools For Biologists.

2 What is Grid-Enabled Tools (GET)? As number of data from the genomics and proteomics experiment increases. Problems arise for the current sequence analysis technology : mainly SLOWER speed. Using GET the sequence is cut into batches and distributed to different computers in the cluster for processing. After computation, the results are sent back to the head node for recombination and thus is ready for collection by the user. Utilizing this method of sequencing and analyzing data reduces the total amount of time need to be spent in doing so.

3 GET Login Submit sequence in FASTA Format GetANNO GetEMBOSS GetMSA Choose your blast parameter GET Flowchart Choose your parameter Choose to perform either DNA or Protein analysis Blast Emboss Clustalw & Hmmer Results Result in zip is sent via download the zip file

4 GET Click Here to register

5 Registration Type in your name, and password. Then go to your to activate your account.

6 Login Page Type in your address and password to login

7 GetANNO GetANNO is to add on additional information associated with a particular point in a piece of information. Many proteins are modular in nature, generally many having small conserved regions called motifs. Motifs are surrounded by divergent regions exhibiting a high degree of mutational change among family members of the same protein which tend to correspond to core structural and functional elements of the proteins.

8 GetANNO Protein annotation compares the user input with databases to determine the family of the protein. Computation will take a long time due to large database caused by many classes and long size of proteins. GetANNO splits up the user input into parts and sends it to different computers holding databases to compute, speeding up the time taken to analyze the proteins.

9 GetANNO GetANNO enables users to: - Perform sequence similarity searches against databases such as RefSeq, Swissprot, Pfam and Gene ontology. - Obtain the results description from an excel spreadsheet output.

10 GetANNO Click here to start GetANNO Type in your title Choose which type DNA or Protein Paste in Sequence Choose E-Value Choose type of Matrix Choose the parameter Load Sequence from file Start the Annotation

11 GetANNO Parameter There is 4 types of databases available to BLAST against. There also parameter to choose the E-value and Scoring matrix. In addition a check box is added to only show the top 10 hit in the result

12 Database There is 4 type of database to check against with. RefSeq Gene Ontology Pfam SwissProt All of them are well accurate and reliable since the information is frequently updated.

13 Database RefSeq Provides a comprehensive, integrated & non-redundant set of sequence. Including genomic DNA, transcript (RNA) and protein products. Gene Ontology Provide structured, controlled vocabularies and classification which cover molecular and cellular biology. Often use in annotation of genes, gene products and sequences.

14 Database Pfam A large collection of multiple sequence alignments and hidden Markov model in many common protein domains. SwissProt Provide a high level of annotation, a minimal level of redundancy and high level of integration with other databases.

15 GetEMBOSS EMBOSS collectively contains the processes of: * Sequence alignment * Rapid database searching with sequence patterns * Protein motif identification, including domain analysis * Nucleotide sequence pattern analysis * Codon usage analysis for small genomes * Rapid identification of sequence patterns in large scale sequence sets

16 GetEMBOSS GetEMBOSS helps to save time by splitting up jobs and sent to different computers in the clusters thus the computational power is increased. GetEMBOSS allows users to perform several sequence analysis options on a batch of sequences submitted.

17 GetEMBOSS Click here to start GetEMBOSS Type in your title Paste your FASTA sequence Choose the type of analysis and parameter Load sequence from file Click here to start analysis

18 GetEMBOSS Parameter Find and extract open reading frames. Picks PCR primers and hybridization oligos. Finds restriction enzyme cleavage site. Translates nucleic acid sequence Predicts protein secondary structure Protein statistics Calculates the isoelectric point of a protein Predict transmembrane proteins Predict coiled coil regions

19 GetMSA Multiple Sequence Alignment Compares multiple DNA or amino acid sequences and aligns them to highlight their similarities. GetMSA helps to shorten the computation time needed. Allow users to align multiple sequences for comparison and select further analysis options of predicting secondary structure and finding domains for those regions of interest.

20 GetMSA Click here to start GetMSA Type in your title Choose DNA or Protein sequence Pairwise Alignment options Mutiple Alignment options Type in sequence Load sequence from file Click here to start analysis

21 Search History The Search History is a page where past analysis data done are stored. Results of submitted jobs are found here.

22 Search History Click here to view the result and search history Click here to view the sequence you enter and the result of the analysis

23 Our Project Plans Original Plan NGO BII There is a limited capacity in this system. Often there would be collision between the information travel since it is a single line transmission Users LSF SGE TP Database

24 Linux Virtual Server (LVS) The Linux Virtual Server, or LVS, is a piece of software that is used to balance loads on clusters. The architecture of the whole cluster is transparent to the end user, thus the LVS cluster acts as a single high performance virtual server. LVS is commonly used to build highly scalable services on the internet such as HTTP, FTP, VoIP and so on.

25 Linux Virtual Server (LVS)

26 How LVS Works User Real Server Internet Real Server Load Balancer LAN/WAN Real Server Real Server

27 How LVS Works LVS works by having a load balancer connected to a cluster. The real servers and the load balancer may be interconnected by either high-speed LAN or by geographically dispersed WAN. The load balancer will dispatch requests to the different servers and make parallel services of the cluster to appear as a virtual service on a single IP address, and request dispatching can use IP load balancing technologies or application-level load balancing technologies.

28 How LVS Works Scalability of the system is achieved by transparently adding or removing nodes in the cluster. High availability is provided by detecting node or daemon failures and reconfiguring the system appropriately. Thus, the service will continue to function even if one real server is taken down for maintenance. A backup load balancer can be connected to the network to provide for backup support if the primarily load balancer has gone down due to either maintenance or service failures.

29 How LVS Works

30 How LVS Works can handle >1million concurrent simultaneous connection 128 bytes memory per connection a computer with 1 gigabyte memory can handle more than 8 million simultaneous connections. LVS is also able to produce statistics of each real server, the number of connections, packets, bytes and so on, on which graphs can be created using other software.

31 Our Project Plans Users LVS This is method which make use of a software known as LVS to act as a router to link up all the cluster together. This method is more efficient. NGO BII TP Database synchronized

32 Convention Methods VS GET

33 Start Analysis of 394 Sequences Select Blast parameters Can only submit 1 query sequence at a time. Do not allow upload of file. Repeat the same process for the other 393 sequences. Obtain Results Conventional Blast

34 GetAnno 394 sequence is combined into a single FASTA format text file Start Select Blast parameters Obtain Results Can submit more than 1 query sequence at a time. Allows upload of file.

35 Conventional Blast Time (hr) Vs GetAnno GET Conventional For a 394 sequence, the normal protein blast takes about 18hrs, while GetANNO only takes 2 hours.

36 Conventional Emboss Start Analysis of 10 sequence Can only select 1 Emboss Program Can only submit 1 query sequence at a time. Repeat the same process for the other 9 sequences and also for the other program Obtain Results [Results are not compiled]

37 10 sequence is combined into a Start single FASTA format file Select Emboss Programs [How many depends on user perference] GetEmboss Restrict Running In Parallel Eprimer 3 Can submit more than 1 query sequence at a time. E.g all 10 query seqs Results Results Compile into 1 result text file

38 Conventional Blast Time (mins) Vs GetEmboss GET Conventional For 10 sequence DNA analysis with 2 program, Institute Pasteur Web takes 30mins but Get Emboss takes 2 mins.

39 Conventional MSA Start Upload file that contains more than 1 sequences Choose parameters E.g window size, k-tuple Obtain result [Jalview, alignment, phylogenetic tree] in individual files

40 Start Upload file that contains more than 1 sequence Choose parameters E.g window size, k-tuple GetMSA Allow users the option to build a hmm profile. Obtain result [Jalview, alignment, phylogenetic tree, hmmbuild] in 1 text profile.

41 Conventional MSA Vs GetMSA The GetMSA offers more option of building the hmm profile for their sequence. Thus saving it an extra step

42 Why use our program?? The time taken for GET to complete a process is faster than the conventional method. The GET provide multiple option for analysis. It is more user-friendly than conventional method.

43 Target Audiences Biologists Students Teachers Anyone who need information on DNA or Protein sequencing.

44 Summary Grid Enabled Tools Suite is developed for Biologists to access computing resources via a user friendly web interface for highthroughput bioinformatics analysis. Provide a convenient resource for annotation extraction and sequence analysis Capitalize on the availability of cluster and grid computing to speed up the process.

45 THANK YOU for listening!