Minería de Datos ANALISIS DE UN SET DE DATOS! Visualization Techniques! Combined Graph! Charts and Pies! Search for specific functions

2 Data Mining on the DAG ü When working with large datasets, annotation results need to be summarized ü The DAG provides visualization of annotation data within its biological context ü In Blast2GO --> Combined Graph Function

3 Combined Graph Each term has a number of sequences associated Node shape to differentiate between direct and indirect annotation Each term is displayed around its biological context Nodes can be coloured to indicate relevance

4 Combined Graph Different GO branches Reduces nodes by number of annotate sequences Node data to be displayed Criterion for highlighting and filtering nodes

5 Combined Graph Let's paint the DAG of the dataset analized yesterday (1000 sequences) Too many nodes!!! Need way to find relevant information

6 Node Information Content Accumulated by node (Sequence Count) Incomming information (Node Score)

7 Node score We compute a node score that reflects the amount of direct information at the node

8 Node score GO4 2.5 dist=0 dist=2 GO dist=2 α = 0.6 dist=1 GO1 1 GO2 3 dist=1 dist=0 dist=0 1 3 NodeScore (GO1) = 1 * = 1 NodeScore (GO2) = 3 * = 3 NodeScore (GO3) = 1 * * = = 2.4 NodeScore (GO4) = 1 * * * = = 2.5

9 Node score vs Annotation score DO NOT MIX-UP!!!!! ROOT 2.5 GO1 GO child seq GO hit1 GO child GO hit2 hit3 1 3 Annotation Score: - In annotation context - Relates to Blast results of ONE sequence Node Score: - In data-mining context - Relates to analysis of a GROUP of sequences AS = max{%sim * ECw]}+ (#TPR_GOs-1) * GOw

10 Filtered Graph # Filtered Nodes Transition nodes Direct annotations

11 Compacting Graphs by GOSlim

12 Show node content

13 Save as picture and as txt Saving Options

14 Graph Charts

15 Graph Charts Sequence Distribution/GO as Bar-Chart Sequence Distribution/GO as Level-Pie (level selection) Sequence Distribution/GO as Multilevel-Pie (#score or #seq cutoff)

16 Multilevel vs. GO-Slim Chart Multi-level Pie with a sequence filter of 20 GO-Slim: Handy to summarize functional content

17 Use DAG to analyze a function DAG can be used to make queries on general concepts without direct annotations How many sequences are annotated to the function photosynthesis? Option 1: Find in the GO graph à direct & indirect annotation Option 2: Find through the Select function. Two sub options Option 2.1. Direct annotation (use GOid or description) Option 2.2. Direct&indirect (use GOid and include GO parents )

18 Example: analyze a specific function export search Find a function on the graph

19 Example: analyze a specific function Select all sequences annotated to this function and its descendents

20 Example: analyze a specific function Locate these sequences

21 Example: analyze a specific function Exporting the sequence table you can see all Sequences annotated to a given function (GO) Explore the annotation diversity of a given function within the graph

22 Conclusions ü DAGs are interesting for browsing functional annotation but can be too large ü With filtering and pruning options you can create more navigable DAGs ü Pies are good to compact information: try out levels ü GO-Slim compacts to more equivalent terms than filtering the GO ü You can use the DAG to query on general terms

23 Minería de Datos ANALISIS DE VARIOS SETS DE DATOS! Functional Enrichment! Enriched Graphs! Meta-analysis

24 Enrichment Analysis Interpretation of a large list of genes: which are relevant functions? One Gene List (A) The other list (B) Are this two groups of genes carrying out different biological roles???? Biosynthesis 54% Biosynthesis 18%??? Sporulation 18% Sporulation 27% Are these differences statistically significant?

25 Fisher's Exact Test One Gene List (A) The other list (B) Biosynthesis 54% Biosynthesis 18% Sporulation 18% Sporulation 27% Contingency table A B A B Biosynthesis 6 2 Sporulation 2 3 No biosynthesis 5 9 No sporulation 9 8 p-value for biosynthesis < 0.05 p-value for sporulation > 0.05

26 Multiple testing correction We do this for all GO term of our dataset!!! Many tests => Many false positive => We need correction! FDR control is a statistical method used in multiple hypothesis testing to correct for multiple comparisons. In a list of rejected hypotheses, FDR controls the expected proportion of incorrectly rejected null hypotheses. FWER control: The familywise error rate is the probability of making one or more false discoveries among all the hypotheses when performing multiple pairwise tests. (more conservative)

27 Fisher s Exact Test in Blast2GO Test-set Ref-set GO No GO A 2 9 B 3 8 Three files:! Blast2GO project with annotations (.dat/.annot)! One txt file with IDs: Test-set (.txt)! Other txt file with IDs: Ref-set (.txt)

28 Different types of comparisons Compare one condition against another Remove Common Ids Test and Ref-Set are interchangeable Compare a subset against the total Gossip default setting Test and Ref-Set are NOT interchangeable Common IDs Set 1 Set 2 Test- Set Common IDs Ref- Set Ref- Set Common IDs Test- Set

29 FET in Blast2GO Two-Tailed test not only identifies over but also under represented functions. If no Ref-Set is chosen all annotations are used as reference

30 Enrichment Results Result table with link outs to sequence lists

31 Most specific terms Retains only the lowest, most specific enriched term per GO branch

32 Enriched Graph View enriched terms data as DAG graphs! reduce => To draw all nodes, set filter to 1

33 Bar-Chart Export enriched terms as chart! => Filter results % of sequences in Test group % of sequences in Ref group If Test > Ref = overexpressed If Ref > Test = underexpressed

34 Meta-analysis in Blast2GO Annotation Result (.annot) Sequence_1 GO: Sequence_1 GO: Sequence_1 GO: Sequence_2 GO: Sequence_2 GO: Sequence_2 GO: Equivalent formats ó Enrichment Result Treatment_1 GO: Treatment_1 GO: Treatment_1 GO: Enrichment Result (.annot) By joining different functional enrichment results we can create and annotation file of conditions that capture their functional profile Treatment_1 GO: Treatment_1 GO: Treatment_1 GO: Treatment_2 GO: Treatment_2 GO: Treatment_2 GO:

35 Meta-analysis in Blast2GO FIND SIMILARITIES BETWEEN TREATMENTS Use seq names to see treatments Use color by SeqCount

36 Meta-analysis in Blast2GO DISPLAY FUNCTIONAL DISSIMILARITIES ON DAG Use second column number for color

37 Ejercicios: Minería de Datos

PANTHER User Manual For PANTHER 9.0 Date: January 7, 2015 Authors: The PANTHER Team Contents 1 Welcome to PANTHER System 1 1.1 About this document........... 1 1.2 How to cite PANTHER.......... 1 1.3 PANTHER