Minería de Datos ANALISIS DE UN SET DE DATOS! Visualization Techniques! Combined Graph! Charts and Pies! Search for specific functions
Data Mining on the DAG ü When working with large datasets, annotation results need to be summarized ü The DAG provides visualization of annotation data within its biological context ü In Blast2GO --> Combined Graph Function
Combined Graph Each term has a number of sequences associated Node shape to differentiate between direct and indirect annotation Each term is displayed around its biological context Nodes can be coloured to indicate relevance
Combined Graph Different GO branches Reduces nodes by number of annotate sequences Node data to be displayed Criterion for highlighting and filtering nodes
Combined Graph Let's paint the DAG of the dataset analized yesterday (1000 sequences) Too many nodes!!! Need way to find relevant information
Node Information Content Accumulated by node (Sequence Count) 4 5 1 1 3 1 3 Incomming information (Node Score) 2.4 2.5 1 1 3 1 3
Node score We compute a node score that reflects the amount of direct information at the node 2.5 2.4 1 1 3 1 3
Node score GO4 2.5 dist=0 dist=2 GO2 2.4 1 dist=2 α = 0.6 dist=1 GO1 1 GO2 3 dist=1 dist=0 dist=0 1 3 NodeScore (GO1) = 1 * 0.6 0 = 1 NodeScore (GO2) = 3 * 0.6 0 = 3 NodeScore (GO3) = 1 * 0.6 1 + 3 * 0.6 1 = 0.6 + 1.8 = 2.4 NodeScore (GO4) = 1 * 0.6 2 + 3 * 0.6 2 + 1 * 0.6 0 = 0.36 + 1.08 + 1 = 2.5
Node score vs Annotation score DO NOT MIX-UP!!!!! ROOT 2.5 GO1 GO1 1 60 child seq GO4 55 2.4 1 hit1 GO2 1 52 child GO3 50 1 3 hit2 hit3 1 3 Annotation Score: - In annotation context - Relates to Blast results of ONE sequence Node Score: - In data-mining context - Relates to analysis of a GROUP of sequences AS = max{%sim * ECw]}+ (#TPR_GOs-1) * GOw
Filtered Graph # Filtered Nodes Transition nodes Direct annotations
Compacting Graphs by GOSlim
Show node content
Save as picture and as txt Saving Options
Graph Charts
Graph Charts Sequence Distribution/GO as Bar-Chart Sequence Distribution/GO as Level-Pie (level selection) Sequence Distribution/GO as Multilevel-Pie (#score or #seq cutoff)
Multilevel vs. GO-Slim Chart Multi-level Pie with a sequence filter of 20 GO-Slim: Handy to summarize functional content
Use DAG to analyze a function DAG can be used to make queries on general concepts without direct annotations How many sequences are annotated to the function photosynthesis? Option 1: Find in the GO graph à direct & indirect annotation Option 2: Find through the Select function. Two sub options Option 2.1. Direct annotation (use GOid or description) Option 2.2. Direct&indirect (use GOid and include GO parents )
Example: analyze a specific function export search Find a function on the graph
Example: analyze a specific function Select all sequences annotated to this function and its descendents
Example: analyze a specific function Locate these sequences
Example: analyze a specific function Exporting the sequence table you can see all Sequences annotated to a given function (GO) Explore the annotation diversity of a given function within the graph
Conclusions ü DAGs are interesting for browsing functional annotation but can be too large ü With filtering and pruning options you can create more navigable DAGs ü Pies are good to compact information: try out levels ü GO-Slim compacts to more equivalent terms than filtering the GO ü You can use the DAG to query on general terms
Minería de Datos ANALISIS DE VARIOS SETS DE DATOS! Functional Enrichment! Enriched Graphs! Meta-analysis
Enrichment Analysis Interpretation of a large list of genes: which are relevant functions? One Gene List (A) The other list (B) Are this two groups of genes carrying out different biological roles???? Biosynthesis 54% Biosynthesis 18%??? Sporulation 18% Sporulation 27% Are these differences statistically significant?
Fisher's Exact Test One Gene List (A) The other list (B) Biosynthesis 54% Biosynthesis 18% Sporulation 18% Sporulation 27% Contingency table A B A B Biosynthesis 6 2 Sporulation 2 3 No biosynthesis 5 9 No sporulation 9 8 p-value for biosynthesis < 0.05 p-value for sporulation > 0.05
Multiple testing correction We do this for all GO term of our dataset!!! Many tests => Many false positive => We need correction! FDR control is a statistical method used in multiple hypothesis testing to correct for multiple comparisons. In a list of rejected hypotheses, FDR controls the expected proportion of incorrectly rejected null hypotheses. FWER control: The familywise error rate is the probability of making one or more false discoveries among all the hypotheses when performing multiple pairwise tests. (more conservative)
Fisher s Exact Test in Blast2GO Test-set Ref-set GO No GO A 2 9 B 3 8 Three files:! Blast2GO project with annotations (.dat/.annot)! One txt file with IDs: Test-set (.txt)! Other txt file with IDs: Ref-set (.txt)
Different types of comparisons Compare one condition against another Remove Common Ids Test and Ref-Set are interchangeable Compare a subset against the total Gossip default setting Test and Ref-Set are NOT interchangeable Common IDs Set 1 Set 2 Test- Set Common IDs Ref- Set Ref- Set Common IDs Test- Set
FET in Blast2GO Two-Tailed test not only identifies over but also under represented functions. If no Ref-Set is chosen all annotations are used as reference
Enrichment Results Result table with link outs to sequence lists
Most specific terms Retains only the lowest, most specific enriched term per GO branch
Enriched Graph View enriched terms data as DAG graphs! reduce => To draw all nodes, set filter to 1
Bar-Chart Export enriched terms as chart! => Filter results % of sequences in Test group % of sequences in Ref group If Test > Ref = overexpressed If Ref > Test = underexpressed
Meta-analysis in Blast2GO Annotation Result (.annot) Sequence_1 GO:0005792 Sequence_1 GO:0006412 Sequence_1 GO:0003735 Sequence_2 GO:0016705 Sequence_2 GO:0005840 Sequence_2 GO:0005506 Equivalent formats ó Enrichment Result Treatment_1 GO:0005792 Treatment_1 GO:0006412 Treatment_1 GO:0003735 Enrichment Result (.annot) By joining different functional enrichment results we can create and annotation file of conditions that capture their functional profile Treatment_1 GO:0005792 Treatment_1 GO:0006412 Treatment_1 GO:0003735 Treatment_2 GO:0016705 Treatment_2 GO:0005840 Treatment_2 GO:0005506
Meta-analysis in Blast2GO FIND SIMILARITIES BETWEEN TREATMENTS Use seq names to see treatments Use color by SeqCount
Meta-analysis in Blast2GO DISPLAY FUNCTIONAL DISSIMILARITIES ON DAG Use second column number for color
Ejercicios: Minería de Datos