Minería de Datos ANALISIS DE UN SET DE DATOS.! Visualization Techniques! Combined Graph! Charts and Pies! Search for specific functions



Similar documents
Tutorial for proteome data analysis using the Perseus software platform

Introduction to Hypothesis Testing. Hypothesis Testing. Step 1: State the Hypotheses

Course on Functional Analysis. ::: Gene Set Enrichment Analysis - GSEA -

MultiExperiment Viewer Quickstart Guide

Friedman's Two-way Analysis of Variance by Ranks -- Analysis of k-within-group Data with a Quantitative Response Variable

Package empiricalfdr.deseq2

7. Data Packager: Sharing and Merging Data

Correlational Research

Blast2GO PRO Plug-in User Manual

1 Why is multiple testing a problem?

Statistical issues in the analysis of microarray data

Exercise with Gene Ontology - Cytoscape - BiNGO

Section 13, Part 1 ANOVA. Analysis Of Variance

Package copa. R topics documented: August 9, 2016

Chapter 8 Hypothesis Testing Chapter 8 Hypothesis Testing 8-1 Overview 8-2 Basics of Hypothesis Testing

Frequently Asked Questions Next Generation Sequencing

Shark Talent Management System Performance Reports

An Introduction to Statistics Course (ECOE 1302) Spring Semester 2011 Chapter 10- TWO-SAMPLE TESTS

Package dunn.test. January 6, 2016

False Discovery Rates

THE FIRST SET OF EXAMPLES USE SUMMARY DATA... EXAMPLE 7.2, PAGE 227 DESCRIBES A PROBLEM AND A HYPOTHESIS TEST IS PERFORMED IN EXAMPLE 7.

Database Searching Tutorial/Exercises Jimmy Eng

Two Correlated Proportions (McNemar Test)

DDBA 8438: Introduction to Hypothesis Testing Video Podcast Transcript

Section 7.1. Introduction to Hypothesis Testing. Schrodinger s cat quantum mechanics thought experiment (1935)

Introduction to. Hypothesis Testing CHAPTER LEARNING OBJECTIVES. 1 Identify the four steps of hypothesis testing.

Unit 31 A Hypothesis Test about Correlation and Slope in a Simple Linear Regression

HYPOTHESIS TESTING WITH SPSS:

Calculating P-Values. Parkland College. Isela Guerra Parkland College. Recommended Citation

STATISTICA Formula Guide: Logistic Regression. Table of Contents

From Reads to Differentially Expressed Genes. The statistics of differential gene expression analysis using RNA-seq data

Gene Expression Analysis

INTERPRETING THE ONE-WAY ANALYSIS OF VARIANCE (ANOVA)

Analyzing microrna Data and Integrating mirna with Gene Expression Data in Partek Genomics Suite 6.6

BA 275 Review Problems - Week 6 (10/30/06-11/3/06) CD Lessons: 53, 54, 55, 56 Textbook: pp , ,

How Does My TI-84 Do That

Gene expression analysis. Ulf Leser and Karin Zimmermann

How to create and interpret the predictive analysis of a compound

QAD Usability Customization Demo

Two-Way ANOVA tests. I. Definition and Applications...2. II. Two-Way ANOVA prerequisites...2. III. How to use the Two-Way ANOVA tool?...

Important Tips when using Ad Hoc

HYPOTHESIS TESTING: POWER OF THE TEST

Rapid alignment methods: FASTA and BLAST. p The biological problem p Search strategies p FASTA p BLAST

Introduction to Hypothesis Testing

Descriptive Statistics

Statistiek II. John Nerbonne. October 1, Dept of Information Science

USING MYWEBSQL FIGURE 1: FIRST AUTHENTICATION LAYER (ENTER YOUR REGULAR SIMMONS USERNAME AND PASSWORD)

Navigating Through SpamTitan

An introduction to IBM SPSS Statistics

Mining Social Network Graphs

Protein Protein Interaction Networks

Intelligent Process Management & Process Visualization. TAProViz 2014 workshop. Presenter: Dafna Levy

CRISP-DM, which stands for Cross-Industry Standard Process for Data Mining, is an industry-proven way to guide your data mining efforts.

MicroStrategy Desktop

Reporting with Pentaho. Gabriele Pozzani

Using Excel in Research. Hui Bian Office for Faculty Excellence

Visualization methods for patent data

VISUALIZING HIERARCHICAL DATA. Graham Wills SPSS Inc.,

Hypothesis Testing --- One Mean

EXCEL PIVOT TABLE David Geffen School of Medicine, UCLA Dean s Office Oct 2002

Analysis of Data. Organizing Data Files in SPSS. Descriptive Statistics

Final Project Report

5/31/2013. Chapter 8 Hypothesis Testing. Hypothesis Testing. Hypothesis Testing. Outline. Objectives. Objectives

Gene Expression Macro Version 1.1

Experimental Design. Power and Sample Size Determination. Proportions. Proportions. Confidence Interval for p. The Binomial Test

Advanced Excel Charts : Tables : Pivots : Macros

When to use Excel. When NOT to use Excel 9/24/2014

ecw Weekly Users Tip: My Settings: Template-Friendly Settings & My Favorites: Templates

Monitoring Replication

Ad Hoc Advanced Table of Contents

Package RDAVIDWebService

Reporting Student Progress and Achievement

PANTHER User Manual. For PANTHER 9.0. Date: January 7, The PANTHER Team. Authors:

Non-Inferiority Tests for Two Proportions

Data Mining Techniques Chapter 6: Decision Trees

Microsoft Access 2010 Overview of Basics

To launch the Microsoft Excel program, locate the Microsoft Excel icon, and double click.

When a variable is assigned as a Process Initialization variable its value is provided at the beginning of the process.

The Big Picture. Describing Data: Categorical and Quantitative Variables Population. Descriptive Statistics. Community Coalitions (n = 175)

Methods for network visualization and gene enrichment analysis July 17, Jeremy Miller Scientist I jeremym@alleninstitute.org

MicroStrategy Analytics Express User Guide

Data Visualization. Prepared by Francisco Olivera, Ph.D., Srikanth Koka Department of Civil Engineering Texas A&M University February 2004

Hypothesis Testing. Reminder of Inferential Statistics. Hypothesis Testing: Introduction

Didacticiel - Études de cas

Analyzing the Effect of Treatment and Time on Gene Expression in Partek Genomics Suite (PGS) 6.6: A Breast Cancer Study

1. What is the critical value for this 95% confidence interval? CV = z.025 = invnorm(0.025) = 1.96

org.rn.eg.db December 16, 2015 org.rn.egaccnum is an R object that contains mappings between Entrez Gene identifiers and GenBank accession numbers.

Package cpm. July 28, 2015

Difference of Means and ANOVA Problems

ASSIGNMENT 4 PREDICTIVE MODELING AND GAINS CHARTS

Multiple-Comparison Procedures

Online 12 - Sections 9.1 and 9.2-Doug Ensley

How To Test For Significance On A Data Set

Data, Measurements, Features

Transcription:

Minería de Datos ANALISIS DE UN SET DE DATOS! Visualization Techniques! Combined Graph! Charts and Pies! Search for specific functions

Data Mining on the DAG ü When working with large datasets, annotation results need to be summarized ü The DAG provides visualization of annotation data within its biological context ü In Blast2GO --> Combined Graph Function

Combined Graph Each term has a number of sequences associated Node shape to differentiate between direct and indirect annotation Each term is displayed around its biological context Nodes can be coloured to indicate relevance

Combined Graph Different GO branches Reduces nodes by number of annotate sequences Node data to be displayed Criterion for highlighting and filtering nodes

Combined Graph Let's paint the DAG of the dataset analized yesterday (1000 sequences) Too many nodes!!! Need way to find relevant information

Node Information Content Accumulated by node (Sequence Count) 4 5 1 1 3 1 3 Incomming information (Node Score) 2.4 2.5 1 1 3 1 3

Node score We compute a node score that reflects the amount of direct information at the node 2.5 2.4 1 1 3 1 3

Node score GO4 2.5 dist=0 dist=2 GO2 2.4 1 dist=2 α = 0.6 dist=1 GO1 1 GO2 3 dist=1 dist=0 dist=0 1 3 NodeScore (GO1) = 1 * 0.6 0 = 1 NodeScore (GO2) = 3 * 0.6 0 = 3 NodeScore (GO3) = 1 * 0.6 1 + 3 * 0.6 1 = 0.6 + 1.8 = 2.4 NodeScore (GO4) = 1 * 0.6 2 + 3 * 0.6 2 + 1 * 0.6 0 = 0.36 + 1.08 + 1 = 2.5

Node score vs Annotation score DO NOT MIX-UP!!!!! ROOT 2.5 GO1 GO1 1 60 child seq GO4 55 2.4 1 hit1 GO2 1 52 child GO3 50 1 3 hit2 hit3 1 3 Annotation Score: - In annotation context - Relates to Blast results of ONE sequence Node Score: - In data-mining context - Relates to analysis of a GROUP of sequences AS = max{%sim * ECw]}+ (#TPR_GOs-1) * GOw

Filtered Graph # Filtered Nodes Transition nodes Direct annotations

Compacting Graphs by GOSlim

Show node content

Save as picture and as txt Saving Options

Graph Charts

Graph Charts Sequence Distribution/GO as Bar-Chart Sequence Distribution/GO as Level-Pie (level selection) Sequence Distribution/GO as Multilevel-Pie (#score or #seq cutoff)

Multilevel vs. GO-Slim Chart Multi-level Pie with a sequence filter of 20 GO-Slim: Handy to summarize functional content

Use DAG to analyze a function DAG can be used to make queries on general concepts without direct annotations How many sequences are annotated to the function photosynthesis? Option 1: Find in the GO graph à direct & indirect annotation Option 2: Find through the Select function. Two sub options Option 2.1. Direct annotation (use GOid or description) Option 2.2. Direct&indirect (use GOid and include GO parents )

Example: analyze a specific function export search Find a function on the graph

Example: analyze a specific function Select all sequences annotated to this function and its descendents

Example: analyze a specific function Locate these sequences

Example: analyze a specific function Exporting the sequence table you can see all Sequences annotated to a given function (GO) Explore the annotation diversity of a given function within the graph

Conclusions ü DAGs are interesting for browsing functional annotation but can be too large ü With filtering and pruning options you can create more navigable DAGs ü Pies are good to compact information: try out levels ü GO-Slim compacts to more equivalent terms than filtering the GO ü You can use the DAG to query on general terms

Minería de Datos ANALISIS DE VARIOS SETS DE DATOS! Functional Enrichment! Enriched Graphs! Meta-analysis

Enrichment Analysis Interpretation of a large list of genes: which are relevant functions? One Gene List (A) The other list (B) Are this two groups of genes carrying out different biological roles???? Biosynthesis 54% Biosynthesis 18%??? Sporulation 18% Sporulation 27% Are these differences statistically significant?

Fisher's Exact Test One Gene List (A) The other list (B) Biosynthesis 54% Biosynthesis 18% Sporulation 18% Sporulation 27% Contingency table A B A B Biosynthesis 6 2 Sporulation 2 3 No biosynthesis 5 9 No sporulation 9 8 p-value for biosynthesis < 0.05 p-value for sporulation > 0.05

Multiple testing correction We do this for all GO term of our dataset!!! Many tests => Many false positive => We need correction! FDR control is a statistical method used in multiple hypothesis testing to correct for multiple comparisons. In a list of rejected hypotheses, FDR controls the expected proportion of incorrectly rejected null hypotheses. FWER control: The familywise error rate is the probability of making one or more false discoveries among all the hypotheses when performing multiple pairwise tests. (more conservative)

Fisher s Exact Test in Blast2GO Test-set Ref-set GO No GO A 2 9 B 3 8 Three files:! Blast2GO project with annotations (.dat/.annot)! One txt file with IDs: Test-set (.txt)! Other txt file with IDs: Ref-set (.txt)

Different types of comparisons Compare one condition against another Remove Common Ids Test and Ref-Set are interchangeable Compare a subset against the total Gossip default setting Test and Ref-Set are NOT interchangeable Common IDs Set 1 Set 2 Test- Set Common IDs Ref- Set Ref- Set Common IDs Test- Set

FET in Blast2GO Two-Tailed test not only identifies over but also under represented functions. If no Ref-Set is chosen all annotations are used as reference

Enrichment Results Result table with link outs to sequence lists

Most specific terms Retains only the lowest, most specific enriched term per GO branch

Enriched Graph View enriched terms data as DAG graphs! reduce => To draw all nodes, set filter to 1

Bar-Chart Export enriched terms as chart! => Filter results % of sequences in Test group % of sequences in Ref group If Test > Ref = overexpressed If Ref > Test = underexpressed

Meta-analysis in Blast2GO Annotation Result (.annot) Sequence_1 GO:0005792 Sequence_1 GO:0006412 Sequence_1 GO:0003735 Sequence_2 GO:0016705 Sequence_2 GO:0005840 Sequence_2 GO:0005506 Equivalent formats ó Enrichment Result Treatment_1 GO:0005792 Treatment_1 GO:0006412 Treatment_1 GO:0003735 Enrichment Result (.annot) By joining different functional enrichment results we can create and annotation file of conditions that capture their functional profile Treatment_1 GO:0005792 Treatment_1 GO:0006412 Treatment_1 GO:0003735 Treatment_2 GO:0016705 Treatment_2 GO:0005840 Treatment_2 GO:0005506

Meta-analysis in Blast2GO FIND SIMILARITIES BETWEEN TREATMENTS Use seq names to see treatments Use color by SeqCount

Meta-analysis in Blast2GO DISPLAY FUNCTIONAL DISSIMILARITIES ON DAG Use second column number for color

Ejercicios: Minería de Datos