How To Write A Network Analysis



Similar documents
WebBee and ViNCES. Antonio Mauro Saraiva Universidade de São Paulo Agricultural Automation Laboratory 1

Filling the Semantic Gap: A Genetic Programming Framework for Content-Based Image Retrieval

eflora and DialGraph, tools for enhancing identification processes in plants Fernando Sánchez Laulhé, Cecilio Cano Calonge, Antonio Jiménez Montaño

Open Science A Day in the Life of a Scientist, AD 2030

How To Get A Phd In Computer Science

DAVID A. EBERT, PH.D. CURRICULUM VITAE

Ramsey numbers for bipartite graphs with small bandwidth

DNA Barcoding: A New Tool for Identifying Biological Specimens and Managing Species Diversity

Name-based Approach to Build a Hub for Biodiversity LOD

Data Management in NeuroMat and the Neuroscience Experiments System (NES)

Graph Analytics in Big Data. John Feo Pacific Northwest National Laboratory

University of Campinas (UNICAMP)

Erasmus Mundus External Cooperation Windows. EU-BRAZIL STARTUP Call for applications outcome 08/02/2010

Interactive Information Visualization in the Digital Flora of Texas

Digitization in the Pacific. Larry M. Page PD, idigbio Curator, FLMNH

UCM-MACB 2.0: A COMPLUTENSE UNIVERSITY VIRTUAL HERBARIUM PROJECT

Managing extended organizations and data governance : a foucauldian perspective

3.1 Measuring Biodiversity

Understanding by Design. Title: BIOLOGY/LAB. Established Goal(s) / Content Standard(s): Essential Question(s) Understanding(s):

V. Adamchik 1. Graph Theory. Victor Adamchik. Fall of 2005

Bony Fish Anatomy Worksheet

Host specificity and the probability of discovering species of helminth parasites

What is it? Dichotomous Keys Teacher Information

Writing a Dichotomous Key to Wildflowers

Diablo Valley College Catalog

Biology Keystone (PA Core) Quiz Ecology - (BIO.B ) Ecological Organization, (BIO.B ) Ecosystem Characteristics, (BIO.B.4.2.

Sullivan s Island Bird Banding and Environmental Education Program. Sarah Harper Díaz, MA and Jennifer Tyrrell, MS

1. Participant Institutions

Trade analysis of five CITES-listed taxa

Subgraph Patterns: Network Motifs and Graphlets. Pedro Ribeiro

PANDUIT Physical Layer Infrastructure Management. EMC Smarts Integration Module

Endemic and Introduced Species Lesson Plan

FAPESP Bioenergy Research Program BIOEN.

Extensive Cryptic Diversity in Indo-Australian Rainbowfishes Revealed by DNA Barcoding

Big Graph Analytics on Neo4j with Apache Spark. Michael Hunger Original work by Kenny Bastani Berlin Buzzwords, Open Stage

High-dimensional labeled data analysis with Gabriel graphs

Opportunities for Environmental Research in Brazil

MBARI Deep Sea Guide: Designing a web interface that represents information about the Monterey Bay deep-sea world.

Mobilising Vegetation Plot Data: the National Vegetation Survey Databank. Susan Wiser April

Building a Dichotomous Key: Take home Assignment. - Copy of Aliens Handout - Question Sheet - Dichotomous Key Sheet

A Method of Population Estimation: Mark & Recapture

RESTORATION & REVITALIZATION

JustClust User Manual

UFSCar Database Group (UFSCar DB)

Specimen Labels v. 09/2002

Identifying Vertebrates Using Classification Keys

IC05 Introduction on Networks &Visualization Nov

One Major Six Concentrations. Department of Environmental Conservation University of Massachusetts Amherst

1 Overview introducing global issues and legal tools through local case studies. 2 Importance of legal protections for natural areas

Topological Properties

Training course of microbial resources information management and utilization for developing countries

1. Nondeterministically guess a solution (called a certificate) 2. Check whether the solution solves the problem (called verification)

Principles of Evolution - Origin of Species

THE RANKING WEB NEW INDICATORS FOR NEW NEEDS. 2 nd International Workshop on University Web Rankings CCHS-CSIC, Madrid (Spain).

Graph Theory Problems and Solutions

CHAPTER 2: APPROACH AND METHODS APPROACH

Luiz Celso Gomes Jr Campinas, São Paulo

Prentice Hall World Geography: Building a Global Perspective 2003 Correlated to: Arkansas Social Studies Curriculum Frameworks (Grades 9-12)

Environment and Natural Resources Trust Fund (ENRTF) M.L Work Plan

A MIDDLE SCHOOL LESSON FOR CREATING AND USING DICHOTOMOUS KEYS By Sharon Donovan

Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP

Tutorial for proteome data analysis using the Perseus software platform

Building Interactive Animations using VRML and Java

Discover Entomology. Discover Entomology. A Science, a Career, a Lifetime. A Science, a Career, a Lifetime

THREE-DIMENSIONAL CARTOGRAPHIC REPRESENTATION AND VISUALIZATION FOR SOCIAL NETWORK SPATIAL ANALYSIS

How graph databases started the multi-model revolution

Using and Constructing a Dichotomous Key

Protein Protein Interaction Networks

The need for longitudinal study of the dual roles of insects as pests and food resources in agroecosystems

Transcription:

FishGraph: A Network-Driven Data Analysis Patrícia Cavoto*, Victor Cardoso*, Régine Vignes Lebbe, André Santanchè* *UNICAMP University ofcampinas, São Paulo, Brasil ISYEB - UMR 7205 CNRS, MNHN, UPMC, EPHE UPMC Univ. Paris 06, Sorbonne Universités, Paris, France

Outline Motivation Goal ReGraph: from FishBase to FishGraph Data Experiments Conclusions FishGraph: A Network-Driven Data Analysis 2

Motivation Collaborative research involving: LIS - Laboratory of Information Systems UNICAMP, Brazil MNHN - National Museum of Natural History and Sorbonne Univertès Paris, France FishBase Consortium FishGraph: A Network-Driven Data Analysis 3

Motivation FishBase: a relational database and information system for biological data storage of fish species, with millions of records containing: Species, taxonomic classification and predators Locations (country and ecosystem) Identification keys Food Behavior etc. FishGraph: A Network-Driven Data Analysis 4

Motivation Identification Key: A biology mechanism to identify a specific specimen Composed by a set of questions that guides scientists in this identification Has one or more species associated Similar to a decision tree FishGraph: A Network-Driven Data Analysis 5

Identification Key Example 6 - Freshwater fishes of Africa Five pairs of external gill slits Single, or single pair of gill openings Head without extended rostrum, gill slits lateral Head with extended rostrum, gill slits ventral Body without scales, or scales small and not clearly visible. Body with clearly visible scales. Body slender, elongate and eellike Body not eel-like Carcharhinus leucas adapted from: http://fishbase.org/keys/description.php?keycode=6 7

Identification Key Problem 6 - Freshwater fishes of Africa 1419 - Species of Schilbe of Africa? adapted from: http://fishbase.org/keys/description.php?keycode=6 http://fishbase.org/keys/description.php?keycode=1419 7

Identification Key Problem 6 - Freshwater fishes of Africa Adipose fin present Adipose fin absent adapted from: http://fishbase.org/keys/description.php?keycode=6 http://fishbase.org/keys/description.php?keycode=1419 9

Identification Key Problem 6 - Freshwater fishes of Africa 1419 - Species of Schilbe of Africa Adipose fin present Adipose fin absent Adipose fin present Adipose fin absent adapted from: http://fishbase.org/keys/description.php?keycode=6 http://fishbase.org/keys/description.php?keycode=1419 11

Motivation Biological data (as in FishBase) form a big network Biologists need network analysis for: Identify the most important species in an specific food chain; Define areas (or species) for preservation; Find relations in a network of identification keys. But data are mainly stored in a relational database FishGraph: A Network-Driven Data Analysis 10

Motivation How to support biologists in network-driven analysis? FishGraph: A Network-Driven Data Analysis 11

Goal Build a network database for analysis from a relational database FishGraph: A Network-Driven Data Analysis 12

ReGraph: from FishBase to FishGraph Graph databases: Very effective in network analysis Flexible structure Easy to run transitive relationships FishGraph: A Network-Driven Data Analysis 13

ReGraph: from FishBase to FishGraph ReGraph: a framework that generates a graph database from a relational database. FishGraph: A Network-Driven Data Analysis 14

ReGraph: from FishBase to FishGraph ReGraph: a framework that generates a graph database from a relational database. mapped subgraph FishGraph: A Network-Driven Data Analysis 15

ReGraph: from FishBase to FishGraph ReGraph: maintain the graph database synchronized with the relational database (one-way synchronization). FishGraph: A Network-Driven Data Analysis 16

ReGraph: from FishBase to FishGraph ReGraph: maintain the graph database synchronized with the relational database (one-way synchronization). FishGraph: A Network-Driven Data Analysis 17

ReGraph: from FishBase to FishGraph ReGraph: Relational and Graph Databases keep their native form. Current Current systems systems New system FishGraph: A Network-Driven Data Analysis 18

ReGraph: from FishBase to FishGraph ReGraph: allows adding new data in the graph database. annotated subgraph FishGraph: A Network-Driven Data Analysis 19

ReGraph: from FishBase to FishGraph ReGraph: mapped and annotated subgraphs are integrated and avaiable for running analysis. FishGraph: A Network-Driven Data Analysis 20

ReGraph: from FishBase to FishGraph ReGraph: connects data in the local graph with global graphs on the web. Semantic Web FishGraph: A Network-Driven Data Analysis 21

ReGraph: from FishBase to FishGraph STEPS: 1. Map data from relational database to graph 2. Run the ETL process to load initial data 3. Synchronism process starts to run after the first loading 4. Add new information as annotation (optional) 5. Perform the analysis FishGraph: A Network-Driven Data Analysis 22

ReGraph: from FishBase to FishGraph ReGraph: used to generate FishGraph (graph database) from FishBase (relational database) FishGraph FishGraph: A Network-Driven Data Analysis 23

ReGraph: from FishBase to FishGraph FAMILIES CLASSES GENERA ORDERS SPECIES COUNTREF KEYQUESTIONS SPECIES ECOSYSTEMREF FAOARREF KEYS ECOSYSTEMREF FishGraph: A Network-Driven Data Analysis 24

ReGraph: from FishBase to FishGraph FAMILY belongs_to SPECIES CLASS GENUS COUNTRY ORDER KEY ECOSYSTEM FishGraph: A Network-Driven Data Analysis 25

Experiments: Identification Key Analysis Data used: Identification keys Species Geographic locations (countries and ecossystems) ~ 86,500 edges and 10,500 nodes FishGraph: A Network-Driven Data Analysis 26

Experiments: Identification Key Analysis Long term goal: Start the identification process from any node across several identification keys. Goals in this analysis: Find similarities between keys Find differences between keys Analyze groups of keys FishGraph: A Network-Driven Data Analysis 27

Experiments: Identification Keys Analysis The annotated share edge connects keys that share at least one species. FAMILY belongs_to SPECIES CLASS GENUS COUNTRY ORDER KEY ECOSYSTEM FishGraph: A Network-Driven Data Analysis 28

Experiments: Identification Keys Analysis Components based on the share edge connecting two or more distinct keys with their associated species. A component is a subgraph in which there is a path from any node to another one. *each color represents an independent component. FishGraph: A Network-Driven Data Analysis 29

Experiments: Identification Keys Analysis Identification key Species colored by family (same order and class) 205 316 205 - Key to the species of scorpionfishes occurring in the Western Central Pacific. 316 - Key to the species of Indo-Pacific Scorpionfish (Genus Scorpaenopsis). FishGraph: A Network-Driven Data Analysis 30

Experiments: Identification Keys Analysis Identification key Species colored by family (same order and class) identification key 205 316 205 - Key to the species of scorpionfishes occurring in the Western Central Pacific. 316 - Key to the species of Indo-Pacific Scorpionfish (Genus Scorpaenopsis). FishGraph: A Network-Driven Data Analysis 31

Experiments: Identification Keys Analysis Identification key Species colored by family (same order and class) shared species 205 316 205 - Key to the species of scorpionfishes occurring in the Western Central Pacific. 316 - Key to the species of Indo-Pacific Scorpionfish (Genus Scorpaenopsis). FishGraph: A Network-Driven Data Analysis 32

Experiments: Identification Keys Analysis Identification key Species colored by family (same order and class) Both keys can be unified! 205 316 205 - Key to the species of scorpionfishes occurring in the Western Central Pacific. 316 - Key to the species of Indo-Pacific Scorpionfish (Genus Scorpaenopsis). FishGraph: A Network-Driven Data Analysis 33

Experiments: Taxonomic Classification Analysis Data used: Class Order Family Genus Species ~ 163,000 edges and 44,500 nodes FishGraph: A Network-Driven Data Analysis 34

Experiments: Taxonomic Classification Analysis Goals: Compare data in FishGraph with data in a global graph (DBpedia) Find divergences Propose reviews FishGraph: A Network-Driven Data Analysis 35

Experiments: Taxonomic Classification Comparison of taxonomic classification between FishGraph and Dbpedia Total of Species = 32,957 - Equal = 15.18% (5,136) - Inconsistent = 22.62% (7,456) - Not found = 61.79% (20,365) FishGraph: A Network-Driven Data Analysis 36

Experiments: Taxonomic Classification Analysis FishGraph: A Network-Driven Data Analysis 37

Conclusions Graph databases to perform network analyses One-way synchronization Annotations and connection with other sources on the Web Network-driven data analysis for knowledge discovery: Identification keys Taxonomic Classification FishGraph: A Network-Driven Data Analysis 38

Conclusions Future Work: Register provenance from data obtained from web sources Organize same as nodes in the local graph Enable distinct graph mappings from one relational model FishGraph: A Network-Driven Data Analysis 39

FishGraph: A Network-Driven Data Analysis Acknowledgments: Unicamp, LIS members, FishBase Consortium, FAPESP, CNPq, CAPES Thank you! Patrícia Cavoto UNICAMP University of Campinas, São Paulo, Brasil patricia.cavoto@gmail.com