FishGraph: A Network-Driven Data Analysis Patrícia Cavoto*, Victor Cardoso*, Régine Vignes Lebbe, André Santanchè* *UNICAMP University ofcampinas, São Paulo, Brasil ISYEB - UMR 7205 CNRS, MNHN, UPMC, EPHE UPMC Univ. Paris 06, Sorbonne Universités, Paris, France
Outline Motivation Goal ReGraph: from FishBase to FishGraph Data Experiments Conclusions FishGraph: A Network-Driven Data Analysis 2
Motivation Collaborative research involving: LIS - Laboratory of Information Systems UNICAMP, Brazil MNHN - National Museum of Natural History and Sorbonne Univertès Paris, France FishBase Consortium FishGraph: A Network-Driven Data Analysis 3
Motivation FishBase: a relational database and information system for biological data storage of fish species, with millions of records containing: Species, taxonomic classification and predators Locations (country and ecosystem) Identification keys Food Behavior etc. FishGraph: A Network-Driven Data Analysis 4
Motivation Identification Key: A biology mechanism to identify a specific specimen Composed by a set of questions that guides scientists in this identification Has one or more species associated Similar to a decision tree FishGraph: A Network-Driven Data Analysis 5
Identification Key Example 6 - Freshwater fishes of Africa Five pairs of external gill slits Single, or single pair of gill openings Head without extended rostrum, gill slits lateral Head with extended rostrum, gill slits ventral Body without scales, or scales small and not clearly visible. Body with clearly visible scales. Body slender, elongate and eellike Body not eel-like Carcharhinus leucas adapted from: http://fishbase.org/keys/description.php?keycode=6 7
Identification Key Problem 6 - Freshwater fishes of Africa 1419 - Species of Schilbe of Africa? adapted from: http://fishbase.org/keys/description.php?keycode=6 http://fishbase.org/keys/description.php?keycode=1419 7
Identification Key Problem 6 - Freshwater fishes of Africa Adipose fin present Adipose fin absent adapted from: http://fishbase.org/keys/description.php?keycode=6 http://fishbase.org/keys/description.php?keycode=1419 9
Identification Key Problem 6 - Freshwater fishes of Africa 1419 - Species of Schilbe of Africa Adipose fin present Adipose fin absent Adipose fin present Adipose fin absent adapted from: http://fishbase.org/keys/description.php?keycode=6 http://fishbase.org/keys/description.php?keycode=1419 11
Motivation Biological data (as in FishBase) form a big network Biologists need network analysis for: Identify the most important species in an specific food chain; Define areas (or species) for preservation; Find relations in a network of identification keys. But data are mainly stored in a relational database FishGraph: A Network-Driven Data Analysis 10
Motivation How to support biologists in network-driven analysis? FishGraph: A Network-Driven Data Analysis 11
Goal Build a network database for analysis from a relational database FishGraph: A Network-Driven Data Analysis 12
ReGraph: from FishBase to FishGraph Graph databases: Very effective in network analysis Flexible structure Easy to run transitive relationships FishGraph: A Network-Driven Data Analysis 13
ReGraph: from FishBase to FishGraph ReGraph: a framework that generates a graph database from a relational database. FishGraph: A Network-Driven Data Analysis 14
ReGraph: from FishBase to FishGraph ReGraph: a framework that generates a graph database from a relational database. mapped subgraph FishGraph: A Network-Driven Data Analysis 15
ReGraph: from FishBase to FishGraph ReGraph: maintain the graph database synchronized with the relational database (one-way synchronization). FishGraph: A Network-Driven Data Analysis 16
ReGraph: from FishBase to FishGraph ReGraph: maintain the graph database synchronized with the relational database (one-way synchronization). FishGraph: A Network-Driven Data Analysis 17
ReGraph: from FishBase to FishGraph ReGraph: Relational and Graph Databases keep their native form. Current Current systems systems New system FishGraph: A Network-Driven Data Analysis 18
ReGraph: from FishBase to FishGraph ReGraph: allows adding new data in the graph database. annotated subgraph FishGraph: A Network-Driven Data Analysis 19
ReGraph: from FishBase to FishGraph ReGraph: mapped and annotated subgraphs are integrated and avaiable for running analysis. FishGraph: A Network-Driven Data Analysis 20
ReGraph: from FishBase to FishGraph ReGraph: connects data in the local graph with global graphs on the web. Semantic Web FishGraph: A Network-Driven Data Analysis 21
ReGraph: from FishBase to FishGraph STEPS: 1. Map data from relational database to graph 2. Run the ETL process to load initial data 3. Synchronism process starts to run after the first loading 4. Add new information as annotation (optional) 5. Perform the analysis FishGraph: A Network-Driven Data Analysis 22
ReGraph: from FishBase to FishGraph ReGraph: used to generate FishGraph (graph database) from FishBase (relational database) FishGraph FishGraph: A Network-Driven Data Analysis 23
ReGraph: from FishBase to FishGraph FAMILIES CLASSES GENERA ORDERS SPECIES COUNTREF KEYQUESTIONS SPECIES ECOSYSTEMREF FAOARREF KEYS ECOSYSTEMREF FishGraph: A Network-Driven Data Analysis 24
ReGraph: from FishBase to FishGraph FAMILY belongs_to SPECIES CLASS GENUS COUNTRY ORDER KEY ECOSYSTEM FishGraph: A Network-Driven Data Analysis 25
Experiments: Identification Key Analysis Data used: Identification keys Species Geographic locations (countries and ecossystems) ~ 86,500 edges and 10,500 nodes FishGraph: A Network-Driven Data Analysis 26
Experiments: Identification Key Analysis Long term goal: Start the identification process from any node across several identification keys. Goals in this analysis: Find similarities between keys Find differences between keys Analyze groups of keys FishGraph: A Network-Driven Data Analysis 27
Experiments: Identification Keys Analysis The annotated share edge connects keys that share at least one species. FAMILY belongs_to SPECIES CLASS GENUS COUNTRY ORDER KEY ECOSYSTEM FishGraph: A Network-Driven Data Analysis 28
Experiments: Identification Keys Analysis Components based on the share edge connecting two or more distinct keys with their associated species. A component is a subgraph in which there is a path from any node to another one. *each color represents an independent component. FishGraph: A Network-Driven Data Analysis 29
Experiments: Identification Keys Analysis Identification key Species colored by family (same order and class) 205 316 205 - Key to the species of scorpionfishes occurring in the Western Central Pacific. 316 - Key to the species of Indo-Pacific Scorpionfish (Genus Scorpaenopsis). FishGraph: A Network-Driven Data Analysis 30
Experiments: Identification Keys Analysis Identification key Species colored by family (same order and class) identification key 205 316 205 - Key to the species of scorpionfishes occurring in the Western Central Pacific. 316 - Key to the species of Indo-Pacific Scorpionfish (Genus Scorpaenopsis). FishGraph: A Network-Driven Data Analysis 31
Experiments: Identification Keys Analysis Identification key Species colored by family (same order and class) shared species 205 316 205 - Key to the species of scorpionfishes occurring in the Western Central Pacific. 316 - Key to the species of Indo-Pacific Scorpionfish (Genus Scorpaenopsis). FishGraph: A Network-Driven Data Analysis 32
Experiments: Identification Keys Analysis Identification key Species colored by family (same order and class) Both keys can be unified! 205 316 205 - Key to the species of scorpionfishes occurring in the Western Central Pacific. 316 - Key to the species of Indo-Pacific Scorpionfish (Genus Scorpaenopsis). FishGraph: A Network-Driven Data Analysis 33
Experiments: Taxonomic Classification Analysis Data used: Class Order Family Genus Species ~ 163,000 edges and 44,500 nodes FishGraph: A Network-Driven Data Analysis 34
Experiments: Taxonomic Classification Analysis Goals: Compare data in FishGraph with data in a global graph (DBpedia) Find divergences Propose reviews FishGraph: A Network-Driven Data Analysis 35
Experiments: Taxonomic Classification Comparison of taxonomic classification between FishGraph and Dbpedia Total of Species = 32,957 - Equal = 15.18% (5,136) - Inconsistent = 22.62% (7,456) - Not found = 61.79% (20,365) FishGraph: A Network-Driven Data Analysis 36
Experiments: Taxonomic Classification Analysis FishGraph: A Network-Driven Data Analysis 37
Conclusions Graph databases to perform network analyses One-way synchronization Annotations and connection with other sources on the Web Network-driven data analysis for knowledge discovery: Identification keys Taxonomic Classification FishGraph: A Network-Driven Data Analysis 38
Conclusions Future Work: Register provenance from data obtained from web sources Organize same as nodes in the local graph Enable distinct graph mappings from one relational model FishGraph: A Network-Driven Data Analysis 39
FishGraph: A Network-Driven Data Analysis Acknowledgments: Unicamp, LIS members, FishBase Consortium, FAPESP, CNPq, CAPES Thank you! Patrícia Cavoto UNICAMP University of Campinas, São Paulo, Brasil patricia.cavoto@gmail.com