Knowledge discovery from biological Big Data : scalability issues Marie-Dominique Devignes, Malika Smaïl, Emmanuel Bresso, Adrien Coulet, Chedy Raïssi, Amedeo Napoli Université de Lorraine, LORIA laboratory and INRIA Nancy Grand-Est, Orpailleur team, Nancy, France http://www.loria.fr http://orpailleur.loria.fr/
From (big) data to knowledge KDD Information Raw Data K Problem solving, making decision KDD : Knowledge Discovery from Databases iterative and interactive process 6/12/2013 Big Data Challenges and Opportunities 2
A «big data» story in the life sciences Presented by Russ Altman (PharmGKB) on Youtube EngX webinar at Stanford Engineering School, nov12, 2013 3. Supervised statistical machine learning 2. Data subset related to eight classes of side effects 1. FDA Adverse Event Reporting System : (FAERS) Information Data K 4. Correlation models : Adverse reaction due to Drug Drug Interactions (DDI) Adverse event interpretation in electronic medical records Tatonetti et al., A novel signal detection algorithm for identifying hidden drug-drug interactions in adverse event reports JAMIA, 19:79-85, 2011 6/12/2013 Big Data Challenges and Opportunities 3
The KDD bottlenecks in the life sciences Data and data sources Noisy, complex, heterogenenous, distributed, dynamic, etc. Need for «knowledge/model - driven» data integration Data selection Example and feature selection for machine learning Need for guidelines Parameters of data mining programs Experimental approach Need for efficient execution platforms Pattern evaluation and interpretation Big data mining can yield big volume of patterns! How to evaluate novelty, significance and consistency of a pattern at large scale? 6/12/2013 Big Data Challenges and Opportunities 4
Objectives of the talk 1. How do big data and biological databases cooperate? 2. How can bio-ontologies help in knowledge discovery? 3. Big data opportunities for the knowledge discovery process 6/12/2013 Big Data Challenges and Opportunities 5
Biological databases are Big Data More than 1500 biological databases today Curated data (not always) Complex schema Time-consuming update and integration Uniprot - Stats nov 2013 : SwissProt >542 KiloSeq for 192 MegaAA TrEMBL > 48 MegaSeq for 15 GigaAA Biological data has been «big data» for years! 6/12/2013 Big Data Challenges and Opportunities 6
Semantic web* as emerging biological Big Data Linked Open Data (LOD) Interconnected data Freely accessible on the web RDF Resource Description Framework {Subject, Property, Object} URI (Uniform Resource Identifier) Bio2RDF project 1 Tera triple graph in july 2013 Uniprot KeggPathway hsa:nnn Has_gene KeggGene hsa:ggg Has_domain See_Also, Xref Uniprot sp:ppp Interpro ipr:ddd *Semantic Web is a group of technologies to allow computers to autonomously process information resources without human intervention by annotating the meaning or "semantics" to them" (coined by Tim Berners-Lee in 1998). 6/12/2013 Big Data Challenges and Opportunities 7
From databases to RDF triples «RDFization» of database contents Database fact RDF triple Database Graph e.g. A protein P:pppp containing a domain D:ddd Uniprot sp:ppp Has_domain Interpro ipr:ddd = «EBI Sparql end-point» 6/12/2013 Big Data Challenges and Opportunities 8
Cooperation between LOD and databases Classical databases can provide reliable curated information to complement and enrich information extracted from LOD Project EXPLOD-BioMed (Adrien Coulet) Exploring LOD in the purpose of mining biomedical data Collect data about the genes responsible for intellectual disability Use Bio2RDF or EBI/RDF SPARQL endpoints Incomplete «RDFization» -> complete the datasets by querying classical databases + RDF representation of results Storing retrieved RDF triples into a triple store Or back to a relational DB (!) for easy design of KDD workflows using Knowledge Discovery Environments (such as KNIME) 6/12/2013 Big Data Challenges and Opportunities 9
Flexibility versus Semantics : research opportunities Moving from relational DB to NoSQL storage systems Schema-less data -> lack of documentation, loss of semantics New management systems to be invented Analytic tools need to be adapted to such systems Mahout MOA PEGASUS Fayyad UM (2012) Big data everywhere and No SQL in sight. SIGKDD explorations, 14: i-ii Fan W and Bifet A (2012) Mining big data: current status and forecast to the future. SIGKDD explorations, 14: 1-5 6/12/2013 Big Data Challenges and Opportunities 10
Objectives of the talk 1. How do big data and biological databases cooperate? 2. How can bio-ontologies help in knowledge discovery? 3. Big data opportunities for the knowledge discovery process 6/12/2013 Big Data Challenges and Opportunities 11
KDDK : Knowlege Discovery guided by Domain Knowledge in the Orpailleur team DB1 3. Result interpretation DB2 Domain Knowledge Data Knowledge Base (KB) 2. Data Mining DB3 Etc. 1. Data extraction and formatting Data integration Data mining Coulet et al. Ontology-based knowledge discovery in pharmacogenomics. Adv Exp Med Biol. 2011;696:357-66 6/12/2013 Big Data Challenges and Opportunities 12
Bio-ontologies, an asset in the life sciences Ontologies = knowledge representation From hierarchical vocabularies e.g. MeSH, MedDRA, GO, SNOMED, ICD To logical representation of concepts and relationships e. g. SIO Semanticscience Integrated Ontology, UMLS Semantic Types, SOPharm Usages (semantic web technologies) Model layer of knowledge bases Semantic enrichment e.g. Onto-Tools, IntelliGO Cross-ressource data retrieval e.g. NCBO Resource Index 6/12/2013 Big Data Challenges and Opportunities 13
National Center for Bio-Ontologies : NCBO bioportal 6/12/2013 Big Data Challenges and Opportunities 14
BIO-Ontologies and LOD exploration 366 bio-ontologies at the NCBO BioPortail: 6 Mega concepts 39 biological resources: UniProt, GO, ArrayExpress, GEO, PharmGKB, etc. 5 Mega records 24,8 Giga annotations (Jonquet C et al. (2011) NCBO Resource Index: Ontology-Based Search and Mining of Biomedical Resources. Web Semantics 9:316-324) Statistics updated November 2013 6/12/2013 Big Data Challenges and Opportunities 15
Exploring resources with the Resource Index 6/12/2013 Big Data Challenges and Opportunities 16
Bio-ontologies and dimension reduction Big data often mean high-dimensional data Statistical methods for feature selection Many possible methods Clustering similar features using a terminology and semantic similarity measure E.g. semantic clustering of 1288 MedDRA adverse effect terms -> 112 term clusters Enables execution of symbolic data mining methods such as frequent itemset search Bresso et al. (2013) Integrative relational Machine-Learning Approach for Understanding Drug Side-Effect Profiles. BMC Bioinformatics,14(1):207. 6/12/2013 Big Data Challenges and Opportunities 17
Objectives of the talk 1. How do big data and biological databases cooperate? 2. How can bio-ontologies help in knowledge discovery? 3. Big data opportunities for the knowledge discovery process 6/12/2013 Big Data Challenges and Opportunities 18
Big data as a reservoir of data for validating hypotheses and models Huge data sets become available for mining The amount of effort required to warehouse data often means that valuable data sources in organizations are never mined. This is where Hadoop can make a big difference (Eric Dumbhill, Big Data now, 2012) Adverse events -> grouping medical records from different hospitals is useful to enlarge the dataset Data mining often generate more than one model, sometimes a huge amount of patterns Training set requires integrated curated data Test set can be extracted from big data -> statistical evaluation 6/12/2013 Big Data Challenges and Opportunities 19
The critical «Vs» of Big Data in the Life Sciences Variety and variability New data types provided by high-throughput technologies (OMICS data but also images from microscopy devices ) Value : FAERS and drug drug interaction -> better control of drug treatments Individual genomes -> personalized medicine Veracity Multiple source integration means detecting and managing possible inconsistencies Quality and provenance metadata in the LOD Bio2RDF uses DublinCore metadata triples and calculates 9 metrics for each dataset Popularity ranking, cross-reference degree 6/12/2013 Big Data Challenges and Opportunities 20
New paradigms for knowledge discovery Cooperation between symbolic and statistical methods Statistical feature selection before symbolic data mining Automatic filtering and/or ranking of patterns using statistical significance measurements before expert interpretation Adaptive learning systems Label propagation on big data objects 6/12/2013 Big Data Challenges and Opportunities 21
Other projects in the Orpailleur team Research projects Parallelization of CORON tools (http://coron.loria.fr) A suite of tools for symbolic data mining and formal concept analysis Text mining (ANR Hybride : http://hybride.loria.fr/ ) Collaboration with Orphanet Graph mining for chemical reactions Pennerath F, Niel G, Vismara P., Jauffret P., Laurenço C., Napoli A. (2010) "A graph-mining algorithm for the evaluation of bond formability". Journal of Chemical Information and Modeling, 50:221-239. Spatio-temporal mining of agronomical data Mari JF, Lazrak E-G, Benoît M (2013) Time space stochastic modelling of agricultural landscapes for environmental issues. Environmental Modelling and Software 46:219-227 Education : TELECOM Nancy (http://www.telecomnancy.eu/ ) Training engineers as «Data Scientists», Masters level Ingénierie et Applications des Masses de Données (IAMD) 6/12/2013 Big Data Challenges and Opportunities 22
Conclusion LOD and biological databases can cooperate in the KDD process Bio-ontologies are a major asset in the Life Sciences For data exploration For dimension reduction Semantic web technology scales up at RDF level But not yet at the OWL and reasoning level HPC computing and programs can process big data But Knowledge Discovery remains a human-guided process 6/12/2013 Big Data Challenges and Opportunities 23
References Big Data Now. O Reilly Media Inc. 1st edition, october 2012, www.it-ebooks.info (123 p.) Bresso E, Grisoni R, Marchetti G, Karaboga AS, Souchet M, Smaïl-Tabbone M (2013) Integrative relational Machine-Learning Approach for Understanding Drug Side-Effect Profiles. BMC Bioinformatics.14:207. Callahan A, Cruz-Toledo J, Dumontier M. (2013) Ontology-Based Querying with Bio2RDF's Linked Open Data. J Biomed Semantics. 15:4 Coakley MF, Leerkes MR, Barnett J, Gabrielian AE, Noble K, Weber MN and Huyen Y. Unlocking the power of big data at the NIH (Meeeting Report) Big Data September 2013 183-186. Coulet A, Smaïl-Tabbone M, Napoli A, Devignes MD (2011) Ontology-based knowledge discovery in pharmacogenomics. Adv Exp Med Biol. 696:357-66 Fan W and Bifet A (2012) Mining big data : current status and forecast to the future. SIGKDD explorations, 14:1-5 Fayyad U (2012) Big data everywhere and No SQL in sight. SIGKDD explorations, 14: i-ii Higdon R, Haynes W, Stanberry L, Stewart E, Yandl G, Howard C, Broomall W, Kolker N and Kolker E (2013) Unraveling the complexities of life sciences data. Big Data March 2013 42-50 Hoehndorf R, DumontierM and Gkoutos G (2012) Evaluation of research in biomedical ontologies. Briefings in Bioinformatics. Sept 8, 2012, 1-17. Jonquet C, Lependu P, Falconer S, Coulet A, Noy NF, Musen MA, Shah NH (2011) NCBO Resource Index: Ontology-Based Search and Mining of Biomedical Resources. Web Semantics 9:316-324. Tatonetti NP, Fernald GH, Altman RB (2012) A novel signal detection algorithm for identifying hidden drug-drug interactions in adverse event reports. J Am Med Inform Assoc.19:79-85 6/12/2013 Big Data Challenges and Opportunities 24
Thank you for your attention! but 6/12/2013 Big Data Challenges and Opportunities 25