Knowledge discovery from biological Big Data : scalability issues



Similar documents
Complexity and Scalability in Semantic Graph Analysis Semantic Days 2013

Big Data Management Assessed Coursework Two Big Data vs Semantic Web F21BD

AgroPortal. a proposition for ontologybased services in the agronomic domain

Lecture 11 Data storage and LIMS solutions. Stéphane LE CROM

Provenance-Centered Dataset of Drug-Drug Interactions

The Development of the Clinical Trial Ontology to standardize dissemination of clinical trial data. Ravi Shankar

BBSRC TECHNOLOGY STRATEGY: TECHNOLOGIES NEEDED BY RESEARCH KNOWLEDGE PROVIDERS

Industry 4.0 and Big Data

How To Use Data Analysis To Get More Information From A Computer Or Cell Phone To A Computer

Data-intensive HPC: opportunities and challenges. Patrick Valduriez

Introduction to Data Mining

Course DSS. Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

Database Marketing, Business Intelligence and Knowledge Discovery

Data, Measurements, Features

Chapter 5. Warehousing, Data Acquisition, Data. Visualization

Find the signal in the noise

Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

Data Catalogs for Hadoop Achieving Shared Knowledge and Re-usable Data Prep. Neil Raden Hired Brains Research, LLC

AgroPortal : a proposition for ontology-based services in the agronomic domain

Using Big Data in Healthcare

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

72. Ontology Driven Knowledge Discovery Process: a proposal to integrate Ontology Engineering and KDD

From Data to Foresight:

Chapter ML:XI. XI. Cluster Analysis

An EVIDENCE-ENHANCED HEALTHCARE ECOSYSTEM for Cancer: I/T perspectives

INRA's Big Data perspectives and implementation challenges. Pascal Neveu UMR MISTEA INRA - Montpellier

> Semantic Web Use Cases and Case Studies

BIG DATA: FROM HYPE TO REALITY. Leandro Ruiz Presales Partner for C&LA Teradata

The Ontological Approach for SIEM Data Repository

Semantic Modeling with RDF. DBTech ExtWorkshop on Database Modeling and Semantic Modeling Lili Aunimo

De la Business Intelligence aux Big Data. Marie- Aude AUFAURE Head of the Business Intelligence team Ecole Centrale Paris. 22/01/14 Séminaire Big Data

Using the Grid for the interactive workflow management in biomedicine. Andrea Schenone BIOLAB DIST University of Genova

Protein Protein Interaction Networks

BIG. Big Data Analysis John Domingue (STI International and The Open University) Big Data Public Private Forum

Publishing Linked Data Requires More than Just Using a Tool

Introduction. A. Bellaachia Page: 1

Semantic Data Management. Xavier Lopez, Ph.D., Director, Spatial & Semantic Technologies

THE SEMANTIC WEB AND IT`S APPLICATIONS

Data collection architecture for Big Data

Classifying Adverse Events From Clinical Trials

The Scientific Data Mining Process

Surfing the Data Tsunami: A New Paradigm for Big Data Processing and Analytics

Clustering Technique in Data Mining for Text Documents

Data analysis of L2-L3 products

Introduction to Data Mining

LDIF - Linked Data Integration Framework

Presente e futuro del Web Semantico

LiDDM: A Data Mining System for Linked Data

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

Big Data, Fast Data, Complex Data. Jans Aasman Franz Inc

SEMANTIC WEB BASED INFERENCE MODEL FOR LARGE SCALE ONTOLOGIES FROM BIG DATA

Semantic Interoperability

Big Data and Semantic Web in Manufacturing. Nitesh Khilwani, PhD Chief Engineer, Samsung Research Institute Noida, India

Conquering the Astronomical Data Flood through Machine

Text Mining for Health Care and Medicine. Sophia Ananiadou Director National Centre for Text Mining

Using Semantic Data Mining for Classification Improvement and Knowledge Extraction

Integrating a Big Data Platform into Government:

BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON

LinkZoo: A linked data platform for collaborative management of heterogeneous resources

Outline. What is Big data and where they come from? How we deal with Big data?

Improving EHR Semantic Interoperability Future Vision and Challenges

Cray: Enabling Real-Time Discovery in Big Data

Big Data: Study in Structured and Unstructured Data

MarkLogic Semantics in Healthcare and Life Sciences for LIDER COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.

Linked Statistical Data Analysis

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce

Big Data: Opportunities & Challenges, Myths & Truths 資 料 來 源 : 台 大 廖 世 偉 教 授 課 程 資 料

COMP9321 Web Application Engineering

Alejandro Vaisman Esteban Zimanyi. Data. Warehouse. Systems. Design and Implementation. ^ Springer

Information Management course

Advanced In-Database Analytics

International Journal of Scientific & Engineering Research, Volume 5, Issue 4, April ISSN

Supporting Change-Aware Semantic Web Services

A leader in the development and application of information technology to prevent and treat disease.

HUAWEI Advanced Data Science with Spark Streaming. Albert Bifet

Collaborations between Official Statistics and Academia in the Era of Big Data

Transcription:

Knowledge discovery from biological Big Data : scalability issues Marie-Dominique Devignes, Malika Smaïl, Emmanuel Bresso, Adrien Coulet, Chedy Raïssi, Amedeo Napoli Université de Lorraine, LORIA laboratory and INRIA Nancy Grand-Est, Orpailleur team, Nancy, France http://www.loria.fr http://orpailleur.loria.fr/

From (big) data to knowledge KDD Information Raw Data K Problem solving, making decision KDD : Knowledge Discovery from Databases iterative and interactive process 6/12/2013 Big Data Challenges and Opportunities 2

A «big data» story in the life sciences Presented by Russ Altman (PharmGKB) on Youtube EngX webinar at Stanford Engineering School, nov12, 2013 3. Supervised statistical machine learning 2. Data subset related to eight classes of side effects 1. FDA Adverse Event Reporting System : (FAERS) Information Data K 4. Correlation models : Adverse reaction due to Drug Drug Interactions (DDI) Adverse event interpretation in electronic medical records Tatonetti et al., A novel signal detection algorithm for identifying hidden drug-drug interactions in adverse event reports JAMIA, 19:79-85, 2011 6/12/2013 Big Data Challenges and Opportunities 3

The KDD bottlenecks in the life sciences Data and data sources Noisy, complex, heterogenenous, distributed, dynamic, etc. Need for «knowledge/model - driven» data integration Data selection Example and feature selection for machine learning Need for guidelines Parameters of data mining programs Experimental approach Need for efficient execution platforms Pattern evaluation and interpretation Big data mining can yield big volume of patterns! How to evaluate novelty, significance and consistency of a pattern at large scale? 6/12/2013 Big Data Challenges and Opportunities 4

Objectives of the talk 1. How do big data and biological databases cooperate? 2. How can bio-ontologies help in knowledge discovery? 3. Big data opportunities for the knowledge discovery process 6/12/2013 Big Data Challenges and Opportunities 5

Biological databases are Big Data More than 1500 biological databases today Curated data (not always) Complex schema Time-consuming update and integration Uniprot - Stats nov 2013 : SwissProt >542 KiloSeq for 192 MegaAA TrEMBL > 48 MegaSeq for 15 GigaAA Biological data has been «big data» for years! 6/12/2013 Big Data Challenges and Opportunities 6

Semantic web* as emerging biological Big Data Linked Open Data (LOD) Interconnected data Freely accessible on the web RDF Resource Description Framework {Subject, Property, Object} URI (Uniform Resource Identifier) Bio2RDF project 1 Tera triple graph in july 2013 Uniprot KeggPathway hsa:nnn Has_gene KeggGene hsa:ggg Has_domain See_Also, Xref Uniprot sp:ppp Interpro ipr:ddd *Semantic Web is a group of technologies to allow computers to autonomously process information resources without human intervention by annotating the meaning or "semantics" to them" (coined by Tim Berners-Lee in 1998). 6/12/2013 Big Data Challenges and Opportunities 7

From databases to RDF triples «RDFization» of database contents Database fact RDF triple Database Graph e.g. A protein P:pppp containing a domain D:ddd Uniprot sp:ppp Has_domain Interpro ipr:ddd = «EBI Sparql end-point» 6/12/2013 Big Data Challenges and Opportunities 8

Cooperation between LOD and databases Classical databases can provide reliable curated information to complement and enrich information extracted from LOD Project EXPLOD-BioMed (Adrien Coulet) Exploring LOD in the purpose of mining biomedical data Collect data about the genes responsible for intellectual disability Use Bio2RDF or EBI/RDF SPARQL endpoints Incomplete «RDFization» -> complete the datasets by querying classical databases + RDF representation of results Storing retrieved RDF triples into a triple store Or back to a relational DB (!) for easy design of KDD workflows using Knowledge Discovery Environments (such as KNIME) 6/12/2013 Big Data Challenges and Opportunities 9

Flexibility versus Semantics : research opportunities Moving from relational DB to NoSQL storage systems Schema-less data -> lack of documentation, loss of semantics New management systems to be invented Analytic tools need to be adapted to such systems Mahout MOA PEGASUS Fayyad UM (2012) Big data everywhere and No SQL in sight. SIGKDD explorations, 14: i-ii Fan W and Bifet A (2012) Mining big data: current status and forecast to the future. SIGKDD explorations, 14: 1-5 6/12/2013 Big Data Challenges and Opportunities 10

Objectives of the talk 1. How do big data and biological databases cooperate? 2. How can bio-ontologies help in knowledge discovery? 3. Big data opportunities for the knowledge discovery process 6/12/2013 Big Data Challenges and Opportunities 11

KDDK : Knowlege Discovery guided by Domain Knowledge in the Orpailleur team DB1 3. Result interpretation DB2 Domain Knowledge Data Knowledge Base (KB) 2. Data Mining DB3 Etc. 1. Data extraction and formatting Data integration Data mining Coulet et al. Ontology-based knowledge discovery in pharmacogenomics. Adv Exp Med Biol. 2011;696:357-66 6/12/2013 Big Data Challenges and Opportunities 12

Bio-ontologies, an asset in the life sciences Ontologies = knowledge representation From hierarchical vocabularies e.g. MeSH, MedDRA, GO, SNOMED, ICD To logical representation of concepts and relationships e. g. SIO Semanticscience Integrated Ontology, UMLS Semantic Types, SOPharm Usages (semantic web technologies) Model layer of knowledge bases Semantic enrichment e.g. Onto-Tools, IntelliGO Cross-ressource data retrieval e.g. NCBO Resource Index 6/12/2013 Big Data Challenges and Opportunities 13

National Center for Bio-Ontologies : NCBO bioportal 6/12/2013 Big Data Challenges and Opportunities 14

BIO-Ontologies and LOD exploration 366 bio-ontologies at the NCBO BioPortail: 6 Mega concepts 39 biological resources: UniProt, GO, ArrayExpress, GEO, PharmGKB, etc. 5 Mega records 24,8 Giga annotations (Jonquet C et al. (2011) NCBO Resource Index: Ontology-Based Search and Mining of Biomedical Resources. Web Semantics 9:316-324) Statistics updated November 2013 6/12/2013 Big Data Challenges and Opportunities 15

Exploring resources with the Resource Index 6/12/2013 Big Data Challenges and Opportunities 16

Bio-ontologies and dimension reduction Big data often mean high-dimensional data Statistical methods for feature selection Many possible methods Clustering similar features using a terminology and semantic similarity measure E.g. semantic clustering of 1288 MedDRA adverse effect terms -> 112 term clusters Enables execution of symbolic data mining methods such as frequent itemset search Bresso et al. (2013) Integrative relational Machine-Learning Approach for Understanding Drug Side-Effect Profiles. BMC Bioinformatics,14(1):207. 6/12/2013 Big Data Challenges and Opportunities 17

Objectives of the talk 1. How do big data and biological databases cooperate? 2. How can bio-ontologies help in knowledge discovery? 3. Big data opportunities for the knowledge discovery process 6/12/2013 Big Data Challenges and Opportunities 18

Big data as a reservoir of data for validating hypotheses and models Huge data sets become available for mining The amount of effort required to warehouse data often means that valuable data sources in organizations are never mined. This is where Hadoop can make a big difference (Eric Dumbhill, Big Data now, 2012) Adverse events -> grouping medical records from different hospitals is useful to enlarge the dataset Data mining often generate more than one model, sometimes a huge amount of patterns Training set requires integrated curated data Test set can be extracted from big data -> statistical evaluation 6/12/2013 Big Data Challenges and Opportunities 19

The critical «Vs» of Big Data in the Life Sciences Variety and variability New data types provided by high-throughput technologies (OMICS data but also images from microscopy devices ) Value : FAERS and drug drug interaction -> better control of drug treatments Individual genomes -> personalized medicine Veracity Multiple source integration means detecting and managing possible inconsistencies Quality and provenance metadata in the LOD Bio2RDF uses DublinCore metadata triples and calculates 9 metrics for each dataset Popularity ranking, cross-reference degree 6/12/2013 Big Data Challenges and Opportunities 20

New paradigms for knowledge discovery Cooperation between symbolic and statistical methods Statistical feature selection before symbolic data mining Automatic filtering and/or ranking of patterns using statistical significance measurements before expert interpretation Adaptive learning systems Label propagation on big data objects 6/12/2013 Big Data Challenges and Opportunities 21

Other projects in the Orpailleur team Research projects Parallelization of CORON tools (http://coron.loria.fr) A suite of tools for symbolic data mining and formal concept analysis Text mining (ANR Hybride : http://hybride.loria.fr/ ) Collaboration with Orphanet Graph mining for chemical reactions Pennerath F, Niel G, Vismara P., Jauffret P., Laurenço C., Napoli A. (2010) "A graph-mining algorithm for the evaluation of bond formability". Journal of Chemical Information and Modeling, 50:221-239. Spatio-temporal mining of agronomical data Mari JF, Lazrak E-G, Benoît M (2013) Time space stochastic modelling of agricultural landscapes for environmental issues. Environmental Modelling and Software 46:219-227 Education : TELECOM Nancy (http://www.telecomnancy.eu/ ) Training engineers as «Data Scientists», Masters level Ingénierie et Applications des Masses de Données (IAMD) 6/12/2013 Big Data Challenges and Opportunities 22

Conclusion LOD and biological databases can cooperate in the KDD process Bio-ontologies are a major asset in the Life Sciences For data exploration For dimension reduction Semantic web technology scales up at RDF level But not yet at the OWL and reasoning level HPC computing and programs can process big data But Knowledge Discovery remains a human-guided process 6/12/2013 Big Data Challenges and Opportunities 23

References Big Data Now. O Reilly Media Inc. 1st edition, october 2012, www.it-ebooks.info (123 p.) Bresso E, Grisoni R, Marchetti G, Karaboga AS, Souchet M, Smaïl-Tabbone M (2013) Integrative relational Machine-Learning Approach for Understanding Drug Side-Effect Profiles. BMC Bioinformatics.14:207. Callahan A, Cruz-Toledo J, Dumontier M. (2013) Ontology-Based Querying with Bio2RDF's Linked Open Data. J Biomed Semantics. 15:4 Coakley MF, Leerkes MR, Barnett J, Gabrielian AE, Noble K, Weber MN and Huyen Y. Unlocking the power of big data at the NIH (Meeeting Report) Big Data September 2013 183-186. Coulet A, Smaïl-Tabbone M, Napoli A, Devignes MD (2011) Ontology-based knowledge discovery in pharmacogenomics. Adv Exp Med Biol. 696:357-66 Fan W and Bifet A (2012) Mining big data : current status and forecast to the future. SIGKDD explorations, 14:1-5 Fayyad U (2012) Big data everywhere and No SQL in sight. SIGKDD explorations, 14: i-ii Higdon R, Haynes W, Stanberry L, Stewart E, Yandl G, Howard C, Broomall W, Kolker N and Kolker E (2013) Unraveling the complexities of life sciences data. Big Data March 2013 42-50 Hoehndorf R, DumontierM and Gkoutos G (2012) Evaluation of research in biomedical ontologies. Briefings in Bioinformatics. Sept 8, 2012, 1-17. Jonquet C, Lependu P, Falconer S, Coulet A, Noy NF, Musen MA, Shah NH (2011) NCBO Resource Index: Ontology-Based Search and Mining of Biomedical Resources. Web Semantics 9:316-324. Tatonetti NP, Fernald GH, Altman RB (2012) A novel signal detection algorithm for identifying hidden drug-drug interactions in adverse event reports. J Am Med Inform Assoc.19:79-85 6/12/2013 Big Data Challenges and Opportunities 24

Thank you for your attention! but 6/12/2013 Big Data Challenges and Opportunities 25