BIG DATA EUROPE Integrating Big Data, Software & Communities for Addressing Europe s Societal Challenges
Partners
Mission Lower barrrier for using big data technologies o Required effort and resources o Required data science skills Assist in establishing cross-lingual/organizational/domain Data Value Chains Show societal value of Big Data www.big-data-europe.eu 16-mars-15
cross-lingual / cross-organizational / cross-domain Societal Domain Preliminary Big Data Focus area Selected Key Data assets Life Sciences & Health Heterogeneous data Linking & integration Biomedical Semantic Indexing & QA ACD Labs / ChemSpider, ChEBI, ChEMBL, Con-ceptWiki, DrugBank, EN-ZYME, Gene Ontology, GO Annotation, Swis-sProt, UniProt, Wik-iPathways, PubMed, MeSH, Disease Ontology (DO), Joint Chemical Dic-tionary (Jochem), Bio-ASQ datasets Food & Agriculture Large-scale distributed data integration INFOODS, AQUASTAT Green Learning Network (GLN), Agricultural Bibliography Network (ABN), AGRIS, AquaMaps, Fishbase Energy Real-time monitoring, stream processing, data analytics, and decision support European Energy Exchange Data, smart meter measurement data, gas/fuels/energy market/price data, consumption statistics, equipment condition monitoring data) Transport Climate Social Sciences Security Streaming sensor network & geo-spatial data integration Real-time monitoring, stream processing, and data analytics. Statistical and research data linking & integration Real-time monitoring, stream processing, and data analytics. Image data analysis GTFS data, OSM/ LinkedGeoData, MobilityMaps, Transport sensor data, ROSATTE Road safety attributes, European Road Data Infrastructure - EuroRoadS European Grid Infrastructure (EGI), Databases hosting atmospheric data. Several software frameworks for simulation, calibration and reconstruction. Federated social sciences data catalogs, statistical data from public data portals and statistical offices (e.g. EuroStats, UNESCO, WorldBank) Earth Observation data (e.g. Very High Resolution Satellite Imagery acquired from commercial providers and governmental systems) and collateral data for supporting CFSP/CSDP missions and operations, Databases hosting atmospheric Data. Experimental and simulation data concerning dispersion of hazardous substances
Project Summary Two clearly defined coordination and support measures: Coordination: Engaging with a diverse range of stakeholder groups representing particularly the Horizon 2020 societal challenges Health, Food & Agriculture, Energy, Transport, Climate, Social Sciences and Security; Collecting requirements for the ICT infrastructure needed by data-intensive science practitioners tackling a wide range of societal challenges; covering all aspects of publishing and consuming semantically interoperable, large-scale data and knowledge assets; Support: Designing, realizing and evaluating a Big Data Aggregator platform infrastructure that meets requirements, minimises disruption to current workflows, and maximises the opportunities to take advantage of the latest European RTD developments (incl. multilingual data harvesting, data analytics & visualisation). BigDataEurope will implement and apply two main instruments to successfully realize these measures: Build Societal Big Data Interest Groups in the W3C interest group scheme and involving a large number of stakeholders from the Horizon 2020 societal challenges as well as technical Big Data experts; Design, integrate and deploy a cloud-deployment-ready Big Data aggregator platform comprising key open-source Big Data technologies for real-time and batch processing, such as Hadoop, Cassandra and Storm.
Domain Specific Data Assets & Technology Societal Challenges Orthogonal Dimensions of Big Data Ecosystems Generic Big Data Enabling Technologies Data Value Chain Data Generation & Acquisition Data Analysis & Processing Data Storage & Curation Data Visualization & Usage Data-driven Services Healthcare Food Security Energy Intelligent Transport Climate & Environment Inclusive & Reflective Societies Secure Societies
BigDataEurope Platform www.big-data-europe.eu 16-mars-15
Work Packages & Implementation Phases M1-M12 M13-M24 M25-M36 Community Building WP2 Community Building & Requirements Enabling Technologies WP3 Big Data Generic Enabling Technologies & Architecture Component Integration WP4 Big Data Integrator Platform Integrator Deployment WP5 Big Data Integrator Instances Community Assessment WP6 Real-life Deployment & User Evaluation Uptake WP7 Dissemination & Communication
BDE platform covers complete data-landscape Data processing with human organized information Similar data processing steps applied on a large quantity Similar data processing steps applied on a stream of data
Reporting API Dissemination storage Dissemination API Blueprint BDE platform Background knowhow Bulk database Background aggregator Bulk data aggregator aggregated data Search index SPARQL JSON-LD JSON LOD search Real time aggregator Dataset Meta data
Reporting API Deployment Dissemination storage Dissemination API Blueprint BDE platform Background knowhow Bulk database Background aggregator Bulk data aggregator aggregated data Search index SPARQL JSON-LD JSON LOD search Real time aggregator Dataset Meta data
Coordination www.big-data-europe.eu 16-mars-15
Networking partners Health, demographic change and wellbeing Food, Agriculture, Forestry, Water and Bioeconomy Inclusive, innovative and Reflective Societies Secure, clean and efficient energy Climate, environment, resource efficiency and raw materials Smart, green and integrated transport Secure Societies www.big-data-europe.eu
Envisioned societal stakeholder engagement cycle
Community building and supporting Establish 7 Societal Big Data Interest Groups o o o modelled after the W3C interest groups involving a large number of stakeholders from the H2020 societal challenges as well as technical Big Data experts each group has a domain and a technical chair Building a European network and multiplier organization per societal challenge to o o o o o engage with stakeholders in the particular societal challenge area and raise awareness support the requirements elicitation, definition and prioritization assemble a library of data sources and datasets provide a comprehensive test bed for the evaluation of the BDE Aggregator Platform select pilot use cases, across different domains o promote the showcase developed for the societal domain and support the dissemination of the BDE results o provide appropriate academic and training curricula for training future 27-févr.-15 www.big-data-europe.eu researchers and practitioners.
Workshops 7 X 3 Workshops (at least 3 per Societal Challenge) First series of workshops in the next months will focus on requirements definition o o analyse workshops results and create 1st draft per societal challenge, examine also the use of other tools such as surveys (broad audience to ask for (big) data management needs) manage experts interviews with Big Data experts interviews with EC representative per societal challenge Second series of workshops in the 2 nd year will focus on a review of the architecture and first prototype implementation Third series of workshops in the 3 rd year will focus on the platform evaluation and showcases for the societal domains 27-févr.-15 www.big-data-europe.eu
Big Data Europe Bert.Van.Nuffelen@tenforce.com y.barnard@mail.ertico.com www.big-data-europe.eu 16-mars-15
OPEN PHACTS - BIG DATA AND DRUG DISCOVERY BRYN WILLIAMS-JONES, CEO THE OPEN PHACTS FOUNDATION Big Data Europe
Pre-competitive Informatics: Pharma companies are all accessing, processing, storing & re-processing external open research data Literature Patents PubChem Genbank Databases Downloads x Repeat @ each company Data Integration Data Analysis Firewalled Databases Lowering industry firewalls: pre-competitive informatics in drug discovery Nature Reviews Drug Discovery (2009) 8, 701-708 doi:10.1038/nrd2944
The Innovative Medicines Initiative EC funded public-private partnership for pharmaceutical research Focus on key problems Efficacy, Safety, Education & Training, Knowledge Management The Open PHACTS Project Create a semantic integration hub ( Open Pharmacological Space ) Runs 2011-2014, ENSO till 2016 Deliver services to support on-going drug discovery programs in pharma and public domain Leading academics in semantics, pharmacology and informatics, driven by solid industry business requirements 10 EFPIA companies, 15 academics, 6 SMEs Focus on sustainability and long term impact of the Open PHACTS infrastructure
Open PHACTS Mission Integrate Multiple Research Biomedical Data Resources Into A Single Open & Free Access Point
What do research scientists want to know? ChEMBL DrugBank Gene Ontology Wikipathways GeneGo ChEBI UniProt UMLS GVKBio ConceptWiki ChemSpider TrialTrove TR Integrity
Business Questions Number sum Nr of 1 Question 15 12 9 All oxidoreductase inhibitors active <100nM in both human and mouse 18 14 8 Given compound X, what is its predicted secondary pharmacology? What are the on and off,target safety concerns for a compound? What is the evidence and how reliable is that evidence (journal impact factor, KOL) for findings associated with a compound? 24 13 8 Given a target find me all actives against that target. Find/predict polypharmacology of actives. Determine ADMET profile of actives. 32 13 8 For a given interaction profile, give me compounds similar to it. 37 13 8 38 13 8 41 13 8 The current Factor Xa lead series is characterised by substructure X. Retrieve all bioactivity data in serine protease assays for molecules that contain substructure X. Retrieve all experimental and clinical data for a given list of compounds defined by their chemical structure (with options to match stereochemistry or not). A project is considering Protein Kinase C Alpha (PRKCA) as a target. What are all the compounds known to modulate the target directly? What are the compounds that may modulate the target directly? i.e. return all cmpds active in assays where the resolution is at least at the level of the target family (i.e. PKC) both from structured assay databases and the literature. 44 13 8 Give me all active compounds on a given target with the relevant assay data 46 13 8 Give me the compound(s) which hit most specifically the multiple targets in a given pathway (disease) 59 14 8 Identify all known protein-protein interaction inhibitors
Core Platform The Open PHACTS Discovery Platform Apps Identity Resolution Service Adenosine receptor 2a Linked Data API (RDF/XML, TTL, JSON) Domain Specific Services Identifier Management Service P12374 EC2.43.4 CS4532 Semantic Workflow Engine Data Cache (Virtuoso Triple Store) Chemistry Registration Normalisation & Q/C Indexing VoID VoID Nanopub VoID VoID Nanopub VoID Nanopub Public Ontologies Db Db Db http://dx.doi.org/10.1016/j.websem.2014.03.003 Db User Annotations
Sustaining Impact Software is free like puppies are free - they both need money for maintenance and more resource for future development
How do we move data about and integrate it?
Data Standardisation is vital http://imgs.xkcd.com/comics/standards.png
Yet the bioscience world really struggles to agree on names GB:29384 P12047 X31045
bryn@openphactsfoundation.org Acknowledgements Open PHACTS Practical Semantics info@openphactsfoundation.org @Open_PHACTS GlaxoSmithKline Coordinator Universität Wien Managing entity Technical University of Denmark University of Hamburg, Center for Bioinformatics BioSolveIT GmBH Consorci Mar Parc de Salut de Barcelona Leiden University Medical Centre Royal Society of Chemistry Vrije Universiteit Amsterdam Novartis Merck Serono H. Lundbeck A/S Eli Lilly Netherlands Bioinformatics Centre Swiss Institute of Bioinformatics ConnectedDiscovery EMBL-European Bioinformatics Institute Janssen Esteve Almirall OpenLink Scibite The Open PHACTS Foundation Spanish National Cancer Research Centre University of Manchester Maastricht University Aqnowledge University of Santiago de Compostela Rheinische Friedrich-Wilhelms-Universität Bonn AstraZeneca Pfizer
Big Data Europe Bert.Van.Nuffelen@tenforce.com y.barnard@mail.ertico.com www.big-data-europe.eu 16-mars-15