Ernestina Menasalvas Universidad Politécnica de Madrid EECA Cluster networking event RITA 12th november 2014, Baku
Sectors/Domains Big Data Value Source Public administration EUR 150 billion to EUR 300 billion in new value (Considering EU 23 larger governments) Healthcare & Social Care Utilities Transport and logistics EUR 90 billion considering only the reduction of national healthcare expenditure in the EU Reduce CO2 emissions by more than 2 gigatonnes, equivalent to EUR 79 billion (Global figure) USD 500 billion in value worldwide in the form of time and fuel savings, or 380 megatonnes of CO2 emissions saved OCDE, 2013 McKinsey Global Institute, 2011 OCDE, 2013 OCDE, 2013 Retail & Trade Geospatial Applications & Services 60% potential increase in retailers operating margins possible with Big Data USD 800 billion in revenue to service providers and value to consumer and business end users USD 51 billion worldwide directly associated to Big Data market (Services and applications) McKinsey Global Institute 2, 2011 McKinsey Global Institute 2, 2011 Various, 4
Motivation In 2012, worldwide digital healthcare data was estimated to be equal to 500 petabytes and is expected to reach 25,000 petabytes in 2020 Can we learn from the past to become better in the future? Healthcare Data is becoming more complex!! The problem : Milllions of reports, tasks, incidents, events, images, DNA Complete availability Lack of protocols and structure Organization oriented processes Need of patient oriented processes information 5
From Mckensey: big data in health report 2013 From physicians judgment to evidence-based medicine Standard medical practice is moving from relatively ad-hoc and subjective decision making to evidence-based healthcare Is the health-care industry prepared to capture big data s full potential, or are there roadblocks that will hamper its use? Holistic, patient-centered approach to value, one that focuses equally on health-care spending and treatment outcomes. 6
EHR adoption http://www.accenture.com/sitecollectiondocuments/pdf/accenture_emr_markets_whitepaper_vfinal.pdf 7
BIG DATA IN THE HEALTH DOMAIN 8
The average hospital (300 beds) 500.000 patients (reference population) 1300 users (250 physicians, 900 nurses and technicsian, 150 administrative tasks) Monthly activity: 20.000 consultations, 1300 admissions, 800 interventions 10.000 emergencies 75.000 annotations 25.000 reports 90.000 interdepartamental orders 450.000 lab results (analytical) 13.000 images analysis 24.000 pharmacological prescriptions 9
Hospital Management They require of solutions for cost-reduction policies. efficiency procedures. establishing share-risk policies Alarms Early prognosis and diagnosis Environmental, sensor, integration Use data and services of the cloud for comparison of data of other hospitals/countries/.. for efficiency policies... 10
Goverment support for cost-reduction policies analysis of early detection of chronic diseases analysis of diseases and the elderly prediction of the evolution of diseases depending on clinical and societal factors. sentiment analysis (user satisfaction) of policies, health care, impact of environmental factors on the evolution, prevalence and.. of diseases impact of socio economic situation of people on the disease evolution and impact on health costs cloud based services for analysis of all the data generated in different hospitals 11
Clinicians: evidence based medicine correlations, associations of symptoms, familiar antecedents, habits, diseases impact of certain biomedical factors (genome structure, clinical variables ) on the evolution of certain diseases automatic classification of images (prioritization of RX images to help diagnosis) automatic annotation of images natural language (google style) based diagnose aid tools 12
14
15 ESCUELA TÉCNICA SUPERIOR DE INGENIEROS INFORMATICOS Process
Data Acquisition Data Silos Standarization Privacy Structured data: Diverse numeric scales on different labs Missing data Clinical and demographic data (ICD) medium recall and medium precision for characterizing patients Non-structured data: Images Clinical reports Data processing Modelling Image annotation NLP Integration Deep analysis Visualization Validation Apply 16
By 2015, the average hospital will have two-thirds of a petabyte of patient data, 80% of which will be unstructured image data like CT scans and X-rays. http://medcitynews.com/2013/03/the-body-in-bytesmedical-images-as-a-source-of-healthcare-big-datainfographic/ 17
Most frequent ComputedTomography (CT), X-Ray, Positron Emission Tomography (PET) The main challenge with the image data is that it is not only huge, but is also high-dimensional and complex. Extraction of the important and relevant features is a daunting task. 18
Methodology for image processing Overall process of image mining Data Preprocessing Extracting multidimensional feature vectors Mining of vectors and acquire high level knowledge 19
NLP applied to EHR Analysis of free text input from clinical reports and patient s history would improve healthcare. There are several English-centric tools working towards that goal: Mayo s ctakes SNOMED-CT MetaMap UMLS MedLee LOINC HiTex 20
Natural Language Processing Sentence Detector Tokenizer Part of Speech Chunker Name Entity Negation Detection Negation Hypothesis Historical Event Subject Recogntion 21
NESSI CPPP: BIG DATA VALUE 22
GOAL of the cppp Ensure Europe s leading role in the data-driven world addressing competitiveness, innovation, and society Covering the dimensions of Big Data Value: data, skills, legal, technical, application, business, social.
Multiple views of Big Data 24
Technical and non technical aspects Data: Data is at the centre of the Big Data Value activities and making data sets and assets accessible. private and open data sources, ensure their availability, integrity, and confidentiality Data ownership Technology: technologies and tools which are needed to support data-driven Non structured data Algorithms for text, image Anonimization Legal, Policy and Privacy: European-wide legislation, regulation Social: Acquiring early insights into the social impact of new technologies and data-driven applications and how they will change the behaviour of individuals 25
Technical issues Harmonization across different sources: standardized modelling, integration of heterogeneous data sources Low latency and real-time data processing Advanced data mining: predictive analytics, graph mining, semantic analysis Image, text processing Data protection and privacy technologies Advanced visualization, user experience and usability 26
Tecnical priorities: Data Management Define, interoperate, openly share, access, transform, link, syndicate, and manage data: Annotation: Data needs to be semantically annotated in digital formats, without imposing extra-effort to data producers Unstructured data Semantic Interoperability: Data silos have to be unlocked Legal Frameworks: Technical means have to be backed by legal frameworks to ensure the transparent sharing and exchange of data Quality 27
Tecnical priorities: Deep analytics Event Space: Move beyond limited samples used so far in statistical analytics to samples covering the whole or the largest part of an event space Model Accuracy: Improve the accuracy of statistical models by enabling fast nonlinear approximations in very large datasets Event Discovery: Discover rare events that are hard to identify since they have a small probability of occurrence, but have a great significance (such as rare diseases and treatments) Real Time: Enable real-time analytics that are capable of analysing large amounts of data-in-motion and data-at-rest by updating the analysis results as the information content changes Semantic Analysis: Deep learning, contextualization based on IA, machine learning, semantic analysis in near-real time, graph mining Unstructured Data: Processing of unstructured data (multi-media, text). Linking and cross-analysis algorithms to deliver cross-domain and cross-sector intelligence Canonical forms: Provide canonical paths so that data can be aggregated and shared easily without dependency on technicians or domain experts and provide a path for the smart analysis of data across and within domains 28
Tecnical priorities: Privacy and Anonymisation Mechansims Cloud Data Protection: Protect the cloud infrastructure, analytics applications, and the data from leakage and threats Data minimisation: Methods for secure deletion of data and data minimization Algorithms: Robust anonymisation algorithms Reversibility: Risk assessment tools to evaluate the reversibility of the anonymisation mechanisms Mining Algorithms: Developed privacy-preserving data mining algorithms Privacy Preservation: Mechanisms for privacy-preserving data publishing and data computations Pattern Hiding: Design of mechanisms for pattern hiding so data is transformed in a way that certain patterns cannot be derived (via mining), while others can Multiparty Mining: Secure multiparty mining mechanisms over distributed datasets 29
Tecnical priorities: Advanced Visualisation and User Experience End User Centric: Adaptation to the needs of end users rather than predefined visualization and analytics. User feedback Scale: handle extremely large volumes of data: aggregate data at different scales of interaction techniques, which should enable easy transitions from one scale or form of aggregation to another while supporting aggregation and comparisons among different scales Clusters: Dynamic clustering of information based on similarity or relatedness to the problem rather than on individual categories Geospatial: New visualisation for data with geo-locations, distances, and space/time correlations (i.e. sensor data, event data) Interrelated Data: Rather than data islands, visual interfaces must take account of spatial and semantic relationships, such as positions, distances, space/time correlations Qualitative Analysis Time Plug and Play 30
Priority Year 1 Year 2 Year 3 Year 4 Year 5 Data Management Mechanisms for integration of hetero-geneous data sources Semantic based data and content interoperability Generalisation of secure remote data access techniques Collaborative Tools and techniques for Data Quality (including integrity and veracity check) Harmonized description format for meta-data and for data reduction Methodology, models and tools for data lifecycle management Data management as a service Deep analytics Improved statistical models by enabling fast non-linear approximations in very large datasets Real-time analytics Predictive modelling and graph mining techniques applied on extremely large graphs Semantic analysis in near-real-time Algorithms for multimedia data mining Descriptive language for deep analytics Deep learning techniques Privacy and Anonymisation Complete Data Protection framework Method for deletion of data and data minimization Robust anonymisation algorithms Advance isualisation and User Experience End-user Centric data search and solutions paradigms Semantic driven data visualisation Integration of analytics and visualization Contextuali-sation Collaborative realtime, dynamic 3D solutions 31
Mechanisms In order implement the research and innovation strategy and to align technical and non-technical aspects, the four major kinds of mechanisms are recommended to be realized: Innovation Spaces (i-spaces): Cross-organisational and cross-sector environments will allow challenges to be addressed in an interdisciplinary way and will serve as a hub for other research and innovation activities. Lighthouse projects: These will help raise awareness about the opportunities offered by Big Data and the value of data-driven applications for different sectors and they will be an incubator for data-driven ecosystems. Technical Projects: These will take up specific Big Data issues addressing targeted aspects of the technical priorities Non-technical Projects: These projects will foster international cooperation for efficient information exchange and coordination of activities. 32
Main components and research priorities of the cppp Innovation Spaces serve as hubs for bringing the technology and application developments together and cater for the development of skills, competence, and best practices. Improving understanding of data by deep analytics (e.g. predictive modelling, graph mining,...) Architectures for analysing data including real-time data (e.g. recommendation engines,...) Visualization and user experience (e.g. User adaptive systems, search capabilities,...) Lighthouse Projects Large scale demonstrations focusing on certain sectors and domains Data management engineering (e.g. Data integration, data integrity,...) Privacy and anonymisation mechanisms
Implementation Timeline
THANKS! Ernestina Menasalvas Universidad Politecnica de Madrid