Astrid Genet astrid.genet@hs-furtwangen.de 24 Oct, 2014
The origins of Big Data in biomedicine As in many other fields, recently emerged state-of-the-art biomedical technologies generates huge and heterogeneous amount of digital health care information of all types accumulated from patients. Those large quantities of data are referred to as "Big Data".
What makes Big Data special? They cannot be managed and processed by conventional methods for the following reasons: Volume: Big data implies enormous volumes of data; Variety: data typically come from multiple sources (images, free text, measurements from monitoring devices, audio records, etc.) and therefore can have various formats; Velocity: the speed at which the data are generated massive and sometimes continuous ("real-time data"); Variability: refers to the possible biases, noise, abnormality or time inconsistency in data (things like volume and velocity makes the variability even harder to handle).
Some of the recent technologies that went with an associated large collection of experimental data: Microarrays of gene expression data are being generated by the gigabyte all over the world; Next-generation sequencing (NGS) has exponentially increased the rate of biological data generation in the last 4 years a (considered small) project with 10 to 20 whole genome sequencing samples can generate about 4TB of raw data; Mass spectrometry also generate massive amount of complex proteomic data the ProteomicsDB database 1 is a mass-spectrometry-based draft of the human proteome represents terabytes of big data; Medical imaging: in just 20 years, MRI has revolutionised medical imaging by produces diagnostic images of photographic quality one year of imaging is over 15 TB (with a very low acquisition to analysis ratio). Patient Data management systems: comprehensive software recording measurements from ICUs static and temporal data are stored from one of the most data intensive environments in medicine (admission data, monitoring devices, laboratory analyses, annotations from the medical staff, etc.). 1 Mathias Wilhelm, Judith Schlegl, Hannes Hahne, Amin Moghaddas Gholami, Marcus Lieberenz, Mikhail M Savitski, Emanuel Ziegler, Lars Butzmann, Siegfried Gessulat, Harald Marx, et al. Mass-spectrometry-based draft of the human proteome. Nature, 509(7502):582 587, 2014
What do we do with it all? Modern technologies make it possible to generate huge quantities of complex and high quality data, at a reasonable price. But does it really make it possible to get more for less in terms of disease classifiers, analyses of shape and improved diagnostic accuracy? Emerging challenges have to be faced: Storage and organisation of the volume of information (requires hardware, maintenance, physical space); Concerns over privacy and security of patient data; Bioinformatics and biostatistics processing tools should be adapted to the size and complexity of the data.
Storage, maintenance and organisation The question is quite well organised for microarray gene expression data. Measurements data are stored in (public or subscription-based) repositories called microarray databases, which also manage a searchable index and make the data available for analysis and interpretation. Some standards have also been created for reporting microarray experiments under a reliable form: MIAME (Minimum Information About a Microarray Experiment) standard; MACQ (MicroArray Quality Control) project.
Storage, maintenance and organisation Storage remains a major challenge for NGS, medical imaging and Mass spectrometry data which represent larger amount of data (by the TB). There is not yet an established standard for storing and exchanging them. Centralized storage should allow: everything to be in one place; everything to be in one format; to read and use analysis tools to interact directly with the data. Concerns with time, expense, and security that arise from those requirements given the size of the data are still an issue.
Challenges in biostatistics and bioinformatics Ultimate goal of clever storage and organisation of biological data: turn them into usable information for mining and real knowledge. Challenges faced by traditional biostatistics and bioinformatics: exploration and cleaning of large and incomplete datasets (variable transformations, relationship among variables, verification and quality control) time-consuming, difficult or impossible to fully complete risk of overlooked relationships, likelihood of errors or omissions; traditional statistical models, software programs, visualisation tools do not scale for application to large-scale data; insufficient computer processing power extreme time delays when running complex models; interpretation of analytical results and their clinical applications analysts might need effective clinical support to guide them.
Emerging solutions: computational facilities for analysing Big Data New tools are continually emerging and solutions appear, related to computing performance, computing environment and analysis algorithms. High-performance computing solutions include: highly optimised CPU multicore workstation; Graphics Prossessing Unit (GPU) significantly speeding up the processing of mining algorithms on workstations; parallel processing on multiple processor core also an option to reduce computation time; cloud-based computing moving computation to resources delivered over the Internet ("renting" computational power by the hour and save the acquisition of expensive resources).
Emerging solutions: environments for statistical computing Specific extensions of computing environments enables to handle large datasets. For example, R(64 bits) offers the following facilities: facilities for High-Performance CPU and GPU Parallel Computing (domc, gputools); options to use file-based access to data sets that are too large to be loaded into R s internal memory (RAM access) (ff, bigmemory); easy transfer of Robjects to efficient C or C++ functions via the use of dll (.C(), Rcpp); flexible and fast visualization method to explore and analyse large multivariate dataset (bigviz). Equivalent possibilities also exists in similar environments like Perl, Python and Matlab.
Data analysis of Big Data Regarding the processing of data (if supported by adequate computational resources), flexible models from the Machine Learning field adapt better to large datasets than statistical models with highly structured forms (linear regression, logistic regression, discriminant analysis) because they enable inference in non-standard situations: non-i.i.d. data;. semi-supervised learning; learning with structured data; etc. Examples of machine learning algorithms suitable for mining large and complex datasets: neural networks, classification and regression trees (decision trees), naive Bayes, k-nearest neighbor, support vector machines, etc.
Dealing with the complexity of biomedical data: still an open issue Pre-processing the data often requires the biggest effort in a data-mining study. Most issues concern either: the structure of data: missing meta-information (fields meaning, keys, units), class label imbalance (control/cases), repeated measurements, etc. the quality of data: typos, multiple formats, changes in scale, gaps in time series, missing values, duplicated measurements, etc. Some tools exist that help dealing with those situations (multiple imputation, resampling technique, etc.) but the volume and velocity of data hamper the solutions available for smaller datasets and make them highly resource-consuming. So far there is no consensus regarding the right way (and order) to deal with the complexity of medical datasets. Care must be taken: inappropriate pre-processing can destroy or mutilate information, lead to misleading results or bias.
Patient specific modeling Large healthcare datasets also opens new avenues regarding the development of personalized diagnostics and therapeutics: if data are mined from billions of persons each patient can be surrounded by a "virtual cloud" of cases matching its own health status. Patient specific modelling (PSM) is an emerging field in biostatistics. modelling technique are rather specific to the medical field (bones, heart and circulation, brain, diagnostics, surgical planning, etc.) 1 the common goal however is to develop computational models that are influenced by the particular history, symptoms, laboratory results, etc. of the patient and perform better than population-wide learners 2. 1 Amit Gefen. Patient-Specific Modeling in Tomorrow s Medicine, volume 9. Springer, 2012 2 Visweswaran. Shyam, Derek C. Angus, Margaret Hsieh, Lisa Weissfeld, Donald Yealy, and Gregory F. Cooper. Learning patient-specific predictive models from clinical data. Journal of Biomedical Informatics, 43(5):669 685, 2010
Project PATIENTS Goal: help understand and early diagnose postoperative liver and kidney failure leading cause of death in surgical ICUs; risk factors, causes and prognosis are not fully understood. Efforts devoted to the adjustment of a reliable methodology using state-of-the-art mining procedures for the development of accurate and robust clinical prediction models, meant to complement medical reasoning in decision making tasks.
PATIENTS database Learning algorithms are developed and tested on clinical data from the PDMS of the COPRA System company installed at the intensive care unit of the University Hospital of Rostock (Germany). Data consists of: almost 7.000 cases; admitted for major surgery between 2008 and 2011; up to 4.679 parameters measured at different time intervals; parameters include demographic, clinical and laboratory information.
Clinical translation of data mining results According to the currently increasing demand for translational medicine, biomedical research results must be conveyed to the health-care providers, in a manner that is fast and easy to understand and apply to patient care. The way we want to make this happen is by developing a further module for integration in the COPRA PDMS system therefore turning it into a clinical decision support system (CDSS): carry out predefined decision rules based on data in the PDMS; help predict potential events; assist the nurse or physician in their diagnostic and in the choice of an appropriate course of treatment; increase ICU patient safety and compliance.
Multi-disciplinary collaborations Reaching this ultimate goal requires expertise in many disciplines and multi-disciplinary collaborations: Medical research and clinical care gather data from clinical studies and offer guidance from biological reasoning; Mathematical and statistical expertise of data analysts methods for analysis and results interpretation; Computer science and developers make the results available and usable to healthcare providers.
Experimental methodology: development of clinical prediction models Benchmarking classification models: flexible models from machine learning (tree, SVM, BN); statistic models (LDA, LR) easier to interpret. Special care will be taken to: avoid over-fitting: production of a combined model, averaged over a large amount of single classifiers (reduced variance); get accurate performance estimates: using resampling methods to try and inject variation into the system better approximate performance on future samples. Adapted pre-processing of data class imbalance problem addressed at data level using resampling (SMOTE algorithm); missing values in the data filled in with plausible values using multiple imputation (10-20 repetitions); all pre-processing steps included within the resampling loop ensure fair performance estimates.
Experimental methodology Development of patient-specific algorithms well suited methodology: selection of a subset among the single learners, most relevant for the patient at hand; averaged patient prediction over the optimised subset. way to make use of the population-wide methodology and limit the computational burden while improving the accuracy of the outcome. Computational solutions: Workstation multicore CPU and extensions to the R software environment for applications on large data.
Thank you for your attention