Integration of biospecimen data with clinical data mining



Similar documents
Is a Data Scientist the New Quant? Stuart Kozola MathWorks

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

How To Change Medicine

From Raw Data to. Actionable Insights with. MATLAB Analytics. Learn more. Develop predictive models. 1Access and explore data

Data Mining On Diabetics

Data Mining: Overview. What is Data Mining?

Machine Learning Logistic Regression

An Overview of Knowledge Discovery Database and Data mining Techniques

An Introduction to Data Mining

Introduction to Data Mining

Data Mining and Machine Learning in Bioinformatics

Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.

Data Mining. Nonlinear Classification

Information Management course

BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics

Supervised Learning (Big Data Analytics)

A Review of Data Mining Techniques

Big Data Challenges. technology basics for data scientists. Spring Jordi Torres, UPC - BSC

Learning outcomes. Knowledge and understanding. Competence and skills

PREDICTIVE ANALYTICS: PROVIDING NOVEL APPROACHES TO ENHANCE OUTCOMES RESEARCH LEVERAGING BIG AND COMPLEX DATA

BIOINF 525 Winter 2016 Foundations of Bioinformatics and Systems Biology

The Scientific Data Mining Process

Search and Data Mining: Techniques. Applications Anya Yarygina Boris Novikov

Data Quality Mining: Employing Classifiers for Assuring consistent Datasets

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Analysis Tools and Libraries for BigData

Data, Measurements, Features

Dr. Rob Donald - Curriculum Vitae. rob@statsresearch.co.uk, Web: Mob:

Statistics Graduate Courses

Course Requirements for the Ph.D., M.S. and Certificate Programs

Survey of clinical data mining applications on big data in health informatics

Delivering the power of the world s most successful genomics platform

Medical Informatics II

Introduction to Data Mining

Leading Genomics. Diagnostic. Discove. Collab. harma. Shanghai Cambridge, MA Reykjavik

Promises and Pitfalls of Big-Data-Predictive Analytics: Best Practices and Trends

Gerard Mc Nulty Systems Optimisation Ltd BA.,B.A.I.,C.Eng.,F.I.E.I

Comparison of Data Mining Techniques used for Financial Data Analysis

Lecture/Recitation Topic SMA 5303 L1 Sampling and statistical distributions

Embedded Systems in Healthcare. Pierre America Healthcare Systems Architecture Philips Research, Eindhoven, the Netherlands November 12, 2008

EFFICIENT DATA PRE-PROCESSING FOR DATA MINING

Chapter 6. The stacking ensemble approach

DATA MINING TECHNIQUES AND APPLICATIONS

Cloud-Based Big Data Analytics in Bioinformatics

Machine learning for algo trading

Data Cleansing for Remote Battery System Monitoring

MEDICAL DATA MINING. Timothy Hays, PhD. Health IT Strategy Executive Dynamics Research Corporation (DRC) December 13, 2012

Customer Classification And Prediction Based On Data Mining Technique

Machine Learning Introduction

Distributed forests for MapReduce-based machine learning

Information Visualization WS 2013/14 11 Visual Analytics

INTERNATIONAL JOURNAL FOR ENGINEERING APPLICATIONS AND TECHNOLOGY DATA MINING IN HEALTHCARE SECTOR.

Data Mining for Fun and Profit

Building a Collaborative Informatics Platform for Translational Research: Prof. Yike Guo Department of Computing Imperial College London

Keywords data mining, prediction techniques, decision making.

DataSafe Solutions. Protect your valuable genomic data

Knowledge Discovery and Data Mining

Social Media Mining. Data Mining Essentials

Understanding the Benefits of IBM SPSS Statistics Server

medexter clinical decision support

Sanjeev Kumar. contribute

Predicting the Risk of Heart Attacks using Neural Network and Decision Tree

Random forest algorithm in big data environment

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

THE ROLE OF BIG DATA IN HEALTH AND BIOMEDICAL RESEARCH. John Quackenbush Dana-Farber Cancer Institute Harvard School of Public Health

Example application (1) Telecommunication. Lecture 1: Data Mining Overview and Process. Example application (2) Health

ENSEMBLE DECISION TREE CLASSIFIER FOR BREAST CANCER DATA

The Data Mining Process

Medical Big Data Interpretation

Vad är bioinformatik och varför behöver vi det i vården? a bioinformatician's perspectives

Machine Learning and Data Mining. Fundamentals, robotics, recognition

Azure Machine Learning, SQL Data Mining and R

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

CLASSIFYING NETWORK TRAFFIC IN THE BIG DATA ERA

Prediction of Heart Disease Using Naïve Bayes Algorithm

Advanced analytics at your hands

Data Science, Predictive Analytics & Big Data Analytics Solutions. Service Presentation

Dr Alexander Henzing

How To Solve The Kd Cup 2010 Challenge

A Case Study on the Use of Unstructured Data in Healthcare Analytics. Analysis of Images for Diabetic Retinopathy

Using the Grid for the interactive workflow management in biomedicine. Andrea Schenone BIOLAB DIST University of Genova

IEEE International Conference on Computing, Analytics and Security Trends CAST-2016 (19 21 December, 2016) Call for Paper

Statistical issues in the analysis of microarray data

Healthcare Measurement Analysis Using Data mining Techniques

Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal

HPC technology and future architecture

KNIME Enterprise server usage and global deployment at NIBR

Data Mining Practical Machine Learning Tools and Techniques

Algorithmic Scoring Models

How To Get A Computer Science Degree

Transcription:

Astrid Genet astrid.genet@hs-furtwangen.de 24 Oct, 2014

The origins of Big Data in biomedicine As in many other fields, recently emerged state-of-the-art biomedical technologies generates huge and heterogeneous amount of digital health care information of all types accumulated from patients. Those large quantities of data are referred to as "Big Data".

What makes Big Data special? They cannot be managed and processed by conventional methods for the following reasons: Volume: Big data implies enormous volumes of data; Variety: data typically come from multiple sources (images, free text, measurements from monitoring devices, audio records, etc.) and therefore can have various formats; Velocity: the speed at which the data are generated massive and sometimes continuous ("real-time data"); Variability: refers to the possible biases, noise, abnormality or time inconsistency in data (things like volume and velocity makes the variability even harder to handle).

Some of the recent technologies that went with an associated large collection of experimental data: Microarrays of gene expression data are being generated by the gigabyte all over the world; Next-generation sequencing (NGS) has exponentially increased the rate of biological data generation in the last 4 years a (considered small) project with 10 to 20 whole genome sequencing samples can generate about 4TB of raw data; Mass spectrometry also generate massive amount of complex proteomic data the ProteomicsDB database 1 is a mass-spectrometry-based draft of the human proteome represents terabytes of big data; Medical imaging: in just 20 years, MRI has revolutionised medical imaging by produces diagnostic images of photographic quality one year of imaging is over 15 TB (with a very low acquisition to analysis ratio). Patient Data management systems: comprehensive software recording measurements from ICUs static and temporal data are stored from one of the most data intensive environments in medicine (admission data, monitoring devices, laboratory analyses, annotations from the medical staff, etc.). 1 Mathias Wilhelm, Judith Schlegl, Hannes Hahne, Amin Moghaddas Gholami, Marcus Lieberenz, Mikhail M Savitski, Emanuel Ziegler, Lars Butzmann, Siegfried Gessulat, Harald Marx, et al. Mass-spectrometry-based draft of the human proteome. Nature, 509(7502):582 587, 2014

What do we do with it all? Modern technologies make it possible to generate huge quantities of complex and high quality data, at a reasonable price. But does it really make it possible to get more for less in terms of disease classifiers, analyses of shape and improved diagnostic accuracy? Emerging challenges have to be faced: Storage and organisation of the volume of information (requires hardware, maintenance, physical space); Concerns over privacy and security of patient data; Bioinformatics and biostatistics processing tools should be adapted to the size and complexity of the data.

Storage, maintenance and organisation The question is quite well organised for microarray gene expression data. Measurements data are stored in (public or subscription-based) repositories called microarray databases, which also manage a searchable index and make the data available for analysis and interpretation. Some standards have also been created for reporting microarray experiments under a reliable form: MIAME (Minimum Information About a Microarray Experiment) standard; MACQ (MicroArray Quality Control) project.

Storage, maintenance and organisation Storage remains a major challenge for NGS, medical imaging and Mass spectrometry data which represent larger amount of data (by the TB). There is not yet an established standard for storing and exchanging them. Centralized storage should allow: everything to be in one place; everything to be in one format; to read and use analysis tools to interact directly with the data. Concerns with time, expense, and security that arise from those requirements given the size of the data are still an issue.

Challenges in biostatistics and bioinformatics Ultimate goal of clever storage and organisation of biological data: turn them into usable information for mining and real knowledge. Challenges faced by traditional biostatistics and bioinformatics: exploration and cleaning of large and incomplete datasets (variable transformations, relationship among variables, verification and quality control) time-consuming, difficult or impossible to fully complete risk of overlooked relationships, likelihood of errors or omissions; traditional statistical models, software programs, visualisation tools do not scale for application to large-scale data; insufficient computer processing power extreme time delays when running complex models; interpretation of analytical results and their clinical applications analysts might need effective clinical support to guide them.

Emerging solutions: computational facilities for analysing Big Data New tools are continually emerging and solutions appear, related to computing performance, computing environment and analysis algorithms. High-performance computing solutions include: highly optimised CPU multicore workstation; Graphics Prossessing Unit (GPU) significantly speeding up the processing of mining algorithms on workstations; parallel processing on multiple processor core also an option to reduce computation time; cloud-based computing moving computation to resources delivered over the Internet ("renting" computational power by the hour and save the acquisition of expensive resources).

Emerging solutions: environments for statistical computing Specific extensions of computing environments enables to handle large datasets. For example, R(64 bits) offers the following facilities: facilities for High-Performance CPU and GPU Parallel Computing (domc, gputools); options to use file-based access to data sets that are too large to be loaded into R s internal memory (RAM access) (ff, bigmemory); easy transfer of Robjects to efficient C or C++ functions via the use of dll (.C(), Rcpp); flexible and fast visualization method to explore and analyse large multivariate dataset (bigviz). Equivalent possibilities also exists in similar environments like Perl, Python and Matlab.

Data analysis of Big Data Regarding the processing of data (if supported by adequate computational resources), flexible models from the Machine Learning field adapt better to large datasets than statistical models with highly structured forms (linear regression, logistic regression, discriminant analysis) because they enable inference in non-standard situations: non-i.i.d. data;. semi-supervised learning; learning with structured data; etc. Examples of machine learning algorithms suitable for mining large and complex datasets: neural networks, classification and regression trees (decision trees), naive Bayes, k-nearest neighbor, support vector machines, etc.

Dealing with the complexity of biomedical data: still an open issue Pre-processing the data often requires the biggest effort in a data-mining study. Most issues concern either: the structure of data: missing meta-information (fields meaning, keys, units), class label imbalance (control/cases), repeated measurements, etc. the quality of data: typos, multiple formats, changes in scale, gaps in time series, missing values, duplicated measurements, etc. Some tools exist that help dealing with those situations (multiple imputation, resampling technique, etc.) but the volume and velocity of data hamper the solutions available for smaller datasets and make them highly resource-consuming. So far there is no consensus regarding the right way (and order) to deal with the complexity of medical datasets. Care must be taken: inappropriate pre-processing can destroy or mutilate information, lead to misleading results or bias.

Patient specific modeling Large healthcare datasets also opens new avenues regarding the development of personalized diagnostics and therapeutics: if data are mined from billions of persons each patient can be surrounded by a "virtual cloud" of cases matching its own health status. Patient specific modelling (PSM) is an emerging field in biostatistics. modelling technique are rather specific to the medical field (bones, heart and circulation, brain, diagnostics, surgical planning, etc.) 1 the common goal however is to develop computational models that are influenced by the particular history, symptoms, laboratory results, etc. of the patient and perform better than population-wide learners 2. 1 Amit Gefen. Patient-Specific Modeling in Tomorrow s Medicine, volume 9. Springer, 2012 2 Visweswaran. Shyam, Derek C. Angus, Margaret Hsieh, Lisa Weissfeld, Donald Yealy, and Gregory F. Cooper. Learning patient-specific predictive models from clinical data. Journal of Biomedical Informatics, 43(5):669 685, 2010

Project PATIENTS Goal: help understand and early diagnose postoperative liver and kidney failure leading cause of death in surgical ICUs; risk factors, causes and prognosis are not fully understood. Efforts devoted to the adjustment of a reliable methodology using state-of-the-art mining procedures for the development of accurate and robust clinical prediction models, meant to complement medical reasoning in decision making tasks.

PATIENTS database Learning algorithms are developed and tested on clinical data from the PDMS of the COPRA System company installed at the intensive care unit of the University Hospital of Rostock (Germany). Data consists of: almost 7.000 cases; admitted for major surgery between 2008 and 2011; up to 4.679 parameters measured at different time intervals; parameters include demographic, clinical and laboratory information.

Clinical translation of data mining results According to the currently increasing demand for translational medicine, biomedical research results must be conveyed to the health-care providers, in a manner that is fast and easy to understand and apply to patient care. The way we want to make this happen is by developing a further module for integration in the COPRA PDMS system therefore turning it into a clinical decision support system (CDSS): carry out predefined decision rules based on data in the PDMS; help predict potential events; assist the nurse or physician in their diagnostic and in the choice of an appropriate course of treatment; increase ICU patient safety and compliance.

Multi-disciplinary collaborations Reaching this ultimate goal requires expertise in many disciplines and multi-disciplinary collaborations: Medical research and clinical care gather data from clinical studies and offer guidance from biological reasoning; Mathematical and statistical expertise of data analysts methods for analysis and results interpretation; Computer science and developers make the results available and usable to healthcare providers.

Experimental methodology: development of clinical prediction models Benchmarking classification models: flexible models from machine learning (tree, SVM, BN); statistic models (LDA, LR) easier to interpret. Special care will be taken to: avoid over-fitting: production of a combined model, averaged over a large amount of single classifiers (reduced variance); get accurate performance estimates: using resampling methods to try and inject variation into the system better approximate performance on future samples. Adapted pre-processing of data class imbalance problem addressed at data level using resampling (SMOTE algorithm); missing values in the data filled in with plausible values using multiple imputation (10-20 repetitions); all pre-processing steps included within the resampling loop ensure fair performance estimates.

Experimental methodology Development of patient-specific algorithms well suited methodology: selection of a subset among the single learners, most relevant for the patient at hand; averaged patient prediction over the optimised subset. way to make use of the population-wide methodology and limit the computational burden while improving the accuracy of the outcome. Computational solutions: Workstation multicore CPU and extensions to the R software environment for applications on large data.

Thank you for your attention