Selected Big Data Analytics Methods in Health Research

Transcription

1 Selected Big Data Analytics Methods in Health Research Research Field Key Technologies Jülich Supercomputing Centre Supercomputing & Big Data Dr. Ing. Morris Riedel et al. Adjunct Associated Professor, University of Iceland Jülich Supercomputing Centre, Germany Head of Research Group High Productivity Data Processing Peking, 10th 12th September 2014

2 Outline 2/ 42

3 Outline Research Group High Productivity Data Processing From Big Data Analytics to Smart Data Analytics Research Data Alliance Activities LBSN Analytics [Changes 2013 Seed Project Follow-Up NCSA/JSC] Scientific Case & Visualization Clustering using Parallel DBScan Big Brain Data Analytics [Collaboration INM/JSC] Scientific Case (Supervised) Learning from Data 101 Classification of Brain Big Data Selected Lessons Learned References Changes 2013 activities with selected contents of (dropped) talk from Shaowen Wang et al. (Tue) 3/ 42

4 Demand for Generic Data Methods in Science Advancing user-centered data mining methods & tools (e.g. outlier detection, support vector machines, etc.) Algorithms and Data Structures Enhancing/parallizing generic techniques and algorithmic cores (e.g. indexing, sorting) Visualization Scaling visualization techniques for highperformance graphics and data visualization Security Efficient security and privacy solutions applied to real-world big data use-cases 4/ 42

5 Research Group High Productivity Data Processing 5/ 42

6 Research Group High Productivity Data Processing Data Applied Mining Statistics Data Science Machine Learning Different Approaches Scientific Computing Scientific Applications using Big Data Traditional Scientific Computing Methods HPC and HTC Paradigms & Parallelization Smart Data Analytics Approaches & Techniques Optimized Data Access & Management Statistical Data Mining & Machine Learning Big Data Methods (Parallel) Technologies Scientific Applications Research Data MLlib 6/ 42

7 Juelich Supercomputing Centre Context Support data-intensive science and engineering applications Explore computing that is more intertwined with data analysis Big Data Data Management, Security & Access Clustering Classification Regression 7/ 42

8 Research Data Alliance Activities Steering the Big Data Analytics Interest Group Members: ~60; Founded ~03/2013 in 1st RDA Plenary (Goeteborg) Telcons: 1-2x / month; Secretary: M. Goetz (JSC) Co-chairs: M. Riedel (JSC), K. Kwo-Sen (NASA), P. Baumann (UoBremen) Systematic way: Cross Industry Standard Process for Data Mining (CRISP-DM) Concete Datasets (& source/sensor) Algorithms & Methods Technologies & Ressources Scientific Data Applications Big Data Analytics Group Process Best Practices Communitybased practice & recommendations Reference Data Analytics for reusability & learning CRISP- DM Report [4] Research Data Alliance Openly Shared Datasets Running Analytics Code Join us at Amsterdam: 4 th RDA Plenary September 2014 Two Analytics Sessions! 8/ 42

9 Research Data Alliance Application Example Big Data Analytics Group Process [3] P. Chapman et al., CRISP-DM Guide Sattelite Data(Quickbird) Parallel Support Vector Machines (SVM) HPC/MPI, Map-Reduce & GPGPUs Classification Study of Land Cover Types Best Practices Communitybased practice JUDGE JSC Classification++ Reference Data Analytics for reusability & learning CRISP- DM Report Parallel Brain Data Analytics Openly Shared Datasets Running Analytics Code [2] G. Cavallaro and M. Riedel, Smart Data Analytics Methods for Remote Sensing Applications, IGARSS 2014 [1] EUDAT B2SHARE 9/ 42

10 Understanding Industry vs H1N1 Virus Made Headlines Nature paper from Google employees Explains how Google is able to predict winter flus Not only on national scale, but down to regions Possible via logged big data search queries [5] Jeremy Ginsburg et al., Detecting influenza epidemics using search engine query data, Nature 457, 2009 Big Data is not always better data Think causality vs. correlation 2014 The Parable of Google Flu Large errors in flu prediction & lessons learned (1) Dataset: Transparency & replicability impossible (2) Study the algorithm since they keep changing (3) It s not just about size of the data [6] David Lazer, Ryan Kennedy, Gary King, and Alessandro Vespignani, The Parable of Google Flu: Traps in Big Data Analysis, Science Vol (343), / 42

11 LBSN Analytics 11 / 42

12 Scientific Case & Realization Clustering Towards interactive visual analytics using parallel methods Goal: Answer health related questions from publicly available data E.g. estimated emissions/region correlated with measurements stations E.g. estimated pollution/emissions breathing/person/region Visualization pipeline design: Open data source: OpenStreet Maps (OSM) maps/streets Approach: Towards interactive data exploration Click-free visualization, i.e. no GUI applications Support for typical overlays (density maps, polygon creation) All of the above possible browser programming APIs (e.g. openlayers) Parallel MPI/OpenMP Trajectories (Statistics) OSRM Lib Parallel MPI/OpenMP Clustering (BDSCAN) Slide courtesy by Markus Goetz in close collaboration with Junjun Yin & Shaowen Wang (NCSA) 12 / 42

13 Approach: LBSN Trajectory Mining Scientific Domain Area Smart Cities approaches compined with Health Analytics Research Tweets Scientific Outcome Traffic density estimation Network emission model Location-based Social Networks (LBSN) Data Open data sources: Twitter & Foursquare Data collection and storage: NCSA Initial Computation: JSC Check-ins Slide courtesy by Markus Goetz in close collaboration with Junjun Yin & Shaowen Wang (NCSA) 13 / 42

14 Example: Cluster Visualization Clustering London Clusters 6/1/2014, 1h time slice beginning at 18:00 UTC using Density-Based Spatial Clustering of Applications with Noise (DBScan) Slide courtesy by Markus Goetz in close collaboration with Junjun Yin & Shaowen Wang (NCSA) 14 / 42

15 Data Preparation & Visualization Implementation Clustering The source code is available on our bitbucket repository on invite: Changes2013 Outcome: selected work has been recently accepted for inclusion of the Cartography and Geographic Information Science Journal Special Issue (Guest editors: Xinyue Ye, Qunying Huang, Wenwen Li) Slide courtesy by Markus Goetz in close collaboration with Junjun Yin & Shaowen Wang (NCSA) 15 / 42

16 Scalable Parallel Clustering Implementation Clustering Exising parallel did not scale for problem domain DBScan noise ingored & separately reported PDSDBCAN Parallel DBScan Algorithm New implementation is work-in-progress, but promising initial results: HPDBSCAN [8] Md. Mostofa Ali Patwary, Diana Palsetia, Ankit Agrawal, Wei-keng Liao, Fredrik Manne, Alok Choudhary, A new scalable parallel DBSCAN algorithm using the disjoint-set data structure, In proceedings of 2012 International Conference for High Performance Computing, Networking, Storage and Analysis, 2012 Slide material and performance comparisons courtesy by Markus Goetz and Christian Bodenstein 16 / 42

17 Further use of HPDBSCAN: Data of Koljoefjords Sweden Data of the PANGAEA data collection Example: one month measurement: 03/2012 Reality: years of data are available Goal: Automatic outlier detection using parallel dbscan algorithm Better measurement devices produce orders of magnitudes more big data Manual quality control becomes impossible and error-prone Automate the quality control process ( parallelization) [9] PANGAEA data collection DBScan noise ingored & separately reported Outlier! e.g. boxplots not feasible Data courtesy of Robert Huber, Marum Center for Marine Environmental Sciences, Bremen 17 / 42

18 Big Brain Data Analytics 18 / 42

19 JSC/INM Jointly tackled Case Slide courtesy by Dr. M. Axer from Talk New insights into the fiber architecture of the brain 19 / 42

20 Big Brain Data Analytics Scientific Case Classification Build reconstructed brain (one 3d volume) that matches with sections & block images Understanding the sectioning of the brain and support automation of reconstruction 1. Some pattern exists ([1] Problem understanding phase) Image content classification : [1] brain part; [2] not brain part 2. No exact mathematical formula exists No precise formula for contour of the brain ; a non-linear class boundary [2] 3. Data ([2] Data understanding phase ) Block face images (of frozen tissue) Every 20 micron (cut size) Resolution: 3272 x 2469 ~ 14 MB / RGB image ~ 8 MB / corresponding mask image ( groundtruth ) ~700 images [1] [2] [2] ~ 40 GB dataset we can apply supervised learning due to labelled data [2] [1] [2] [2] class label 20 / 42

21 Supervised Learning Mathematical Building Blocks (1) Unknown Target Function (ideal function) Elements we not exactly (need to) know Training Examples (historical records, groundtruth data, examples) Elements we must and/or should have and that might raise huge demands for storage Learning Algorithm ( train a system ) (set of known algorithms) Hypothesis Set (set of candidate formulas) Two Solution Tools: The Learning Model Final Hypothesis (final formula) Elements that we derive from our skillset and that can be computationally intensive Elements that we derive from our skillset 21 / 42

22 Statistical Learning Theory Probability Distribution on X Question: Can we really learn a function from data? Learning is only possible in a probabilistic sense [ no restriction ] In sample data tracks out of sample data, created from same distribution Given large N: There is a probability of picking one point or another Data created by probability independently from each other Mathematically established via Hoeffdings Inequality : Future brains, we dont know in sample M is the number of hypothesis Practice: Need enough samples N to learn Problem: M is infinite, but we can reduce it to ~N Solution: VC dimension large overlaps out of sample use for predict! frequency is likely close to other frequency Probability Distribution 22 / 42

23 Supervised Learning Mathematical Building Blocks (2) Unknown Target Function (ideal function) Training Examples (historical records, groundtruth data, examples) Probability Distribution Elements we not exactly (need to) know constants in learning Elements we must and/or should have and that might raise huge demands for storage Learning Algorithm ( train a system ) (set of known algorithms) Final Hypothesis (final formula) Elements that we derive from our skillset and that can be computationally intensive Hypothesis Set (set of candidate formulas) Elements that we derive from our skillset 23 / 42

24 Statistical Learning Theory Error Measure & Noisy Targets Question: How can we learn a function from (noisy) data? Error measures to quantify our progress, the goal is: Often user-defined, if not often squared error : Error Measure Aka point-wise error measure (Noisy) Target function is not a (deterministic) function Getting with same x in the same y out is not always given in practice Problem: Noise in the data that hinders us from learning Idea: Use a target distribution instead of target function 24 / 42

25 Supervised Learning Mathematical Building Blocks (3) Unknown Target Distribution Function target function plus noise (ideal function) Training Examples (historical records, groundtruth data, examples) Error Measure Probability Distribution Elements we not exactly (need to) know constants in learning Elements we must and/or should have and that might raise huge demands for storage Learning Algorithm ( train a system ) (set of known algorithms) Final Hypothesis (final formula) Elements that we derive from our skillset and that can be computationally intensive Hypothesis Set (set of candidate formulas) Elements that we derive from our skillset 25 / 42

26 Supervised Learning Selected Classification Method Classification Initial approach: Support Vector Machines (SVMs) [7] C. Cortes and V. Vapnik, Supportvector networks, Machine Learning, vol. 20(3), pp , One of the best out-of-the-box / robust classification methods ([4] Modelling) Binary classifier separates two classes: [1] brain; [2] non-brain Parameters after cross-validation; radial basis function (rbf) kernel; C-SVC type Uses quadratic programming & Lagrangian method with N x N (Cross-validation (grid-search) nicely parallel high throughput computing) (linear example) ( maximal margin clasifier example) (maximizing hyperplane turned into optimization problem, minimization, dual problem) (max. hyperplane dual problem, using quadratic programming method) (quadratic coefficients) 26 / 42

27 Big Brain Data Analytics Data Preprocessing (1) Classification Feature Selection/Extraction/Reduction ([3] Data Preparation) Labelled dataset with only three features ( RGB channels ) E.g. principle component analysis (PCA) not very helpful, RGB orthogonal X pixel location [1D] Blue Green Red RGB [3D] SUM = 3 Features 3d Brain [4D] y pixel location [2D] 27 / 42

28 Big Brain Data Analytics Data Preprocessing (2) Classification Transform images to the LibSVM format ([3] Data Preparation) Labelled dataset with only three features ( RGB channels ) Label: Class B/W Red Green Blue Each line is a training vector with rgb levels each line is a pixel 0 1: : : : : : : : : : : : : : : features ~ #XYZ out of instances: Samples (we have to randomly pick the samples) 28 / 42

29 Big Brain Data Analytics Data Preprocessing (3) Classification Smart data sampling ([3] Data Preparation) The data bears the potential of sampling bias (here: much more black than white pixels ) Solution: Create samples equally per class ([1] brain; [2] non-brain) Create different datasets for training & testing (same structure, but avoid data snooping!) r_msa_ _dxxxx-xx-xx_all_train r_msa_ _dxxxx-xx-xx_all_test.svm 29 / 42

30 Big Brain Data Analytics Approach Overview Sampling, training and testing Parameters C (allowing error) and gamma (RBF Kernel) 30 / 42

31 Big Brain Data Analytics Initial Results Classification Approach: Cross-validation for model selection ([4] Modelling) (Skipped for simplicity: essentially gridsearch getting two parameters C and gamma bounds sum of errors determines the number and severity of violations (using a soft-margin SVM model, called also slack variables ) Approach: Serial SVM implementations ([4] Modelling) Data: nearly full dataset, but equally balanced classes Stopped: after using three different serial implementations Big data problem : plain balanced dataset too large to be properly processed in serial Potential Solution: create smaller samples from the plain balanced dataset 31 / 42

32 Big Brain Data Analytics Data Preprocessing Revisited Smart data sampling ([3] Data Preparation) Create very small samples Still balanced classes Approach: 0.01 % of the data per class Create different datasets for training & testing Classification r_msa_ _dxxxx-xx-xx_all_train r_msa_ _dxxxx-xx-xx_all_test.svm 32 / 42

33 Big Brain Data Analytics Selected Serial Results Classification Approach: Serial SVM implementations ([4] Modelling) Data: sampled very small dataset, but equally balanced classes [We reached a limit: Approach not scalable for larger quantities of data] Scikit-learn (python) Example: ([5] Evaluation) Training time: ~39 minutes on JUDGE; Testing time: ~2 Min; Accuracy ~91% Matlab Example: ([5] Evaluation) Training time: ~3 hours on Laptop; Testing time: ~27 Min; Accuracy ~90,1% Using the small sample size worked to train with some serial implementations Potential Improvement: Training-time reduction by maintaining classification accuracy 33 / 42

34 Big Brain Data Analytics Selected Parallel Results Classification Approach: Parallel SVM implementations ([4] Modelling) Data: sampled very small dataset, but equally balanced classes No limit: theoretical scalable, but usefulness depends on datasets Incremented number of datasets, towards a sample size of 0.1% Twister (iterative Map-Reduce) Example: ([5] Evaluation) Challenge: Data distribution across the parallel infrastructure on FutureGrid Training time: ~7 minutes on JUDGE; Testing time: ~7 Min; Accuracy ~96% MLlib Parallel version with Twister (iterative map-reduce) is working with growing dataset Work-in-progress: Work on other parallel implementations 34 / 42

35 Selected Lessons Learned 35 / 42

36 Big Brain Data Analytics Parallel SVM Technologies Tool Platform Approach Parallel Support Vector Machine Apache Mahout Java; Apache Hadoop 1.0 (mapreduce); HTC No strategy for implementation (Website), serial SVM in code Apache Spark/MLlib Apache Spark; HTC Only linear SVM; no multi-class implementation Twister/ParallelSVM Java; Apache Hadoop 1.0 (mapreduce); Twister (iterations), HTC Much dependencies on other software: Hadoop, Messaging, etc. Scikit-Learn Python; HPC/HTC Multi-class Implementations of SVM, but not fully parallelized pisvm C code; Message Passing Interface (MPI); HPC Simple multi-class parallel SVM implementation outdated (~2011) GPU accelerated LIBSVM CUDA language Multi-class parallel SVM, relatively hard to program, no std. (CUDA) psvm C code; Message Passing Interface (MPI); HPC Unstable beta, SVM implementation outdated (~2011) Journal Paper in preparation Algorithm A Implementation closed/old source, also after asking paper authors Clustering++ Classification++ Regression++ Algorithm Extension A Implementation Parallelization of Algorithm Extension A A implementations available implementations rare and/or not stable 36 / 42

37 Big Brain Data Analytics Classifier Challenges Classification Sampling, training and testing Checking out-of-sample performance ([5] Evaluation) E.g. using two different images and compare with masks issue Check out-of-sample performance & better data understanding ~ok Problem: 2d cut classification on 3D brain color data (i.e. background color) Solution: E.g. use neighbouring methods Color histograms: e.g samples of class 0 & only 567 of class 1 (for G) 37 / 42

38 Big Brain Data Analytics Potential Next Approaches Approach to 2D/3D problem: Apply Self Dual Attribute Profile (SDAP) Increase number of dimensions, using different threshold values Takes advantage of neighbouring pixels and cuts trees at certain thresholds Good experience in land cover classification : e.g. ~70% to ~90% accuracy Area Std Dev Moment of Inertia [9] G. Cavallaro, M. Mura, J.A. Benediktsson, L. Bruzzone A Comparison of Self-Dual Attribute Profiles based on different filter rules for classification, IEEE IGARSS2014, Quebec, Canada Example: Std Dev (Channel Blue) Approach: Increase number of training samples (no more serial) Very small sampling may violate the generalization capability of classifier Approach: Compare with other (parallel) classification methods E.g. Naive Bayer classifier, DecisionTrees, RandomForests, etc. 38 / 42

39 Big Brain Data Analytics : Much more interesting Challenges! 3D Reconstruction of High Resolution Images Data: brain slices (microscopic measurements) Mapping of cell densities and cortical areas (in 3D) Data: ~1 PB/brain Analyse differences of brains, evolving over time (longitudinal studies) Data: e.g Kohorte project (in-vivo humans studies) Data: e.g. Vervet monkey 20 brains (in-vivo and post-mortem monkeys) Close collaboration between JSC and INM bears lots of potential for tackling research challenges 39 / 42

40 Acknowledgements & References 40 / 42

41 Acknowledgements Gabriele Cavallaro, University of Iceland Tomas Philipp Runarsson, University of Iceland Shaowen Wang, National Center for Supercomputing Applications Junjun Yin, National Center for Supercomputing Applications Markus Axer, Stefan Köhnen, Tim Hütz, Institute of Neuroscience & Medicine, Juelich Selected Members of the Research Group on High Productivity Data Processing Ahmed Shiraz Memon Mohammad Shahbaz Memon Markus Goetz Christian Bodenstein Philipp Glock Matthias Richerzhagen 41 / 42

42 References [1] EUDAT European Data Infrastructure, B2SHARE Tool, Online: [2] G. Cavallaro & M. Riedel et al., Smart Data Analytics Methods for Remote Sensing Applications, IEEE IGARSS, Quebec, Canada [3] P. Chapman et al., CRISP-DM Guide [4] Research Data Alliance, online: [5] Jeremy Ginsburg et al., Detecting influenza epidemics using search engine query data, Nature 457, 2009 [6] David Lazer, Ryan Kennedy, Gary King, and Alessandro Vespignani, The Parable of Google Flu: Traps in Big Data Analysis, Science Vol (343), 2014 [7] C. Cortes and V. Vapnik, Support-vector networks, Machine Learning, vol. 20(3), pp , [8] Md. Mostofa Ali Patwary, Diana Palsetia, Ankit Agrawal, Wei-keng Liao, Fredrik Manne, Alok Choudhary, A new scalable parallel DBSCAN algorithm using the disjoint-set data structure, In proceedings of 2012 International Conference for High Performance Computing, Networking, Storage and Analysis, 2012 [9] PANGAEA data collection, Talk available at 42 / 42