Selected Big Data Analytics Methods in Health Research

Similar documents

Scalable Developments for Big Data Analytics in Remote Sensing

High Productivity Data Processing Analytics Methods with Applications

Classification Techniques in Remote Sensing Research using Smart Data Analytics

Selected Parallel and Scalable Methods for Scientific Big Data Analytics

Selected Parallel and Scalable Methods for Scientific Big Data Analytics

Understanding Big Data Analytics Applications in Earth Science Morris Riedel, Rahul Ramachandran/Kuo Kwo-Sen, Peter Baumann Big Data Analytics

Introduction to Big Data in HPC, Hadoop and HDFS Part One

From Big Data Analytics To Smart Data Analytics With Parallelization Techniques

IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING 1

On Establishing Big Data Breakwaters

European Data Infrastructure - EUDAT Data Services & Tools

Support Vector Machine (SVM)

Healthcare data analytics. Da-Wei Wang Institute of Information Science

Analysis Tools and Libraries for BigData

Knowledge Discovery from patents using KMX Text Analytics

Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches

Azure Machine Learning, SQL Data Mining and R

Introduction to Big Data in HPC, Hadoop and HDFS Part Two

Big Data: Rethinking Text Visualization

Active Learning SVM for Blogs recommendation

HPC ABDS: The Case for an Integrating Apache Big Data Stack

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

Maschinelles Lernen mit MATLAB

Support Vector Machines with Clustering for Training with Very Large Datasets

Big Data and Analytics: Challenges and Opportunities

COMP9321 Web Application Engineering

The Artificial Prediction Market

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

IC05 Introduction on Networks &Visualization Nov

Data Mining - Evaluation of Classifiers

Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms

Distributed forests for MapReduce-based machine learning

BIG DATA VISUALIZATION. Team Impossible Peter Vilim, Sruthi Mayuram Krithivasan, Matt Burrough, and Ismini Lourentzou

Environmental Remote Sensing GEOG 2021

Predicting Flight Delays

Fast Analytics on Big Data with H20

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S

Journée Thématique Big Data 13/03/2015

Feature Selection using Integer and Binary coded Genetic Algorithm to improve the performance of SVM Classifier

Java Modules for Time Series Analysis

Decision Trees from large Databases: SLIQ

Network Intrusion Detection using Semi Supervised Support Vector Machine

Bayesian networks - Time-series models - Apache Spark & Scala

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Introduction to Data Mining

Map-Reduce for Machine Learning on Multicore

Comparison of machine learning methods for intelligent tutoring systems

Several Views of Support Vector Machines

Introduction to Data Mining

MAXIMIZING RETURN ON DIRECT MARKETING CAMPAIGNS

Grid Density Clustering Algorithm

Big Data Analytics. Tools and Techniques

Analytics on Big Data

High Performance Spatial Queries and Analytics for Spatial Big Data. Fusheng Wang. Department of Biomedical Informatics Emory University

Statistical Machine Learning

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

A Map Reduce based Support Vector Machine for Big Data Classification

Software Engineering for Big Data. CS846 Paulo Alencar David R. Cheriton School of Computer Science University of Waterloo

Machine learning for algo trading

Traffic Prediction and Analysis using a Big Data and Visualisation Approach

Introduction to Support Vector Machines. Colin Campbell, Bristol University

Data Mining Practical Machine Learning Tools and Techniques

Sentiment analysis using emoticons

Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence

Text Mining Approach for Big Data Analysis Using Clustering and Classification Methodologies

Predict the Popularity of YouTube Videos Using Early View Data

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

Classifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang

Data Mining for Manufacturing: Preventive Maintenance, Failure Prediction, Quality Control

A Simple Introduction to Support Vector Machines

Bootstrapping Big Data

Big Data Mining Services and Knowledge Discovery Applications on Clouds

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

DATA EXPERTS MINE ANALYZE VISUALIZE. We accelerate research and transform data to help you create actionable insights

Identification algorithms for hybrid systems

Spark: Cluster Computing with Working Sets

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Simple and efficient online algorithms for real world applications

Machine Learning in Spam Filtering

Support Vector Machines

Music Mood Classification

Information Processing, Big Data, and the Cloud

The? Data: Introduction and Future

Data, Measurements, Features

Support Vector Machine. Tutorial. (and Statistical Learning Theory)

Data Mining Part 5. Prediction

Object Recognition and Template Matching

The STC for Event Analysis: Scalability Issues

Lecture 2: The SVM classifier

Machine Learning Big Data using Map Reduce

Optimizing content delivery through machine learning. James Schneider Anton DeFrancesco

Content-Based Recommendation

Predict Influencers in the Social Network

From Raw Data to. Actionable Insights with. MATLAB Analytics. Learn more. Develop predictive models. 1Access and explore data

Oracle Advanced Analytics 12c & SQLDEV/Oracle Data Miner 4.0 New Features

Intrusion Detection via Machine Learning for SCADA System Protection

Transcription:

Selected Big Data Analytics Methods in Health Research Research Field Key Technologies Jülich Supercomputing Centre Supercomputing & Big Data Dr. Ing. Morris Riedel et al. Adjunct Associated Professor, University of Iceland Jülich Supercomputing Centre, Germany Head of Research Group High Productivity Data Processing Peking, 10th 12th September 2014

Outline 2/ 42

Outline Research Group High Productivity Data Processing From Big Data Analytics to Smart Data Analytics Research Data Alliance Activities LBSN Analytics [Changes 2013 Seed Project Follow-Up NCSA/JSC] Scientific Case & Visualization Clustering using Parallel DBScan Big Brain Data Analytics [Collaboration INM/JSC] Scientific Case (Supervised) Learning from Data 101 Classification of Brain Big Data Selected Lessons Learned References Changes 2013 activities with selected contents of (dropped) talk from Shaowen Wang et al. (Tue) 3/ 42

Demand for Generic Data Methods in Science Advancing user-centered data mining methods & tools (e.g. outlier detection, support vector machines, etc.) Algorithms and Data Structures Enhancing/parallizing generic techniques and algorithmic cores (e.g. indexing, sorting) Visualization Scaling visualization techniques for highperformance graphics and data visualization Security Efficient security and privacy solutions applied to real-world big data use-cases 4/ 42

Research Group High Productivity Data Processing 5/ 42

Research Group High Productivity Data Processing Data Applied Mining Statistics Data Science Machine Learning Different Approaches Scientific Computing Scientific Applications using Big Data Traditional Scientific Computing Methods HPC and HTC Paradigms & Parallelization Smart Data Analytics Approaches & Techniques Optimized Data Access & Management Statistical Data Mining & Machine Learning Big Data Methods (Parallel) Technologies Scientific Applications Research Data MLlib 6/ 42

Juelich Supercomputing Centre Context Support data-intensive science and engineering applications Explore computing that is more intertwined with data analysis Big Data Data Management, Security & Access Clustering Classification Regression 7/ 42

Research Data Alliance Activities Steering the Big Data Analytics Interest Group Members: ~60; Founded ~03/2013 in 1st RDA Plenary (Goeteborg) Telcons: 1-2x / month; Secretary: M. Goetz (JSC) Co-chairs: M. Riedel (JSC), K. Kwo-Sen (NASA), P. Baumann (UoBremen) Systematic way: Cross Industry Standard Process for Data Mining (CRISP-DM) Concete Datasets (& source/sensor) Algorithms & Methods Technologies & Ressources Scientific Data Applications Big Data Analytics Group Process Best Practices Communitybased practice & recommendations Reference Data Analytics for reusability & learning CRISP- DM Report [4] Research Data Alliance Openly Shared Datasets Running Analytics Code Join us at Amsterdam: 4 th RDA Plenary 22 24 September 2014 Two Analytics Sessions! 8/ 42

Research Data Alliance Application Example Big Data Analytics Group Process [3] P. Chapman et al., CRISP-DM Guide Sattelite Data(Quickbird) Parallel Support Vector Machines (SVM) HPC/MPI, Map-Reduce & GPGPUs Classification Study of Land Cover Types Best Practices Communitybased practice JUDGE system @ JSC Classification++ Reference Data Analytics for reusability & learning CRISP- DM Report Parallel Brain Data Analytics Openly Shared Datasets Running Analytics Code [2] G. Cavallaro and M. Riedel, Smart Data Analytics Methods for Remote Sensing Applications, IGARSS 2014 [1] EUDAT B2SHARE 9/ 42

Understanding Industry vs. 2009 H1N1 Virus Made Headlines Nature paper from Google employees Explains how Google is able to predict winter flus Not only on national scale, but down to regions Possible via logged big data search queries [5] Jeremy Ginsburg et al., Detecting influenza epidemics using search engine query data, Nature 457, 2009 Big Data is not always better data Think causality vs. correlation 2014 The Parable of Google Flu Large errors in flu prediction & lessons learned (1) Dataset: Transparency & replicability impossible (2) Study the algorithm since they keep changing (3) It s not just about size of the data [6] David Lazer, Ryan Kennedy, Gary King, and Alessandro Vespignani, The Parable of Google Flu: Traps in Big Data Analysis, Science Vol (343), 2014 10 / 42

LBSN Analytics 11 / 42

Scientific Case & Realization Clustering Towards interactive visual analytics using parallel methods Goal: Answer health related questions from publicly available data E.g. estimated emissions/region correlated with measurements stations E.g. estimated pollution/emissions breathing/person/region Visualization pipeline design: Open data source: OpenStreet Maps (OSM) maps/streets Approach: Towards interactive data exploration Click-free visualization, i.e. no GUI applications Support for typical overlays (density maps, polygon creation) All of the above possible browser programming APIs (e.g. openlayers) Parallel MPI/OpenMP Trajectories (Statistics) OSRM Lib Parallel MPI/OpenMP Clustering (BDSCAN) Slide courtesy by Markus Goetz in close collaboration with Junjun Yin & Shaowen Wang (NCSA) 12 / 42

Approach: LBSN Trajectory Mining Scientific Domain Area Smart Cities approaches compined with Health Analytics Research Tweets Scientific Outcome Traffic density estimation Network emission model Location-based Social Networks (LBSN) Data Open data sources: Twitter & Foursquare Data collection and storage: NCSA Initial Computation: JSC Check-ins Slide courtesy by Markus Goetz in close collaboration with Junjun Yin & Shaowen Wang (NCSA) 13 / 42

Example: Cluster Visualization Clustering London Clusters 6/1/2014, 1h time slice beginning at 18:00 UTC using Density-Based Spatial Clustering of Applications with Noise (DBScan) Slide courtesy by Markus Goetz in close collaboration with Junjun Yin & Shaowen Wang (NCSA) 14 / 42

Data Preparation & Visualization Implementation Clustering The source code is available on our bitbucket repository on invite: https://bitbucket.org/markus.goetz/changes Changes2013 Outcome: selected work has been recently accepted for inclusion of the Cartography and Geographic Information Science Journal Special Issue (Guest editors: Xinyue Ye, Qunying Huang, Wenwen Li) Slide courtesy by Markus Goetz in close collaboration with Junjun Yin & Shaowen Wang (NCSA) 15 / 42

Scalable Parallel Clustering Implementation Clustering Exising parallel did not scale for problem domain DBScan noise ingored & separately reported PDSDBCAN Parallel DBScan Algorithm New implementation is work-in-progress, but promising initial results: HPDBSCAN [8] Md. Mostofa Ali Patwary, Diana Palsetia, Ankit Agrawal, Wei-keng Liao, Fredrik Manne, Alok Choudhary, A new scalable parallel DBSCAN algorithm using the disjoint-set data structure, In proceedings of 2012 International Conference for High Performance Computing, Networking, Storage and Analysis, 2012 Slide material and performance comparisons courtesy by Markus Goetz and Christian Bodenstein 16 / 42

Further use of HPDBSCAN: Data of Koljoefjords Sweden Data of the PANGAEA data collection Example: one month measurement: 03/2012 Reality: years of data are available Goal: Automatic outlier detection using parallel dbscan algorithm Better measurement devices produce orders of magnitudes more big data Manual quality control becomes impossible and error-prone Automate the quality control process ( parallelization) [9] PANGAEA data collection DBScan noise ingored & separately reported Outlier! e.g. boxplots not feasible Data courtesy of Robert Huber, Marum Center for Marine Environmental Sciences, Bremen 17 / 42

Big Brain Data Analytics 18 / 42

JSC/INM Jointly tackled Case Slide courtesy by Dr. M. Axer from Talk New insights into the fiber architecture of the brain 19 / 42

Big Brain Data Analytics Scientific Case Classification Build reconstructed brain (one 3d volume) that matches with sections & block images Understanding the sectioning of the brain and support automation of reconstruction 1. Some pattern exists ([1] Problem understanding phase) Image content classification : [1] brain part; [2] not brain part 2. No exact mathematical formula exists No precise formula for contour of the brain ; a non-linear class boundary [2] 3. Data ([2] Data understanding phase ) Block face images (of frozen tissue) Every 20 micron (cut size) Resolution: 3272 x 2469 ~ 14 MB / RGB image ~ 8 MB / corresponding mask image ( groundtruth ) ~700 images [1] [2] [2] ~ 40 GB dataset we can apply supervised learning due to labelled data [2] [1] [2] [2] class label 20 / 42

Supervised Learning Mathematical Building Blocks (1) Unknown Target Function (ideal function) Elements we not exactly (need to) know Training Examples (historical records, groundtruth data, examples) Elements we must and/or should have and that might raise huge demands for storage Learning Algorithm ( train a system ) (set of known algorithms) Hypothesis Set (set of candidate formulas) Two Solution Tools: The Learning Model Final Hypothesis (final formula) Elements that we derive from our skillset and that can be computationally intensive Elements that we derive from our skillset 21 / 42

Statistical Learning Theory Probability Distribution on X Question: Can we really learn a function from data? Learning is only possible in a probabilistic sense [ no restriction ] In sample data tracks out of sample data, created from same distribution Given large N: There is a probability of picking one point or another Data created by probability independently from each other Mathematically established via Hoeffdings Inequality : Future brains, we dont know in sample M is the number of hypothesis Practice: Need enough samples N to learn Problem: M is infinite, but we can reduce it to ~N Solution: VC dimension large overlaps out of sample use for predict! frequency is likely close to other frequency Probability Distribution 22 / 42

Supervised Learning Mathematical Building Blocks (2) Unknown Target Function (ideal function) Training Examples (historical records, groundtruth data, examples) Probability Distribution Elements we not exactly (need to) know constants in learning Elements we must and/or should have and that might raise huge demands for storage Learning Algorithm ( train a system ) (set of known algorithms) Final Hypothesis (final formula) Elements that we derive from our skillset and that can be computationally intensive Hypothesis Set (set of candidate formulas) Elements that we derive from our skillset 23 / 42

Statistical Learning Theory Error Measure & Noisy Targets Question: How can we learn a function from (noisy) data? Error measures to quantify our progress, the goal is: Often user-defined, if not often squared error : Error Measure Aka point-wise error measure (Noisy) Target function is not a (deterministic) function Getting with same x in the same y out is not always given in practice Problem: Noise in the data that hinders us from learning Idea: Use a target distribution instead of target function 24 / 42

Supervised Learning Mathematical Building Blocks (3) Unknown Target Distribution Function target function plus noise (ideal function) Training Examples (historical records, groundtruth data, examples) Error Measure Probability Distribution Elements we not exactly (need to) know constants in learning Elements we must and/or should have and that might raise huge demands for storage Learning Algorithm ( train a system ) (set of known algorithms) Final Hypothesis (final formula) Elements that we derive from our skillset and that can be computationally intensive Hypothesis Set (set of candidate formulas) Elements that we derive from our skillset 25 / 42

Supervised Learning Selected Classification Method Classification Initial approach: Support Vector Machines (SVMs) [7] C. Cortes and V. Vapnik, Supportvector networks, Machine Learning, vol. 20(3), pp. 273 297, 1995. One of the best out-of-the-box / robust classification methods ([4] Modelling) Binary classifier separates two classes: [1] brain; [2] non-brain Parameters after cross-validation; radial basis function (rbf) kernel; C-SVC type Uses quadratic programming & Lagrangian method with N x N (Cross-validation (grid-search) nicely parallel high throughput computing) (linear example) ( maximal margin clasifier example) (maximizing hyperplane turned into optimization problem, minimization, dual problem) (max. hyperplane dual problem, using quadratic programming method) (quadratic coefficients) 26 / 42

Big Brain Data Analytics Data Preprocessing (1) Classification Feature Selection/Extraction/Reduction ([3] Data Preparation) Labelled dataset with only three features ( RGB channels ) E.g. principle component analysis (PCA) not very helpful, RGB orthogonal X pixel location [1D] Blue Green Red RGB [3D] SUM = 3 Features 3d Brain [4D] y pixel location [2D] 27 / 42

Big Brain Data Analytics Data Preprocessing (2) Classification Transform images to the LibSVM format ([3] Data Preparation) Labelled dataset with only three features ( RGB channels ) Label: Class B/W Red Green Blue Each line is a training vector with rgb levels each line is a pixel 0 1:0.105882 2:0.109804 3:0.101961 1 1:0.364706 2:0.360784 3:0.356863 0 1:0.152941 2:0.34902 3:0.454902............ 1 1:0.247059 2:0.247059 3:0.227451 0 1:0.411765 2:0.411765 3:0.415686 3 features ~ #XYZ out of instances: Samples (we have to randomly pick the samples) 28 / 42

Big Brain Data Analytics Data Preprocessing (3) Classification Smart data sampling ([3] Data Preparation) The data bears the potential of sampling bias (here: much more black than white pixels ) Solution: Create samples equally per class ([1] brain; [2] non-brain) Create different datasets for training & testing (same structure, but avoid data snooping!) r_msa_03-2009_dxxxx-xx-xx_all_train r_msa_03-2009_dxxxx-xx-xx_all_test.svm 29 / 42

Big Brain Data Analytics Approach Overview Sampling, training and testing Parameters C (allowing error) and gamma (RBF Kernel) 30 / 42

Big Brain Data Analytics Initial Results Classification Approach: Cross-validation for model selection ([4] Modelling) (Skipped for simplicity: essentially gridsearch getting two parameters C and gamma bounds sum of errors determines the number and severity of violations (using a soft-margin SVM model, called also slack variables ) Approach: Serial SVM implementations ([4] Modelling) Data: nearly full dataset, but equally balanced classes Stopped: after using three different serial implementations Big data problem : plain balanced dataset too large to be properly processed in serial Potential Solution: create smaller samples from the plain balanced dataset 31 / 42

Big Brain Data Analytics Data Preprocessing Revisited Smart data sampling ([3] Data Preparation) Create very small samples Still balanced classes Approach: 0.01 % of the data per class Create different datasets for training & testing Classification r_msa_03-2009_dxxxx-xx-xx_all_train r_msa_03-2009_dxxxx-xx-xx_all_test.svm 32 / 42

Big Brain Data Analytics Selected Serial Results Classification Approach: Serial SVM implementations ([4] Modelling) Data: sampled very small dataset, but equally balanced classes [We reached a limit: Approach not scalable for larger quantities of data] Scikit-learn (python) Example: ([5] Evaluation) Training time: ~39 minutes on JUDGE; Testing time: ~2 Min; Accuracy ~91% Matlab Example: ([5] Evaluation) Training time: ~3 hours on Laptop; Testing time: ~27 Min; Accuracy ~90,1% Using the small sample size worked to train with some serial implementations Potential Improvement: Training-time reduction by maintaining classification accuracy 33 / 42

Big Brain Data Analytics Selected Parallel Results Classification Approach: Parallel SVM implementations ([4] Modelling) Data: sampled very small dataset, but equally balanced classes No limit: theoretical scalable, but usefulness depends on datasets Incremented number of datasets, towards a sample size of 0.1% Twister (iterative Map-Reduce) Example: ([5] Evaluation) Challenge: Data distribution across the parallel infrastructure on FutureGrid Training time: ~7 minutes on JUDGE; Testing time: ~7 Min; Accuracy ~96% MLlib Parallel version with Twister (iterative map-reduce) is working with growing dataset Work-in-progress: Work on other parallel implementations 34 / 42

Selected Lessons Learned 35 / 42

Big Brain Data Analytics Parallel SVM Technologies Tool Platform Approach Parallel Support Vector Machine Apache Mahout Java; Apache Hadoop 1.0 (mapreduce); HTC No strategy for implementation (Website), serial SVM in code Apache Spark/MLlib Apache Spark; HTC Only linear SVM; no multi-class implementation Twister/ParallelSVM Java; Apache Hadoop 1.0 (mapreduce); Twister (iterations), HTC Much dependencies on other software: Hadoop, Messaging, etc. Scikit-Learn Python; HPC/HTC Multi-class Implementations of SVM, but not fully parallelized pisvm C code; Message Passing Interface (MPI); HPC Simple multi-class parallel SVM implementation outdated (~2011) GPU accelerated LIBSVM CUDA language Multi-class parallel SVM, relatively hard to program, no std. (CUDA) psvm C code; Message Passing Interface (MPI); HPC Unstable beta, SVM implementation outdated (~2011) Journal Paper in preparation Algorithm A Implementation closed/old source, also after asking paper authors Clustering++ Classification++ Regression++ Algorithm Extension A Implementation Parallelization of Algorithm Extension A A implementations available implementations rare and/or not stable 36 / 42

Big Brain Data Analytics Classifier Challenges Classification Sampling, training and testing Checking out-of-sample performance ([5] Evaluation) E.g. using two different images and compare with masks issue Check out-of-sample performance & better data understanding ~ok Problem: 2d cut classification on 3D brain color data (i.e. background color) Solution: E.g. use neighbouring methods Color histograms: e.g. 410.125 samples of class 0 & only 567 of class 1 (for G) 37 / 42

Big Brain Data Analytics Potential Next Approaches Approach to 2D/3D problem: Apply Self Dual Attribute Profile (SDAP) Increase number of dimensions, using different threshold values Takes advantage of neighbouring pixels and cuts trees at certain thresholds Good experience in land cover classification : e.g. ~70% to ~90% accuracy Area Std Dev Moment of Inertia [9] G. Cavallaro, M. Mura, J.A. Benediktsson, L. Bruzzone A Comparison of Self-Dual Attribute Profiles based on different filter rules for classification, IEEE IGARSS2014, Quebec, Canada Example: Std Dev (Channel Blue) Approach: Increase number of training samples (no more serial) Very small sampling may violate the generalization capability of classifier Approach: Compare with other (parallel) classification methods E.g. Naive Bayer classifier, DecisionTrees, RandomForests, etc. 38 / 42

Big Brain Data Analytics : Much more interesting Challenges! 3D Reconstruction of High Resolution Images Data: brain slices (microscopic measurements) Mapping of cell densities and cortical areas (in 3D) Data: ~1 PB/brain Analyse differences of brains, evolving over time (longitudinal studies) Data: e.g. 1000 Kohorte project (in-vivo humans studies) Data: e.g. Vervet monkey 20 brains (in-vivo and post-mortem monkeys) Close collaboration between JSC and INM bears lots of potential for tackling research challenges 39 / 42

Acknowledgements & References 40 / 42

Acknowledgements Gabriele Cavallaro, University of Iceland Tomas Philipp Runarsson, University of Iceland Shaowen Wang, National Center for Supercomputing Applications Junjun Yin, National Center for Supercomputing Applications Markus Axer, Stefan Köhnen, Tim Hütz, Institute of Neuroscience & Medicine, Juelich Selected Members of the Research Group on High Productivity Data Processing Ahmed Shiraz Memon Mohammad Shahbaz Memon Markus Goetz Christian Bodenstein Philipp Glock Matthias Richerzhagen 41 / 42

References [1] EUDAT European Data Infrastructure, B2SHARE Tool, Online: https://b2share.eudat.eu/ [2] G. Cavallaro & M. Riedel et al., Smart Data Analytics Methods for Remote Sensing Applications, IEEE IGARSS, Quebec, Canada [3] P. Chapman et al., CRISP-DM Guide [4] Research Data Alliance, online: https://www.rd-alliance.org/ [5] Jeremy Ginsburg et al., Detecting influenza epidemics using search engine query data, Nature 457, 2009 [6] David Lazer, Ryan Kennedy, Gary King, and Alessandro Vespignani, The Parable of Google Flu: Traps in Big Data Analysis, Science Vol (343), 2014 [7] C. Cortes and V. Vapnik, Support-vector networks, Machine Learning, vol. 20(3), pp. 273 297, 1995. [8] Md. Mostofa Ali Patwary, Diana Palsetia, Ankit Agrawal, Wei-keng Liao, Fredrik Manne, Alok Choudhary, A new scalable parallel DBSCAN algorithm using the disjoint-set data structure, In proceedings of 2012 International Conference for High Performance Computing, Networking, Storage and Analysis, 2012 [9] PANGAEA data collection, http://www.pangaea.de Talk available at http://www.morrisriedel.de/talks 42 / 42