Scalable Developments for Big Data Analytics in Remote Sensing



Similar documents
High Productivity Data Processing Analytics Methods with Applications

Selected Parallel and Scalable Methods for Scientific Big Data Analytics

Selected Parallel and Scalable Methods for Scientific Big Data Analytics

Classification Techniques in Remote Sensing Research using Smart Data Analytics

Introduction to Big Data in HPC, Hadoop and HDFS Part One

IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING 1

Analysis Tools and Libraries for BigData

Selected Big Data Analytics Methods in Health Research

Knowledge Discovery from patents using KMX Text Analytics

Machine Learning in Spam Filtering

Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms

MAXIMIZING RETURN ON DIRECT MARKETING CAMPAIGNS

A MapReduce based distributed SVM algorithm for binary classification

Supervised Learning (Big Data Analytics)

European Data Infrastructure - EUDAT Data Services & Tools

Big Data Analytics. Tools and Techniques

An Introduction to Data Mining

Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research

BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence

Maschinelles Lernen mit MATLAB

Support Vector Machines with Clustering for Training with Very Large Datasets

NEURAL NETWORKS A Comprehensive Foundation

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Support Vector Machine. Tutorial. (and Statistical Learning Theory)

Employer Health Insurance Premium Prediction Elliott Lui

Support Vector Machines

Feature Selection using Integer and Binary coded Genetic Algorithm to improve the performance of SVM Classifier

Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches

Machine learning for algo trading

A Content based Spam Filtering Using Optical Back Propagation Technique

Support Vector Machine (SVM)

Data Mining in Education: Data Classification and Decision Tree Approach

Classifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang

Steven C.H. Hoi School of Information Systems Singapore Management University

Evaluation of Machine Learning Techniques for Green Energy Prediction

Data Mining Analysis (breast-cancer data)

Neural Networks and Support Vector Machines

Predict Influencers in the Social Network

International Journal of Software and Web Sciences (IJSWS)

Analecta Vol. 8, No. 2 ISSN

Sanjeev Kumar. contribute

Operations Research and Knowledge Modeling in Data Mining

Introduction to Data Mining

On Establishing Big Data Breakwaters

How To Use Neural Networks In Data Mining

Application of Event Based Decision Tree and Ensemble of Data Driven Methods for Maintenance Action Recommendation

Car Insurance. Prvák, Tomi, Havri

Is a Data Scientist the New Quant? Stuart Kozola MathWorks

Machine Learning at DIKU

Big Data Mining Services and Knowledge Discovery Applications on Clouds

Data Mining. Nonlinear Classification

Data Mining Practical Machine Learning Tools and Techniques

The Artificial Prediction Market

BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES

Spark: Cluster Computing with Working Sets

Map-Reduce for Machine Learning on Multicore

Predicting Flight Delays

Bagged Ensemble Classifiers for Sentiment Classification of Movie Reviews

Active Learning SVM for Blogs recommendation

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Learning to Process Natural Language in Big Data Environment

Predicting required bandwidth for educational institutes using prediction techniques in data mining (Case Study: Qom Payame Noor University)

An Embedded Hardware-Efficient Architecture for Real-Time Cascade Support Vector Machine Classification

Simple and efficient online algorithms for real world applications

CS Master Level Courses and Areas COURSE DESCRIPTIONS. CSCI 521 Real-Time Systems. CSCI 522 High Performance Computing

Cross-Validation. Synonyms Rotation estimation

Chapter 6. The stacking ensemble approach

SURVIVABILITY OF COMPLEX SYSTEM SUPPORT VECTOR MACHINE BASED APPROACH

Classification algorithm in Data mining: An Overview

Introduction to Support Vector Machines. Colin Campbell, Bristol University

COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Big Data by the numbers

Statistical Machine Learning

An Overview of Knowledge Discovery Database and Data mining Techniques

Using artificial intelligence for data reduction in mechanical engineering

Analytics on Big Data

A Map Reduce based Support Vector Machine for Big Data Classification

Hong Kong Stock Index Forecasting

Processing of Big Data. Nelson L. S. da Fonseca IEEE ComSoc Summer Scool Trento, July 9 th, 2015

Support Vector Machines Explained

Comparison of K-means and Backpropagation Data Mining Algorithms

Data Mining - Evaluation of Classifiers

AUTO CLAIM FRAUD DETECTION USING MULTI CLASSIFIER SYSTEM

E-commerce Transaction Anomaly Classification

DATA EXPERTS MINE ANALYZE VISUALIZE. We accelerate research and transform data to help you create actionable insights

Sense Making in an IOT World: Sensor Data Analysis with Deep Learning

Bayesian networks - Time-series models - Apache Spark & Scala

Big Data: Image & Video Analytics

Université de Montpellier 2 Hugo Alatrista-Salas : hugo.alatrista-salas@teledetection.fr

Optimizing content delivery through machine learning. James Schneider Anton DeFrancesco

SURVEY OF TEXT CLASSIFICATION ALGORITHMS FOR SPAM FILTERING

Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j

Spark and the Big Data Library

Applications of Deep Learning to the GEOINT mission. June 2015

Transcription:

Scalable Developments for Big Data Analytics in Remote Sensing Federated Systems and Data Division Research Group High Productivity Data Processing Dr.-Ing. Morris Riedel et al. Research Group Leader, Juelich Supercomputing Centre Adjunct Associated Professor, University of Iceland Gabriele Cavallaro, Jon Atli Benediktsson University of Iceland Philipp Glock, Christian Bodenstein, Markus Goetz Juelich Supercomputing Centre

Outline 2/ 25

Outline Big Data Analytics Driven by Scientific & Engineering Demands Supervised Learning Classification Method Scalable & Parallel Developments Short Survey of Related Work Enable Direct Parallelization & Cross-validation with pisvm Scale Parallelization with the Cascade SVM Approach & GPGPUs Remote Sensing Applications & Data in Context Recent Research Directions Parallel Clustering & Brain Data Analytics Conclusions References 3/ 25

Big Data Analytics 4/ 25

Big Data Analytics Big Data Analytics is an interesting mix of distinct approaches Analytics: Whole methodology; Analysis: data investigation process itself Big : scalable processing methods and underlying parallel infrastructure Concrete big data : large medical data Concrete big data : large earth science data Clustering++ Regression++ Classification++ 5/ 25

Learning From Data Classification Technique Classification? Clustering Regression Groups of data exist New data classified to existing groups No groups of data exist Create groups from data close to each other Identify a line with a certain slope describing the data 6/ 25

Selected Classification Method Perceptron Learning Algorithm simple linear classification Enables binary classification with a line between classes of seperable data Support Vector Machines (SVMs) non-linear ( kernel ) classification Enables non-linear classification with maximum margin (best out-of-the-box ) Reasoning: achieves often better results than other methods in remote sensing application Decision Trees & Ensemble Methods tree-based classification Grows trees for class decisions, ensemble methods average n trees Artificial Neural Networks (ANNs) brain-inspired classification Combine multiple linear perceptrons to a strong network for non-linear tasks Naive Bayes Classifier probabilistic classification Use of the Bayes theorem with strong/naive independence between features 7/ 25

Supervised Learning SVM Classification Method [1] C. Cortes and V. Vapnik et al. SVM Algorithm Approach Introduced 1995 by C.Cortes & V. Vapnik et al. Creates a maximal margin classifier to get future points ( more often ) right Uses quadratic programming & Lagrangian method with N x N (use of soft-margin approach for better generalization ) (linear example) ( maximal margin clasifier example) (maximizing hyperplane turned into optimization problem, minimization, dual problem) (max. hyperplane dual problem, using quadratic programming method) (kernel trick, quadratic coefficients Computational Complexity & Big Data Impact) 8/ 25

Scalable and Parallel Developments 9/ 25

Short Survey of Related Work [2] M. Goetz, M. Riedel et al., On Parallel and Scalable Classification and Clustering Techniques for Earth Science Datasets, 6 th Workshop on Data Mining in Earth System Science, International Conference of Computational Science (ICCS), Reykjavik 10 / 25

Parallel & Scalable pisvm MPI Tool Tunings Original parallel pisvm tool 1.2 Open-source and based on libsvm library, C, 2011 Message Passing Interface (MPI) [3] pisvm Website, 2011/2014 code New version appeared 2014-10 v. 1.3 (no major improvements) Lack of big data support (memory, layout, etc.) Tuned scalable parallel pisvm tool 1.2.1 Open-source (repository to be created) Based on pisvm tool 1.2 Optimizations: load balancing; MPI collectives Contact: m.richerzhagen@fz-juelich.de 11 / 25

Parallel & Scalable pisvm MPI Tool Dataset Another more challenging dataset: high number of classes Parallelization challenges: unbalanced class representations, mixed pixels remote sensing cube & ground reference [4] G. Cavallaro, M. Riedel et al., IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, to be published challenges in automation [5] Indian pines dataset, processed and raw 12 / 25

Parallel & Scalable pisvm MPI Tool Parallel Results High number of classes, different scenarios important to check Parallelization benefits: major speed-ups, ~interactive (<1 min) possible manual & serial activities (in minutes) big data is not always better data Manual WORK Can we automate feature extraction mechanism to some degree? TALK AT IGARSS 2015 Wednesday, July 29, 10:30-12:10 (Student Paper Competition) [6] Analytics Results (raw) [7] Analytics Results (processed) GABRIELE CAVALLARO et al. AUTOMATIC MORPHOLOGICAL ATTRIBUTE PROFILES 13 / 25

Parallel & Scalable pisvm MPI Tool Most Impact 2x benefits of parallelization (shown in n-fold cross validation) (1) Compute parallel; (2) Do cross-validation runs in parallel Evaluation between Matlab (aka serial) and parallel pisvm 10x cross-validation (RBF kernel parameter and C, gridsearch) raw dataset (serial) processed dataset (serial) raw dataset (parallel, 80 cores) processed dataset (parallel, 80 cores) [8] Analytics 10 fold cross-validation Results (raw) [9] Analytics 10 fold cross-validation Results (processed) 14 / 25

Parallel & Scalable Cascade SVM MPI Tool Approach Cascade Approach: Filtering the support vectors out Theoretically unlimited parallelization (shared nothing or MPI) Various extensions: cross feedback (implemented) Parallel application: using parallel I/O & MPI [10] Graf et al., Parallel Support Vector Machines: The Cascade SVM, 2005 15 / 25

Parallel & Scalable Cascade SVM MPI Tool - Results Results after various iterations Good for real-time: trade-off between accuracy & speed-up Comparisons with pisvm: less accuracy, but faster 16 / 25

Parallel & Scalable Cascade GPGPU Tool Results Promising area of parallelization: GPGPUs Not many implementations Still proprietary (e.g. Nvidea/CUDA) Open source implementation Selected evaluations [11] V. Meyaris I. Kompatsiaris A. Athanasopoulos, A. Dimou, Gpu acceleration for support vector machines, 12th International Workshop on Image Analzsis for Multimedia Interactive Services, 2011. Still faster then serial (libsvm) version Performs not as good as the parallel pisvm implementation Disadvantages explained: e.g. overhead in copying data cpu/gpu Still promising area for the future: more expected to come! 17 / 25

Recent Research Directions 18 / 25

Parallel & Scalable DBSCAN MPI Tool Brain Analytics Neuroscience Application Cell nuclei detection and tissue clustering Scientific Case: Detect various layers (colored) Layers seem to have different density distribution of cells Extract cell nuclei into 2D/3D point cloud Cluster different brain areas by cell density # Use of HPBSCAN algorithm Parallel & Scalable DBSCAN algorithm Developed in Juelich, open source First 2d results detect various clusters Approach: Several iterations (with 3D) with potentially different parameter values Research activities jointly with T. Dickscheid et al. (Juelich Institute of Neuroscience & Medicine) 19 / 25

Conclusions 20 / 25

Conclusions Scientific Peer Review is essential to progress in the field Work in the field needs to be guided & steered by communities NIC Scientific Big Data Analytics (SBDA) first step (learn from HPC) Towards enabling reproducability by uploading runs and datasets Selected SBDA benefit from parallelization Statistical data mining techniques able to reduce big data (e.g. PCA, etc.) Benefits in n-fold cross-validation & raw data, less on preprocessed data Two codes available to use and maintained @JSC: HPDBSCAN, pisvm Number of Data Analytics et al. Technologies incredible high Thorough analysis and evaluation hard (needs different infrastructures) (Less) open source & working versions available, often paper studies Still evaluating approaches: HPC, map-reduce, Spark, SciDB, MaTex, 21 / 25

References 22 / 25

References [1] C. Cortes and V. Vapnik, Support-vector networks, Machine Learning, vol. 20(3), pp. 273 297, 1995 [2] M. Goetz, M. Riedel et al., On Parallel and Scalable Classification and Clustering Techniques for Earth Science Datasets, 6 th Workshop on Data Mining in Earth System Science, International Conference of Computational Science (ICCS), Reykjavik [3] Original pisvm tool, online: http://pisvm.sourceforge.net/ [4] G. Cavallaro, M. Riedel et al., IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, to be published [5] B2SHARE data collection, remote sensing indian pines images, Online: http://hdl.handle.net/11304/7e8eec8e-ad61-11e4-ac7e-860aa0063d1f [6] B2SHARE data collection, pisvm remote sensing indian pines analytics results (raw), Online: http://hdl.handle.net/11304/c06a8c7e-fe6c-11e4-8a18-f31aa6f4d448 [7] B2SHARE data collection, pisvm remote sensing indian pines analytics results (processed), Online: http://hdl.handle.net/11304/c528998e-ff7c-11e4-8a18-f31aa6f4d448 [8] B2SHARE data collection, Analytics 10 fold cross-validation (raw), Online: http://hdl.handle.net/11304/163ba8e8-fe60-11e4-8a18-f31aa6f4d448 [9] B2SHARE data collection, Analytics 10 fold cross-validation (processed), Online: http://hdl.handle.net/11304/5bba8e36-fe63-11e4-8a18-f31aa6f4d448 [10] Graf, Hans P., et al. "Parallel support vector machines: The cascade svm." Advances in neural information processing systems. 2004. [11] V. Meyaris I. Kompatsiaris A. Athanasopoulos, A. Dimou, Gpu acceleration for support vector machines, 12th International Workshop on Image Analzsis for Multimedia Interactive Services, 2011. 23 / 25

Acknowledgements Team behind the Paper PhD Student Gabriele Cavallaro, University of Iceland Selected Members of the Research Group on High Productivity Data Processing Ahmed Shiraz Memon Mohammad Shahbaz Memon Markus Goetz Christian Bodenstein Philipp Glock Matthias Richerzhagen 24 / 25

Thanks Slides available at http://www.morrisriedel.de/talks 25 / 25