High Productivity Data Processing Analytics Methods with Applications



Similar documents
Classification Techniques in Remote Sensing Research using Smart Data Analytics

Scalable Developments for Big Data Analytics in Remote Sensing

Understanding Big Data Analytics Applications in Earth Science Morris Riedel, Rahul Ramachandran/Kuo Kwo-Sen, Peter Baumann Big Data Analytics

European Data Infrastructure - EUDAT Data Services & Tools

IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING 1

Azure Machine Learning, SQL Data Mining and R

Selected Parallel and Scalable Methods for Scientific Big Data Analytics

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Selected Parallel and Scalable Methods for Scientific Big Data Analytics

On Establishing Big Data Breakwaters

Introduction to Big Data in HPC, Hadoop and HDFS Part One

Big Data Analytics. Tools and Techniques

Selected Big Data Analytics Methods in Health Research

Maschinelles Lernen mit MATLAB

Machine Learning Capacity and Performance Analysis and R

The Matsu Wheel: A Cloud-based Scanning Framework for Analyzing Large Volumes of Hyperspectral Data

Adapting scientific computing problems to cloud computing frameworks Ph.D. Thesis. Pelle Jakovits

Map-Reduce for Machine Learning on Multicore

The Data Mining Process

Analysis Tools and Libraries for BigData

BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES

HPC ABDS: The Case for an Integrating Apache Big Data Stack

The Need for Training in Big Data: Experiences and Case Studies

A MapReduce based distributed SVM algorithm for binary classification

Big Data: Rethinking Text Visualization

DATA MINING TECHNIQUES AND APPLICATIONS

Data Mining Part 5. Prediction

Predicting Flight Delays

Analytics on Big Data

Machine Learning in Spam Filtering

Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches

Big Data and Analytics: Challenges and Opportunities

BUSINESS ANALYTICS. Overview. Lecture 0. Information Systems and Machine Learning Lab. University of Hildesheim. Germany

Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University

Social Media Mining. Data Mining Essentials

Architectures for Big Data Analytics A database perspective

BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics

Predict Influencers in the Social Network

Decision Trees from large Databases: SLIQ

Prerequisites. Course Outline

EDA and ML A Perfect Pair for Large-Scale Data Analysis

COPYRIGHTED MATERIAL. Contents. List of Figures. Acknowledgments

Big Data Analytics. An Introduction. Oliver Fuchsberger University of Paderborn 2014

Lluis Belanche + Alfredo Vellido. Intelligent Data Analysis and Data Mining

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

An Introduction to Data Mining

International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, May-June 2015

Chapter 6. The stacking ensemble approach

Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms

Introduction to Big Data in HPC, Hadoop and HDFS Part Two

The? Data: Introduction and Future

Spark. Fast, Interactive, Language- Integrated Cluster Computing

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Database Marketing, Business Intelligence and Knowledge Discovery

What is Data Mining? Data Mining (Knowledge discovery in database) Data mining: Basic steps. Mining tasks. Classification: YES, NO

Knowledge Discovery from patents using KMX Text Analytics

T : Classification as Spam or Ham using Naive Bayes Classifier. Santosh Tirunagari :

Cafcam: Crisp And Fuzzy Classification Accuracy Measurement Software

A Map Reduce based Support Vector Machine for Big Data Classification

! E6893 Big Data Analytics Lecture 5:! Big Data Analytics Algorithms -- II

Is a Data Scientist the New Quant? Stuart Kozola MathWorks

Role of Social Networking in Marketing using Data Mining

Cell Phone based Activity Detection using Markov Logic Network

COMP9321 Web Application Engineering

Big Data Mining Services and Knowledge Discovery Applications on Clouds

The Role of Size Normalization on the Recognition Rate of Handwritten Numerals

Distributed forests for MapReduce-based machine learning

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

Spark: Cluster Computing with Working Sets

Keywords data mining, prediction techniques, decision making.

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

not possible or was possible at a high cost for collecting the data.

How To Cluster

Advanced In-Database Analytics

Visualization methods for patent data

Transforming the Telecoms Business using Big Data and Analytics

How To Identify A Churner

CRISP - DM. Data Mining Process. Process Standardization. Why Should There be a Standard Process? Cross-Industry Standard Process for Data Mining

Pixel-based and object-oriented change detection analysis using high-resolution imagery

A Novel Feature Selection Method Based on an Integrated Data Envelopment Analysis and Entropy Mode

BIG DATA What it is and how to use?

Employer Health Insurance Premium Prediction Elliott Lui

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

Business Process Services. White Paper. Price Elasticity using Distributed Computing for Big Data

Microsoft Azure Machine learning Algorithms

Machine learning for algo trading

Introduction to Data Mining

CUSTOMER Presentation of SAP Predictive Analytics

Data Quality Mining: Employing Classifiers for Assuring consistent Datasets

How To Make Visual Analytics With Big Data Visual

MLlib: Scalable Machine Learning on Spark

L3: Statistical Modeling with Hadoop

Machine Learning at DIKU

Applying Scalable Machine Learning Techniques on Big Data using a Computational Cluster

Journée Thématique Big Data 13/03/2015

Large-scale Data Mining: MapReduce and Beyond Part 2: Algorithms. Spiros Papadimitriou, IBM Research Jimeng Sun, IBM Research Rong Yan, Facebook

Transcription:

High Productivity Data Processing Analytics Methods with Applications Dr. Ing. Morris Riedel et al. Adjunct Associate Professor School of Engineering and Natural Sciences, University of Iceland Research Group Leader, Juelich Supercomputing Centre Forschungszentrum Juelich, Germany

Outline Page 2

Outline Introduction Big Data Analytics Smart Data Analytics Methods Systematic Analytics with CRISP-DM Support Vector Machines Analytics Example Analytics Application Classification of Buildings in Images Conclusions & References Page 3

Introduction Page 4

Big Data Analytics (Automatically) examine large quantities of scientific ( big ) data Uncover hidden patterns Reveal unknown correlations Extract information in cases where there is no exact formula Intersection of traditional methods from a wide variety of fields Use of parallelization techniques (MPI, Map-reduce, GPGPUs) offers scalability to big data sets Page 5

Smart Data Analytics Methods Page 6

Systematic Analytics with CRISP - DM Performed survey of reference models that enable data analytics in structured way Cross Industry Standard Process for Data Mining Used in Research Data Alliance BigData Analytics Interest Group [7] P. Chapman et al., CRISP-DM Guide Big Data Analytics [10] RDA Big Data Analytics IG Page 7

Support Vector Machines Analytics Classification Support Vector Machines? Clustering Regression Support Vector Machines Quadratic Programming (big data challenge) (e.g. all N datasets vs. sampling) (quadratic coefficients N x N dense matrix) Page 8

Example Analytics Application Page 9

Classification of Buildings in Images (Big) Data Problem: Multi-Class Classification of buildings from hyper- / multi-spectral images panchromatic image (972 1188 pixels); high resolution; 0.6m multispectral image with the four bands; low-resolution; 2.4m Classification of buildings from multi-spectral data N profiles further improve classification feature vector (area, std deviation, moment of inertia, ) 1st Principle Component Analysis (PCA) Classify building classes using image data & attribute filters to increase the accuracy Multi-spectral images can become very large Labelled data with groundtruth data exists Use parallel Support Vector Machines (SVMs) since it is known as good classification method today Page 10

Classification of Buildings in Images Toolset (1) Performed large survey of parallel SVM implementations (map-reduce) Classification++ Spark/MLlib (Map-Reduce) only binary classification, linear SVM Mahout (Map-Reduce) no strategy for implementation ParallelSVM on Twister (Iterative Map-Reduce) received beta code per email [3] Sun Z. & Fox. G et al. [1] Spark Website [2] Mahout Website Parallel implementations based on Map-Reduce are emerging but stabilility needs to be improved Page 11

Classification of Buildings in Images Toolset (2) Performed large survey of parallel SVM implementations (MPI & GPGPUs) pisvm Open source code, scalability limits psvm Open source code, beta quality GPULibSVM, Open source code, beta quality Classification++ [4] pisvm Website [5] psvm Website [8] GPU LibSVM Parallel implementations based on MPI + GPGPUs are openly available, but show scalability limits Page 12

Building Classification in Images Some Results Serial Matlab scripts used before Not scalable to big data sets parallelization E.g. pisvm Speed-up, but also shows limits [6] G. Cavallaro & M. Riedel et al., 2014 Reproducable Findings: Data is publicly available [9] Rome Dataset Code is publicly available [4] pisvm Website Take away message from applications: Mostly multi-class SVMs used in science & engineering Page 13

Conclusions Page 14

Conclusions Big Data Analytics Requires smart parallel data analytics methods Enables high productivity (big) data processing Apply (existing) or research parallel methods Methods Reviewed & Applied CRISP-DM guides well the systematic analytics Availability of parallel implementations of analytic algorithms rare, simple, or non existent SVM: Map-Reduce less stable, MPI / GPGPUs ok Page 15

References Page 16

References [1] Spark/Mllib, Online: http://spark.apache.org/docs/0.9.0/mllib-guide.html [2] Apache Mahout, Online: https://mahout.apache.org/users/classification/support-vector-machines.html [3] Sun Z., and Fox G., Study on Parallel SVM Based on MapReduce, In Proceedings of the international conference on parallel and distributed processing techniques and applications, 2012. [4] pisvm, Online: http://pisvm.sourceforge.net/ [5] Online: https://code.google.com/p/psvm/ [6] G. Cavallaro and M. Riedel, Smart Data Analytics Methods for Remote Sensing Applications, 35th Canadian Symposium on Remote Sensing (IGARSS), 2014, Quebec, Canada, to appear [7] Pete Chapman, CRISP-DM User Guide, 1999, Online: http://lyle.smu.edu/~mhd/8331f03/crisp.pdf [8] GPU-LibSVM, Online: http://mklab.iti.gr/project/gpu-libsvm [9] B2SHARE Rome Dataset, Online: http://hdl.handle.net/11304/4615928c-e1a5-11e3-8cd7-14feb57d12b9 [10] Research Data Alliance, Big Data Analytics IG, Online: https://rd-alliance.org/internal-groups/big-data-analytics-ig.html Page 17

Thanks Talk available at: www.morrisriedel.de/talks Contact: m.riedel@fz-juelich.de Acknowledgements Parts of the presentation have been created in close collaboration with scientific domain remote sensing scientists Gabriele Cavallaro, Jon Atli Benediktsson University of Iceland Page 18