Big Data Analytics. Tools and Techniques



Similar documents
High Productivity Data Processing Analytics Methods with Applications

Scalable Developments for Big Data Analytics in Remote Sensing

Master's projects at ITMO University. Daniil Chivilikhin PhD ITMO University

Comparison of K-means and Backpropagation Data Mining Algorithms

An Introduction to Data Mining

Statistics for BIG data

Machine Learning and Data Mining. Fundamentals, robotics, recognition

Data Quality Mining: Employing Classifiers for Assuring consistent Datasets

Gerard Mc Nulty Systems Optimisation Ltd BA.,B.A.I.,C.Eng.,F.I.E.I

CS Master Level Courses and Areas COURSE DESCRIPTIONS. CSCI 521 Real-Time Systems. CSCI 522 High Performance Computing

On Establishing Big Data Breakwaters

Using Data Mining for Mobile Communication Clustering and Characterization

Package empiricalfdr.deseq2

Pipeline Pilot Enterprise Server. Flexible Integration of Disparate Data and Applications. Capture and Deployment of Best Practices

NEURAL NETWORKS IN DATA MINING

RAPIDMINER FREE SOFTWARE FOR DATA MINING, ANALYTICS AND BUSINESS INTELLIGENCE. Luigi Grimaudo Database And Data Mining Research Group

D A T A M I N I N G C L A S S I F I C A T I O N

A STUDY ON DATA MINING INVESTIGATING ITS METHODS, APPROACHES AND APPLICATIONS

Data Science at U of U

European Data Infrastructure - EUDAT Data Services & Tools

WebFOCUS RStat. RStat. Predict the Future and Make Effective Decisions Today. WebFOCUS RStat

DATA MINING - SELECTED TOPICS

Predictive Modeling and Big Data

Index Contents Page No. Introduction . Data Mining & Knowledge Discovery

Azure Machine Learning, SQL Data Mining and R

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

COPYRIGHTED MATERIAL. Contents. List of Figures. Acknowledgments

R s and Predictive Modeling Boot Camp Nov. 8-9, Session #1: Predictive Modeling: An Overview Syed Muzayan Mehmud, ASA, FCA, MAAA

How To Use Neural Networks In Data Mining

Text Mining in JMP with R Andrew T. Karl, Senior Management Consultant, Adsurgo LLC Heath Rushing, Principal Consultant and Co-Founder, Adsurgo LLC

Lecture/Recitation Topic SMA 5303 L1 Sampling and statistical distributions

In this presentation, you will be introduced to data mining and the relationship with meaningful use.

Automatic Inventory Control: A Neural Network Approach. Nicholas Hall

WROX Certified Big Data Analyst Program by AnalytixLabs and Wiley

DATA MINING TECHNOLOGY. Keywords: data mining, data warehouse, knowledge discovery, OLAP, OLAM.

Sanjeev Kumar. contribute

8. Machine Learning Applied Artificial Intelligence

Predictive Analytics Techniques: What to Use For Your Big Data. March 26, 2014 Fern Halper, PhD

Information and Decision Sciences (IDS)

Data Mining Algorithms Part 1. Dejan Sarka

Title. Introduction to Data Mining. Dr Arulsivanathan Naidoo Statistics South Africa. OECD Conference Cape Town 8-10 December 2010.

An Overview of Knowledge Discovery Database and Data mining Techniques

Introduction to Data Mining and Machine Learning Techniques. Iza Moise, Evangelos Pournaras, Dirk Helbing

22S:295 Seminar in Applied Statistics High Performance Computing in Statistics

MSCA Introduction to Statistical Concepts

BIOINF 525 Winter 2016 Foundations of Bioinformatics and Systems Biology

An Introduction to Advanced Analytics and Data Mining

TDS - Socio-Environmental Data Science

EFFICIENT DATA PRE-PROCESSING FOR DATA MINING

PARALLEL & CLUSTER COMPUTING CS 6260 PROFESSOR: ELISE DE DONCKER BY: LINA HUSSEIN

The Predictive Data Mining Revolution in Scorecards:

1. Classification problems

Combining GLM and datamining techniques for modelling accident compensation data. Peter Mulquiney

Audit Analytics. --An innovative course at Rutgers. Qi Liu. Roman Chinchila

Overlapping Data Transfer With Application Execution on Clusters

Data Mining and Exploration. Data Mining and Exploration: Introduction. Relationships between courses. Overview. Course Introduction

Protein Protein Interaction Networks

Model Deployment. Dr. Saed Sayad. University of Toronto

Visualization methods for patent data

Random forest algorithm in big data environment

Keywords data mining, prediction techniques, decision making.

not possible or was possible at a high cost for collecting the data.

Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.

Predictive Analytics Certificate Program

Machine Learning and Data Analysis overview. Department of Cybernetics, Czech Technical University in Prague.

Machine Learning with MATLAB David Willingham Application Engineer

Football Match Winner Prediction

Ensembles and PMML in KNIME

Overview. Swarms in nature. Fish, birds, ants, termites, Introduction to swarm intelligence principles Particle Swarm Optimization (PSO)

Course Requirements for the Ph.D., M.S. and Certificate Programs

Principles of Dat Da a t Mining Pham Tho Hoan hoanpt@hnue.edu.v hoanpt@hnue.edu. n

Principles of Data Mining by Hand&Mannila&Smyth

The University of Jordan

Advanced Operating Systems CS428

Course DSS. Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

Data Mining Solutions for the Business Environment

Data Mining and Neural Networks in Stata

Chapter 6. The stacking ensemble approach

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

Detection. Perspective. Network Anomaly. Bhattacharyya. Jugal. A Machine Learning »C) Dhruba Kumar. Kumar KaKta. CRC Press J Taylor & Francis Croup

Big Data: Rethinking Text Visualization

Statistics 215b 11/20/03 D.R. Brillinger. A field in search of a definition a vague concept

Online Ensembles for Financial Trading

DATA MINING TECHNIQUES AND APPLICATIONS

Impact of Feature Selection on the Performance of Wireless Intrusion Detection Systems

How To Get A Masters Degree In Logistics And Supply Chain Management

Using reporting and data mining techniques to improve knowledge of subscribers; applications to customer profiling and fraud management

Course Description This course will change the way you think about data and its role in business.

Industrial and Systems Engineering Master of Science Program Data Analytics and Optimization

Cloud Computing Simulation Using CloudSim

AASPI SOFTWARE PARALLELIZATION

Big Data at Spotify. Anders Arpteg, Ph D Analytics Machine Learning, Spotify

Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

Designing and Building Applications for Extreme Scale Systems CS598 William Gropp

Data Mining Analysis of HIV-1 Protease Crystal Structures

Parallel Multi-Swarm Optimization Framework for Search Problems in Water Distribution Systems

Statistics Graduate Courses

Data Centric Systems (DCS)

Knowledge Discovery from Data Bases Proposal for a MAP-I UC

Transcription:

Big Data Analytics Basic concepts of analyzing very large amounts of data Dr. Ing. Morris Riedel Adjunct Associated Professor School of Engineering and Natural Sciences, University of Iceland Research Group Leader, Juelich Supercomputing Centre, Germany LECTURE BDA5 Tools and Techniques 2014-02-08 (v2) Online Material

Outline Lecture BDA5 Tools and Techniques 2/ 14

Outline Tools Overview and Selected Use Cases RMPI R interface to MPI Methods Statistical Techniques Data Mining Machine Learning Evolutionary Optimization Lecture BDA5 Tools and Techniques 3/ 14

RMPI Lecture BDA5 Tools and Techniques 4/ 14

RMPI Overview The tool R is a free software environment for statistical computing and graphics RMPI is an R package and an interface (wrapper) to the Message Passing Interface (MPI) MPI is a standardized and portable parallel programming model implemented via libraries RMPI ports low level MPI functions into R so that users do not have to know C or Fortran MPI libraries supported: LAM-MPI, MPICH(2), and OpenMPI Write R programs using certain RMPI functions: Startup and Shutdow: e.g. mpi.spawn.rslaves([nslaves=#]) Cluster Information: e.g. mpi.comm.rank() Use sending/receiving data/functions to send an R object like a number, string, or a list between different slaves or to all slaves (e.g. mpi.bcast.robj2slave(object)) [2] RMPI Web Page Enables inter process communication Lecture BDA5 Tools and Techniques 5/ 14

RMPI Selected Use Cases Precision Medicine [1] Bottomly et al., 2013 Simulations and calculation of P-values and False Discovery Rate (FDR) Parallel computed on Beowulf-style cluster using RMPI R-2.15.1 Use of parallel random number generation using L Ecuyer s method (R rlecuyer package) Plotting and summaries for the simulations with ggplot2 on R-3.0.1 Lecture BDA5 Tools and Techniques 6/ 14

Methods Lecture BDA5 Tools and Techniques 7/ 14

Statistical Techniques The main task of statistical techniques is to test the likelihood of a given hypothesis Limits Impossible to identify new relationships in data Discovery of new relationships is a very important chance to realize additional unexpected benefits from the large amounts of data Example: data available from High Throughput Experimentation (HTE) [3] Ohrenberg et al., 2005 Lecture BDA5 Tools and Techniques 8/ 14

Data Mining Techniques Data mining stands for algorithms and techniques which transfer data into information Provides methods that extract rules, patterns, or other structural information from data Identify in the data new relationships that were previously undiscovered Typical data-mining techniques Clustering, separation methods, decision trees, association rules, etc. Limits Data mining does not assess the validity of extracted rules Making data-mining results reliable, it is essential to apply different approaches and check the resulting rules for mutual compatibility [3] Ohrenberg et al., 2005 Lecture BDA5 Tools and Techniques 9/ 14

Machine Learning Techniques Machine learning techniques are able to learn from data and train a system (model) Wide variety of machine learning algorithms exist Artificial Neural Networks, Support Vector Machines, etc. Artificial Neural Networks (ANN) A kind of extremely powerful regression model (rough perspective) Input-output relationships can be generated by ANN as black box model Black box model: function cannot be interpreted physically the parameters in the ANN have generally no explicit physical meaning ANN need less data then other black-box methods to generate an input output relationship of comparible quality (given a wide range of conditions) [3] Ohrenberg et al., 2005 Lecture BDA5 Tools and Techniques 10 / 14

Evolutionary Optimization Evolutionary optimization techniques are characterized by a stochastic approach Advantages Compared to statistical experiment design, evolutionary algorithms are very flexible and adaptable Limits In most cases the optimization speed is rather slow [3] Ohrenberg et al., 2005 Lecture BDA5 Tools and Techniques 11 / 14

Lecture Bibliography Lecture BDA5 Tools and Techniques 12 / 14

Lecture Bibliography [1] Bottomly et al.: Comparison of methods to identify aberrant expression patterns in individual patients: augmenting our toolkit for precision medicine. Genome Medicine 2013 5:103., Online: http://genomemedicine.com/content/pdf/gm509.pdf [2] RMPI Web Page, Online: http://www.stats.uwo.ca/faculty/yu/rmpi/ [3] Ohrenberg, Arne, et al. "Application of Data Mining and Evolutionary Optimization in Catalyst Discovery and High-Throughput Experimentation Techniques, Strategies, and Software." QSAR & Combinatorial Science 24.1 (2005): 29-37. Lecture BDA5 Tools and Techniques 13 / 14

Lecture BDA5 Tools and Techniques 14 / 14