Scalable Machine Learning to Exploit Big Data for Knowledge Discovery

Similar documents
Supervised Learning (Big Data Analytics)

Data Mining Algorithms Part 1. Dejan Sarka

Promises and Pitfalls of Big-Data-Predictive Analytics: Best Practices and Trends

CYBER SCIENCE 2015 AN ANALYSIS OF NETWORK TRAFFIC CLASSIFICATION FOR BOTNET DETECTION

Chapter 6. The stacking ensemble approach

Machine Learning with MATLAB David Willingham Application Engineer

Statistics for BIG data

Advanced In-Database Analytics

not possible or was possible at a high cost for collecting the data.

Is a Data Scientist the New Quant? Stuart Kozola MathWorks

Data Mining Part 5. Prediction

Distributed forests for MapReduce-based machine learning

An Overview of Knowledge Discovery Database and Data mining Techniques

Knowledge Discovery and Data Mining

Cloud Computing, Virtualization & Green IT

The Scientific Data Mining Process

Data Mining. Nonlinear Classification

Azure Machine Learning, SQL Data Mining and R

Microsoft Azure Machine learning Algorithms

HYBRID PROBABILITY BASED ENSEMBLES FOR BANKRUPTCY PREDICTION

Data Mining - Evaluation of Classifiers

Social Media Mining. Data Mining Essentials

High Productivity Data Processing Analytics Methods with Applications

Knowledge Discovery from patents using KMX Text Analytics

Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes

Machine Learning Capacity and Performance Analysis and R

D A T A M I N I N G C L A S S I F I C A T I O N

Big Data Use Case. How Rackspace is using Private Cloud for Big Data. Bryan Thompson. May 8th, 2013

Predicting the Risk of Heart Attacks using Neural Network and Decision Tree

DATA MINING TECHNIQUES AND APPLICATIONS

How Organisations Are Using Data Mining Techniques To Gain a Competitive Advantage John Spooner SAS UK

Lavastorm Analytic Library Predictive and Statistical Analytics Node Pack FAQs

Sanjeev Kumar. contribute

Software Development Training Camp 1 (0-3) Prerequisite : Program development skill enhancement camp, at least 48 person-hours.

Sibyl: a system for large scale machine learning

Machine learning for algo trading

How To Solve The Kd Cup 2010 Challenge

Predicting outcome of soccer matches using machine learning

Maschinelles Lernen mit MATLAB

Oracle9i Data Warehouse Review. Robert F. Edwards Dulcian, Inc.

A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions

The Data Mining Process

Data Quality Mining: Employing Classifiers for Assuring consistent Datasets

Feature Factory: A Crowd Sourced Approach to Variable Discovery From Linked Data

Augmented Search for IT Data Analytics. New frontier in big log data analysis and application intelligence

Scalable Developments for Big Data Analytics in Remote Sensing

Sense Making in an IOT World: Sensor Data Analysis with Deep Learning

Detection. Perspective. Network Anomaly. Bhattacharyya. Jugal. A Machine Learning »C) Dhruba Kumar. Kumar KaKta. CRC Press J Taylor & Francis Croup

Ensuring Collective Availability in Volatile Resource Pools via Forecasting

A Content based Spam Filtering Using Optical Back Propagation Technique

CLASSIFYING NETWORK TRAFFIC IN THE BIG DATA ERA

Learning is a very general term denoting the way in which agents:

MHI3000 Big Data Analytics for Health Care Final Project Report

Data, Measurements, Features

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Technological advances in storage paired with the more

Random forest algorithm in big data environment

ISSN: CONTEXTUAL ADVERTISEMENT MINING BASED ON BIG DATA ANALYTICS

Data Analytics at NICTA. Stephen Hardy National ICT Australia (NICTA)

ENSEMBLE DECISION TREE CLASSIFIER FOR BREAST CANCER DATA

KEITH LEHNERT AND ERIC FRIEDRICH

Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal

A Systemic Artificial Intelligence (AI) Approach to Difficult Text Analytics Tasks

PhysioMiner: A Scalable Cloud Based Framework for Physiological Waveform Mining

The Big Data Paradigm Shift. Insight Through Automation

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Challenges for Data Driven Systems

Towards Elastic Application Model for Augmenting Computing Capabilities of Mobile Platforms. Mobilware 2010

Radoop: Analyzing Big Data with RapidMiner and Hadoop

Data Mining for Knowledge Management. Classification

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) ( ) Roman Kern. KTI, TU Graz

Classification algorithm in Data mining: An Overview

Oracle Advanced Analytics 12c & SQLDEV/Oracle Data Miner 4.0 New Features

Server Load Prediction

Welcome. Data Mining: Updates in Technologies. Xindong Wu. Colorado School of Mines Golden, Colorado 80401, USA

Car Insurance. Prvák, Tomi, Havri

A Case of Study on Hadoop Benchmark Behavior Modeling Using ALOJA-ML

VMware DRS: Why You Still Need Assured Application Delivery and Application Delivery Networking

Active Learning SVM for Blogs recommendation

Tax Fraud in Increasing

Data Mining Solutions for the Business Environment

Pentaho Data Mining Last Modified on January 22, 2007

The Use of Open Source Is Growing. So Why Do Organizations Still Turn to SAS?

Inciting Cloud Virtual Machine Reallocation With Supervised Machine Learning and Time Series Forecasts. Eli M. Dow IBM Research, Yorktown NY

STeP-IN SUMMIT June 18 21, 2013 at Bangalore, INDIA. Performance Testing of an IAAS Cloud Software (A CloudStack Use Case)

Bagged Ensemble Classifiers for Sentiment Classification of Movie Reviews

Predicting Flight Delays

Mining Direct Marketing Data by Ensembles of Weak Learners and Rough Set Methods

The Next Phase of Datacenter Network Resource Management and Automation March 2011

Data Isn't Everything

Analysis Tools and Libraries for BigData

Name: Srinivasan Govindaraj Title: Big Data Predictive Analytics

Analysis of Denial of Service Attack Using Proposed Model

Machine Learning using MapReduce

Towards applying Data Mining Techniques for Talent Mangement

Payment minimization and Error-tolerant Resource Allocation for Cloud System Using equally spread current execution load

Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning

RevoScaleR Speed and Scalability

Augmented Search for Web Applications. New frontier in big log data analysis and application intelligence

Big Data Classification: Problems and Challenges in Network Intrusion Prediction with Machine Learning

Transcription:

Scalable Machine Learning to Exploit Big Data for Knowledge Discovery Una-May O Reilly MIT MIT ILP-EPOCH Taiwan Symposium Big Data: Technologies and Applications

Lots of Data Everywhere Knowledge Mining Opportunities

The GigaBeats Project In an ICU environment, physiologic data is collected at high frequency but is ignored because of the need for immediate focus of abencon. manual MIMIC Waveform database Feature Value Total size 4 Terra Bytes Waveform types 22 ICU Signal sampling frequency Number of samples 125 samples/sec 500m 3

GigaBeats Project Scalable Machine Learning

Machine Learning Primer Testing set Exemplars new observation Training set prediction class label pattern

Agenda Distributed Computation + Scalable Machine Learning SCALE FlexGP EC-Star

SCALE

SCALE System Layer Networked Task Management (client side) Algorithm unification Constrained Parameter Sweep Networked Task Management (server side)

DCAP Protocol Request for Task Task Results!!! Client!

SCALE Demonstration: WD-1 Forecasting ABP 10 levels 95K exemplars 67K/28K training/test split 7 dimensions Stats, trends on MAP 1225 learner tasks (10 fold cross validation) 80 nodes, ~2 days, ~4000 node hours

SCALE: WD-1 Running time Elapsed Time Plot SVM Time (s) Neural Network Decision Tree Discrete Bayes Naive Bayes

SCALE: WD-1 Accuracy results F1 Score over all Classifiers, All Classes F1 Score Class

SCALE: WD-1 Comparison SVM v DT SVM Decision Tree F1 Score F1 Score class class

Scaling up: from SCALE to FlexGP SCALE Modest 10 s of features Assumes all training data fits into RAM FlexGP 100 s of features Big Data Big Data requires multidimensional factoring, filtering then fusion

FlexGP ML Layer FACTOR FACTORING Π: Probability of feature, objective function, operator D: factoring of the data

FlexGP Learner Goal: Model y = f(x 1,x 2, x n ) Genetic Programming Symbolic Regression

FlexGP Filter and Fusion Filter to select diverse models X y Fusion to derive an ensemble prediction Adaptive Regression Mixing FlexGP Overview

Factoring is Better - Each learning from 100% of data - 412,00 exemplars - 21 hours each Unfactored GP Factored Models Each learning from 10% of data - 2 hours each FlexGP Fused factored models

FlexGP: Data Factoring Size Study Sweet Spot 36000 exemplars 0.1% 1% 10% 25% 50% 100%

Resource and System Management Layer Completely decentralized launch Start up Cascading launch is decentralized and allows elastic retraction and expansion Completely decentralized network support IP discovery and launch of gossip protocol added to start up cascade Launch Ip discovery and ongoing gossip protocol enables communication network among nodes at algorithm level Support for Monitoring/reporting/harvesting current best Remarks Essential design for resilience to node failure Launch or running node loss will not halt computation, lost launch branch or node can be integrated seamlessly Node loss in a communication network won t break the network Statistical oversight of learning algorithm s execution parms and data distribution for parms and data added to start up cascade, negotiated between parent and child at launch of child FlexGP

System Layer: SCALE vs FlexGP SCALE: LAMPs pre-defines the tasks Every learner has to know IP of task handler Task handler is a bottleneck and central point of failure FlexGP Autonomous task specification Learners gossip to learn each others IP No central task handler or point of failure

FlexGP System Layer Introduction

FlexGP Launch

ECStar Goal: compute very cost effectively on *VAST* number of nodes with a lot of training data Runs on thousand to 10 Ks 100K s million nodes Vast requires cost effective -> volunteer Domain: learn from time series Finance, medical signals domain Solution is strategy or classifier expressed as rule sets

EC-Star Paradigm Digital Directed Evolution of Models current models new objectives updated variables data selection Use Case #1

EC-Star - dedicated - replicable - dedicated - replicable - randomly sampled BigData factoring strategy Use fewer training samples and cull poor models Increase training samples on better models Local and distributed randomization Oversampling on superior models EC-Star Divide and Conquer

System Layer Comparison Scale FlexGP EC-Star ML domain Classification Regression Classification Rule Learning Resource Scale 10 s to 100 100 s to 1000 10^3 to 10^6 Resource Type Cloud Cloud Volunteer and Dedicated Fusion External External Integrated Local Algorithm Different Same Same Server:Client ratio 1: many Decentralized Few: many

Scalable Machine Learning Comparison SCALE FlexGP EC-Star Factor Algorithm Algorithm and Data Filter Fuse Correlation Accuracy Nonparametric output space approaches Data: Under to oversampling Layered competition Migration and ancestral properties

Automation "In the end, the biggest bottleneck is not data or CPU cycles, but human cycles." Looking Forward

ML requires a lot of Human Effort Domain Knowledge Analysis and Transfer Problem Definition Data Preconditioning Feature Identification Feature Extraction Algorithm Selection Algorithm Customization Parameter Selection Training and Test Data Selection Results Evaluation Solution Deployment Looking Forward

Compressing the ML Endeavor Specify ML system parms Run ML system From desktop Multi-core To GPU To Cloud Seamlessly! Rapidly! Flexibly! Scalably!

Wrap Up DCAP is open source: https://github.com/byterial/dcap FlexGP is documented in publications and thesis Owen Derby, MEng, 2013 Thanks to ALFA group members Large team of students Postdoc: Dr. Erik Hemberg Research Scientist: Dr. Kalyan Veeramachaneni CSAIL Quanta cloud Our collaborators and sponsors