Intelligent Heuristic Construction with Active Learning



Similar documents
GPU for Scientific Computing. -Ali Saleh

Big-data Analytics: Challenges and Opportunities

Introduction to GPU Programming Languages

Architectures for Big Data Analytics A database perspective

Machine Learning using MapReduce

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set

2: Computer Performance

GeoImaging Accelerator Pansharp Test Results

Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data

Several tips on how to choose a suitable computer

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU

The Impact of Memory Subsystem Resource Sharing on Datacenter Applications. Lingia Tang Jason Mars Neil Vachharajani Robert Hundt Mary Lou Soffa

Choosing a Computer for Running SLX, P3D, and P5

A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures

large-scale machine learning revisited Léon Bottou Microsoft Research (NYC)

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Performance Evaluations of Graph Database using CUDA and OpenMP Compatible Libraries

High Performance GPGPU Computer for Embedded Systems

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Several tips on how to choose a suitable computer

How to choose a suitable computer

Scalable Machine Learning - or what to do with all that Big Data infrastructure

RevoScaleR Speed and Scalability

Effective Java Programming. efficient software development

Exploiting GPU Hardware Saturation for Fast Compiler Optimization

Home Exam 3: Distributed Video Encoding using Dolphin PCI Express Networks. October 20 th 2015

LS-DYNA Scalability on Cray Supercomputers. Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp.

Implementation of Stereo Matching Using High Level Compiler for Parallel Computing Acceleration

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

Tableau Server 7.0 scalability

Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data

Control 2004, University of Bath, UK, September 2004

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011

Clustering Billions of Data Points Using GPUs

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

Mixed Precision Iterative Refinement Methods Energy Efficiency on Hybrid Hardware Platforms

Report Paper: MatLab/Database Connectivity

SCALABILITY AND AVAILABILITY

Experiences on using GPU accelerators for data analysis in ROOT/RooFit

MAGENTO HOSTING Progressive Server Performance Improvements

In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller

ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU

Interpreters and virtual machines. Interpreters. Interpreters. Why interpreters? Tree-based interpreters. Text-based interpreters

Parallelism and Cloud Computing

Get an Easy Performance Boost Even with Unthreaded Apps. with Intel Parallel Studio XE for Windows*

SUBJECT: SOLIDWORKS HARDWARE RECOMMENDATIONS UPDATE

Turbomachinery CFD on many-core platforms experiences and strategies

Scalable and High Performance Computing for Big Data Analytics in Understanding the Human Dynamics in the Mobile Age

Benchmarking Cassandra on Violin

A Review of Customized Dynamic Load Balancing for a Network of Workstations

ultra fast SOM using CUDA

Big Fast Data Hadoop acceleration with Flash. June 2013

PARALLEL JAVASCRIPT. Norm Rubin (NVIDIA) Jin Wang (Georgia School of Technology)

CHAPTER 1 INTRODUCTION

Parallel Algorithm Engineering

Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o

Technology Update White Paper. High Speed RAID 6. Powered by Custom ASIC Parity Chips

CentOS Linux 5.2 and Apache 2.2 vs. Microsoft Windows Web Server 2008 and IIS 7.0 when Serving Static and PHP Content

Data-parallel Acceleration of PARSEC Black-Scholes Benchmark

1. INTRODUCTION Graphics 2

Managing Adaptability in Heterogeneous Architectures through Performance Monitoring and Prediction

Case Study on Productivity and Performance of GPGPUs

GPUs for Scientific Computing

Whitepaper: performance of SqlBulkCopy

Computer Graphics Hardware An Overview

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Parallelization: Binary Tree Traversal

OpenCL Programming for the CUDA Architecture. Version 2.3

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

ST810 Advanced Computing

SIGMOD RWE Review Towards Proximity Pattern Mining in Large Graphs

FPGA-based Multithreading for In-Memory Hash Joins

SIDN Server Measurements

LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR

Distributed Dynamic Load Balancing for Iterative-Stencil Applications

Practical issues in DIY RAID Recovery

Medical Image Processing on the GPU. Past, Present and Future. Anders Eklund, PhD Virginia Tech Carilion Research Institute

Mike Maxey. Senior Director Product Marketing Greenplum A Division of EMC. Copyright 2011 EMC Corporation. All rights reserved.

Scaling from Workstation to Cluster for Compute-Intensive Applications

Data Center and Cloud Computing Market Landscape and Challenges

Modern Platform for Parallel Algorithms Testing: Java on Intel Xeon Phi

High Performance Computing in CST STUDIO SUITE

DIABLO TECHNOLOGIES MEMORY CHANNEL STORAGE AND VMWARE VIRTUAL SAN : VDI ACCELERATION

Accelerating BIRCH for Clustering Large Scale Streaming Data Using CUDA Dynamic Parallelism

The Scientific Data Mining Process

Speeding ETL Processing in Data Warehouses White Paper

Efficient Parallel Graph Exploration on Multi-Core CPU and GPU

Statistical Challenges with Big Data in Management Science

Transcription:

Intelligent Heuristic Construction with Active Learning William F. Ogilvie, Pavlos Petoumenos, Zheng Wang, Hugh Leather E H U N I V E R S I T Y T O H F G R E D I N B U

Space is BIG! Hubble Ultra-Deep Field Tiny region of space shown Despite this, many galaxies Each galaxy, billions of stars Relevance to heuristics?

Optimisation spaces are MUCH BIGGER!!! We can t pick from 10 400 Rough heuristics instead 10 82 Atoms in the Universe Traditionally hard-coded Can take a year to perfect As if that wasn't bad enough 10 400 Combinations of GCC Optimisations

the problem is even worse than that! Each architectural change requires heuristics to be re-tuned Heuristics are inherently tied to the underlying hardware Most compilers support many different platforms Very difficult to keep up and getting harder We already have out of date compilers

Machine Learning to the rescue? Leverage machine learning techniques to create heuristics Well suited to the problem Lots of interesting research Can be better than Humans But, it s also incredibly slow to learn We demonstrate how it s possible to accelerate training Create a heuristic which maps workload to processor

feature values Quick Detour: Machine Learning 101 Classification involves forming a correlation between the features of an object and its label examples Machine Learning Algorithm Model best heuristic value

Training a Heuristic thousands of examples input value 2 input value 1

Training a Heuristic thousands of examples Machine Learning Algorithm input value 2 input value 1

Training a Heuristic thousands of examples Machine Learning Algorithm input value 2 GPU CPU mathematical model input value 1

Using a Heuristic unseen features Mathematical Model input value 2 GPU CPU predicted processor input value 1

So what s wrong with this? feature 2 feature 1 Traditional approach almost universally adopted

Well, we actually only needed these! feature 2 feature 1

So this was a complete waste of time! feature 2 feature 1 Random sampling inevitably leads to redundancy

How much time was wasted? Correctness of labels are tied to heuristic quality I.e. consistently wrong labels leads to wrong model Sound data is essential, but very expensive E.g. are inputs X, Y, Z faster on CPU or GPU? 1. Run program on CPU using X, Y, Z 2. Run program on GPU using X, Y, Z 3. GOTO 1 until statistical difference observed

Compile-time Heuristics are Even Slower Labelling one single example requires iterative compilation compile code using different optimisation values repeated profiling to make statistically sound determination only then, associate best optimisation with code features.exe.c.exe best optimisation wins.exe

What do we do about it? We cannot know where the informative examples lie But, we can let the algorithm make an educated guess You and I do not learn in a random, unstructured way We build up our knowledge gradually and iteratively Perhaps, let the algorithm do the same?

Active Supervised Learning passive (random) thousands of random examples Machine Learning Algorithm final model

Active Supervised Learning passive (random) active (iterative) few random examples thousands of random examples ML Algorithm intermediate model Machine Learning Algorithm final model

Active Supervised Learning passive (random) active (iterative) few random examples thousands of random examples Machine Learning Algorithm ML Algorithm intermediate model completion reached? no carefully select an example final model yes final model

How do we know when it s complete? few random examples Many criteria, including time elapsed loop iterations ML Algorithm intermediate model carefully select an example cross-validation completion reached? no yes final model

What about selecting examples? few random examples Many algorithms available Used Query by Committee Easier to show than to tell ML Algorithm intermediate model carefully select an example completion reached? no yes final model

We start with a few random examples feature 2 feature 1

We form multiple intermediate models feature 2 feature 1

Each with a distinct algorithm feature 2 feature 1

A committee of different models feature 2 feature 1

Here the committee disagrees, but we use this to our advantage feature 2 feature 1 Disagreement regions hold the greatest potential to improve the collective knowledge learn from there!

So what example do we learn from next? feature 2 feature 1 We ask each model to predict the label of random unseen examples drawn from the feature space

Broadly the Committee will agree feature 2 feature 1

but we re interested in disagreement! feature 2 feature 1 Disagreement inevitably occurs around class boundaries

We select one of these examples to label properly feature 2 feature 1

Then rebuild the intermediate models feature 2 feature 1 Notice the region of disagreement has shrunk Eventually the distinct models will converge

Experimental Setup Demonstrate technique by creating an important heuristic Map workload to fastest device CPU or GPU Much studied problem, choosing poorly can drastically degrade performance Specifically, given inputs for Rodinia HotSpot, PathFinder, SRAD and Matrix Multiplication is it faster to use OpenMP (CPU) or OpenCL (GPU)? Compared number of training examples required to get high accuracy heuristic using passive versus active learning

A few gory details most in the paper Measured accuracy of randomly-trained vs. QBCtrained classifier using 500 test examples Intel Core i7 7770 @ 3.4GHz (8 HW Threads) NVIDIA Geforce GTX Titan (6GB) 12 distinct committee members 1 random example to begin 10,000 candidate examples 200 loop iterations

Random Training Examples 120 CPU GPU Sample Points Program Input Parameter 100 80 60 40 20 0 0 10 20 30 40 50 60 70 80 90 100 110 120 130 Program Input Parameter

QBC Chosen Training Examples 120 CPU GPU Sample Points Program Input Parameter 100 80 60 40 20 0 0 10 20 30 40 50 60 70 80 90 100 110 120 130 Program Input Parameter Same accuracy but quicker

Lights, Camera, Action... Region of Disagreement over time Shape of Model over time Shows ib1 algorithm refining a HotSpot model over time, using training examples chosen by a committee

It works 3x faster on average!

Summary Desperately need fast, reliable method to generate heuristics Current implementations rely on learning randomly Randomness is problematic because of labelling costs We show active learning is much more efficient 3x faster at creating heuristics to map program inputs to best processor in a heterogeneous system