Bringing Big Data Modelling into the Hands of Domain Experts

Similar documents
Tackling Big Data with MATLAB Adam Filion Application Engineer MathWorks, Inc.

2015 The MathWorks, Inc. 1

What s New in MATLAB and Simulink

MATLAB in Business Critical Applications Arvind Hosagrahara Principal Technical Consultant

Is a Data Scientist the New Quant? Stuart Kozola MathWorks

Solving Big Data Problems in Computer Vision with MATLAB Loren Shure

Origins, Evolution, and Future Directions of MATLAB Loren Shure

Parallel Computing with MATLAB

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011

Outils pour l'analyse prédictive parallèle de multiples sources de données non structurées

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

RevoScaleR Speed and Scalability

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Introduction to MATLAB Gergely Somlay Application Engineer

News and trends in Data Warehouse Automation, Big Data and BI. Johan Hendrickx & Dirk Vermeiren

CSE-E5430 Scalable Cloud Computing Lecture 2

Big Fast Data Hadoop acceleration with Flash. June 2013

Distributed Computing and Big Data: Hadoop and MapReduce

How To Use Hadoop For Gis

Advanced Big Data Analytics with R and Hadoop

Challenges for Data Driven Systems

Implement Hadoop jobs to extract business value from large and varied data sets

Architectures for Big Data Analytics A database perspective

Oracle Big Data SQL Technical Update

Deploying MATLAB -based Applications David Willingham Senior Application Engineer

Using In-Memory Computing to Simplify Big Data Analytics

A Novel Cloud Based Elastic Framework for Big Data Preprocessing

Oracle Database - Engineered for Innovation. Sedat Zencirci Teknoloji Satış Danışmanlığı Direktörü Türkiye ve Orta Asya

Matlab on a Supercomputer

Accelerating Hadoop MapReduce Using an In-Memory Data Grid

SEAIP 2009 Presentation

Speeding up MATLAB and Simulink Applications

BIG DATA What it is and how to use?

Data Analysis with MATLAB The MathWorks, Inc. 1

BIG DATA TRENDS AND TECHNOLOGIES

Scaling Out With Apache Spark. DTL Meeting Slides based on

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Where We Are. References. Cloud Computing. Levels of Service. Cloud Computing History. Introduction to Data Management CSE 344

How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Developing MapReduce Programs

Journée Thématique Big Data 13/03/2015

Big Data Processing with Google s MapReduce. Alexandru Costan

CIS 4930/6930 Spring 2014 Introduction to Data Science /Data Intensive Computing. University of Florida, CISE Department Prof.

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Chapter 7. Using Hadoop Cluster and MapReduce

Big Data on Microsoft Platform

Part V Applications. What is cloud computing? SaaS has been around for awhile. Cloud Computing: General concepts

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Fast Analytics on Big Data with H20

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Introducing Oracle Exalytics In-Memory Machine

Data Mining in the Swamp

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com

HADOOP PERFORMANCE TUNING

Rackspace Cloud Databases and Container-based Virtualization

Hadoop Ecosystem B Y R A H I M A.

Hadoop Architecture. Part 1

How To Scale Out Of A Nosql Database

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU

Large-Scale Data Processing

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Understanding the Value of In-Memory in the IT Landscape

Understanding the Benefits of IBM SPSS Statistics Server

SQL Server 2012 Performance White Paper

Big Data With Hadoop

R at the front end and

On Demand Satellite Image Processing

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Introduction to Cloud Computing

Cloud Computing at Google. Architecture

Big Data Are You Ready? Thomas Kyte

Performance and Scalability Overview

Chapter 18: Database System Architectures. Centralized Systems

GraySort and MinuteSort at Yahoo on Hadoop 0.23

Credit Risk Modeling with MATLAB

Turn Big Data to Small Data

HPC technology and future architecture

Introduction to MATLAB for Data Analysis and Visualization

From Raw Data to. Actionable Insights with. MATLAB Analytics. Learn more. Develop predictive models. 1Access and explore data

Lecture 10 - Functional programming: Hadoop and MapReduce

An Oracle White Paper June High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database

From GWS to MapReduce: Google s Cloud Technology in the Early Days

Performance and Scalability Overview

Surfing the Data Tsunami: A New Paradigm for Big Data Processing and Analytics

Assignment # 1 (Cloud Computing Security)

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

GraySort on Apache Spark by Databricks

MapReduce and Hadoop Distributed File System

ANALYTICS IN BIG DATA ERA

Hadoop & its Usage at Facebook

Hadoop & SAS Data Loader for Hadoop

Hadoop and its Usage at Facebook. Dhruba Borthakur June 22 rd, 2009

Big Data and Analytics: Getting Started with ArcGIS. Mike Park Erik Hoel

Machine Learning with MATLAB David Willingham Application Engineer

Architecture & Experience

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

Transcription:

Bringing Big Data Modelling into the Hands of Domain Experts David Willingham Senior Application Engineer MathWorks david.willingham@mathworks.com.au 2015 The MathWorks, Inc. 1

Data is the sword of the 21st century, those who wield it the samurai. Big data how to create it, manipulate it, and put it to good use. If you want to work at Google, make sure you can use MATLAB. Google s Former SVP - Jonathan Rosenberg 2

Who is looking to do Big Data Analytics now? Software Developers Programmers using languages such as Java /.NET They may not be domain experts End Users (non technical) Non programmers They may or may not be domain experts Domain Users Scientists, Engineers, Analysts, etc.. Have some programming skills Might use MATLAB to protype ideas, algorithms models They want their ideas to scale with the size of data easily 3

Key Industries Aerospace and defense Automotive Biotech and pharmaceutical Communications Education Electronics and semiconductors Energy production Financial services Industrial automation and machinery Medical devices 4

Tesco uses supply chain analytics to save 100m a year. 4 years of sales data held in a Teradata data warehouse. MATLAB is used to forecast the optimum stock levels. We can run 100 stores for 100 days in about half an hour. We can figure out quickly whether what we are doing is right and we can optimise that. 5

Preventing lethal ship collisions with whales Cornell University collected terabytes of ocean acoustic data. Crowdsourced algorithms to detect and classify animal signals in the presence of noise. A data set that would have taken months to process can now be processed multiple times in just a few days using different detection algorithms. 6

MathWorks Today Earth s topography on a Miller cylindrical projection, created with MATLAB and Mapping Toolbox. More than 1 million users worldwide. 3500 universities 21 AU universities have MATLAB campuswide license. 7

Challenges of Big Data Any collection of data sets so large and complex that it becomes difficult to process using traditional data processing applications. (Wikipedia) Getting started Rapid data exploration Development of scalable algorithms Ease of deployment 8

Big Data Capabilities in MATLAB Memory and Data Access 64-bit processors Memory Mapped Variables Disk Variables Databases Datastores Programming Constructs Streaming Block Processing Parallel-for loops GPU Arrays SPMD and Distributed Arrays MapReduce Platforms Desktop (Multicore, GPU) Clusters Cloud Computing (MDCS on EC2) Hadoop 9

MATLAB and Memory Best Practices for Memory Usage Use the appropriate data storage Use only the precision your need Sparse Matrices Categorical Arrays Be aware of overhead of cells and structures Minimize Data Copies In place operations, if possible Use nested functions If using objects, consider handle classes 10

Considerations for Choosing an Approach Data characteristics Size, type and location of your data Compute platform Single desktop machine or cluster Analysis Characteristics Embarrassingly Parallel Analyze sub-segments of data and aggregate results Operate on entire dataset 11

Techniques for Big Data in MATLAB Load, Analyze, Discard parfor, datastore, MapReduce Distributed Memory SPMD and distributed arrays out-of-memory in-memory Embarrassingly Parallel Non- Partitionable Complexity 12

Techniques for Big Data in MATLAB Load, Analyze, Discard parfor, datastore, MapReduce Distributed Memory SPMD and distributed arrays out-of-memory in-memory Embarrassingly Parallel Non- Partitionable Complexity 13

Parallel Computing with MATLAB MATLAB Desktop (Client) Worker Worker Worker Worker Worker Worker 14

Demo: Determining Land Use Using Parallel for-loops (parfor) Data Arial images of agriculture land 24 TIF files Analysis Find and measure irrigation fields Determine which irrigation circles are in use (by color) Calculate area under irrigation 15

When to Use parfor Data Characteristics Can be of any format (i.e. text, images) as long as it can be broken into pieces The data for each iteration must fit in memory Compute Platform Desktop (Parallel Computing Toolbox) Cluster (MATLAB Distributed Computing Server) Analysis Characteristics Each iteration of your loop must be independent 16

Techniques for Big Data in MATLAB Load, Analyze, Discard parfor, datastore, MapReduce Distributed Memory SPMD and distributed arrays out-of-memory in-memory Embarrassingly Parallel Non- Partitionable Complexity 17

Access Big Data datastore Easily specify data set Single text file (or collection of text files) Preview data structure and format Select data to import using column names Incrementally read subsets of the data airdata = datastore('*.csv'); airdata.selectedvariables = {'Distance', 'ArrDelay }; data = read(airdata); 18

Demo: Vehicle Registry Analysis Using a DataStore Data Massachusetts Vehicle Registration Data from 2008-2011 16M records, 45 fields Analysis Examine hybrid adoptions Calculate % of hybrids registered by quarter Fit growth to predict further adoption 19

When to Use datastore Data Characteristics Text data in files, databases or stored in the Hadoop Distributed File System (HDFS) Compute Platform Desktop Analysis Characteristics Supports Load, Analyze, Discard workflows Incrementally read chunks of data, process within a while loop 20

Big Data Capabilities in MATLAB Memory and Data Access 64-bit processors Memory Mapped Variables Disk Variables Databases Datastores Programming Constructs Streaming Block Processing Parallel-for loops GPU Arrays SPMD and Distributed Arrays MapReduce Platforms Desktop (Multicore, GPU) Clusters Cloud Computing (MDCS on EC2) Hadoop 21

Reading in Part of a Dataset from Files Text file, ASCII file datastore MAT file Load and save part of a variable using the matfile Binary file Read and write directly to/from file using memmapfile Maps address space to file Databases ODBC and JDBC-compliant (e.g. Oracle, MySQL, Microsoft, SQL Server) 22

Techniques for Big Data in MATLAB Load, Analyze, Discard parfor, datastore, MapReduce Distributed Memory SPMD and distributed arrays out-of-memory in-memory Embarrassingly Parallel Complexity Non- Partitionable 23

Analyze Big Data mapreduce Use the powerful MapReduce programming technique to analyze big data mapreduce uses a datastore to process data in small chunks that individually fit into memory Useful for processing multiple keys, or when Intermediate results do not fit in memory ******************************** * MAPREDUCE PROGRESS * ******************************** Map 0% Reduce 0% Map 20% Reduce 0% Map 40% Reduce 0% Map 60% Reduce 0% Map 80% Reduce 0% Map 100% Reduce 25% Map 100% Reduce 50% Map 100% Reduce 75% Map 100% Reduce 100% mapreduce on the desktop Increase compute capacity (Parallel Computing Toolbox) Access data on HDFS to develop algorithms for use on Hadoop mapreduce with Hadoop Run on Hadoop using MATLAB Distributed Computing Server Deploy applications and libraries for Hadoop using MATLAB Compiler 24

mapreduce Data Store Map Shuffle Reduce and Sort Veh_typ Q3_08 Q4_08 Q1_09 Hybrid Car 1 1 1 0 SUV 0 1 1 1 Car 1 1 1 1 Car 0 0 1 1 Car 0 1 1 1 Car 1 1 1 1 Car 0 0 1 1 SUV 0 1 1 0 Car 1 1 1 0 SUV 1 1 1 1 Car 0 1 1 1 Car 1 0 0 0 Hybrid 0 1 1 0 1 1 1 0 1 1 1 1 0 0 0 1 0 1 Key: Q3_08 Key: Q4_08 Key: Q1_09 Key: Q3_08 Key: Q4_08 Key: Q1_09 Hybrid 0 1 1 0 0 0 1 1 1 0 1 0 1 1 1 1 0 1 Key: Q3_08 Key: Q4_08 Key: Q1_09 Key % Hybrid (Value) Q3_08 0.4 Q4_08 0.67 Q1_09 0.75 25

Demo: Vehicle Registry Analysis Using MapReduce Data Massachusetts Vehicle Registration Data from 2008-2011 16M records, 45 fields Analysis Examine hybrid adoptions Calculate % of hybrids registered By Quarter By Regional Area Create map of results 26

The Big Data Platform Datastore HDFS Node Data Map Reduce Node Data Map Reduce Node Data Map Reduce 27

Explore and Analyze Data on Hadoop Datastore HDFS MATLAB Distributed Computing Server Node Data Map Reduce Node Data Map Reduce MATLAB MapReduce Code Node Data Map Reduce 28

Deployed Applications with Hadoop Datastore HDFS MATLAB runtime Node Data Map Reduce Node Data Map Reduce Node Data Map Reduce MATLAB MapReduce Code 29

When to Use mapreduce Data Characteristics Text data in files, databases or stored in the Hadoop Distributed File System (HDFS) Dataset will not fit into memory Compute Platform Desktop Scales to run within Hadoop MapReduce on data in HDFS Analysis Characteristics Must be able to be Partitioned into two phases 1. Map: filter or process sub-segments of data 2. Reduce: aggregate interim results and calculate final answer 30

Techniques for Big Data in MATLAB Load, Analyze, Discard parfor, datastore, MapReduce Distributed Memory SPMD and distributed arrays out-of-memory in-memory Embarrassingly Parallel Complexity Non- Partitionable 31

Parallel Computing Distributed Memory Using More Computers (RAM) Core 1 Core 2 Core 1 Core 2 Core 3 Core 4 Core 3 Core 4 RAM RAM 32

Distributed Arrays Available from Parallel Computing Toolbox MATLAB Distributed Computing Server 11 26 41 12 27 42 13 28 43 14 29 44 15 30 45 16 31 46 TOOLBOXES BLOCKSETS 17 32 47 17 33 48 19 34 49 20 35 50 21 36 51 22 37 52 Remotely Manipulate Array from Desktop Distributed Array Lives on the Cluster 33

spmd blocks spmd % single program across workers end Mix parallel and serial code in the same function Run on a pool of MATLAB resources Single Program runs simultaneously across workers Multiple Data spread across multiple workers 34

Example: Airline Delay Analysis Data BTS/RITA Airline On-Time Statistics 123.5M records, 29 fields Analysis Calculate delay patterns Visualize summaries Estimate & evaluate predictive models 35

When to Use Distributed Memory Data Characteristics Data must be fit in collective memory across machines Compute Platform Prototype (subset of data) on desktop Run on a cluster or cloud Analysis Characteristics Consists of: Parts that can be run on data in memory (spmd) Supported functions for distributed arrays 36

Big Data on the Desktop Expand workspace 64 bit processor support increased in-memory data set handling Access portions of data too big to fit into memory Memory mapped variables huge binary file Datastore huge text file or collections of text files Database query portion of a big database table Variety of programming constructs System Objects analyze streaming data MapReduce process text files that won t fit into memory Increase analysis speed Parallel for-loops with multicore/multi-process machines GPU Arrays 37

Further Scaling Big Data Capacity MATLAB supports a number of programming constructs for use with clusters General compute clusters Parallel for-loops embarrassingly parallel algorithms SPMD and distributed arrays distributed memory Hadoop clusters MapReduce analyze data stored in the Hadoop Distributed File System Use these constructs on the desktop to develop your algorithms Migrate to a cluster without algorithm changes 38

Learn More MATLAB Documentation Strategies for Efficient Use of Memory Resolving "Out of Memory" Errors Big Data with MATLAB www.mathworks.com/discovery/big-data-matlab.html MATLAB MapReduce and Hadoop www.mathworks.com/discovery/matlab-mapreduce-hadoop.html 39

EXTRA SLIDES 40

Running into Big Data Issues? Performance Takes too long to process all of your data Out of memory Running out of address space Slow processing (swapping) Data too large to be efficiently managed between RAM and virtual memory 41

MATLAB and Memory Memory and Your Operating System 32-bit operating systems 4GB of addressable memory per process Part of it is reserved by the OS, leaving the application < 4GB 64-bit operating systems 8TB of addressable memory per process Limited by the amount of memory available on the computer Memory per Process Reserved by Operating System Available for Process Use 64-bit OS, if possible 42

MATLAB and Memory Best Practices for Memory Usage Use the appropriate data storage Use only the precision your need Sparse Matrices Categorical Arrays Be aware of overhead of cells and structures Minimize Data Copies In place operations, if possible Use nested functions If using objects, consider handle classes 43

Independent Tasks or Iterations No dependencies or communications between tasks Ideal problem for parallel computing (parfor) Time Time 44

Block Processing Images Available from Image Processing Toolbox blockproc automatically divides an image into blocks for processing Reduces memory usage Read and write block directly from image file Processes arbitrarily large images 45

System Objects Available from DSP System Toolbox Communications System Toolbox Computer Vision System Toolbox Phased Array System Toolbox Image Acquisition Toolbox A class of MATLAB objects that support streaming workflows Simplifies data access for streaming applications Manages flow of data from files or network Handles data indexing and buffering Contain algorithms to work with streaming data Manages algorithm state Available for Signal Processing, Communications, Video Processing, and Phased Array Applications 46

Summary: Options for Handling Large Data Platform Desktop Only Desktop + Cluster Desktop + Hadoop Cluster Hadoop Cluster MATLAB Desktop (Client) MATLAB Desktop (Client) MATLAB Desktop (Client)............ Scheduler Hadoop Scheduler Data Size 100 s MB -10 s GB 100 s MB -100 s GB 100 s GB PBs Techniques parfor datastore mapreduce parfor distributed data spmd mapreduce 47

Performance Gain with More Hardware Using More Cores (CPUs) Using GPUs Core 1 Core 2 GPU cores Core 3 Core 4 Cache Device Memory 48

Scaling Up Your Computations MATLAB Distributed Computing Server Transition from desktop to cluster with minimal code changes Computer Cluster Cluster Supports both interactive and batch workflows Integrates with commonly used schedulers Scheduler 49

MapReduce Programming Model Input Files Output Files Execution Environment MAP REDUCE mapreducer datastore mapreduce 50

mapreducer Input Files Output Files Execution Environment MAP REDUCE mapreducer: Control the Execution Environment of the MapReduce operation 51

mapreducer Serial Local Pool Hadoop Cluster Multi-core CPU 52

datastore Input Files Output Files Execution Environment MAP REDUCE datastore: Stores information regarding input files and output files 53

mapreduce Input Files Output Files Execution Environment MAP REDUCE mapreduce: Executes the MapReduce algorithm 54

mapreduce Input Files Output Files Execution Environment MAP REDUCE mapreduce: - Input datastore object - Map function - Reduce function 55