Bringing Big Data Modelling into the Hands of Domain Experts David Willingham Senior Application Engineer MathWorks david.willingham@mathworks.com.au 2015 The MathWorks, Inc. 1
Data is the sword of the 21st century, those who wield it the samurai. Big data how to create it, manipulate it, and put it to good use. If you want to work at Google, make sure you can use MATLAB. Google s Former SVP - Jonathan Rosenberg 2
Who is looking to do Big Data Analytics now? Software Developers Programmers using languages such as Java /.NET They may not be domain experts End Users (non technical) Non programmers They may or may not be domain experts Domain Users Scientists, Engineers, Analysts, etc.. Have some programming skills Might use MATLAB to protype ideas, algorithms models They want their ideas to scale with the size of data easily 3
Key Industries Aerospace and defense Automotive Biotech and pharmaceutical Communications Education Electronics and semiconductors Energy production Financial services Industrial automation and machinery Medical devices 4
Tesco uses supply chain analytics to save 100m a year. 4 years of sales data held in a Teradata data warehouse. MATLAB is used to forecast the optimum stock levels. We can run 100 stores for 100 days in about half an hour. We can figure out quickly whether what we are doing is right and we can optimise that. 5
Preventing lethal ship collisions with whales Cornell University collected terabytes of ocean acoustic data. Crowdsourced algorithms to detect and classify animal signals in the presence of noise. A data set that would have taken months to process can now be processed multiple times in just a few days using different detection algorithms. 6
MathWorks Today Earth s topography on a Miller cylindrical projection, created with MATLAB and Mapping Toolbox. More than 1 million users worldwide. 3500 universities 21 AU universities have MATLAB campuswide license. 7
Challenges of Big Data Any collection of data sets so large and complex that it becomes difficult to process using traditional data processing applications. (Wikipedia) Getting started Rapid data exploration Development of scalable algorithms Ease of deployment 8
Big Data Capabilities in MATLAB Memory and Data Access 64-bit processors Memory Mapped Variables Disk Variables Databases Datastores Programming Constructs Streaming Block Processing Parallel-for loops GPU Arrays SPMD and Distributed Arrays MapReduce Platforms Desktop (Multicore, GPU) Clusters Cloud Computing (MDCS on EC2) Hadoop 9
MATLAB and Memory Best Practices for Memory Usage Use the appropriate data storage Use only the precision your need Sparse Matrices Categorical Arrays Be aware of overhead of cells and structures Minimize Data Copies In place operations, if possible Use nested functions If using objects, consider handle classes 10
Considerations for Choosing an Approach Data characteristics Size, type and location of your data Compute platform Single desktop machine or cluster Analysis Characteristics Embarrassingly Parallel Analyze sub-segments of data and aggregate results Operate on entire dataset 11
Techniques for Big Data in MATLAB Load, Analyze, Discard parfor, datastore, MapReduce Distributed Memory SPMD and distributed arrays out-of-memory in-memory Embarrassingly Parallel Non- Partitionable Complexity 12
Techniques for Big Data in MATLAB Load, Analyze, Discard parfor, datastore, MapReduce Distributed Memory SPMD and distributed arrays out-of-memory in-memory Embarrassingly Parallel Non- Partitionable Complexity 13
Parallel Computing with MATLAB MATLAB Desktop (Client) Worker Worker Worker Worker Worker Worker 14
Demo: Determining Land Use Using Parallel for-loops (parfor) Data Arial images of agriculture land 24 TIF files Analysis Find and measure irrigation fields Determine which irrigation circles are in use (by color) Calculate area under irrigation 15
When to Use parfor Data Characteristics Can be of any format (i.e. text, images) as long as it can be broken into pieces The data for each iteration must fit in memory Compute Platform Desktop (Parallel Computing Toolbox) Cluster (MATLAB Distributed Computing Server) Analysis Characteristics Each iteration of your loop must be independent 16
Techniques for Big Data in MATLAB Load, Analyze, Discard parfor, datastore, MapReduce Distributed Memory SPMD and distributed arrays out-of-memory in-memory Embarrassingly Parallel Non- Partitionable Complexity 17
Access Big Data datastore Easily specify data set Single text file (or collection of text files) Preview data structure and format Select data to import using column names Incrementally read subsets of the data airdata = datastore('*.csv'); airdata.selectedvariables = {'Distance', 'ArrDelay }; data = read(airdata); 18
Demo: Vehicle Registry Analysis Using a DataStore Data Massachusetts Vehicle Registration Data from 2008-2011 16M records, 45 fields Analysis Examine hybrid adoptions Calculate % of hybrids registered by quarter Fit growth to predict further adoption 19
When to Use datastore Data Characteristics Text data in files, databases or stored in the Hadoop Distributed File System (HDFS) Compute Platform Desktop Analysis Characteristics Supports Load, Analyze, Discard workflows Incrementally read chunks of data, process within a while loop 20
Big Data Capabilities in MATLAB Memory and Data Access 64-bit processors Memory Mapped Variables Disk Variables Databases Datastores Programming Constructs Streaming Block Processing Parallel-for loops GPU Arrays SPMD and Distributed Arrays MapReduce Platforms Desktop (Multicore, GPU) Clusters Cloud Computing (MDCS on EC2) Hadoop 21
Reading in Part of a Dataset from Files Text file, ASCII file datastore MAT file Load and save part of a variable using the matfile Binary file Read and write directly to/from file using memmapfile Maps address space to file Databases ODBC and JDBC-compliant (e.g. Oracle, MySQL, Microsoft, SQL Server) 22
Techniques for Big Data in MATLAB Load, Analyze, Discard parfor, datastore, MapReduce Distributed Memory SPMD and distributed arrays out-of-memory in-memory Embarrassingly Parallel Complexity Non- Partitionable 23
Analyze Big Data mapreduce Use the powerful MapReduce programming technique to analyze big data mapreduce uses a datastore to process data in small chunks that individually fit into memory Useful for processing multiple keys, or when Intermediate results do not fit in memory ******************************** * MAPREDUCE PROGRESS * ******************************** Map 0% Reduce 0% Map 20% Reduce 0% Map 40% Reduce 0% Map 60% Reduce 0% Map 80% Reduce 0% Map 100% Reduce 25% Map 100% Reduce 50% Map 100% Reduce 75% Map 100% Reduce 100% mapreduce on the desktop Increase compute capacity (Parallel Computing Toolbox) Access data on HDFS to develop algorithms for use on Hadoop mapreduce with Hadoop Run on Hadoop using MATLAB Distributed Computing Server Deploy applications and libraries for Hadoop using MATLAB Compiler 24
mapreduce Data Store Map Shuffle Reduce and Sort Veh_typ Q3_08 Q4_08 Q1_09 Hybrid Car 1 1 1 0 SUV 0 1 1 1 Car 1 1 1 1 Car 0 0 1 1 Car 0 1 1 1 Car 1 1 1 1 Car 0 0 1 1 SUV 0 1 1 0 Car 1 1 1 0 SUV 1 1 1 1 Car 0 1 1 1 Car 1 0 0 0 Hybrid 0 1 1 0 1 1 1 0 1 1 1 1 0 0 0 1 0 1 Key: Q3_08 Key: Q4_08 Key: Q1_09 Key: Q3_08 Key: Q4_08 Key: Q1_09 Hybrid 0 1 1 0 0 0 1 1 1 0 1 0 1 1 1 1 0 1 Key: Q3_08 Key: Q4_08 Key: Q1_09 Key % Hybrid (Value) Q3_08 0.4 Q4_08 0.67 Q1_09 0.75 25
Demo: Vehicle Registry Analysis Using MapReduce Data Massachusetts Vehicle Registration Data from 2008-2011 16M records, 45 fields Analysis Examine hybrid adoptions Calculate % of hybrids registered By Quarter By Regional Area Create map of results 26
The Big Data Platform Datastore HDFS Node Data Map Reduce Node Data Map Reduce Node Data Map Reduce 27
Explore and Analyze Data on Hadoop Datastore HDFS MATLAB Distributed Computing Server Node Data Map Reduce Node Data Map Reduce MATLAB MapReduce Code Node Data Map Reduce 28
Deployed Applications with Hadoop Datastore HDFS MATLAB runtime Node Data Map Reduce Node Data Map Reduce Node Data Map Reduce MATLAB MapReduce Code 29
When to Use mapreduce Data Characteristics Text data in files, databases or stored in the Hadoop Distributed File System (HDFS) Dataset will not fit into memory Compute Platform Desktop Scales to run within Hadoop MapReduce on data in HDFS Analysis Characteristics Must be able to be Partitioned into two phases 1. Map: filter or process sub-segments of data 2. Reduce: aggregate interim results and calculate final answer 30
Techniques for Big Data in MATLAB Load, Analyze, Discard parfor, datastore, MapReduce Distributed Memory SPMD and distributed arrays out-of-memory in-memory Embarrassingly Parallel Complexity Non- Partitionable 31
Parallel Computing Distributed Memory Using More Computers (RAM) Core 1 Core 2 Core 1 Core 2 Core 3 Core 4 Core 3 Core 4 RAM RAM 32
Distributed Arrays Available from Parallel Computing Toolbox MATLAB Distributed Computing Server 11 26 41 12 27 42 13 28 43 14 29 44 15 30 45 16 31 46 TOOLBOXES BLOCKSETS 17 32 47 17 33 48 19 34 49 20 35 50 21 36 51 22 37 52 Remotely Manipulate Array from Desktop Distributed Array Lives on the Cluster 33
spmd blocks spmd % single program across workers end Mix parallel and serial code in the same function Run on a pool of MATLAB resources Single Program runs simultaneously across workers Multiple Data spread across multiple workers 34
Example: Airline Delay Analysis Data BTS/RITA Airline On-Time Statistics 123.5M records, 29 fields Analysis Calculate delay patterns Visualize summaries Estimate & evaluate predictive models 35
When to Use Distributed Memory Data Characteristics Data must be fit in collective memory across machines Compute Platform Prototype (subset of data) on desktop Run on a cluster or cloud Analysis Characteristics Consists of: Parts that can be run on data in memory (spmd) Supported functions for distributed arrays 36
Big Data on the Desktop Expand workspace 64 bit processor support increased in-memory data set handling Access portions of data too big to fit into memory Memory mapped variables huge binary file Datastore huge text file or collections of text files Database query portion of a big database table Variety of programming constructs System Objects analyze streaming data MapReduce process text files that won t fit into memory Increase analysis speed Parallel for-loops with multicore/multi-process machines GPU Arrays 37
Further Scaling Big Data Capacity MATLAB supports a number of programming constructs for use with clusters General compute clusters Parallel for-loops embarrassingly parallel algorithms SPMD and distributed arrays distributed memory Hadoop clusters MapReduce analyze data stored in the Hadoop Distributed File System Use these constructs on the desktop to develop your algorithms Migrate to a cluster without algorithm changes 38
Learn More MATLAB Documentation Strategies for Efficient Use of Memory Resolving "Out of Memory" Errors Big Data with MATLAB www.mathworks.com/discovery/big-data-matlab.html MATLAB MapReduce and Hadoop www.mathworks.com/discovery/matlab-mapreduce-hadoop.html 39
EXTRA SLIDES 40
Running into Big Data Issues? Performance Takes too long to process all of your data Out of memory Running out of address space Slow processing (swapping) Data too large to be efficiently managed between RAM and virtual memory 41
MATLAB and Memory Memory and Your Operating System 32-bit operating systems 4GB of addressable memory per process Part of it is reserved by the OS, leaving the application < 4GB 64-bit operating systems 8TB of addressable memory per process Limited by the amount of memory available on the computer Memory per Process Reserved by Operating System Available for Process Use 64-bit OS, if possible 42
MATLAB and Memory Best Practices for Memory Usage Use the appropriate data storage Use only the precision your need Sparse Matrices Categorical Arrays Be aware of overhead of cells and structures Minimize Data Copies In place operations, if possible Use nested functions If using objects, consider handle classes 43
Independent Tasks or Iterations No dependencies or communications between tasks Ideal problem for parallel computing (parfor) Time Time 44
Block Processing Images Available from Image Processing Toolbox blockproc automatically divides an image into blocks for processing Reduces memory usage Read and write block directly from image file Processes arbitrarily large images 45
System Objects Available from DSP System Toolbox Communications System Toolbox Computer Vision System Toolbox Phased Array System Toolbox Image Acquisition Toolbox A class of MATLAB objects that support streaming workflows Simplifies data access for streaming applications Manages flow of data from files or network Handles data indexing and buffering Contain algorithms to work with streaming data Manages algorithm state Available for Signal Processing, Communications, Video Processing, and Phased Array Applications 46
Summary: Options for Handling Large Data Platform Desktop Only Desktop + Cluster Desktop + Hadoop Cluster Hadoop Cluster MATLAB Desktop (Client) MATLAB Desktop (Client) MATLAB Desktop (Client)............ Scheduler Hadoop Scheduler Data Size 100 s MB -10 s GB 100 s MB -100 s GB 100 s GB PBs Techniques parfor datastore mapreduce parfor distributed data spmd mapreduce 47
Performance Gain with More Hardware Using More Cores (CPUs) Using GPUs Core 1 Core 2 GPU cores Core 3 Core 4 Cache Device Memory 48
Scaling Up Your Computations MATLAB Distributed Computing Server Transition from desktop to cluster with minimal code changes Computer Cluster Cluster Supports both interactive and batch workflows Integrates with commonly used schedulers Scheduler 49
MapReduce Programming Model Input Files Output Files Execution Environment MAP REDUCE mapreducer datastore mapreduce 50
mapreducer Input Files Output Files Execution Environment MAP REDUCE mapreducer: Control the Execution Environment of the MapReduce operation 51
mapreducer Serial Local Pool Hadoop Cluster Multi-core CPU 52
datastore Input Files Output Files Execution Environment MAP REDUCE datastore: Stores information regarding input files and output files 53
mapreduce Input Files Output Files Execution Environment MAP REDUCE mapreduce: Executes the MapReduce algorithm 54
mapreduce Input Files Output Files Execution Environment MAP REDUCE mapreduce: - Input datastore object - Map function - Reduce function 55