2015 The MathWorks, Inc. 1

Similar documents
Tackling Big Data with MATLAB Adam Filion Application Engineer MathWorks, Inc.

Bringing Big Data Modelling into the Hands of Domain Experts

What s New in MATLAB and Simulink

Solving Big Data Problems in Computer Vision with MATLAB Loren Shure

MATLAB in Business Critical Applications Arvind Hosagrahara Principal Technical Consultant

Is a Data Scientist the New Quant? Stuart Kozola MathWorks

Origins, Evolution, and Future Directions of MATLAB Loren Shure

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011

Implement Hadoop jobs to extract business value from large and varied data sets

Introduction to MATLAB for Data Analysis and Visualization

Speeding up MATLAB and Simulink Applications

Hadoop Cluster Applications

Parallel Computing with MATLAB

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

Data Analysis with MATLAB The MathWorks, Inc. 1

Chapter 7. Using Hadoop Cluster and MapReduce

Massive Cloud Auditing using Data Mining on Hadoop

A Novel Cloud Based Elastic Framework for Big Data Preprocessing

Developing MapReduce Programs

Data Mining in the Swamp

Big Fast Data Hadoop acceleration with Flash. June 2013

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Fast Analytics on Big Data with H20

Hadoop Architecture. Part 1

Architectures for Big Data Analytics A database perspective

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Challenges for Data Driven Systems

Recognization of Satellite Images of Large Scale Data Based On Map- Reduce Framework

Hortonworks & SAS. Analytics everywhere. Page 1. Hortonworks Inc All Rights Reserved

Research of Railway Wagon Flow Forecast System Based on Hadoop-Hazelcast

1 st Symposium on Colossal Data and Networking (CDAN-2016) March 18-19, 2016 Medicaps Group of Institutions, Indore, India

Map-Reduce for Machine Learning on Multicore

Matlab on a Supercomputer

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Hadoop & SAS Data Loader for Hadoop

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

RevoScaleR Speed and Scalability

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park

BIG DATA TRENDS AND TECHNOLOGIES

Leveraging BlobSeer to boost up the deployment and execution of Hadoop applications in Nimbus cloud environments on Grid 5000

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

MapReduce and Hadoop Distributed File System

Big Data on Microsoft Platform

Map-Parallel Scheduling (mps) using Hadoop environment for job scheduler and time span for Multicore Processors

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Parallel Computing: Strategies and Implications. Dori Exterman CTO IncrediBuild.

Introduction to Cloud Computing

Big Graph Analytics on Neo4j with Apache Spark. Michael Hunger Original work by Kenny Bastani Berlin Buzzwords, Open Stage

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

MapReduce and Hadoop Distributed File System V I J A Y R A O

From GWS to MapReduce: Google s Cloud Technology in the Early Days

Hadoop Scheduler w i t h Deadline Constraint

Cloud Computing at Google. Architecture

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data

RCFile: A Fast and Space-efficient Data Placement Structure in MapReduce-based Warehouse Systems CLOUD COMPUTING GROUP - LITAO DENG

Distributed Computing and Big Data: Hadoop and MapReduce

Machine Learning with MATLAB David Willingham Application Engineer

3rd International Symposium on Big Data and Cloud Computing Challenges (ISBCC-2016) March 10-11, 2016 VIT University, Chennai, India

Data processing goes big

How To Use Hadoop For Gis

Large-Scale Data Processing

Introduction to MATLAB Gergely Somlay Application Engineer

GraySort on Apache Spark by Databricks

Using In-Memory Computing to Simplify Big Data Analytics

Alexander Nikov. 5. Database Systems and Managing Data Resources. Learning Objectives. RR Donnelley Tries to Master Its Data

Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

In-Database Analytics

Big Data Processing with Google s MapReduce. Alexandru Costan

Workshop on Hadoop with Big Data

Oracle Big Data SQL Technical Update

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

ISSN: CONTEXTUAL ADVERTISEMENT MINING BASED ON BIG DATA ANALYTICS

On Demand Satellite Image Processing

10- High Performance Compu5ng

Data Management in the Cloud

Hadoop Ecosystem B Y R A H I M A.

How To Program With Adaptive Vision Studio

Graph Mining on Big Data System. Presented by Hefu Chai, Rui Zhang, Jian Fang

HPC technology and future architecture

Oracle Database 12c Plug In. Switch On. Get SMART.

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Introduction to DISC and Hadoop

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

Apache Flink Next-gen data analysis. Kostas

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Journée Thématique Big Data 13/03/2015

R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5

ITG Software Engineering

PLATFORA INTERACTIVE, IN-MEMORY BUSINESS INTELLIGENCE FOR HADOOP

Distributed Apriori in Hadoop MapReduce Framework

Assignment # 1 (Cloud Computing Security)

Transcription:

25 The MathWorks, Inc.

빅 데이터 및 다양한 데이터 처리 위한 MATLAB의 인터페이스 환경 및 새로운 기능 엄준상 대리 Application Engineer MathWorks 25 The MathWorks, Inc. 2

Challenges of Data Any collection of data sets so large and complex that it becomes difficult to process using traditional data processing applications. (Wikipedia) Various Data Sources Rapid data exploration Development of scalable algorithms Ease of deployment 3

Big Data Capabilities in MATLAB Memory and Data Access 64-bit processors Memory Mapped Variables Disk Variables Databases Datastores Programming Constructs Streaming Block Processing Parallel-for loops GPU Arrays SPMD and Distributed Arrays MapReduce Platforms Desktop (Multicore, GPU) Clusters Cloud Computing (MDCS on EC2) Hadoop 4

MATLAB Connects to Your Hardware Devices Data Acquisition Toolbox Plug in data acquisition boards Instrument Control Toolbox Electronic and scientific instrumentation Image Acquisition Toolbox Video devices MATLAB Interfaces for communicating with everything 5

Considerations for Choosing an Approach Data characteristics Size, type and location of your data Compute platform Single desktop machine or cluster Analysis Characteristics Embarrassingly Parallel Analyze sub-segments of data and aggregate results Operate on entire dataset 7

Techniques for Big Data in MATLAB Load, Analyze, Discard parfor, datastore, MapReduce Distributed Memory SPMD and distributed arrays out-of-memory in-memory Embarrassingly Parallel Non- Partitionable Complexity 8

Demo: Determining Land Use Using Parallel for-loops (parfor) Data Arial images of agriculture land 24 TIF files Analysis Find and measure irrigation fields Determine which irrigation circles are in use (by color) Calculate area under irrigation 9

Access Big Data datastore Easily specify data set Single text file (or collection of text files) Preview data structure and format Select data to import using column names Incrementally read subsets of the data airdata = datastore('*.csv'); airdata.selectedvariables = {'Distance', 'ArrDelay }; data = read(airdata);

Demo: Vehicle Registry Analysis Using a DataStore Data Massachusetts Vehicle Registration Data from 28-2 6M records, 45 fields Analysis Examine hybrid adoptions Calculate % of hybrids registered by quarter Fit growth to predict further adoption

When to Use datastore Data Characteristics Text data in files, databases or stored in the Hadoop Distributed File System (HDFS) Compute Platform Desktop Analysis Characteristics Supports Load, Analyze, Discard workflows Incrementally read chunks of data, process within a while loop 2

Analyze Big Data mapreduce Use the powerful MapReduce programming technique to analyze big data mapreduce uses a datastore to process data in small chunks that individually fit into memory Useful for processing multiple keys, or when Intermediate results do not fit in memory ******************************** * MAPREDUCE PROGRESS * ******************************** Map % Reduce % Map 2% Reduce % Map 4% Reduce % Map 6% Reduce % Map 8% Reduce % Map % Reduce 25% Map % Reduce 5% Map % Reduce 75% Map % Reduce % mapreduce on the desktop Increase compute capacity (Parallel Computing Toolbox) Access data on HDFS to develop algorithms for use on Hadoop mapreduce with Hadoop Run on Hadoop using MATLAB Distributed Computing Server Deploy applications and libraries for Hadoop using MATLAB Compiler 3

mapreduce Data Store Map Shuffle Reduce and Sort Veh_typ Q3_8 Q4_8 Q_9 Hybrid Car SUV Car Car Car Car Car SUV Car SUV Car Car Hybrid Key: Q3_8 Key: Q4_8 Key: Q_9 Key: Q3_8 Key: Q4_8 Key: Q_9 Hybrid Key: Q3_8 Key: Q4_8 Key: Q_9 Key % Hybrid (Value) Q3_8.4 Q4_8.67 Q_9.75 4

Demo: Vehicle Registry Analysis Using MapReduce Data Massachusetts Vehicle Registration Data from 28-2 6M records, 45 fields Analysis Examine hybrid adoptions Calculate % of hybrids registered By Quarter By Regional Area Create map of results 5

Explore and Analyze Data on Hadoop Datastore HDFS MATLAB Distributed Computing Server Node Data Map Reduce Node Data Map Reduce MATLAB MapReduce Code Node Data Map Reduce 6

Deployed Applications with Hadoop Datastore HDFS MATLAB runtime Node Data Map Reduce Node Data Map Reduce Node Data Map Reduce MATLAB MapReduce Code 7

When to Use mapreduce Data Characteristics Text data in files, databases or stored in the Hadoop Distributed File System (HDFS) Dataset will not fit into memory Compute Platform Desktop Scales to run within Hadoop MapReduce on data in HDFS Analysis Characteristics Must be able to be Partitioned into two phases. Map: filter or process sub-segments of data 2. Reduce: aggregate interim results and calculate final answer 8

Example: Airline Delay Analysis Data BTS/RITA Airline On-Time Statistics 23.5M records, 29 fields Analysis Calculate delay patterns Visualize summaries Estimate & evaluate predictive models 9

When to Use Distributed Memory Data Characteristics Data must be fit in collective memory across machines Compute Platform Prototype (subset of data) on desktop Run on a cluster or cloud Analysis Characteristics Consists of: Parts that can be run on data in memory (spmd) Supported functions for distributed arrays 2

Big Data on the Desktop Expand workspace 64 bit processor support increased in-memory data set handling Access portions of data too big to fit into memory Memory mapped variables huge binary file Datastore huge text file or collections of text files Database query portion of a big database table Variety of programming constructs System Objects analyze streaming data MapReduce process text files that won t fit into memory Increase analysis speed Parallel for-loops with multicore/multi-process machines GPU Arrays 2

Further Scaling Big Data Capacity MATLAB supports a number of programming constructs for use with clusters General compute clusters Parallel for-loops embarrassingly parallel algorithms SPMD and distributed arrays distributed memory Hadoop clusters MapReduce analyze data stored in the Hadoop Distributed File System Use these constructs on the desktop to develop your algorithms Migrate to a cluster without algorithm changes 22

Learn More MATLAB Documentation Strategies for Efficient Use of Memory Resolving "Out of Memory" Errors Big Data with MATLAB www.mathworks.com/discovery/big-data-matlab.html MATLAB MapReduce and Hadoop www.mathworks.com/discovery/matlab-mapreduce-hadoop.html 23

Questions? 25 The MathWorks, Inc. 24