Outils pour l'analyse prédictive parallèle de multiples sources de données non structurées

Similar documents
Origins, Evolution, and Future Directions of MATLAB Loren Shure

Tackling Big Data with MATLAB Adam Filion Application Engineer MathWorks, Inc.

2015 The MathWorks, Inc. 1

Solving Big Data Problems in Computer Vision with MATLAB Loren Shure

Bringing Big Data Modelling into the Hands of Domain Experts

What s New in MATLAB and Simulink

Is a Data Scientist the New Quant? Stuart Kozola MathWorks

MATLAB in Business Critical Applications Arvind Hosagrahara Principal Technical Consultant

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Using Big Data and GIS to Model Aviation Fuel Burn

Industry 4.0 and Big Data

SAP and Hortonworks Reference Architecture

Integrating a Big Data Platform into Government:

Challenges for Data Driven Systems

Big Data on AWS. Services Overview. Bernie Nallamotu Principle Solutions Architect

Hadoop Cluster Applications

An Integrated Big Data & Analytics Infrastructure June 14, 2012 Robert Stackowiak, VP Oracle ESG Data Systems Architecture

Big Graph Analytics on Neo4j with Apache Spark. Michael Hunger Original work by Kenny Bastani Berlin Buzzwords, Open Stage

End to End Solution to Accelerate Data Warehouse Optimization. Franco Flore Alliance Sales Director - APJ

How To Handle Big Data With A Data Scientist

Deploying MATLAB -based Applications David Willingham Senior Application Engineer

Architecture & Experience

EMC Greenplum Driving the Future of Data Warehousing and Analytics. Tools and Technologies for Big Data

Adapting scientific computing problems to cloud computing frameworks Ph.D. Thesis. Pelle Jakovits

Big Data on Microsoft Platform

Using In-Memory Computing to Simplify Big Data Analytics

Hortonworks & SAS. Analytics everywhere. Page 1. Hortonworks Inc All Rights Reserved

Massive Cloud Auditing using Data Mining on Hadoop

Forecast of Big Data Trends. Assoc. Prof. Dr. Thanachart Numnonda Executive Director IMC Institute 3 September 2014

Intel HPC Distribution for Apache Hadoop* Software including Intel Enterprise Edition for Lustre* Software. SC13, November, 2013

2015 MATLAB Conference Perth 21 st May 2015 Nicholas Brown. Deploying Electricity Load Forecasts on MATLAB Production Server.

In-Database Analytics

Big Data Analytics. An Introduction. Oliver Fuchsberger University of Paderborn 2014

Dell* In-Memory Appliance for Cloudera* Enterprise

ERM Technology for Insurance Companies

Implement Hadoop jobs to extract business value from large and varied data sets

Machine Learning with MATLAB David Willingham Application Engineer

Cloud Computing Capstone Task 2 Report

Unlocking the Intelligence in. Big Data. Ron Kasabian General Manager Big Data Solutions Intel Corporation

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Integrate Big Data into Business Processes and Enterprise Systems. solution white paper

Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies

Addressing Open Source Big Data, Hadoop, and MapReduce limitations

Data-intensive HPC: opportunities and challenges. Patrick Valduriez

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

Energy Efficient MapReduce

BIG DATA What it is and how to use?

Hadoop & SAS Data Loader for Hadoop

Parallel Computing using MATLAB Distributed Compute Server ZORRO HPC

Introducing Oracle Exalytics In-Memory Machine

Enabling High performance Big Data platform with RDMA

Big Data Analytics - Accelerated. stream-horizon.com

Mike Maxey. Senior Director Product Marketing Greenplum A Division of EMC. Copyright 2011 EMC Corporation. All rights reserved.

Collaborative Big Data Analytics. Copyright 2012 EMC Corporation. All rights reserved.

Speeding up MATLAB and Simulink Applications

Machine Learning for Data-Driven Discovery

Turning Data into Actionable Insights: Predictive Analytics with MATLAB WHITE PAPER

Databricks. A Primer

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

Laurence Liew General Manager, APAC. Economics Is Driving Big Data Analytics to the Cloud

MATLAB in Production Systems, Database Integration, and Big Data Eugene McGoldrick

Big Data Cloud Services

Apache Hama Design Document v0.6

From GWS to MapReduce: Google s Cloud Technology in the Early Days

ISSN: CONTEXTUAL ADVERTISEMENT MINING BASED ON BIG DATA ANALYTICS

Massive Predictive Modeling using Oracle R Technologies Mark Hornick, Director, Oracle Advanced Analytics

Accelerating Hadoop MapReduce Using an In-Memory Data Grid

Neptune. A Domain Specific Language for Deploying HPC Software on Cloud Platforms. Chris Bunch Navraj Chohan Chandra Krintz Khawaja Shams

A Novel Cloud Based Elastic Framework for Big Data Preprocessing

How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time

Algorithmic Trading with MATLAB Martin Demel, Application Engineer

SAP Predictive Analytics: An Overview and Roadmap. Charles Gadalla, SESSION CODE: 603

Scaling Out With Apache Spark. DTL Meeting Slides based on

The Stratosphere Big Data Analytics Platform

Enabling the Use of Data

Capacity Planning Fundamentals. Support Business Growth with a Better Approach to Scaling Your Data Center

High Performance Computing with Hadoop WV HPC Summer Institute 2014

The Big Data Ecosystem at LinkedIn Roshan Sumbaly, Jay Kreps, and Sam Shah LinkedIn

TRAINING PROGRAM ON BIGDATA/HADOOP

Databricks. A Primer

Oracle Database 12c Plug In. Switch On. Get SMART.

A Next-Generation Analytics Ecosystem for Big Data. Colin White, BI Research September 2012 Sponsored by ParAccel

Make the Most of Big Data to Drive Innovation Through Reseach

Modernizing Your Data Warehouse for Hadoop

RevoScaleR Speed and Scalability

Customized Report- Big Data

GC3 Use cases for the Cloud

Big Data and Data Science: Behind the Buzz Words

MATLAB as a Collaboration Platform Marta Wilczkowiak Senior Applications Engineer MathWorks

Hadoop on a Low-Budget General Purpose HPC Cluster in Academia

Converged, Real-time Analytics Enabling Faster Decision Making and New Business Opportunities

Hadoop & Spark Using Amazon EMR

Elastic Application Platform for Market Data Real-Time Analytics. for E-Commerce

SURVEY REPORT DATA SCIENCE SOCIETY 2014

Extend your analytic capabilities with SAP Predictive Analysis

Big Data, SAP HANA. SUSE Linux Enterprise Server for SAP Applications. Kim Aaltonen

Data Management in SAP Environments

CRITEO INTERNSHIP PROGRAM 2015/2016

Transcription:

Outils pour l'analyse prédictive parallèle de multiples sources de données non structurées Forum Ter@tec Mercredi 25 juin 2015 Marc Wolff Application Engineer HPC & Big Data 2015 The MathWorks, Inc. 1

Challenges Businesses Are Facing Today Big Data for evidence-based decision making Goal Large (and increasing) amount of available data Leverage data to make better decision Organizational issues Rapid evolution Data scientist need to share their algorithms and results efficiently Technical issues Datasets do not fit in the memory of a single computer Processing these data requires huge computing resources 2

How MATLAB Helps Tackling Big Data Data More data, faster access Complex / incomplete / changing formats handling Algorithms for complex problems People & Systems Share algorithms & protect IP Real-time analytics Compute Power Leverage HPC facilities & clouds Distributed memory framework 3

Scaling Out Calculations MATLAB Distributed Computing Server (MDCS) Parallel Computing Toolbox (PCT) 5

MATLAB & High Performance Computing? Cleve Moler Inventor of MATLAB Co-Founder of MathWorks One of the authors of the LINPACK library 6

Data Analytics Workflow Data Access Data Processing Data Analytics Today s presentation Descriptive Statistics Machine Learning Neural Networks MapReduce 7

Access Large Datasets datastore Data container that allows to easily read data that are too large to fit in the computer s memory Incremental read: data loaded in memory by parts Data sources of various natures Database (using Database Toolbox) Single text file or collection of text files MATLAB is generally able to read and write data directly from/to HDFS datastore can be partitioned using the partition function Allows parallel data access Take advantage of parallel file systems 8

MapReduce Programming Model Example with air traffic dataset: find the longest flight for each carrier Datastore Map Reduce 1503 UA LAX -5-10 2356 540 PS BUR 13 5 186 1920 DL BOS 10 32 1876 1840 DL SFO 0 13 568 272 US BWI 4-2 359 UA PS DL DL US 2356 186 1876 568 359 UA 2356 UA 1867 UA 1365 PS 186 784 PS SEA 7 3 176 PS 176 PS 237 796 PS LAX -2 2 237 PS 237 1525 UA SFO 3-5 1867 UA 1867 DL 1876 632 US PS SJC 2-4 245 1610 UA MIA 60 34 1365 2032 DL EWR 10 16 789 2134 DL DFW -2 6 914 US UA DL DL 245 1365 789 914 DL 914 US 359 US 245 9

MapReduce Programming Model Strengths and limitations mapreduce has been introduced in MATLAB Strengths Analytics are made easy when they fit in the MapReduce framework MapReduce on Hadoop can take advantage of data locality in HDFS Shared filesystem High-speed network Limitations Subset-by-subset data processing, no vision of the whole dataset Scalability issues in some cases 10

A Parallel Programming Model for Predictive Analytics Parallel computing implementation in MATLAB Capabilities all based on MPI MATLAB offers a transparently distributed data structure: distributed Dataset #1 Dataset #2 datastore Process #1 Process #2 Process #N distributed array Run in-memory parallel analytics 11

A Parallel Programming Model for Predictive Analytics Scalability analysis Does it scale? Does it require to store the complete dataset on a single process? 1. Read data from one or multiple datasets and preprocess it 2. Store the preprocessed data in a distributed array (up to 100s of processes) 3. Run in-memory analytics on the distributed array (depends on the algorithm) (depends on the algorithm) 12

A Parallel Programming Model for Predictive Analytics Example: power load forecasting http://ec2-54-165-201-58.compute-1.amazonaws.com:8080/demandforecastweb/ 13

A Parallel Programming Model for Predictive Analytics Example: power load forecasting Goal Develop a predictive model to forecast electrical power consumption Deploy the prediction tool in power plants to adjust production Predictors from different datasets Power consumption over the previous days Calendar information: day of the week? holiday period? Climate data Challenges Multiple data sources with different formatting Datasets have different samplings Final result Predictive model based on Neural Networks Deployed in production using MATLAB Production Server 14

Key Takeaways Access large datasets with MATLAB datastore allows to read datasets that do not fit in memory partition allows parallel data access Tools for easily developing algorithms and scaling out computations No need to be an expert in computer science No need to be an expert in parallel computing MathWorks development teams heavily involved Continuous development and improvement New features in each release Thank you! Q & A session 15