Bringing Big Data Modelling into the Hands of Domain Experts
|
|
- Blake Whitehead
- 8 years ago
- Views:
Transcription
1 Bringing Big Data Modelling into the Hands of Domain Experts David Willingham Senior Application Engineer MathWorks 2015 The MathWorks, Inc. 1
2 Data is the sword of the 21st century, those who wield it the samurai. Big data how to create it, manipulate it, and put it to good use. If you want to work at Google, make sure you can use MATLAB. Google s Former SVP - Jonathan Rosenberg 2
3 Who is looking to do Big Data Analytics now? Software Developers Programmers using languages such as Java /.NET They may not be domain experts End Users (non technical) Non programmers They may or may not be domain experts Domain Users Scientists, Engineers, Analysts, etc.. Have some programming skills Might use MATLAB to protype ideas, algorithms models They want their ideas to scale with the size of data easily 3
4 Key Industries Aerospace and defense Automotive Biotech and pharmaceutical Communications Education Electronics and semiconductors Energy production Financial services Industrial automation and machinery Medical devices 4
5 Tesco uses supply chain analytics to save 100m a year. 4 years of sales data held in a Teradata data warehouse. MATLAB is used to forecast the optimum stock levels. We can run 100 stores for 100 days in about half an hour. We can figure out quickly whether what we are doing is right and we can optimise that. 5
6 Preventing lethal ship collisions with whales Cornell University collected terabytes of ocean acoustic data. Crowdsourced algorithms to detect and classify animal signals in the presence of noise. A data set that would have taken months to process can now be processed multiple times in just a few days using different detection algorithms. 6
7 MathWorks Today Earth s topography on a Miller cylindrical projection, created with MATLAB and Mapping Toolbox. More than 1 million users worldwide universities 21 AU universities have MATLAB campuswide license. 7
8 Challenges of Big Data Any collection of data sets so large and complex that it becomes difficult to process using traditional data processing applications. (Wikipedia) Getting started Rapid data exploration Development of scalable algorithms Ease of deployment 8
9 Big Data Capabilities in MATLAB Memory and Data Access 64-bit processors Memory Mapped Variables Disk Variables Databases Datastores Programming Constructs Streaming Block Processing Parallel-for loops GPU Arrays SPMD and Distributed Arrays MapReduce Platforms Desktop (Multicore, GPU) Clusters Cloud Computing (MDCS on EC2) Hadoop 9
10 MATLAB and Memory Best Practices for Memory Usage Use the appropriate data storage Use only the precision your need Sparse Matrices Categorical Arrays Be aware of overhead of cells and structures Minimize Data Copies In place operations, if possible Use nested functions If using objects, consider handle classes 10
11 Considerations for Choosing an Approach Data characteristics Size, type and location of your data Compute platform Single desktop machine or cluster Analysis Characteristics Embarrassingly Parallel Analyze sub-segments of data and aggregate results Operate on entire dataset 11
12 Techniques for Big Data in MATLAB Load, Analyze, Discard parfor, datastore, MapReduce Distributed Memory SPMD and distributed arrays out-of-memory in-memory Embarrassingly Parallel Non- Partitionable Complexity 12
13 Techniques for Big Data in MATLAB Load, Analyze, Discard parfor, datastore, MapReduce Distributed Memory SPMD and distributed arrays out-of-memory in-memory Embarrassingly Parallel Non- Partitionable Complexity 13
14 Parallel Computing with MATLAB MATLAB Desktop (Client) Worker Worker Worker Worker Worker Worker 14
15 Demo: Determining Land Use Using Parallel for-loops (parfor) Data Arial images of agriculture land 24 TIF files Analysis Find and measure irrigation fields Determine which irrigation circles are in use (by color) Calculate area under irrigation 15
16 When to Use parfor Data Characteristics Can be of any format (i.e. text, images) as long as it can be broken into pieces The data for each iteration must fit in memory Compute Platform Desktop (Parallel Computing Toolbox) Cluster (MATLAB Distributed Computing Server) Analysis Characteristics Each iteration of your loop must be independent 16
17 Techniques for Big Data in MATLAB Load, Analyze, Discard parfor, datastore, MapReduce Distributed Memory SPMD and distributed arrays out-of-memory in-memory Embarrassingly Parallel Non- Partitionable Complexity 17
18 Access Big Data datastore Easily specify data set Single text file (or collection of text files) Preview data structure and format Select data to import using column names Incrementally read subsets of the data airdata = datastore('*.csv'); airdata.selectedvariables = {'Distance', 'ArrDelay }; data = read(airdata); 18
19 Demo: Vehicle Registry Analysis Using a DataStore Data Massachusetts Vehicle Registration Data from M records, 45 fields Analysis Examine hybrid adoptions Calculate % of hybrids registered by quarter Fit growth to predict further adoption 19
20 When to Use datastore Data Characteristics Text data in files, databases or stored in the Hadoop Distributed File System (HDFS) Compute Platform Desktop Analysis Characteristics Supports Load, Analyze, Discard workflows Incrementally read chunks of data, process within a while loop 20
21 Big Data Capabilities in MATLAB Memory and Data Access 64-bit processors Memory Mapped Variables Disk Variables Databases Datastores Programming Constructs Streaming Block Processing Parallel-for loops GPU Arrays SPMD and Distributed Arrays MapReduce Platforms Desktop (Multicore, GPU) Clusters Cloud Computing (MDCS on EC2) Hadoop 21
22 Reading in Part of a Dataset from Files Text file, ASCII file datastore MAT file Load and save part of a variable using the matfile Binary file Read and write directly to/from file using memmapfile Maps address space to file Databases ODBC and JDBC-compliant (e.g. Oracle, MySQL, Microsoft, SQL Server) 22
23 Techniques for Big Data in MATLAB Load, Analyze, Discard parfor, datastore, MapReduce Distributed Memory SPMD and distributed arrays out-of-memory in-memory Embarrassingly Parallel Complexity Non- Partitionable 23
24 Analyze Big Data mapreduce Use the powerful MapReduce programming technique to analyze big data mapreduce uses a datastore to process data in small chunks that individually fit into memory Useful for processing multiple keys, or when Intermediate results do not fit in memory ******************************** * MAPREDUCE PROGRESS * ******************************** Map 0% Reduce 0% Map 20% Reduce 0% Map 40% Reduce 0% Map 60% Reduce 0% Map 80% Reduce 0% Map 100% Reduce 25% Map 100% Reduce 50% Map 100% Reduce 75% Map 100% Reduce 100% mapreduce on the desktop Increase compute capacity (Parallel Computing Toolbox) Access data on HDFS to develop algorithms for use on Hadoop mapreduce with Hadoop Run on Hadoop using MATLAB Distributed Computing Server Deploy applications and libraries for Hadoop using MATLAB Compiler 24
25 mapreduce Data Store Map Shuffle Reduce and Sort Veh_typ Q3_08 Q4_08 Q1_09 Hybrid Car SUV Car Car Car Car Car SUV Car SUV Car Car Hybrid Key: Q3_08 Key: Q4_08 Key: Q1_09 Key: Q3_08 Key: Q4_08 Key: Q1_09 Hybrid Key: Q3_08 Key: Q4_08 Key: Q1_09 Key % Hybrid (Value) Q3_ Q4_ Q1_
26 Demo: Vehicle Registry Analysis Using MapReduce Data Massachusetts Vehicle Registration Data from M records, 45 fields Analysis Examine hybrid adoptions Calculate % of hybrids registered By Quarter By Regional Area Create map of results 26
27 The Big Data Platform Datastore HDFS Node Data Map Reduce Node Data Map Reduce Node Data Map Reduce 27
28 Explore and Analyze Data on Hadoop Datastore HDFS MATLAB Distributed Computing Server Node Data Map Reduce Node Data Map Reduce MATLAB MapReduce Code Node Data Map Reduce 28
29 Deployed Applications with Hadoop Datastore HDFS MATLAB runtime Node Data Map Reduce Node Data Map Reduce Node Data Map Reduce MATLAB MapReduce Code 29
30 When to Use mapreduce Data Characteristics Text data in files, databases or stored in the Hadoop Distributed File System (HDFS) Dataset will not fit into memory Compute Platform Desktop Scales to run within Hadoop MapReduce on data in HDFS Analysis Characteristics Must be able to be Partitioned into two phases 1. Map: filter or process sub-segments of data 2. Reduce: aggregate interim results and calculate final answer 30
31 Techniques for Big Data in MATLAB Load, Analyze, Discard parfor, datastore, MapReduce Distributed Memory SPMD and distributed arrays out-of-memory in-memory Embarrassingly Parallel Complexity Non- Partitionable 31
32 Parallel Computing Distributed Memory Using More Computers (RAM) Core 1 Core 2 Core 1 Core 2 Core 3 Core 4 Core 3 Core 4 RAM RAM 32
33 Distributed Arrays Available from Parallel Computing Toolbox MATLAB Distributed Computing Server TOOLBOXES BLOCKSETS Remotely Manipulate Array from Desktop Distributed Array Lives on the Cluster 33
34 spmd blocks spmd % single program across workers end Mix parallel and serial code in the same function Run on a pool of MATLAB resources Single Program runs simultaneously across workers Multiple Data spread across multiple workers 34
35 Example: Airline Delay Analysis Data BTS/RITA Airline On-Time Statistics 123.5M records, 29 fields Analysis Calculate delay patterns Visualize summaries Estimate & evaluate predictive models 35
36 When to Use Distributed Memory Data Characteristics Data must be fit in collective memory across machines Compute Platform Prototype (subset of data) on desktop Run on a cluster or cloud Analysis Characteristics Consists of: Parts that can be run on data in memory (spmd) Supported functions for distributed arrays 36
37 Big Data on the Desktop Expand workspace 64 bit processor support increased in-memory data set handling Access portions of data too big to fit into memory Memory mapped variables huge binary file Datastore huge text file or collections of text files Database query portion of a big database table Variety of programming constructs System Objects analyze streaming data MapReduce process text files that won t fit into memory Increase analysis speed Parallel for-loops with multicore/multi-process machines GPU Arrays 37
38 Further Scaling Big Data Capacity MATLAB supports a number of programming constructs for use with clusters General compute clusters Parallel for-loops embarrassingly parallel algorithms SPMD and distributed arrays distributed memory Hadoop clusters MapReduce analyze data stored in the Hadoop Distributed File System Use these constructs on the desktop to develop your algorithms Migrate to a cluster without algorithm changes 38
39 Learn More MATLAB Documentation Strategies for Efficient Use of Memory Resolving "Out of Memory" Errors Big Data with MATLAB MATLAB MapReduce and Hadoop 39
40 EXTRA SLIDES 40
41 Running into Big Data Issues? Performance Takes too long to process all of your data Out of memory Running out of address space Slow processing (swapping) Data too large to be efficiently managed between RAM and virtual memory 41
42 MATLAB and Memory Memory and Your Operating System 32-bit operating systems 4GB of addressable memory per process Part of it is reserved by the OS, leaving the application < 4GB 64-bit operating systems 8TB of addressable memory per process Limited by the amount of memory available on the computer Memory per Process Reserved by Operating System Available for Process Use 64-bit OS, if possible 42
43 MATLAB and Memory Best Practices for Memory Usage Use the appropriate data storage Use only the precision your need Sparse Matrices Categorical Arrays Be aware of overhead of cells and structures Minimize Data Copies In place operations, if possible Use nested functions If using objects, consider handle classes 43
44 Independent Tasks or Iterations No dependencies or communications between tasks Ideal problem for parallel computing (parfor) Time Time 44
45 Block Processing Images Available from Image Processing Toolbox blockproc automatically divides an image into blocks for processing Reduces memory usage Read and write block directly from image file Processes arbitrarily large images 45
46 System Objects Available from DSP System Toolbox Communications System Toolbox Computer Vision System Toolbox Phased Array System Toolbox Image Acquisition Toolbox A class of MATLAB objects that support streaming workflows Simplifies data access for streaming applications Manages flow of data from files or network Handles data indexing and buffering Contain algorithms to work with streaming data Manages algorithm state Available for Signal Processing, Communications, Video Processing, and Phased Array Applications 46
47 Summary: Options for Handling Large Data Platform Desktop Only Desktop + Cluster Desktop + Hadoop Cluster Hadoop Cluster MATLAB Desktop (Client) MATLAB Desktop (Client) MATLAB Desktop (Client) Scheduler Hadoop Scheduler Data Size 100 s MB -10 s GB 100 s MB -100 s GB 100 s GB PBs Techniques parfor datastore mapreduce parfor distributed data spmd mapreduce 47
48 Performance Gain with More Hardware Using More Cores (CPUs) Using GPUs Core 1 Core 2 GPU cores Core 3 Core 4 Cache Device Memory 48
49 Scaling Up Your Computations MATLAB Distributed Computing Server Transition from desktop to cluster with minimal code changes Computer Cluster Cluster Supports both interactive and batch workflows Integrates with commonly used schedulers Scheduler 49
50 MapReduce Programming Model Input Files Output Files Execution Environment MAP REDUCE mapreducer datastore mapreduce 50
51 mapreducer Input Files Output Files Execution Environment MAP REDUCE mapreducer: Control the Execution Environment of the MapReduce operation 51
52 mapreducer Serial Local Pool Hadoop Cluster Multi-core CPU 52
53 datastore Input Files Output Files Execution Environment MAP REDUCE datastore: Stores information regarding input files and output files 53
54 mapreduce Input Files Output Files Execution Environment MAP REDUCE mapreduce: Executes the MapReduce algorithm 54
55 mapreduce Input Files Output Files Execution Environment MAP REDUCE mapreduce: - Input datastore object - Map function - Reduce function 55
Tackling Big Data with MATLAB Adam Filion Application Engineer MathWorks, Inc.
Tackling Big Data with MATLAB Adam Filion Application Engineer MathWorks, Inc. 2015 The MathWorks, Inc. 1 Challenges of Big Data Any collection of data sets so large and complex that it becomes difficult
More information2015 The MathWorks, Inc. 1
25 The MathWorks, Inc. 빅 데이터 및 다양한 데이터 처리 위한 MATLAB의 인터페이스 환경 및 새로운 기능 엄준상 대리 Application Engineer MathWorks 25 The MathWorks, Inc. 2 Challenges of Data Any collection of data sets so large and complex
More informationWhat s New in MATLAB and Simulink
What s New in MATLAB and Simulink Kevin Cohan Product Marketing, MATLAB Michael Carone Product Marketing, Simulink 2015 The MathWorks, Inc. 1 What was new for Simulink in R2012b? 2 What Was New for MATLAB
More informationMATLAB in Business Critical Applications Arvind Hosagrahara Principal Technical Consultant Arvind.Hosagrahara@mathworks.
MATLAB in Business Critical Applications Arvind Hosagrahara Principal Technical Consultant Arvind.Hosagrahara@mathworks.com 310-819-3970 2014 The MathWorks, Inc. 1 Outline Problem Statement The Big Picture
More informationIs a Data Scientist the New Quant? Stuart Kozola MathWorks
Is a Data Scientist the New Quant? Stuart Kozola MathWorks 2015 The MathWorks, Inc. 1 Facts or information used usually to calculate, analyze, or plan something Information that is produced or stored by
More informationSolving Big Data Problems in Computer Vision with MATLAB Loren Shure
Solving Big Data Problems in Computer Vision with MATLAB Loren Shure 2015 The MathWorks, Inc. 1 Why Are We Talking About Big Data? 100 hours of video uploaded to YouTube per minute 1 Explosive increase
More informationOrigins, Evolution, and Future Directions of MATLAB Loren Shure
Origins, Evolution, and Future Directions of MATLAB Loren Shure 2015 The MathWorks, Inc. 1 Agenda Origins Peaks 5 Evolution 0-5 Tomorrow 2 0 y -2-3 -2-1 x 0 1 2 3 2 Computational Finance Workflow Access
More informationParallel Computing with MATLAB
Parallel Computing with MATLAB Scott Benway Senior Account Manager Jiro Doke, Ph.D. Senior Application Engineer 2013 The MathWorks, Inc. 1 Acceleration Strategies Applied in MATLAB Approach Options Best
More informationScalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011
Scalable Data Analysis in R Lee E. Edlefsen Chief Scientist UserR! 2011 1 Introduction Our ability to collect and store data has rapidly been outpacing our ability to analyze it We need scalable data analysis
More informationOutils pour l'analyse prédictive parallèle de multiples sources de données non structurées
Outils pour l'analyse prédictive parallèle de multiples sources de données non structurées Forum Ter@tec Mercredi 25 juin 2015 Marc Wolff Application Engineer HPC & Big Data 2015 The MathWorks, Inc. 1
More informationScalable Cloud Computing Solutions for Next Generation Sequencing Data
Scalable Cloud Computing Solutions for Next Generation Sequencing Data Matti Niemenmaa 1, Aleksi Kallio 2, André Schumacher 1, Petri Klemelä 2, Eija Korpelainen 2, and Keijo Heljanko 1 1 Department of
More informationRevoScaleR Speed and Scalability
EXECUTIVE WHITE PAPER RevoScaleR Speed and Scalability By Lee Edlefsen Ph.D., Chief Scientist, Revolution Analytics Abstract RevoScaleR, the Big Data predictive analytics library included with Revolution
More informationIntroduction to Big Data! with Apache Spark" UC#BERKELEY#
Introduction to Big Data! with Apache Spark" UC#BERKELEY# This Lecture" The Big Data Problem" Hardware for Big Data" Distributing Work" Handling Failures and Slow Machines" Map Reduce and Complex Jobs"
More informationA REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, sborkar95@gmail.com Assistant Professor, Information
More informationESS event: Big Data in Official Statistics. Antonino Virgillito, Istat
ESS event: Big Data in Official Statistics Antonino Virgillito, Istat v erbi v is 1 About me Head of Unit Web and BI Technologies, IT Directorate of Istat Project manager and technical coordinator of Web
More informationIntroduction to MATLAB Gergely Somlay Application Engineer gergely.somlay@gamax.hu
Introduction to MATLAB Gergely Somlay Application Engineer gergely.somlay@gamax.hu 2012 The MathWorks, Inc. 1 What is MATLAB? High-level language Interactive development environment Used for: Numerical
More informationNews and trends in Data Warehouse Automation, Big Data and BI. Johan Hendrickx & Dirk Vermeiren
News and trends in Data Warehouse Automation, Big Data and BI Johan Hendrickx & Dirk Vermeiren Extreme Agility from Source to Analysis DWH Appliances & DWH Automation Typical Architecture 3 What Business
More informationCSE-E5430 Scalable Cloud Computing Lecture 2
CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing
More informationBig Fast Data Hadoop acceleration with Flash. June 2013
Big Fast Data Hadoop acceleration with Flash June 2013 Agenda The Big Data Problem What is Hadoop Hadoop and Flash The Nytro Solution Test Results The Big Data Problem Big Data Output Facebook Traditional
More informationDistributed Computing and Big Data: Hadoop and MapReduce
Distributed Computing and Big Data: Hadoop and MapReduce Bill Keenan, Director Terry Heinze, Architect Thomson Reuters Research & Development Agenda R&D Overview Hadoop and MapReduce Overview Use Case:
More informationHow To Use Hadoop For Gis
2013 Esri International User Conference July 8 12, 2013 San Diego, California Technical Workshop Big Data: Using ArcGIS with Apache Hadoop David Kaiser Erik Hoel Offering 1330 Esri UC2013. Technical Workshop.
More informationAdvanced Big Data Analytics with R and Hadoop
REVOLUTION ANALYTICS WHITE PAPER Advanced Big Data Analytics with R and Hadoop 'Big Data' Analytics as a Competitive Advantage Big Analytics delivers competitive advantage in two ways compared to the traditional
More informationChallenges for Data Driven Systems
Challenges for Data Driven Systems Eiko Yoneki University of Cambridge Computer Laboratory Quick History of Data Management 4000 B C Manual recording From tablets to papyrus to paper A. Payberah 2014 2
More informationImplement Hadoop jobs to extract business value from large and varied data sets
Hadoop Development for Big Data Solutions: Hands-On You Will Learn How To: Implement Hadoop jobs to extract business value from large and varied data sets Write, customize and deploy MapReduce jobs to
More informationArchitectures for Big Data Analytics A database perspective
Architectures for Big Data Analytics A database perspective Fernando Velez Director of Product Management Enterprise Information Management, SAP June 2013 Outline Big Data Analytics Requirements Spectrum
More informationOracle Big Data SQL Technical Update
Oracle Big Data SQL Technical Update Jean-Pierre Dijcks Oracle Redwood City, CA, USA Keywords: Big Data, Hadoop, NoSQL Databases, Relational Databases, SQL, Security, Performance Introduction This technical
More informationDeploying MATLAB -based Applications David Willingham Senior Application Engineer
Deploying MATLAB -based Applications David Willingham Senior Application Engineer 2014 The MathWorks, Inc. 1 Data Analytics Workflow Access Files Explore & Discover Data Analysis & Modeling Share Reporting
More informationUsing In-Memory Computing to Simplify Big Data Analytics
SCALEOUT SOFTWARE Using In-Memory Computing to Simplify Big Data Analytics by Dr. William Bain, ScaleOut Software, Inc. 2012 ScaleOut Software, Inc. 12/27/2012 T he big data revolution is upon us, fed
More informationA Novel Cloud Based Elastic Framework for Big Data Preprocessing
School of Systems Engineering A Novel Cloud Based Elastic Framework for Big Data Preprocessing Omer Dawelbeit and Rachel McCrindle October 21, 2014 University of Reading 2008 www.reading.ac.uk Overview
More informationOracle Database - Engineered for Innovation. Sedat Zencirci Teknoloji Satış Danışmanlığı Direktörü Türkiye ve Orta Asya
Oracle Database - Engineered for Innovation Sedat Zencirci Teknoloji Satış Danışmanlığı Direktörü Türkiye ve Orta Asya Oracle Database 11g Release 2 Shipping since September 2009 11.2.0.3 Patch Set now
More informationMatlab on a Supercomputer
Matlab on a Supercomputer Shelley L. Knuth Research Computing April 9, 2015 Outline Description of Matlab and supercomputing Interactive Matlab jobs Non-interactive Matlab jobs Parallel Computing Slides
More informationAccelerating Hadoop MapReduce Using an In-Memory Data Grid
Accelerating Hadoop MapReduce Using an In-Memory Data Grid By David L. Brinker and William L. Bain, ScaleOut Software, Inc. 2013 ScaleOut Software, Inc. 12/27/2012 H adoop has been widely embraced for
More informationSEAIP 2009 Presentation
SEAIP 2009 Presentation By David Tan Chair of Yahoo! Hadoop SIG, 2008-2009,Singapore EXCO Member of SGF SIG Imperial College (UK), Institute of Fluid Science (Japan) & Chicago BOOTH GSB (USA) Alumni Email:
More informationSpeeding up MATLAB and Simulink Applications
Speeding up MATLAB and Simulink Applications 2009 The MathWorks, Inc. Customer Tour 2009 Today s Schedule Introduction to Parallel Computing with MATLAB and Simulink Break Master Class on Speeding Up MATLAB
More informationBIG DATA What it is and how to use?
BIG DATA What it is and how to use? Lauri Ilison, PhD Data Scientist 21.11.2014 Big Data definition? There is no clear definition for BIG DATA BIG DATA is more of a concept than precise term 1 21.11.14
More informationData Analysis with MATLAB. 2013 The MathWorks, Inc. 1
Data Analysis with MATLAB 2013 The MathWorks, Inc. 1 Agenda Introduction Data analysis with MATLAB and Excel Break Developing applications with MATLAB Solving larger problems Summary 2 Modeling the Solar
More informationBIG DATA TRENDS AND TECHNOLOGIES
BIG DATA TRENDS AND TECHNOLOGIES THE WORLD OF DATA IS CHANGING Cloud WHAT IS BIG DATA? Big data are datasets that grow so large that they become awkward to work with using onhand database management tools.
More informationScaling Out With Apache Spark. DTL Meeting 17-04-2015 Slides based on https://www.sics.se/~amir/files/download/dic/spark.pdf
Scaling Out With Apache Spark DTL Meeting 17-04-2015 Slides based on https://www.sics.se/~amir/files/download/dic/spark.pdf Your hosts Mathijs Kattenberg Technical consultant Jeroen Schot Technical consultant
More informationJeffrey D. Ullman slides. MapReduce for data intensive computing
Jeffrey D. Ullman slides MapReduce for data intensive computing Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very
More informationWhere We Are. References. Cloud Computing. Levels of Service. Cloud Computing History. Introduction to Data Management CSE 344
Where We Are Introduction to Data Management CSE 344 Lecture 25: DBMS-as-a-service and NoSQL We learned quite a bit about data management see course calendar Three topics left: DBMS-as-a-service and NoSQL
More informationHow In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time
SCALEOUT SOFTWARE How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time by Dr. William Bain and Dr. Mikhail Sobolev, ScaleOut Software, Inc. 2012 ScaleOut Software, Inc. 12/27/2012 T wenty-first
More informationOutline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging
Outline High Performance Computing (HPC) Towards exascale computing: a brief history Challenges in the exascale era Big Data meets HPC Some facts about Big Data Technologies HPC and Big Data converging
More informationDeveloping MapReduce Programs
Cloud Computing Developing MapReduce Programs Dell Zhang Birkbeck, University of London 2015/16 MapReduce Algorithm Design MapReduce: Recap Programmers must specify two functions: map (k, v) * Takes
More informationJournée Thématique Big Data 13/03/2015
Journée Thématique Big Data 13/03/2015 1 Agenda About Flaminem What Do We Want To Predict? What Is The Machine Learning Theory Behind It? How Does It Work In Practice? What Is Happening When Data Gets
More informationBig Data Processing with Google s MapReduce. Alexandru Costan
1 Big Data Processing with Google s MapReduce Alexandru Costan Outline Motivation MapReduce programming model Examples MapReduce system architecture Limitations Extensions 2 Motivation Big Data @Google:
More informationCIS 4930/6930 Spring 2014 Introduction to Data Science /Data Intensive Computing. University of Florida, CISE Department Prof.
CIS 4930/6930 Spring 2014 Introduction to Data Science /Data Intensie Computing Uniersity of Florida, CISE Department Prof. Daisy Zhe Wang Map/Reduce: Simplified Data Processing on Large Clusters Parallel/Distributed
More informationArchitectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase
Architectural patterns for building real time applications with Apache HBase Andrew Purtell Committer and PMC, Apache HBase Who am I? Distributed systems engineer Principal Architect in the Big Data Platform
More informationChapter 7. Using Hadoop Cluster and MapReduce
Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in
More informationBig Data on Microsoft Platform
Big Data on Microsoft Platform Prepared by GJ Srinivas Corporate TEG - Microsoft Page 1 Contents 1. What is Big Data?...3 2. Characteristics of Big Data...3 3. Enter Hadoop...3 4. Microsoft Big Data Solutions...4
More informationPart V Applications. What is cloud computing? SaaS has been around for awhile. Cloud Computing: General concepts
Part V Applications Cloud Computing: General concepts Copyright K.Goseva 2010 CS 736 Software Performance Engineering Slide 1 What is cloud computing? SaaS: Software as a Service Cloud: Datacenters hardware
More informationIntroduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data
Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give
More informationFast Analytics on Big Data with H20
Fast Analytics on Big Data with H20 0xdata.com, h2o.ai Tomas Nykodym, Petr Maj Team About H2O and 0xdata H2O is a platform for distributed in memory predictive analytics and machine learning Pure Java,
More informationDATA MINING WITH HADOOP AND HIVE Introduction to Architecture
DATA MINING WITH HADOOP AND HIVE Introduction to Architecture Dr. Wlodek Zadrozny (Most slides come from Prof. Akella s class in 2014) 2015-2025. Reproduction or usage prohibited without permission of
More informationIntroducing Oracle Exalytics In-Memory Machine
Introducing Oracle Exalytics In-Memory Machine Jon Ainsworth Director of Business Development Oracle EMEA Business Analytics 1 Copyright 2011, Oracle and/or its affiliates. All rights Agenda Topics Oracle
More informationData Mining in the Swamp
WHITE PAPER Page 1 of 8 Data Mining in the Swamp Taming Unruly Data with Cloud Computing By John Brothers Business Intelligence is all about making better decisions from the data you have. However, all
More informationData Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com
Data Warehousing and Analytics Infrastructure at Facebook Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com Overview Challenges in a Fast Growing & Dynamic Environment Data Flow Architecture,
More informationHADOOP PERFORMANCE TUNING
PERFORMANCE TUNING Abstract This paper explains tuning of Hadoop configuration parameters which directly affects Map-Reduce job performance under various conditions, to achieve maximum performance. The
More informationRackspace Cloud Databases and Container-based Virtualization
Rackspace Cloud Databases and Container-based Virtualization August 2012 J.R. Arredondo @jrarredondo Page 1 of 6 INTRODUCTION When Rackspace set out to build the Cloud Databases product, we asked many
More informationHadoop Ecosystem B Y R A H I M A.
Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open
More informationHadoop Architecture. Part 1
Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,
More informationHow To Scale Out Of A Nosql Database
Firebird meets NoSQL (Apache HBase) Case Study Firebird Conference 2011 Luxembourg 25.11.2011 26.11.2011 Thomas Steinmaurer DI +43 7236 3343 896 thomas.steinmaurer@scch.at www.scch.at Michael Zwick DI
More informationBenchmark Hadoop and Mars: MapReduce on cluster versus on GPU
Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Heshan Li, Shaopeng Wang The Johns Hopkins University 3400 N. Charles Street Baltimore, Maryland 21218 {heshanli, shaopeng}@cs.jhu.edu 1 Overview
More informationLarge-Scale Data Processing
Large-Scale Data Processing Eiko Yoneki eiko.yoneki@cl.cam.ac.uk http://www.cl.cam.ac.uk/~ey204 Systems Research Group University of Cambridge Computer Laboratory 2010s: Big Data Why Big Data now? Increase
More informationCSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model
More informationUnderstanding the Value of In-Memory in the IT Landscape
February 2012 Understing the Value of In-Memory in Sponsored by QlikView Contents The Many Faces of In-Memory 1 The Meaning of In-Memory 2 The Data Analysis Value Chain Your Goals 3 Mapping Vendors to
More informationUnderstanding the Benefits of IBM SPSS Statistics Server
IBM SPSS Statistics Server Understanding the Benefits of IBM SPSS Statistics Server Contents: 1 Introduction 2 Performance 101: Understanding the drivers of better performance 3 Why performance is faster
More informationSQL Server 2012 Performance White Paper
Published: April 2012 Applies to: SQL Server 2012 Copyright The information contained in this document represents the current view of Microsoft Corporation on the issues discussed as of the date of publication.
More informationBig Data With Hadoop
With Saurabh Singh singh.903@osu.edu The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials
More informationR at the front end and
Divide & Recombine for Large Complex Data (a.k.a. Big Data) 1 Statistical framework requiring research in statistical theory and methods to make it work optimally Framework is designed to make computation
More informationOn Demand Satellite Image Processing
On Demand Satellite Image Processing Next generation technology for processing Terabytes of imagery on the Cloud WHITEPAPER MARCH 2015 Introduction Profound changes are happening with computing hardware
More informationSpark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. www.spark- project.org. University of California, Berkeley UC BERKELEY
Spark in Action Fast Big Data Analytics using Scala Matei Zaharia University of California, Berkeley www.spark- project.org UC BERKELEY My Background Grad student in the AMP Lab at UC Berkeley» 50- person
More informationIntroduction to Cloud Computing
Introduction to Cloud Computing Cloud Computing I (intro) 15 319, spring 2010 2 nd Lecture, Jan 14 th Majd F. Sakr Lecture Motivation General overview on cloud computing What is cloud computing Services
More informationCloud Computing at Google. Architecture
Cloud Computing at Google Google File System Web Systems and Algorithms Google Chris Brooks Department of Computer Science University of San Francisco Google has developed a layered system to handle webscale
More informationBig Data Are You Ready? Thomas Kyte http://asktom.oracle.com
Big Data Are You Ready? Thomas Kyte http://asktom.oracle.com The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated
More informationPerformance and Scalability Overview
Performance and Scalability Overview This guide provides an overview of some of the performance and scalability capabilities of the Pentaho Business Analytics platform. PENTAHO PERFORMANCE ENGINEERING
More informationChapter 18: Database System Architectures. Centralized Systems
Chapter 18: Database System Architectures! Centralized Systems! Client--Server Systems! Parallel Systems! Distributed Systems! Network Types 18.1 Centralized Systems! Run on a single computer system and
More informationGraySort and MinuteSort at Yahoo on Hadoop 0.23
GraySort and at Yahoo on Hadoop.23 Thomas Graves Yahoo! May, 213 The Apache Hadoop[1] software library is an open source framework that allows for the distributed processing of large data sets across clusters
More informationCredit Risk Modeling with MATLAB
Credit Risk Modeling with MATLAB Martin Demel, Application Engineer 95% VaR: $798232. 95% CVaR: $1336167. AAA 93.68% 5.55% 0.59% 0.18% AA 2.44% 92.60% 4.03% 0.73% 0.15% 0.06% -1 0 1 2 3 4 A5 0.14% 6 4.18%
More informationTurn Big Data to Small Data
Turn Big Data to Small Data Use Qlik to Utilize Distributed Systems and Document Databases October, 2014 Stig Magne Henriksen Image: kdnuggets.com From Big Data to Small Data Agenda When do we have a Big
More informationHPC technology and future architecture
HPC technology and future architecture Visual Analysis for Extremely Large-Scale Scientific Computing KGT2 Internal Meeting INRIA France Benoit Lange benoit.lange@inria.fr Toàn Nguyên toan.nguyen@inria.fr
More informationIntroduction to MATLAB for Data Analysis and Visualization
Introduction to MATLAB for Data Analysis and Visualization Sean de Wolski Application Engineer 2014 The MathWorks, Inc. 1 Data Analysis Tasks Files Data Analysis & Modeling Reporting and Documentation
More informationFrom Raw Data to. Actionable Insights with. MATLAB Analytics. Learn more. Develop predictive models. 1Access and explore data
100 001 010 111 From Raw Data to 10011100 Actionable Insights with 00100111 MATLAB Analytics 01011100 11100001 1 Access and Explore Data For scientists the problem is not a lack of available but a deluge.
More informationLecture 10 - Functional programming: Hadoop and MapReduce
Lecture 10 - Functional programming: Hadoop and MapReduce Sohan Dharmaraja Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 1 / 41 For today Big Data and Text analytics Functional
More informationAn Oracle White Paper June 2012. High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database
An Oracle White Paper June 2012 High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database Executive Overview... 1 Introduction... 1 Oracle Loader for Hadoop... 2 Oracle Direct
More informationFrom GWS to MapReduce: Google s Cloud Technology in the Early Days
Large-Scale Distributed Systems From GWS to MapReduce: Google s Cloud Technology in the Early Days Part II: MapReduce in a Datacenter COMP6511A Spring 2014 HKUST Lin Gu lingu@ieee.org MapReduce/Hadoop
More informationPerformance and Scalability Overview
Performance and Scalability Overview This guide provides an overview of some of the performance and scalability capabilities of the Pentaho Business Analytics Platform. Contents Pentaho Scalability and
More informationSurfing the Data Tsunami: A New Paradigm for Big Data Processing and Analytics
Surfing the Data Tsunami: A New Paradigm for Big Data Processing and Analytics Dr. Liangxiu Han Future Networks and Distributed Systems Group (FUNDS) School of Computing, Mathematics and Digital Technology,
More informationAssignment # 1 (Cloud Computing Security)
Assignment # 1 (Cloud Computing Security) Group Members: Abdullah Abid Zeeshan Qaiser M. Umar Hayat Table of Contents Windows Azure Introduction... 4 Windows Azure Services... 4 1. Compute... 4 a) Virtual
More informationDeveloping Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control
Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control EP/K006487/1 UK PI: Prof Gareth Taylor (BU) China PI: Prof Yong-Hua Song (THU) Consortium UK Members: Brunel University
More informationGraySort on Apache Spark by Databricks
GraySort on Apache Spark by Databricks Reynold Xin, Parviz Deyhim, Ali Ghodsi, Xiangrui Meng, Matei Zaharia Databricks Inc. Apache Spark Sorting in Spark Overview Sorting Within a Partition Range Partitioner
More informationMapReduce and Hadoop Distributed File System
MapReduce and Hadoop Distributed File System 1 B. RAMAMURTHY Contact: Dr. Bina Ramamurthy CSE Department University at Buffalo (SUNY) bina@buffalo.edu http://www.cse.buffalo.edu/faculty/bina Partially
More informationANALYTICS IN BIG DATA ERA
ANALYTICS IN BIG DATA ERA ANALYTICS TECHNOLOGY AND ARCHITECTURE TO MANAGE VELOCITY AND VARIETY, DISCOVER RELATIONSHIPS AND CLASSIFY HUGE AMOUNT OF DATA MAURIZIO SALUSTI SAS Copyr i g ht 2012, SAS Ins titut
More informationHadoop & its Usage at Facebook
Hadoop & its Usage at Facebook Dhruba Borthakur Project Lead, Hadoop Distributed File System dhruba@apache.org Presented at the The Israeli Association of Grid Technologies July 15, 2009 Outline Architecture
More informationHadoop & SAS Data Loader for Hadoop
Turning Data into Value Hadoop & SAS Data Loader for Hadoop Sebastiaan Schaap Frederik Vandenberghe Agenda What s Hadoop SAS Data management: Traditional In-Database In-Memory The Hadoop analytics lifecycle
More informationHadoop and its Usage at Facebook. Dhruba Borthakur dhruba@apache.org, June 22 rd, 2009
Hadoop and its Usage at Facebook Dhruba Borthakur dhruba@apache.org, June 22 rd, 2009 Who Am I? Hadoop Developer Core contributor since Hadoop s infancy Focussed on Hadoop Distributed File System Facebook
More informationBig Data and Analytics: Getting Started with ArcGIS. Mike Park Erik Hoel
Big Data and Analytics: Getting Started with ArcGIS Mike Park Erik Hoel Agenda Overview of big data Distributed computation User experience Data management Big data What is it? Big Data is a loosely defined
More informationMachine Learning with MATLAB David Willingham Application Engineer
Machine Learning with MATLAB David Willingham Application Engineer 2014 The MathWorks, Inc. 1 Goals Overview of machine learning Machine learning models & techniques available in MATLAB Streamlining the
More informationArchitecture & Experience
Architecture & Experience Data Mining - Combination from SAP HANA, R & Hadoop Markus Severin, Solution Principal Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein
More informationProgramming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga
Programming models for heterogeneous computing Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga Talk outline [30 slides] 1. Introduction [5 slides] 2.
More information