Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011
|
|
|
- Meryl Violet Quinn
- 10 years ago
- Views:
Transcription
1 Scalable Data Analysis in R Lee E. Edlefsen Chief Scientist UserR!
2 Introduction Our ability to collect and store data has rapidly been outpacing our ability to analyze it We need scalable data analysis software R is the ideal platform for such software: universal data language, easy to add new functionality, powerful, flexible, forgiving I will discuss the approach to scalability we have taken at Revolution Analytics with our package RevoScaleR 2
3 RevoScaleR package Part of Revolution R Enterprise Provides data management and data analysis functionality Scales from small to huge data Is internally threaded to use multiple cores Distributes computations across multiple computers (in version 5.0 beta on Windows) Revolution R Enterprise is free for academic use Revolution R Enterprise 3
4 Overview of my talk What do I mean by scalability? Why bother? Revolution s approach to scalability: R code stays the same as much as possible Chunking operate on chunks of data Parallel external memory algorithms Implementation issues Benchmarks Revolution R Enterprise 4
5 Scalability From small in-memory data.frames to multiterabyte data sets distributed across space and even time From single cores on a single computer to multiple cores on multiple computers/nodes From local hardware to remote clouds For both data management and data analysis Revolution R Enterprise 5
6 Keys to scalability Most importantly, must be able to process more data than can fit into memory on one computer at one time Process data in chunks Up to a point, the bigger the chunk the better Allows faster sequential reads from disk Minimizes thread overhead in memory Takes maximum advantage of R s vectorized code Requires external memory algorithms Revolution R Enterprise 6
7 Huge data sets are becoming common Digital data is not only cheaper to store, it is cheaper to collect More data is collected digitally Increasingly more data of all types is being put directly into data bases It is easier now to merge and clean data, and there is a big and growing market for such data bases Revolution R Enterprise 7
8 Huge benefits to huge data More information, more to be learned Variables and relationships can be visualized and analyzed in much greater detail Can allow the data to speak for itself; can relax or eliminate assumptions Can get better predictions and better understandings of effects Revolution R Enterprise 8
9 Two huge problems: capacity and speed Capacity: problems handling the size of data sets or models Data too big to fit into memory Even if it can fit, there are limits on what can be done Even simple data management can be extremely challenging Speed: even without a capacity limit, computation may be too slow to be useful 9
10 We are currently incapable of analyzing a lot of the data we have The most commonly-used statistical software tools either fail completely or are too slow to be useful on huge data sets In many ways we are back where we were in the 70s and 80 s in terms of ability to handle common data sizes Fortunately, this is changing Revolution R Enterprise 10
11 Requires software solutions For decades the rising tide of technology has allowed the same data analysis code to run faster and on bigger data sets That happy era has ended The size of data sets is increasing much more rapidly than the speed of single cores, of I/O, and of RAM Need software that can use multiple cores, multiple hard drives, and multiple computers Revolution R Enterprise 11
12 High Performance Analytics: HPA HPA is HPC + Data High Performance Computing is CPU centric Lots of processing on small amounts of data Focus is on cores High Performance Analytics is data centric Less processing per amount of data Focus is on feeding data to the cores On disk I/O On efficient threading, data management in RAM Revolution R Enterprise 12
13 Revolution s approach to HPA and scalability in the RevoScaleR package User interface Set of R functions (but mostly implemented in C++) Keep things as familiar as possible Capacity and speed Handle data in large chunks blocks of rows and columns Use parallel external memory algorithms Revolution R Enterprise 13
14 Some interface design goals Should be able to run the same analysis code on a small data.frame in memory and on a huge distributed data file on a remote cluster Should be able to do data transformations using standard R language on standard R objects Should be able to use the R formula language for analysis Revolution R Enterprise 14
15 Sample code for logit on laptop # # Standard formula language # # Allow transformations in the formula or # in a function or a list of expressions # # Row selections also allowed (not shown) rxlogit(arrdelay>15 ~ Origin + Year + Month + DayOfWeek + UniqueCarrier + F(CRSDepTime), data=airdata) Revolution R Enterprise 15
16 Sample code for logit on a cluster # Just change the compute context rxoptions(computecontext = myclust) # Otherwise, the code is the same rxlogit(arrdelay>15 ~ Origin + Year + Month + DayOfWeek + UniqueCarrier + F(CRSDepTime), data=airdata) Revolution R Enterprise 16
17 Sample data transformation code # # Standard R code on standard R objects # # Function or a list of expressions # # In separate data step or in analysis call transforms <- list( DepTime = ConvertToDecimalTime(DepTime), Late = ArrDelay > 15 ) Revolution R Enterprise 17
18 Chunking Operate blocks of rows for selected columns Huge data can be processed a chunk at a time in a fixed amount of RAM Bigger is better up to a point Makes it possible to take maximum advantage of disk bandwidth Minimizes threading overhead Can take maximum advantage of R s vectorized code Revolution R Enterprise 18
19 The basis for a solution for capacity, speed, distributed and streaming data PEMA s Parallel external memory algorithms (PEMA s) allow solution of both capacity and speed problems, and can deal with distributed and streaming data External memory algorithms are those that allow computations to be split into pieces so that not all data has to be in memory at one time It is possible to automatically parallelize and distribute such algorithms 19
20 External memory algorithms are widely available Some statistics packages originating in the 70 s and 80 s, such as SAS and Gauss, were based almost entirely on external memory algorithms Examples of external memory algorithms: data management (transformations, new variables, subsets, sorts, merges, converting to factors) descriptive statistics, cross tabulations linear models, glm, mixed effects prediction/scoring many maximum likelihood and other optimization algorithms many clustering procedures, tree models 20
21 Parallel algorithms are widely available Parallel algorithms are those that allow different portions of a computation to be done at the same time using different cores There has been an lot of research on parallel computing over the past years, and many helpful tools are now available 21
22 Not much literature on PEMA s Unfortunately, there is not much literature on Parallel External Memory Algorithms, especially for statistical computations In what follows, I outline an approach to PEMA s for doing statistical computations Revolution R Enterprise 22
23 Parallelizing external memory algorithms Parallel algorithms split a job into tasks that can be run simultaneously External memory algorithms split a job into tasks that operate on separate blocks data External memory algorithms usually are run sequentially, but with proper care most can be parallelized efficiently and automatically The key is to split the tasks appropriately and to minimize inter-thread communication and synchronization 23
24 Parallel Computations with Synchronization and Communication 24
25 Embarrassingly Parallel Computations 25
26 Sequential External Memory Computations 26
27 Parallel External Memory Computations 27
28 Example external memory algorithm for the mean of a variable Initialization task: total=0, count=0 Process data task: for each block of x; total = sum(x), count=length(x) Update results task: combined total = sum(all totals), combined count = sum(all counts) Process results task: mean = combined total / combined count 28
29 PEMA s in RevoScaleR Analytics algorithms are implemented in a framework that automatically and efficiently parallelizes code Key is arranging code in virtual functions: Initialize: allows initializations ProcessData: process a chunk of data at a time UpdateResults: updates one set of results from another ProcessResults: any final processing Revolution R Enterprise 29
30 PEMA s in RevoScaleR (2) Analytic code is completely independent of: Data path code that feeds data to ProcessData Threading code for using multiple cores Inter-computer communication code for distributing across nodes Communication among machines is highly abstracted (currently can use MPI, RPC) Implemented in C++ now, but we plan to add an implementation in R Revolution R Enterprise 30
31 Storing and reading data ScaleR can process data from a variety of sources Has its own optimized format (XDF) that is especially suitable for chunking Allows rapid access to blocks of rows for selected columns The time it takes to read a block of rows is essentially independent of the total number of rows and columns in the file Revolution R Enterprise 31
32 The XDF file format Data is stored in blocks of rows per column Header information is stored at the end of the file Allows sequential reads; tens to hundreds of thousands of times faster than random reads Essentially unlimited in size 64 bit row indexing per column 64 bit column indexing (but the practical limit is much smaller) Revolution R Enterprise 32
33 The XDF file format (2) Allows wide range of storage formats (1 byte to 8 byte signed and unsigned integers; 4 and 8 byte floats; variable length strings) Both new rows and new columns can be added to a file without having to rewrite the file Changes to header information (variable names, descriptions, factor levels, and so on) are extremely fast Revolution R Enterprise 33
34 Overview of data path on a computer DataSource object reads a chunk of data into a DataSet object on I/O thread DataSet is given to transformation code (data copied to R for R transformations); variables and rows may be created, removed Transformed DataSet is virtually split across computational cores and passed to ProcessData methods on different threads Any disk output is stored until I/O thread can write it to output DataSource Revolution R Enterprise 34
35 Handling data in memory Use of appropriate storage format reduces space and reduces time to move data in memory Data copying and conversion is minimized For instance, when adding a vector of unsigned shorts to a vector of doubles, the smaller type is not converted until loaded into the CPU Revolution R Enterprise 35
36 Use of multiple cores per computer Code is internally threaded so that interprocess communication and data transfer is not required One core (typically) handles I/O, while the other cores process data from the previous read Data is virtually split across computational cores; each core thinks it has its own private copy Revolution R Enterprise 36
37 RevoScaleR Multi-Threaded Processing Shared Memory Data Data Data Disk Core 0 (Thread 0) Core 1 (Thread 1) Core 2 (Thread 2) Core n (Thread n) Multicore Processor (4, 8, 16+ cores) RevoScaleR A RevoScaleR algorithm is provided a data source as input The algorithm loops over data, reading a block at a time. Blocks of data are read by a separate worker thread (Thread 0). Other worker threads (Threads 1..n) process the data block from the previous iteration of the data loop and update intermediate results objects in memory When all of the data is processed a master results object is created from the intermediate results objects
38 Use of multiple computers Key to efficiency is minimizing data transfer and communication Locality of data! For PEMA s, the master node controls computations, telling workers where to get data and what computations to do Intermediate results on each node are aggregated across cores Master node gathers all results, checks for convergence, and repeats if necessary Revolution R Enterprise 38
39 RevoScaleR Distributed Computing Data Partition Data Partition Data Partition Data Partition Compute Node (RevoScaleR) Compute Node (RevoScaleR) Compute Node (RevoScaleR) Compute Node (RevoScaleR) Master Node (RevoScaleR) Portions of the data source are made available to each compute node RevoScaleR on the master node assigns a task to each compute node Each compute node independently processes its data, and returns it s intermediate results back to the master node master node aggregates all of the intermediate results from each compute node and produces the final result
40 Data management capabilities Import from external sources Transform and clean variables Code and recode factors Missing values Validation Sort (huge data, but not distributed) Merge Aggregate, summarize Revolution R Enterprise 40
41 Analysis algorithms Descriptive statistics (rxsummary) Tables and cubes (rxcube, rxcrosstabs) Correlations/covariances (rxcovcor, rxcor, rxcov, rxsscp) Linear regressions (rxlinmod) Logistic regressions (rxlogit) K means clustering (rxkmeans) Predictions (scoring) (rxpredict) Other algorithms are being developed Revolution R Enterprise 41
42 Benchmarks using the airline data Airline on-time performance data produced by U.S. Department of Transportation; used in the ASA Data Expo CSV files for the years million observations, 29 variables, about 13 GB For benchmarks, sliced and replicated to get files with 1 million to about 1.25 billion rows 5 node commodity cluster (4 3.2GHz & 16 GB RAM per node); Windows HPC Server 2008 Revolution R Enterprise 42
43 Distributed Import of Airline Data Copy the 22 CSV files to the 5 nodes, keeping rough balance in sizes Two pass import process: Pass1: Import/clean/transform/append Pass 2: Recode factors whose levels differ Import time: about 3 min 20 secs on 5 node cluster (about 17 minutes on single node) Revolution R Enterprise 43
44 Scalability of RevoScaleR with Rows Regression, 1 million billion rows, 443 betas (4 core laptop) Revolution R Enterprise 44
45 Scalability of RevoScaleR with Nodes Regression, 1 billion rows, 443 betas (1 to 5 nodes, 4 cores per node) Revolution R Enterprise 45
46 Scalability of RevoScaleR with Nodes Logit, million rows, 443 betas, 5 iterations (1 to 5 nodes, 4 cores per node) Revolution R Enterprise 46
47 Scalability of RevoScaleR with Nodes Logit, 1 billion rows, 443 betas, 5 iterations (1 to 5 nodes, 4 cores per node) Revolution R Enterprise 47
48 Comparative benchmarks SAS HPA benchmarks: Logit, billion rows, 32 nodes, 384 cores, in-memory, just a few parameters: 80 secs. Regression, 50 million rows, 24 nodes, in-memory, 1,800 parameters: 42 secs. ScaleR: 1 billion rows, 5 nodes, 20 cores ($5K) rxlogit, 7 parameters: 43.4 secs rxlinmod, 7 parameters: 4.1 secs rxlinmod, 1,848 params: 11.9 secs rxlogit, 1,848 params: secs rxlinmod, 13,537 params: 89.8 secs. Revolution R Enterprise 48
49 Contact information Lee Edlefsen Revolution R Enterprise 49
RevoScaleR Speed and Scalability
EXECUTIVE WHITE PAPER RevoScaleR Speed and Scalability By Lee Edlefsen Ph.D., Chief Scientist, Revolution Analytics Abstract RevoScaleR, the Big Data predictive analytics library included with Revolution
High Performance Predictive Analytics in R and Hadoop:
High Performance Predictive Analytics in R and Hadoop: Achieving Big Data Big Analytics Presented by: Mario E. Inchiosa, Ph.D. US Chief Scientist August 27, 2013 1 Polling Questions 1 & 2 2 Agenda Revolution
How To Test The Performance Of An Ass 9.4 And Sas 7.4 On A Test On A Powerpoint Powerpoint 9.2 (Powerpoint) On A Microsoft Powerpoint 8.4 (Powerprobe) (
White Paper Revolution R Enterprise: Faster Than SAS Benchmarking Results by Thomas W. Dinsmore and Derek McCrae Norton In analytics, speed matters. How much? We asked the director of analytics from a
Technical Paper. Performance of SAS In-Memory Statistics for Hadoop. A Benchmark Study. Allison Jennifer Ames Xiangxiang Meng Wayne Thompson
Technical Paper Performance of SAS In-Memory Statistics for Hadoop A Benchmark Study Allison Jennifer Ames Xiangxiang Meng Wayne Thompson Release Information Content Version: 1.0 May 20, 2014 Trademarks
Understanding the Benefits of IBM SPSS Statistics Server
IBM SPSS Statistics Server Understanding the Benefits of IBM SPSS Statistics Server Contents: 1 Introduction 2 Performance 101: Understanding the drivers of better performance 3 Why performance is faster
Tackling Big Data with MATLAB Adam Filion Application Engineer MathWorks, Inc.
Tackling Big Data with MATLAB Adam Filion Application Engineer MathWorks, Inc. 2015 The MathWorks, Inc. 1 Challenges of Big Data Any collection of data sets so large and complex that it becomes difficult
Bringing Big Data Modelling into the Hands of Domain Experts
Bringing Big Data Modelling into the Hands of Domain Experts David Willingham Senior Application Engineer MathWorks [email protected] 2015 The MathWorks, Inc. 1 Data is the sword of the
RevoScaleR 7.3 HPC Server Getting Started Guide
RevoScaleR 7.3 HPC Server Getting Started Guide The correct bibliographic citation for this manual is as follows: Revolution Analytics, Inc. 2014. RevoScaleR 7.3 HPC Server Getting Started Guide. Revolution
Big Data Analysis with Revolution R Enterprise
Big Data Analysis with Revolution R Enterprise August 2010 Joseph B. Rickert Copyright 2010 Revolution Analytics, Inc. All Rights Reserved. 1 Background The R language is well established as the language
Big Data Analysis with Revolution R Enterprise
REVOLUTION WHITE PAPER Big Data Analysis with Revolution R Enterprise By Joseph Rickert January 2011 Background The R language is well established as the language for doing statistics, data analysis, data-mining
2015 The MathWorks, Inc. 1
25 The MathWorks, Inc. 빅 데이터 및 다양한 데이터 처리 위한 MATLAB의 인터페이스 환경 및 새로운 기능 엄준상 대리 Application Engineer MathWorks 25 The MathWorks, Inc. 2 Challenges of Data Any collection of data sets so large and complex
Fast Analytics on Big Data with H20
Fast Analytics on Big Data with H20 0xdata.com, h2o.ai Tomas Nykodym, Petr Maj Team About H2O and 0xdata H2O is a platform for distributed in memory predictive analytics and machine learning Pure Java,
Decision Trees built in Hadoop plus more Big Data Analytics with Revolution R Enterprise
Decision Trees built in Hadoop plus more Big Data Analytics with Revolution R Enterprise Revolution Webinar April 17, 2014 Mario Inchiosa, US Chief Scientist [email protected] All
COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Big Data by the numbers
COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Instructor: ([email protected]) TAs: Pierre-Luc Bacon ([email protected]) Ryan Lowe ([email protected])
Advanced Big Data Analytics with R and Hadoop
REVOLUTION ANALYTICS WHITE PAPER Advanced Big Data Analytics with R and Hadoop 'Big Data' Analytics as a Competitive Advantage Big Analytics delivers competitive advantage in two ways compared to the traditional
Big Fast Data Hadoop acceleration with Flash. June 2013
Big Fast Data Hadoop acceleration with Flash June 2013 Agenda The Big Data Problem What is Hadoop Hadoop and Flash The Nytro Solution Test Results The Big Data Problem Big Data Output Facebook Traditional
Big Data, Fast Processing Speeds Kevin McGowan SAS Solutions on Demand, Cary NC
Big Data, Fast Processing Speeds Kevin McGowan SAS Solutions on Demand, Cary NC ABSTRACT As data sets continue to grow, it is important for programs to be written very efficiently to make sure no time
HPC performance applications on Virtual Clusters
Panagiotis Kritikakos EPCC, School of Physics & Astronomy, University of Edinburgh, Scotland - UK [email protected] 4 th IC-SCCE, Athens 7 th July 2010 This work investigates the performance of (Java)
Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software
WHITEPAPER Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software SanDisk ZetaScale software unlocks the full benefits of flash for In-Memory Compute and NoSQL applications
R and Hadoop: Architectural Options. Bill Jacobs VP Product Marketing & Field CTO, Revolution Analytics @bill_jacobs
R and Hadoop: Architectural Options Bill Jacobs VP Product Marketing & Field CTO, Revolution Analytics @bill_jacobs Polling Question #1: Who Are You? (choose one) Statistician or modeler who uses R Other
Big Data and Data Science: Behind the Buzz Words
Big Data and Data Science: Behind the Buzz Words Peggy Brinkmann, FCAS, MAAA Actuary Milliman, Inc. April 1, 2014 Contents Big data: from hype to value Deconstructing data science Managing big data Analyzing
Cloud Computing at Google. Architecture
Cloud Computing at Google Google File System Web Systems and Algorithms Google Chris Brooks Department of Computer Science University of San Francisco Google has developed a layered system to handle webscale
Scaling Objectivity Database Performance with Panasas Scale-Out NAS Storage
White Paper Scaling Objectivity Database Performance with Panasas Scale-Out NAS Storage A Benchmark Report August 211 Background Objectivity/DB uses a powerful distributed processing architecture to manage
Benchmarking Hadoop & HBase on Violin
Technical White Paper Report Technical Report Benchmarking Hadoop & HBase on Violin Harnessing Big Data Analytics at the Speed of Memory Version 1.0 Abstract The purpose of benchmarking is to show advantages
BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB
BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB Planet Size Data!? Gartner s 10 key IT trends for 2012 unstructured data will grow some 80% over the course of the next
Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel
Parallel Databases Increase performance by performing operations in parallel Parallel Architectures Shared memory Shared disk Shared nothing closely coupled loosely coupled Parallelism Terminology Speedup:
22S:295 Seminar in Applied Statistics High Performance Computing in Statistics
22S:295 Seminar in Applied Statistics High Performance Computing in Statistics Luke Tierney Department of Statistics & Actuarial Science University of Iowa August 30, 2007 Luke Tierney (U. of Iowa) HPC
Whitepaper. Innovations in Business Intelligence Database Technology. www.sisense.com
Whitepaper Innovations in Business Intelligence Database Technology The State of Database Technology in 2015 Database technology has seen rapid developments in the past two decades. Online Analytical Processing
Cloud Computing through Virtualization and HPC technologies
Cloud Computing through Virtualization and HPC technologies William Lu, Ph.D. 1 Agenda Cloud Computing & HPC A Case of HPC Implementation Application Performance in VM Summary 2 Cloud Computing & HPC HPC
CSE-E5430 Scalable Cloud Computing Lecture 2
CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University [email protected] 14.9-2015 1/36 Google MapReduce A scalable batch processing
Laurence Liew General Manager, APAC. Economics Is Driving Big Data Analytics to the Cloud
Laurence Liew General Manager, APAC Economics Is Driving Big Data Analytics to the Cloud Big Data 101 The Analytics Stack Economics of Big Data Convergence of the 3 forces Big Data Analytics in the Cloud
Parallel Data Preparation with the DS2 Programming Language
ABSTRACT Paper SAS329-2014 Parallel Data Preparation with the DS2 Programming Language Jason Secosky and Robert Ray, SAS Institute Inc., Cary, NC and Greg Otto, Teradata Corporation, Dayton, OH A time-consuming
Architectures for Big Data Analytics A database perspective
Architectures for Big Data Analytics A database perspective Fernando Velez Director of Product Management Enterprise Information Management, SAP June 2013 Outline Big Data Analytics Requirements Spectrum
Big Data Processing with Google s MapReduce. Alexandru Costan
1 Big Data Processing with Google s MapReduce Alexandru Costan Outline Motivation MapReduce programming model Examples MapReduce system architecture Limitations Extensions 2 Motivation Big Data @Google:
Can High-Performance Interconnects Benefit Memcached and Hadoop?
Can High-Performance Interconnects Benefit Memcached and Hadoop? D. K. Panda and Sayantan Sur Network-Based Computing Laboratory Department of Computer Science and Engineering The Ohio State University,
Healthcare Big Data Exploration in Real-Time
Healthcare Big Data Exploration in Real-Time Muaz A Mian A Project Submitted in partial fulfillment of the requirements for degree of Masters of Science in Computer Science and Systems University of Washington
Is a Data Scientist the New Quant? Stuart Kozola MathWorks
Is a Data Scientist the New Quant? Stuart Kozola MathWorks 2015 The MathWorks, Inc. 1 Facts or information used usually to calculate, analyze, or plan something Information that is produced or stored by
Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU
Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Heshan Li, Shaopeng Wang The Johns Hopkins University 3400 N. Charles Street Baltimore, Maryland 21218 {heshanli, shaopeng}@cs.jhu.edu 1 Overview
Hadoop Cluster Applications
Hadoop Overview Data analytics has become a key element of the business decision process over the last decade. Classic reporting on a dataset stored in a database was sufficient until recently, but yesterday
Scalable Cloud Computing Solutions for Next Generation Sequencing Data
Scalable Cloud Computing Solutions for Next Generation Sequencing Data Matti Niemenmaa 1, Aleksi Kallio 2, André Schumacher 1, Petri Klemelä 2, Eija Korpelainen 2, and Keijo Heljanko 1 1 Department of
What s New in MATLAB and Simulink
What s New in MATLAB and Simulink Kevin Cohan Product Marketing, MATLAB Michael Carone Product Marketing, Simulink 2015 The MathWorks, Inc. 1 What was new for Simulink in R2012b? 2 What Was New for MATLAB
Introduction to Cloud Computing
Introduction to Cloud Computing Parallel Processing I 15 319, spring 2010 7 th Lecture, Feb 2 nd Majd F. Sakr Lecture Motivation Concurrency and why? Different flavors of parallel computing Get the basic
The Impact of Big Data on Classic Machine Learning Algorithms. Thomas Jensen, Senior Business Analyst @ Expedia
The Impact of Big Data on Classic Machine Learning Algorithms Thomas Jensen, Senior Business Analyst @ Expedia Who am I? Senior Business Analyst @ Expedia Working within the competitive intelligence unit
Rethinking SIMD Vectorization for In-Memory Databases
SIGMOD 215, Melbourne, Victoria, Australia Rethinking SIMD Vectorization for In-Memory Databases Orestis Polychroniou Columbia University Arun Raghavan Oracle Labs Kenneth A. Ross Columbia University Latest
COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters
COSC 6374 Parallel I/O (I) I/O basics Fall 2012 Concept of a clusters Processor 1 local disks Compute node message passing network administrative network Memory Processor 2 Network card 1 Network card
Big Data in Stata. Paulo Guimaraes 1,2. Portuguese Stata UGM - Sept 18, 2015. Big Data in Stata. Paulo Guimaraes. Motivation
Big in Analysis Big in 1,2 1 Banco de Portugal 2 Universidade do Porto Portuguese UGM - Sept 18, 2015 Big in What do I mean by big data? Big in Analysis big data has several meanings the Vs of big data
GPU File System Encryption Kartik Kulkarni and Eugene Linkov
GPU File System Encryption Kartik Kulkarni and Eugene Linkov 5/10/2012 SUMMARY. We implemented a file system that encrypts and decrypts files. The implementation uses the AES algorithm computed through
Up Your R Game. James Taylor, Decision Management Solutions Bill Franks, Teradata
Up Your R Game James Taylor, Decision Management Solutions Bill Franks, Teradata Today s Speakers James Taylor Bill Franks CEO Chief Analytics Officer Decision Management Solutions Teradata 7/28/14 3 Polling
Apache Hama Design Document v0.6
Apache Hama Design Document v0.6 Introduction Hama Architecture BSPMaster GroomServer Zookeeper BSP Task Execution Job Submission Job and Task Scheduling Task Execution Lifecycle Synchronization Fault
Comparison of Windows IaaS Environments
Comparison of Windows IaaS Environments Comparison of Amazon Web Services, Expedient, Microsoft, and Rackspace Public Clouds January 5, 215 TABLE OF CONTENTS Executive Summary 2 vcpu Performance Summary
Big-data Analytics: Challenges and Opportunities
Big-data Analytics: Challenges and Opportunities Chih-Jen Lin Department of Computer Science National Taiwan University Talk at 台 灣 資 料 科 學 愛 好 者 年 會, August 30, 2014 Chih-Jen Lin (National Taiwan Univ.)
SAP HANA - Main Memory Technology: A Challenge for Development of Business Applications. Jürgen Primsch, SAP AG July 2011
SAP HANA - Main Memory Technology: A Challenge for Development of Business Applications Jürgen Primsch, SAP AG July 2011 Why In-Memory? Information at the Speed of Thought Imagine access to business data,
Delivering Value from Big Data with Revolution R Enterprise and Hadoop
Executive White Paper Delivering Value from Big Data with Revolution R Enterprise and Hadoop Bill Jacobs, Director of Product Marketing Thomas W. Dinsmore, Director of Product Management October 2013 Abstract
Classroom Demonstrations of Big Data
Classroom Demonstrations of Big Data Eric A. Suess Abstract We present examples of accessing and analyzing large data sets for use in a classroom at the first year graduate level or senior undergraduate
Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms
Distributed File System 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributed File System Don t move data to workers move workers to the data! Store data on the local disks of nodes
Hadoop MapReduce over Lustre* High Performance Data Division Omkar Kulkarni April 16, 2013
Hadoop MapReduce over Lustre* High Performance Data Division Omkar Kulkarni April 16, 2013 * Other names and brands may be claimed as the property of others. Agenda Hadoop Intro Why run Hadoop on Lustre?
Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging
Outline High Performance Computing (HPC) Towards exascale computing: a brief history Challenges in the exascale era Big Data meets HPC Some facts about Big Data Technologies HPC and Big Data converging
MapReduce and Hadoop Distributed File System
MapReduce and Hadoop Distributed File System 1 B. RAMAMURTHY Contact: Dr. Bina Ramamurthy CSE Department University at Buffalo (SUNY) [email protected] http://www.cse.buffalo.edu/faculty/bina Partially
Energy Efficient MapReduce
Energy Efficient MapReduce Motivation: Energy consumption is an important aspect of datacenters efficiency, the total power consumption in the united states has doubled from 2000 to 2005, representing
MapReduce and Hadoop Distributed File System V I J A Y R A O
MapReduce and Hadoop Distributed File System 1 V I J A Y R A O The Context: Big-data Man on the moon with 32KB (1969); my laptop had 2GB RAM (2009) Google collects 270PB data in a month (2007), 20000PB
Cloud Storage. Parallels. Performance Benchmark Results. White Paper. www.parallels.com
Parallels Cloud Storage White Paper Performance Benchmark Results www.parallels.com Table of Contents Executive Summary... 3 Architecture Overview... 3 Key Features... 4 No Special Hardware Requirements...
Parallel Computing with MATLAB
Parallel Computing with MATLAB Scott Benway Senior Account Manager Jiro Doke, Ph.D. Senior Application Engineer 2013 The MathWorks, Inc. 1 Acceleration Strategies Applied in MATLAB Approach Options Best
In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller
In-Memory Databases Algorithms and Data Structures on Modern Hardware Martin Faust David Schwalb Jens Krüger Jürgen Müller The Free Lunch Is Over 2 Number of transistors per CPU increases Clock frequency
How To Store Data On An Ocora Nosql Database On A Flash Memory Device On A Microsoft Flash Memory 2 (Iomemory)
WHITE PAPER Oracle NoSQL Database and SanDisk Offer Cost-Effective Extreme Performance for Big Data 951 SanDisk Drive, Milpitas, CA 95035 www.sandisk.com Table of Contents Abstract... 3 What Is Big Data?...
Big Data With Hadoop
With Saurabh Singh [email protected] The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials
Deliverable 2.1.4. 150 Billion Triple dataset hosted on the LOD2 Knowledge Store Cluster. LOD2 Creating Knowledge out of Interlinked Data
Collaborative Project LOD2 Creating Knowledge out of Interlinked Data Project Number: 257943 Start Date of Project: 01/09/2010 Duration: 48 months Deliverable 2.1.4 150 Billion Triple dataset hosted on
Preview of Oracle Database 12c In-Memory Option. Copyright 2013, Oracle and/or its affiliates. All rights reserved.
Preview of Oracle Database 12c In-Memory Option 1 The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any
QLIKVIEW ARCHITECTURE AND SYSTEM RESOURCE USAGE
QLIKVIEW ARCHITECTURE AND SYSTEM RESOURCE USAGE QlikView Technical Brief April 2011 www.qlikview.com Introduction This technical brief covers an overview of the QlikView product components and architecture
Big data in R EPIC 2015
Big data in R EPIC 2015 Big Data: the new 'The Future' In which Forbes magazine finds common ground with Nancy Krieger (for the first time ever?), by arguing the need for theory-driven analysis This future
FPGA-based Multithreading for In-Memory Hash Joins
FPGA-based Multithreading for In-Memory Hash Joins Robert J. Halstead, Ildar Absalyamov, Walid A. Najjar, Vassilis J. Tsotras University of California, Riverside Outline Background What are FPGAs Multithreaded
David Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems
David Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems About me David Rioja Redondo Telecommunication Engineer - Universidad de Alcalá >2 years building and managing clusters UPM
Find the Hidden Signal in Market Data Noise
Find the Hidden Signal in Market Data Noise Revolution Analytics Webinar, 13 March 2013 Andrie de Vries Business Services Director (Europe) @RevoAndrie [email protected] Agenda Find the Hidden
Integrated Big Data: Hadoop + DBMS + Discovery for SAS High Performance Analytics
Paper 1828-2014 Integrated Big Data: Hadoop + DBMS + Discovery for SAS High Performance Analytics John Cunningham, Teradata Corporation, Danville, CA ABSTRACT SAS High Performance Analytics (HPA) is a
SQL Server 2012 Performance White Paper
Published: April 2012 Applies to: SQL Server 2012 Copyright The information contained in this document represents the current view of Microsoft Corporation on the issues discussed as of the date of publication.
External Sorting. Why Sort? 2-Way Sort: Requires 3 Buffers. Chapter 13
External Sorting Chapter 13 Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1 Why Sort? A classic problem in computer science! Data requested in sorted order e.g., find students in increasing
Google File System. Web and scalability
Google File System Web and scalability The web: - How big is the Web right now? No one knows. - Number of pages that are crawled: o 100,000 pages in 1994 o 8 million pages in 2005 - Crawlable pages might
Load balancing in a heterogeneous computer system by self-organizing Kohonen network
Bull. Nov. Comp. Center, Comp. Science, 25 (2006), 69 74 c 2006 NCC Publisher Load balancing in a heterogeneous computer system by self-organizing Kohonen network Mikhail S. Tarkov, Yakov S. Bezrukov Abstract.
SAS Data Set Encryption Options
Technical Paper SAS Data Set Encryption Options SAS product interaction with encrypted data storage Table of Contents Introduction: What Is Encryption?... 1 Test Configuration... 1 Data... 1 Code... 2
An Alternative Storage Solution for MapReduce. Eric Lomascolo Director, Solutions Marketing
An Alternative Storage Solution for MapReduce Eric Lomascolo Director, Solutions Marketing MapReduce Breaks the Problem Down Data Analysis Distributes processing work (Map) across compute nodes and accumulates
Big Data Challenges in Bioinformatics
Big Data Challenges in Bioinformatics BARCELONA SUPERCOMPUTING CENTER COMPUTER SCIENCE DEPARTMENT Autonomic Systems and ebusiness Pla?orms Jordi Torres [email protected] Talk outline! We talk about Petabyte?
Big Data Technology Map-Reduce Motivation: Indexing in Search Engines
Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Edward Bortnikov & Ronny Lempel Yahoo Labs, Haifa Indexing in Search Engines Information Retrieval s two main stages: Indexing process
Understanding the Value of In-Memory in the IT Landscape
February 2012 Understing the Value of In-Memory in Sponsored by QlikView Contents The Many Faces of In-Memory 1 The Meaning of In-Memory 2 The Data Analysis Value Chain Your Goals 3 Mapping Vendors to
Distributed R for Big Data
Distributed R for Big Data Indrajit Roy HP Vertica Development Team Abstract Distributed R simplifies large-scale analysis. It extends R. R is a single-threaded environment which limits its utility for
IBM SPSS Statistics Performance Best Practices
IBM SPSS Statistics IBM SPSS Statistics Performance Best Practices Contents Overview... 3 Target User... 3 Introduction... 3 Methods of Problem Diagnosis... 3 Performance Logging for Statistics Server...
Hadoop Architecture. Part 1
Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,
EFFICIENT EXTERNAL SORTING ON FLASH MEMORY EMBEDDED DEVICES
ABSTRACT EFFICIENT EXTERNAL SORTING ON FLASH MEMORY EMBEDDED DEVICES Tyler Cossentine and Ramon Lawrence Department of Computer Science, University of British Columbia Okanagan Kelowna, BC, Canada [email protected]
White Paper The Numascale Solution: Extreme BIG DATA Computing
White Paper The Numascale Solution: Extreme BIG DATA Computing By: Einar Rustad ABOUT THE AUTHOR Einar Rustad is CTO of Numascale and has a background as CPU, Computer Systems and HPC Systems De-signer
Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data
Case Study 2: Document Retrieval Parallel Programming Map-Reduce Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Carlos Guestrin January 31 st, 2013 Carlos Guestrin
Unstructured Data Accelerator (UDA) Author: Motti Beck, Mellanox Technologies Date: March 27, 2012
Unstructured Data Accelerator (UDA) Author: Motti Beck, Mellanox Technologies Date: March 27, 2012 1 Market Trends Big Data Growing technology deployments are creating an exponential increase in the volume
Hadoop: Embracing future hardware
Hadoop: Embracing future hardware Suresh Srinivas @suresh_m_s Page 1 About Me Architect & Founder at Hortonworks Long time Apache Hadoop committer and PMC member Designed and developed many key Hadoop
numascale White Paper The Numascale Solution: Extreme BIG DATA Computing Hardware Accellerated Data Intensive Computing By: Einar Rustad ABSTRACT
numascale Hardware Accellerated Data Intensive Computing White Paper The Numascale Solution: Extreme BIG DATA Computing By: Einar Rustad www.numascale.com Supemicro delivers 108 node system with Numascale
Oracle Database In-Memory The Next Big Thing
Oracle Database In-Memory The Next Big Thing Maria Colgan Master Product Manager #DBIM12c Why is Oracle do this Oracle Database In-Memory Goals Real Time Analytics Accelerate Mixed Workload OLTP No Changes
CIS 4930/6930 Spring 2014 Introduction to Data Science /Data Intensive Computing. University of Florida, CISE Department Prof.
CIS 4930/6930 Spring 2014 Introduction to Data Science /Data Intensie Computing Uniersity of Florida, CISE Department Prof. Daisy Zhe Wang Map/Reduce: Simplified Data Processing on Large Clusters Parallel/Distributed
Memory-Centric Database Acceleration
Memory-Centric Database Acceleration Achieving an Order of Magnitude Increase in Database Performance A FedCentric Technologies White Paper September 2007 Executive Summary Businesses are facing daunting
A Study on Workload Imbalance Issues in Data Intensive Distributed Computing
A Study on Workload Imbalance Issues in Data Intensive Distributed Computing Sven Groot 1, Kazuo Goda 1, and Masaru Kitsuregawa 1 University of Tokyo, 4-6-1 Komaba, Meguro-ku, Tokyo 153-8505, Japan Abstract.
