RevoScaleR Speed and Scalability

Size: px
Start display at page:

Download "RevoScaleR Speed and Scalability"

Transcription

1 EXECUTIVE WHITE PAPER RevoScaleR Speed and Scalability By Lee Edlefsen Ph.D., Chief Scientist, Revolution Analytics Abstract RevoScaleR, the Big Data predictive analytics library included with Revolution R Enterprise, is designed from the ground up to be fast and scalable. Consideration has been give to all of the components that are involved in performing large-scale statistical analysis. These include data storage, usage of a computing infrastructure s resources (RAM, CPUs, cores, and computers) and the algorithms themselves. Its extreme speed and scalability are the result of careful, innovative engineering at every stage. This white paper describes the design and implementation considerations that are the foundation of the high-performance Big Data capabilities of Revolution R Enterprise. Executive Summary Analytics-driven breakthroughs in every field from healthcare to financial services have put demand for advanced analytics front and center for large and small organizations. As in any IT deployment, IT leaders supporting analytics environments have been challenged by tradeoffs among cost, performance and functionality. These tradeoffs are becoming more problematic due to exploding data volumes and the increasing numbers of people who recognize the potential impact that advanced analytics could have and are requesting analytics solutions that exceed the capabilities of existing tools. How can IT create an analytics infrastructure that will grow with the organization s needs? For the past several decades, the rising tide of technology especially the increasing speed of single processors has allowed the same data analysis code from legacy analytics software vendors to run faster and on bigger data sets. That happy era is ending. The size of data sets is increasing much more rapidly than the speed of single cores, of I/O, and of RAM and legacy code can t keep up. To allow analytics to realize its potential for organizational improvements and handle very large and growing data sets, IT leaders need scalable data analysis software that is able to run on newer hardware paradigms, specifically using multiple cores, multiple hard drives, and multiple computers. The data analysis software needs to scale from small data sets to huge ones, from using one core and one hard drive on one computer to using many cores and many hard drives on many computers, and from using local hardware to using remote clouds. 1

2 Revolution Analytics offers enterprise-grade, terabyte-class software based on the Open Source project R. This white paper discusses the approach to scalability we have taken at Revolution Analytics with our package RevoScaleR, specifically exploring: Storing data Reading and writing of chunks of data Handling data in memory Using multiple cores on single computers Please share this white paper with the people on your team who are responsible for collecting, storing, managing, analyzing and extracting value from data. 1. Storing Data One of the keys to being scalable is the ability to process more data than can fit into memory at one time. This essentially equates to being able to work with chunks of data instead of requiring the entire dataset to be resident in memory at once. In the context of RevoScaleR, chunks are defined as sequential blocks of rows for a given selection of columns. Although RevoScaleR can process data from a wide variety of sources, it has its own highly optimized file format (the "XDF" format) that is especially suitable for chunking. Data in an XDF file can be accessed rapidly by row or by column. In addition, blocks of contiguous rows for selected columns can be read sequentially, rather than randomly. Sequential reads can be tens to hundreds of thousands of times faster than random reads. Furthermore, in an XDF file the time it takes to read a block of rows for a variable is essentially independent of the total number of variables and rows in the file. This means that even in terabyte-sized files, only the data for the actual variables required for an analysis needs to be read and processed, and this may only be a few hundred megabytes. The time it takes to do that is essentially the same as if only that data were stored in the file; storing the additional unused data does not add to the processing time. Data in an XDF file is stored in the same binary format that is used in memory, so no conversion is required when it is brought into memory. In order to minimize wasted space, it can also be stored in an appropriately-sized way. For instance, values that have no more than 256 values can be stored in a byte per number, rather than in 8 bytes as is the case with some data analysis programs. Floating point values with a precision of less than 6 or 7 decimal digits, which is commonly the case, can be stored in 4 bytes per number, not 8. New variables and new rows can be added to the file without having to rewrite the entire file. Thus, the cost of creating new variables and of adding more observations is greatly reduced. 2

3 2. Reading Data When data is read in "chunks," the optimal chunk size depends upon a variety of factors, such as the speed of the disk, the speed of RAM, the number and speed of cores, and the types of computations being done. RevoScaleR allows the size of chunks to vary depending upon conditions. A bottleneck for data processing is data I/O: reading the data from disk. RevoScaleR dedicates one core to reading data from disk, to avoid disk contention and optimize bandwidth for data/io. Meanwhile, the remaining cores are assigned to processing the chunk of data read into memory from the previous read. Of course, when it is possible to fit all data into memory, RevoScaleR allows that. It then assigns all cores to process that data. 3. Handling Data in Memory As on disk, use of the appropriate-sized data type in memory reduces the space required and also reduces the time it takes to move the data around in memory. In RevoScaleR, the amount of data conversion and copying is minimized, to save time and speed. In almost all other data-oriented programs, before an array of integers and an array of double precision floating point numbers are added together, the array of integers is first converted and copied into an array of doubles. This takes time and space. In RevoScaleR, that is almost never necessary, regardless of the type of operation and the sizes of the data types. No conversion or copying is done until the values are actually loaded into the CPU. 4. Use of Multiple Cores on a Single Computer Nearly all computations that involve data in RevoScaleR are automatically "threaded" that is they use multiple cores on a machine when they are available. This is done efficiently by minimizing the overhead of transferring the computations to multiple threads, by minimizing the amount of data that must be copied, by doing as much work as possible on each thread to amortize the cost of initializing the computations, and by minimizing inter-thread communication and synchronization. Feeding large chunks of data to each of the multiple cores is important for efficiency. For analytic routines such as descriptive statistics, crosstabs, linear regression, logistic regression, and K- means clustering (in which several variables are typically used) a large chunk of observations perhaps millions for all of the variables is read into memory by one core. Simultaneously, the data chunk from the previous read is "virtually" split among the remaining cores for the required processing. The code doing the processing on each core (thread) only needs to know what its assigned task is, and no inter-thread communication and synchronization is needed. As a simple example, consider computing the mean of several variables. Millions of observations of each of those variables might be read by the I/O thread and then each of the other threads is given a proportionate share of the observations. Each computational thread just needs to compute and store the sum of each of the variables for its share of the observations, and to 3

4 record how many total observations it used. To get the means for the entire data set, the partial sums and partial observation counts are aggregated, and the grand sums are divided by the total number of observations. Figure 1 RevoScaleR on Single Computer A RevoScaleR algorithm is provided a data source as input The algorithm loops over data, reading a block at a time. Blocks of data are read by a separate worker thread (Thread 0). Other worker threads (Threads 1..n) process the data block from the previous iteration of the data loop and update intermediate results objects in memory When all of the data is processed a master results object is created from the intermediate results objects 5. Use of Multiple Computers A key to efficiently using multiple computers is to minimize the amount of information including data that must be communicated among the computers. In RevoScaleR, one of the computers (the master node) controls the computations on all of the other computers. It first sends a message to each compute node telling it where to find the data to use, and what types of computations to do. On each computer, multiple cores are used as described above, to maximize the efficiency of the node. The intermediate results from all cores are aggregated on that node, and only that information is sent back to the master node. The master node monitors the status of the compute nodes, aggregates the overall results sent back by those nodes, and then processes those results to get overall estimates. The final processing often involves computeintensive operations such as solving large sets of equations. RevoScaleR allows several options for getting data to the cores on each node, including reading data from a common data server, but it is generally most efficient to have the portion of data needed by each node stored locally. For iterative algorithms that require many passes through the data, such as logistic regression and K-means clustering, the master node controls the iterations. This is done by repeating the steps 4

5 described above: each iteration is initialized by a message from the master node, which aggregates the results that come back, computes the next set of estimated parameters, and decides whether the algorithm has converged. If not, another iteration is started. Figure 2 RevoScaleR on Multiple Computers Portions of the data source are made available to each compute node RevoScaleR on the master node assigns a task to each compute node, and the sleeping instance of RevoScaleR on the compute node wakes up. RevoScaleR on each compute node independently processes its data, and returns it s intermediate results back to RevoScaleR on the master node The RevoScaleR on the master node aggregates all of the intermediate results from each compute node and produces the final result 6. Efficient Parallelization of Statistical and Data Mining Algorithms RevoScaleR is built upon a platform designed to automatically and efficiently parallelize "external memory" algorithms. This is the class of algorithms that do not require all data to be in memory at one time, and such algorithms are available for a wide range of statistical and data mining routines. The way in which these algorithms are automatically parallelized is such that, in general, the fastest algorithms per core are also the fastest when parallelized. (This happy situation is not the case for some other types of parallel algorithms). Since the burden of worrying about parallelization is removed from the engineers implementing these algorithms, they can focus on getting optimal speed on each core. This involves several things. Most obviously, it involves using fast algorithms, and carefully coding those using C++ templates, which can produce very fast code. Other issues are important as well. Categorical data is very common in statistical computations, and they are handled in ways that save memory, increase speed, and increase computational precision as well. 5

6 It is often the case in statistical models that the same values are required in different parts of the computation. RevoScaleR has a sophisticated algorithm for pre-analyzing models to detect such duplication, so that the number of computations can be minimized. Multiple models can be analyzed jointly. This algorithm can also detect collinearities in models, which can lead to wasted computations or even computational failures, and can remove them prior to doing any computations. Conclusion RevoScaleR is a library included in Revolution R Enterprise that provides extremely fast statistical analysis on terabyte-class data sets, without needing specialized hardware. Using only a commodity multi-processor computer with modest amounts of RAM, data processing and predictive modeling techniques can easily be performed on data sets with hundreds of millions of rows and hundreds of variables, at speeds suitable for interactive processing. Extending the system to a small cluster of similar computers commensurately reduces processing time. These achievements are the result of the design of the RevoScaleR platform, constructed from the ground up for speed and scalability. Specifically: Efficient storage of data on local disk, in the high-performance XDF file format optimized for block-reads of data; A high-performance strategy for streaming data from disk to memory, optimizing throughput by dedicating one core to I/O while remaining cores process buffered data; Optimized data formats for storing data in-memory; Parallelized algorithms that exploit multiple cores to perform analytic processing on chunks of data held temporarily in-memory; The ability to exploit the processing power of multiple nodes in a cluster, to further reduce processing times; and An architectural platform to implement parallel, streaming algorithms that efficiently combine the partial results from optimized algorithms running multiple cores and multiple machines, to provide fast statistical data analyses on extremely large data sets. The RevoScaleR library is included with Revolution R Enterprise, available for Windows and Linux systems from Revolution Analytics. For more information, please contact Revolution Analytics at GET-REVO ( ) or at 6

7 About Revolution Analytics Revolution Analytics delivers advanced analytics software at half the cost of existing solutions. Led by predictive analytics pioneer and SPSS co-founder Norman Nie, the company brings high performance, productivity, and enterprise readiness to open source R, the most powerful statistics language in the world. In the last 10 years, R has exploded in popularity and functionality and has emerged as the data scientists tool of choice. Today R is used by over 2 million analysts worldwide in academia and at cutting-edge analytics-driven companies such as Google, Facebook, and LinkedIn. To equip R for the demands and requirements of all business environments, Revolution R Enterprise builds on open source R with innovations in big data analysis, integration and user experience. The company s flagship Revolution R product is available both as a workstation and server-based offering. Revolution R Enterprise Server is designed to scale and meet the mission-critical production needs of large organizations such as Merck, Bank of America and Mu Sigma, while Revolution R Workstation offers productivity and development tools for individuals and small teams that need to build applications and analyze data. Revolution Analytics is committed to fostering the growth of the R community. The company sponsors the Inside-R.org community site, local users groups worldwide, and offers free licenses of Revolution R Enterprise to everyone in academia to broaden adoption by the next generation of data scientists. Revolution Analytics is headquartered in Palo Alto, Calif. and backed by North Bridge Venture Partners and Intel Capital. Please visit us at 7

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011 Scalable Data Analysis in R Lee E. Edlefsen Chief Scientist UserR! 2011 1 Introduction Our ability to collect and store data has rapidly been outpacing our ability to analyze it We need scalable data analysis

More information

Advanced Big Data Analytics with R and Hadoop

Advanced Big Data Analytics with R and Hadoop REVOLUTION ANALYTICS WHITE PAPER Advanced Big Data Analytics with R and Hadoop 'Big Data' Analytics as a Competitive Advantage Big Analytics delivers competitive advantage in two ways compared to the traditional

More information

How To Test The Performance Of An Ass 9.4 And Sas 7.4 On A Test On A Powerpoint Powerpoint 9.2 (Powerpoint) On A Microsoft Powerpoint 8.4 (Powerprobe) (

How To Test The Performance Of An Ass 9.4 And Sas 7.4 On A Test On A Powerpoint Powerpoint 9.2 (Powerpoint) On A Microsoft Powerpoint 8.4 (Powerprobe) ( White Paper Revolution R Enterprise: Faster Than SAS Benchmarking Results by Thomas W. Dinsmore and Derek McCrae Norton In analytics, speed matters. How much? We asked the director of analytics from a

More information

Understanding the Benefits of IBM SPSS Statistics Server

Understanding the Benefits of IBM SPSS Statistics Server IBM SPSS Statistics Server Understanding the Benefits of IBM SPSS Statistics Server Contents: 1 Introduction 2 Performance 101: Understanding the drivers of better performance 3 Why performance is faster

More information

Fast Analytics on Big Data with H20

Fast Analytics on Big Data with H20 Fast Analytics on Big Data with H20 0xdata.com, h2o.ai Tomas Nykodym, Petr Maj Team About H2O and 0xdata H2O is a platform for distributed in memory predictive analytics and machine learning Pure Java,

More information

Driving Value from Big Data

Driving Value from Big Data Executive White Paper Driving Value from Big Data Bill Jacobs, Director of Product Marketing & Thomas W. Dinsmore, Director of Product Management Abstract Businesses are rapidly investing in Hadoop to

More information

How To Handle Big Data With A Data Scientist

How To Handle Big Data With A Data Scientist III Big Data Technologies Today, new technologies make it possible to realize value from Big Data. Big data technologies can replace highly customized, expensive legacy systems with a standard solution

More information

Using In-Memory Computing to Simplify Big Data Analytics

Using In-Memory Computing to Simplify Big Data Analytics SCALEOUT SOFTWARE Using In-Memory Computing to Simplify Big Data Analytics by Dr. William Bain, ScaleOut Software, Inc. 2012 ScaleOut Software, Inc. 12/27/2012 T he big data revolution is upon us, fed

More information

Bringing Big Data Modelling into the Hands of Domain Experts

Bringing Big Data Modelling into the Hands of Domain Experts Bringing Big Data Modelling into the Hands of Domain Experts David Willingham Senior Application Engineer MathWorks david.willingham@mathworks.com.au 2015 The MathWorks, Inc. 1 Data is the sword of the

More information

Benchmarking Hadoop & HBase on Violin

Benchmarking Hadoop & HBase on Violin Technical White Paper Report Technical Report Benchmarking Hadoop & HBase on Violin Harnessing Big Data Analytics at the Speed of Memory Version 1.0 Abstract The purpose of benchmarking is to show advantages

More information

How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time

How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time SCALEOUT SOFTWARE How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time by Dr. William Bain and Dr. Mikhail Sobolev, ScaleOut Software, Inc. 2012 ScaleOut Software, Inc. 12/27/2012 T wenty-first

More information

Delivering Value from Big Data with Revolution R Enterprise and Hadoop

Delivering Value from Big Data with Revolution R Enterprise and Hadoop Executive White Paper Delivering Value from Big Data with Revolution R Enterprise and Hadoop Bill Jacobs, Director of Product Marketing Thomas W. Dinsmore, Director of Product Management October 2013 Abstract

More information

High Performance Predictive Analytics in R and Hadoop:

High Performance Predictive Analytics in R and Hadoop: High Performance Predictive Analytics in R and Hadoop: Achieving Big Data Big Analytics Presented by: Mario E. Inchiosa, Ph.D. US Chief Scientist August 27, 2013 1 Polling Questions 1 & 2 2 Agenda Revolution

More information

Informatica Ultra Messaging SMX Shared-Memory Transport

Informatica Ultra Messaging SMX Shared-Memory Transport White Paper Informatica Ultra Messaging SMX Shared-Memory Transport Breaking the 100-Nanosecond Latency Barrier with Benchmark-Proven Performance This document contains Confidential, Proprietary and Trade

More information

UpStream Software s Big Data Analytics Platform for Marketing Optimization Helps Clients Understand Buying Behavior and Improve Customer Targeting

UpStream Software s Big Data Analytics Platform for Marketing Optimization Helps Clients Understand Buying Behavior and Improve Customer Targeting CASE STUDY UpStream Software s Big Data Analytics Platform for Marketing Optimization Helps Clients Understand Buying Behavior and Improve Customer Targeting Company: Industry: Challenge: Solution: Results:

More information

Flash Memory Arrays Enabling the Virtualized Data Center. July 2010

Flash Memory Arrays Enabling the Virtualized Data Center. July 2010 Flash Memory Arrays Enabling the Virtualized Data Center July 2010 2 Flash Memory Arrays Enabling the Virtualized Data Center This White Paper describes a new product category, the flash Memory Array,

More information

The Rise of Big Data Spurs a Revolution in Big Analytics

The Rise of Big Data Spurs a Revolution in Big Analytics REVOLUTION ANALYTICS EXECUTIVE BRIEFING The Rise of Big Data Spurs a Revolution in Big Analytics By Norman H. Nie, CEO Revolution Analytics The enormous growth in the amount of data that the global economy

More information

Big Data Analysis with Revolution R Enterprise

Big Data Analysis with Revolution R Enterprise REVOLUTION WHITE PAPER Big Data Analysis with Revolution R Enterprise By Joseph Rickert January 2011 Background The R language is well established as the language for doing statistics, data analysis, data-mining

More information

Benchmarking Cassandra on Violin

Benchmarking Cassandra on Violin Technical White Paper Report Technical Report Benchmarking Cassandra on Violin Accelerating Cassandra Performance and Reducing Read Latency With Violin Memory Flash-based Storage Arrays Version 1.0 Abstract

More information

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software WHITEPAPER Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software SanDisk ZetaScale software unlocks the full benefits of flash for In-Memory Compute and NoSQL applications

More information

Infrastructure Matters: POWER8 vs. Xeon x86

Infrastructure Matters: POWER8 vs. Xeon x86 Advisory Infrastructure Matters: POWER8 vs. Xeon x86 Executive Summary This report compares IBM s new POWER8-based scale-out Power System to Intel E5 v2 x86- based scale-out systems. A follow-on report

More information

Architectures for Big Data Analytics A database perspective

Architectures for Big Data Analytics A database perspective Architectures for Big Data Analytics A database perspective Fernando Velez Director of Product Management Enterprise Information Management, SAP June 2013 Outline Big Data Analytics Requirements Spectrum

More information

RAID for the 21st Century. A White Paper Prepared for Panasas October 2007

RAID for the 21st Century. A White Paper Prepared for Panasas October 2007 A White Paper Prepared for Panasas October 2007 Table of Contents RAID in the 21 st Century...1 RAID 5 and RAID 6...1 Penalties Associated with RAID 5 and RAID 6...1 How the Vendors Compensate...2 EMA

More information

CSE-E5430 Scalable Cloud Computing Lecture 2

CSE-E5430 Scalable Cloud Computing Lecture 2 CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing

More information

Scaling Objectivity Database Performance with Panasas Scale-Out NAS Storage

Scaling Objectivity Database Performance with Panasas Scale-Out NAS Storage White Paper Scaling Objectivity Database Performance with Panasas Scale-Out NAS Storage A Benchmark Report August 211 Background Objectivity/DB uses a powerful distributed processing architecture to manage

More information

Part V Applications. What is cloud computing? SaaS has been around for awhile. Cloud Computing: General concepts

Part V Applications. What is cloud computing? SaaS has been around for awhile. Cloud Computing: General concepts Part V Applications Cloud Computing: General concepts Copyright K.Goseva 2010 CS 736 Software Performance Engineering Slide 1 What is cloud computing? SaaS: Software as a Service Cloud: Datacenters hardware

More information

Hadoop Architecture. Part 1

Hadoop Architecture. Part 1 Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,

More information

PARALLELS CLOUD SERVER

PARALLELS CLOUD SERVER PARALLELS CLOUD SERVER An Introduction to Operating System Virtualization and Parallels Cloud Server 1 Table of Contents Introduction... 3 Hardware Virtualization... 3 Operating System Virtualization...

More information

How To Store Data On An Ocora Nosql Database On A Flash Memory Device On A Microsoft Flash Memory 2 (Iomemory)

How To Store Data On An Ocora Nosql Database On A Flash Memory Device On A Microsoft Flash Memory 2 (Iomemory) WHITE PAPER Oracle NoSQL Database and SanDisk Offer Cost-Effective Extreme Performance for Big Data 951 SanDisk Drive, Milpitas, CA 95035 www.sandisk.com Table of Contents Abstract... 3 What Is Big Data?...

More information

Scala Storage Scale-Out Clustered Storage White Paper

Scala Storage Scale-Out Clustered Storage White Paper White Paper Scala Storage Scale-Out Clustered Storage White Paper Chapter 1 Introduction... 3 Capacity - Explosive Growth of Unstructured Data... 3 Performance - Cluster Computing... 3 Chapter 2 Current

More information

Actian Vector in Hadoop

Actian Vector in Hadoop Actian Vector in Hadoop Industrialized, High-Performance SQL in Hadoop A Technical Overview Contents Introduction...3 Actian Vector in Hadoop - Uniquely Fast...5 Exploiting the CPU...5 Exploiting Single

More information

Binary search tree with SIMD bandwidth optimization using SSE

Binary search tree with SIMD bandwidth optimization using SSE Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous

More information

Accelerating Hadoop MapReduce Using an In-Memory Data Grid

Accelerating Hadoop MapReduce Using an In-Memory Data Grid Accelerating Hadoop MapReduce Using an In-Memory Data Grid By David L. Brinker and William L. Bain, ScaleOut Software, Inc. 2013 ScaleOut Software, Inc. 12/27/2012 H adoop has been widely embraced for

More information

Cloud Server. Parallels. An Introduction to Operating System Virtualization and Parallels Cloud Server. White Paper. www.parallels.

Cloud Server. Parallels. An Introduction to Operating System Virtualization and Parallels Cloud Server. White Paper. www.parallels. Parallels Cloud Server White Paper An Introduction to Operating System Virtualization and Parallels Cloud Server www.parallels.com Table of Contents Introduction... 3 Hardware Virtualization... 3 Operating

More information

Tackling Big Data with MATLAB Adam Filion Application Engineer MathWorks, Inc.

Tackling Big Data with MATLAB Adam Filion Application Engineer MathWorks, Inc. Tackling Big Data with MATLAB Adam Filion Application Engineer MathWorks, Inc. 2015 The MathWorks, Inc. 1 Challenges of Big Data Any collection of data sets so large and complex that it becomes difficult

More information

Rackspace Cloud Databases and Container-based Virtualization

Rackspace Cloud Databases and Container-based Virtualization Rackspace Cloud Databases and Container-based Virtualization August 2012 J.R. Arredondo @jrarredondo Page 1 of 6 INTRODUCTION When Rackspace set out to build the Cloud Databases product, we asked many

More information

Table of Contents. June 2010

Table of Contents. June 2010 June 2010 From: StatSoft Analytics White Papers To: Internal release Re: Performance comparison of STATISTICA Version 9 on multi-core 64-bit machines with current 64-bit releases of SAS (Version 9.2) and

More information

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database Managing Big Data with Hadoop & Vertica A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database Copyright Vertica Systems, Inc. October 2009 Cloudera and Vertica

More information

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, sborkar95@gmail.com Assistant Professor, Information

More information

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat ESS event: Big Data in Official Statistics Antonino Virgillito, Istat v erbi v is 1 About me Head of Unit Web and BI Technologies, IT Directorate of Istat Project manager and technical coordinator of Web

More information

GigaSpaces Real-Time Analytics for Big Data

GigaSpaces Real-Time Analytics for Big Data GigaSpaces Real-Time Analytics for Big Data GigaSpaces makes it easy to build and deploy large-scale real-time analytics systems Rapidly increasing use of large-scale and location-aware social media and

More information

Data Center Solutions

Data Center Solutions Data Center Solutions Systems, software and hardware solutions you can trust With over 25 years of storage innovation, SanDisk is a global flash technology leader. At SanDisk, we re expanding the possibilities

More information

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB Planet Size Data!? Gartner s 10 key IT trends for 2012 unstructured data will grow some 80% over the course of the next

More information

Top Ten Questions. to Ask Your Primary Storage Provider About Their Data Efficiency. May 2014. Copyright 2014 Permabit Technology Corporation

Top Ten Questions. to Ask Your Primary Storage Provider About Their Data Efficiency. May 2014. Copyright 2014 Permabit Technology Corporation Top Ten Questions to Ask Your Primary Storage Provider About Their Data Efficiency May 2014 Copyright 2014 Permabit Technology Corporation Introduction The value of data efficiency technologies, namely

More information

Client/Server Computing Distributed Processing, Client/Server, and Clusters

Client/Server Computing Distributed Processing, Client/Server, and Clusters Client/Server Computing Distributed Processing, Client/Server, and Clusters Chapter 13 Client machines are generally single-user PCs or workstations that provide a highly userfriendly interface to the

More information

[x+1] Completes Next-Generation POE; Its Origin Enterprise Data Management Platform for Automated, Big Data-Driven Marketing Optimization

[x+1] Completes Next-Generation POE; Its Origin Enterprise Data Management Platform for Automated, Big Data-Driven Marketing Optimization REVOLUTION CASE STUDY [x+1] Completes Next-Generation POE; Its Origin Enterprise Data Management Platform for Automated, Big Data-Driven Marketing Optimization Revolution R Enterprise Tapped for High-Performance,

More information

Make the Most of Big Data to Drive Innovation Through Reseach

Make the Most of Big Data to Drive Innovation Through Reseach White Paper Make the Most of Big Data to Drive Innovation Through Reseach Bob Burwell, NetApp November 2012 WP-7172 Abstract Monumental data growth is a fact of life in research universities. The ability

More information

Hadoop: Embracing future hardware

Hadoop: Embracing future hardware Hadoop: Embracing future hardware Suresh Srinivas @suresh_m_s Page 1 About Me Architect & Founder at Hortonworks Long time Apache Hadoop committer and PMC member Designed and developed many key Hadoop

More information

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging In some markets and scenarios where competitive advantage is all about speed, speed is measured in micro- and even nano-seconds.

More information

I N T E R S Y S T E M S W H I T E P A P E R INTERSYSTEMS CACHÉ AS AN ALTERNATIVE TO IN-MEMORY DATABASES. David Kaaret InterSystems Corporation

I N T E R S Y S T E M S W H I T E P A P E R INTERSYSTEMS CACHÉ AS AN ALTERNATIVE TO IN-MEMORY DATABASES. David Kaaret InterSystems Corporation INTERSYSTEMS CACHÉ AS AN ALTERNATIVE TO IN-MEMORY DATABASES David Kaaret InterSystems Corporation INTERSYSTEMS CACHÉ AS AN ALTERNATIVE TO IN-MEMORY DATABASES Introduction To overcome the performance limitations

More information

Performance And Scalability In Oracle9i And SQL Server 2000

Performance And Scalability In Oracle9i And SQL Server 2000 Performance And Scalability In Oracle9i And SQL Server 2000 Presented By : Phathisile Sibanda Supervisor : John Ebden 1 Presentation Overview Project Objectives Motivation -Why performance & Scalability

More information

A Comparative Study on Vega-HTTP & Popular Open-source Web-servers

A Comparative Study on Vega-HTTP & Popular Open-source Web-servers A Comparative Study on Vega-HTTP & Popular Open-source Web-servers Happiest People. Happiest Customers Contents Abstract... 3 Introduction... 3 Performance Comparison... 4 Architecture... 5 Diagram...

More information

Big Fast Data Hadoop acceleration with Flash. June 2013

Big Fast Data Hadoop acceleration with Flash. June 2013 Big Fast Data Hadoop acceleration with Flash June 2013 Agenda The Big Data Problem What is Hadoop Hadoop and Flash The Nytro Solution Test Results The Big Data Problem Big Data Output Facebook Traditional

More information

DataStax Enterprise, powered by Apache Cassandra (TM)

DataStax Enterprise, powered by Apache Cassandra (TM) PerfAccel (TM) Performance Benchmark on Amazon: DataStax Enterprise, powered by Apache Cassandra (TM) Disclaimer: All of the documentation provided in this document, is copyright Datagres Technologies

More information

In-Database Analytics

In-Database Analytics Embedding Analytics in Decision Management Systems In-database analytics offer a powerful tool for embedding advanced analytics in a critical component of IT infrastructure. James Taylor CEO CONTENTS Introducing

More information

Data Aggregation and Cloud Computing

Data Aggregation and Cloud Computing Data Intensive Scalable Computing Harnessing the Power of Cloud Computing Randal E. Bryant February, 2009 Our world is awash in data. Millions of devices generate digital data, an estimated one zettabyte

More information

DAS, NAS or SAN: Choosing the Right Storage Technology for Your Organization

DAS, NAS or SAN: Choosing the Right Storage Technology for Your Organization DAS, NAS or SAN: Choosing the Right Storage Technology for Your Organization New Drivers in Information Storage Data is unquestionably the lifeblood of today s digital organization. Storage solutions remain

More information

Index Terms : Load rebalance, distributed file systems, clouds, movement cost, load imbalance, chunk.

Index Terms : Load rebalance, distributed file systems, clouds, movement cost, load imbalance, chunk. Load Rebalancing for Distributed File Systems in Clouds. Smita Salunkhe, S. S. Sannakki Department of Computer Science and Engineering KLS Gogte Institute of Technology, Belgaum, Karnataka, India Affiliated

More information

ioscale: The Holy Grail for Hyperscale

ioscale: The Holy Grail for Hyperscale ioscale: The Holy Grail for Hyperscale The New World of Hyperscale Hyperscale describes new cloud computing deployments where hundreds or thousands of distributed servers support millions of remote, often

More information

R and Hadoop: Architectural Options. Bill Jacobs VP Product Marketing & Field CTO, Revolution Analytics @bill_jacobs

R and Hadoop: Architectural Options. Bill Jacobs VP Product Marketing & Field CTO, Revolution Analytics @bill_jacobs R and Hadoop: Architectural Options Bill Jacobs VP Product Marketing & Field CTO, Revolution Analytics @bill_jacobs Polling Question #1: Who Are You? (choose one) Statistician or modeler who uses R Other

More information

COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Big Data by the numbers

COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Big Data by the numbers COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Instructor: (jpineau@cs.mcgill.ca) TAs: Pierre-Luc Bacon (pbacon@cs.mcgill.ca) Ryan Lowe (ryan.lowe@mail.mcgill.ca)

More information

Energy Efficient MapReduce

Energy Efficient MapReduce Energy Efficient MapReduce Motivation: Energy consumption is an important aspect of datacenters efficiency, the total power consumption in the united states has doubled from 2000 to 2005, representing

More information

Modernizing Hadoop Architecture for Superior Scalability, Efficiency & Productive Throughput. ddn.com

Modernizing Hadoop Architecture for Superior Scalability, Efficiency & Productive Throughput. ddn.com DDN Technical Brief Modernizing Hadoop Architecture for Superior Scalability, Efficiency & Productive Throughput. A Fundamentally Different Approach To Enterprise Analytics Architecture: A Scalable Unit

More information

Big-data Analytics: Challenges and Opportunities

Big-data Analytics: Challenges and Opportunities Big-data Analytics: Challenges and Opportunities Chih-Jen Lin Department of Computer Science National Taiwan University Talk at 台 灣 資 料 科 學 愛 好 者 年 會, August 30, 2014 Chih-Jen Lin (National Taiwan Univ.)

More information

R at the front end and

R at the front end and Divide & Recombine for Large Complex Data (a.k.a. Big Data) 1 Statistical framework requiring research in statistical theory and methods to make it work optimally Framework is designed to make computation

More information

Scaling Web Applications on Server-Farms Requires Distributed Caching

Scaling Web Applications on Server-Farms Requires Distributed Caching Scaling Web Applications on Server-Farms Requires Distributed Caching A White Paper from ScaleOut Software Dr. William L. Bain Founder & CEO Spurred by the growth of Web-based applications running on server-farms,

More information

Cluster Computing at HRI

Cluster Computing at HRI Cluster Computing at HRI J.S.Bagla Harish-Chandra Research Institute, Chhatnag Road, Jhunsi, Allahabad 211019. E-mail: jasjeet@mri.ernet.in 1 Introduction and some local history High performance computing

More information

Cloud Computing at Google. Architecture

Cloud Computing at Google. Architecture Cloud Computing at Google Google File System Web Systems and Algorithms Google Chris Brooks Department of Computer Science University of San Francisco Google has developed a layered system to handle webscale

More information

From Big Data, Data Mining, and Machine Learning. Full book available for purchase here.

From Big Data, Data Mining, and Machine Learning. Full book available for purchase here. From Big Data, Data Mining, and Machine Learning. Full book available for purchase here. Contents Forward xiii Preface xv Acknowledgments xix Introduction 1 Big Data Timeline 5 Why This Topic Is Relevant

More information

An Oracle White Paper November 2010. Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

An Oracle White Paper November 2010. Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics An Oracle White Paper November 2010 Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics 1 Introduction New applications such as web searches, recommendation engines,

More information

Recommended hardware system configurations for ANSYS users

Recommended hardware system configurations for ANSYS users Recommended hardware system configurations for ANSYS users The purpose of this document is to recommend system configurations that will deliver high performance for ANSYS users across the entire range

More information

Laurence Liew General Manager, APAC. Economics Is Driving Big Data Analytics to the Cloud

Laurence Liew General Manager, APAC. Economics Is Driving Big Data Analytics to the Cloud Laurence Liew General Manager, APAC Economics Is Driving Big Data Analytics to the Cloud Big Data 101 The Analytics Stack Economics of Big Data Convergence of the 3 forces Big Data Analytics in the Cloud

More information

Reducing Storage TCO With Private Cloud Storage

Reducing Storage TCO With Private Cloud Storage Prepared by: Colm Keegan, Senior Analyst Prepared: October 2014 With the burgeoning growth of data, many legacy storage systems simply struggle to keep the total cost of ownership (TCO) in check. This

More information

SAS and Oracle: Big Data and Cloud Partnering Innovation Targets the Third Platform

SAS and Oracle: Big Data and Cloud Partnering Innovation Targets the Third Platform SAS and Oracle: Big Data and Cloud Partnering Innovation Targets the Third Platform David Lawler, Oracle Senior Vice President, Product Management and Strategy Paul Kent, SAS Vice President, Big Data What

More information

The Power of Predictive Analytics

The Power of Predictive Analytics The Power of Predictive Analytics Derive real-time insights with accuracy and ease SOLUTION OVERVIEW www.sybase.com KXEN S INFINITEINSIGHT AND SYBASE IQ FEATURES & BENEFITS AT A GLANCE Ensure greater accuracy

More information

The Mainframe Virtualization Advantage: How to Save Over Million Dollars Using an IBM System z as a Linux Cloud Server

The Mainframe Virtualization Advantage: How to Save Over Million Dollars Using an IBM System z as a Linux Cloud Server Research Report The Mainframe Virtualization Advantage: How to Save Over Million Dollars Using an IBM System z as a Linux Cloud Server Executive Summary Information technology (IT) executives should be

More information

16.1 MAPREDUCE. For personal use only, not for distribution. 333

16.1 MAPREDUCE. For personal use only, not for distribution. 333 For personal use only, not for distribution. 333 16.1 MAPREDUCE Initially designed by the Google labs and used internally by Google, the MAPREDUCE distributed programming model is now promoted by several

More information

In-Memory Analytics for Big Data

In-Memory Analytics for Big Data In-Memory Analytics for Big Data Game-changing technology for faster, better insights WHITE PAPER SAS White Paper Table of Contents Introduction: A New Breed of Analytics... 1 SAS In-Memory Overview...

More information

WHITE PAPER Improving Storage Efficiencies with Data Deduplication and Compression

WHITE PAPER Improving Storage Efficiencies with Data Deduplication and Compression WHITE PAPER Improving Storage Efficiencies with Data Deduplication and Compression Sponsored by: Oracle Steven Scully May 2010 Benjamin Woo IDC OPINION Global Headquarters: 5 Speen Street Framingham, MA

More information

Big Data and Data Science: Behind the Buzz Words

Big Data and Data Science: Behind the Buzz Words Big Data and Data Science: Behind the Buzz Words Peggy Brinkmann, FCAS, MAAA Actuary Milliman, Inc. April 1, 2014 Contents Big data: from hype to value Deconstructing data science Managing big data Analyzing

More information

Wide-area Network Acceleration for the Developing World. Sunghwan Ihm (Princeton) KyoungSoo Park (KAIST) Vivek S. Pai (Princeton)

Wide-area Network Acceleration for the Developing World. Sunghwan Ihm (Princeton) KyoungSoo Park (KAIST) Vivek S. Pai (Princeton) Wide-area Network Acceleration for the Developing World Sunghwan Ihm (Princeton) KyoungSoo Park (KAIST) Vivek S. Pai (Princeton) POOR INTERNET ACCESS IN THE DEVELOPING WORLD Internet access is a scarce

More information

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging Outline High Performance Computing (HPC) Towards exascale computing: a brief history Challenges in the exascale era Big Data meets HPC Some facts about Big Data Technologies HPC and Big Data converging

More information

A survey on platforms for big data analytics

A survey on platforms for big data analytics Singh and Reddy Journal of Big Data 2014, 1:8 SURVEY PAPER Open Access A survey on platforms for big data analytics Dilpreet Singh and Chandan K Reddy * * Correspondence: reddy@cs.wayne.edu Department

More information

Microsoft Windows Server Hyper-V in a Flash

Microsoft Windows Server Hyper-V in a Flash Microsoft Windows Server Hyper-V in a Flash Combine Violin s enterprise-class storage arrays with the ease and flexibility of Windows Storage Server in an integrated solution to achieve higher density,

More information

Hadoop Cluster Applications

Hadoop Cluster Applications Hadoop Overview Data analytics has become a key element of the business decision process over the last decade. Classic reporting on a dataset stored in a database was sufficient until recently, but yesterday

More information

Achieving Real-Time Business Solutions Using Graph Database Technology and High Performance Networks

Achieving Real-Time Business Solutions Using Graph Database Technology and High Performance Networks WHITE PAPER July 2014 Achieving Real-Time Business Solutions Using Graph Database Technology and High Performance Networks Contents Executive Summary...2 Background...3 InfiniteGraph...3 High Performance

More information

Parallel Processing of cluster by Map Reduce

Parallel Processing of cluster by Map Reduce Parallel Processing of cluster by Map Reduce Abstract Madhavi Vaidya, Department of Computer Science Vivekanand College, Chembur, Mumbai vamadhavi04@yahoo.co.in MapReduce is a parallel programming model

More information

How To Speed Up A Flash Flash Storage System With The Hyperq Memory Router

How To Speed Up A Flash Flash Storage System With The Hyperq Memory Router HyperQ Hybrid Flash Storage Made Easy White Paper Parsec Labs, LLC. 7101 Northland Circle North, Suite 105 Brooklyn Park, MN 55428 USA 1-763-219-8811 www.parseclabs.com info@parseclabs.com sales@parseclabs.com

More information

Delivering value from big data with Microsoft R Server and Hadoop

Delivering value from big data with Microsoft R Server and Hadoop EXECUTIVE WHITE PAPER Delivering value from big data with Microsoft R Server and Hadoop Microsoft Advanced Analytics Team April 2016 ABSTRACT Businesses are continuing to invest in Hadoop to manage analytic

More information

ANALYTICS IN BIG DATA ERA

ANALYTICS IN BIG DATA ERA ANALYTICS IN BIG DATA ERA ANALYTICS TECHNOLOGY AND ARCHITECTURE TO MANAGE VELOCITY AND VARIETY, DISCOVER RELATIONSHIPS AND CLASSIFY HUGE AMOUNT OF DATA MAURIZIO SALUSTI SAS Copyr i g ht 2012, SAS Ins titut

More information

How to Choose your Red Hat Enterprise Linux Filesystem

How to Choose your Red Hat Enterprise Linux Filesystem How to Choose your Red Hat Enterprise Linux Filesystem EXECUTIVE SUMMARY Choosing the Red Hat Enterprise Linux filesystem that is appropriate for your application is often a non-trivial decision due to

More information

SQL Server Virtualization

SQL Server Virtualization The Essential Guide to SQL Server Virtualization S p o n s o r e d b y Virtualization in the Enterprise Today most organizations understand the importance of implementing virtualization. Virtualization

More information

HadoopTM Analytics DDN

HadoopTM Analytics DDN DDN Solution Brief Accelerate> HadoopTM Analytics with the SFA Big Data Platform Organizations that need to extract value from all data can leverage the award winning SFA platform to really accelerate

More information

Top 10 Reasons why MySQL Experts Switch to SchoonerSQL - Solving the common problems users face with MySQL

Top 10 Reasons why MySQL Experts Switch to SchoonerSQL - Solving the common problems users face with MySQL SCHOONER WHITE PAPER Top 10 Reasons why MySQL Experts Switch to SchoonerSQL - Solving the common problems users face with MySQL About Schooner Information Technology Schooner Information Technology provides

More information

Why Big Data in the Cloud?

Why Big Data in the Cloud? Have 40 Why Big Data in the Cloud? Colin White, BI Research January 2014 Sponsored by Treasure Data TABLE OF CONTENTS Introduction The Importance of Big Data The Role of Cloud Computing Using Big Data

More information

QLIKVIEW ARCHITECTURE AND SYSTEM RESOURCE USAGE

QLIKVIEW ARCHITECTURE AND SYSTEM RESOURCE USAGE QLIKVIEW ARCHITECTURE AND SYSTEM RESOURCE USAGE QlikView Technical Brief April 2011 www.qlikview.com Introduction This technical brief covers an overview of the QlikView product components and architecture

More information

What Is In-Memory Computing and What Does It Mean to U.S. Leaders? EXECUTIVE WHITE PAPER

What Is In-Memory Computing and What Does It Mean to U.S. Leaders? EXECUTIVE WHITE PAPER What Is In-Memory Computing and What Does It Mean to U.S. Leaders? EXECUTIVE WHITE PAPER A NEW PARADIGM IN INFORMATION TECHNOLOGY There is a revolution happening in information technology, and it s not

More information

Networking in the Hadoop Cluster

Networking in the Hadoop Cluster Hadoop and other distributed systems are increasingly the solution of choice for next generation data volumes. A high capacity, any to any, easily manageable networking layer is critical for peak Hadoop

More information

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel Parallel Databases Increase performance by performing operations in parallel Parallel Architectures Shared memory Shared disk Shared nothing closely coupled loosely coupled Parallelism Terminology Speedup:

More information