RevoScaleR Speed and Scalability
|
|
- Andra Barber
- 8 years ago
- Views:
Transcription
1 EXECUTIVE WHITE PAPER RevoScaleR Speed and Scalability By Lee Edlefsen Ph.D., Chief Scientist, Revolution Analytics Abstract RevoScaleR, the Big Data predictive analytics library included with Revolution R Enterprise, is designed from the ground up to be fast and scalable. Consideration has been give to all of the components that are involved in performing large-scale statistical analysis. These include data storage, usage of a computing infrastructure s resources (RAM, CPUs, cores, and computers) and the algorithms themselves. Its extreme speed and scalability are the result of careful, innovative engineering at every stage. This white paper describes the design and implementation considerations that are the foundation of the high-performance Big Data capabilities of Revolution R Enterprise. Executive Summary Analytics-driven breakthroughs in every field from healthcare to financial services have put demand for advanced analytics front and center for large and small organizations. As in any IT deployment, IT leaders supporting analytics environments have been challenged by tradeoffs among cost, performance and functionality. These tradeoffs are becoming more problematic due to exploding data volumes and the increasing numbers of people who recognize the potential impact that advanced analytics could have and are requesting analytics solutions that exceed the capabilities of existing tools. How can IT create an analytics infrastructure that will grow with the organization s needs? For the past several decades, the rising tide of technology especially the increasing speed of single processors has allowed the same data analysis code from legacy analytics software vendors to run faster and on bigger data sets. That happy era is ending. The size of data sets is increasing much more rapidly than the speed of single cores, of I/O, and of RAM and legacy code can t keep up. To allow analytics to realize its potential for organizational improvements and handle very large and growing data sets, IT leaders need scalable data analysis software that is able to run on newer hardware paradigms, specifically using multiple cores, multiple hard drives, and multiple computers. The data analysis software needs to scale from small data sets to huge ones, from using one core and one hard drive on one computer to using many cores and many hard drives on many computers, and from using local hardware to using remote clouds. 1
2 Revolution Analytics offers enterprise-grade, terabyte-class software based on the Open Source project R. This white paper discusses the approach to scalability we have taken at Revolution Analytics with our package RevoScaleR, specifically exploring: Storing data Reading and writing of chunks of data Handling data in memory Using multiple cores on single computers Please share this white paper with the people on your team who are responsible for collecting, storing, managing, analyzing and extracting value from data. 1. Storing Data One of the keys to being scalable is the ability to process more data than can fit into memory at one time. This essentially equates to being able to work with chunks of data instead of requiring the entire dataset to be resident in memory at once. In the context of RevoScaleR, chunks are defined as sequential blocks of rows for a given selection of columns. Although RevoScaleR can process data from a wide variety of sources, it has its own highly optimized file format (the "XDF" format) that is especially suitable for chunking. Data in an XDF file can be accessed rapidly by row or by column. In addition, blocks of contiguous rows for selected columns can be read sequentially, rather than randomly. Sequential reads can be tens to hundreds of thousands of times faster than random reads. Furthermore, in an XDF file the time it takes to read a block of rows for a variable is essentially independent of the total number of variables and rows in the file. This means that even in terabyte-sized files, only the data for the actual variables required for an analysis needs to be read and processed, and this may only be a few hundred megabytes. The time it takes to do that is essentially the same as if only that data were stored in the file; storing the additional unused data does not add to the processing time. Data in an XDF file is stored in the same binary format that is used in memory, so no conversion is required when it is brought into memory. In order to minimize wasted space, it can also be stored in an appropriately-sized way. For instance, values that have no more than 256 values can be stored in a byte per number, rather than in 8 bytes as is the case with some data analysis programs. Floating point values with a precision of less than 6 or 7 decimal digits, which is commonly the case, can be stored in 4 bytes per number, not 8. New variables and new rows can be added to the file without having to rewrite the entire file. Thus, the cost of creating new variables and of adding more observations is greatly reduced. 2
3 2. Reading Data When data is read in "chunks," the optimal chunk size depends upon a variety of factors, such as the speed of the disk, the speed of RAM, the number and speed of cores, and the types of computations being done. RevoScaleR allows the size of chunks to vary depending upon conditions. A bottleneck for data processing is data I/O: reading the data from disk. RevoScaleR dedicates one core to reading data from disk, to avoid disk contention and optimize bandwidth for data/io. Meanwhile, the remaining cores are assigned to processing the chunk of data read into memory from the previous read. Of course, when it is possible to fit all data into memory, RevoScaleR allows that. It then assigns all cores to process that data. 3. Handling Data in Memory As on disk, use of the appropriate-sized data type in memory reduces the space required and also reduces the time it takes to move the data around in memory. In RevoScaleR, the amount of data conversion and copying is minimized, to save time and speed. In almost all other data-oriented programs, before an array of integers and an array of double precision floating point numbers are added together, the array of integers is first converted and copied into an array of doubles. This takes time and space. In RevoScaleR, that is almost never necessary, regardless of the type of operation and the sizes of the data types. No conversion or copying is done until the values are actually loaded into the CPU. 4. Use of Multiple Cores on a Single Computer Nearly all computations that involve data in RevoScaleR are automatically "threaded" that is they use multiple cores on a machine when they are available. This is done efficiently by minimizing the overhead of transferring the computations to multiple threads, by minimizing the amount of data that must be copied, by doing as much work as possible on each thread to amortize the cost of initializing the computations, and by minimizing inter-thread communication and synchronization. Feeding large chunks of data to each of the multiple cores is important for efficiency. For analytic routines such as descriptive statistics, crosstabs, linear regression, logistic regression, and K- means clustering (in which several variables are typically used) a large chunk of observations perhaps millions for all of the variables is read into memory by one core. Simultaneously, the data chunk from the previous read is "virtually" split among the remaining cores for the required processing. The code doing the processing on each core (thread) only needs to know what its assigned task is, and no inter-thread communication and synchronization is needed. As a simple example, consider computing the mean of several variables. Millions of observations of each of those variables might be read by the I/O thread and then each of the other threads is given a proportionate share of the observations. Each computational thread just needs to compute and store the sum of each of the variables for its share of the observations, and to 3
4 record how many total observations it used. To get the means for the entire data set, the partial sums and partial observation counts are aggregated, and the grand sums are divided by the total number of observations. Figure 1 RevoScaleR on Single Computer A RevoScaleR algorithm is provided a data source as input The algorithm loops over data, reading a block at a time. Blocks of data are read by a separate worker thread (Thread 0). Other worker threads (Threads 1..n) process the data block from the previous iteration of the data loop and update intermediate results objects in memory When all of the data is processed a master results object is created from the intermediate results objects 5. Use of Multiple Computers A key to efficiently using multiple computers is to minimize the amount of information including data that must be communicated among the computers. In RevoScaleR, one of the computers (the master node) controls the computations on all of the other computers. It first sends a message to each compute node telling it where to find the data to use, and what types of computations to do. On each computer, multiple cores are used as described above, to maximize the efficiency of the node. The intermediate results from all cores are aggregated on that node, and only that information is sent back to the master node. The master node monitors the status of the compute nodes, aggregates the overall results sent back by those nodes, and then processes those results to get overall estimates. The final processing often involves computeintensive operations such as solving large sets of equations. RevoScaleR allows several options for getting data to the cores on each node, including reading data from a common data server, but it is generally most efficient to have the portion of data needed by each node stored locally. For iterative algorithms that require many passes through the data, such as logistic regression and K-means clustering, the master node controls the iterations. This is done by repeating the steps 4
5 described above: each iteration is initialized by a message from the master node, which aggregates the results that come back, computes the next set of estimated parameters, and decides whether the algorithm has converged. If not, another iteration is started. Figure 2 RevoScaleR on Multiple Computers Portions of the data source are made available to each compute node RevoScaleR on the master node assigns a task to each compute node, and the sleeping instance of RevoScaleR on the compute node wakes up. RevoScaleR on each compute node independently processes its data, and returns it s intermediate results back to RevoScaleR on the master node The RevoScaleR on the master node aggregates all of the intermediate results from each compute node and produces the final result 6. Efficient Parallelization of Statistical and Data Mining Algorithms RevoScaleR is built upon a platform designed to automatically and efficiently parallelize "external memory" algorithms. This is the class of algorithms that do not require all data to be in memory at one time, and such algorithms are available for a wide range of statistical and data mining routines. The way in which these algorithms are automatically parallelized is such that, in general, the fastest algorithms per core are also the fastest when parallelized. (This happy situation is not the case for some other types of parallel algorithms). Since the burden of worrying about parallelization is removed from the engineers implementing these algorithms, they can focus on getting optimal speed on each core. This involves several things. Most obviously, it involves using fast algorithms, and carefully coding those using C++ templates, which can produce very fast code. Other issues are important as well. Categorical data is very common in statistical computations, and they are handled in ways that save memory, increase speed, and increase computational precision as well. 5
6 It is often the case in statistical models that the same values are required in different parts of the computation. RevoScaleR has a sophisticated algorithm for pre-analyzing models to detect such duplication, so that the number of computations can be minimized. Multiple models can be analyzed jointly. This algorithm can also detect collinearities in models, which can lead to wasted computations or even computational failures, and can remove them prior to doing any computations. Conclusion RevoScaleR is a library included in Revolution R Enterprise that provides extremely fast statistical analysis on terabyte-class data sets, without needing specialized hardware. Using only a commodity multi-processor computer with modest amounts of RAM, data processing and predictive modeling techniques can easily be performed on data sets with hundreds of millions of rows and hundreds of variables, at speeds suitable for interactive processing. Extending the system to a small cluster of similar computers commensurately reduces processing time. These achievements are the result of the design of the RevoScaleR platform, constructed from the ground up for speed and scalability. Specifically: Efficient storage of data on local disk, in the high-performance XDF file format optimized for block-reads of data; A high-performance strategy for streaming data from disk to memory, optimizing throughput by dedicating one core to I/O while remaining cores process buffered data; Optimized data formats for storing data in-memory; Parallelized algorithms that exploit multiple cores to perform analytic processing on chunks of data held temporarily in-memory; The ability to exploit the processing power of multiple nodes in a cluster, to further reduce processing times; and An architectural platform to implement parallel, streaming algorithms that efficiently combine the partial results from optimized algorithms running multiple cores and multiple machines, to provide fast statistical data analyses on extremely large data sets. The RevoScaleR library is included with Revolution R Enterprise, available for Windows and Linux systems from Revolution Analytics. For more information, please contact Revolution Analytics at GET-REVO ( ) or at 6
7 About Revolution Analytics Revolution Analytics delivers advanced analytics software at half the cost of existing solutions. Led by predictive analytics pioneer and SPSS co-founder Norman Nie, the company brings high performance, productivity, and enterprise readiness to open source R, the most powerful statistics language in the world. In the last 10 years, R has exploded in popularity and functionality and has emerged as the data scientists tool of choice. Today R is used by over 2 million analysts worldwide in academia and at cutting-edge analytics-driven companies such as Google, Facebook, and LinkedIn. To equip R for the demands and requirements of all business environments, Revolution R Enterprise builds on open source R with innovations in big data analysis, integration and user experience. The company s flagship Revolution R product is available both as a workstation and server-based offering. Revolution R Enterprise Server is designed to scale and meet the mission-critical production needs of large organizations such as Merck, Bank of America and Mu Sigma, while Revolution R Workstation offers productivity and development tools for individuals and small teams that need to build applications and analyze data. Revolution Analytics is committed to fostering the growth of the R community. The company sponsors the Inside-R.org community site, local users groups worldwide, and offers free licenses of Revolution R Enterprise to everyone in academia to broaden adoption by the next generation of data scientists. Revolution Analytics is headquartered in Palo Alto, Calif. and backed by North Bridge Venture Partners and Intel Capital. Please visit us at 7
Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011
Scalable Data Analysis in R Lee E. Edlefsen Chief Scientist UserR! 2011 1 Introduction Our ability to collect and store data has rapidly been outpacing our ability to analyze it We need scalable data analysis
More informationAdvanced Big Data Analytics with R and Hadoop
REVOLUTION ANALYTICS WHITE PAPER Advanced Big Data Analytics with R and Hadoop 'Big Data' Analytics as a Competitive Advantage Big Analytics delivers competitive advantage in two ways compared to the traditional
More informationHow To Test The Performance Of An Ass 9.4 And Sas 7.4 On A Test On A Powerpoint Powerpoint 9.2 (Powerpoint) On A Microsoft Powerpoint 8.4 (Powerprobe) (
White Paper Revolution R Enterprise: Faster Than SAS Benchmarking Results by Thomas W. Dinsmore and Derek McCrae Norton In analytics, speed matters. How much? We asked the director of analytics from a
More informationUnderstanding the Benefits of IBM SPSS Statistics Server
IBM SPSS Statistics Server Understanding the Benefits of IBM SPSS Statistics Server Contents: 1 Introduction 2 Performance 101: Understanding the drivers of better performance 3 Why performance is faster
More informationFast Analytics on Big Data with H20
Fast Analytics on Big Data with H20 0xdata.com, h2o.ai Tomas Nykodym, Petr Maj Team About H2O and 0xdata H2O is a platform for distributed in memory predictive analytics and machine learning Pure Java,
More informationDriving Value from Big Data
Executive White Paper Driving Value from Big Data Bill Jacobs, Director of Product Marketing & Thomas W. Dinsmore, Director of Product Management Abstract Businesses are rapidly investing in Hadoop to
More informationHow To Handle Big Data With A Data Scientist
III Big Data Technologies Today, new technologies make it possible to realize value from Big Data. Big data technologies can replace highly customized, expensive legacy systems with a standard solution
More informationUsing In-Memory Computing to Simplify Big Data Analytics
SCALEOUT SOFTWARE Using In-Memory Computing to Simplify Big Data Analytics by Dr. William Bain, ScaleOut Software, Inc. 2012 ScaleOut Software, Inc. 12/27/2012 T he big data revolution is upon us, fed
More informationBringing Big Data Modelling into the Hands of Domain Experts
Bringing Big Data Modelling into the Hands of Domain Experts David Willingham Senior Application Engineer MathWorks david.willingham@mathworks.com.au 2015 The MathWorks, Inc. 1 Data is the sword of the
More informationBenchmarking Hadoop & HBase on Violin
Technical White Paper Report Technical Report Benchmarking Hadoop & HBase on Violin Harnessing Big Data Analytics at the Speed of Memory Version 1.0 Abstract The purpose of benchmarking is to show advantages
More informationHow In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time
SCALEOUT SOFTWARE How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time by Dr. William Bain and Dr. Mikhail Sobolev, ScaleOut Software, Inc. 2012 ScaleOut Software, Inc. 12/27/2012 T wenty-first
More informationDelivering Value from Big Data with Revolution R Enterprise and Hadoop
Executive White Paper Delivering Value from Big Data with Revolution R Enterprise and Hadoop Bill Jacobs, Director of Product Marketing Thomas W. Dinsmore, Director of Product Management October 2013 Abstract
More informationHigh Performance Predictive Analytics in R and Hadoop:
High Performance Predictive Analytics in R and Hadoop: Achieving Big Data Big Analytics Presented by: Mario E. Inchiosa, Ph.D. US Chief Scientist August 27, 2013 1 Polling Questions 1 & 2 2 Agenda Revolution
More informationInformatica Ultra Messaging SMX Shared-Memory Transport
White Paper Informatica Ultra Messaging SMX Shared-Memory Transport Breaking the 100-Nanosecond Latency Barrier with Benchmark-Proven Performance This document contains Confidential, Proprietary and Trade
More informationUpStream Software s Big Data Analytics Platform for Marketing Optimization Helps Clients Understand Buying Behavior and Improve Customer Targeting
CASE STUDY UpStream Software s Big Data Analytics Platform for Marketing Optimization Helps Clients Understand Buying Behavior and Improve Customer Targeting Company: Industry: Challenge: Solution: Results:
More informationFlash Memory Arrays Enabling the Virtualized Data Center. July 2010
Flash Memory Arrays Enabling the Virtualized Data Center July 2010 2 Flash Memory Arrays Enabling the Virtualized Data Center This White Paper describes a new product category, the flash Memory Array,
More informationThe Rise of Big Data Spurs a Revolution in Big Analytics
REVOLUTION ANALYTICS EXECUTIVE BRIEFING The Rise of Big Data Spurs a Revolution in Big Analytics By Norman H. Nie, CEO Revolution Analytics The enormous growth in the amount of data that the global economy
More informationBig Data Analysis with Revolution R Enterprise
REVOLUTION WHITE PAPER Big Data Analysis with Revolution R Enterprise By Joseph Rickert January 2011 Background The R language is well established as the language for doing statistics, data analysis, data-mining
More informationBenchmarking Cassandra on Violin
Technical White Paper Report Technical Report Benchmarking Cassandra on Violin Accelerating Cassandra Performance and Reducing Read Latency With Violin Memory Flash-based Storage Arrays Version 1.0 Abstract
More informationAccelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software
WHITEPAPER Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software SanDisk ZetaScale software unlocks the full benefits of flash for In-Memory Compute and NoSQL applications
More informationInfrastructure Matters: POWER8 vs. Xeon x86
Advisory Infrastructure Matters: POWER8 vs. Xeon x86 Executive Summary This report compares IBM s new POWER8-based scale-out Power System to Intel E5 v2 x86- based scale-out systems. A follow-on report
More informationArchitectures for Big Data Analytics A database perspective
Architectures for Big Data Analytics A database perspective Fernando Velez Director of Product Management Enterprise Information Management, SAP June 2013 Outline Big Data Analytics Requirements Spectrum
More informationRAID for the 21st Century. A White Paper Prepared for Panasas October 2007
A White Paper Prepared for Panasas October 2007 Table of Contents RAID in the 21 st Century...1 RAID 5 and RAID 6...1 Penalties Associated with RAID 5 and RAID 6...1 How the Vendors Compensate...2 EMA
More informationCSE-E5430 Scalable Cloud Computing Lecture 2
CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing
More informationScaling Objectivity Database Performance with Panasas Scale-Out NAS Storage
White Paper Scaling Objectivity Database Performance with Panasas Scale-Out NAS Storage A Benchmark Report August 211 Background Objectivity/DB uses a powerful distributed processing architecture to manage
More informationPart V Applications. What is cloud computing? SaaS has been around for awhile. Cloud Computing: General concepts
Part V Applications Cloud Computing: General concepts Copyright K.Goseva 2010 CS 736 Software Performance Engineering Slide 1 What is cloud computing? SaaS: Software as a Service Cloud: Datacenters hardware
More informationHadoop Architecture. Part 1
Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,
More informationPARALLELS CLOUD SERVER
PARALLELS CLOUD SERVER An Introduction to Operating System Virtualization and Parallels Cloud Server 1 Table of Contents Introduction... 3 Hardware Virtualization... 3 Operating System Virtualization...
More informationHow To Store Data On An Ocora Nosql Database On A Flash Memory Device On A Microsoft Flash Memory 2 (Iomemory)
WHITE PAPER Oracle NoSQL Database and SanDisk Offer Cost-Effective Extreme Performance for Big Data 951 SanDisk Drive, Milpitas, CA 95035 www.sandisk.com Table of Contents Abstract... 3 What Is Big Data?...
More informationScala Storage Scale-Out Clustered Storage White Paper
White Paper Scala Storage Scale-Out Clustered Storage White Paper Chapter 1 Introduction... 3 Capacity - Explosive Growth of Unstructured Data... 3 Performance - Cluster Computing... 3 Chapter 2 Current
More informationActian Vector in Hadoop
Actian Vector in Hadoop Industrialized, High-Performance SQL in Hadoop A Technical Overview Contents Introduction...3 Actian Vector in Hadoop - Uniquely Fast...5 Exploiting the CPU...5 Exploiting Single
More informationBinary search tree with SIMD bandwidth optimization using SSE
Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous
More informationAccelerating Hadoop MapReduce Using an In-Memory Data Grid
Accelerating Hadoop MapReduce Using an In-Memory Data Grid By David L. Brinker and William L. Bain, ScaleOut Software, Inc. 2013 ScaleOut Software, Inc. 12/27/2012 H adoop has been widely embraced for
More informationCloud Server. Parallels. An Introduction to Operating System Virtualization and Parallels Cloud Server. White Paper. www.parallels.
Parallels Cloud Server White Paper An Introduction to Operating System Virtualization and Parallels Cloud Server www.parallels.com Table of Contents Introduction... 3 Hardware Virtualization... 3 Operating
More informationTackling Big Data with MATLAB Adam Filion Application Engineer MathWorks, Inc.
Tackling Big Data with MATLAB Adam Filion Application Engineer MathWorks, Inc. 2015 The MathWorks, Inc. 1 Challenges of Big Data Any collection of data sets so large and complex that it becomes difficult
More informationRackspace Cloud Databases and Container-based Virtualization
Rackspace Cloud Databases and Container-based Virtualization August 2012 J.R. Arredondo @jrarredondo Page 1 of 6 INTRODUCTION When Rackspace set out to build the Cloud Databases product, we asked many
More informationTable of Contents. June 2010
June 2010 From: StatSoft Analytics White Papers To: Internal release Re: Performance comparison of STATISTICA Version 9 on multi-core 64-bit machines with current 64-bit releases of SAS (Version 9.2) and
More informationManaging Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database
Managing Big Data with Hadoop & Vertica A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database Copyright Vertica Systems, Inc. October 2009 Cloudera and Vertica
More informationA REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, sborkar95@gmail.com Assistant Professor, Information
More informationESS event: Big Data in Official Statistics. Antonino Virgillito, Istat
ESS event: Big Data in Official Statistics Antonino Virgillito, Istat v erbi v is 1 About me Head of Unit Web and BI Technologies, IT Directorate of Istat Project manager and technical coordinator of Web
More informationGigaSpaces Real-Time Analytics for Big Data
GigaSpaces Real-Time Analytics for Big Data GigaSpaces makes it easy to build and deploy large-scale real-time analytics systems Rapidly increasing use of large-scale and location-aware social media and
More informationData Center Solutions
Data Center Solutions Systems, software and hardware solutions you can trust With over 25 years of storage innovation, SanDisk is a global flash technology leader. At SanDisk, we re expanding the possibilities
More informationBENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB
BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB Planet Size Data!? Gartner s 10 key IT trends for 2012 unstructured data will grow some 80% over the course of the next
More informationTop Ten Questions. to Ask Your Primary Storage Provider About Their Data Efficiency. May 2014. Copyright 2014 Permabit Technology Corporation
Top Ten Questions to Ask Your Primary Storage Provider About Their Data Efficiency May 2014 Copyright 2014 Permabit Technology Corporation Introduction The value of data efficiency technologies, namely
More informationClient/Server Computing Distributed Processing, Client/Server, and Clusters
Client/Server Computing Distributed Processing, Client/Server, and Clusters Chapter 13 Client machines are generally single-user PCs or workstations that provide a highly userfriendly interface to the
More information[x+1] Completes Next-Generation POE; Its Origin Enterprise Data Management Platform for Automated, Big Data-Driven Marketing Optimization
REVOLUTION CASE STUDY [x+1] Completes Next-Generation POE; Its Origin Enterprise Data Management Platform for Automated, Big Data-Driven Marketing Optimization Revolution R Enterprise Tapped for High-Performance,
More informationMake the Most of Big Data to Drive Innovation Through Reseach
White Paper Make the Most of Big Data to Drive Innovation Through Reseach Bob Burwell, NetApp November 2012 WP-7172 Abstract Monumental data growth is a fact of life in research universities. The ability
More informationHadoop: Embracing future hardware
Hadoop: Embracing future hardware Suresh Srinivas @suresh_m_s Page 1 About Me Architect & Founder at Hortonworks Long time Apache Hadoop committer and PMC member Designed and developed many key Hadoop
More informationAchieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging
Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging In some markets and scenarios where competitive advantage is all about speed, speed is measured in micro- and even nano-seconds.
More informationI N T E R S Y S T E M S W H I T E P A P E R INTERSYSTEMS CACHÉ AS AN ALTERNATIVE TO IN-MEMORY DATABASES. David Kaaret InterSystems Corporation
INTERSYSTEMS CACHÉ AS AN ALTERNATIVE TO IN-MEMORY DATABASES David Kaaret InterSystems Corporation INTERSYSTEMS CACHÉ AS AN ALTERNATIVE TO IN-MEMORY DATABASES Introduction To overcome the performance limitations
More informationPerformance And Scalability In Oracle9i And SQL Server 2000
Performance And Scalability In Oracle9i And SQL Server 2000 Presented By : Phathisile Sibanda Supervisor : John Ebden 1 Presentation Overview Project Objectives Motivation -Why performance & Scalability
More informationA Comparative Study on Vega-HTTP & Popular Open-source Web-servers
A Comparative Study on Vega-HTTP & Popular Open-source Web-servers Happiest People. Happiest Customers Contents Abstract... 3 Introduction... 3 Performance Comparison... 4 Architecture... 5 Diagram...
More informationBig Fast Data Hadoop acceleration with Flash. June 2013
Big Fast Data Hadoop acceleration with Flash June 2013 Agenda The Big Data Problem What is Hadoop Hadoop and Flash The Nytro Solution Test Results The Big Data Problem Big Data Output Facebook Traditional
More informationDataStax Enterprise, powered by Apache Cassandra (TM)
PerfAccel (TM) Performance Benchmark on Amazon: DataStax Enterprise, powered by Apache Cassandra (TM) Disclaimer: All of the documentation provided in this document, is copyright Datagres Technologies
More informationIn-Database Analytics
Embedding Analytics in Decision Management Systems In-database analytics offer a powerful tool for embedding advanced analytics in a critical component of IT infrastructure. James Taylor CEO CONTENTS Introducing
More informationData Aggregation and Cloud Computing
Data Intensive Scalable Computing Harnessing the Power of Cloud Computing Randal E. Bryant February, 2009 Our world is awash in data. Millions of devices generate digital data, an estimated one zettabyte
More informationDAS, NAS or SAN: Choosing the Right Storage Technology for Your Organization
DAS, NAS or SAN: Choosing the Right Storage Technology for Your Organization New Drivers in Information Storage Data is unquestionably the lifeblood of today s digital organization. Storage solutions remain
More informationIndex Terms : Load rebalance, distributed file systems, clouds, movement cost, load imbalance, chunk.
Load Rebalancing for Distributed File Systems in Clouds. Smita Salunkhe, S. S. Sannakki Department of Computer Science and Engineering KLS Gogte Institute of Technology, Belgaum, Karnataka, India Affiliated
More informationioscale: The Holy Grail for Hyperscale
ioscale: The Holy Grail for Hyperscale The New World of Hyperscale Hyperscale describes new cloud computing deployments where hundreds or thousands of distributed servers support millions of remote, often
More informationR and Hadoop: Architectural Options. Bill Jacobs VP Product Marketing & Field CTO, Revolution Analytics @bill_jacobs
R and Hadoop: Architectural Options Bill Jacobs VP Product Marketing & Field CTO, Revolution Analytics @bill_jacobs Polling Question #1: Who Are You? (choose one) Statistician or modeler who uses R Other
More informationCOMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Big Data by the numbers
COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Instructor: (jpineau@cs.mcgill.ca) TAs: Pierre-Luc Bacon (pbacon@cs.mcgill.ca) Ryan Lowe (ryan.lowe@mail.mcgill.ca)
More informationEnergy Efficient MapReduce
Energy Efficient MapReduce Motivation: Energy consumption is an important aspect of datacenters efficiency, the total power consumption in the united states has doubled from 2000 to 2005, representing
More informationModernizing Hadoop Architecture for Superior Scalability, Efficiency & Productive Throughput. ddn.com
DDN Technical Brief Modernizing Hadoop Architecture for Superior Scalability, Efficiency & Productive Throughput. A Fundamentally Different Approach To Enterprise Analytics Architecture: A Scalable Unit
More informationBig-data Analytics: Challenges and Opportunities
Big-data Analytics: Challenges and Opportunities Chih-Jen Lin Department of Computer Science National Taiwan University Talk at 台 灣 資 料 科 學 愛 好 者 年 會, August 30, 2014 Chih-Jen Lin (National Taiwan Univ.)
More informationR at the front end and
Divide & Recombine for Large Complex Data (a.k.a. Big Data) 1 Statistical framework requiring research in statistical theory and methods to make it work optimally Framework is designed to make computation
More informationScaling Web Applications on Server-Farms Requires Distributed Caching
Scaling Web Applications on Server-Farms Requires Distributed Caching A White Paper from ScaleOut Software Dr. William L. Bain Founder & CEO Spurred by the growth of Web-based applications running on server-farms,
More informationCluster Computing at HRI
Cluster Computing at HRI J.S.Bagla Harish-Chandra Research Institute, Chhatnag Road, Jhunsi, Allahabad 211019. E-mail: jasjeet@mri.ernet.in 1 Introduction and some local history High performance computing
More informationCloud Computing at Google. Architecture
Cloud Computing at Google Google File System Web Systems and Algorithms Google Chris Brooks Department of Computer Science University of San Francisco Google has developed a layered system to handle webscale
More informationFrom Big Data, Data Mining, and Machine Learning. Full book available for purchase here.
From Big Data, Data Mining, and Machine Learning. Full book available for purchase here. Contents Forward xiii Preface xv Acknowledgments xix Introduction 1 Big Data Timeline 5 Why This Topic Is Relevant
More informationAn Oracle White Paper November 2010. Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics
An Oracle White Paper November 2010 Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics 1 Introduction New applications such as web searches, recommendation engines,
More informationRecommended hardware system configurations for ANSYS users
Recommended hardware system configurations for ANSYS users The purpose of this document is to recommend system configurations that will deliver high performance for ANSYS users across the entire range
More informationLaurence Liew General Manager, APAC. Economics Is Driving Big Data Analytics to the Cloud
Laurence Liew General Manager, APAC Economics Is Driving Big Data Analytics to the Cloud Big Data 101 The Analytics Stack Economics of Big Data Convergence of the 3 forces Big Data Analytics in the Cloud
More informationReducing Storage TCO With Private Cloud Storage
Prepared by: Colm Keegan, Senior Analyst Prepared: October 2014 With the burgeoning growth of data, many legacy storage systems simply struggle to keep the total cost of ownership (TCO) in check. This
More informationSAS and Oracle: Big Data and Cloud Partnering Innovation Targets the Third Platform
SAS and Oracle: Big Data and Cloud Partnering Innovation Targets the Third Platform David Lawler, Oracle Senior Vice President, Product Management and Strategy Paul Kent, SAS Vice President, Big Data What
More informationThe Power of Predictive Analytics
The Power of Predictive Analytics Derive real-time insights with accuracy and ease SOLUTION OVERVIEW www.sybase.com KXEN S INFINITEINSIGHT AND SYBASE IQ FEATURES & BENEFITS AT A GLANCE Ensure greater accuracy
More informationThe Mainframe Virtualization Advantage: How to Save Over Million Dollars Using an IBM System z as a Linux Cloud Server
Research Report The Mainframe Virtualization Advantage: How to Save Over Million Dollars Using an IBM System z as a Linux Cloud Server Executive Summary Information technology (IT) executives should be
More information16.1 MAPREDUCE. For personal use only, not for distribution. 333
For personal use only, not for distribution. 333 16.1 MAPREDUCE Initially designed by the Google labs and used internally by Google, the MAPREDUCE distributed programming model is now promoted by several
More informationIn-Memory Analytics for Big Data
In-Memory Analytics for Big Data Game-changing technology for faster, better insights WHITE PAPER SAS White Paper Table of Contents Introduction: A New Breed of Analytics... 1 SAS In-Memory Overview...
More informationWHITE PAPER Improving Storage Efficiencies with Data Deduplication and Compression
WHITE PAPER Improving Storage Efficiencies with Data Deduplication and Compression Sponsored by: Oracle Steven Scully May 2010 Benjamin Woo IDC OPINION Global Headquarters: 5 Speen Street Framingham, MA
More informationBig Data and Data Science: Behind the Buzz Words
Big Data and Data Science: Behind the Buzz Words Peggy Brinkmann, FCAS, MAAA Actuary Milliman, Inc. April 1, 2014 Contents Big data: from hype to value Deconstructing data science Managing big data Analyzing
More informationWide-area Network Acceleration for the Developing World. Sunghwan Ihm (Princeton) KyoungSoo Park (KAIST) Vivek S. Pai (Princeton)
Wide-area Network Acceleration for the Developing World Sunghwan Ihm (Princeton) KyoungSoo Park (KAIST) Vivek S. Pai (Princeton) POOR INTERNET ACCESS IN THE DEVELOPING WORLD Internet access is a scarce
More informationOutline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging
Outline High Performance Computing (HPC) Towards exascale computing: a brief history Challenges in the exascale era Big Data meets HPC Some facts about Big Data Technologies HPC and Big Data converging
More informationA survey on platforms for big data analytics
Singh and Reddy Journal of Big Data 2014, 1:8 SURVEY PAPER Open Access A survey on platforms for big data analytics Dilpreet Singh and Chandan K Reddy * * Correspondence: reddy@cs.wayne.edu Department
More informationMicrosoft Windows Server Hyper-V in a Flash
Microsoft Windows Server Hyper-V in a Flash Combine Violin s enterprise-class storage arrays with the ease and flexibility of Windows Storage Server in an integrated solution to achieve higher density,
More informationHadoop Cluster Applications
Hadoop Overview Data analytics has become a key element of the business decision process over the last decade. Classic reporting on a dataset stored in a database was sufficient until recently, but yesterday
More informationAchieving Real-Time Business Solutions Using Graph Database Technology and High Performance Networks
WHITE PAPER July 2014 Achieving Real-Time Business Solutions Using Graph Database Technology and High Performance Networks Contents Executive Summary...2 Background...3 InfiniteGraph...3 High Performance
More informationParallel Processing of cluster by Map Reduce
Parallel Processing of cluster by Map Reduce Abstract Madhavi Vaidya, Department of Computer Science Vivekanand College, Chembur, Mumbai vamadhavi04@yahoo.co.in MapReduce is a parallel programming model
More informationHow To Speed Up A Flash Flash Storage System With The Hyperq Memory Router
HyperQ Hybrid Flash Storage Made Easy White Paper Parsec Labs, LLC. 7101 Northland Circle North, Suite 105 Brooklyn Park, MN 55428 USA 1-763-219-8811 www.parseclabs.com info@parseclabs.com sales@parseclabs.com
More informationDelivering value from big data with Microsoft R Server and Hadoop
EXECUTIVE WHITE PAPER Delivering value from big data with Microsoft R Server and Hadoop Microsoft Advanced Analytics Team April 2016 ABSTRACT Businesses are continuing to invest in Hadoop to manage analytic
More informationANALYTICS IN BIG DATA ERA
ANALYTICS IN BIG DATA ERA ANALYTICS TECHNOLOGY AND ARCHITECTURE TO MANAGE VELOCITY AND VARIETY, DISCOVER RELATIONSHIPS AND CLASSIFY HUGE AMOUNT OF DATA MAURIZIO SALUSTI SAS Copyr i g ht 2012, SAS Ins titut
More informationHow to Choose your Red Hat Enterprise Linux Filesystem
How to Choose your Red Hat Enterprise Linux Filesystem EXECUTIVE SUMMARY Choosing the Red Hat Enterprise Linux filesystem that is appropriate for your application is often a non-trivial decision due to
More informationSQL Server Virtualization
The Essential Guide to SQL Server Virtualization S p o n s o r e d b y Virtualization in the Enterprise Today most organizations understand the importance of implementing virtualization. Virtualization
More informationHadoopTM Analytics DDN
DDN Solution Brief Accelerate> HadoopTM Analytics with the SFA Big Data Platform Organizations that need to extract value from all data can leverage the award winning SFA platform to really accelerate
More informationTop 10 Reasons why MySQL Experts Switch to SchoonerSQL - Solving the common problems users face with MySQL
SCHOONER WHITE PAPER Top 10 Reasons why MySQL Experts Switch to SchoonerSQL - Solving the common problems users face with MySQL About Schooner Information Technology Schooner Information Technology provides
More informationWhy Big Data in the Cloud?
Have 40 Why Big Data in the Cloud? Colin White, BI Research January 2014 Sponsored by Treasure Data TABLE OF CONTENTS Introduction The Importance of Big Data The Role of Cloud Computing Using Big Data
More informationQLIKVIEW ARCHITECTURE AND SYSTEM RESOURCE USAGE
QLIKVIEW ARCHITECTURE AND SYSTEM RESOURCE USAGE QlikView Technical Brief April 2011 www.qlikview.com Introduction This technical brief covers an overview of the QlikView product components and architecture
More informationWhat Is In-Memory Computing and What Does It Mean to U.S. Leaders? EXECUTIVE WHITE PAPER
What Is In-Memory Computing and What Does It Mean to U.S. Leaders? EXECUTIVE WHITE PAPER A NEW PARADIGM IN INFORMATION TECHNOLOGY There is a revolution happening in information technology, and it s not
More informationNetworking in the Hadoop Cluster
Hadoop and other distributed systems are increasingly the solution of choice for next generation data volumes. A high capacity, any to any, easily manageable networking layer is critical for peak Hadoop
More informationParallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel
Parallel Databases Increase performance by performing operations in parallel Parallel Architectures Shared memory Shared disk Shared nothing closely coupled loosely coupled Parallelism Terminology Speedup:
More information