# How Well do Graph-Processing Platforms Perform? An Empirical Performance Evaluation and Analysis

Save this PDF as:

Size: px
Start display at page:

Download "How Well do Graph-Processing Platforms Perform? An Empirical Performance Evaluation and Analysis"

## Transcription

3 TABLE I SUMMARY OF PERFORMANCE METRICS. Metric How measured? Derived Relevant aspect (use) job execution Time the full - Raw processing power time (T ) execution (Figure 1, 3, 4) Edges per - #E/T Raw processing power second (EPS) (Figure 2) Vertices per - #V/T Raw processing power second (VPS) (Figure 2) CPU, memory, Monitoring - Resource utilization network sampled each second (Technical report [33]) Horizontal T of different - Scalability scalability cluster size (N) (Figure 5) Vertical T of different - Scalability scalability cores per node (Figure 6) Normalized edges - #E/T/N Scalability per second (NEPS) (Figure 5, 6) Computation Time actual - Raw processing power time (T c) for calculating (Figure 7) Overhead - T T c Overhead time (T o) (Figure 7) #V and #E are the number of vertices and the number of edges of graphs, respectively. B. Selection of graphs and algorithms This section presents a selection of graphs and algorithms, which is akin to identifying some of the main functional requirements of graph-processing systems. We further discuss the representativeness of our selection in Section V. 1) Graph selection: The main goal of the graph selection step is to select graphs with different characteristics but with comparable representation. We use the classic graph formalism [34]: a graph is a collection of vertices V (also called nodes) and edges E (also called arcs or links) which connect the vertices. A single edge is described by the two vertices it connects: e = (u,v). A graph is represented by G = (V,E). We consider both directed and undirected graphs. We do not use other graph models (e.g., hypergraphs). Regarding the graph characteristics, we select graphs with a variety of values for the number of nodes and edges, and with different structures (see Table II). We store the graphs in plain text with a processing-friendly format but without indexes. In our format, vertices have integers as identifiers. Each vertex is stored in an individual line, which for undirected graphs, includes the identifier of the vertex and a commaseparated list of neighbors; for directed graphs, each vertex line includes the vertex identifier and two comma-separated lists of neighbors, corresponding to the incoming and to the outgoing edges. Thus, we do not consider other data models proposed for exchanging and using graphs such as complex plain-text representations, universal data formats (e.g. XML), relational databases, relationship formalisms (e.g., RDF), etc. Why these datasets? We select seven graphs which could match, in scale and diversity, the datasets used by SMEs. Table II shows the characteristics of the selected graph datasets. The graphs have diverse sources (e-business, social network, online gaming, citation links, and synthetic graph), and a wide range of different sizes and graph metrics (e.g., high vs. low degree, 1,663 vs. 2, respectively, directed and undirected graphs, etc.). The synthetic graph ( Synth in Table II) is TABLE II SUMMARY OF DATASETS. Graphs #V #E d D Directivity G1 Amazon 262,111 1,234, directed G2 WikiTalk 2,388,953 5,018, directed G3 KGS 293,290 16,558, undirected G4 Citation 3,764,117 16,511, directed G5 DotaLeague 61,171 50,870,316 2, ,663 undirected G6 Synth 2,394,536 64,152, undirected G7 Friendster 65,608,366 1,806,067, undirected d is the link density of the graphs ( 10 5 ). D is the average vertex degree of undirected graphs and the average vertex in-degree (or average vertex out-degree) of directed graphs. TABLE III SUMMARY OF ALGORITHMS. Algorithm Main features Use A1 STATS single step, low processing decision-making A2 BFS iterative, low processing building block A3 CONN iterative, medium processing building block A4 CD iterative, medium or high processing social network A5 EVO iterative (multi-level), high processing prediction produced by the generator described in Graph500 [31]. The other graphs have been extracted from real-world use, and have been shared through the Stanford Network Analysis Project (SNAP) [35]) and the Game Trace Archive (GTA) [2]. 2) Algorithm selection: Why these algorithms? We have conducted a comprehensive survey of graph-processing articles published in 10 representative conferences, in recent years; in total, over 100 articles (see technical report [33]). We found that a large variety of graph processing algorithms exist in practice [36] and are likely used by SMEs. The algorithms can be categorized into several groups by functionality, consumption of resources, etc. We focus on algorithm functionality and select one exemplar of each of the following five algorithmic classes, which are common in our survey: general statistics, graph traversal (used in Graph500), connected components, community detection, and graph evolution. We describe in the following the five selected algorithms and summarize their characteristics in Table III. The General statistics (STATS) algorithm computes the number of vertices and edges, and the average of the local clustering coefficient of all vertices. The results obtained with STATS can provide the graph analyst with an overview of the structure of the graph. Breadth-first search (BFS) is a widely used algorithm in graph processing, which is often a building block for more complex algorithms, such as item search, distance calculation, diameter calculation, shortest path, longest path, etc. BFS allows us to understand how the tested platforms cope with lightweight iterative jobs. Connected component (CONN) is an algorithm for extracting groups of vertices that can reach each other via graph edges. This algorithm produces a large amount of output, as in many graphs the largest connected component includes a majority of the vertices. Community detection (CD) is important for social network applications, as users of these networks tends to form communities, that is, groups whose constituent nodes form more

7 Execution time [s] Amazon WikiTalk KGS Citation DotaLeague Friendster 1 hour 15 mins 1 min Execution time [s] YARN Neo4j 1 hour 15 mins 1 min STATS BFS CONN CD EVO Algorithms CONN() Fig. 3. The execution time of all algorithms for all datasets running on, and for CONN running on (right-most bars). to process only directed graphs, which has required the conversion of the undirected KGS to a directed version. This operation lead to a doubling in the number of edges, resulting in a proportional increase of the execution time. 2) Results for two selected platforms: We focus in this section on the graph-specific platforms ( and ) and discuss their performance for all the algorithms and datasets, as depicted in Figure 3. As is an in-memory-only platform, its performance is not affected by the large penalties of I/O operations. Figure 3 shows that the execution time for all experiments is below 100 seconds. However, when the amount of messages between computing nodes becomes extremely large (tens of gigabytes), crashes. For example, crashes for the STATS algorithm running on the WikiTalk dataset; for Friendster, the largest of our datasets, completes only the EVO algorithm, for which our graph evolution algorithm generates relatively few messages. From the selected results, performs better than for the CONN algorithm for most graphs. Moreover, is able to process even the largest graph in this study. 3) Results for two selected datasets: Finally, to understand the impact of algorithm complexity on each platform, we focus now on two interesting datasets DotaLeague and Citation. We depict their performance, for all algorithms running on all platforms, in Figure 4. Because Friendster is too large for some platforms, we present here the results for graphs that all platforms can process: the second-largest real graph, DotaLeague, and the small Citation graph. Even for the second-largest graph,, and YARN crashed when running STATS; we also had to terminate after running STATS for nearly 4 hours without success; similarly, STATS and CD run for more than 20 hours in Neo4j and are not shown in Figure 4. For the other algorithms, BFS, CONN, CD, and EVO, the number of iterations is between 4 and 6. From Figure 4, the execution time of BFS is lower than the execution time of CONN and CD, on all platforms. In each iteration of CONN and CD, many more vertices will be active, in comparison to BFS. Furthermore, in CONN, the number of active vertices stays relatively constant in each iteration, while CD is more compute-intensive and variable. For EVO, takes advantage of its programming model, as it can represent one BFS CONN CD EVO CONN(Citation) Algorithms Fig. 4. The execution time for all platforms, running all algorithms for the DotaLeague dataset, and CONN for the Citation dataset (right-most bars). EVO iteration by a single map-reduce-reduce procedure; in contrast, and YARN need to run two MapReduce jobs per iteration and thus their execution time increases. Citation is much smaller and sparser than DotaLeague. The CONN of Citation takes 20 iterations. The execution time of CONN of Citation on, YARN, and increases compared with 6-iteration CONN of DotaLeague. As we explained for the analysis of BFS (Section IV-A1), more iterations result in higher I/O and other overheads. B. Evaluation of scalability In this section, we evaluate the horizontal and vertical scalability of the distributed platforms. Besides the job execution time, we also report the NEPS for comparing the performance per computing unit. To allow a comparison with the previous experiments, we use BFS results. To test scalability, we use the two largest real graphs in our study, Friendster and DotaLeague (the results of DotaLeague are in our technical report [33] ). For testing horizontal scalability, we increase the number of machines from 20 to 50 by a step of 5, and keep using a single computing core per machine. For testing vertical scalability, we keep the cluster size at 20 computing machines and increase the number of computing cores per machines from 1 to 7. We step up the number of map (reduce) tasks and parallelization degree equally to the available computing cores. Key findings: Some platforms can scale up reasonably with cluster size (horizontally) or number of cores (vertically). Increasing the number of computing cores may lead to worse performance, especially for small graphs. The normalized performance per computing unit mostly decreases with the increase of cluster size and with the number of computing cores per node. 1) Horizontal scalability: Figure 5 shows the horizontal scalability of BFS for Friendster (G7). Most of the platforms presents significant horizontal scalability, except for, which exhibits little scalability. The horizontal scalability of is constrained by the graph loading phase using one single file. We thus explore tuning : for (mp) we split the input file into multiple separate pieces, as many as the MPI processes. (mp) has much lower execution time than. Moreover,

### Graphalytics: A Big Data Benchmark for Graph-Processing Platforms

Graphalytics: A Big Data Benchmark for Graph-Processing Platforms Mihai Capotă, Tim Hegeman, Alexandru Iosup, Arnau Prat-Pérez, Orri Erling, Peter Boncz Delft University of Technology Universitat Politècnica

### GRAPHALYTICS http://bl.ocks.org/mbostock/4062045 A Big Data Benchmark for Graph-Processing Platforms

GRAPHALYTICS http://bl.ocks.org/mbostock/4062045 A Big Data Benchmark for Graph-Processing Platforms Mihai Capotã, Yong Guo, Ana Lucia Varbanescu, Tim Hegeman, Jorai Rijsdijk, Alexandru Iosup, GRAPHALYTICS

### Software tools for Complex Networks Analysis. Fabrice Huet, University of Nice Sophia- Antipolis SCALE (ex-oasis) Team

Software tools for Complex Networks Analysis Fabrice Huet, University of Nice Sophia- Antipolis SCALE (ex-oasis) Team MOTIVATION Why do we need tools? Source : nature.com Visualization Properties extraction

### A Study on Workload Imbalance Issues in Data Intensive Distributed Computing

A Study on Workload Imbalance Issues in Data Intensive Distributed Computing Sven Groot 1, Kazuo Goda 1, and Masaru Kitsuregawa 1 University of Tokyo, 4-6-1 Komaba, Meguro-ku, Tokyo 153-8505, Japan Abstract.

### Graph Processing and Social Networks

Graph Processing and Social Networks Presented by Shu Jiayu, Yang Ji Department of Computer Science and Engineering The Hong Kong University of Science and Technology 2015/4/20 1 Outline Background Graph

### Big Graph Processing: Some Background

Big Graph Processing: Some Background Bo Wu Colorado School of Mines Part of slides from: Paul Burkhardt (National Security Agency) and Carlos Guestrin (Washington University) Mines CSCI-580, Bo Wu Graphs

### Evaluating HDFS I/O Performance on Virtualized Systems

Evaluating HDFS I/O Performance on Virtualized Systems Xin Tang xtang@cs.wisc.edu University of Wisconsin-Madison Department of Computer Sciences Abstract Hadoop as a Service (HaaS) has received increasing

### Map-Based Graph Analysis on MapReduce

2013 IEEE International Conference on Big Data Map-Based Graph Analysis on MapReduce Upa Gupta, Leonidas Fegaras University of Texas at Arlington, CSE Arlington, TX 76019 {upa.gupta,fegaras}@uta.edu Abstract

### Fast Iterative Graph Computation with Resource Aware Graph Parallel Abstraction

Human connectome. Gerhard et al., Frontiers in Neuroinformatics 5(3), 2011 2 NA = 6.022 1023 mol 1 Paul Burkhardt, Chris Waring An NSA Big Graph experiment Fast Iterative Graph Computation with Resource

### Evaluating partitioning of big graphs

Evaluating partitioning of big graphs Fredrik Hallberg, Joakim Candefors, Micke Soderqvist fhallb@kth.se, candef@kth.se, mickeso@kth.se Royal Institute of Technology, Stockholm, Sweden Abstract. Distributed

### An Empirical Performance Evaluation of GPU-Enabled Graph-Processing Systems

An Empirical Performance Evaluation of GPU-Enabled Graph-Processing Systems Yong Guo, Ana Lucia Varbanescu, Alexandru Iosup and Dick Epema TU Delft, The Netherlands, Email: {Yong.Guo, A.Iosup, D.H.J.Epema}@tudelft.nl

### A Framework for Performance Analysis and Tuning in Hadoop Based Clusters

A Framework for Performance Analysis and Tuning in Hadoop Based Clusters Garvit Bansal Anshul Gupta Utkarsh Pyne LNMIIT, Jaipur, India Email: [garvit.bansal anshul.gupta utkarsh.pyne] @lnmiit.ac.in Manish

### LARGE-SCALE GRAPH PROCESSING IN THE BIG DATA WORLD. Dr. Buğra Gedik, Ph.D.

LARGE-SCALE GRAPH PROCESSING IN THE BIG DATA WORLD Dr. Buğra Gedik, Ph.D. MOTIVATION Graph data is everywhere Relationships between people, systems, and the nature Interactions between people, systems,

### A Brief Introduction to Apache Tez

A Brief Introduction to Apache Tez Introduction It is a fact that data is basically the new currency of the modern business world. Companies that effectively maximize the value of their data (extract value

Characterizing Task Usage Shapes in Google s Compute Clusters Qi Zhang 1, Joseph L. Hellerstein 2, Raouf Boutaba 1 1 University of Waterloo, 2 Google Inc. Introduction Cloud computing is becoming a key

### Energy Efficient MapReduce

Energy Efficient MapReduce Motivation: Energy consumption is an important aspect of datacenters efficiency, the total power consumption in the united states has doubled from 2000 to 2005, representing

### Big Data Analytics. Lucas Rego Drumond

Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Big Data Analytics Big Data Analytics 1 / 33 Outline

### A Performance Evaluation of Open Source Graph Databases. Robert McColl David Ediger Jason Poovey Dan Campbell David A. Bader

A Performance Evaluation of Open Source Graph Databases Robert McColl David Ediger Jason Poovey Dan Campbell David A. Bader Overview Motivation Options Evaluation Results Lessons Learned Moving Forward

### Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics

Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics Juwei Shi, Yunjie Qiu, Umar Farooq Minhas, Limei Jiao, Chen Wang, Berthold Reinwald, and Fatma Özcan IBM Research China IBM Almaden

### Overview on Graph Datastores and Graph Computing Systems. -- Litao Deng (Cloud Computing Group) 06-08-2012

Overview on Graph Datastores and Graph Computing Systems -- Litao Deng (Cloud Computing Group) 06-08-2012 Graph - Everywhere 1: Friendship Graph 2: Food Graph 3: Internet Graph Most of the relationships

### Analysis and Optimization of Massive Data Processing on High Performance Computing Architecture

Analysis and Optimization of Massive Data Processing on High Performance Computing Architecture He Huang, Shanshan Li, Xiaodong Yi, Feng Zhang, Xiangke Liao and Pan Dong School of Computer Science National

### Virtuoso and Database Scalability

Virtuoso and Database Scalability By Orri Erling Table of Contents Abstract Metrics Results Transaction Throughput Initializing 40 warehouses Serial Read Test Conditions Analysis Working Set Effect of

### Resource Scalability for Efficient Parallel Processing in Cloud

Resource Scalability for Efficient Parallel Processing in Cloud ABSTRACT Govinda.K #1, Abirami.M #2, Divya Mercy Silva.J #3 #1 SCSE, VIT University #2 SITE, VIT University #3 SITE, VIT University In the

### Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM Executive Summary The explosion of internet data, driven in large part by the growth of more and more powerful mobile devices, has created

### BSPCloud: A Hybrid Programming Library for Cloud Computing *

BSPCloud: A Hybrid Programming Library for Cloud Computing * Xiaodong Liu, Weiqin Tong and Yan Hou Department of Computer Engineering and Science Shanghai University, Shanghai, China liuxiaodongxht@qq.com,

### Graph Database Proof of Concept Report

Objectivity, Inc. Graph Database Proof of Concept Report Managing The Internet of Things Table of Contents Executive Summary 3 Background 3 Proof of Concept 4 Dataset 4 Process 4 Query Catalog 4 Environment

### Tableau Server Scalability Explained

Tableau Server Scalability Explained Author: Neelesh Kamkolkar Tableau Software July 2013 p2 Executive Summary In March 2013, we ran scalability tests to understand the scalability of Tableau 8.0. We wanted

### The big data revolution

The big data revolution Friso van Vollenhoven (Xebia) Enterprise NoSQL Recently, there has been a lot of buzz about the NoSQL movement, a collection of related technologies mostly concerned with storing

### Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control EP/K006487/1 UK PI: Prof Gareth Taylor (BU) China PI: Prof Yong-Hua Song (THU) Consortium UK Members: Brunel University

### Towards an Optimized Big Data Processing System

Towards an Optimized Big Data Processing System The Doctoral Symposium of the IEEE/ACM CCGrid 2013 Delft, The Netherlands Bogdan Ghiţ, Alexandru Iosup, and Dick Epema Parallel and Distributed Systems Group

Hadoop Overview Data analytics has become a key element of the business decision process over the last decade. Classic reporting on a dataset stored in a database was sufficient until recently, but yesterday

### CSE-E5430 Scalable Cloud Computing Lecture 2

CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing

### Large-Scale Data Processing

Large-Scale Data Processing Eiko Yoneki eiko.yoneki@cl.cam.ac.uk http://www.cl.cam.ac.uk/~ey204 Systems Research Group University of Cambridge Computer Laboratory 2010s: Big Data Why Big Data now? Increase

### RevoScaleR Speed and Scalability

EXECUTIVE WHITE PAPER RevoScaleR Speed and Scalability By Lee Edlefsen Ph.D., Chief Scientist, Revolution Analytics Abstract RevoScaleR, the Big Data predictive analytics library included with Revolution

### InfiniteGraph: The Distributed Graph Database

A Performance and Distributed Performance Benchmark of InfiniteGraph and a Leading Open Source Graph Database Using Synthetic Data Objectivity, Inc. 640 West California Ave. Suite 240 Sunnyvale, CA 94086

### Mining Large Datasets: Case of Mining Graph Data in the Cloud

Mining Large Datasets: Case of Mining Graph Data in the Cloud Sabeur Aridhi PhD in Computer Science with Laurent d Orazio, Mondher Maddouri and Engelbert Mephu Nguifo 16/05/2014 Sabeur Aridhi Mining Large

### Benchmarking Hadoop & HBase on Violin

Technical White Paper Report Technical Report Benchmarking Hadoop & HBase on Violin Harnessing Big Data Analytics at the Speed of Memory Version 1.0 Abstract The purpose of benchmarking is to show advantages

### The Evolvement of Big Data Systems

The Evolvement of Big Data Systems From the Perspective of an Information Security Application 2015 by Gang Chen, Sai Wu, Yuan Wang presented by Slavik Derevyanko Outline Authors and Netease Introduction

### Optimizing Dell PowerEdge Configurations for Hadoop

Optimizing Dell PowerEdge Configurations for Hadoop Understanding how to get the most out of Hadoop running on Dell hardware A Dell technical white paper July 2013 Michael Pittaro Principal Architect,

### SIGMOD RWE Review Towards Proximity Pattern Mining in Large Graphs

SIGMOD RWE Review Towards Proximity Pattern Mining in Large Graphs Fabian Hueske, TU Berlin June 26, 21 1 Review This document is a review report on the paper Towards Proximity Pattern Mining in Large

PerfAccel (TM) Performance Benchmark on Amazon: DataStax Enterprise, powered by Apache Cassandra (TM) Disclaimer: All of the documentation provided in this document, is copyright Datagres Technologies

### Using an In-Memory Data Grid for Near Real-Time Data Analysis

SCALEOUT SOFTWARE Using an In-Memory Data Grid for Near Real-Time Data Analysis by Dr. William Bain, ScaleOut Software, Inc. 2012 ScaleOut Software, Inc. 12/27/2012 IN today s competitive world, businesses

### Apache Hama Design Document v0.6

Apache Hama Design Document v0.6 Introduction Hama Architecture BSPMaster GroomServer Zookeeper BSP Task Execution Job Submission Job and Task Scheduling Task Execution Lifecycle Synchronization Fault

### A Middleware Strategy to Survive Compute Peak Loads in Cloud

A Middleware Strategy to Survive Compute Peak Loads in Cloud Sasko Ristov Ss. Cyril and Methodius University Faculty of Information Sciences and Computer Engineering Skopje, Macedonia Email: sashko.ristov@finki.ukim.mk

### Concept and Project Objectives

3.1 Publishable summary Concept and Project Objectives Proactive and dynamic QoS management, network intrusion detection and early detection of network congestion problems among other applications in the

### Big Data and Scripting Systems beyond Hadoop

Big Data and Scripting Systems beyond Hadoop 1, 2, ZooKeeper distributed coordination service many problems are shared among distributed systems ZooKeeper provides an implementation that solves these avoid

### Architecting for the next generation of Big Data Hortonworks HDP 2.0 on Red Hat Enterprise Linux 6 with OpenJDK 7

Architecting for the next generation of Big Data Hortonworks HDP 2.0 on Red Hat Enterprise Linux 6 with OpenJDK 7 Yan Fisher Senior Principal Product Marketing Manager, Red Hat Rohit Bakhshi Product Manager,

### A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, sborkar95@gmail.com Assistant Professor, Information

### Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing

/35 Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing Zuhair Khayyat 1 Karim Awara 1 Amani Alonazi 1 Hani Jamjoom 2 Dan Williams 2 Panos Kalnis 1 1 King Abdullah University of

### Oracle Database Scalability in VMware ESX VMware ESX 3.5

Performance Study Oracle Database Scalability in VMware ESX VMware ESX 3.5 Database applications running on individual physical servers represent a large consolidation opportunity. However enterprises

### MINIMIZING STORAGE COST IN CLOUD COMPUTING ENVIRONMENT

MINIMIZING STORAGE COST IN CLOUD COMPUTING ENVIRONMENT 1 SARIKA K B, 2 S SUBASREE 1 Department of Computer Science, Nehru College of Engineering and Research Centre, Thrissur, Kerala 2 Professor and Head,

### Cloud Based Application Architectures using Smart Computing

Cloud Based Application Architectures using Smart Computing How to Use this Guide Joyent Smart Technology represents a sophisticated evolution in cloud computing infrastructure. Most cloud computing products

### Oracle NoSQL Database and SanDisk Offer Cost-Effective Extreme Performance for Big Data

WHITE PAPER Oracle NoSQL Database and SanDisk Offer Cost-Effective Extreme Performance for Big Data 951 SanDisk Drive, Milpitas, CA 95035 www.sandisk.com Table of Contents Abstract... 3 What Is Big Data?...

### An Application of Hadoop and Horizontal Scaling to Conjunction Assessment. Mike Prausa The MITRE Corporation Norman Facas The MITRE Corporation

An Application of Hadoop and Horizontal Scaling to Conjunction Assessment Mike Prausa The MITRE Corporation Norman Facas The MITRE Corporation ABSTRACT This paper examines a horizontal scaling approach

### Achieving Real-Time Business Solutions Using Graph Database Technology and High Performance Networks

WHITE PAPER July 2014 Achieving Real-Time Business Solutions Using Graph Database Technology and High Performance Networks Contents Executive Summary...2 Background...3 InfiniteGraph...3 High Performance

### Scalable Cloud Computing Solutions for Next Generation Sequencing Data

Scalable Cloud Computing Solutions for Next Generation Sequencing Data Matti Niemenmaa 1, Aleksi Kallio 2, André Schumacher 1, Petri Klemelä 2, Eija Korpelainen 2, and Keijo Heljanko 1 1 Department of

### MAGENTO HOSTING Progressive Server Performance Improvements

MAGENTO HOSTING Progressive Server Performance Improvements Simple Helix, LLC 4092 Memorial Parkway Ste 202 Huntsville, AL 35802 sales@simplehelix.com 1.866.963.0424 www.simplehelix.com 2 Table of Contents

### Analysis of Web Archives. Vinay Goel Senior Data Engineer

Analysis of Web Archives Vinay Goel Senior Data Engineer Internet Archive Established in 1996 501(c)(3) non profit organization 20+ PB (compressed) of publicly accessible archival material Technology partner

### Recommendations for Performance Benchmarking

Recommendations for Performance Benchmarking Shikhar Puri Abstract Performance benchmarking of applications is increasingly becoming essential before deployment. This paper covers recommendations and best

### - An Essential Building Block for Stable and Reliable Compute Clusters

Ferdinand Geier ParTec Cluster Competence Center GmbH, V. 1.4, March 2005 Cluster Middleware - An Essential Building Block for Stable and Reliable Compute Clusters Contents: Compute Clusters a Real Alternative

### Accelerating Web-Based SQL Server Applications with SafePeak Plug and Play Dynamic Database Caching

Accelerating Web-Based SQL Server Applications with SafePeak Plug and Play Dynamic Database Caching A SafePeak Whitepaper February 2014 www.safepeak.com Copyright. SafePeak Technologies 2014 Contents Objective...

### Converged, Real-time Analytics Enabling Faster Decision Making and New Business Opportunities

Technology Insight Paper Converged, Real-time Analytics Enabling Faster Decision Making and New Business Opportunities By John Webster February 2015 Enabling you to make the best technology decisions Enabling

### Benchmarking Cassandra on Violin

Technical White Paper Report Technical Report Benchmarking Cassandra on Violin Accelerating Cassandra Performance and Reducing Read Latency With Violin Memory Flash-based Storage Arrays Version 1.0 Abstract

### Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software

WHITEPAPER Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software SanDisk ZetaScale software unlocks the full benefits of flash for In-Memory Compute and NoSQL applications

### Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems

Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems Rekha Singhal and Gabriele Pacciucci * Other names and brands may be claimed as the property of others. Lustre File

### Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January 2015. Email: bdg@qburst.com Website: www.qburst.com

Lambda Architecture Near Real-Time Big Data Analytics Using Hadoop January 2015 Contents Overview... 3 Lambda Architecture: A Quick Introduction... 4 Batch Layer... 4 Serving Layer... 4 Speed Layer...

### Introduction 1 Performance on Hosted Server 1. Benchmarks 2. System Requirements 7 Load Balancing 7

Introduction 1 Performance on Hosted Server 1 Figure 1: Real World Performance 1 Benchmarks 2 System configuration used for benchmarks 2 Figure 2a: New tickets per minute on E5440 processors 3 Figure 2b:

### Four Orders of Magnitude: Running Large Scale Accumulo Clusters. Aaron Cordova Accumulo Summit, June 2014

Four Orders of Magnitude: Running Large Scale Accumulo Clusters Aaron Cordova Accumulo Summit, June 2014 Scale, Security, Schema Scale to scale 1 - (vt) to change the size of something let s scale the

### Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Heshan Li, Shaopeng Wang The Johns Hopkins University 3400 N. Charles Street Baltimore, Maryland 21218 {heshanli, shaopeng}@cs.jhu.edu 1 Overview

### Well packaged sets of preinstalled, integrated, and optimized software on select hardware in the form of engineered systems and appliances

INSIGHT Oracle's All- Out Assault on the Big Data Market: Offering Hadoop, R, Cubes, and Scalable IMDB in Familiar Packages Carl W. Olofson IDC OPINION Global Headquarters: 5 Speen Street Framingham, MA

### Networking in the Hadoop Cluster

Hadoop and other distributed systems are increasingly the solution of choice for next generation data volumes. A high capacity, any to any, easily manageable networking layer is critical for peak Hadoop

### GeoGrid Project and Experiences with Hadoop

GeoGrid Project and Experiences with Hadoop Gong Zhang and Ling Liu Distributed Data Intensive Systems Lab (DiSL) Center for Experimental Computer Systems Research (CERCS) Georgia Institute of Technology

### An NSA Big Graph experiment. Paul Burkhardt, Chris Waring. May 20, 2013

U.S. National Security Agency Research Directorate - R6 Technical Report NSA-RD-2013-056002v1 May 20, 2013 Graphs are everywhere! A graph is a collection of binary relationships, i.e. networks of pairwise

### Big Data Systems CS 5965/6965 FALL 2015

Big Data Systems CS 5965/6965 FALL 2015 Today General course overview Expectations from this course Q&A Introduction to Big Data Assignment #1 General Course Information Course Web Page http://www.cs.utah.edu/~hari/teaching/fall2015.html

### W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

W H I T E P A P E R Deriving Intelligence from Large Data Using Hadoop and Applying Analytics Abstract This white paper is focused on discussing the challenges facing large scale data processing and the

### Cray: Enabling Real-Time Discovery in Big Data

Cray: Enabling Real-Time Discovery in Big Data Discovery is the process of gaining valuable insights into the world around us by recognizing previously unknown relationships between occurrences, objects

### Condusiv s V-locity Server Boosts Performance of SQL Server 2012 by 55%

openbench Labs Executive Briefing: April 19, 2013 Condusiv s Server Boosts Performance of SQL Server 2012 by 55% Optimizing I/O for Increased Throughput and Reduced Latency on Physical Servers 01 Executive

### Linux Performance Optimizations for Big Data Environments

Linux Performance Optimizations for Big Data Environments Dominique A. Heger Ph.D. DHTechnologies (Performance, Capacity, Scalability) www.dhtusa.com Data Nubes (Big Data, Hadoop, ML) www.datanubes.com

### Analysis and Modeling of MapReduce s Performance on Hadoop YARN

Analysis and Modeling of MapReduce s Performance on Hadoop YARN Qiuyi Tang Dept. of Mathematics and Computer Science Denison University tang_j3@denison.edu Dr. Thomas C. Bressoud Dept. of Mathematics and

### Challenges for Data Driven Systems

Challenges for Data Driven Systems Eiko Yoneki University of Cambridge Computer Laboratory Quick History of Data Management 4000 B C Manual recording From tablets to papyrus to paper A. Payberah 2014 2

### Best Practices Guide. Sendmail Servers: Scaling Performance through Solid State File Caching. March 2000

Best Practices Guide Sendmail Servers: Scaling Performance through Solid State File Caching March 2000 Solid Data Systems, Inc. K 2945 Oakmead Village Court K Santa Clara, CA 95051 K Tel +1.408.845.5700

### Rackspace Cloud Databases and Container-based Virtualization

Rackspace Cloud Databases and Container-based Virtualization August 2012 J.R. Arredondo @jrarredondo Page 1 of 6 INTRODUCTION When Rackspace set out to build the Cloud Databases product, we asked many

### From GWS to MapReduce: Google s Cloud Technology in the Early Days

Large-Scale Distributed Systems From GWS to MapReduce: Google s Cloud Technology in the Early Days Part II: MapReduce in a Datacenter COMP6511A Spring 2014 HKUST Lin Gu lingu@ieee.org MapReduce/Hadoop

WHITE PAPER CDH AND BUSINESS CONTINUITY: An overview of the availability, data protection and disaster recovery features in Hadoop Abstract Using the sophisticated built-in capabilities of CDH for tunable

### The Comprehensive Performance Rating for Hadoop Clusters on Cloud Computing Platform

The Comprehensive Performance Rating for Hadoop Clusters on Cloud Computing Platform Fong-Hao Liu, Ya-Ruei Liou, Hsiang-Fu Lo, Ko-Chin Chang, and Wei-Tsong Lee Abstract Virtualization platform solutions

### Accelerating Hadoop MapReduce Using an In-Memory Data Grid

Accelerating Hadoop MapReduce Using an In-Memory Data Grid By David L. Brinker and William L. Bain, ScaleOut Software, Inc. 2013 ScaleOut Software, Inc. 12/27/2012 H adoop has been widely embraced for

### 16.1 MAPREDUCE. For personal use only, not for distribution. 333

For personal use only, not for distribution. 333 16.1 MAPREDUCE Initially designed by the Google labs and used internally by Google, the MAPREDUCE distributed programming model is now promoted by several

### Figure 1. The cloud scales: Amazon EC2 growth [2].

- Chung-Cheng Li and Kuochen Wang Department of Computer Science National Chiao Tung University Hsinchu, Taiwan 300 shinji10343@hotmail.com, kwang@cs.nctu.edu.tw Abstract One of the most important issues

### IBM Platform Computing Cloud Service Ready to use Platform LSF & Symphony clusters in the SoftLayer cloud

IBM Platform Computing Cloud Service Ready to use Platform LSF & Symphony clusters in the SoftLayer cloud February 25, 2014 1 Agenda v Mapping clients needs to cloud technologies v Addressing your pain

### Understanding the Benefits of IBM SPSS Statistics Server

IBM SPSS Statistics Server Understanding the Benefits of IBM SPSS Statistics Server Contents: 1 Introduction 2 Performance 101: Understanding the drivers of better performance 3 Why performance is faster

### CiteSeer x in the Cloud

Published in the 2nd USENIX Workshop on Hot Topics in Cloud Computing 2010 CiteSeer x in the Cloud Pradeep B. Teregowda Pennsylvania State University C. Lee Giles Pennsylvania State University Bhuvan Urgaonkar

### Adapting scientific computing problems to cloud computing frameworks Ph.D. Thesis. Pelle Jakovits

Adapting scientific computing problems to cloud computing frameworks Ph.D. Thesis Pelle Jakovits Outline Problem statement State of the art Approach Solutions and contributions Current work Conclusions

### Chapter 7. Using Hadoop Cluster and MapReduce

Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in

### III Big Data Technologies

III Big Data Technologies Today, new technologies make it possible to realize value from Big Data. Big data technologies can replace highly customized, expensive legacy systems with a standard solution

### Detection of Distributed Denial of Service Attack with Hadoop on Live Network

Characterizing Task Usage Shapes in Google s Compute Clusters Qi Zhang University of Waterloo qzhang@uwaterloo.ca Joseph L. Hellerstein Google Inc. jlh@google.com Raouf Boutaba University of Waterloo rboutaba@uwaterloo.ca

### Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Architectural patterns for building real time applications with Apache HBase Andrew Purtell Committer and PMC, Apache HBase Who am I? Distributed systems engineer Principal Architect in the Big Data Platform