Algorithmic Challenges and Opportunities for Data Analysis and Visualization in the Co-design Process

Save this PDF as:
 WORD  PNG  TXT  JPG

Size: px
Start display at page:

Download "Algorithmic Challenges and Opportunities for Data Analysis and Visualization in the Co-design Process"

Transcription

1 Algorithmic Challenges and Opportunities for Data Analysis and Visualization in the Co-design Process Hasan Abbasi, Janine Bennett, Peer-Timo Bremer, Varis Carey, Greg Eisenhauer, Attila Gyulassy, Scott Klasky, Robert Moser, Todd Oliver, Manish Parashar, Valerio Pascucci, Karsten Schwan, Hongfeng Yu, and Matthew Wolf

2 SDMA challenges in extreme-scale computing

3 Combustion Workflow RHS of S3D solver at each Stage of an explicit time step Asynchronous movement of data or share data in memory (different levels) In situ, in transit data analysis/viz workflow via hybrid staging

4 We are Building Proxy and Skeletal Apps that Enable Empirical Evaluation of Codesign Design Choices Proxy App for Topology driven feature extraction Proxy App for Topology driven feature tracking Proxy App for Statistical analysis Proxy App for Visualization Proxy App for Uncertainty Quantification Skeletal Apps for Staging and Data Movements

5 Codesign Questions for Data Analysis and Visualization Algorithms How much memory will be available in situ and with what characteristics? Will we have hardware and runtime support for asynchronous computation in situ? Will be performance for small messages reduced? What is the ratio of network bandwidth for in situ vs in transit communication? How well will modern processors support code that is branch heavy and flop free?

6 A Wide Range of Analysis and Visualization Algorithms Are Needed for Combustion Applications In situ multi-variate volume and particle rendering Lagrangian particle querying and analysis Topological segmentation: Contour trees Morse-Smale complex Time tracking Scalar field comparison Distance field (level set) Filtering and averaging (spatial and temporal) Shape analysis Statistical moments (conditional) Statistical dimensionality reduction (joint PDFS) Spectra (scalar, velocity, coherency) Flame-centric control volume analysis

7 We Build Reduced Topological Models for Characterization and Tracking of Combustion Features domain hierarchy Split Birth t Death Continuation t + Δt

8 Merge Trees Represent Feature Extraction at Different Scales with Thresholds for Noise Removal

9 Visualization and Analysis Bottlenecks May Differ from Simulation Bottlenecks Typically I/O bound: limited by rate at which data can be accessed Memory layout may significantly impact efficiency beyond traditional cache effects seen in solvers FLOP vs branch ratios can vary dramatically Algorithms with many branches are highly data-dependent Feature density: how much data is relevant Feature distribution: balance of work load distributed It is difficult to reliably predict generally expected behavior

10 Execution Models and Data Movements Depend on Different Flavors of Hybrid Staging In Situ Co-located: complete resource sharing/contention with simulation Partial Sharing: different processor/core less resource contention Out of Band: on node with minimal resource usage (e.g. use of ooc techniques, low priority) In Transit Local: sending data to different nodes of the same machine Remote: sending data to a separate machine

11 The Codesign Process Provides a Unique Opportunity to Studying Algorithmic Design Space

12 Multi-scale Design Patterns and Execution Models Lead to Local, and global algorithm design parameters Global design parameters include: Number of execution units Data aggregation patterns N-1 communication patterns Use of pair-wise data exchange schemes Local design parameters are algorithm-specific: Sort First, Filter Last Filter First, Sort Last Filter First, Traverse Last

13 We can project behaviors for different design parameters and execution models onto prospective hardware configurations SST and spreadsheet models provide a projection of communication and wall time Prospective configurations: PIM (Processing in memory, cache-less architecture) exanode1 (commodity NIC and memory) exanode2 (custom on-board NIC + faster memory) ExaCT tools Performance analysis spreadsheet

14 We are exploring 3 use cases that cover a range of characteristic behaviors Statistics Visualization Topology Local compute behavior Complexity is data dependent? Amount of data transferred for aggregation/gather Two-Phases: All FLOPs Two-phases: Some FLOPS, Three-phases: 2 FLOP-free, 1 FLOP-heavy No Yes Yes Constant - small Can be datadependent Data-dependent Scatter required? Sometimes (small data) No Yes (small data)

15 Topology Algorithm Computation of local features Process vertices in sorted order, detecting joining of iso-surface components Uses Union-Find data structure Communication to resolve features spanning multiple blocks Local merge trees are communicated in N-to-1 merges Corrections to local trees are re-broadcast to local compute nodes Feature-based statistics computation The segmentation stored with the corrected local merge trees are used to compute per-feature statistics Averages, size, shape

16 Basic Topology Measurements Obtained with Byfl Basic analysis for on node computation (data size 560x560x560) num cores num points points per core Total Loads (MB) Total Stores (MB) Total FLOPs Loads/core (MB) ,616, ,616 4,344, , ,616,000 25,088 4,098, ,616,000 6,272 3,790,

17 Topology Design Space Exploration Computation of local features alternatives Feature Advantage Disadvantage Sort-first Efficient (possible GPU) O(n) Memory for sorted indices Progressive-sort Smaller memory footprint extra pass over data Union-Find Efficient O(n) Union-Find data structure Streaming computation Re-use of memory for tree Significantly slower Communication alternatives Interleave compute/comm Streaming compute Asynchronous comm N-to-1 gather then 1-to-N scatter Simple com. Model Higher latency N-to-1 gather interleaved with 1- to-n scatter N-to-1 gather interleaved with 1- to-leaves scatter Less idle time on compute node Less idle time on compute node Feature-based statistics alternatives Re-do some work Complex communicators Wait for corrections then compute Simple Higher latency Compute-and-correct Lower latency Re-do work, more complex

18 Example execution model 1: in situ data transfers on node or on network Topology computed directly integrated with the solver K-rounds of merges on in-situ processes Factors analyzed: Number of nodes Number of cores per node Number of merge operations per stage Initial data access through shared pointer Free parameters: Communicate on network first Communicate on network last Size of messages

19 Msg. Size Msg. Count Communication Loads for Different Data Transfer Patterns Data Size: 2025x1600x400, Kay=0.31, Binary Merges 2-cores per node 4-cores per node 8-cores per node Merge stages

20 Msg. Size Msg. Count Communication Loads for Different Data Transfer Patterns Data Size: 2025x1600x400, Kay=0.31, 8-way Merges 2-cores per node 4-cores per node 8-cores per node Merge stages

21 Example execution model 2: part in situ and part plus in transit Initial local compute directly integrated with the solver K-rounds of merges on in-situ processes (N-K) rounds of merging in staging area Predicted cost factors: Initial data access through shared pointer K-round of merges in blocking mode as part of the solver code including shared memory (on-node) and MPI messages (off-node) communications Data transfer to staging area Asynchronous computation in staging area Free parameters: merging strategy and staging area break Things to watch out for: Initial data transfer might pollute the cache In-situ merge becomes sparse quickly

22 Total Communication Communication Loads for Different Data Transfer Patterns Data Size: 2025x1600x400, Kay= cores per node 4-cores per node 8-cores per node binary merge 8-way merge Merge stages

23 Statistics Algorithm 1 st -4 th order moments, variance, skewness, kurtosis, minimum and maximum values are values commonly computed by physics codes Pair-wise update formulas for 1 st -4 th order moments allow for a single-pass distributed implementation Given moment(a) and moment(b), compute moment(a U B) The global model can optionally be scattered to all processes to allow for assessment of observations (e.g. to determine outliers)

24 Statistics Design Space Exploration The algorithm is mostly FLOPs, is not data-dependent, and requires small amounts of data to be communicated Small-scale, algorithm-specific design parameters None: local computations are straightforward implementation of update formulas Large-scale design parameters Update formulas provide complete flexibility in communication patterns Support arbitrary depth/width of the compute tree Execution model Initial local compute level is good candidate for insitu (all data must be transferred otherwise) Later local compute levels could be placed anywhere (require very small data sizes transferred: moments, minima, and maxima only)

25 operations per core Measurements obtained with Byfl confirm data-parallel, scalable nature of statistics algorithms 1.00E+10 HCCI-ALU HCCI-FLOP LEJ-ALU LEJ-FLOP 1.00E E E E E E E E E E E E E+08 points per core Data set Dim x Dim y Dim z num cores points/core Loads/core (MB) Stores/core (B) FLOPs/core ALU ops/core mem ops/core LEJ 2,025 1, ,250, ,000,036 1,032,750,120 40,500,024 LEJ 2,025 1, ,025, ,700, ,275,113 4,050,018 LEJ 2,025 1, , , ,670,036 10,327, ,017 LEJ 2,025 1, ,000 20, ,036 1,032,862 40,517 LEJ 2,025 1, ,000 2, , ,387 4,050 HCCI ,136, ,808, ,936,131 6,272,041 HCCI , ,780,836 15,993, ,219 HCCI ,600 31, ,116 1,599,472 62,737 HCCI ,000 3, , ,048 6,289 HCCI ,000 1, ,940 80,080 3,153 Data movement options: all gather or gather of 3KB per processor

26 Visualization Algorithm Volume rendering of local data to generate partial images Cast a ray from the eye through each pixel of an image For each ray, sample local volume, map data into color values via transfer function, and accumulate color values. Parallel image compositing to combine the partial images into a final global image Build communication schedule according to distribution of pixel data Exchange pixel data via communication Blend exchanged pixel data

27 Visualization Design Space Exploration The algorithm is marginally FLOPs, can be data-dependent, and requires a potentially large amount of messages exchanged. Small-scale, algorithm-specific design parameters Adaptive workload of local rendering Features specified by users in transfer function space, features identified by analysis algorithms, data resolution for different exploration purposes Optimization and acceleration on CPUs and/or GPUs Large-scale design parameters Optimize communication schedule according to pixel data distribution Minimize link contention, pixel data exchanged, and blending cost Exploration using MPI and/or OpenMP Execution model Local rendering of simulation data could be performed in situ to minimize data movement Local rendering of feature data could be place anywhere

28 Visualization Analysis Results Measurements obtained with Byfl confirm data-parallel, nearly scalable nature of local rendering. The number of operations varies marginally across cores depending on data features and rendering parameters. Measurements with a middle image quality Data set Dim x Dim y Dim z num cores points/core Loads/core (MB) Stores/core (B) FLOPs/core ALU ops/core mem ops/core LEJ 2,025 1, ,250,000 1, ±1.67% ±1.80% 55,276,662±1.34% 789,879,803 ±1.62% 466,466,350 ±1.84% LEJ 2,025 1, ,025, ±1.76% ±2.05% 7,712,093 ±1.37% 102,163,763 ±1.77% 58,18,647 ±2.04% LEJ 2,025 1, , , ±1.87% ±2.21% 383,819 ±1.26% 4,921,160 ±1.82% 2,691,746 ±2.21% LEJ 2,025 1, ,000 20, ±4.71% 1.84 ±5.92% 106,312 ±4.86% 1,655,434 ± ,687 ±5.29% LEJ 2,025 1, ,000 2, ±2.91% 0.23 ±3.11% 16,851 ±1.86% 272,890 ±1.43% 132,357 ±3.33% HCCI ,744, ±3.28% ±3.54% 5,508,745 ±3.06% 87,148,662 ±3.53% 53,197,224 ±3.47% HCCI , ±1.29% ±1.52% 436,888 ±9.34% 5,569,581 ±1.26% 3,068,742 ±1.52% HCCI ,400 27, ±0.58% 5.83 ±0.73% 322,109 ±0.64% 4,993,671 ±0.87% 3,036,488 ±0.64% HCCI ,000 2, ±1.75% 2.05 ±1.96% 110,723 ±1.71% 1,749,591 ±2.06% 1,070,992 ±1.87% HCCI , ±1.36% 0.38 ±1.86% 20,762 ±1.56% 325,887 ±1.98% 196,553 ±2.03% Middle image quality For un-optimized image compositing, the size of messages exchanged across cores depends on image resolutions. For the same image resolution, the size of messages is nearly same for the different numbers of cores. The messages can be reduced via optimization. High image quality

29 Successful execution hinges on tight integration with all the co-design components Separate Proxy Apps for Data Analysis and Visualization Integration to understand combined behavior Further reduction to build skeletal apps Coordination with Data Management Efficient data transfer strategies Where are the biggest reserves in performance, energy, wall time? Coordination with DSL Improve local compute patterns Fast index computation Elimination of unnecessary branches (boundary cases) UQ Analysis How much persistent memory is required? Modeling capabilities Compilers (Byfl, ROSE) Simulation(SST, spreadsheet model) Solvers Study tradeoffs and cache pollution effects (possible sharing of data structures)

30 Uncertainty Quantification within the SDMA workflow

31 UQ and Data Management The Problem: Evaluate sensitivities of quantities of interest (QoIs) with respect to chemistry model parameters, or modeled fields (e.g. reaction rate fields) The to solving this problem requires solving P+1 forward simulations, where P is the number of sensitivity evaluations P can be very large (>> 1000) makes classical approach infeasible solve one auxiliary problem, the adjoint problem, which is linearized about the primal solution. Need the primal (forward) solution to solve the adjoint problem.

32 Solving the Adjoint Problem The challenge: Solving the Adjoint Problem requires the Primal State The Adjoint Problem must be solved backwards in time. Must store primal state at all time substep!

33 More Sophisticated Adjoint Solution To reduce Storage requirements: Store a limited number of Primal states (check points) Use check points to recompute Primal state when needed Example with two checkpoints:

34 Storing full Primal solution state is prohibitive (e.g. 1PB/state) Only interested in sensitivities in a limited region in space & time (RoI, e.g. Extinction event) RoIs are not known a priori Solve Primal problem & identify RoIs using in situ analysis Resolve Primal problem only checkpointing state from the RoIs Only solve adjoint problem in RoIs Example with one Region of Interest

35 Example from Combustion Use Case Naïve adjoint solution (store full state in space & time) Storage: 5ZB Compute: 2 X Primal problem One level checkpointing Storage: 4EB Compute: 3 X Primal problem Six level checkpointing Storage: 50PB Compute: 8 X Primal problem One level checkpointing on N RoIs Storage: (1+1.3N)PB Compute: ( N) X Primal Problem Important trade-off between storage and recomputing A proxy app capturing adjoint solution data flow is being developed

36 Data Management

37 Goals for SDMA in EXACT 1. Explore data staging techniques to deal with Exascale data 2. Design questions for data staging Where should data for A&V be stored Where should the A&V operations be executed How should SDMA integrate with the solver What architectural features be leveraged for SDMA 3. Evaluate design space

38 The Meta-Skeleton

39 Design Space Solver Proxy? Descriptive Stats Visualization Topological Analysis 1. Where do we move the data to 2. How do we extract data from the solver 3. What hardware features can be exploited 1. What processing resources are allocated 2. How do we schedule the execution of these tasks

40 Storage scalability and the power wall Disk sizes of TB Single disk bandwidth of MB/s Power consumption of ~45 W/disk System memory 32PB 3 A full checkpoint every hour => TB/s I/O bandwidth => 277,633 disks => 13 MW of power 1,2 Without even considering RAID! 1. Power use of disk subsystems in supercomputers, Curry, M.L. and Ward, H.L. and Grider, G. and Gemmill, J. and Harris, J. and Martinez, D., Proceedings of the sixth workshop on Parallel Data Storage, Nov G. Grider. Exa-scale FSIO. HEC-FSIO workshop presentation. August R. Stevens and A. White. A DOE laboratory plan for providing exascale applications and technologies for critical DOE mission needs, SciDac Workshop, July 2010

41 Synchronous I/O is not the solution S3D simulation Synchronous I/O O(400 PB)/run O(1M) cores 1 PB/dump every 30 minutes Storage space requirements 35 disks for each dump (No RAID) 1.5 KW/live dump Performance requirements 5% overhead, ~31k disks, >1.4 MW 10% overhead, ~15k disks, >0.65 MW 50% overhead, ~3k disks, >0.14 MW Synchronous I/O Analysis MS-Complex Visualization Volume, Surface, Particle rendering Downstream Isomap

42 What EXACTly is Staging? Extra stage(s) in the data pipeline Use available memory resources to serve as a buffer Use available compute resources to serve as an execution target Traditional staging used discrete nodes Used for high performance I/O managing storage variability for application coupling for in transit workflows

43 Keep data in a Shared Data Space Maintain data in a shared space in staging Shared space can share the same memory as the simulation Multiple analysis and visualization services can access data

44 Managing Data Movement Data movement is expected to be a bottleneck for Exascale Use flexible resource allocation to optimize data movement

45 Hybrid Staging Asynchronous Data Transfer Asynchronous Data Transfer Hybrid Staging S3D-Box ADIOS In situ Analysis and Visualization ADIOS ADIOS In transit Analysis Visualization ADIOS S3D-Box ADIOS In situ Analysis and Visualization ADIOS S3D-Box ADIOS In situ Analysis and Visualization ADIOS ADIOS In transit Analysis Visualization ADIOS S3D-Box ADIOS In situ Analysis and Visualization ADIOS Parallel Data Staging coupling/analytics/viz Use compute and deep-memory hierarchies to optimize overall workflow for power vs. performance tradeoffs S3D-Box ADIOS In situ Analysis and Visualization ADIOS ADIOS In transit Analysis Visualization Utilize hybrid staging for analytics and visualization Abstract complex/deep memory hierarchy access ADIOS Compute cores Statistics Topology In transit ADIOS Visualization S3D-Box In situ Analysis and Visualization ADIOS ADIOS In transit Analysis Visualization ADIOS

46 Hybrid Staging Hybrid staging is a combination of the available solutions Classify data processing actions into In line/in situ Asynchronous/in situ Asynchronous/in staging Asynchronous/on disk What about tasks that span multiple classes? Partition the algorithms Place tasks to take advantage of data management

47 Resources in Hybrid Staging Placement of analysis and visualization tasks in a complex system Leverage fast/slow DRAM and Leverage local NVRAM/SSD Impact of network data movement compared to memory movement Network topology impact on performance and power

48 Tradeoffs for Hybrid Staging Going to disk is slow even for small application sizes Inline approach adds more overhead to application runtime In transit approach gives better overall performance Additional cost of data movement Normalized Total CPU Seconds Offline: Process data after writing to disk In line: Process data in place synchronously with the application Staging: Move data to staging resources for processing

49 Impact of Task Mapping Data-centric task mapping Significant saving in amount of data transferred data by co-locating data producers and consumers Time Time CAP1 data Concurrent coupling (CAP1: 512, CAP2: 64 cores) data SAP2 SAP1 CAP2 data SAP3 Sequential coupling (SAP1: 512, SAP2: 128, SAP3: 384 cores)

50 Complex Memory Hierarchy Workflow integrates knowledge of complex memory hierarchies Placement decisions are important factors for evaluation Local NVRAM must be leveraged Fast-small memory vs slow-big memory

51

52 Impact of NVRAM Study how deep memory hierarchy can be used for end-toend I/O analytic pipeline How NVRAM can be used as a staging area How much of each level of the memory hierarchy to use for the staging area? Where to move data (RAM, NVRAM, SSD, disk, network) When (and how frequently) to move the data over the hierarchy

53 Tradeoff between Frequency and Costs Frequency of analysis impacts energy and performance of analysis NVRAM/disk gap Not a linear function Sweet Spot case dependent Experiments in collaboration with Steve Poole, ORNL

54 Asynchronous Workflow Impact Frequency of analysis impacts energy and performance of analysis Not a linear function Bitter spot case dependent

55 Execution time (ms) NVRAM for C/R Optimizing by hiding data movement to NVM Run time (sec) Run time - Optimized (sec) NVM B/W per core (MB) NVM per core data copy bandwidth was assumed to be 450 MB/sec (compared to 2 GB device B/W). Experiment was conducted in a 12 core GHz Intel Xeon node, with 48 GB DDR3 memory. To emulate NVM, 24 GB of memory was used for NVM Experiments in collaboration with HP

56 Deep Memory Tradeoffs Driving synthetic application benchmark: Generates data Allocates two matrices in DRAM memory and fills them up with random data Operates with generated data runs a kernel Multiplication (MUL), addition (ADD) or read access (NOP) generated matrices Manage generated data: Keeps the data in RAM memory Staging area (Local Fusion-IO, remote Fusion-IO, HDD, SSD, etc ) Asynchronously runs data analysis (reads data from staging area) Quality of solutions Frequency of data analysis Resources for analysis (accuracy)

57 Co-Design Decisions for Complex Memory Impact of using slow memory for SDMA processing Power vs Performance tradeoffs Size of Memory vs Speed of Memory Using a combination of fast memory and slow memory Fast memory size and speed compared to slow memory size Managing performance at the application level Tradeoff frequency of analysis with memory usage Use combination of asynchronous and synchronous computation Use knowledge of the workflow to study tradeoffs Write vs Read ratios

58 Dealing with UQ Our next big target for data management Use analysis to select only the ROI Use the feature detection algorithms and their optimization Ideal candidate for deep memory Data is not used for a long time after output Data access is regular and predictable Move data to fat/slow memory

59 The Proxy App Original Application Link Tracing Records Pattern Analyzer Tracing Tools Skel-2.0 Child of Skel that creates proxies for synchronous I/O Skel-xml file for I/O Insert User s Changes SKEL xml file for Communication Communication Phase SKEL- Code Generator Desired Benchmark Magically generates a workflow through tracing patterns Underdevelopment

60 X-Stack Interactions Increase engagement with X-Stack projects Dynamic Task Scheduling Resource allocation Memory and Task dependencies Bring Fast Forwards into the conversation Storage Complex Memory hierarchies DRAM, NVRAM, ScratchPads, SSDs New processing elements GPUs, MICs

61 Outstanding Questions GPU/Accelerators for SDMA tasks? Staging nodes can be customized with additional resources Analysis tasks can be split similarly to in situ/in transit Integration with the Solver Data extraction impact on Solver performance Inline processing can impact cache Data copies pollute cache Exploring asynchronicity in algorithms

62

63 What about UQ? Solver Compute Adjoint Store Solution Feed the adjoint computation in reverse order Need to store the ENTIRE data set

64 UQ v2.0 Solver Store Solution Compute Adjoint Store smaller subset of solution u(t), from t i to t i-1 Recompute u(t)

65 UQ v3.0 Solver Identify the interesting events Filter Domains Store even smaller subset of solution Store Solution u(t), from t i to t i-1 Recompute u(t) Compute Adjoint

66

In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller

In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller In-Memory Databases Algorithms and Data Structures on Modern Hardware Martin Faust David Schwalb Jens Krüger Jürgen Müller The Free Lunch Is Over 2 Number of transistors per CPU increases Clock frequency

More information

WITH A FUSION POWERED SQL SERVER 2014 IN-MEMORY OLTP DATABASE

WITH A FUSION POWERED SQL SERVER 2014 IN-MEMORY OLTP DATABASE WITH A FUSION POWERED SQL SERVER 2014 IN-MEMORY OLTP DATABASE 1 W W W. F U S I ON I O.COM Table of Contents Table of Contents... 2 Executive Summary... 3 Introduction: In-Memory Meets iomemory... 4 What

More information

Tips for Performance. Running PTC Creo Elements Pro 5.0 (Pro/ENGINEER Wildfire 5.0) on HP Z and Mobile Workstations

Tips for Performance. Running PTC Creo Elements Pro 5.0 (Pro/ENGINEER Wildfire 5.0) on HP Z and Mobile Workstations System Memory - size and layout Optimum performance is only possible when application data resides in system RAM. Waiting on slower disk I/O page file adversely impacts system and application performance.

More information

Parallel Large-Scale Visualization

Parallel Large-Scale Visualization Parallel Large-Scale Visualization Aaron Birkland Cornell Center for Advanced Computing Data Analysis on Ranger January 2012 Parallel Visualization Why? Performance Processing may be too slow on one CPU

More information

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software WHITEPAPER Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software SanDisk ZetaScale software unlocks the full benefits of flash for In-Memory Compute and NoSQL applications

More information

File System & Device Drive. Overview of Mass Storage Structure. Moving head Disk Mechanism. HDD Pictures 11/13/2014. CS341: Operating System

File System & Device Drive. Overview of Mass Storage Structure. Moving head Disk Mechanism. HDD Pictures 11/13/2014. CS341: Operating System CS341: Operating System Lect 36: 1 st Nov 2014 Dr. A. Sahu Dept of Comp. Sc. & Engg. Indian Institute of Technology Guwahati File System & Device Drive Mass Storage Disk Structure Disk Arm Scheduling RAID

More information

GraySort on Apache Spark by Databricks

GraySort on Apache Spark by Databricks GraySort on Apache Spark by Databricks Reynold Xin, Parviz Deyhim, Ali Ghodsi, Xiangrui Meng, Matei Zaharia Databricks Inc. Apache Spark Sorting in Spark Overview Sorting Within a Partition Range Partitioner

More information

DIABLO TECHNOLOGIES MEMORY CHANNEL STORAGE AND VMWARE VIRTUAL SAN : VDI ACCELERATION

DIABLO TECHNOLOGIES MEMORY CHANNEL STORAGE AND VMWARE VIRTUAL SAN : VDI ACCELERATION DIABLO TECHNOLOGIES MEMORY CHANNEL STORAGE AND VMWARE VIRTUAL SAN : VDI ACCELERATION A DIABLO WHITE PAPER AUGUST 2014 Ricky Trigalo Director of Business Development Virtualization, Diablo Technologies

More information

Understanding the Benefits of IBM SPSS Statistics Server

Understanding the Benefits of IBM SPSS Statistics Server IBM SPSS Statistics Server Understanding the Benefits of IBM SPSS Statistics Server Contents: 1 Introduction 2 Performance 101: Understanding the drivers of better performance 3 Why performance is faster

More information

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi ICPP 6 th International Workshop on Parallel Programming Models and Systems Software for High-End Computing October 1, 2013 Lyon, France

More information

Improve Business Productivity and User Experience with a SanDisk Powered SQL Server 2014 In-Memory OLTP Database

Improve Business Productivity and User Experience with a SanDisk Powered SQL Server 2014 In-Memory OLTP Database WHITE PAPER Improve Business Productivity and User Experience with a SanDisk Powered SQL Server 2014 In-Memory OLTP Database 951 SanDisk Drive, Milpitas, CA 95035 www.sandisk.com Table of Contents Executive

More information

Hank Childs, University of Oregon

Hank Childs, University of Oregon Exascale Analysis & Visualization: Get Ready For a Whole New World Sept. 16, 2015 Hank Childs, University of Oregon Before I forget VisIt: visualization and analysis for very big data DOE Workshop for

More information

Jean-Pierre Panziera Teratec 2011

Jean-Pierre Panziera Teratec 2011 Technologies for the future HPC systems Jean-Pierre Panziera Teratec 2011 3 petaflop systems : TERA 100, CURIE & IFERC Tera100 Curie IFERC 1.25 PetaFlops 256 TB ory 30 PB disk storage 140 000+ Xeon cores

More information

Cloud Storage. Parallels. Performance Benchmark Results. White Paper. www.parallels.com

Cloud Storage. Parallels. Performance Benchmark Results. White Paper. www.parallels.com Parallels Cloud Storage White Paper Performance Benchmark Results www.parallels.com Table of Contents Executive Summary... 3 Architecture Overview... 3 Key Features... 4 No Special Hardware Requirements...

More information

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging In some markets and scenarios where competitive advantage is all about speed, speed is measured in micro- and even nano-seconds.

More information

Benchmarking Cassandra on Violin

Benchmarking Cassandra on Violin Technical White Paper Report Technical Report Benchmarking Cassandra on Violin Accelerating Cassandra Performance and Reducing Read Latency With Violin Memory Flash-based Storage Arrays Version 1.0 Abstract

More information

Binary search tree with SIMD bandwidth optimization using SSE

Binary search tree with SIMD bandwidth optimization using SSE Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous

More information

Data Centric Systems (DCS)

Data Centric Systems (DCS) Data Centric Systems (DCS) Architecture and Solutions for High Performance Computing, Big Data and High Performance Analytics High Performance Computing with Data Centric Systems 1 Data Centric Systems

More information

POSIX and Object Distributed Storage Systems

POSIX and Object Distributed Storage Systems 1 POSIX and Object Distributed Storage Systems Performance Comparison Studies With Real-Life Scenarios in an Experimental Data Taking Context Leveraging OpenStack Swift & Ceph by Michael Poat, Dr. Jerome

More information

Performance Characteristics of VMFS and RDM VMware ESX Server 3.0.1

Performance Characteristics of VMFS and RDM VMware ESX Server 3.0.1 Performance Study Performance Characteristics of and RDM VMware ESX Server 3.0.1 VMware ESX Server offers three choices for managing disk access in a virtual machine VMware Virtual Machine File System

More information

SAP HANA - Main Memory Technology: A Challenge for Development of Business Applications. Jürgen Primsch, SAP AG July 2011

SAP HANA - Main Memory Technology: A Challenge for Development of Business Applications. Jürgen Primsch, SAP AG July 2011 SAP HANA - Main Memory Technology: A Challenge for Development of Business Applications Jürgen Primsch, SAP AG July 2011 Why In-Memory? Information at the Speed of Thought Imagine access to business data,

More information

Stream Processing on GPUs Using Distributed Multimedia Middleware

Stream Processing on GPUs Using Distributed Multimedia Middleware Stream Processing on GPUs Using Distributed Multimedia Middleware Michael Repplinger 1,2, and Philipp Slusallek 1,2 1 Computer Graphics Lab, Saarland University, Saarbrücken, Germany 2 German Research

More information

DSS. Diskpool and cloud storage benchmarks used in IT-DSS. Data & Storage Services. Geoffray ADDE

DSS. Diskpool and cloud storage benchmarks used in IT-DSS. Data & Storage Services. Geoffray ADDE DSS Data & Diskpool and cloud storage benchmarks used in IT-DSS CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/it Geoffray ADDE DSS Outline I- A rational approach to storage systems evaluation

More information

Performance Characteristics of Large SMP Machines

Performance Characteristics of Large SMP Machines Performance Characteristics of Large SMP Machines Dirk Schmidl, Dieter an Mey, Matthias S. Müller schmidl@rz.rwth-aachen.de Rechen- und Kommunikationszentrum (RZ) Agenda Investigated Hardware Kernel Benchmark

More information

NetApp FAS Hybrid Array Flash Efficiency. Silverton Consulting, Inc. StorInt Briefing

NetApp FAS Hybrid Array Flash Efficiency. Silverton Consulting, Inc. StorInt Briefing NetApp FAS Hybrid Array Flash Efficiency Silverton Consulting, Inc. StorInt Briefing PAGE 2 OF 7 Introduction Hybrid storage arrays (storage systems with both disk and flash capacity) have become commonplace

More information

Parallels VDI Solution

Parallels VDI Solution Parallels VDI Solution White Paper VDI SIZING A Competitive Comparison of VDI Solution Sizing between Parallels VDI versus VMware VDI www.parallels.com Parallels VDI Sizing. 29 Table of Contents Overview...

More information

Parallel Computing. Benson Muite. benson.muite@ut.ee http://math.ut.ee/ benson. https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage

Parallel Computing. Benson Muite. benson.muite@ut.ee http://math.ut.ee/ benson. https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage Parallel Computing Benson Muite benson.muite@ut.ee http://math.ut.ee/ benson https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage 3 November 2014 Hadoop, Review Hadoop Hadoop History Hadoop Framework

More information

Visualization and Data Analysis

Visualization and Data Analysis Working Group Outbrief Visualization and Data Analysis James Ahrens, David Rogers, Becky Springmeyer Eric Brugger, Cyrus Harrison, Laura Monroe, Dino Pavlakos Scott Klasky, Kwan-Liu Ma, Hank Childs LLNL-PRES-481881

More information

Introduction to Dataflow Computing

Introduction to Dataflow Computing Introduction to Dataflow Computing Maxeler Dataflow Computing Workshop STFC Hartree Centre, June 2013 Programmable Spectrum Control-flow processors Dataflow processor GK110 Single-Core CPU Multi-Core Several-Cores

More information

Accelerating Server Storage Performance on Lenovo ThinkServer

Accelerating Server Storage Performance on Lenovo ThinkServer Accelerating Server Storage Performance on Lenovo ThinkServer Lenovo Enterprise Product Group April 214 Copyright Lenovo 214 LENOVO PROVIDES THIS PUBLICATION AS IS WITHOUT WARRANTY OF ANY KIND, EITHER

More information

The IntelliMagic White Paper: Storage Performance Analysis for an IBM Storwize V7000

The IntelliMagic White Paper: Storage Performance Analysis for an IBM Storwize V7000 The IntelliMagic White Paper: Storage Performance Analysis for an IBM Storwize V7000 Summary: This document describes how to analyze performance on an IBM Storwize V7000. IntelliMagic 2012 Page 1 This

More information

Everything you need to know about flash storage performance

Everything you need to know about flash storage performance Everything you need to know about flash storage performance The unique characteristics of flash make performance validation testing immensely challenging and critically important; follow these best practices

More information

The Design and Implement of Ultra-scale Data Parallel. In-situ Visualization System

The Design and Implement of Ultra-scale Data Parallel. In-situ Visualization System The Design and Implement of Ultra-scale Data Parallel In-situ Visualization System Liu Ning liuning01@ict.ac.cn Gao Guoxian gaoguoxian@ict.ac.cn Zhang Yingping zhangyingping@ict.ac.cn Zhu Dengming mdzhu@ict.ac.cn

More information

In-Situ Bitmaps Generation and Efficient Data Analysis based on Bitmaps. Yu Su, Yi Wang, Gagan Agrawal The Ohio State University

In-Situ Bitmaps Generation and Efficient Data Analysis based on Bitmaps. Yu Su, Yi Wang, Gagan Agrawal The Ohio State University In-Situ Bitmaps Generation and Efficient Data Analysis based on Bitmaps Yu Su, Yi Wang, Gagan Agrawal The Ohio State University Motivation HPC Trends Huge performance gap CPU: extremely fast for generating

More information

New Dimensions in Configurable Computing at runtime simultaneously allows Big Data and fine Grain HPC

New Dimensions in Configurable Computing at runtime simultaneously allows Big Data and fine Grain HPC New Dimensions in Configurable Computing at runtime simultaneously allows Big Data and fine Grain HPC Alan Gara Intel Fellow Exascale Chief Architect Legal Disclaimer Today s presentations contain forward-looking

More information

QLIKVIEW ARCHITECTURE AND SYSTEM RESOURCE USAGE

QLIKVIEW ARCHITECTURE AND SYSTEM RESOURCE USAGE QLIKVIEW ARCHITECTURE AND SYSTEM RESOURCE USAGE QlikView Technical Brief April 2011 www.qlikview.com Introduction This technical brief covers an overview of the QlikView product components and architecture

More information

Overlapping Data Transfer With Application Execution on Clusters

Overlapping Data Transfer With Application Execution on Clusters Overlapping Data Transfer With Application Execution on Clusters Karen L. Reid and Michael Stumm reid@cs.toronto.edu stumm@eecg.toronto.edu Department of Computer Science Department of Electrical and Computer

More information

Removing Performance Bottlenecks in Databases with Red Hat Enterprise Linux and Violin Memory Flash Storage Arrays. Red Hat Performance Engineering

Removing Performance Bottlenecks in Databases with Red Hat Enterprise Linux and Violin Memory Flash Storage Arrays. Red Hat Performance Engineering Removing Performance Bottlenecks in Databases with Red Hat Enterprise Linux and Violin Memory Flash Storage Arrays Red Hat Performance Engineering Version 1.0 August 2013 1801 Varsity Drive Raleigh NC

More information

Power-Aware High-Performance Scientific Computing

Power-Aware High-Performance Scientific Computing Power-Aware High-Performance Scientific Computing Padma Raghavan Scalable Computing Laboratory Department of Computer Science Engineering The Pennsylvania State University http://www.cse.psu.edu/~raghavan

More information

Benchmarking Hadoop & HBase on Violin

Benchmarking Hadoop & HBase on Violin Technical White Paper Report Technical Report Benchmarking Hadoop & HBase on Violin Harnessing Big Data Analytics at the Speed of Memory Version 1.0 Abstract The purpose of benchmarking is to show advantages

More information

Virtuoso and Database Scalability

Virtuoso and Database Scalability Virtuoso and Database Scalability By Orri Erling Table of Contents Abstract Metrics Results Transaction Throughput Initializing 40 warehouses Serial Read Test Conditions Analysis Working Set Effect of

More information

NV-DIMM: Fastest Tier in Your Storage Strategy

NV-DIMM: Fastest Tier in Your Storage Strategy NV-DIMM: Fastest Tier in Your Storage Strategy Introducing ArxCis-NV, a Non-Volatile DIMM Author: Adrian Proctor, Viking Technology [email: adrian.proctor@vikingtechnology.com] This paper reviews how Non-Volatile

More information

LSI MegaRAID CacheCade Performance Evaluation in a Web Server Environment

LSI MegaRAID CacheCade Performance Evaluation in a Web Server Environment LSI MegaRAID CacheCade Performance Evaluation in a Web Server Environment Evaluation report prepared under contract with LSI Corporation Introduction Interest in solid-state storage (SSS) is high, and

More information

Challenges to Obtaining Good Parallel Processing Performance

Challenges to Obtaining Good Parallel Processing Performance Outline: Challenges to Obtaining Good Parallel Processing Performance Coverage: The Parallel Processing Challenge of Finding Enough Parallelism Amdahl s Law: o The parallel speedup of any program is limited

More information

MS SQL Performance (Tuning) Best Practices:

MS SQL Performance (Tuning) Best Practices: MS SQL Performance (Tuning) Best Practices: 1. Don t share the SQL server hardware with other services If other workloads are running on the same server where SQL Server is running, memory and other hardware

More information

Trends in High-Performance Computing for Power Grid Applications

Trends in High-Performance Computing for Power Grid Applications Trends in High-Performance Computing for Power Grid Applications Franz Franchetti ECE, Carnegie Mellon University www.spiral.net Co-Founder, SpiralGen www.spiralgen.com This talk presents my personal views

More information

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging Outline High Performance Computing (HPC) Towards exascale computing: a brief history Challenges in the exascale era Big Data meets HPC Some facts about Big Data Technologies HPC and Big Data converging

More information

PARALLELS CLOUD STORAGE

PARALLELS CLOUD STORAGE PARALLELS CLOUD STORAGE Performance Benchmark Results 1 Table of Contents Executive Summary... Error! Bookmark not defined. Architecture Overview... 3 Key Features... 5 No Special Hardware Requirements...

More information

Performance Monitoring of Parallel Scientific Applications

Performance Monitoring of Parallel Scientific Applications Performance Monitoring of Parallel Scientific Applications Abstract. David Skinner National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory This paper introduces an infrastructure

More information

Big Fast Data Hadoop acceleration with Flash. June 2013

Big Fast Data Hadoop acceleration with Flash. June 2013 Big Fast Data Hadoop acceleration with Flash June 2013 Agenda The Big Data Problem What is Hadoop Hadoop and Flash The Nytro Solution Test Results The Big Data Problem Big Data Output Facebook Traditional

More information

Maximizing SQL Server Virtualization Performance

Maximizing SQL Server Virtualization Performance Maximizing SQL Server Virtualization Performance Michael Otey Senior Technical Director Windows IT Pro SQL Server Pro 1 What this presentation covers Host configuration guidelines CPU, RAM, networking

More information

Disks and RAID. Profs. Bracy and Van Renesse. based on slides by Prof. Sirer

Disks and RAID. Profs. Bracy and Van Renesse. based on slides by Prof. Sirer Disks and RAID Profs. Bracy and Van Renesse based on slides by Prof. Sirer 50 Years Old! 13th September 1956 The IBM RAMAC 350 Stored less than 5 MByte Reading from a Disk Must specify: cylinder # (distance

More information

Beyond Embarrassingly Parallel Big Data. William Gropp www.cs.illinois.edu/~wgropp

Beyond Embarrassingly Parallel Big Data. William Gropp www.cs.illinois.edu/~wgropp Beyond Embarrassingly Parallel Big Data William Gropp www.cs.illinois.edu/~wgropp Messages Big is big Data driven is an important area, but not all data driven problems are big data (despite current hype).

More information

Boost Database Performance with the Cisco UCS Storage Accelerator

Boost Database Performance with the Cisco UCS Storage Accelerator Boost Database Performance with the Cisco UCS Storage Accelerator Performance Brief February 213 Highlights Industry-leading Performance and Scalability Offloading full or partial database structures to

More information

Speeding Up Cloud/Server Applications Using Flash Memory

Speeding Up Cloud/Server Applications Using Flash Memory Speeding Up Cloud/Server Applications Using Flash Memory Sudipta Sengupta Microsoft Research, Redmond, WA, USA Contains work that is joint with B. Debnath (Univ. of Minnesota) and J. Li (Microsoft Research,

More information

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB Planet Size Data!? Gartner s 10 key IT trends for 2012 unstructured data will grow some 80% over the course of the next

More information

Oracle Database Scalability in VMware ESX VMware ESX 3.5

Oracle Database Scalability in VMware ESX VMware ESX 3.5 Performance Study Oracle Database Scalability in VMware ESX VMware ESX 3.5 Database applications running on individual physical servers represent a large consolidation opportunity. However enterprises

More information

Full and Para Virtualization

Full and Para Virtualization Full and Para Virtualization Dr. Sanjay P. Ahuja, Ph.D. 2010-14 FIS Distinguished Professor of Computer Science School of Computing, UNF x86 Hardware Virtualization The x86 architecture offers four levels

More information

The Classical Architecture. Storage 1 / 36

The Classical Architecture. Storage 1 / 36 1 / 36 The Problem Application Data? Filesystem Logical Drive Physical Drive 2 / 36 Requirements There are different classes of requirements: Data Independence application is shielded from physical storage

More information

Enterprise Applications

Enterprise Applications Enterprise Applications Chi Ho Yue Sorav Bansal Shivnath Babu Amin Firoozshahian EE392C Emerging Applications Study Spring 2003 Functionality Online Transaction Processing (OLTP) Users/apps interacting

More information

Bringing Big Data Modelling into the Hands of Domain Experts

Bringing Big Data Modelling into the Hands of Domain Experts Bringing Big Data Modelling into the Hands of Domain Experts David Willingham Senior Application Engineer MathWorks david.willingham@mathworks.com.au 2015 The MathWorks, Inc. 1 Data is the sword of the

More information

RAMCloud and the Low- Latency Datacenter. John Ousterhout Stanford University

RAMCloud and the Low- Latency Datacenter. John Ousterhout Stanford University RAMCloud and the Low- Latency Datacenter John Ousterhout Stanford University Most important driver for innovation in computer systems: Rise of the datacenter Phase 1: large scale Phase 2: low latency Introduction

More information

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

GPU File System Encryption Kartik Kulkarni and Eugene Linkov GPU File System Encryption Kartik Kulkarni and Eugene Linkov 5/10/2012 SUMMARY. We implemented a file system that encrypts and decrypts files. The implementation uses the AES algorithm computed through

More information

Windows Server Performance Monitoring

Windows Server Performance Monitoring Spot server problems before they are noticed The system s really slow today! How often have you heard that? Finding the solution isn t so easy. The obvious questions to ask are why is it running slowly

More information

Optimizing the Performance of Your Longview Application

Optimizing the Performance of Your Longview Application Optimizing the Performance of Your Longview Application François Lalonde, Director Application Support May 15, 2013 Disclaimer This presentation is provided to you solely for information purposes, is not

More information

Big Data Processing with Google s MapReduce. Alexandru Costan

Big Data Processing with Google s MapReduce. Alexandru Costan 1 Big Data Processing with Google s MapReduce Alexandru Costan Outline Motivation MapReduce programming model Examples MapReduce system architecture Limitations Extensions 2 Motivation Big Data @Google:

More information

Jun Liu, Senior Software Engineer Bianny Bian, Engineering Manager SSG/STO/PAC

Jun Liu, Senior Software Engineer Bianny Bian, Engineering Manager SSG/STO/PAC Jun Liu, Senior Software Engineer Bianny Bian, Engineering Manager SSG/STO/PAC Agenda Quick Overview of Impala Design Challenges of an Impala Deployment Case Study: Use Simulation-Based Approach to Design

More information

VMware Virtual SAN Backup Using VMware vsphere Data Protection Advanced SEPTEMBER 2014

VMware Virtual SAN Backup Using VMware vsphere Data Protection Advanced SEPTEMBER 2014 VMware SAN Backup Using VMware vsphere Data Protection Advanced SEPTEMBER 2014 VMware SAN Backup Using VMware vsphere Table of Contents Introduction.... 3 vsphere Architectural Overview... 4 SAN Backup

More information

Using Synology SSD Technology to Enhance System Performance Synology Inc.

Using Synology SSD Technology to Enhance System Performance Synology Inc. Using Synology SSD Technology to Enhance System Performance Synology Inc. Synology_SSD_Cache_WP_ 20140512 Table of Contents Chapter 1: Enterprise Challenges and SSD Cache as Solution Enterprise Challenges...

More information

A Close Look at PCI Express SSDs. Shirish Jamthe Director of System Engineering Virident Systems, Inc. August 2011

A Close Look at PCI Express SSDs. Shirish Jamthe Director of System Engineering Virident Systems, Inc. August 2011 A Close Look at PCI Express SSDs Shirish Jamthe Director of System Engineering Virident Systems, Inc. August 2011 Macro Datacenter Trends Key driver: Information Processing Data Footprint (PB) CAGR: 100%

More information

itransformer: Using SSD to Improve Disk

itransformer: Using SSD to Improve Disk itransformer: Using SSD to Improve Disk Scheduling for High performance I/O Xuechen Zhang Song Jiang Wayne State t University it Kei Davis Los Alamos National Laboratory Challenges of data management using

More information

The Fusion of Supercomputing and Big Data. Peter Ungaro President & CEO

The Fusion of Supercomputing and Big Data. Peter Ungaro President & CEO The Fusion of Supercomputing and Big Data Peter Ungaro President & CEO The Supercomputing Company Supercomputing Big Data Because some great things never change One other thing that hasn t changed. Cray

More information

OpenMP Programming on ScaleMP

OpenMP Programming on ScaleMP OpenMP Programming on ScaleMP Dirk Schmidl schmidl@rz.rwth-aachen.de Rechen- und Kommunikationszentrum (RZ) MPI vs. OpenMP MPI distributed address space explicit message passing typically code redesign

More information

A Deduplication File System & Course Review

A Deduplication File System & Course Review A Deduplication File System & Course Review Kai Li 12/13/12 Topics A Deduplication File System Review 12/13/12 2 Traditional Data Center Storage Hierarchy Clients Network Server SAN Storage Remote mirror

More information

Intel Data Direct I/O Technology (Intel DDIO): A Primer >

Intel Data Direct I/O Technology (Intel DDIO): A Primer > Intel Data Direct I/O Technology (Intel DDIO): A Primer > Technical Brief February 2012 Revision 1.0 Legal Statements INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE,

More information

Using Synology SSD Technology to Enhance System Performance. Based on DSM 5.2

Using Synology SSD Technology to Enhance System Performance. Based on DSM 5.2 Using Synology SSD Technology to Enhance System Performance Based on DSM 5.2 Table of Contents Chapter 1: Enterprise Challenges and SSD Cache as Solution Enterprise Challenges... 3 SSD Cache as Solution...

More information

Low-Power Amdahl-Balanced Blades for Data-Intensive Computing

Low-Power Amdahl-Balanced Blades for Data-Intensive Computing Thanks to NVIDIA, Microsoft External Research, NSF, Moore Foundation, OCZ Technology Low-Power Amdahl-Balanced Blades for Data-Intensive Computing Alex Szalay, Andreas Terzis, Alainna White, Howie Huang,

More information

HP ProLiant BL660c Gen9 and Microsoft SQL Server 2014 technical brief

HP ProLiant BL660c Gen9 and Microsoft SQL Server 2014 technical brief Technical white paper HP ProLiant BL660c Gen9 and Microsoft SQL Server 2014 technical brief Scale-up your Microsoft SQL Server environment to new heights Table of contents Executive summary... 2 Introduction...

More information

Big Graph Processing: Some Background

Big Graph Processing: Some Background Big Graph Processing: Some Background Bo Wu Colorado School of Mines Part of slides from: Paul Burkhardt (National Security Agency) and Carlos Guestrin (Washington University) Mines CSCI-580, Bo Wu Graphs

More information

Interactive Level-Set Deformation On the GPU

Interactive Level-Set Deformation On the GPU Interactive Level-Set Deformation On the GPU Institute for Data Analysis and Visualization University of California, Davis Problem Statement Goal Interactive system for deformable surface manipulation

More information

The IntelliMagic White Paper on: Storage Performance Analysis for an IBM San Volume Controller (SVC) (IBM V7000)

The IntelliMagic White Paper on: Storage Performance Analysis for an IBM San Volume Controller (SVC) (IBM V7000) The IntelliMagic White Paper on: Storage Performance Analysis for an IBM San Volume Controller (SVC) (IBM V7000) IntelliMagic, Inc. 558 Silicon Drive Ste 101 Southlake, Texas 76092 USA Tel: 214-432-7920

More information

Scaling from Datacenter to Client

Scaling from Datacenter to Client Scaling from Datacenter to Client KeunSoo Jo Sr. Manager Memory Product Planning Samsung Semiconductor Audio-Visual Sponsor Outline SSD Market Overview & Trends - Enterprise What brought us to NVMe Technology

More information

Hybrid parallelism for Weather Research and Forecasting Model on Intel platforms (performance evaluation)

Hybrid parallelism for Weather Research and Forecasting Model on Intel platforms (performance evaluation) Hybrid parallelism for Weather Research and Forecasting Model on Intel platforms (performance evaluation) Roman Dubtsov*, Mark Lubin, Alexander Semenov {roman.s.dubtsov,mark.lubin,alexander.l.semenov}@intel.com

More information

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK Steve Oberlin CTO, Accelerated Computing US to Build Two Flagship Supercomputers SUMMIT SIERRA Partnership for Science 100-300 PFLOPS Peak Performance

More information

EMC XTREMIO EXECUTIVE OVERVIEW

EMC XTREMIO EXECUTIVE OVERVIEW EMC XTREMIO EXECUTIVE OVERVIEW COMPANY BACKGROUND XtremIO develops enterprise data storage systems based completely on random access media such as flash solid-state drives (SSDs). By leveraging the underlying

More information

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

GPU System Architecture. Alan Gray EPCC The University of Edinburgh GPU System Architecture EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? GPU-CPU comparison Architectural reasons for GPU performance advantages GPU accelerated systems

More information

HADOOP ON ORACLE ZFS STORAGE A TECHNICAL OVERVIEW

HADOOP ON ORACLE ZFS STORAGE A TECHNICAL OVERVIEW HADOOP ON ORACLE ZFS STORAGE A TECHNICAL OVERVIEW 757 Maleta Lane, Suite 201 Castle Rock, CO 80108 Brett Weninger, Managing Director brett.weninger@adurant.com Dave Smelker, Managing Principal dave.smelker@adurant.com

More information

FPGA-based Multithreading for In-Memory Hash Joins

FPGA-based Multithreading for In-Memory Hash Joins FPGA-based Multithreading for In-Memory Hash Joins Robert J. Halstead, Ildar Absalyamov, Walid A. Najjar, Vassilis J. Tsotras University of California, Riverside Outline Background What are FPGAs Multithreaded

More information

Performance and scalability of a large OLTP workload

Performance and scalability of a large OLTP workload Performance and scalability of a large OLTP workload ii Performance and scalability of a large OLTP workload Contents Performance and scalability of a large OLTP workload with DB2 9 for System z on Linux..............

More information

PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions. Outline. Performance oriented design

PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions. Outline. Performance oriented design PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions Slide 1 Outline Principles for performance oriented design Performance testing Performance tuning General

More information

EFFICIENT EXTERNAL SORTING ON FLASH MEMORY EMBEDDED DEVICES

EFFICIENT EXTERNAL SORTING ON FLASH MEMORY EMBEDDED DEVICES ABSTRACT EFFICIENT EXTERNAL SORTING ON FLASH MEMORY EMBEDDED DEVICES Tyler Cossentine and Ramon Lawrence Department of Computer Science, University of British Columbia Okanagan Kelowna, BC, Canada tcossentine@gmail.com

More information

Why Computers Are Getting Slower (and what we can do about it) Rik van Riel Sr. Software Engineer, Red Hat

Why Computers Are Getting Slower (and what we can do about it) Rik van Riel Sr. Software Engineer, Red Hat Why Computers Are Getting Slower (and what we can do about it) Rik van Riel Sr. Software Engineer, Red Hat Why Computers Are Getting Slower The traditional approach better performance Why computers are

More information

ioscale: The Holy Grail for Hyperscale

ioscale: The Holy Grail for Hyperscale ioscale: The Holy Grail for Hyperscale The New World of Hyperscale Hyperscale describes new cloud computing deployments where hundreds or thousands of distributed servers support millions of remote, often

More information

MAGENTO HOSTING Progressive Server Performance Improvements

MAGENTO HOSTING Progressive Server Performance Improvements MAGENTO HOSTING Progressive Server Performance Improvements Simple Helix, LLC 4092 Memorial Parkway Ste 202 Huntsville, AL 35802 sales@simplehelix.com 1.866.963.0424 www.simplehelix.com 2 Table of Contents

More information

Parallel Processing and Software Performance. Lukáš Marek

Parallel Processing and Software Performance. Lukáš Marek Parallel Processing and Software Performance Lukáš Marek DISTRIBUTED SYSTEMS RESEARCH GROUP http://dsrg.mff.cuni.cz CHARLES UNIVERSITY PRAGUE Faculty of Mathematics and Physics Benchmarking in parallel

More information

Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes

Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes Eric Petit, Loïc Thebault, Quang V. Dinh May 2014 EXA2CT Consortium 2 WPs Organization Proto-Applications

More information

ICRI-CI Retreat Architecture track

ICRI-CI Retreat Architecture track ICRI-CI Retreat Architecture track Uri Weiser June 5 th 2015 - Funnel: Memory Traffic Reduction for Big Data & Machine Learning (Uri) - Accelerators for Big Data & Machine Learning (Ran) - Machine Learning

More information

Flash Memory Arrays Enabling the Virtualized Data Center. July 2010

Flash Memory Arrays Enabling the Virtualized Data Center. July 2010 Flash Memory Arrays Enabling the Virtualized Data Center July 2010 2 Flash Memory Arrays Enabling the Virtualized Data Center This White Paper describes a new product category, the flash Memory Array,

More information

PowerVault MD3 SSD Cache Overview. White Paper

PowerVault MD3 SSD Cache Overview. White Paper PowerVault MD3 SSD Cache Overview White Paper 2012 Dell Inc. All Rights Reserved. PowerVault is a trademark of Dell Inc. 2 Dell PowerVault MD3 SSD Cache Overview Table of contents 1 Overview... 4 2 Architecture...

More information

VALAR: A BENCHMARK SUITE TO STUDY THE DYNAMIC BEHAVIOR OF HETEROGENEOUS SYSTEMS

VALAR: A BENCHMARK SUITE TO STUDY THE DYNAMIC BEHAVIOR OF HETEROGENEOUS SYSTEMS VALAR: A BENCHMARK SUITE TO STUDY THE DYNAMIC BEHAVIOR OF HETEROGENEOUS SYSTEMS Perhaad Mistry, Yash Ukidave, Dana Schaa, David Kaeli Department of Electrical and Computer Engineering Northeastern University,

More information