Navigating Big Data with High-Throughput, Energy-Efficient Data Partitioning

Similar documents

Solid State Drive Architecture

An Oracle White Paper June High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database

Big Fast Data Hadoop acceleration with Flash. June 2013

Capacity Planning Process Estimating the load Initial configuration

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

SAP HANA - Main Memory Technology: A Challenge for Development of Business Applications. Jürgen Primsch, SAP AG July 2011

Rethinking SIMD Vectorization for In-Memory Databases

Benchmarking Cassandra on Violin

GraySort on Apache Spark by Databricks

Price/performance Modern Memory Hierarchy

GraySort and MinuteSort at Yahoo on Hadoop 0.23

Intel DPDK Boosts Server Appliance Performance White Paper

Query Acceleration of Oracle Database 12c In-Memory using Software on Chip Technology with Fujitsu M10 SPARC Servers

FPGA-based Multithreading for In-Memory Hash Joins

Big Data With Hadoop

News and trends in Data Warehouse Automation, Big Data and BI. Johan Hendrickx & Dirk Vermeiren

Performance rule violations usually result in increased CPU or I/O, time to fix the mistake, and ultimately, a cost to the business unit.

Protect Data... in the Cloud

361 Computer Architecture Lecture 14: Cache Memory

Towards Fast SQL Query Processing in DB2 BLU Using GPUs A Technology Demonstration. Sina Meraji sinamera@ca.ibm.com

Big Data Functionality for Oracle 11 / 12 Using High Density Computing and Memory Centric DataBase (MCDB) Frequently Asked Questions

Performance And Scalability In Oracle9i And SQL Server 2000

Highly Available Mobile Services Infrastructure Using Oracle Berkeley DB

Preview of Oracle Database 12c In-Memory Option. Copyright 2013, Oracle and/or its affiliates. All rights reserved.

Sawmill Log Analyzer Best Practices!! Page 1 of 6. Sawmill Log Analyzer Best Practices

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software

Router Architectures

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

Binary search tree with SIMD bandwidth optimization using SSE

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

Oracle9i Release 2 Database Architecture on Windows. An Oracle Technical White Paper April 2003

In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller

DSS. Diskpool and cloud storage benchmarks used in IT-DSS. Data & Storage Services. Geoffray ADDE

COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Big Data by the numbers

The Methodology Behind the Dell SQL Server Advisor Tool

Bringing Big Data Modelling into the Hands of Domain Experts

Amadeus SAS Specialists Prove Fusion iomemory a Superior Analysis Accelerator

Using Synology SSD Technology to Enhance System Performance Synology Inc.

SUN ORACLE EXADATA STORAGE SERVER

Architectures and Platforms

Big Picture. IC220 Set #11: Storage and I/O I/O. Outline. Important but neglected

This article is the second

Data Integrator Performance Optimization Guide

Oracle Exadata Database Machine for SAP Systems - Innovation Provided by SAP and Oracle for Joint Customers

Evaluation Report: Accelerating SQL Server Database Performance with the Lenovo Storage S3200 SAN Array

Use of Hadoop File System for Nuclear Physics Analyses in STAR

2009 Oracle Corporation 1

A Novel Cloud Based Elastic Framework for Big Data Preprocessing

A Comparison of Approaches to Large-Scale Data Analysis

Solving I/O Bottlenecks to Enable Superior Cloud Efficiency

Accelerating and Simplifying Apache

Chapter Introduction. Storage and Other I/O Topics. p. 570( 頁 585) Fig I/O devices can be characterized by. I/O bus connections

Big Data Analytics. with EMC Greenplum and Hadoop. Big Data Analytics. Ofir Manor Pre Sales Technical Architect EMC Greenplum

Computer Performance. Topic 3. Contents. Prerequisite knowledge Before studying this topic you should be able to:

Memory Management Outline. Background Swapping Contiguous Memory Allocation Paging Segmentation Segmented Paging

High performance ETL Benchmark

Big Data and Its Impact on the Data Warehousing Architecture

MAGENTO HOSTING Progressive Server Performance Improvements

Chapter 13: Query Processing. Basic Steps in Query Processing

MapReduce Jeffrey Dean and Sanjay Ghemawat. Background context

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com

White Paper February IBM InfoSphere DataStage Performance and Scalability Benchmark Whitepaper Data Warehousing Scenario

Performance Optimization Guide

What is a bus? A Bus is: Advantages of Buses. Disadvantage of Buses. Master versus Slave. The General Organization of a Bus

Latency: The Heartbeat of a Solid State Disk. Levi Norman, Texas Memory Systems

RAMCloud and the Low- Latency Datacenter. John Ousterhout Stanford University

1. Memory technology & Hierarchy

SUN ORACLE DATABASE MACHINE

Apache HBase. Crazy dances on the elephant back

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011

Unstructured Data Accelerator (UDA) Author: Motti Beck, Mellanox Technologies Date: March 27, 2012

Can High-Performance Interconnects Benefit Memcached and Hadoop?

<Insert Picture Here> Best Practices for Extreme Performance with Data Warehousing on Oracle Database

MS EXCHANGE SERVER ACCELERATION IN VMWARE ENVIRONMENTS WITH SANRAD VXL

HADOOP PERFORMANCE TUNING

How To Store Data On An Ocora Nosql Database On A Flash Memory Device On A Microsoft Flash Memory 2 (Iomemory)

Comparing SMB Direct 3.0 performance over RoCE, InfiniBand and Ethernet. September 2014

DELL RAID PRIMER DELL PERC RAID CONTROLLERS. Joe H. Trickey III. Dell Storage RAID Product Marketing. John Seward. Dell Storage RAID Engineering

Achieving Mainframe-Class Performance on Intel Servers Using InfiniBand Building Blocks. An Oracle White Paper April 2003

Netezza and Business Analytics Synergy

Using Synology SSD Technology to Enhance System Performance. Based on DSM 5.2

Understanding the Benefits of IBM SPSS Statistics Server

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

In-Memory Data Management for Enterprise Applications

Oracle BI Suite Enterprise Edition

Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics

GigaSpaces Real-Time Analytics for Big Data

Accelerating High-Speed Networking with Intel I/O Acceleration Technology

Performance Baseline of Hitachi Data Systems HUS VM All Flash Array for Oracle

REAL-TIME STREAMING ANALYTICS DATA IN, ACTION OUT

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Duke University

SUN ORACLE DATABASE MACHINE

Improve Business Productivity and User Experience with a SanDisk Powered SQL Server 2014 In-Memory OLTP Database

Transcription:

Application-Specific Architecture Navigating Big Data with High-Throughput, Energy-Efficient Data Partitioning Lisa Wu, R.J. Barker, Martha Kim, and Ken Ross Columbia University Xiaowei Wang Rui Chen

Outline 1. Introduction 2. Background & Motivation 3. HARP System 4. Evaluation 5. Conclusion 6. Discussion

Data Partitioning

Current Situation Data growing very fast. Almost 10^18 bytes per day, 2013. (even more today!) Databases support partition to faster subsequent processing. Oracle, DB2, SQL Server Partitioning enables partitions to be processed independently and more efficiently. parallel cache locality

Software Based Partitioning Hash A multiplicative hash of each record s key determines partition. Direct Like hash partitioning, but eliminates hashing cost by treating the key itself as the hash value. eg: city name Range Equality range partitioning. eg: date (software based range partitioning is used as comparison in this paper) Software implementations of data partitioning have fundamental performance limitations that make it compute-bound, even after parallelization.

Low Memory Utilization 1. Modern databases are not typically disk I/O bound. 2. Some dbs having a memory-resident working set. (Facebook 28 TB data) 3. Servers running Bing, Hotmail and Cosmos show 67%-97% processor utilization but only 2%-6% memory bandwidth utilization. 4. Google s BigTable and Content Analyzer use a couple of percent of memory bandwidth. <10%

Solution Low memory bandwidth utilization Compute-bound New hardware to accelerate 1: Hardware Accelerated Range Partitioner, or HARP 2: A high-bandwidth hardware-software streaming framework that transfers data to and from HARP

Join Figure from Lisa Wu slide

Join Thrash the Cache Figure from Lisa Wu slide

Partitioning Half of join time used on partitioning DB: MonetDB-11.11.5 Time: Median of ten runs of each query

Partitioning Figure from Lisa Wu slide

Partitioning Figure from Lisa Wu slide

Partitioning Figure from Lisa Wu slide

Outline 1. Introduction 2. Background & Motivation 3. HARP System 4. Evaluation 5. Conclusion 6. Discussion

Hardware Accelerated Range Partitioning System Overview of the HARP system Design of the HARP Accelerator Design of the streaming framework

Hardware Accelerated Range Partitioning System Overview of the HARP system Design of the HARP Accelerator Design of the streaming framework

HARP System Overview New components coupled with CPU core: HARP Core HARP Core HARP Accelerator SB in SB out L1 SB in SB out L1 Stream Buffer (SB_in, SB_out) L2 L2 Memory Memory Controller

Hardware Accelerated Range Partitioning System Overview of the HARP system Design of the HARP Accelerator Design of the streaming framework

HARP Accelerator Linear Pipelining Design

HARP Accelerator ISA HARP Instructions: set_splitter <splitter number> <value> partition_start partition_stop

HARP Implementation Details HARP Instructions: set_splitter partition_start partition_stop

Set Splitters HARP Instructions: set_splitter <splitter number> <value> partition_start partition_stop

Set Splitters From CPU 10 20 30 HARP Instructions: set_splitter <splitter number> <value> partition_start partition_stop

Pull Data from Stream Buffer 10 20 30 HARP Instructions: set_splitter partition_start partition_stop

Pull Data from Stream Buffer 10 20 30 HARP Instructions: set_splitter partition_start partition_stop Convert burst to stream of records (FSM)

Pull Data from Stream Buffer 10 20 30 HARP Instructions: set_splitter partition_start partition_stop Convert burst to stream of records (FSM)

Pull Data from Stream Buffer 10 20 30 HARP Instructions: set_splitter partition_start partition_stop Convert burst to stream of records (FSM)

Compare on Conveyor: Pipelining 10 20 30 15 Partition Buffers HARP Instructions: set_splitter partition_start partition_stop

Compare on Conveyor: Pipelining 10 20 30 8 15 Partition Buffers HARP Instructions: set_splitter partition_start partition_stop

Compare on Conveyor: Pipelining 10 20 30 33 Partition Buffers 8 15 HARP Instructions: set_splitter partition_start partition_stop

Merge Data to Output Stream Buffer Partition Buffers Full Buffer HARP Instructions: set_splitter partition_start partition_stop

Merge Data to Output Stream Buffer Full Buffer Pull burst of records from the full partition buffer (FSM) HARP Instructions: set_splitter partition_start partition_stop

Merge Data to Output Stream Buffer Pull burst of records from the full partition buffer (FSM) HARP Instructions: set_splitter partition_start partition_stop (One record per cycle)

Signal to Stop & Drain In-flight Data HARP Instructions: set_splitter partition_start partition_stop

Signal to Stop & Drain In-flight Data HARP Instructions: set_splitter partition_start partition_stop

Technical Insights of the HARP Accelerator Pipeline is always full One record per cycle Comparison with pipeline eliminates control instructions Reduces instruction fetch overhead

Hardware Accelerated Range Partitioning System Overview of the HARP system Design of the HARP Accelerator Design of the streaming framework (Datapaths between HARP accelerator and memory)

Stream Buffer Moves data between memory and the HARP accelerator At or above the partitioning rate of HARP Minimal modifications to existing processor architecture SB in HARP Memory SB out Core L1 L2 Memory Controller

Stream Buffer ISA Stream Buffer Instructions: sbload sbid, [mem addr] sbstore [mem addr], sbid sbsave sbid sbrestore sbid

Steps: Load Data from Memory to Stream Buffer sbload sbid, [mem addr]

Load Data from Memory to Stream Buffer Steps: sbload sbid, [mem addr] 1. Issue sbload from core

Load Data from Memory to Stream Buffer Steps: sbload sbid, [mem addr] 1. Issue sbload from core 2. Send sbload from request buffer to memory

Load Data from Memory to Stream Buffer Steps: sbload sbid, [mem addr] 1. Issue sbload from core 2. Send sbload from request buffer to memory 3. Data return from memory to SB_in

Load Data from Memory to Stream Buffer Steps: sbload sbid, [mem addr] 1. Issue sbload from core 2. Send sbload from request buffer to memory 3. Data return from memory to SB_in Use existing microarchitectural vector load request path (request buffer)

Store Data to Memory sbstore [mem addr], sbid Steps: HARP Core Store Buffer SB in SB out L1 L2 LLC Memory

Store Data to Memory sbstore [mem addr], sbid Steps: 1. Issue sbstore from core HARP Core Store Buffer SB in SB out L1 L2 LLC Memory

Store Data to Memory sbstore [mem addr], sbid Steps: 1. Issue sbstore from core 2. Copy data from SB_out to store buffer SB in HARP SB out Core Store Buffer L1 L2 LLC Memory

Store Data to Memory sbstore [mem addr], sbid Steps: 1. Issue sbstore from core 2. Copy data from SB_out to store buffer 3. Write back data to memory via existing datapath SB in HARP SB out Core Store Buffer L1 L2 LLC Use existing store data path and structures (write combining buffers) Memory

Seamless Execution after Interrupts Instructions: partition_stop (HARP accelerator) sbsave: Save the contents of the stream buffer sbrestore: Ensure the streaming states are identical before and after the interrupt

Seamless Execution after Interrupts Instructions: partition_stop (HARP accelerator) sbsave: Save the contents of the stream buffer sbrestore: Ensure the streaming states are identical before and after the interrupt Improves Robustness

Compatibility with Current Software

Outline 1. Introduction 2. Background & Motivation 3. HARP System 4. Evaluation 5. Conclusion 6. Discussion

Methodology Baseline HARP Parameters: 127 splitters for 255 way partitioning 16 bytes records with 4 bytes keys 64 bytes DRAM bursts: 4 records/burst HARP Throughput convert cycle counts and cycle time into bandwidth (GB/sec) 1 million random records Streaming Instruction Throughput rate of copy memory from one location to another. 1 GB table. built-in C library hand-optimized X86 scalar assembly hand-optimized X86 vector assembly

HARP Throughput Baseline throughput: 3.13GB/sec 7.8x

Streaming Throughput Throughput 4.6 GB/sec slightly larger than HARP throughput > 3.13GB/sec

Area & Energy Baseline HARP: Area: 6.9% in total Power: 4.3% in total

Energy Efficient when compared with software implement. (both serial & parallel)

Skew Tolerance At most 19% degradation in throughput when records are not well distributed. Parameter: one partition

Some Other Observations 1. Number of partitions throughput highly sensitive to the number of partitions when < 63 area & power grow linearly with the number of partitions 1. Record width throughput scales linearly with record width 1. Key width has slight impact on throughput.

Outline 1. Introduction 2. Background & Motivation 3. HARP System 4. Evaluation 5. Conclusion 6. Discussion

Conclusion Advantages of partitioning broadly applicable in database domain, join, sort, aggregation applicable in the non-database domain, map-reduce Design of HARP throughput and power efficiency advantages over software flexible yet modular data-centric acceleration framework Strategy more partitions: low throughput, efficient in cache locality, fewer partitions: high throughput, relatively inefficient in cache locality need to find the balance between efficiency and throughput

Outline 1. Introduction 2. Background & Motivation 3. HARP System 4. Evaluation 5. Conclusion 6. Discussion

Discussion Points 1. Is multi-thread + HARP helpful to further improve the throughput of data partitioning? 2. Is this hardware design too inflexible? - Can HARP efficiently deal with records of larger sizes? - Is HARP extensible for applications that requires partitioning over records of different, or unknown formats? 1. The authors demonstrate partitioning is important in large SQL databases. Is this application domain broad enough to justify the design of a custom accelerator? 2. In general, for a specific application, is it more profitable to integrate the accelerator with a general purpose CPU, or to design independent hardware?

Oracle s SPARC M7 Application Accelerator The M7's most significant innovations revolve around what is known as "software in silicon," a design approach that places software functions directly into the processor.