Identifying Performance Bottlenecks in Hive: Use of Processor Counters
|
|
- Lee Briggs
- 7 years ago
- Views:
Transcription
1 Identifying Performance Bottlenecks in Hive: Use of Processor Counters Alexander C Shulyak, Lizy K John Presented By: Shuang Song
2 Problem Businesses and online services increasingly rely on insights derived from data analytics applications Targeted promotional advertising Personalized content and experiences Streamlining business operations Sales and market analysis Amount of data being collected increasing exponentially Distributed SQL Query Engines (DSQEs) increasingly used as Decision Support System (DSS) to process large amounts of data at scale: Hive, Shark, Impala, etc. What are the performance bottlenecks of a DSQE running DSS queries? What are the performance trade-offs of a DSQE over a traditional Relational Database Management System? 2
3 Hive Overview Hadoop Architecture; from [1] Data warehousing and query processing framework for large database Uses Hadoop HDFS and MapReduce framework Hive converts SQL-like queries to a set of MapReduce jobs for Hadoop, monitors progress, returns result Yarn Resource Manager allows custom execution engines Tez execution engine improves performance by modeling query as DAG of MapReduce Jobs; optimizing execution of entire DAG [1] 3 Bikas Saha, Hitesh Shah, Siddharth Seth, Gopal Vijayaraghavan, Arun Murthy, and Carlo Curino. Apache tez: A unifying framework for modeling and building data processing applications. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, SIGMOD 15, pages , New York, NY, USA, ACM.
4 Previous Work Panda et al.; SBAC-PAD 2015; Performance Characterization of Modern Databases on Out-of-Order CPUs Performance analysis of: MySQL, Cassandra, MongoDB, VoltDB Ins/Data cache and TLB stressed significantly MySQL Comparatively best throughput and latency for all workloads Wouw et al.; ICPE 2015; An Empirical Performance Evaluation of Distributed SQL Query Engines Analyzed Shark, Hive with MapReduce, and Impala Propose Micro-benchmarking suite and empirical method for evaluation performance for Distributed SQL Query Engines (DSQEs) Hive with MapReduce is outperformed by all other database options; it experiences high network I/O and framework overhead. 4
5 Previous Work (Cont.) Floratou et al; VLDB 2014; Sql-on-hadoop: Full circle back to sharednothing database architectures Study included Hive with MapReduce, Hive with Tez, and Impala with 21-node cluster running 1TB TPC-H database Only Hive with MapReduce impacted by startup and scheduling overheads Impala shared-nothing SQL on Hadoop database vastly improved performance when workloads fit in memory Hive with MapReduce and Hive with Tez CPU bound, especially on scan operations 5
6 Motivation: Hive struggles to beat MySQL despite 6X available cores Execution Time Hive still needed due to its scale-out potential Algorithm, code base, OS, and CPU resources could cause computational performance differences between Hive and MySQL Root cause analysis needed Time = (1/f * CPI * Total Ins) / TLP Query1 Query3 Query6 Query14 Query19 Average MySQL 10GB HIVE TEZ 10GB 6
7 Experimental Setup Database Details Hive Hadoop 2.7.0; Tez MySQL Benchmark Details TPC-H 10GB database, queries 1, 3, 6, 14, 19 Server Setup Single-node 6 (12) core Intel Xeon 32KB private L1i and L1d, 256KB combined L2, 15MB LLC shared 64GB 1600Mhz; 750GB OpenJDK
8 Methodology Processor Performance Counters: Perf 3.19 IPC, CPI, MPKI, CS PKI Huge number of counters, hundreds of factors: Need for imperial method to identify bottleneck Top-Down SW Performance Analysis 1. Total Instruction Count and Average Thread Count 2. Query Plan Analysis 3. OS Events and Statistics 4. Instruction-Stream Statistics (Ins Mix, Ins/Data Footprint, Ins/Data access patterns) Top-Down HW Performance Analysis 1. System level utilization statistics: (CPU, Mem, Network I/O, Disk I/O) 2. Instructions per Cycle (IPC) 3. Top-Down Microarchitectural Analysis Method (TMAM) Aggregate statistics may hide phase behavior responsible for performance loss Counter statistics collected at 1-second interval 8
9 Insights 1. MySQL: relatively efficient DSS query execution engine 2. Hive shows difficulty converting SQL queries into a set of MapReduce Jobs 3. Hive s framework layers add code bloat, slow database traversal 4. Hive with Tez amortize JVM startup cost effectively across MapReduce Tasks; initial start-up period for a Hive is costly for queries under 300 seconds 5. Hive s highly parallel execution increases context switch rates which stresses the memory hierarchy 9
10 Instruction Count Hive must execute 1.7X more instructions on average Hadoop, MapReduce, Distributed Execution Large Overhead Abstract query from flow of execution Overhead slightly lower due to vector execution Variation in Overhead across queries due to query plan inefficiencies Normalized Instruction Count Total Instructions: Normalized to MySQL Query1 Query3 Query6 Query14 Query19 Average MySQL Hive 10
11 Query 1 SQL Query 19 SQL Query 1 reports the amount of business that was billed, shipped, and returned Query 19 reports the gross discounted revenue attributed to the sale of selected parts handled in a particular manner 11
12 Startup Overhead Start-up Period Static period of time observed in all Hive queries noted by poor microarchitectural performance 2 parts: initialization period, warm-up period Initialization Period IPC Query 3 5E E+10 4E E+10 3E E+10 2E E+10 1E+10 5E+09 0 Ins Count Hive query plan is generated and sent to Hadoop. Hadoop is initialized Low Instruction Count, Low IPC Warm-up Period Ave. IPC IPC Ave. IPC post-startup Total Instructions Hadoop Tez Containers are initialized, but JVM processes and CPU must warm-up to reach peak IPC High Instruction Count, Low IPC Because period is static, effect on execution time and average IPC dependent on total execution time Query 3 s IPC increases by 9% with Startup Period ignored Query 6 s IPC increases by 45% with Startup Period ignored Improvement over Hive with Built-in Hadoop MapReduce IPC Query E+10 4E E+10 3E E+10 2E E+10 1E+10 5E+09 0 Ins Count Use of Tez s online or batch query processing mode can further amortize the start-up period costs Ave. IPC Ave. IPC post-starup IPC Total Instructions 12
13 IPC IPC IPC Query1 Query3 Query6 Query14 Query19 Average MySQL Hive 6Thread Hive 6Thread Post-Startup Hive Instruction execution rate (IPC) improves based on execution time. When start-up period ignored, Query 6 and Query 14 outperform MySQL counterpart Relative IPC across queries does not correlate between MySQL and Hive. Each database has different inherent microarchitectural bottlenecks 13
14 LLC Misses: MySQL Performance Bottleneck CPI LLC MPKI vs. CPI LLC MPKI MySQL Hive 1 Thread Hive 6 Threads Linear (MySQL) MySQL microarchitectural performance correlated to LLC MPKI Data driven performance Fewer instruction between unique entries in database Hive has little variation in LLC MPKI Data traversal indicative of underlying frameworks Little difference in LLC MPKI rates between 1 and 6 threaded Hive setups (despite differences in CPI); therefore, context switches effecting 1 st level caches far more than LLC. 14
15 Context Switches: Hive Performance Bottleneck CPI increases as the number of threads and subsequently number of context switches increase. Indicates bottleneck of system MySQL has far fewer context switches then Hive even with 1 thread. MySQL is 1 threaded, and there is no apparent correlation between context switches and CPI CPI CS VS. CPI WITH DIFFERENT NUMBER OF THREADS Hive Query 1 Hive Query 3 Hive Query 6 Hive Query 14 Hive Query 19 MySQL Queries E E E E E E-04 CS PKI 15
16 Conclusion MySQL executes queries as efficiently as possible Low instruction count Microarchitectural performance differences dependent on how the data is traversed and how the memory hierarchy is stressed by that algorithm. Hive s large code base and generic execution framework primary performance bottleneck Query plan inefficiency and increased instruction count Hive s startup period hurts the performance of short running queries Hive s higher context switch rates directly impact microarchitectural performance Improvements: Amortize startup costs over more queries with batch or online execution Decrease parallelization per node to improve microarchitectural performance Resort to Distributed SQL Query Engine only if database size too large for 1 node 16
17 17 QUESTIONS?
18 Overview Objective: Root Cause Analysis of performance discrepancies between MySQL and Hive MySQL: traditional Relational Database Management System (RDMS), to scale-out approach like Hive Hive: Hadoop-based DSQE Use Decision Support Benchmark (DSB), TPC-h as database benchmark Identify computational overheads associated with Distributed set-ups 18
19 Previous Work (Cont.) Floratou et al; VLDB 2014; Sql-on-hadoop: Full circle back to sharednothing database architectures Hive without Tez impacted by startup and scheduling overheads Impala shared-nothing SQL on Hadoop database vastly improved performance when workloads fit in memory Hive on MapReduce and Hive on Tez CPU bound, especially on scan operations Jia et al; IISWC 2013; Characterizing data analysis workloads in data centers Data analysis workloads have higher IPC than data serving workloads, while lower than that of computation-intensive HPCC workloads Both data analysis workloads and data serving workloads suffer from noticeable front-end stalls which they blame on larger code footprint causing inefficient L1I cache and itlb performance. 19
20 20 TPC-H Database
21 MySQL Query Plans Query Table Type Clauses Query 1 lineitem ALL Using Where Using Temporary Using Filesort Query 3 orders ALL Using Where Using Temporary Using Filesort customers eq_ref Using where lineitem ref Using where Query 6 lineitem ALL Using where Query 14 lineitem ALL Using where part eq_ref lineitem ALL Using where Query 19 part eq_ref Using where 21
22 Hive Query Plans Query Vertex Source Type Partitions Filter Query 1 Map 1 lineitem Select Group By 12 Reduce 2 Map 1 Group By 11 Reduce 3 Reduce 2 Select 1 Map 1 customers Filter 1 Map 6 orders Filter 4 Map 7 lineitem Filter 15 Reduce 2 Map 1 Map 6 Merge/Join 1 Query 3 Merge/Join Map 7 Reduce 3 Select Reduce 2 Group By 2 Reduce 4 Reduce 3 Group By Select 2 Reduce 5 Reduce 4 Select 1 Filter Query 6 Map 1 lineitem Select 12 Group By Reduce 2 Map 1 Group By 1 Map 1 part Filter 1 Map 4 lineitem Filter 15 Query 14 Reduce 2 Map 1 Merge/Join Map 4 Group By 1 Reduce 3 Reduce 2 Group By Select 1 Map 1 lineitem Filter 15 Map 4 parts Filter 1 Merge/Join Map 1 Query 19 Filter Reduce 2 Select Map 4 Group By 4 Reduce 3 Reduce 2 Group By 1 22
Benchmarking Cassandra on Violin
Technical White Paper Report Technical Report Benchmarking Cassandra on Violin Accelerating Cassandra Performance and Reducing Read Latency With Violin Memory Flash-based Storage Arrays Version 1.0 Abstract
More informationJun Liu, Senior Software Engineer Bianny Bian, Engineering Manager SSG/STO/PAC
Jun Liu, Senior Software Engineer Bianny Bian, Engineering Manager SSG/STO/PAC Agenda Quick Overview of Impala Design Challenges of an Impala Deployment Case Study: Use Simulation-Based Approach to Design
More informationBinary search tree with SIMD bandwidth optimization using SSE
Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous
More informationBenchmarking Hadoop & HBase on Violin
Technical White Paper Report Technical Report Benchmarking Hadoop & HBase on Violin Harnessing Big Data Analytics at the Speed of Memory Version 1.0 Abstract The purpose of benchmarking is to show advantages
More informationFLOW-3D Performance Benchmark and Profiling. September 2012
FLOW-3D Performance Benchmark and Profiling September 2012 Note The following research was performed under the HPC Advisory Council activities Participating vendors: FLOW-3D, Dell, Intel, Mellanox Compute
More informationHADOOP ON ORACLE ZFS STORAGE A TECHNICAL OVERVIEW
HADOOP ON ORACLE ZFS STORAGE A TECHNICAL OVERVIEW 757 Maleta Lane, Suite 201 Castle Rock, CO 80108 Brett Weninger, Managing Director brett.weninger@adurant.com Dave Smelker, Managing Principal dave.smelker@adurant.com
More informationBPOE Research Highlights
BPOE Research Highlights Jianfeng Zhan ICT, Chinese Academy of Sciences 2013-10- 9 http://prof.ict.ac.cn/jfzhan INSTITUTE OF COMPUTING TECHNOLOGY What is BPOE workshop? B: Big Data Benchmarks PO: Performance
More informationMaximizing Hadoop Performance and Storage Capacity with AltraHD TM
Maximizing Hadoop Performance and Storage Capacity with AltraHD TM Executive Summary The explosion of internet data, driven in large part by the growth of more and more powerful mobile devices, has created
More informationBENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB
BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB Planet Size Data!? Gartner s 10 key IT trends for 2012 unstructured data will grow some 80% over the course of the next
More informationA Brief Introduction to Apache Tez
A Brief Introduction to Apache Tez Introduction It is a fact that data is basically the new currency of the modern business world. Companies that effectively maximize the value of their data (extract value
More informationBig Data Approaches. Making Sense of Big Data. Ian Crosland. Jan 2016
Big Data Approaches Making Sense of Big Data Ian Crosland Jan 2016 Accelerate Big Data ROI Even firms that are investing in Big Data are still struggling to get the most from it. Make Big Data Accessible
More informationBenchmark Study on Distributed XML Filtering Using Hadoop Distribution Environment. Sanjay Kulhari, Jian Wen UC Riverside
Benchmark Study on Distributed XML Filtering Using Hadoop Distribution Environment Sanjay Kulhari, Jian Wen UC Riverside Team Sanjay Kulhari M.S. student, CS U C Riverside Jian Wen Ph.D. student, CS U
More informationVP/GM, Data Center Processing Group. Copyright 2014 Cavium Inc.
VP/GM, Data Center Processing Group Trends Disrupting Server Industry Public & Private Clouds Compute, Network & Storage Virtualization Application Specific Servers Large end users designing server HW
More informationCan the Elephants Handle the NoSQL Onslaught?
Can the Elephants Handle the NoSQL Onslaught? Avrilia Floratou, Nikhil Teletia David J. DeWitt, Jignesh M. Patel, Donghui Zhang University of Wisconsin-Madison Microsoft Jim Gray Systems Lab Presented
More informationBig Data Analytics - Accelerated. stream-horizon.com
Big Data Analytics - Accelerated stream-horizon.com Legacy ETL platforms & conventional Data Integration approach Unable to meet latency & data throughput demands of Big Data integration challenges Based
More informationEnterprise Applications
Enterprise Applications Chi Ho Yue Sorav Bansal Shivnath Babu Amin Firoozshahian EE392C Emerging Applications Study Spring 2003 Functionality Online Transaction Processing (OLTP) Users/apps interacting
More informationUnified Big Data Processing with Apache Spark. Matei Zaharia @matei_zaharia
Unified Big Data Processing with Apache Spark Matei Zaharia @matei_zaharia What is Apache Spark? Fast & general engine for big data processing Generalizes MapReduce model to support more types of processing
More informationHP ProLiant Gen8 vs Gen9 Server Blades on Data Warehouse Workloads
HP ProLiant Gen8 vs Gen9 Server Blades on Data Warehouse Workloads Gen9 Servers give more performance per dollar for your investment. Executive Summary Information Technology (IT) organizations face increasing
More informationCan t We All Just Get Along? Spark and Resource Management on Hadoop
Can t We All Just Get Along? Spark and Resource Management on Hadoop Introduc=ons So>ware engineer at Cloudera MapReduce, YARN, Resource management Hadoop commider Introduc=on Spark as a first class data
More informationAccelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software
WHITEPAPER Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software SanDisk ZetaScale software unlocks the full benefits of flash for In-Memory Compute and NoSQL applications
More informationApplication of Predictive Analytics for Better Alignment of Business and IT
Application of Predictive Analytics for Better Alignment of Business and IT Boris Zibitsker, PhD bzibitsker@beznext.com July 25, 2014 Big Data Summit - Riga, Latvia About the Presenter Boris Zibitsker
More informationPerformance Characteristics of VMFS and RDM VMware ESX Server 3.0.1
Performance Study Performance Characteristics of and RDM VMware ESX Server 3.0.1 VMware ESX Server offers three choices for managing disk access in a virtual machine VMware Virtual Machine File System
More informationHP ProLiant BL660c Gen9 and Microsoft SQL Server 2014 technical brief
Technical white paper HP ProLiant BL660c Gen9 and Microsoft SQL Server 2014 technical brief Scale-up your Microsoft SQL Server environment to new heights Table of contents Executive summary... 2 Introduction...
More informationBig Fast Data Hadoop acceleration with Flash. June 2013
Big Fast Data Hadoop acceleration with Flash June 2013 Agenda The Big Data Problem What is Hadoop Hadoop and Flash The Nytro Solution Test Results The Big Data Problem Big Data Output Facebook Traditional
More informationExploiting Remote Memory Operations to Design Efficient Reconfiguration for Shared Data-Centers over InfiniBand
Exploiting Remote Memory Operations to Design Efficient Reconfiguration for Shared Data-Centers over InfiniBand P. Balaji, K. Vaidyanathan, S. Narravula, K. Savitha, H. W. Jin D. K. Panda Network Based
More informationOracle Database Scalability in VMware ESX VMware ESX 3.5
Performance Study Oracle Database Scalability in VMware ESX VMware ESX 3.5 Database applications running on individual physical servers represent a large consolidation opportunity. However enterprises
More informationUse of Hadoop File System for Nuclear Physics Analyses in STAR
1 Use of Hadoop File System for Nuclear Physics Analyses in STAR EVAN SANGALINE UC DAVIS Motivations 2 Data storage a key component of analysis requirements Transmission and storage across diverse resources
More informationarxiv:1504.04974v1 [cs.dc] 20 Apr 2015
2015-4 UNDERSTANDING BIG DATA ANALYTIC WORKLOADS ON MODERN PROCESSORS arxiv:1504.04974v1 [cs.dc] 20 Apr 2015 Zhen Jia, Lei Wang, Jianfeng Zhan, Lixin Zhang, Chunjie Luo, Ninghui Sun Institute Of Computing
More informationArchitecture Support for Big Data Analytics
Architecture Support for Big Data Analytics Ahsan Javed Awan EMJD-DC (KTH-UPC) (http://uk.linkedin.com/in/ahsanjavedawan/) Supervisors: Mats Brorsson(KTH), Eduard Ayguade(UPC), Vladimir Vlassov(KTH) 1
More informationSpecification and Implementation of Dynamic Web Site Benchmarks. Sameh Elnikety Department of Computer Science Rice University
Specification and Implementation of Dynamic Web Site Benchmarks Sameh Elnikety Department of Computer Science Rice University 1 Dynamic Content Is Common 1 2 3 2 Generating Dynamic Content http Web Server
More informationPerformance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems
Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems Rekha Singhal and Gabriele Pacciucci * Other names and brands may be claimed as the property of others. Lustre File
More informationAn Oracle White Paper June 2012. High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database
An Oracle White Paper June 2012 High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database Executive Overview... 1 Introduction... 1 Oracle Loader for Hadoop... 2 Oracle Direct
More informationNoSQL Performance Test In-Memory Performance Comparison of SequoiaDB, Cassandra, and MongoDB
bankmark UG (haftungsbeschränkt) Bahnhofstraße 1 9432 Passau Germany www.bankmark.de info@bankmark.de T +49 851 25 49 49 F +49 851 25 49 499 NoSQL Performance Test In-Memory Performance Comparison of SequoiaDB,
More informationEvaluation Report: Accelerating SQL Server Database Performance with the Lenovo Storage S3200 SAN Array
Evaluation Report: Accelerating SQL Server Database Performance with the Lenovo Storage S3200 SAN Array Evaluation report prepared under contract with Lenovo Executive Summary Even with the price of flash
More informationDIABLO TECHNOLOGIES MEMORY CHANNEL STORAGE AND VMWARE VIRTUAL SAN : VDI ACCELERATION
DIABLO TECHNOLOGIES MEMORY CHANNEL STORAGE AND VMWARE VIRTUAL SAN : VDI ACCELERATION A DIABLO WHITE PAPER AUGUST 2014 Ricky Trigalo Director of Business Development Virtualization, Diablo Technologies
More informationIntroduction to Big Data! with Apache Spark" UC#BERKELEY#
Introduction to Big Data! with Apache Spark" UC#BERKELEY# This Lecture" The Big Data Problem" Hardware for Big Data" Distributing Work" Handling Failures and Slow Machines" Map Reduce and Complex Jobs"
More informationAnalysis and Modeling of MapReduce s Performance on Hadoop YARN
Analysis and Modeling of MapReduce s Performance on Hadoop YARN Qiuyi Tang Dept. of Mathematics and Computer Science Denison University tang_j3@denison.edu Dr. Thomas C. Bressoud Dept. of Mathematics and
More informationEvaluating HDFS I/O Performance on Virtualized Systems
Evaluating HDFS I/O Performance on Virtualized Systems Xin Tang xtang@cs.wisc.edu University of Wisconsin-Madison Department of Computer Sciences Abstract Hadoop as a Service (HaaS) has received increasing
More informationScaling Objectivity Database Performance with Panasas Scale-Out NAS Storage
White Paper Scaling Objectivity Database Performance with Panasas Scale-Out NAS Storage A Benchmark Report August 211 Background Objectivity/DB uses a powerful distributed processing architecture to manage
More informationMoving From Hadoop to Spark
+ Moving From Hadoop to Spark Sujee Maniyam Founder / Principal @ www.elephantscale.com sujee@elephantscale.com Bay Area ACM meetup (2015-02-23) + HI, Featured in Hadoop Weekly #109 + About Me : Sujee
More informationComprehending the Tradeoffs between Deploying Oracle Database on RAID 5 and RAID 10 Storage Configurations. Database Solutions Engineering
Comprehending the Tradeoffs between Deploying Oracle Database on RAID 5 and RAID 10 Storage Configurations A Dell Technical White Paper Database Solutions Engineering By Sudhansu Sekhar and Raghunatha
More informationReal Time Fraud Detection With Sequence Mining on Big Data Platform. Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May 6 2014 Santa Clara, CA
Real Time Fraud Detection With Sequence Mining on Big Data Platform Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May 6 2014 Santa Clara, CA Open Source Big Data Eco System Query (NOSQL) : Cassandra,
More informationVirtuoso and Database Scalability
Virtuoso and Database Scalability By Orri Erling Table of Contents Abstract Metrics Results Transaction Throughput Initializing 40 warehouses Serial Read Test Conditions Analysis Working Set Effect of
More informationAli Ghodsi Head of PM and Engineering Databricks
Making Big Data Simple Ali Ghodsi Head of PM and Engineering Databricks Big Data is Hard: A Big Data Project Tasks Tasks Build a Hadoop cluster Challenges Clusters hard to setup and manage Build a data
More informationPetabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, SIGMOD, New York, June 2013
Petabyte Scale Data at Facebook Dhruba Borthakur, Engineer at Facebook, SIGMOD, New York, June 2013 Agenda 1 Types of Data 2 Data Model and API for Facebook Graph Data 3 SLTP (Semi-OLTP) and Analytics
More informationBest Practices for Deploying SSDs in a Microsoft SQL Server 2008 OLTP Environment with Dell EqualLogic PS-Series Arrays
Best Practices for Deploying SSDs in a Microsoft SQL Server 2008 OLTP Environment with Dell EqualLogic PS-Series Arrays Database Solutions Engineering By Murali Krishnan.K Dell Product Group October 2009
More informationNavigating Big Data with High-Throughput, Energy-Efficient Data Partitioning
Application-Specific Architecture Navigating Big Data with High-Throughput, Energy-Efficient Data Partitioning Lisa Wu, R.J. Barker, Martha Kim, and Ken Ross Columbia University Xiaowei Wang Rui Chen Outline
More informationCSE-E5430 Scalable Cloud Computing Lecture 2
CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing
More informationGPU File System Encryption Kartik Kulkarni and Eugene Linkov
GPU File System Encryption Kartik Kulkarni and Eugene Linkov 5/10/2012 SUMMARY. We implemented a file system that encrypts and decrypts files. The implementation uses the AES algorithm computed through
More informationA Novel Cloud Based Elastic Framework for Big Data Preprocessing
School of Systems Engineering A Novel Cloud Based Elastic Framework for Big Data Preprocessing Omer Dawelbeit and Rachel McCrindle October 21, 2014 University of Reading 2008 www.reading.ac.uk Overview
More informationRemoving Performance Bottlenecks in Databases with Red Hat Enterprise Linux and Violin Memory Flash Storage Arrays. Red Hat Performance Engineering
Removing Performance Bottlenecks in Databases with Red Hat Enterprise Linux and Violin Memory Flash Storage Arrays Red Hat Performance Engineering Version 1.0 August 2013 1801 Varsity Drive Raleigh NC
More informationArchitectures for Big Data Analytics A database perspective
Architectures for Big Data Analytics A database perspective Fernando Velez Director of Product Management Enterprise Information Management, SAP June 2013 Outline Big Data Analytics Requirements Spectrum
More informationIntroducing EEMBC Cloud and Big Data Server Benchmarks
Introducing EEMBC Cloud and Big Data Server Benchmarks Quick Background: Industry-Standard Benchmarks for the Embedded Industry EEMBC formed in 1997 as non-profit consortium Defining and developing application-specific
More informationExperiences with Lustre* and Hadoop*
Experiences with Lustre* and Hadoop* Gabriele Paciucci (Intel) June, 2014 Intel * Some Con fidential name Do Not Forward and brands may be claimed as the property of others. Agenda Overview Intel Enterprise
More informationOracle Big Data SQL Technical Update
Oracle Big Data SQL Technical Update Jean-Pierre Dijcks Oracle Redwood City, CA, USA Keywords: Big Data, Hadoop, NoSQL Databases, Relational Databases, SQL, Security, Performance Introduction This technical
More informationMicrosoft Office SharePoint Server 2007 Performance on VMware vsphere 4.1
Performance Study Microsoft Office SharePoint Server 2007 Performance on VMware vsphere 4.1 VMware vsphere 4.1 One of the key benefits of virtualization is the ability to consolidate multiple applications
More informationAn Alternative Storage Solution for MapReduce. Eric Lomascolo Director, Solutions Marketing
An Alternative Storage Solution for MapReduce Eric Lomascolo Director, Solutions Marketing MapReduce Breaks the Problem Down Data Analysis Distributes processing work (Map) across compute nodes and accumulates
More informationSQL Server 2012 Performance White Paper
Published: April 2012 Applies to: SQL Server 2012 Copyright The information contained in this document represents the current view of Microsoft Corporation on the issues discussed as of the date of publication.
More informationDell Reference Configuration for Hortonworks Data Platform
Dell Reference Configuration for Hortonworks Data Platform A Quick Reference Configuration Guide Armando Acosta Hadoop Product Manager Dell Revolutionary Cloud and Big Data Group Kris Applegate Solution
More informationArchitecting for the next generation of Big Data Hortonworks HDP 2.0 on Red Hat Enterprise Linux 6 with OpenJDK 7
Architecting for the next generation of Big Data Hortonworks HDP 2.0 on Red Hat Enterprise Linux 6 with OpenJDK 7 Yan Fisher Senior Principal Product Marketing Manager, Red Hat Rohit Bakhshi Product Manager,
More informationExpress5800 Scalable Enterprise Server Reference Architecture. For NEC PCIe SSD Appliance for Microsoft SQL Server
Express5800 Scalable Enterprise Server Reference Architecture For NEC PCIe SSD Appliance for Microsoft SQL Server An appliance that significantly improves performance of enterprise systems and large-scale
More informationCondusiv s V-locity Server Boosts Performance of SQL Server 2012 by 55%
openbench Labs Executive Briefing: April 19, 2013 Condusiv s Server Boosts Performance of SQL Server 2012 by 55% Optimizing I/O for Increased Throughput and Reduced Latency on Physical Servers 01 Executive
More information2009 Oracle Corporation 1
The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material,
More informationAnalysis of VDI Storage Performance During Bootstorm
Analysis of VDI Storage Performance During Bootstorm Introduction Virtual desktops are gaining popularity as a more cost effective and more easily serviceable solution. The most resource-dependent process
More informationVDI Optimization Real World Learnings. Russ Fellows, Evaluator Group
Russ Fellows, Evaluator Group SNIA Legal Notice The material contained in this tutorial is copyrighted by the SNIA unless otherwise noted. Member companies and individual members may use this material
More informationDeveloping Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control
Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control EP/K006487/1 UK PI: Prof Gareth Taylor (BU) China PI: Prof Yong-Hua Song (THU) Consortium UK Members: Brunel University
More informationHow In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time
SCALEOUT SOFTWARE How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time by Dr. William Bain and Dr. Mikhail Sobolev, ScaleOut Software, Inc. 2012 ScaleOut Software, Inc. 12/27/2012 T wenty-first
More informationSOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera
SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP Eva Andreasson Cloudera Most FAQ: Super-Quick Overview! The Apache Hadoop Ecosystem a Zoo! Oozie ZooKeeper Hue Impala Solr Hive Pig Mahout HBase MapReduce
More informationJVM Performance Study Comparing Oracle HotSpot and Azul Zing Using Apache Cassandra
JVM Performance Study Comparing Oracle HotSpot and Azul Zing Using Apache Cassandra January 2014 Legal Notices Apache Cassandra, Spark and Solr and their respective logos are trademarks or registered trademarks
More informationMicrosoft SQL Server 2012 on Cisco UCS with iscsi-based Storage Access in VMware ESX Virtualization Environment: Performance Study
White Paper Microsoft SQL Server 2012 on Cisco UCS with iscsi-based Storage Access in VMware ESX Virtualization Environment: Performance Study 2012 Cisco and/or its affiliates. All rights reserved. This
More informationNews and trends in Data Warehouse Automation, Big Data and BI. Johan Hendrickx & Dirk Vermeiren
News and trends in Data Warehouse Automation, Big Data and BI Johan Hendrickx & Dirk Vermeiren Extreme Agility from Source to Analysis DWH Appliances & DWH Automation Typical Architecture 3 What Business
More informationHadoop s Entry into the Traditional Analytical DBMS Market. Daniel Abadi Yale University August 3 rd, 2010
Hadoop s Entry into the Traditional Analytical DBMS Market Daniel Abadi Yale University August 3 rd, 2010 Data, Data, Everywhere Data explosion Web 2.0 more user data More devices that sense data More
More informationTHE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES
THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES Vincent Garonne, Mario Lassnig, Martin Barisits, Thomas Beermann, Ralph Vigne, Cedric Serfon Vincent.Garonne@cern.ch ph-adp-ddm-lab@cern.ch XLDB
More informationData Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com
Data Warehousing and Analytics Infrastructure at Facebook Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com Overview Challenges in a Fast Growing & Dynamic Environment Data Flow Architecture,
More informationPerformance Comparison of Fujitsu PRIMERGY and PRIMEPOWER Servers
WHITE PAPER FUJITSU PRIMERGY AND PRIMEPOWER SERVERS Performance Comparison of Fujitsu PRIMERGY and PRIMEPOWER Servers CHALLENGE Replace a Fujitsu PRIMEPOWER 2500 partition with a lower cost solution that
More informationBig Data. Value, use cases and architectures. Petar Torre Lead Architect Service Provider Group. Dubrovnik, Croatia, South East Europe 20-22 May, 2013
Dubrovnik, Croatia, South East Europe 20-22 May, 2013 Big Data Value, use cases and architectures Petar Torre Lead Architect Service Provider Group 2011 2013 Cisco and/or its affiliates. All rights reserved.
More informationCan High-Performance Interconnects Benefit Memcached and Hadoop?
Can High-Performance Interconnects Benefit Memcached and Hadoop? D. K. Panda and Sayantan Sur Network-Based Computing Laboratory Department of Computer Science and Engineering The Ohio State University,
More informationDelivering Quality in Software Performance and Scalability Testing
Delivering Quality in Software Performance and Scalability Testing Abstract Khun Ban, Robert Scott, Kingsum Chow, and Huijun Yan Software and Services Group, Intel Corporation {khun.ban, robert.l.scott,
More informationIntroduction. Application Performance in the QLinux Multimedia Operating System. Solution: QLinux. Introduction. Outline. QLinux Design Principles
Application Performance in the QLinux Multimedia Operating System Sundaram, A. Chandra, P. Goyal, P. Shenoy, J. Sahni and H. Vin Umass Amherst, U of Texas Austin ACM Multimedia, 2000 Introduction General
More informationProgramming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview
Programming Hadoop 5-day, instructor-led BD-106 MapReduce Overview The Client Server Processing Pattern Distributed Computing Challenges MapReduce Defined Google's MapReduce The Map Phase of MapReduce
More informationHP reference configuration for entry-level SAS Grid Manager solutions
HP reference configuration for entry-level SAS Grid Manager solutions Up to 864 simultaneous SAS jobs and more than 3 GB/s I/O throughput Technical white paper Table of contents Executive summary... 2
More informationCloud Storage. Parallels. Performance Benchmark Results. White Paper. www.parallels.com
Parallels Cloud Storage White Paper Performance Benchmark Results www.parallels.com Table of Contents Executive Summary... 3 Architecture Overview... 3 Key Features... 4 No Special Hardware Requirements...
More informationIntel RAID Performance 12Gb/s SAS RAID Controllers
Intel RAID Performance 12Gb/s SAS RAID Controllers WHITE PAPER September 2014 DB10-000023-00 For more information on the Intel RAID visit: www.intel.com/go/raid Intel, the Intel logo, Intel Inside, Xeon
More informationBenchmark Hadoop and Mars: MapReduce on cluster versus on GPU
Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Heshan Li, Shaopeng Wang The Johns Hopkins University 3400 N. Charles Street Baltimore, Maryland 21218 {heshanli, shaopeng}@cs.jhu.edu 1 Overview
More informationConquering Big Data with BDAS (Berkeley Data Analytics)
UC BERKELEY Conquering Big Data with BDAS (Berkeley Data Analytics) Ion Stoica UC Berkeley / Databricks / Conviva Extracting Value from Big Data Insights, diagnosis, e.g.,» Why is user engagement dropping?»
More informationPractical Performance Understanding the Performance of Your Application
Neil Masson IBM Java Service Technical Lead 25 th September 2012 Practical Performance Understanding the Performance of Your Application 1 WebSphere User Group: Practical Performance Understand the Performance
More informationSQL-on-Hadoop: Full Circle Back to Shared-Nothing Database Architectures
SQL-on-Hadoop: Full Circle Back to Shared-Nothing Database Architectures Avrilia Floratou IBM Almaden Research Center aflorat@us.ibm.com Umar Farooq Minhas IBM Almaden Research Center ufminhas@us.ibm.com
More informationFlash Performance in Storage Systems. Bill Moore Chief Engineer, Storage Systems Sun Microsystems
Flash Performance in Storage Systems Bill Moore Chief Engineer, Storage Systems Sun Microsystems 1 Disk to CPU Discontinuity Moore s Law is out-stripping disk drive performance (rotational speed) As a
More informationAn Oracle White Paper August 2012. Oracle WebCenter Content 11gR1 Performance Testing Results
An Oracle White Paper August 2012 Oracle WebCenter Content 11gR1 Performance Testing Results Introduction... 2 Oracle WebCenter Content Architecture... 2 High Volume Content & Imaging Application Characteristics...
More informationAutomating Big Data Benchmarking for Different Architectures with ALOJA
www.bsc.es Jan 2016 Automating Big Data Benchmarking for Different Architectures with ALOJA Nicolas Poggi, Postdoc Researcher Agenda 1. Intro on Hadoop performance 1. Current scenario and problematic 2.
More informationBusiness white paper. HP Process Automation. Version 7.0. Server performance
Business white paper HP Process Automation Version 7.0 Server performance Table of contents 3 Summary of results 4 Benchmark profile 5 Benchmark environmant 6 Performance metrics 6 Process throughput 6
More informationCharacterizing Task Usage Shapes in Google s Compute Clusters
Characterizing Task Usage Shapes in Google s Compute Clusters Qi Zhang 1, Joseph L. Hellerstein 2, Raouf Boutaba 1 1 University of Waterloo, 2 Google Inc. Introduction Cloud computing is becoming a key
More informationHow To Write A Bigbench Benchmark For A Retailer
BigBench Overview Towards a Comprehensive End-to-End Benchmark for Big Data - bankmark UG (haftungsbeschränkt) 02/04/2015 @ SPEC RG Big Data The BigBench Proposal End to end benchmark Application level
More informationQuantcast Petabyte Storage at Half Price with QFS!
9-131 Quantcast Petabyte Storage at Half Price with QFS Presented by Silvius Rus, Director, Big Data Platforms September 2013 Quantcast File System (QFS) A high performance alternative to the Hadoop Distributed
More informationSAP HANA - Main Memory Technology: A Challenge for Development of Business Applications. Jürgen Primsch, SAP AG July 2011
SAP HANA - Main Memory Technology: A Challenge for Development of Business Applications Jürgen Primsch, SAP AG July 2011 Why In-Memory? Information at the Speed of Thought Imagine access to business data,
More informationCS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University
CS 555: DISTRIBUTED SYSTEMS [SPARK] Shrideep Pallickara Computer Science Colorado State University Frequently asked questions from the previous class survey Streaming Significance of minimum delays? Interleaving
More informationUnified Big Data Analytics Pipeline. 连 城 lian@databricks.com
Unified Big Data Analytics Pipeline 连 城 lian@databricks.com What is A fast and general engine for large-scale data processing An open source implementation of Resilient Distributed Datasets (RDD) Has an
More informationBest Practices for Hadoop Data Analysis with Tableau
Best Practices for Hadoop Data Analysis with Tableau September 2013 2013 Hortonworks Inc. http:// Tableau 6.1.4 introduced the ability to visualize large, complex data stored in Apache Hadoop with Hortonworks
More informationRecommended hardware system configurations for ANSYS users
Recommended hardware system configurations for ANSYS users The purpose of this document is to recommend system configurations that will deliver high performance for ANSYS users across the entire range
More information