# Faster Set Intersection with SIMD instructions by Reducing Branch Mispredictions

Save this PDF as:

Size: px
Start display at page:

## Transcription

6 input array A vector register A (128 bit) vector register B (128 bit) input array A A[i] A[i+1] A[i+2] A[i+3] B[j] B[j+1] B[j+2] B[j+3] A[i] vector register A (128 bit) vector register B (128 bit) B[j] compare each byte pair A[i+1] compare each word pair B[j+1] Byte wise check check four elements in parallel gather one byte from each element as illustrated and compare each byte pair store result in a vector register as bitmask Word wise check check two elements in parallel gather one word from each element as illustrated and compare each word pair store result in a vector register as bitmask Figure 4. Overview of byte-wise check and word-wise check. S a = 4 and S b = 8. As discussed in Section 3.2, the block size of 3 is best for our scalar (non-simd) algorithm. However we can reduce the number of comparisons by using SIMD instructions, which justifies using a larger block size than used in the scalar algorithm. The vector sizes of the SIMD instruction sets of today s processors, such as SSE/AVX of Xeon or VSX of POWER7+, are limited to 128 bits or 256 bits. This means we can execute only two or four comparisons of 64-bit elements in parallel. This parallelism is insufficient for an all-pairs comparison of large blocks in one step. To execute the all-pairs comparisons for larger blocks efficiently by increasing the parallelism available in one SIMD instruction, we use a parallel SIMD comparison, which compares only a part of each element, to filter out all of the values with no outputs before executing the all-pairs comparison using scalar comparisons. Unlike the previous SIMD approaches [8, 9], we did not fully replace the scalar comparisons with parallel SIMD comparisons. Because the number of matching pairs are typically much smaller than the number of input elements in practice, our filtering technique is effective to avoid most of the scalar comparisons and hence achieves higher overall performance. Zhou et al. [12] also used a similar idea of comparing only a part of each element to increase the data parallelism with one SIMD instruction for the nested loop join for unsorted arrays. We use two different types of checks to find a matching pair in the all-pairs comparison hierarchically. Figure 4 shows an overview of our byte-wise check and word-wise check using SIMD parallel compare instructions. In this example, we assume 64-bit integers as the data type and an SIMD instruction set with 128-bit registers. These checks execute only a partial comparison of each element. It means that if the check does not find any matching byte or word pair, there cannot be any matching element pairs (no false negatives). However, if the check finds a matching byte or word pair, the matching pair may still be a false positive. To reduce the number of false-positive matches, we hierarchically do two different types of checks. If the data type of each element is a 64-bit integer and the block size S is 4, our hierarchical filtering uses these steps: while (1) { // do byte-wise check: step (1)-(3) of the hierarchical check compare_result = bytewise_check1(&a[apos], &B[Bpos]) & bytewise_check2(&a[apos], &B[Bpos]); if (!is_all_bit_zero(compare_result)) { // step (4) // found a potential matching value // do word-wise check and scalar check: step (5) - (7)... } else if (A[Apos+3] > B[Bpos+3]) goto advanceb; else goto advancea; advancea: Apos+=4; if (Apos >= Aend) { break; } else { continue; } advanceb: Bpos+=4; if (Bpos >= Bend) { break; } else { continue; } advanceab: Apos+=4; Bpos+=4; if (Apos >= Aend Bpos >= Bend) { break; } } Figure 5. Pseudocode of our SIMD algorithm for block size of 4x4. (1) Do a byte-wise check for A[i.. i+3] and B[j.. j+3] using the least significant byte, (2) Do a byte-wise check for A[i.. i+3] and B[j.. j+3] using the second-least significant byte, (3) Do a bit-wise AND operation for the results of Steps 1 and 2, (4) If every bit is zero in Step 3, then skip further checks because there is no matching pair (most frequent case), (5) Do a word-wise check for A[i.. i+1] and B[j.. j+1] using the third to sixth bytes, (6) Do a scalar check for Step-5 matches, and (7) Repeat Steps 5 and 6 for {A[i.. i+1] and B[j+2.. j+3]}, {A[i+2.. i+3] and B[j.. j+1]}, and {A[i+2.. i+3] and B[j+2.. j+3]}. Alternatively, we can replace Steps 5 to 7 with a countleading-zero instruction to identify the location of the matching pairs found in Step 3. When we use a 32-bit integer data type, we use the first to fourth bytes in Step 5. Figure 5 shows the pseudocode for the set intersection algorithm with our hierarchical filtering with SIMD. On Xeon, we can use the STTNI (pcmpestrm) instruction, which is unique to the Xeon processor, to execute the all-pair comparison efficiently. This instruction can execute the all-pair comparisons of eight 2-byte characters in one vector register against eight characters in another vector register with only one instruction. Thus we can implement Steps 1 to 3 with a block size of 8 by using the STTNI instruction very efficiently. Unlike Schlegel s algorithm [8], our algorithm uses the STTNI instruction to filter out redundant comparisons and thus we do not need to limit the data types to the 8- or 16-bit integer supported by this instruction. When we use the STTNI in our SIMD algorithm, we use the popcnt instruction to identify the position of matching pair efficiently because the processor does not support the count-leading-zero instruction. We can get the position of the least significant non-zero bit in the result of STTNI, x, by popcnt((~x) & (x-1)). When the SIMD instruction set supports a wider vector, such as a 256-bit vector in AVX, one way to exploit the wider vector is doing multiple byte-wise checks at in once step. For example, we could do Steps 1 and 2 with just one parallel comparison using 256-bit vector registers, with 16 pairs of 2-byte elements. By using our hierarchical filtering with SIMD instructions, we can avoid increasing the number of instructions and still gain the benefits of the reduced branch mispredictions. 298

7 1,8 1,6 1,4 1,2 1, Xeon (for 32-bit integers) using STTNI instruction unique to Xeon POWER7+ (for 32-bit integers) POWER7+ (for 64-bit integers) Naive (block size = 1) Our scalar algorithm (block size = 2) Our scalar algorithm (block size = 3) Our scalar algorithm (block size = 4) Our scalar algorithm (block size = 5) Our SIMD algorithm (block size = 4) Branchless algorithm V1 SIMD algorithm (Lemire et al. [9]) Our algorithm w/ STTNI (block size = 8) Schlegel's algorithm w/ STTNI [8] Figure 6. Performance for set intersection of 32-bit and 64-bit random integer arrays of 256k elements on Xeon and POWER7+. The selectivity was set to (as the best case). The error bars show 95% confidence intervals. 4. EXPERIMENTAL RESULTS We implemented and evaluated our algorithm on Intel Xeon and IBM POWER7+ processors with and without using SIMD instructions. On Xeon, we also evaluated our SIMD algorithm implemented using the Xeon-only STTNI instruction. We implemented the program in C++ using SSE intrinsics on Xeon and Altivec intrinsics on POWER7+, but the algorithm is the same for both platforms. The POWER7+ system used for our evaluation was equipped with a 4.1-GHz POWER7+ processor. Redhat Enterprise Linux 6.4 was running on the system. We compiled all of the programs using gcc included in the IBM Advance Toolkit with the O3 option. We also evaluated the performance of our algorithm on a system equipped with two 2.9- GHz Xeon E5-269 (SandyBridge-EP), also with Redhat Enterprise Linux 6.4 as the OS, but the compiler on this system was gcc (still with the O3 option) on this Xeon system. We disabled the dynamic frequency scaling on both systems for more reproducible results. In the evaluation, we used both artificial and more realistic datasets. With the artificial datasets generated using a random number generator, we assessed the characteristics of our algorithm for three key parameters, (1) the ratio of the number of output elements over the input elements (selectivity), (2) the difference in the sizes of the two input arrays and (3) the total sizes of the input arrays. We define the selectivity as the number of output elements over the number of elements in the smaller input array. To create input datasets with a specified selectivity, we first generate a long enough array of (unsorted) random numbers without duplicates. We then trim two input arrays from this long array with the specified number of elements included in both arrays. Each array is then sorted and the pair is used as an input for experiments. We executed the measurements 16 times using different seeds for the random number generator and averaged the results. For the realistic datasets, we used arrays generated from Wikipedia data. We generated a list of document IDs for the articles that include a specified word. Then we executed the set intersection operations for up to eight arrays to emulate the set intersection operation for the multi-word queries in a query serving system. We also averaged the results from 16 measurements for the real-world data. We show the performance of our block-based algorithm with and without SIMD instructions and with various block sizes. The results shown as naive in the figures are the performance of the code shown in Figure 1(a). As already discussed, our algorithm is equivalent to the naive algorithm when the block size is 1. We also evaluated and compared the performances of the existing algorithms including the widely used std::set_intersection library method in the delivered with gcc, the branchless algorithm shown in Figure 2, a galloping algorithm [1] (as a popular binary-search-based algorithm), the two SIMD algorithms by Lemire et al. [9] (which are the V1 SIMD algorithm and an SIMD galloping algorithm), and Schlegel s algorithm that uses the STTNI instruction of Xeon [8]. We picked these algorithms for the comparisons because, like our algorithm, they need no preprocessing. Among these evaluated algorithms, the two galloping algorithms based on binary searches are tailored for paired arrays of very different sizes. The other algorithms, including ours, are merge-based algorithms, which are known to work better when the sizes of the two inputs are similar. 4.1 Performance Improvements from Our Algorithm Figure 6 compares the performance of the set intersection algorithms for two datasets of 256k integers based on a random number generator. The selectivity was set to zero. We used 32-bit integer as the data type for both Xeon and POWER7+ and also tested 64-bit integers for POWER7+. The results show that our block-based algorithm improved the performance over the naive merge-based algorithms ( and naive) on both platforms even without using the SIMD instructions. When comparing how the block size affected the performance of our scalar algorithm on these two platforms, the best performance was when the block size was set to 3. On both platforms, the block sizes of 3 and 4 gave almost comparable performance. Our prediction based on the simple model discussed in Section 3.2, which predicts the block size of 3 is the best and 4 is a close second best, seems reasonably accurate for both processors, although they have totally different instruction sets and implementations. The performance gains over the widely used were 2.1x on Xeon and 1.8x on POWER7+ with the block size of 3. Compared to our algorithm, the branchless algorithm did not yield large performance gains over, although it caused a smaller number of branch mispredictions than our algorithm (as shown later). 299

8 When we used the SIMD instructions, there were additional performance improvements of about 2.5x over our scalar algorithm on both platforms, where the total improvement was 4.8x to 5.2x better than for 32-bit integers and 4.2x better for 64-bit integers. Although the also achieved performance improvements over the using SIMD instructions, the performance of our SIMD algorithm was 2.9x and 3.x better than V1 algorithm on Xeon and POWER7+ respectively. Also, the V1 algorithm cannot support 64-bit integer on POWER7+ because POWER7+ does not have SIMD comparisons for 64-bit integers, while our SIMD algorithm achieved good performance improvements even for 64-bit integers on POWER7+. This is because the V1 algorithm uses the SIMD comparison for the entire elements to find the matching pairs, but our algorithm uses the SIMD comparisons to filter out unnecessary scalar comparisons by comparing only a part of each element. Schlegel's algorithm [8] did not achieve good performance for the artificial datasets generated by the random number generator. As the authors noted, Schlegel's algorithm is efficient only when the value domain is limited, so that a sufficient number of elements share matching values in their upper 16 bits, and this is not true for our artificial datasets. 4.2 Microarchitectural Statistics For more insight into the improvements from our algorithm with and without SIMD instructions, Figure 7 displays some microarchitectural statistics of each algorithm for the artificial datasets in the 32-bit integer arrays as measured by the hardware performance monitors of the processors. We studied the branch misprediction rate (the number of branch mispredictions per input element), the CPI (cycles per instruction), and the path length (the number of instructions executed per input element). We begin with the microarchitectural statistics of our algorithm when not using the SIMD instructions. When comparing the statistics for our scalar algorithm and the naive algorithm, which is equivalent to our algorithm with a block size of 1, the branch mispredictions are reduced as intended by using the larger block sizes. The reduction in the branch mispredictions directly affected the overall CPI, which was improved when we used the larger block sizes. The improvements in CPI were especially significant when we enlarged the block size from 1 to 2 and from 2 to 3. By using our scalar algorithm with the block size of 3, the branch mispredictions were reduced by more than 75% compared to the naive algorithm on both platforms, which was higher than the predicted reduction of 66%. In contrast, the path lengths increased steadily with the increasing block sizes. As a result of the reduced CPI and the increased path length, our best performance without SIMD instructions was with the block sizes of 3 and 4. When the block size increased beyond 4, the benefits of reduced branch mispredictions were not significant enough to compensate for the increased path length. This supports our belief that the best block size might be larger on processors with larger branch misprediction overhead. Since most of today s high performance processors use pipelined execution units and typically have large branch misprediction overhead, we expect that our algorithm would be generally effective for most of the modern processors, not just the two tested processors. The branchless algorithm showed the smallest number of the branch mispredictions by totally replacing the hard-to-predict conditional branches with arithmetic operations. However, as shown in Figure 7, the path length of the branchless algorithm was larger than our algorithm. Due to this long path length, the branchless algorithm did not outperform our scalar algorithm in Branch mispredictions per input element branch mispredictions per input element Xeon Cycles Per instruction (CPI) cycles per instruction (CPI) Xeon POWER7+ POWER7+ Path length (Instructions executed per input element) 18. instructions executed per input element Xeon POWER7+ Naive (scalar, block size = 1) Ours (scalar, block size = 2) Ours (scalar, block size = 3) Ours (scalar, block size = 4) Ours (scalar, block size = 5) Ours (SIMD, block size = 4) Branchless (scalar) Ours (STTNI, block size = 8) Figure 7. Branch misprediction rate, CPI, and path length. spite of its small number of branch mispredictions. Our algorithm achieved comparable or even better CPI than the branchless algorithm even with the larger numbers of branch mispredictions. This is due to our algorithm s higher instruction-level parallelism, since all of the comparisons in the all-pair comparisons can be done in parallel on the hyperscalar processors. For our algorithm with the SIMD instructions, we observed significant improvements in the path lengths. Because the parallel comparisons of the SIMD instructions make the all-pairs comparisons of the costly scalar comparisons unnecessary in most cases, this greatly reduced the number of instructions executed, even with the large block size of 4. When we use STTNI on Xeon, our algorithm achieved further reductions in the path lengths. Due to the shorter path lengths, the SIMD instructions showed huge boosts in the overall performance. Shorter is better Shorter is better Shorter is better 3

9 Scalar algorithms on Xeon Our scalar algorithm (block size = 3 x 3) Our scalar algorithm (block size = 2 x 4) branchless algorithm galloping algorithm Scalar algorithms on POWER7+ Our scalar algorithm (block size = 3 x 3) Our scalar algorithm (block size = 2 x 4) branchless algorithm galloping algorithm 4k/4k 4k/8k 4k/16k 4k/32k 4k/64k 4k/128k 4k/256k 4k/512k 4k/1M 4K/4K 4K/8K 4K/16K 4K/32K 4K/64K 4K/128K 4K/256K 4K/512K 4K/1M input array sizes (number of elements) input array sizes (number of elements) SIMD algorithms on Xeon Our SIMD algorithm (block size = 4 x 4) Our SIMD algorithm (block size = 4 x 8) Our SIMD algorithm w/ STTNI (block size = 8 x 8) SIMD galloping algorithm [9] SIMD algorithms on POWER7+ Our SIMD algorithm (block size = 4 x 4) Our SIMD algorithm (block size = 4 x 8) SIMD galloping algorithm [9] 5 5 4k/4k 4k/8k 4k/16k 4k/32k 4k/64k 4k/128k 4k/256k 4k/512k 4k/1M 4K/4K 4K/8K 4K/16K 4K/32K 4K/64K 4K/128K 4K/256K 4K/512K 4K/1M input array sizes (number of elements) input array sizes (number of elements) Figure 8. Performance of scalar and SIMD algorithms for intersecting 32-bit integer arrays on Xeon and POWER7+ when the sizes of the two input arrays are different. The selectivity was set to. 4.3 Performance For Two Arrays of Various Sizes In this section, we show how the differences in the sizes of the two input arrays and the total sizes of the input arrays affect the performance of each algorithm. Figure 8 compares the performances of scalar and SIMD algorithms for 32-bit integer arrays with changing ratios between the sizes of the two input arrays. When comparing our scalar algorithm with different block sizes, it worked best with the block size of S a = S b = 3 (we denote this block size as 3x3) when the sizes of two input arrays are the same (the leftmost point in the figure), while S a = 2 and S b = 4 (block size 2x4) worked better than the block size of 3x3 when the larger of the two input arrays was at least twice as large as the smaller array, as predicted in Section 3.2. For two input arrays with very different sizes, the numbers of branch mispredictions with merge-based algorithms, and ours, became much smaller than for the two arrays of the same size. When the sizes of the two input arrays are different, the hard-to-predict conditional branches to select which array s pointer to advance, e.g. the branches shown in bold in Figure 1, become relatively easier to predict because the frequency of one branch direction (taken or not-taken) becomes much higher than the other direction on average. This means there were fewer opportunities to improve the performance with our scalar algorithm. As a result, the absolute performances became higher for these algorithms and also the benefits of the reduced branch mispredictions with our algorithm became smaller with the larger gaps between the sizes of the two arrays. However, even for the largest differences between the sizes of the two arrays, our scalar algorithm with the block size of 2x4 achieved higher performance than. The branchless algorithm does not incur the branch misprediction overhead and hence its performance was not affected by the size ratio of the two arrays. As shown in many previous studies, when the ratio of the input sizes exceeds an order of magnitude, binary-search-based algorithms, such as the galloping algorithm in the figure, outperform the merge-based algorithms, including our algorithm. For our SIMD algorithms, the block size of 4x8 yielded better performance than the block size of 4x4 (or the block size of 8x8 with STTNI on Xeon) when the sizes of the two arrays are significantly different, while the block size of 4x4 gave the best performance when the two arrays are of the same size (the leftmost point in the figure). When the ratio of the input array sizes is very large, the V1 SIMD algorithm and the SIMD galloping algorithm [9] had better performances than our SIMD algorithm with the block size of 4x4 or 4x8. The V1 algorithm is a merge-based algorithm and is very similar to our algorithm with a block size of 1x8 implemented with SIMD instructions. Although this block size gave better performance for two arrays with very different sizes than 4x4 or 4x8, the binary-search-based galloping algorithm implemented with SIMD outperformed any mergebased algorithms we tested with such inputs. Based on these observations, we combined our block-based algorithm with the galloping algorithm, so as to improve the performance for datasets with very different sizes. We select the best algorithm based on the ratio of the sizes of the two input arrays. When using SIMD, we use the SIMD galloping algorithm when the ratio of input sizes is larger than 32. Otherwise we use our new SIMD algorithm. We use a block size setting of 4x8 when the ratio of input sizes is larger than 2., and a block size setting of 4x4 when the difference is smaller than this threshold. 31

### Binary search tree with SIMD bandwidth optimization using SSE

Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous

### Techniques Covered in this Class. Inverted List Compression Techniques. Recap: Search Engine Query Processing. Recap: Search Engine Query Processing

Recap: Search Engine Query Processing Basically, to process a query we need to traverse the inverted lists of the query terms Lists are very long and are stored on disks Challenge: traverse lists as quickly

### EFFICIENT EXTERNAL SORTING ON FLASH MEMORY EMBEDDED DEVICES

ABSTRACT EFFICIENT EXTERNAL SORTING ON FLASH MEMORY EMBEDDED DEVICES Tyler Cossentine and Ramon Lawrence Department of Computer Science, University of British Columbia Okanagan Kelowna, BC, Canada tcossentine@gmail.com

### Rethinking SIMD Vectorization for In-Memory Databases

SIGMOD 215, Melbourne, Victoria, Australia Rethinking SIMD Vectorization for In-Memory Databases Orestis Polychroniou Columbia University Arun Raghavan Oracle Labs Kenneth A. Ross Columbia University Latest

### Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui

Hardware-Aware Analysis and Optimization of Stable Fluids Presentation Date: Sep 15 th 2009 Chrissie C. Cui Outline Introduction Highlights Flop and Bandwidth Analysis Mehrstellen Schemes Advection Caching

Adaptive Stable Additive Methods for Linear Algebraic Calculations József Smidla, Péter Tar, István Maros University of Pannonia Veszprém, Hungary 4 th of July 204. / 2 József Smidla, Péter Tar, István

### character E T A S R I O D frequency

Data Compression Data compression is any process by which a digital (e.g. electronic) file may be transformed to another ( compressed ) file, such that the original file may be fully recovered from the

### The mathematics of RAID-6

The mathematics of RAID-6 H. Peter Anvin 1 December 2004 RAID-6 supports losing any two drives. The way this is done is by computing two syndromes, generally referred P and Q. 1 A quick

### FPGA-based Multithreading for In-Memory Hash Joins

FPGA-based Multithreading for In-Memory Hash Joins Robert J. Halstead, Ildar Absalyamov, Walid A. Najjar, Vassilis J. Tsotras University of California, Riverside Outline Background What are FPGAs Multithreaded

### NATIONAL UNIVERSITY OF SINGAPORE

NATIONAL UNIVERSITY OF SINGAPORE SCHOOL OF COMPUTING EXAMINATION FOR Semester 2 AY212/213 CS21 COMPUTER ORGANISATION April 213 Time allowed: 2 hours INSTRUCTIONS TO CANDIDATES 1. This examination paper

### Bindel, Spring 2010 Applications of Parallel Computers (CS 5220) Week 1: Wednesday, Jan 27

Logistics Week 1: Wednesday, Jan 27 Because of overcrowding, we will be changing to a new room on Monday (Snee 1120). Accounts on the class cluster (crocus.csuglab.cornell.edu) will be available next week.

### Performance Basics; Computer Architectures

8 Performance Basics; Computer Architectures 8.1 Speed and limiting factors of computations Basic floating-point operations, such as addition and multiplication, are carried out directly on the central

### FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data

FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data Miguel Liroz-Gistau, Reza Akbarinia, Patrick Valduriez To cite this version: Miguel Liroz-Gistau, Reza Akbarinia, Patrick Valduriez. FP-Hadoop:

### In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller

In-Memory Databases Algorithms and Data Structures on Modern Hardware Martin Faust David Schwalb Jens Krüger Jürgen Müller The Free Lunch Is Over 2 Number of transistors per CPU increases Clock frequency

### ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM 1 The ARM architecture processors popular in Mobile phone systems 2 ARM Features ARM has 32-bit architecture but supports 16 bit

### Experimental Comparison of Set Intersection Algorithms for Inverted Indexing

ITAT 213 Proceedings, CEUR Workshop Proceedings Vol. 13, pp. 58 64 http://ceur-ws.org/vol-13, Series ISSN 1613-73, c 213 V. Boža Experimental Comparison of Set Intersection Algorithms for Inverted Indexing

### Boosting Long Term Evolution (LTE) Application Performance with Intel System Studio

Case Study Intel Boosting Long Term Evolution (LTE) Application Performance with Intel System Studio Challenge: Deliver high performance code for time-critical tasks in LTE wireless communication applications.

### Automatic Logging of Operating System Effects to Guide Application-Level Architecture Simulation

Automatic Logging of Operating System Effects to Guide Application-Level Architecture Simulation Satish Narayanasamy, Cristiano Pereira, Harish Patil, Robert Cohn, and Brad Calder Computer Science and

### Muse Server Sizing. 18 June 2012. Document Version 0.0.1.9 Muse 2.7.0.0

Muse Server Sizing 18 June 2012 Document Version 0.0.1.9 Muse 2.7.0.0 Notice No part of this publication may be reproduced stored in a retrieval system, or transmitted, in any form or by any means, without

### IBM SPSS Direct Marketing 23

IBM SPSS Direct Marketing 23 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 23, release

### 2

1 2 3 4 5 For Description of these Features see http://download.intel.com/products/processor/corei7/prod_brief.pdf The following Features Greatly affect Performance Monitoring The New Performance Monitoring

### IBM SPSS Direct Marketing 22

IBM SPSS Direct Marketing 22 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 22, release

### Elemental functions: Writing data-parallel code in C/C++ using Intel Cilk Plus

Elemental functions: Writing data-parallel code in C/C++ using Intel Cilk Plus A simple C/C++ language extension construct for data parallel operations Robert Geva robert.geva@intel.com Introduction Intel

### CS 2112 Spring 2014. 0 Instructions. Assignment 3 Data Structures and Web Filtering. 0.1 Grading. 0.2 Partners. 0.3 Restrictions

CS 2112 Spring 2014 Assignment 3 Data Structures and Web Filtering Due: March 4, 2014 11:59 PM Implementing spam blacklists and web filters requires matching candidate domain names and URLs very rapidly

### CPU-specific optimization. Example of a target CPU core: ARM Cortex-M4F core inside LM4F120H5QR microcontroller in Stellaris LM4F120 Launchpad.

CPU-specific optimization 1 Example of a target CPU core: ARM Cortex-M4F core inside LM4F120H5QR microcontroller in Stellaris LM4F120 Launchpad. Example of a function that we want to optimize: adding 1000

### Can you design an algorithm that searches for the maximum value in the list?

The Science of Computing I Lesson 2: Searching and Sorting Living with Cyber Pillar: Algorithms Searching One of the most common tasks that is undertaken in Computer Science is the task of searching. We

### Efficient Processing of Joins on Set-valued Attributes

Efficient Processing of Joins on Set-valued Attributes Nikos Mamoulis Department of Computer Science and Information Systems University of Hong Kong Pokfulam Road Hong Kong nikos@csis.hku.hk Abstract Object-oriented

### Benchmarking Hadoop & HBase on Violin

Technical White Paper Report Technical Report Benchmarking Hadoop & HBase on Violin Harnessing Big Data Analytics at the Speed of Memory Version 1.0 Abstract The purpose of benchmarking is to show advantages

### EMC DATA DOMAIN SISL SCALING ARCHITECTURE

EMC DATA DOMAIN SISL SCALING ARCHITECTURE A Detailed Review ABSTRACT While tape has been the dominant storage medium for data protection for decades because of its low cost, it is steadily losing ground

### Big Data & Scripting Part II Streaming Algorithms

Big Data & Scripting Part II Streaming Algorithms 1, Counting Distinct Elements 2, 3, counting distinct elements problem formalization input: stream of elements o from some universe U e.g. ids from a set

### COMPUTER ORGANIZATION ARCHITECTURES FOR EMBEDDED COMPUTING

COMPUTER ORGANIZATION ARCHITECTURES FOR EMBEDDED COMPUTING 2013/2014 1 st Semester Sample Exam January 2014 Duration: 2h00 - No extra material allowed. This includes notes, scratch paper, calculator, etc.

### GMP implementation on CUDA - A Backward Compatible Design With Performance Tuning

1 GMP implementation on CUDA - A Backward Compatible Design With Performance Tuning Hao Jun Liu, Chu Tong Edward S. Rogers Sr. Department of Electrical and Computer Engineering University of Toronto haojun.liu@utoronto.ca,

### ProTrack: A Simple Provenance-tracking Filesystem

ProTrack: A Simple Provenance-tracking Filesystem Somak Das Department of Electrical Engineering and Computer Science Massachusetts Institute of Technology das@mit.edu Abstract Provenance describes a file

### Information Retrieval

Information Retrieval ETH Zürich, Fall 2012 Thomas Hofmann LECTURE 4 INDEX COMPRESSION 10.10.2012 Information Retrieval, ETHZ 2012 1 Overview 1. Dictionary Compression 2. Zipf s Law 3. Posting List Compression

### Query Processing C H A P T E R12. Practice Exercises

C H A P T E R12 Query Processing Practice Exercises 12.1 Assume (for simplicity in this exercise) that only one tuple fits in a block and memory holds at most 3 blocks. Show the runs created on each pass

### Qlik Sense scalability

Qlik Sense scalability Visual analytics platform Qlik Sense is a visual analytics platform powered by an associative, in-memory data indexing engine. Based on users selections, calculations are computed

### APPENDIX 1 USER LEVEL IMPLEMENTATION OF PPATPAN IN LINUX SYSTEM

152 APPENDIX 1 USER LEVEL IMPLEMENTATION OF PPATPAN IN LINUX SYSTEM A1.1 INTRODUCTION PPATPAN is implemented in a test bed with five Linux system arranged in a multihop topology. The system is implemented

### VLIW Processors. VLIW Processors

1 VLIW Processors VLIW ( very long instruction word ) processors instructions are scheduled by the compiler a fixed number of operations are formatted as one big instruction (called a bundle) usually LIW

### Jonathan Worthington Scarborough Linux User Group

Jonathan Worthington Scarborough Linux User Group Introduction What does a Virtual Machine do? Hides away the details of the hardware platform and operating system. Defines a common set of instructions.

### PERFORMANCE TOOLS DEVELOPMENTS

PERFORMANCE TOOLS DEVELOPMENTS Roberto A. Vitillo presented by Paolo Calafiura & Wim Lavrijsen Lawrence Berkeley National Laboratory Future computing in particle physics, 16 June 2011 1 LINUX PERFORMANCE

### Fast Arithmetic Coding (FastAC) Implementations

Fast Arithmetic Coding (FastAC) Implementations Amir Said 1 Introduction This document describes our fast implementations of arithmetic coding, which achieve optimal compression and higher throughput by

### RevoScaleR Speed and Scalability

EXECUTIVE WHITE PAPER RevoScaleR Speed and Scalability By Lee Edlefsen Ph.D., Chief Scientist, Revolution Analytics Abstract RevoScaleR, the Big Data predictive analytics library included with Revolution

### IMPROVING PERFORMANCE OF RANDOMIZED SIGNATURE SORT USING HASHING AND BITWISE OPERATORS

Volume 2, No. 3, March 2011 Journal of Global Research in Computer Science RESEARCH PAPER Available Online at www.jgrcs.info IMPROVING PERFORMANCE OF RANDOMIZED SIGNATURE SORT USING HASHING AND BITWISE

### Architecture Sensitive Database Design: Examples from the Columbia Group

Architecture Sensitive Database Design: Examples from the Columbia Group Kenneth A. Ross Columbia University John Cieslewicz Columbia University Jun Rao IBM Research Jingren Zhou Microsoft Research In

### Overlapping Data Transfer With Application Execution on Clusters

Overlapping Data Transfer With Application Execution on Clusters Karen L. Reid and Michael Stumm reid@cs.toronto.edu stumm@eecg.toronto.edu Department of Computer Science Department of Electrical and Computer

### Integrating Apache Spark with an Enterprise Data Warehouse

Integrating Apache Spark with an Enterprise Warehouse Dr. Michael Wurst, IBM Corporation Architect Spark/R/Python base Integration, In-base Analytics Dr. Toni Bollinger, IBM Corporation Senior Software

### The mathematics of RAID-6

The mathematics of RAID-6 H. Peter Anvin First version 20 January 2004 Last updated 20 December 2011 RAID-6 supports losing any two drives. syndromes, generally referred P and Q. The way

### The programming language C. sws1 1

The programming language C sws1 1 The programming language C invented by Dennis Ritchie in early 1970s who used it to write the first Hello World program C was used to write UNIX Standardised as K&C (Kernighan

### Performance Tuning for the Teradata Database

Performance Tuning for the Teradata Database Matthew W Froemsdorf Teradata Partner Engineering and Technical Consulting - i - Document Changes Rev. Date Section Comment 1.0 2010-10-26 All Initial document

### Raima Database Manager Version 14.0 In-memory Database Engine

+ Raima Database Manager Version 14.0 In-memory Database Engine By Jeffrey R. Parsons, Senior Engineer January 2016 Abstract Raima Database Manager (RDM) v14.0 contains an all new data storage engine optimized

### Interpreters and virtual machines. Interpreters. Interpreters. Why interpreters? Tree-based interpreters. Text-based interpreters

Interpreters and virtual machines Michel Schinz 2007 03 23 Interpreters Interpreters Why interpreters? An interpreter is a program that executes another program, represented as some kind of data-structure.

### Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Edward Bortnikov & Ronny Lempel Yahoo Labs, Haifa Indexing in Search Engines Information Retrieval s two main stages: Indexing process

### arxiv: v1 [hep-lat] 5 Nov 2012

arxiv:12110820v1 [hep-lat] 5 Nov 2012, and Weonjong Lee Lattice Gauge Theory Research Center, CTP, and FPRD, Department of Physics and Astronomy, Seoul National University, Seoul, 151-747, South Korea

### Software Pipelining. for (i=1, i<100, i++) { x := A[i]; x := x+1; A[i] := x

Software Pipelining for (i=1, i

### Lecture 3: Single processor architecture and memory

Lecture 3: Single processor architecture and memory David Bindel 30 Jan 2014 Logistics Raised enrollment from 75 to 94 last Friday. Current enrollment is 90; C4 and CMS should be current? HW 0 (getting

### Multi-Threading Performance on Commodity Multi-Core Processors

Multi-Threading Performance on Commodity Multi-Core Processors Jie Chen and William Watson III Scientific Computing Group Jefferson Lab 12000 Jefferson Ave. Newport News, VA 23606 Organization Introduction

### Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011

Scalable Data Analysis in R Lee E. Edlefsen Chief Scientist UserR! 2011 1 Introduction Our ability to collect and store data has rapidly been outpacing our ability to analyze it We need scalable data analysis

### ByteSlice: Pushing the Envelop of Main Memory Data Processing with a New Storage Layout

: Pushing the Envelop of Main Memory Data Processing with a New Storage Layout Ziqiang Feng Eric Lo Ben Kao Wenjian Xu Department of Computing, The Hong Kong Polytechnic University Department of Computer

### Load Balancing in MapReduce Based on Scalable Cardinality Estimates

Load Balancing in MapReduce Based on Scalable Cardinality Estimates Benjamin Gufler 1, Nikolaus Augsten #, Angelika Reiser 3, Alfons Kemper 4 Technische Universität München Boltzmannstraße 3, 85748 Garching

More on Pipelining and Pipelines in Real Machines CS 333 Fall 2006 Main Ideas Data Hazards RAW WAR WAW More pipeline stall reduction techniques Branch prediction» static» dynamic bimodal branch prediction

### Performance Monitoring of the Software Frameworks for LHC Experiments

Proceedings of the First EELA-2 Conference R. mayo et al. (Eds.) CIEMAT 2009 2009 The authors. All rights reserved Performance Monitoring of the Software Frameworks for LHC Experiments William A. Romero

### Week 1 out-of-class notes, discussions and sample problems

Week 1 out-of-class notes, discussions and sample problems Although we will primarily concentrate on RISC processors as found in some desktop/laptop computers, here we take a look at the varying types

### Benchmarking Cassandra on Violin

Technical White Paper Report Technical Report Benchmarking Cassandra on Violin Accelerating Cassandra Performance and Reducing Read Latency With Violin Memory Flash-based Storage Arrays Version 1.0 Abstract

### Algorithms and Data Structures

Algorithm Analysis Page 1 BFH-TI: Softwareschule Schweiz Algorithm Analysis Dr. CAS SD01 Algorithm Analysis Page 2 Outline Course and Textbook Overview Analysis of Algorithm Pseudo-Code and Primitive Operations

### FLIX: Fast Relief for Performance-Hungry Embedded Applications

FLIX: Fast Relief for Performance-Hungry Embedded Applications Tensilica Inc. February 25 25 Tensilica, Inc. 25 Tensilica, Inc. ii Contents FLIX: Fast Relief for Performance-Hungry Embedded Applications...

### ! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions

Chapter 13: Query Processing Chapter 13: Query Processing! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions Basic Steps in Query

### 2. Compressing data to reduce the amount of transmitted data (e.g., to save money).

Presentation Layer The presentation layer is concerned with preserving the meaning of information sent across a network. The presentation layer may represent (encode) the data in various ways (e.g., data

### Unit 4: Performance & Benchmarking. Performance Metrics. This Unit. CIS 501: Computer Architecture. Performance: Latency vs.

This Unit CIS 501: Computer Architecture Unit 4: Performance & Benchmarking Metrics Latency and throughput Speedup Averaging CPU Performance Performance Pitfalls Slides'developed'by'Milo'Mar0n'&'Amir'Roth'at'the'University'of'Pennsylvania'

### Control 2004, University of Bath, UK, September 2004

Control, University of Bath, UK, September ID- IMPACT OF DEPENDENCY AND LOAD BALANCING IN MULTITHREADING REAL-TIME CONTROL ALGORITHMS M A Hossain and M O Tokhi Department of Computing, The University of

### Adaptive String Dictionary Compression in In-Memory Column-Store Database Systems

Adaptive String Dictionary Compression in In-Memory Column-Store Database Systems Ingo Müller #, Cornelius Ratsch #, Franz Faerber # ingo.mueller@kit.edu, cornelius.ratsch@sap.com, franz.faerber@sap.com

### HyperThreading Support in VMware ESX Server 2.1

HyperThreading Support in VMware ESX Server 2.1 Summary VMware ESX Server 2.1 now fully supports Intel s new Hyper-Threading Technology (HT). This paper explains the changes that an administrator can expect

### 7.1 Our Current Model

Chapter 7 The Stack In this chapter we examine what is arguably the most important abstract data type in computer science, the stack. We will see that the stack ADT and its implementation are very simple.

### NetFlow Performance Analysis

NetFlow Performance Analysis Last Updated: May, 2007 The Cisco IOS NetFlow feature set allows for the tracking of individual IP flows as they are received at a Cisco router or switching device. Network

### Oracle EXAM - 1Z0-117. Oracle Database 11g Release 2: SQL Tuning. Buy Full Product. http://www.examskey.com/1z0-117.html

Oracle EXAM - 1Z0-117 Oracle Database 11g Release 2: SQL Tuning Buy Full Product http://www.examskey.com/1z0-117.html Examskey Oracle 1Z0-117 exam demo product is here for you to test the quality of the

### Chapter 6. The stacking ensemble approach

82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described

### Optimizing Java* and Apache Hadoop* for Intel Architecture

WHITE PAPER Intel Xeon Processors Optimizing and Apache Hadoop* Optimizing and Apache Hadoop* for Intel Architecture With the ability to analyze virtually unlimited amounts of unstructured and semistructured

### CIS371 Computer Organization and Design Final Exam Solutions Prof. Martin Wednesday, May 2nd, 2012

1 CIS371 Computer Organization and Design Final Exam Solutions Prof. Martin Wednesday, May 2nd, 2012 1. [ 12 Points ] Datapath & Pipelining. (a) Consider a simple in-order five-stage pipeline with a two-cycle

### Chapter 13: Query Processing. Basic Steps in Query Processing

Chapter 13: Query Processing! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions 13.1 Basic Steps in Query Processing 1. Parsing

### Representation of Data

Representation of Data In contrast with higher-level programming languages, C does not provide strong abstractions for representing data. Indeed, while languages like Racket has a rich notion of data type

### Performance Characteristics of VMFS and RDM VMware ESX Server 3.0.1

Performance Study Performance Characteristics of and RDM VMware ESX Server 3.0.1 VMware ESX Server offers three choices for managing disk access in a virtual machine VMware Virtual Machine File System

### Symbol Tables. Introduction

Symbol Tables Introduction A compiler needs to collect and use information about the names appearing in the source program. This information is entered into a data structure called a symbol table. The

### Computer Organization and Components

Computer Organization and Components IS5, fall 25 Lecture : Pipelined Processors ssociate Professor, KTH Royal Institute of Technology ssistant Research ngineer, University of California, Berkeley Slides

### SPARC64 VIIIfx: CPU for the K computer

SPARC64 VIIIfx: CPU for the K computer Toshio Yoshida Mikio Hondo Ryuji Kan Go Sugizaki SPARC64 VIIIfx, which was developed as a processor for the K computer, uses Fujitsu Semiconductor Ltd. s 45-nm CMOS

### Why? A central concept in Computer Science. Algorithms are ubiquitous.

Analysis of Algorithms: A Brief Introduction Why? A central concept in Computer Science. Algorithms are ubiquitous. Using the Internet (sending email, transferring files, use of search engines, online

### CS4617 Computer Architecture

1/39 CS4617 Computer Architecture Lecture 20: Pipelining Reference: Appendix C, Hennessy & Patterson Reference: Hamacher et al. Dr J Vaughan November 2013 2/39 5-stage pipeline Clock cycle Instr No 1 2

### Oracle Solaris Studio Code Analyzer

Oracle Solaris Studio Code Analyzer The Oracle Solaris Studio Code Analyzer ensures application reliability and security by detecting application vulnerabilities, including memory leaks and memory access

### Chapter 6: Episode discovery process

Chapter 6: Episode discovery process Algorithmic Methods of Data Mining, Fall 2005, Chapter 6: Episode discovery process 1 6. Episode discovery process The knowledge discovery process KDD process of analyzing

### Advanced Computer Architecture-CS501. Computer Systems Design and Architecture 2.1, 2.2, 3.2

Lecture Handout Computer Architecture Lecture No. 2 Reading Material Vincent P. Heuring&Harry F. Jordan Chapter 2,Chapter3 Computer Systems Design and Architecture 2.1, 2.2, 3.2 Summary 1) A taxonomy of

### A Time Efficient Algorithm for Web Log Analysis

A Time Efficient Algorithm for Web Log Analysis Santosh Shakya Anju Singh Divakar Singh Student [M.Tech.6 th sem (CSE)] Asst.Proff, Dept. of CSE BU HOD (CSE), BUIT, BUIT,BU Bhopal Barkatullah University,

### Operating Systems. Memory Management. Lecture 9 Michael O Boyle

Operating Systems Memory Management Lecture 9 Michael O Boyle 1 Chapter 8: Memory Management Background Logical/Virtual Address Space vs Physical Address Space Swapping Contiguous Memory Allocation Segmentation

### Handling large amount of data efficiently

Handling large amount of data efficiently 1. Storage media and its constraints (magnetic disks, buffering) 2. Algorithms for large inputs (sorting) 3. Data structures (trees, hashes and bitmaps) Lecture

### Some Computer Organizations and Their Effectiveness. Michael J Flynn. IEEE Transactions on Computers. Vol. c-21, No.

Some Computer Organizations and Their Effectiveness Michael J Flynn IEEE Transactions on Computers. Vol. c-21, No.9, September 1972 Introduction Attempts to codify a computer have been from three points

### Chapter 2 Basic Structure of Computers. Jin-Fu Li Department of Electrical Engineering National Central University Jungli, Taiwan

Chapter 2 Basic Structure of Computers Jin-Fu Li Department of Electrical Engineering National Central University Jungli, Taiwan Outline Functional Units Basic Operational Concepts Bus Structures Software

### FCE: A Fast Content Expression for Server-based Computing

FCE: A Fast Content Expression for Server-based Computing Qiao Li Mentor Graphics Corporation 11 Ridder Park Drive San Jose, CA 95131, U.S.A. Email: qiao li@mentor.com Fei Li Department of Computer Science

### Memory Management. Main memory Virtual memory

Memory Management Main memory Virtual memory Main memory Background (1) Processes need to share memory Instruction execution cycle leads to a stream of memory addresses Basic hardware CPU can only access

### Magento & Zend Benchmarks Version 1.2, 1.3 (with & without Flat Catalogs)

Magento & Zend Benchmarks Version 1.2, 1.3 (with & without Flat Catalogs) 1. Foreword Magento is a PHP/Zend application which intensively uses the CPU. Since version 1.1.6, each new version includes some

### File System Implementation II

Introduction to Operating Systems File System Implementation II Performance, Recovery, Network File System John Franco Electrical Engineering and Computing Systems University of Cincinnati Review Block