DFT Performance Prediction in FFTW
|
|
|
- Claude Miller
- 9 years ago
- Views:
Transcription
1 DFT Performance Prediction in FFTW Liang Gu and Xiaoming Li University of Delaware Abstract. Fastest Fourier Transform in the West (FFTW) is an adaptive FFT library that generates highly efficient Discrete Fourier Transform (DFT) implementations. It is one of the fastest FFT libraries available and it outperforms many adaptive or hand-tuned DFT libraries. Its success largely relies on the huge search space spanned by several FFT algorithms and a set of compiler generated C code (called codelets) for small size DFTs. FFTW empirically finds the best algorithm by measuring the performance of different algorithm combinations. Although the empirical search works very well for FFTW, the search process does not explain why the best plan found performs best, and the search overhead grows polynomially as the DFT size increases. The opposite of empirical search is model-driven optimization. However, it is widely believed that model-driven optimization is inferior to empirical search and is particularly powerless to solve problems as complex as the optimization of DFT. In this paper, we propose a model-driven DFT performance predictor that can replace the empirical search engine in FFTW. Our technique adapts to different architectures and automatically predicts the performance of DFT algorithms and codelets (including SIMD codelets). Our experiments show that this technique renders DFT implementations that achieve more than 95% of the performance with the original FFTW and uses less than 5% of the search overhead on four test platforms. More importantly, our models give insight on why different combinations of DFT algorithms perform differently on a processor given its architectural features. Introduction Adaptive libraries usually apply application-specific knowledge to define a search space for potential implementation of their target routines such as BLAS and DFT. Furthermore, empirical search is used to find the best performing version from that search space. Well-known adaptive libraries include FFTW [8,0], AT- LAS [8] generating the Basic Linear Algebra Subprograms(BLAS), SPIRAL [4] performing linear digital signal processing transforms, as well as Sparskit [6] and Sparsity [] focusing on sparse numerical computation. These libraries usually outperform the best hand-tuned implementation, and sometimes achieve peak performance on some platforms. It is generally acknowledged that the good performance of those libraries depends on the generators extensive search process. A typical example is SPIRAL, G.R. Gao et al. (Eds.): LCPC 2009, LNCS 5898, pp , 200. c Springer-Verlag Berlin Heidelberg 200
2 DFT Performance Prediction in FFTW 4 which employs empirical search for both algorithm and implementation optimizations. However, there has been a long-standing question: why empirical search seems to be indispensable to the generation of high quality code. In some cases, empirical search explores implementation forms that are hard to deduct from models. Other times, it selects better value for the parameter of the transformation that is applied to the routine. Finally, the advantage of empirical search might be its capabilities of applying optimizing transformations in different orders so that it can solve the so-called phase-ordering problem [2]. It is helpful to compare empirical search with its opposite, model-driven optimization. There have been some previous efforts to compare these two approaches in ATLAS. Yotov et al. [9] show that the empirical search in ATLAS can be successfully replaced by an architectural model while still maintaining its good performance. Furthermore, researchers in the signal processing domain have accumulated a set of heuristics for the selection of best DFT algorithms for a particular problem []. More recently, Fraguela et al. [6] show that a properly built memory model can successfully select the best performing DFT algorithms inspiraleventhoughitsruntimeprediction is far from accurate. Overall, it is still unclear how architectural features of a processor such as instruction latency and SIMD(Single-Instruction Multiple Data) instructions affect the performance of a DFT. This paper presents a quantitative evaluation of the empirical search strategy of FFTW and a new model-driven optimization. It replaces the original FFTW empiricalsearchengineand producesequally high-quality code. The performance modeling of the empirical search in FFTW differs from many other library generators, like ATLAS, which have relatively fixed formulas of implementation and limited number of parameters to be tuned. FFTW, however, has some pregenerated codelets together with other algorithms that can potentially compose the best implementation for a DFT problem. The empirical search engine in FFTW tries to find the best combination of problem decomposition strategies, which is called a plan in FFTW. Each of the decomposition strategies is a complex manipulation of the target FFT problem using codelets or other algorithms. Therefore, the fundamental degree of freedom in FFTW search space is not the parameter values but the algorithm structure of a solution. This paper makes two major contributions. The first one is the building of performance prediction models for several frequently used DFT algorithms and DFT codelets. Particularly, we present how to automatically determine the parameter values of these models on different computer architectures. The second contribution is the composition of these individual DFT models and the building of a model-driven optimization engine. It produces plans that have comparable quality to those generated by exhaustive search in FFTW while eliminating most overhead of empirical search. The rest of this paper is organized as follows. Section 2 describes the existing empiricalsearchengine of FFTW. In Section 3, we quantitatively analyze and model the performance of several DFT algorithms and DFT codelets used in FFTW. Section 4 presents the model-driven optimization engine for FFTW. In Section 5, we comprehensively compare a modified FFTW using our model-driven optimization
3 42 L. Gu and X. Li engine with the original FFTW on four different platforms. Finally, we conclude and suggest future directions of research in Section 6. 2 Empirical Searching in FFTW FFTW has a code generation phase and a runtime phase. In its code generation phase, FFTW generates and compiles some C subroutines for small DFTs, while in its runtime phase, FFTW conducts all of its empirical search. All FFTW s capabilities of adapting to different computer architectures is the result of its runtime empirical search. As a result, the efficiency and accuracy of FFTW s runtime empirical search are relatively more important than that of other libraries. In this section, we overview these two phases and some important factors that contribute to the high performance of FFTW. 2. FFTW Code Generation Phase FFTW s low-level optimization is performed within codelets [7]. Some of the codelets are architecture independent(scalar codelets), while others specially take advantage of data level parallelism by using SIMD instruction extensions such as SSE, SSE2, Altivec etc. All these highly optimized straight-line style codelets are generated using a special-purpose compiler called genfft writteninocaml. For each small DFT size, genfft implements only one DFT algorithm which is expected to be efficient for most architectures. DFT algorithms used in codelets include the Cooley-Tukey algorithm [4], the Prime Factor algorithm [3, p69], the Generic DFT algorithm and the Split-radix algorithm [5]. For each small size DFT codelet, genfft first implements one of the above DFT algorithms. Then it simplifies the initial code with a handful of common compiler optimizations such as constant folding and common expression elimination. Two DFT-specific optimizations, i.e. making all numerical constants positive and applying network transposition, are also used. genfft schedules the instructions using a cache oblivious algorithm to achieve the asymptotic optimum for register spilling. Finally, the internal representation of the algorithm, written in OCaml, is un-parsed to straight-line C code. Most users do not need to use the codelet generator during installation. The downloaded FFTW package includes pre-generated codelets for some DFT problem sizes from 2 to 64. Users can generate codelets of other sizes according to their need using genfft. 2.2 FFTW Runtime Phase FFTW performs empirical search in the runtime phase. When the FFTW library is invoked by a user program, the FFTW search engine will apply a wide range of DFT algorithms to the target problem, so as to span a huge enough search space to cover the best solution. Among all O(nlog(n)) FFT algorithms, FFTW implements those that are efficient and widely used on modern architectures,
4 DFT Performance Prediction in FFTW 43 including Cooley-Tukey for composite size DFT, Rader s [5] and Bluestein [2] for prime size DFT problems. Generic DFT (or Direct DFT) O(n 2 ) algorithm is the most straightforward solution to DFT problems and is able to solve all DFT sizes. The Generic DFT algorithm is generally much slower than FFT algorithms. However, FFTW still implements it because it may be beneficial for some small size problems. Each of the above algorithm may have several variants (referred as different solvers). For instance, Cooley-Tukey has Decimation In Time (DIT) and Decimation In Frequency (DIF), buffered and non-buffered solvers. By using these different solvers, FFTW can generally decompose a large size DFT problem into smaller size sub-problems and recursively solve those sub-problems. One thing worthy noting is that even a prime size DFT can be decomposed into composite size one by applying Rader or Bluestein. This decompose process stops when the problem size is small enough and FFTW can solve it directly with a codelet or a Generic DFT solver. For a specific DFT problem, the search space of FFTW grows like a tree. Search space can become extremely huge for large size DFT problems. Decision of which solver to choose on each level generally relies on both this solver s performance and its child solvers performance. A fundamental question raised by this scenario is how we should traverse this huge search space to find the best solution. An easy answer to this question is exhaustive empirical search, which guarantees the best solution within the search space. Indeed, FFTW applies four strategies in its search engine, namely, Exhaustive, Patient, Measure and Estimate, in the order of decreasing sizes of search space explored. The first three strategies apply different possible solvers, run each version of the code, measure the performance and select the best, which generally takes a long time to return the best solution for a large DFT problem. Estimate mode uses simple heuristics to estimate performance and hence predict the best solver combination, which is often not accurate but is the fastest strategies. An ideal optimization technique for FFTW should generate plans that perform comparably with the plans generated by Exhaustive strategy but have much less overhead. 3 Program Analysis and Prediction Models A specific DFT implementation, i.e. plan, is a hierarchical combination of individual solvers. In order to predict the combination s performance, we start from individual DFT algorithms. FFTW employs both Generic DFT algorithm and several FFT algorithms that divide a DFT into child DFT problems. A recursive implementation of an FFT algorithm has a complexity of O(nlog(n)) but the kernel part has a complexity of O(n) if child problems are excluded. An example of such kernel part is the twiddle factor multiplication in Cooley-Tukey. Our models predict the performance of FFT kernels recursively and the performance of Generic DFT as a whole. Codelets are highly efficient components in FFTW and the correct choice of them is crucial to performance. Since each of them has fixed number of instructions but frequently different strides. We predict its performance with respect to the different strides.
5 44 L. Gu and X. Li Generally, it is very hard to predict the performance of a program on any modern architecture. However some efforts have been made on some special programs, for example, various benchmark performance was predicted in [7] by using an abstract machine model. Models for memory hierarchy can be combined with empirical search to improve the performance of dense-matrix computations as shown in [3]. More recently, a highly effective model-driven optimization engine [9] was developed for ATLAS to predict the relative performance between code versions that have different values for transformation parameters. These above works have inspired us to propose an adaptive model-driven DFT performance prediction technique in FFTW. Our model-driven search engine is developed in three steps: () Program analysis and performance modeling using a fractional abstract machine model and a codelet model. (2) Training models on the target computers to determine their architecture dependent parameters using regression. (3) Recursive performance prediction to choose the optimum. We will not declare these models can always give a very accurate runtime prediction for complex DFT plans on all architectures, nor do we need to do that in FFTW. What we really care about is the relative performance between different solvers, and we can tolerant some performance prediction errors while still being able to pick a good solution. 3. Fractional Abstract Machine Model As we mentioned before, FFTW has 4 major O(nlog(n)) FFT algorithms, Cooley-Tukey, Bluestein, Rader and the Prime Factor algorithm. Among them, Prime Factor, which involves less computation but more memory operation comparing to Cooley-Tukey, is not beneficial for modern architectures and is not implemented in the high level optimization. The Cooley-Tukey implementation is limited to the case where both factors are larger than 6. Otherwise, its kernel, i.e. the twiddle factor multiplication, is incorporated into some special codelets used in one of its child problems. We use a fractional abstract machine model to predict the performance of the three FFT kernels and Generic DFT. We start modeling the algorithms performance by putting them in an ideal environment where all data are in L cache( there is no memory stalls) and hence the performance is only determined by the type and the number of operations. This assumption follows the work of Saavddral, et.al. [7], in which an abstract machine model based on Fortran language was used to predict the performance of benchmarks. Under the no-stall and no-parallelism assumption, the run time of a program is a weighted linear combination of operations as shown in () n T A,M = C A,i P M,i = C A P M. () i= C A is the program characteristic vector and C A,i is the number of instruction type i in algorithm A. AcompleteC A should include all instruction types that appear in the code. Vector C A is obtained from a static analysis of the source code. and part of it is already provided by FFTW. P M is the machine performance vector that describes the cost of each instruction type on a specific computer architecture.
6 DFT Performance Prediction in FFTW 45 This model is an explicit model where each element of machine performance vector is the latency of the corresponding instruction and the machine performance vector can be explicitly measured from a set of micro-benchmarks or collected from the processor manuals. This model is simplified by including only the most expensive instructions such as floating point addition, multiplication, division and memory accesses, which take the majority of the execution time. Furthermore, we do not consider a very limited number of function call and branch instructions in the kernel code. For the loop condition branches, most processors can predict them quite well, and they incur almost no more overhead than an integer operation. Next we want to remove the no-parallelism and no-stall assumption. To do so, we need to calculate the effective machine performance vector, that is, the effective latencies for various operations. The effective latency depends on the schedule of instructions and might be different from the absolute operation delay defined in the processor manual. However, it is usually very difficult to deduct effective operation latency from source code given the intrinsic latency of instructions because different instruction set architectures (ISA) and the complex interaction between instructions, like instruction level parallelism (ILP). Therefore, instead of treating the machine performance vector as a global invariant, we make it local to each FFT algorithm because it is determined by the unique instruction schedule, stalls and ILP. As a result, each algorithm has its own local performance vector on a particular computer architecture and it is empirically determined using regression methods. The empirical determination of algorithm-specific performance vectors consider both the intrinsic latency of the corresponding instruction and the unique instruction mix and schedule of a FFT algorithm kernel. In summary the effective performance vector of an algorithm is determined bythetuple(algorithm, architecture) and can be learned by first measuring a number of T A,M, each being the performance of a specific application of the algorithm A on machine M. The program characteristic vector C A can be obtained by analyzing the code using common compiler parsing techniques. Finally the performance vector P M is regressed from T A,M and C A using (). The details of the exact regression method are discussed in Section 4. One thing worth noting is that all loop bounds in the algorithms can be statically determined, which makes it easy to compute C A. Finally, we take into consideration of different memory access latency for our model. The typical two-level cache hierarchy of modern architecture has a very important impact on performance of large size DFTs. A careful study of the relation between cache misses and runtime reveals that we need to implement the above model in three DFT size segments separately according to L2 cache size. Figure (a) shows the profile of average stalled cycles per memory access and the total number of memory accesses in Bluestein kernel. These numbers are collected using hardware counters. The L and L2 cache of the tested platform are 64K and 6M bytes. When the Bluestein s number of memory footprint is less than roughly half of L2 size, the average stalled cycles per memory access is
7 46 L. Gu and X. Li Average Stalled Cycles per Memory Access (a) Performance Profile of Bluestein e+06 e+07 e+08 e+09 Number of Memory Accesses for different DFT sizes (b) Performance Profile of Generic DFT L_cache_misses / (n^2) L2_cache_misses / (n^2) Runtime / (n^2) DFT Size Fig.. Profile of Bluestein and Generic DFT on Intel Xeon relatively small because most are caused by L cache misses. With the increase of total number of memory accesses, hence the L2 miss rate, there is a significant increase on average stalled cycles per memory access. The increase stops when the memory footprint is larger than roughly four times of L2 cache size. More profiles of other FFT solver kernels show similar change of average memory access cost on different architectures. Since the cost of other instructions remain unchanged for different DFT size, we only need to adjust the cost of memory accesses according to DFT sizes. Unlike the work by Fraguela et.al. [6] that uses an analytic memory model, we develop an empirical model to compute the cost of memory accesses in the algorithms. For Bluestein, Cooley-Tukey and Rader kernels, the average cost of memory access can be treated as a constant for DFT sizes that are much smaller or much larger than L2 cache, where L or L2 cache misses dominate the cost. A linear interpolation of average memory access cost is used for the transition segments where DFT memory footprint is around L2 cache size. The L2 transition region can be determined by using hardware counter as shown in Figure or by empirical ways, i.e. capturing the nonlinear turning point in the runtime over DFT sizes data sets. Our experiment results shown in Section 5 demonstrate that our implicit abstract machine models achieve on average less than 0% error in predicting the performance of FFT algorithms. The only O(n 2 ) algorithm used in FFTW is the Generic DFT algorithm. It follows the direct definition of DFT. For a size n DFT problem, Generic DFT works n times on a size 2n complex input/output data and accesses n 2 complex twiddle factors. Figure (b) shows L and L2 cache misses and runtime over n 2 of Generic DFT on Intel Xeon for different problem size n. As indicated in the analysis of Generic DFT and verified in Figure (b), all major arithmetic operations and memory accesses increase quadratically with n. With the above observation, our fractional abstract machine model is simplified to a fractional quadratic model: t = q n 2. Quadratic coefficient q is detected separately on three regions. q is obtained using regression when memory footprints is smaller than half or larger than four times of L2 cache. Interpolation method is used for transition part around L2 cache size. Actually, our fractional
8 DFT Performance Prediction in FFTW 47 performance model is an overkill for the application of Generic DFT in FFTW. Because of the algorithm s quadratic complexity, it can only be beneficial for small problem sizes in which case all data is in L2 cache. FFTW hardcodes to constrain the applicability of Generic DFT only for problem size n<= 73. Although the first segment of our fractional quadratic model is enough for FFTW, we still address the full version of this Generic DFT performance prediction model for completeness. 3.2 Performance Model for Codelets As described before, codelets are highly optimized straight-line style C code for DFTs of small sizes or sizes of small factor. There are primarily two kinds of codelets. One is direct codelets(n codelet) that solve size n DFT directly and the other is twiddle codelets(t codelet) that solves DFTs of size n m following Cooley-Tukey. FFTW version 3.2. includes codelets for size 2-6,20,25,32,64. A direct codelet and each of the m iteration of a twiddle codelets have a fixed number of instructions. Unlike other FFT algorithms, which are more likely to be implemented on high level of a DFT decomposition and have smaller strides, codelets can be implemented at any level of decomposition and may have large strides because of Cooley-Tukey s shuffle effect. Figure 2 (a) shows the performance of direct codelet n 25 with different strides. We can make a couple of observations from the figure. Firstly, with the increase of stride, the runtime of the codelet increase rapidly around the strides of In this region, the memory footprint of the codelet are roughly from half to four times of L2 cache. Similar to other FFT algorithms kernels, the performance of codelets decreases in this L2 transition region, as the ratio of L2 cache misses per memory access increases. Performance of codelets are relatively stable for regions where memory footprint is smaller than half of L2 cache or larger than four times of L2 cache because of relatively fixed ratio of L/L2 cache misses ratio per memory access. The second observation is that the codelet with power-of-two strides performs up to 00% worse than with other strides of similar sizes. This is not hard to understand because all cache feature sizes are power-of-two and such a stride will easily result in extra conflict cache Power_of_two Strides Odd Strides Other Even Strides (a) Codelet n Power_of_two Strides Odd Strides Other Even Strides (b) Codelet nfv25 Runtime in milliseconds Runtime in milliseconds e e+06 Codelet Stride Codelet Stride Fig. 2. Runtime of two codelets with different strides on Xeon
9 48 L. Gu and X. Li misses. This effect is not obvious for stride regions where memory footprint is smaller than L cache. Power-of-two strides that are smaller than L will not be mapped to the same L cache set causing less conflict misses and the cost of L cache misses is relative small. Finally, the even but non-power-two strides performance better than similar odd strides, this is probably because even stride has better data alignment resulting in less number of physical memory access in different memory levels. Besides what we discussed above, strides with large power-of-two factors also affect the performance of codelets. For example, if stridea = n = L2 size/4, where n is power-of-two, every other 4 elements accessed will be mapped to the same L2 cache set. If strideb = 5 n/4, then strideb has a large power-of-two factor n/4 and every other 6 elements accessed will be mapped to the same L2 cache set. With similar compulsory cache misses but less conflict misses, we can expect the performance of even strides with large power-of-two factor will be better than that of power-of-two strides but worse than other cases. Accordingly, the performance model for codelets divides codelets strides into three cases: power-of-two, odd and even strides. If memory access region is smaller than half of L2 or larger than four times of L2, an averaged runtime is used for that segment and each stride type. Interpolation method is used in the L2 cache transition segment for each stride type. However, when stride n is even and has a large power-of-two factor, it is treated differently. Assume stride n =2 p m, where m is an odd number, and 2 q <m<2 (q+), then a coefficient α = p/(p + q) is adopted to represent the percentage of power-two part in n. Predicted runtime is adjusted according to (2). T = T e + α (T p T e ), (2) where T p and T e are the runtime of the codelet with close power-of-two strides and even non-power-two strides. When the stride is not close to power-of-two or numbers that has large power-of-two factor and it is smaller than L or much larger than L2 cache size, an averaged runtime of that stride type is used. otherwise, interpolation is used on odd/even strides depending on the stride type. 3.3 Performance Model for SIMD Codelets The current version of FFTW, fftw-3.2., supports SIMD instruction extensions such as SSE, SSE2 and Altivec. FFTW takes advantage of SIMD instructions at the codelet level by including direct and twiddle SIMD codelets. FFTW uses two implementation schemes, SIMD with vector length of two and SIMD with vector length of four. SIMD with vector length of four takes advantage of parallelism between different iterations of the DFT problem. On the other hand, SIMD with vector of length-2 relies on the natural parallelism between real and imaginary parts of the complex data, i.e. DFT(A + ib) =DFT(A) +idf T (B), and is applicable to more problems. The input/output data in memory is required to be aligned correctly for SIMD instructions before computation. Figure 2 (b) shows the performance of a SIMD direct codelet nfv 25 on Intel Xeon processor with SIMD extension SSE2 enabled. Despite that SIMD
10 DFT Performance Prediction in FFTW 49 codelets perform better than the corresponding scalar versions, the performance pattern of SIMD codelet for different strides types is similar. Therefore we still measure performance of each SIMD codelet with power-of-two, odd and even strides separately. When memory footprint of a codelet with certain stride is smaller than half of L2 or larger than 4 times of L2, an averaged measurement is used directly. Otherwise, the same interpolation method as what is used for scalar codelets is applied for the estimation of the runtime for codelets with arbitrary strides. 4 Model-Driven Optimization Engine for FFTW In this section, we describe the training of the performance prediction models and the replacement of the original empirical search engine in FFTW with our model-driven optimization engine. 4. Training of DFT Performance Models The integration of the model-driven optimization engine begins with the training of performance models for a specific processor so that the values of parameters depending on architectural features can be determined. The training is done only once for a computer when each solver is registered into FFTW planner during installation. We measure the performance of each solver for different sizes or strides and get the model parameters using linear regression. For the abstract machine model of each FFT algorithm kernel, we train on 0-20 randomly selected sizes is more than the number of independent parameters in the model because some numbers of instructions are linearly dependent, e.g. 2n multiplications and 3n additions in Cooley-Turkey algorithm, which decreases the degrees of freedom of the model. For each training size, we estimate the number of different instructions and measure the execution time of the algorithm kernel. Then, we use weighted linear least square regression to determine the optimal model parameters. Given n instruction types and m training points, the j th actual runtime is T A,M,j. Since we care about the relative runtime error instead of absolute error, the weighted residual in (3) is minimized for the optimal solution. m n i= C 2 A,i,jP M,i T A,M,j (3) j= T A,M,j A solution to this weighted linear least square problem is given the matrix form in (4), where W is a diagonal matrix and w i,i = T 2 A,M,j P =(C T WC) C T W T (4) For the fractional quadratic model of Generic DFT and codelet performance model, similar weighted linear least square regression is used for different size or
11 50 L. Gu and X. Li stride segments of the model that are dominated by L or L2 misses. Only one parameter, i.e. the quadratic coefficient q n or the runtime, is extracted in each case from the regression. Linear interpolation of q n or runtime is used in the transition segment of each model where the regression does not apply. One last thing to mention is the variance of measurements between multiple executions of a kernel or a codelet. The variance is typically less than 5% of the total runtime and we minimize this effect by repeat each measurement several times and pick the minimum. The training of all models typically takes just several seconds on each experiment platform we use. 4.2 Replacement of the Original Empirical Search Engine Our model-driven optimization engine still follows the workflow of recursive search in FFTW. Only the performance measurement part is replaced by performance prediction using our models. For each solver, given the specific DFT size and stride, an estimated cost is generated by performance models. FFT solvers having child DFT problems will return a sum of its kernel performance prediction and that of the child problems which is obtained recursively. One thing to note is that solvers that are derived from the same FFT algorithm but with different implementation details, like buffered or non-buffered, will have separate models. Like the original empirical search engine, dynamic programing technique is still used to reduce redundant performance prediction and solution search. By the end of the search, a plan with the least predicted cost is returned as the optimum. FFTW search engine has some internal search flags to constrain the search space. These flags are mapped from user interface patience levels, namely Exhaustive, Patient, Measure and Estimate[9]. Exhaustive mode traverses the whole FFTW search space and takes the longest time to return a plan, e.g. 503 seconds for a DFT of size on a 2GHz Athlon desktop. Patient and Measure modes exclude solvers that are unlikely to lead to the optimal solution, and hence reduce the search space. They generally find plans that perform worse than the Exhaustive but spending less search time. Their search overhead, however, is still huge for large problems and increases polynomially with the problem size. Estimate mode further reduces the search space and uses the estimated total number of floating point operations as the metric of best plan. Clearly, such a strategy oversimplifies a machine model by treating different operations with the same delay and neglecting other factors such as ILP and memory access stalls. For each of the above cases, if SIMD extension is enabled, the search space will become larger because extra SIMD codelets are included and therefore it will take longer to return a plan. We use the model-driven search engine on the search space of Exhaustive mode and it greatly reduces the search time. Similarly, search time will be reduced proportionally if we use our model in Patient or Measure mode.
12 DFT Performance Prediction in FFTW 5 5 Evaluation In this section, we describe the results of our experiment by comparing FFTW version 3.2. with a modified version using our model-driven search engine. We conduct the comparison on four platforms: AMD Athlon, Intel Xeon, IBM PowerPC and Sun SPARC. The configurations of the four architectures are shown in Table. The comparison is made in three aspects. We first show the accuracy of the performance prediction of three FFT algorithm kernels, Generic DFT algorithm, a scalar codelet and a SIMD codelet. Furthermore, the best DFT plan s performance for different sizes are compared among different search strategies. Finally, the search time spent by different strategies for different DFT sizes are compared. 5. Performance Prediction of Individual Algorithms In this section, we show the accuracy of the fractional abstract machine model, fractional quadratic model and codelet performance model by comparing the Table. Test Platforms Configuration Athlon64 X2 Xeon 5405 PowerPC 970 UltraSparc IIIi Frequency 2 GHz 2 GHz 2.3 GHz.06 GHz L Data Cache 64 KB 64 KB 32 KB 64 KB L2 Cache MB 6 MB 52KB MB OS linux linux linux SunOS 5.0 SIMD 3DNow!(not supported) SSE2 Altivec N.A..2 (a) Bluestein.2 (b) Rader Normalized Time Normalized Time Actual execution time Actual execution time Predicted execution time Predicted execution time e+06 DFT Size DFT Size.2 (c) Cooley-Tukey.2 (d) Generic DFT Normalized Time Normalized Time Actual execution time Actual execution time Predicted execution time Predicted execution time e DFT Size DFT Size Fig. 3. Performance prediction of individual algorithms
13 52 L. Gu and X. Li.2 (a) n_25.2 (b) nfv25 Normalized Time Normalized Time Actual execution time Actual execution time Predicted execution time Predicted execution time e e+06 Stride Stride Fig. 4. Performance prediction of Scalar and SIMD Codelets actual runtime and the predicted runtime of the Bluestein, Rader, Cooley-Tukey, the Generic DFT and scalar/simd codelets. Figure 3 shows the performance prediction for the four algorithms on Xeon. All predicted runtime is normalized with respect to the corresponding actual runtime. We compare the performance prediction for DFTs and codelets with sizes/strides up to 0 6. From the plots we can see these performance prediction models are very accurate. On Xeon, the average relative errors are less than 0% and similar results are got from other platforms. Figure 4 shows the performance prediction for scalar and SIMD codelets on Xeon with different strides. Generally, we have about 0% average prediction error and the worst is around 25%. As we will show later, the accuracy of our prediction model is good enough to distinguish good from bad plans in most cases. 5.2 Overall DFT Plan Performance Performance comparison is made among our model-driven optimization engine, FFTW Exhaustive and Estimate mode. We compare the runtime of the best DFT plans found by these optimization methods. Because we only care about relative performance, all runtime is normalized with respect to the runtime of Exhaustive mode. Figure 5 shows the performance comparison among different DFT plans on four test architectures with SIMD disabled. All three search engines perform good (at most 0% 20% slower than the best) for most small size DFTs. While for large problem sizes on Xeon, Estimate plans generally run 0% 20% slower than Exhaustive plans, with occasional 50% slowdown. Large size DFT plans on AMD found by the Estimate mode run 20% 30% slower than the plans found by the Exhaustive mode. Our model-driven optimization engine achieves comparable performance with the Exhaustive mode. On average, our model-driven optimization engine achieves 94.4%, 94.8%, 93.6% and 94% of the performance of FFTW Exhaustive on these four platforms. For the cases where our search engine does not performance well, it is either because our model fail to give an accurate runtime prediction or because our actual search space is a bit smaller than Exhaustive mode. We have not extended our work to real(noncomplex) DFT solvers which sometimes are used in complex DFTs.
14 DFT Performance Prediction in FFTW FFTW Exhaustive FFTW Estimate Model-driven prediction (a) AMD Athlon.7.6 FFTW Exhaustive FFTW Estimate Model-driven prediction (b) Intel Xeon Normalized Excution Time DFT size DFT size Normalized Excution Time FFTW Exhaustive FFTW Estimate Model-driven prediction (c) PowerPC 970MP FFTW Exhaustive FFTW Estimate Model-driven prediction (d) Sun UltraSparc IIIi DFT size DFT size Fig. 5. Scalar performance comparison among FFTW Exhaustive, Estimate mode and model-driven prediction Figure 6 shows the performance of the three search strategies with SIMD extension enabled. Among the four platforms we tested, Intel Xeon supports SSE2 for double precision DFT and PowerPC970 supports Altivec for single precision. 3DNow of AMD is no longer supported in the current version of FFTW. Our performance model outperforms the Estimate mode over the whole test region. The model-driven strategy performs comparably with Exhaustive mode for most Normalized Excution Time FFTW Exhaustive FFTW Estimate Model-driven prediction (a) Intel Xeon with SSE (b) PowerPC 970MP with Altivec FFTW Exhaustive FFTW Estimate Model-driven prediction DFT size DFT size Fig. 6. SIMD performance comparison among FFTW Exhaustive, Estimate mode and model-driven prediction
15 54 L. Gu and X. Li (a) AMD Athlon (b) Intel Xeon Normalized Search Time e-05 e-06 FFTW Exhaustive FFTW Estimate Model-driven prediction DFT Size e-05 FFTW Exhaustive FFTW Estimate Model-driven prediction DFT Size Fig. 7. The comparison of optimization time using FFTW Exhaustive, Estimate and model-driven optimization sizes but the performance of Estimate mode is about 20% to 30% slower than the best. This is because the Estimate mode relies on instruction counts to optimize the performance of codelets, which is inapplicable in the case of SIMD. 5.3 Optimization Time In this section, we compare the search time that the three optimization engines use to find the best DFT plan. Again, the time is normalized with respect to the search time of Exhaustive mode. This search time excludes the initializing and training time of our model, which is about several seconds and spent only once for each platform instead of each DFT problem. Figure 7 shows the search time spent by using model-driven, Exhaustive and Estimate optimization engines on two architectures. For limited space, we do not show the similar results on the other two platforms. Our model-driven optimization engine spends only about 0.% to 0% of the time spent by Exhaustive mode. On the other hand, compared with the Estimate mode, our model-driven engine spends about 00 times more on average. One of the main reasons is that the Estimate engine runs on a much smaller search space than the Exhaustive search space, which our optimization engine needs to walk through. However, the absolute value of the search time using our search strategy is small and is within several seconds even for extremely large DFT sizes. In summary, our model-driven search engine achieves about 94% of the performance of Exhaustive search engine and uses only less than 5% of its search time. Our model-driven optimization engine achieves the goal of model based optimization, that is, delivering performance comparable to that of exhaustive search but using much smaller amount of search time. 6 Conclusion In this paper, we propose a model-driven optimization engine for FFTW. This optimization has successfully reduced the empirical search time by developing performance prediction models for several DFT algorithms and codelets used in
16 DFT Performance Prediction in FFTW 55 FFTW and integrating them into a model-driven search engine. The most important conclusion we can draw from this work is that model-driven optimization can be effectively applied to a complex problem such as the generation of highly efficient implementation of DFT. This work also provides insight on why a DFT solution found by FFTW is the fastest by breaking down the overall performance into some architectural-dependent components. Besides that, reducing searching time means more for FFTW than other libraries, because search time is spent on every DFT size and every runtime in FFTW. Our model-driven optimization engine achieves more than 95% of the performance of the code found by exhaustive search while using less than 5% of the optimization time. This work will be more complete if it is extended to real( non-complex) DFT algorithms. It will also be interesting if choices of best solvers on each search node can be directly given by some rules that are learned from a one-time performance training. Furthermore, it is still a challenge to generalize this work and apply model-driven optimization on other complicated scientific libraries. References. Discussion with Franz Franchetti (2009) 2. Bluestein, L.: A linear Filtering approach to the computation of discrete Fourier transform. IEEE Transactions on Audio and Electroacoustics 8(4), (970) 3. Chen, C., Chame, J., et al.: Combining models and guided empirical search to optimize for multiple levels of the memory hierarchy. In: Proceedings of CGO, Washington, DC, USA, pp. 22. IEEE Computer Society, Los Alamitos (2005) 4. Cooley, J.W., Tukey, J.W.: An algorithm for the machine computation of complex Fourier series. Mathematics of Computation 9(90), (965) 5. Duhamel, P., Vetterli, M.: Fast fourier transforms: a tutorial review and a state of the art. Signal Processing 9(4), (990) 6. Fraguela, B.B., Voronenko, Y., et al.: Automatic tuning of discrete fourier transforms driven by analytical modeling. To appear in PACT (2009) 7. Frigo, M.: A fast Fourier transform compiler. ACM SIGPLAN Notices 34(5), (999) 8. Frigo, M., Johnson, S.G.: The Fastest Fourier Transform in the West (997) 9. Frigo, M., Johnson, S.G.: FFTW manual version 3. The Fastest Fourier Transform in the West. Massachusetts Institute of Technology, Massachusetts (2004) 0. Frigo, M., Johnson, S.G.: The design and implementation of fftw3. Proceeding of the IEEE 93(2), (2005). Im, E.-J.: Optimizing the performance of sparse matrix-vector multiplication. PhD thesis (2000); Chair-Katherine A. Yelick 2. Kulkarniand, P.A., Whalley, D.B., et al.: In search of near-optimal optimization phase orderings. SIGPLAN Not. 4(7), (2006) 3. Oppenheim, A.V., Schafer, R.W., et al.: Discrete-Time Signal Processing (999) 4. Püschel, M., Moura, J.M.F., et al.: SPIRAL: Code generation for DSP transforms. Proceedings of the IEEE, special issue on Program Generation, Optimization, and Adaptation 93(2), (2005) 5. Rader, C.M.: Discrete Fourier transforms when the number of data samples is prime. Proceedings of the IEEE 56(6), (968)
17 56 L. Gu and X. Li 6. Saad, Y.: Research Institute for Advanced Computer Science (US). Sparskit: A Basic Tool Kit for Sparse Matrix Computation (994) 7. Saavedra, R.H., Smith, A.J.: Analysis of benchmark characteristics and benchmark performance prediction. ACM Transactions on Computer Systems (TOCS) 4(4), (996) 8. Whaley, R.C., Petitet, A., Dongarra, J.J.: Automated empirical optimizations of software and the ATLAS project. Parallel Computing 27(-2), 3 35 (200) 9. Yotov, K., Li, X., et al.: Is Search Really Necessary to Generate High-Performance BLAS? Proceedings of the IEEE 93(2), (2005)
DARPA, NSF-NGS/ITR,ACR,CPA,
Spiral Automating Library Development Markus Püschel and the Spiral team (only part shown) With: Srinivas Chellappa Frédéric de Mesmay Franz Franchetti Daniel McFarlin Yevgen Voronenko Electrical and Computer
Binary search tree with SIMD bandwidth optimization using SSE
Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous
Empirical Auto-tuning Code Generator for FFT and Trigonometric Transforms
Empirical Auto-tuning Code Generator for FFT and Trigonometric Transforms Ayaz Ali and Prof. Lennart Johnsson Texas Learning and Computation Center University of Houston Dr. Dragan Mirkovic MD Anderson
64-Bit versus 32-Bit CPUs in Scientific Computing
64-Bit versus 32-Bit CPUs in Scientific Computing Axel Kohlmeyer Lehrstuhl für Theoretische Chemie Ruhr-Universität Bochum March 2004 1/25 Outline 64-Bit and 32-Bit CPU Examples
Bindel, Spring 2010 Applications of Parallel Computers (CS 5220) Week 1: Wednesday, Jan 27
Logistics Week 1: Wednesday, Jan 27 Because of overcrowding, we will be changing to a new room on Monday (Snee 1120). Accounts on the class cluster (crocus.csuglab.cornell.edu) will be available next week.
İSTANBUL AYDIN UNIVERSITY
İSTANBUL AYDIN UNIVERSITY FACULTY OF ENGİNEERİNG SOFTWARE ENGINEERING THE PROJECT OF THE INSTRUCTION SET COMPUTER ORGANIZATION GÖZDE ARAS B1205.090015 Instructor: Prof. Dr. HASAN HÜSEYİN BALIK DECEMBER
Multi-Threading Performance on Commodity Multi-Core Processors
Multi-Threading Performance on Commodity Multi-Core Processors Jie Chen and William Watson III Scientific Computing Group Jefferson Lab 12000 Jefferson Ave. Newport News, VA 23606 Organization Introduction
Data Structures and Algorithms
Data Structures and Algorithms Computational Complexity Escola Politècnica Superior d Alcoi Universitat Politècnica de València Contents Introduction Resources consumptions: spatial and temporal cost Costs
Recommended hardware system configurations for ANSYS users
Recommended hardware system configurations for ANSYS users The purpose of this document is to recommend system configurations that will deliver high performance for ANSYS users across the entire range
Lattice QCD Performance. on Multi core Linux Servers
Lattice QCD Performance on Multi core Linux Servers Yang Suli * Department of Physics, Peking University, Beijing, 100871 Abstract At the moment, lattice quantum chromodynamics (lattice QCD) is the most
Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi
Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi ICPP 6 th International Workshop on Parallel Programming Models and Systems Software for High-End Computing October 1, 2013 Lyon, France
ultra fast SOM using CUDA
ultra fast SOM using CUDA SOM (Self-Organizing Map) is one of the most popular artificial neural network algorithms in the unsupervised learning category. Sijo Mathew Preetha Joy Sibi Rajendra Manoj A
More on Pipelining and Pipelines in Real Machines CS 333 Fall 2006 Main Ideas Data Hazards RAW WAR WAW More pipeline stall reduction techniques Branch prediction» static» dynamic bimodal branch prediction
Power-Aware High-Performance Scientific Computing
Power-Aware High-Performance Scientific Computing Padma Raghavan Scalable Computing Laboratory Department of Computer Science Engineering The Pennsylvania State University http://www.cse.psu.edu/~raghavan
Optimizing matrix multiplication Amitabha Banerjee [email protected]
Optimizing matrix multiplication Amitabha Banerjee [email protected] Present compilers are incapable of fully harnessing the processor architecture complexity. There is a wide gap between the available
FPGA area allocation for parallel C applications
1 FPGA area allocation for parallel C applications Vlad-Mihai Sima, Elena Moscu Panainte, Koen Bertels Computer Engineering Faculty of Electrical Engineering, Mathematics and Computer Science Delft University
FFT Algorithms. Chapter 6. Contents 6.1
Chapter 6 FFT Algorithms Contents Efficient computation of the DFT............................................ 6.2 Applications of FFT................................................... 6.6 Computing DFT
Performance Monitoring of the Software Frameworks for LHC Experiments
Proceedings of the First EELA-2 Conference R. mayo et al. (Eds.) CIEMAT 2009 2009 The authors. All rights reserved Performance Monitoring of the Software Frameworks for LHC Experiments William A. Romero
EE482: Advanced Computer Organization Lecture #11 Processor Architecture Stanford University Wednesday, 31 May 2000. ILP Execution
EE482: Advanced Computer Organization Lecture #11 Processor Architecture Stanford University Wednesday, 31 May 2000 Lecture #11: Wednesday, 3 May 2000 Lecturer: Ben Serebrin Scribe: Dean Liu ILP Execution
2) Write in detail the issues in the design of code generator.
COMPUTER SCIENCE AND ENGINEERING VI SEM CSE Principles of Compiler Design Unit-IV Question and answers UNIT IV CODE GENERATION 9 Issues in the design of code generator The target machine Runtime Storage
Using Power to Improve C Programming Education
Using Power to Improve C Programming Education Jonas Skeppstedt Department of Computer Science Lund University Lund, Sweden [email protected] jonasskeppstedt.net jonasskeppstedt.net [email protected]
Virtuoso and Database Scalability
Virtuoso and Database Scalability By Orri Erling Table of Contents Abstract Metrics Results Transaction Throughput Initializing 40 warehouses Serial Read Test Conditions Analysis Working Set Effect of
Quiz for Chapter 1 Computer Abstractions and Technology 3.10
Date: 3.10 Not all questions are of equal difficulty. Please review the entire quiz first and then budget your time carefully. Name: Course: Solutions in Red 1. [15 points] Consider two different implementations,
Instruction Set Architecture (ISA)
Instruction Set Architecture (ISA) * Instruction set architecture of a machine fills the semantic gap between the user and the machine. * ISA serves as the starting point for the design of a new machine
Reconfigurable Architecture Requirements for Co-Designed Virtual Machines
Reconfigurable Architecture Requirements for Co-Designed Virtual Machines Kenneth B. Kent University of New Brunswick Faculty of Computer Science Fredericton, New Brunswick, Canada [email protected] Micaela Serra
Current Standard: Mathematical Concepts and Applications Shape, Space, and Measurement- Primary
Shape, Space, and Measurement- Primary A student shall apply concepts of shape, space, and measurement to solve problems involving two- and three-dimensional shapes by demonstrating an understanding of:
Assessing the Performance of OpenMP Programs on the Intel Xeon Phi
Assessing the Performance of OpenMP Programs on the Intel Xeon Phi Dirk Schmidl, Tim Cramer, Sandra Wienke, Christian Terboven, and Matthias S. Müller [email protected] Rechen- und Kommunikationszentrum
Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui
Hardware-Aware Analysis and Optimization of Stable Fluids Presentation Date: Sep 15 th 2009 Chrissie C. Cui Outline Introduction Highlights Flop and Bandwidth Analysis Mehrstellen Schemes Advection Caching
THE NAS KERNEL BENCHMARK PROGRAM
THE NAS KERNEL BENCHMARK PROGRAM David H. Bailey and John T. Barton Numerical Aerodynamic Simulations Systems Division NASA Ames Research Center June 13, 1986 SUMMARY A benchmark test program that measures
Image Compression through DCT and Huffman Coding Technique
International Journal of Current Engineering and Technology E-ISSN 2277 4106, P-ISSN 2347 5161 2015 INPRESSCO, All Rights Reserved Available at http://inpressco.com/category/ijcet Research Article Rahul
Energy-Efficient, High-Performance Heterogeneous Core Design
Energy-Efficient, High-Performance Heterogeneous Core Design Raj Parihar Core Design Session, MICRO - 2012 Advanced Computer Architecture Lab, UofR, Rochester April 18, 2013 Raj Parihar Energy-Efficient,
The Combination Forecasting Model of Auto Sales Based on Seasonal Index and RBF Neural Network
, pp.67-76 http://dx.doi.org/10.14257/ijdta.2016.9.1.06 The Combination Forecasting Model of Auto Sales Based on Seasonal Index and RBF Neural Network Lihua Yang and Baolin Li* School of Economics and
INTEL PARALLEL STUDIO EVALUATION GUIDE. Intel Cilk Plus: A Simple Path to Parallelism
Intel Cilk Plus: A Simple Path to Parallelism Compiler extensions to simplify task and data parallelism Intel Cilk Plus adds simple language extensions to express data and task parallelism to the C and
A numerically adaptive implementation of the simplex method
A numerically adaptive implementation of the simplex method József Smidla, Péter Tar, István Maros Department of Computer Science and Systems Technology University of Pannonia 17th of December 2014. 1
1 Bull, 2011 Bull Extreme Computing
1 Bull, 2011 Bull Extreme Computing Table of Contents HPC Overview. Cluster Overview. FLOPS. 2 Bull, 2011 Bull Extreme Computing HPC Overview Ares, Gerardo, HPC Team HPC concepts HPC: High Performance
- An Essential Building Block for Stable and Reliable Compute Clusters
Ferdinand Geier ParTec Cluster Competence Center GmbH, V. 1.4, March 2005 Cluster Middleware - An Essential Building Block for Stable and Reliable Compute Clusters Contents: Compute Clusters a Real Alternative
Chapter 1 Computer System Overview
Operating Systems: Internals and Design Principles Chapter 1 Computer System Overview Eighth Edition By William Stallings Operating System Exploits the hardware resources of one or more processors Provides
Managing Variability in Software Architectures 1 Felix Bachmann*
Managing Variability in Software Architectures Felix Bachmann* Carnegie Bosch Institute Carnegie Mellon University Pittsburgh, Pa 523, USA [email protected] Len Bass Software Engineering Institute Carnegie
Parallel Computing for Option Pricing Based on the Backward Stochastic Differential Equation
Parallel Computing for Option Pricing Based on the Backward Stochastic Differential Equation Ying Peng, Bin Gong, Hui Liu, and Yanxin Zhang School of Computer Science and Technology, Shandong University,
Adaptive Stable Additive Methods for Linear Algebraic Calculations
Adaptive Stable Additive Methods for Linear Algebraic Calculations József Smidla, Péter Tar, István Maros University of Pannonia Veszprém, Hungary 4 th of July 204. / 2 József Smidla, Péter Tar, István
Predict the Popularity of YouTube Videos Using Early View Data
000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050
Florida Math for College Readiness
Core Florida Math for College Readiness Florida Math for College Readiness provides a fourth-year math curriculum focused on developing the mastery of skills identified as critical to postsecondary readiness
Figure 1: Graphical example of a mergesort 1.
CSE 30321 Computer Architecture I Fall 2011 Lab 02: Procedure Calls in MIPS Assembly Programming and Performance Total Points: 100 points due to its complexity, this lab will weight more heavily in your
Analysis and Modeling of MapReduce s Performance on Hadoop YARN
Analysis and Modeling of MapReduce s Performance on Hadoop YARN Qiuyi Tang Dept. of Mathematics and Computer Science Denison University [email protected] Dr. Thomas C. Bressoud Dept. of Mathematics and
A Content-Based Load Balancing Algorithm for Metadata Servers in Cluster File Systems*
A Content-Based Load Balancing Algorithm for Metadata Servers in Cluster File Systems* Junho Jang, Saeyoung Han, Sungyong Park, and Jihoon Yang Department of Computer Science and Interdisciplinary Program
Fast Fourier Transform: Theory and Algorithms
Fast Fourier Transform: Theory and Algorithms Lecture Vladimir Stojanović 6.973 Communication System Design Spring 006 Massachusetts Institute of Technology Discrete Fourier Transform A review Definition
OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA
OpenCL Optimization San Jose 10/2/2009 Peng Wang, NVIDIA Outline Overview The CUDA architecture Memory optimization Execution configuration optimization Instruction optimization Summary Overall Optimization
Enhancing the SNR of the Fiber Optic Rotation Sensor using the LMS Algorithm
1 Enhancing the SNR of the Fiber Optic Rotation Sensor using the LMS Algorithm Hani Mehrpouyan, Student Member, IEEE, Department of Electrical and Computer Engineering Queen s University, Kingston, Ontario,
Mathematical Libraries on JUQUEEN. JSC Training Course
Mitglied der Helmholtz-Gemeinschaft Mathematical Libraries on JUQUEEN JSC Training Course May 10, 2012 Outline General Informations Sequential Libraries, planned Parallel Libraries and Application Systems:
Getting the Performance Out Of High Performance Computing. Getting the Performance Out Of High Performance Computing
Getting the Performance Out Of High Performance Computing Jack Dongarra Innovative Computing Lab University of Tennessee and Computer Science and Math Division Oak Ridge National Lab http://www.cs.utk.edu/~dongarra/
HSL and its out-of-core solver
HSL and its out-of-core solver Jennifer A. Scott [email protected] Prague November 2006 p. 1/37 Sparse systems Problem: we wish to solve where A is Ax = b LARGE Informal definition: A is sparse if many
Echidna: Efficient Clustering of Hierarchical Data for Network Traffic Analysis
Echidna: Efficient Clustering of Hierarchical Data for Network Traffic Analysis Abdun Mahmood, Christopher Leckie, Parampalli Udaya Department of Computer Science and Software Engineering University of
Control 2004, University of Bath, UK, September 2004
Control, University of Bath, UK, September ID- IMPACT OF DEPENDENCY AND LOAD BALANCING IN MULTITHREADING REAL-TIME CONTROL ALGORITHMS M A Hossain and M O Tokhi Department of Computing, The University of
Instruction Set Architecture (ISA) Design. Classification Categories
Instruction Set Architecture (ISA) Design Overview» Classify Instruction set architectures» Look at how applications use ISAs» Examine a modern RISC ISA (DLX)» Measurement of ISA usage in real computers
x64 Servers: Do you want 64 or 32 bit apps with that server?
TMurgent Technologies x64 Servers: Do you want 64 or 32 bit apps with that server? White Paper by Tim Mangan TMurgent Technologies February, 2006 Introduction New servers based on what is generally called
A Pattern-Based Approach to. Automated Application Performance Analysis
A Pattern-Based Approach to Automated Application Performance Analysis Nikhil Bhatia, Shirley Moore, Felix Wolf, and Jack Dongarra Innovative Computing Laboratory University of Tennessee (bhatia, shirley,
The Design and Implementation of FFTW3
1 The Design and Implementation of FFTW3 Matteo Frigo and Steven G. Johnson (Invited Paper) Abstract FFTW is an implementation of the discrete Fourier transform (DFT) that adapts to the hardware in order
MAQAO Performance Analysis and Optimization Tool
MAQAO Performance Analysis and Optimization Tool Andres S. CHARIF-RUBIAL [email protected] Performance Evaluation Team, University of Versailles S-Q-Y http://www.maqao.org VI-HPS 18 th Grenoble 18/22
Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging
Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging In some markets and scenarios where competitive advantage is all about speed, speed is measured in micro- and even nano-seconds.
GPUs for Scientific Computing
GPUs for Scientific Computing p. 1/16 GPUs for Scientific Computing Mike Giles [email protected] Oxford-Man Institute of Quantitative Finance Oxford University Mathematical Institute Oxford e-research
Back to Elements - Tetrahedra vs. Hexahedra
Back to Elements - Tetrahedra vs. Hexahedra Erke Wang, Thomas Nelson, Rainer Rauch CAD-FEM GmbH, Munich, Germany Abstract This paper presents some analytical results and some test results for different
Chapter 7: Additional Topics
Chapter 7: Additional Topics In this chapter we ll briefly cover selected advanced topics in fortran programming. All the topics come in handy to add extra functionality to programs, but the feature you
In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller
In-Memory Databases Algorithms and Data Structures on Modern Hardware Martin Faust David Schwalb Jens Krüger Jürgen Müller The Free Lunch Is Over 2 Number of transistors per CPU increases Clock frequency
YALES2 porting on the Xeon- Phi Early results
YALES2 porting on the Xeon- Phi Early results Othman Bouizi Ghislain Lartigue Innovation and Pathfinding Architecture Group in Europe, Exascale Lab. Paris CRIHAN - Demi-journée calcul intensif, 16 juin
Parallel Computing. Benson Muite. [email protected] http://math.ut.ee/ benson. https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage
Parallel Computing Benson Muite [email protected] http://math.ut.ee/ benson https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage 3 November 2014 Hadoop, Review Hadoop Hadoop History Hadoop Framework
Storage Management for Files of Dynamic Records
Storage Management for Files of Dynamic Records Justin Zobel Department of Computer Science, RMIT, GPO Box 2476V, Melbourne 3001, Australia. [email protected] Alistair Moffat Department of Computer Science
Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms
Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Family-Based Platforms Executive Summary Complex simulations of structural and systems performance, such as car crash simulations,
Risk Decomposition of Investment Portfolios. Dan dibartolomeo Northfield Webinar January 2014
Risk Decomposition of Investment Portfolios Dan dibartolomeo Northfield Webinar January 2014 Main Concepts for Today Investment practitioners rely on a decomposition of portfolio risk into factors to guide
FAST INVERSE SQUARE ROOT
FAST INVERSE SQUARE ROOT CHRIS LOMONT Abstract. Computing reciprocal square roots is necessary in many applications, such as vector normalization in video games. Often, some loss of precision is acceptable
DNA Data and Program Representation. Alexandre David 1.2.05 [email protected]
DNA Data and Program Representation Alexandre David 1.2.05 [email protected] Introduction Very important to understand how data is represented. operations limits precision Digital logic built on 2-valued
Advanced Computer Architecture-CS501. Computer Systems Design and Architecture 2.1, 2.2, 3.2
Lecture Handout Computer Architecture Lecture No. 2 Reading Material Vincent P. Heuring&Harry F. Jordan Chapter 2,Chapter3 Computer Systems Design and Architecture 2.1, 2.2, 3.2 Summary 1) A taxonomy of
Analysis of Memory Sensitive SPEC CPU2006 Integer Benchmarks for Big Data Benchmarking
Analysis of Memory Sensitive SPEC CPU2006 Integer Benchmarks for Big Data Benchmarking Kathlene Hurt and Eugene John Department of Electrical and Computer Engineering University of Texas at San Antonio
Graphic Processing Units: a possible answer to High Performance Computing?
4th ABINIT Developer Workshop RESIDENCE L ESCANDILLE AUTRANS HPC & Graphic Processing Units: a possible answer to High Performance Computing? Luigi Genovese ESRF - Grenoble 26 March 2009 http://inac.cea.fr/l_sim/
Lecture 3: Finding integer solutions to systems of linear equations
Lecture 3: Finding integer solutions to systems of linear equations Algorithmic Number Theory (Fall 2014) Rutgers University Swastik Kopparty Scribe: Abhishek Bhrushundi 1 Overview The goal of this lecture
Matrix Multiplication
Matrix Multiplication CPS343 Parallel and High Performance Computing Spring 2016 CPS343 (Parallel and HPC) Matrix Multiplication Spring 2016 1 / 32 Outline 1 Matrix operations Importance Dense and sparse
Software Development around a Millisecond
Introduction Software Development around a Millisecond Geoffrey Fox In this column we consider software development methodologies with some emphasis on those relevant for large scale scientific computing.
Fast Multipole Method for particle interactions: an open source parallel library component
Fast Multipole Method for particle interactions: an open source parallel library component F. A. Cruz 1,M.G.Knepley 2,andL.A.Barba 1 1 Department of Mathematics, University of Bristol, University Walk,
Operation Count; Numerical Linear Algebra
10 Operation Count; Numerical Linear Algebra 10.1 Introduction Many computations are limited simply by the sheer number of required additions, multiplications, or function evaluations. If floating-point
The mathematics of RAID-6
The mathematics of RAID-6 H. Peter Anvin 1 December 2004 RAID-6 supports losing any two drives. The way this is done is by computing two syndromes, generally referred P and Q. 1 A quick
How To Write A Hexadecimal Program
The mathematics of RAID-6 H. Peter Anvin First version 20 January 2004 Last updated 20 December 2011 RAID-6 supports losing any two drives. syndromes, generally referred P and Q. The way
Non-Volatile Memory Programming on the Agilent 3070
Non-Volatile Memory Programming on the Agilent 3070 Paul Montgomery Agilent Technologies Cleveland, Ohio [email protected] BACKGROUND The trend in non-volatile memory, like the general trend
Integer Factorization using the Quadratic Sieve
Integer Factorization using the Quadratic Sieve Chad Seibert* Division of Science and Mathematics University of Minnesota, Morris Morris, MN 56567 [email protected] March 16, 2011 Abstract We give
Research Statement Immanuel Trummer www.itrummer.org
Research Statement Immanuel Trummer www.itrummer.org We are collecting data at unprecedented rates. This data contains valuable insights, but we need complex analytics to extract them. My research focuses
Choosing a Computer for Running SLX, P3D, and P5
Choosing a Computer for Running SLX, P3D, and P5 This paper is based on my experience purchasing a new laptop in January, 2010. I ll lead you through my selection criteria and point you to some on-line
IA-64 Application Developer s Architecture Guide
IA-64 Application Developer s Architecture Guide The IA-64 architecture was designed to overcome the performance limitations of today s architectures and provide maximum headroom for the future. To achieve
on an system with an infinite number of processors. Calculate the speedup of
1. Amdahl s law Three enhancements with the following speedups are proposed for a new architecture: Speedup1 = 30 Speedup2 = 20 Speedup3 = 10 Only one enhancement is usable at a time. a) If enhancements
Quiz for Chapter 6 Storage and Other I/O Topics 3.10
Date: 3.10 Not all questions are of equal difficulty. Please review the entire quiz first and then budget your time carefully. Name: Course: Solutions in Red 1. [6 points] Give a concise answer to each
