Maximizing Hardware Prefetch Effectiveness with Machine Learning
|
|
|
- Claire Wright
- 10 years ago
- Views:
Transcription
1 Maximizing Hardware Prefetch Effectiveness with Machine Learning Saami Rahman, Martin Burtscher, Ziliang Zong, and Apan Qasem Department of Computer Science Texas State University San Marcos, TX Abstract Modern processors are equipped with multiple hardware prefetchers, each of which targets a distinct level in the memory hierarchy and employs a separate prefetching algorithm. However, different programs require different subsets of these prefetchers to maximize their performance. Turning on all available prefetchers rarely yields the best performance and, in some cases, prefetching even hurts performance. This paper studies the effect of hardware prefetching on multithreaded code and presents a machine-learning technique to predict the optimal combination of prefetchers for a given application. This technique is based on program characterization and utilizes hardware performance events in conjunction with a pruning algorithm to obtain a concise and expressive feature set. The resulting feature set is used in three different learning models. All necessary steps are implemented in a framework that reaches, on average, 96% of the best possible prefetcher speedup. The framework is built from open-source tools, making it easy to extend and port to other architectures. I. INTRODUCTION A prefetcher monitors and extrapolates streaming access patterns in applications and preemptively brings data into higher levels of cache to hide memory latency. Because of the disparity between CPU and memory speeds, prefetching has always been an important technique for code optimization. The significance of prefetching has increased on current multicore architectures. As an increasing number of cores are being attached to the same memory system, the number of candidate streams for prefetching, often from different threads in the same program, has increased as well. Effective coordination of multiple prefetch streams can not only hide memory latency but also directly affect parallel efficiency. Although prefetching is a critical transformation on current architectures, determining optimal prefetch parameters poses several challenges. First, a prefetcher must be accurate in predicting memory access patterns. If the prefetcher is incorrect, it can result in increased memory traffic, and more importantly, cause contention for space in the small and valuable cache. Second, a prefetch instruction must be timely. If a prefetch causes data to be present in a higher level of memory earlier than required, it may be evicted to accommodate more urgently required data. These challenges are amplified further in multithreaded programs. Since lower levels of memory such as L2 can be shared across multiple threads, each of which may The authors acknowledge support from the National Science Foundation through awards CNS and CNS potentially request data from different parts of memory, it can be difficult to correctly identify memory access patterns. In this paper, we address these challenges and devise a strategy for effective prefetching based on machine-learning techniques. We study the effect of prefetching on the PARSEC programs [1] as well as on two idealized programs that exhibit sequential and randomized access patterns. We examine the performance when enabling individual hardware prefetchers and combinations thereof on an Intel processor with four builtin prefetchers. We train several machine-learning algorithms to obtain recommendations on which prefetchers should be used for a given program. A key step to successfully using an machine-learning-based approach is to characterize programs in a quantitative manner that captures their essential differences. We employ hardware performance events for this purpose. Since there are hundreds of such events available, we designed a simple yet effective procedure to prune the number of events needed to characterize a program. We demonstrate the efficacy of our pruning algorithm by testing different feature sets on a decision tree. Finally, we develop a framework that recommends hardware prefetching configurations for unseen programs on the basis of previously seen training programs. On average, this framework can help the user gain up to 96% of the achievable speedup provided by the hardware prefetchers. There are several advantages of using such a recommendation system to adjust the hardware configuration. First, it enables us to use existing hardware more effectively. Second, it does not require any source-code changes of the program being optimized. Rather, it works directly on the executable. Lastly, the framework relies on open-source technologies, making it is easy to extend and port to other architectures. II. RELATED WORK Prefetching is a widely explored area, and the benefits of prefetching are broadly documented and studied. Lee et al. [2] were one of the first researchers to introduce the concept of data prefetching in hardware to hide the memory-access latency. Chen et al. [3], in their early work, studied the effects of data prefetching in both hardware and software. Numerous of work on data prefetching has been done since then. With the introduction of multicore processors, the effects of prefetching have become more interesting. Prefetching can
2 hurt performance, and the ill-effects of prefetching have been studied [4], [5]. Puzak et al. [6] demonstrate cases where prefetching hurts and propose a characterization of prefetching based on timeliness, coverage, and accuracy. Lee et al. [7] conducted an in-depth investigation of when and why prefetching works. They performed an extensive analysis of both software and hardware prefetching performance on the SPEC CPU2006 benchmark programs, which are serial workloads. Jayasena et al. [8] demonstrated the effectiveness of a decision tree to prune performance events used as features for detecting false sharing. Their approach has motivated us to include decision trees in our work. Cavazos et al. [9] developed a machine-learning model to find the best optimization configuration for the SPEC CPU2006 benchmark suite. The goal of their work was to select a set of compiler optimizations that results in good performance improvement. They report high classification accuracies when using performance events as features for a learning algorithm. Milepost GCC [10] is a large project to further crowdsourced optimization learning. It uses static features to characterize programs. Similar work by Demme et al. [11] uses graph clustering based on the data and control flow of programs to characterize program behavior. However, such characterizations are difficult to perform as they involve modifying the compiler or working on an intermediate representation of a program, making them hard to port. Among the works we have studied, the following two are the most similar to ours. McCurdy et al. [12] characterized the impact of prefetching on scientific applications using performance events. They experimented with a combination of several benchmark programs on AMD processors. Their work primarily focuses on serial workloads, but they also studied the effects of running multiple serial programs simultaneously. Their work hinges on successfully isolating performance events that are expressive enough to capture the effects of prefetching. They hand-picked the performance events for the AMD architectures they worked on and would have to repeat the process of studying all available performance events for any new architecture. Liao et al. [13] presented a machine-learningbased approach to selecting the optimal prefetch configuration in a way similar to ours. However, their work also hinges on identifying architecture-specific performance events, which they did by hand. In addition, their work focuses on serial workloads whereas we are interested in parallel programs. III. MACHINE-LEARNING FRAMEWORK Figure 1 shows the major tasks that our machine-learning framework performs. There are two key phases: training and testing. In the training phase, sample programs are run to generate features. Moreover, the programs are run with all possible prefetcher settings to determine the performance and identify the optimal hardware configuration. This information is used to train the learning models. The testing phase is simpler. First, an unseen program is run once to collect its features. Second, these features are fed into the trained model to obtain a recommendation of prefetch configuration. A. Prefetcher Configurations Many modern processors implement hardware prefetchers. For example, our Intel Core2 processor is equipped with the following four prefetchers: Fig. 1. (a) Training phase of the machine-learning framework (b) Testing phase of the machine-learning framework Operation of machine-learning framework 1) Data cache unit (DCU) prefetcher: this prefetcher attempts to recognize a streaming algorithm and prefetches the next line into the L1 data cache. 2) Instruction pointer (IP) based stride prefetcher: this prefetcher tracks individual load instructions and attempts to detect strided accesses. It can detect strides of up to 2kB and prefetches into the L1 cache. 3) Spatial prefetcher (CL): this prefetcher attempts to complete a cache line brought into the L2 by fetching the paired line to form a 128-byte aligned chunk. 4) Stream prefetcher (HW): this prefetcher attempts to detect streaming requests from the L1 cache and brings in anticipated cache lines into the L2 and LLC. It should be noted that more recent Intel architectures, such as SandyBridge and Haswell, include the same prefetchers. Interestingly, it is not defined which of these four prefetchers are enabled or disabled by default [13]. In this work, we define a configuration to be the enabled/disabled state of the prefetchers, expressed by a four-bit bitmask, where each bit represents a prefetcher. The value 1 means the corresponding prefetcher is enabled and a 0 means it is disabled. From most significant bit to least, the mapping of bits to prefetchers is: HW, CL, DCU, and IP. We controlled the prefetcher configuration of the processor using the open-source tool Likwid [14]. It may seem that turning on all prefetchers, i.e., configuration 1111, should result in the best performance. However, other configurations can be superior. In fact, Figure 2 shows that, in some cases, configuration 1111 hurts performance. The reported speedup values are runtime improvements over our baseline configuration, which is For each configuration, we carried out three runs and used the best runtime to shield the results somewhat from operating system jitter. Since the run yielding the best performance is not only the least interrupted but also the run that finished the task in the shortest amount of time, we strive to improve upon this value. Several observations can be made from Figure 2. First, within the same program, changing the prefetcher configuration can have a significant impact on performance. For example, freqmine and streamcluster show large variances in performance when the configuration is altered. Second, configuration 1111 is not always the best option, as can be seen in the cases of random, stream, streamcluster,
3 Fig. 2. blackscholes random raytrace stream facesim streamcluster ferret fluidanimate freqmine none ip dcu dcu_ip cl cl_ip cl_dcu cl_dcu_ip hw hw_ip hw_dcu hw_dcu_ip hw_cl hw_cl_ip hw_cl_dcu hw_cl_dcu_ip none ip dcu dcu_ip cl cl_ip cl_dcu cl_dcu_ip hw hw_ip hw_dcu hw_dcu_ip hw_cl hw_cl_ip hw_cl_dcu hw_cl_dcu_ip Performance of all prefetcher configurations relative to no prefetching and. Third, configuration 0000 can be the best configuration, that is, using prefetching can hurt performance, as in the case of, fluidanimate, and in several configurations of random. Finally, in several cases, there are multiple good configurations, that is, several configurations perform almost as well as the best. B. Feature Extraction We measured all performance events supported by our processor and used them as the initial feature set to characterize programs. Counting events such as L2 cache misses and stalled CPU cycles provides quantified insight into a program s behavior and has previously been employed to characterize programs [8], [9], [12], [13]. However, using all available events is problematic as it takes a long time to measure them and the supported events vary across processors. We use the pruning algorithm described below to address these problems. Once generated, the features are normalized to event counts per million executed instructions. This scaling is necessary to be able to compare different programs and different runs of the same program. For instance, one hundred page faults may not be significant for a program that executes one billion instructions, but it will be significant for a program that only runs ten thousand instructions. For the program with ten thousand instructions, the normalized value will be higher compared to the program with one billion instructions, thus correctly capturing the difference in significance. Figure 3 shows the count for all events for our 14 programs. The x-axis represents the events. The axes are removed as this figure is used only as a visualization. Two key observations can be made from the figure. First, many events have similar values, therefore they do not carry discriminatory information for program behavior. Second, many of the events result in a very high count and cause other events with smaller ranges to disappear. We address this issue with feature scaling. C. Feature Selection We used Algorithm 1 to prune the feature set. It consists of two subroutines: FindEvents and EventsUnion. FindEvents takes the event counts from two programs as input. For every distinct event, it checks if the two counts differ by at least φ, where φ is the relative difference expressed in the interval (0.0, 1.0). If they do, the event is included in a set. At the end of the subroutine, the set contains all events that sufficiently capture the distinctiveness between the two programs. The Diff function returns the absolute difference of its arguments divided by their maximum. We experimentally found φ = 0.95 to work well and exclusively use this threshold. The EventsUnion subroutine takes a list of programs as input and, for all possible pairs of these programs, calls the FindEvents subroutine. It includes the returned sets in a multiset, which eventually contains the expressive events for all program pairs. Next, the EventsUnion subroutine employs a map to count the number of times each event occurs in the multiset. Finally, the map is sorted by the counts. This algorithm is driven by the idea that, if an event has highly distinct counts between two programs, it captures an important aspect in which the two programs differ, and we should consider that event. However, there are many events that differ significantly for any two programs, and we end up with a large number of events. This is why we sort the map to prioritize the events by how often they are useful to distinguish between programs. From the sorted map, we can easily select the top most important events to build our ultimate feature set. We use a final preprocessing step, called feature scaling, before the learning stage. Learning models are known to struggle when there is a large variance in the numeric ranges of the features. For example, if one feature has a range of 1 to 10 and another feature in the same feature set has a range of -10,000 to +10,000, the learning model might struggle to draw accurate decision boundaries. As a remedy, we scale all features in feature vector x using the following formula: x i = x i µ(x) for i = 1 to len(x) max(x) min(x) where µ(x) is the average of the values in feature vector x. After scaling, all features are within the range -1 to 1. Figure 4 shows the effect of feature scaling. The plots are significantly different from those in Figure 3. Also, the differences in program behavior are now much more apparent. For example, without scaling, the plots for blackscholes and are not as distinguishable as they are with scaling. Additionally, now appears substantially different from both and blackscholes. This is because many event counts in are not 0, contrary to how it appears in Figure 3. Similarly, the plots for,, and look very similar without
4 blackscholes Fig. 3. Event-count visualization before feature scaling scaling but are quite different once the features are scaled. The same is true for the other programs. The plots of the unscaled features are sparse whereas the plots of the scaled features differ more substantially between programs and are denser. D. Learning Models The final component of our framework is a learning model. We evaluated two different traditional learning models, logistic regression and decision trees, and designed a Euclideandistance-based classifier that is tailored to our needs and captures the available information well. 1) Training Labels: The learning target is formulated in two ways for two different purposes. In the first formulation, we use a binary classifier that specifies whether a program benefits from prefetching or not. If the best prefetching configuration results in a speedup of at least 10% over configuration 0000, the program is said to benefit from prefetching. According to this metric, programs ferret,, fluidanimate, and do not benefit. We use this formulation to decide how many of the top events found by Algorithm 1 to use (cf. Section IV.C). In the second formulation, we inspect the effect of each prefetcher in isolation. If enabling a specific prefetcher for a given program results in a speedup of at least 2% over configuration 0000, we classify the prefetcher as useful on that program. This yields four instances of learning models, one per prefetcher. Table I shows the resulting training labels. It also lists the best configuration for each program. Note that learning models should be trained on multiple examples for each class label. This is why we use four instances rather than a single instance that includes all combinations of prefetchers, i.e., 16 distinct class labels, which would require a lot more than our 14 programs to train. 2) Euclidean-Distance-Based Model: We use the Euclidean distance as a similarity metric for predicting good prefetcher configurations for previously unseen programs. The motivation behind this approach is that the four independent
5 blackscholes Fig. 4. Event-count visualization after feature scaling TABLE I. TRAINING LABELS FOR OUR FOUR INDEPENDENT CLASSIFIERS AND ACTUAL BEST CONFIGURATION Program Training Labels hw dcu cl ip Actual best configuration blackscholes facesim ferret freqmine raytrace , 1110, fluidanimate , , streamcluster random , 1011 stream , 1001, 1011 classifiers do not capture the interaction between the prefetchers. Comparing the classifiers to the best configuration in Table I illustrates this problem, i.e., the combination of the four classifiers often is not the configuration that results in the maximum speedup. For example, consider raytrace, where the individually recommended prefetchers are hw and dcu, which corresponds to configuration However, the combination of these two prefetchers does not yield the best performance. To counteract this problem, our Euclideandistance-based model calculates the distance between the unseen program and every known program from the training set. It then uses these distances as weights to compute a score for each of the 16 possible prefetcher configurations. Concretely, the model computes the following 16-element vector: [ m Score = i=1 P i[1] d 2 i, m i=1 P i[2],... di 2 m i=1 ] P i[16] di 2
6 Algorithm 1 Finding events that differ between two programs Require: E a and E b contain measurements of the same events in the same order 1: function FINDEVENTS(E a, E b, φ) 2: set {} 3: for i 1 to E a.size do 4: if DIFF(E a[i].value, E b [i].value) φ then relative difference 5: set.include(e a[i].name) 6: end if 7: end for 8: return set 9: end function 10: function EVENTSUNION(programs, φ) 11: multiset {} 12: map MAP() key: event name, value: occurrence 13: nump rogs programs.length 14: for i 1 to nump rogs 1 do 15: for j i + 1 to nump rogs do 16: set FINDEVENTS(programs[i], programs[j], φ) 17: multiset.include(set) 18: end for 19: end for 20: for set in multiset do 21: for event in set do 22: if map.contains(event.name) then 23: map[event.name] += 1 24: else 25: map[event.name] = 1 26: end if 27: end for 28: end for 29: return SORT(map) sorts by values 30: end function where m is the number of programs in the training set. P i is a 16-element vector associated with program i. Each element in this vector corresponds to a configuration, which is simply the element s index in binary notation. The element s value represents the fraction of the achievable speedup that was reached when using the corresponding configuration on program i. For example, P 2 [3] represents the fraction of speedup that was obtained on the 2 nd training program using configuration 0011, since 3 10 = Finally, d i denotes the Euclidean distance between the unseen program and the i th training program. The resulting vector contains a weighted score for each prefetching configuration. The recommended configuration is the binary representation of the index belonging to the maximum element. This approach combines the effect of every prefetching configuration on all training programs and qualifies the scores using distance squared. Thus, the prefetching performance of similar programs carries a greater weight than that of dissimilar programs. Moreover, if an unseen program is close to multiple programs, our approach ensures that the most similar program is not the sole basis for the recommendation. IV. A. Experimental Environment RESULTS AND ANALYSIS We performed our measurements on a 2.4 GHz Intel Core2 Quad Q6600 processor with eight 32 kb L1 caches and two 4 MB L2 caches. All programs were compiled using GCC with optimization level -O2 on an Ubuntu operating system. The PARSEC programs were invoked using the parsecmgmt script that ships with the suite and run with the native input on eight threads. We repeated the experiments TABLE II. EFFECT OF VARYING THE NUMBER OF FEATURES ON PRECISION, RECALL, AND ACCURACY Featureset Precision Recall Accuracy All events Top Top Top Top Top Top Top Top Top Top Top with different numbers of threads, but no significant change in prefetching effectiveness was observed. B. Evaluation Method We used (k - 1) cross-validation to asses our learning models. Initially, we focused on the prediction accuracy. However, this does not capture the quality of the recommendation as multiple prefetching configurations may yield nearoptimal performance. For instance, the best configuration for streamcluster is 0110 giving a speedup of 1.37, but 0111 results in Therefore, we found it more useful to look at what fraction of the achievable speedup can be obtained using the recommendations from the learning model. C. Choosing the Optimal Number of Features To test the efficacy of the feature-selection procedure in Algorithm 1, we trained and tested a decision tree using the top two events and fed the model with an increasing number of events to identify the saturation point. This point is reached when we use the top eight events, as shown in Table II. Using more than eight events degrades the learning performance. This demonstrates that our pruning algorithm is able to identify events that truly capture the differences between programs. Henceforth, we use the top eight events as our feature set except in the case of the Euclidean-distance-based model, which we found to work best with just the top six events. D. Recommending a Good Prefetcher Configuration Figure 5 shows the performance of the three learning models in terms of how close the recommended configuration s speedup is to that of the best possible configuration. The logistic regression model reaches, on average, 92.4% of the achievable speedup. It performs poorly on freqmine because of how the training labels are derived. Applying the cl and dcu prefetchers separately on this program does not yield a speedup, which is why the models are trained to turn them off. However, good configurations for freqmine have both prefetchers turned on. Hence, this is a case where two prefetchers that do not result in significant speedup individually deliver a combined speedup that is greater than their sum. The decision-tree-based model performs better than the logistic regression model on average, reaching 95.3% of the
7 Fraction of Potential Speedup Fig. 5. blackscholes facesim ferret freqmine raytrace fluidanimate streamcluster random stream Logistic Regression Decision Tree Euclidean Distance Fraction of achievable speedup reached by the three learning models achievable speedup. For freqmine, it is able to suggest the best configuration. However, this has likely occurred by chance as information about the interaction between prefetchers is not carried into the independent models. The Euclidean distance model reaches 96.1% of the achievable speedup on average. It is able to suggest a good configuration for freqmine because it takes into account the similarity between the test program and multiple training programs. Moreover, it utilizes information about the performance of every possible prefetcher configuration on all training programs. In contrast, the other classifiers attempt to predict the best configuration based on the impact of the individual prefetchers. The Euclidean distance model performs substantially worse than the other models on facesim. This is because it determines random to be the most similar program, i.e., wellperforming configurations for random greatly influence its decision. When we inspected the model internally, we found the second highest recommendation to be 0110, which results in 98% of the achievable speedup for facesim. Clearly, the similarity metric turned out to be imperfect in this case. All models perform poorly on streamcluster. The logistic regression and decision tree classifiers suffer from the aforementioned problem because the training label is 1111 whereas the actual best configuration is The Euclidean distance model suffers because streamcluster is significantly different from every other program, i.e., there is no similar program in the training set. It is closest to stream, but applying the best configuration from stream on streamcluster results in a speedup of 1.07, which is only 78% of the best possible speedup of 1.37 using V. SUMMARY This paper presents a framework to help users maximize the effectiveness of the hardware prefetchers in their systems. The processor on which we conducted our experiments contains four such prefetchers, resulting in 16 possible prefetching configurations as each prefetcher can be enabled or disabled individually. Our framework performs an exhaustive search of average these 16 combinations and records the performance of each configuration on a set of training programs. It uses hardware performance events to characterize codes for the purpose of determining similarities between a previously unseen application and the training programs. Since modern processors support hundreds of performance events, our framework employs a pruning algorithm to construct a concise yet expressive feature set. This feature set, combined with the performance of the various prefetcher configurations, is used to train three machine-learning models. One of them is a Euclidean-distancebased model that we designed specifically for recommending prefetcher configurations. On average, its recommendations deliver 96% of the possible prefetching speedup on the PARSEC benchmark suite and two additional programs. REFERENCES [1] C. Bienia, S. Kumar, J. P. Singh, and K. Li, The parsec benchmark suite: Characterization and architectural implications, in Proceedings of the 17th international conference on Parallel architectures and compilation techniques. ACM, 2008, pp [2] R. L. Lee, P.-C. Yew, and D. H. Lawrie, Multiprocessor cache design considerations, in Proceedings of the 14th annual international symposium on Computer architecture. ACM, 1987, pp [3] T. Chen and J. Baer, A performance study of software and hardware data prefetching schemes, in Computer Architecture, 1994., Proceedings the 21st Annual International Symposium on. IEEE, 1994, pp [4] H. Kang and J. L. Wong, To hardware prefetch or not to prefetch?: a virtualized environment study and core binding approach, in ACM SIGPLAN Notices, vol. 48, no. 4. ACM, 2013, pp [5] E. Ebrahimi, C. J. Lee, O. Mutlu, and Y. N. Patt, Prefetch-aware shared resource management for multi-core systems, ACM SIGARCH Computer Architecture News, vol. 39, no. 3, pp , [6] T. R. Puzak, A. Hartstein, P. G. Emma, and V. Srinivasan, When prefetching improves/degrades performance, in Proceedings of the 2nd conference on Computing frontiers. ACM, 2005, pp [7] J. Lee, H. Kim, and R. Vuduc, When prefetching works, when it doesn t, and why, ACM Transactions on Architecture and Code Optimization (TACO), vol. 9, no. 1, p. 2, [8] S. Jayasena, S. Amarasinghe, A. Abeyweera, G. Amarasinghe, H. De Silva, S. Rathnayake, X. Meng, and Y. Liu, Detection of false sharing using machine learning, in Proceedings of SC13: International Conference for High Performance Computing, Networking, Storage and Analysis. ACM, 2013, p. 30. [9] J. Cavazos, G. Fursin, F. Agakov, E. Bonilla, M. F. O Boyle, and O. Temam, Rapidly selecting good compiler optimizations using performance counters, in Code Generation and Optimization, CGO 07. International Symposium on. IEEE, 2007, pp [10] G. Fursin, C. Miranda, O. Temam, M. Namolaru, E. Yom-Tov, A. Zaks, B. Mendelson, E. Bonilla, J. Thomson, H. Leather et al., Milepost gcc: machine learning based research compiler, in GCC Summit, [11] J. Demme and S. Sethumadhavan, Approximate graph clustering for program characterization, ACM Transactions on Architecture and Code Optimization (TACO), vol. 8, no. 4, p. 21, [12] C. McCurdy, G. Marin, and J. Vetter, Characterizing the impact of prefetching on scientific application performance, in International Workshop on Performance Modeling, Benchmarking and Simulation of HPC Systems (PMBS13), [13] S. Liao, T.-H. Hung, D. Nguyen, C. Chou, C. Tu, and H. Zhou, Machine learning-based prefetch optimization for data center applications, in Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis. ACM, 2009, p. 56. [14] J. Treibig, G. Hager, and G. Wellein, Likwid: A lightweight performance-oriented tool suite for x86 multicore environments, in Parallel Processing Workshops (ICPPW), th International Conference on. IEEE, 2010, pp
Optimizing Shared Resource Contention in HPC Clusters
Optimizing Shared Resource Contention in HPC Clusters Sergey Blagodurov Simon Fraser University Alexandra Fedorova Simon Fraser University Abstract Contention for shared resources in HPC clusters occurs
TPCalc : a throughput calculator for computer architecture studies
TPCalc : a throughput calculator for computer architecture studies Pierre Michaud Stijn Eyerman Wouter Rogiest IRISA/INRIA Ghent University Ghent University [email protected] [email protected]
ScaAnalyzer: A Tool to Identify Memory Scalability Bottlenecks in Parallel Programs
ScaAnalyzer: A Tool to Identify Memory Scalability Bottlenecks in Parallel Programs ABSTRACT Xu Liu Department of Computer Science College of William and Mary Williamsburg, VA 23185 [email protected] It is
Binary search tree with SIMD bandwidth optimization using SSE
Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous
PARSEC vs. SPLASH 2: A Quantitative Comparison of Two Multithreaded Benchmark Suites on Chip Multiprocessors
PARSEC vs. SPLASH 2: A Quantitative Comparison of Two Multithreaded Benchmark Suites on Chip Multiprocessors ChristianBienia,SanjeevKumar andkaili DepartmentofComputerScience,PrincetonUniversity MicroprocessorTechnologyLabs,Intel
Parallel Ray Tracing using MPI: A Dynamic Load-balancing Approach
Parallel Ray Tracing using MPI: A Dynamic Load-balancing Approach S. M. Ashraful Kadir 1 and Tazrian Khan 2 1 Scientific Computing, Royal Institute of Technology (KTH), Stockholm, Sweden [email protected],
CHAPTER 1 INTRODUCTION
1 CHAPTER 1 INTRODUCTION 1.1 MOTIVATION OF RESEARCH Multicore processors have two or more execution cores (processors) implemented on a single chip having their own set of execution and architectural recourses.
Text Mining Approach for Big Data Analysis Using Clustering and Classification Methodologies
Text Mining Approach for Big Data Analysis Using Clustering and Classification Methodologies Somesh S Chavadi 1, Dr. Asha T 2 1 PG Student, 2 Professor, Department of Computer Science and Engineering,
Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi
Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi ICPP 6 th International Workshop on Parallel Programming Models and Systems Software for High-End Computing October 1, 2013 Lyon, France
Exploiting GPU Hardware Saturation for Fast Compiler Optimization
Exploiting GPU Hardware Saturation for Fast Compiler Optimization Alberto Magni School of Informatics University of Edinburgh United Kingdom [email protected] Christophe Dubach School of Informatics
Evaluating HDFS I/O Performance on Virtualized Systems
Evaluating HDFS I/O Performance on Virtualized Systems Xin Tang [email protected] University of Wisconsin-Madison Department of Computer Sciences Abstract Hadoop as a Service (HaaS) has received increasing
Analysis of Memory Sensitive SPEC CPU2006 Integer Benchmarks for Big Data Benchmarking
Analysis of Memory Sensitive SPEC CPU2006 Integer Benchmarks for Big Data Benchmarking Kathlene Hurt and Eugene John Department of Electrical and Computer Engineering University of Texas at San Antonio
On Benchmarking Popular File Systems
On Benchmarking Popular File Systems Matti Vanninen James Z. Wang Department of Computer Science Clemson University, Clemson, SC 2963 Emails: {mvannin, jzwang}@cs.clemson.edu Abstract In recent years,
Speed Performance Improvement of Vehicle Blob Tracking System
Speed Performance Improvement of Vehicle Blob Tracking System Sung Chun Lee and Ram Nevatia University of Southern California, Los Angeles, CA 90089, USA [email protected], [email protected] Abstract. A speed
There are a number of factors that increase the risk of performance problems in complex computer and software systems, such as e-commerce systems.
ASSURING PERFORMANCE IN E-COMMERCE SYSTEMS Dr. John Murphy Abstract Performance Assurance is a methodology that, when applied during the design and development cycle, will greatly increase the chances
Intelligent Heuristic Construction with Active Learning
Intelligent Heuristic Construction with Active Learning William F. Ogilvie, Pavlos Petoumenos, Zheng Wang, Hugh Leather E H U N I V E R S I T Y T O H F G R E D I N B U Space is BIG! Hubble Ultra-Deep Field
Real Time Network Server Monitoring using Smartphone with Dynamic Load Balancing
www.ijcsi.org 227 Real Time Network Server Monitoring using Smartphone with Dynamic Load Balancing Dhuha Basheer Abdullah 1, Zeena Abdulgafar Thanoon 2, 1 Computer Science Department, Mosul University,
STUDY AND SIMULATION OF A DISTRIBUTED REAL-TIME FAULT-TOLERANCE WEB MONITORING SYSTEM
STUDY AND SIMULATION OF A DISTRIBUTED REAL-TIME FAULT-TOLERANCE WEB MONITORING SYSTEM Albert M. K. Cheng, Shaohong Fang Department of Computer Science University of Houston Houston, TX, 77204, USA http://www.cs.uh.edu
Multi-Threading Performance on Commodity Multi-Core Processors
Multi-Threading Performance on Commodity Multi-Core Processors Jie Chen and William Watson III Scientific Computing Group Jefferson Lab 12000 Jefferson Ave. Newport News, VA 23606 Organization Introduction
Oracle Database Scalability in VMware ESX VMware ESX 3.5
Performance Study Oracle Database Scalability in VMware ESX VMware ESX 3.5 Database applications running on individual physical servers represent a large consolidation opportunity. However enterprises
16.1 MAPREDUCE. For personal use only, not for distribution. 333
For personal use only, not for distribution. 333 16.1 MAPREDUCE Initially designed by the Google labs and used internally by Google, the MAPREDUCE distributed programming model is now promoted by several
Muse Server Sizing. 18 June 2012. Document Version 0.0.1.9 Muse 2.7.0.0
Muse Server Sizing 18 June 2012 Document Version 0.0.1.9 Muse 2.7.0.0 Notice No part of this publication may be reproduced stored in a retrieval system, or transmitted, in any form or by any means, without
Distributed forests for MapReduce-based machine learning
Distributed forests for MapReduce-based machine learning Ryoji Wakayama, Ryuei Murata, Akisato Kimura, Takayoshi Yamashita, Yuji Yamauchi, Hironobu Fujiyoshi Chubu University, Japan. NTT Communication
Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging
Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging In some markets and scenarios where competitive advantage is all about speed, speed is measured in micro- and even nano-seconds.
Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011
Scalable Data Analysis in R Lee E. Edlefsen Chief Scientist UserR! 2011 1 Introduction Our ability to collect and store data has rapidly been outpacing our ability to analyze it We need scalable data analysis
Thread Reinforcer: Dynamically Determining Number of Threads via OS Level Monitoring
Thread Reinforcer: Dynamically Determining via OS Level Monitoring Kishore Kumar Pusukuri, Rajiv Gupta, Laxmi N. Bhuyan Department of Computer Science and Engineering University of California, Riverside
Performance Metrics and Scalability Analysis. Performance Metrics and Scalability Analysis
Performance Metrics and Scalability Analysis 1 Performance Metrics and Scalability Analysis Lecture Outline Following Topics will be discussed Requirements in performance and cost Performance metrics Work
Understanding the Benefits of IBM SPSS Statistics Server
IBM SPSS Statistics Server Understanding the Benefits of IBM SPSS Statistics Server Contents: 1 Introduction 2 Performance 101: Understanding the drivers of better performance 3 Why performance is faster
Chapter 6. The stacking ensemble approach
82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described
Dynamic resource management for energy saving in the cloud computing environment
Dynamic resource management for energy saving in the cloud computing environment Liang-Teh Lee, Kang-Yuan Liu, and Hui-Yang Huang Department of Computer Science and Engineering, Tatung University, Taiwan
BSPCloud: A Hybrid Programming Library for Cloud Computing *
BSPCloud: A Hybrid Programming Library for Cloud Computing * Xiaodong Liu, Weiqin Tong and Yan Hou Department of Computer Engineering and Science Shanghai University, Shanghai, China [email protected],
Parallel Algorithm Engineering
Parallel Algorithm Engineering Kenneth S. Bøgh PhD Fellow Based on slides by Darius Sidlauskas Outline Background Current multicore architectures UMA vs NUMA The openmp framework Examples Software crisis
BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB
BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB Planet Size Data!? Gartner s 10 key IT trends for 2012 unstructured data will grow some 80% over the course of the next
Keywords: Dynamic Load Balancing, Process Migration, Load Indices, Threshold Level, Response Time, Process Age.
Volume 3, Issue 10, October 2013 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Load Measurement
Classification On The Clouds Using MapReduce
Classification On The Clouds Using MapReduce Simão Martins Instituto Superior Técnico Lisbon, Portugal [email protected] Cláudia Antunes Instituto Superior Técnico Lisbon, Portugal [email protected]
JOURNAL OF OBJECT TECHNOLOGY
JOURNAL OF OBJECT TECHNOLOGY Online at http://www.jot.fm. Published by ETH Zurich, Chair of Software Engineering JOT, 2006 Vol. 5, No. 6, July - August 2006 On Assuring Software Quality and Curbing Software
Power-Aware High-Performance Scientific Computing
Power-Aware High-Performance Scientific Computing Padma Raghavan Scalable Computing Laboratory Department of Computer Science Engineering The Pennsylvania State University http://www.cse.psu.edu/~raghavan
Parallel Scalable Algorithms- Performance Parameters
www.bsc.es Parallel Scalable Algorithms- Performance Parameters Vassil Alexandrov, ICREA - Barcelona Supercomputing Center, Spain Overview Sources of Overhead in Parallel Programs Performance Metrics for
Java Virtual Machine: the key for accurated memory prefetching
Java Virtual Machine: the key for accurated memory prefetching Yolanda Becerra Jordi Garcia Toni Cortes Nacho Navarro Computer Architecture Department Universitat Politècnica de Catalunya Barcelona, Spain
Lecture 10: Regression Trees
Lecture 10: Regression Trees 36-350: Data Mining October 11, 2006 Reading: Textbook, sections 5.2 and 10.5. The next three lectures are going to be about a particular kind of nonlinear predictive model,
Adaptive Probing: A Monitoring-Based Probing Approach for Fault Localization in Networks
Adaptive Probing: A Monitoring-Based Probing Approach for Fault Localization in Networks Akshay Kumar *, R. K. Ghosh, Maitreya Natu *Student author Indian Institute of Technology, Kanpur, India Tata Research
Parallel Computing 37 (2011) 26 41. Contents lists available at ScienceDirect. Parallel Computing. journal homepage: www.elsevier.
Parallel Computing 37 (2011) 26 41 Contents lists available at ScienceDirect Parallel Computing journal homepage: www.elsevier.com/locate/parco Architectural support for thread communications in multi-core
The Scientific Data Mining Process
Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In
Data Mining III: Numeric Estimation
Data Mining III: Numeric Estimation Computer Science 105 Boston University David G. Sullivan, Ph.D. Review: Numeric Estimation Numeric estimation is like classification learning. it involves learning a
A Comparison of General Approaches to Multiprocessor Scheduling
A Comparison of General Approaches to Multiprocessor Scheduling Jing-Chiou Liou AT&T Laboratories Middletown, NJ 0778, USA [email protected] Michael A. Palis Department of Computer Science Rutgers University
A Dynamic Approach for Load Balancing using Clusters
A Dynamic Approach for Load Balancing using Clusters ShwetaRajani 1, RenuBagoria 2 Computer Science 1,2,Global Technical Campus, Jaipur 1,JaganNath University, Jaipur 2 Email: [email protected] 1
Reconfigurable Architecture Requirements for Co-Designed Virtual Machines
Reconfigurable Architecture Requirements for Co-Designed Virtual Machines Kenneth B. Kent University of New Brunswick Faculty of Computer Science Fredericton, New Brunswick, Canada [email protected] Micaela Serra
PERFORMANCE ANALYSIS OF KERNEL-BASED VIRTUAL MACHINE
PERFORMANCE ANALYSIS OF KERNEL-BASED VIRTUAL MACHINE Sudha M 1, Harish G M 2, Nandan A 3, Usha J 4 1 Department of MCA, R V College of Engineering, Bangalore : 560059, India [email protected] 2 Department
Performance Impacts of Non-blocking Caches in Out-of-order Processors
Performance Impacts of Non-blocking Caches in Out-of-order Processors Sheng Li; Ke Chen; Jay B. Brockman; Norman P. Jouppi HP Laboratories HPL-2011-65 Keyword(s): Non-blocking cache; MSHR; Out-of-order
Characterizing Task Usage Shapes in Google s Compute Clusters
Characterizing Task Usage Shapes in Google s Compute Clusters Qi Zhang University of Waterloo [email protected] Joseph L. Hellerstein Google Inc. [email protected] Raouf Boutaba University of Waterloo [email protected]
Measuring Cache and Memory Latency and CPU to Memory Bandwidth
White Paper Joshua Ruggiero Computer Systems Engineer Intel Corporation Measuring Cache and Memory Latency and CPU to Memory Bandwidth For use with Intel Architecture December 2008 1 321074 Executive Summary
Stream Processing on GPUs Using Distributed Multimedia Middleware
Stream Processing on GPUs Using Distributed Multimedia Middleware Michael Repplinger 1,2, and Philipp Slusallek 1,2 1 Computer Graphics Lab, Saarland University, Saarbrücken, Germany 2 German Research
How To Improve Performance On A Multicore Processor With An Asymmetric Hypervisor
AASH: An Asymmetry-Aware Scheduler for Hypervisors Vahid Kazempour Ali Kamali Alexandra Fedorova Simon Fraser University, Vancouver, Canada {vahid kazempour, ali kamali, fedorova}@sfu.ca Abstract Asymmetric
In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller
In-Memory Databases Algorithms and Data Structures on Modern Hardware Martin Faust David Schwalb Jens Krüger Jürgen Müller The Free Lunch Is Over 2 Number of transistors per CPU increases Clock frequency
Dynamic Load Balancing of Virtual Machines using QEMU-KVM
Dynamic Load Balancing of Virtual Machines using QEMU-KVM Akshay Chandak Krishnakant Jaju Technology, College of Engineering, Pune. Maharashtra, India. Akshay Kanfade Pushkar Lohiya Technology, College
CPU Session 1. Praktikum Parallele Rechnerarchtitekturen. Praktikum Parallele Rechnerarchitekturen / Johannes Hofmann April 14, 2015 1
CPU Session 1 Praktikum Parallele Rechnerarchtitekturen Praktikum Parallele Rechnerarchitekturen / Johannes Hofmann April 14, 2015 1 Overview Types of Parallelism in Modern Multi-Core CPUs o Multicore
Parallel Computing for Option Pricing Based on the Backward Stochastic Differential Equation
Parallel Computing for Option Pricing Based on the Backward Stochastic Differential Equation Ying Peng, Bin Gong, Hui Liu, and Yanxin Zhang School of Computer Science and Technology, Shandong University,
Coverity White Paper. Effective Management of Static Analysis Vulnerabilities and Defects
Effective Management of Static Analysis Vulnerabilities and Defects Introduction According to a recent industry study, companies are increasingly expanding their development testing efforts to lower their
Benchmarking Cassandra on Violin
Technical White Paper Report Technical Report Benchmarking Cassandra on Violin Accelerating Cassandra Performance and Reducing Read Latency With Violin Memory Flash-based Storage Arrays Version 1.0 Abstract
A Visualization System and Monitoring Tool to Measure Concurrency in MPICH Programs
A Visualization System and Monitoring Tool to Measure Concurrency in MPICH Programs Michael Scherger Department of Computer Science Texas Christian University Email: [email protected] Zakir Hussain Syed
Chapter 1 Computer System Overview
Operating Systems: Internals and Design Principles Chapter 1 Computer System Overview Eighth Edition By William Stallings Operating System Exploits the hardware resources of one or more processors Provides
Memory Access Control in Multiprocessor for Real-time Systems with Mixed Criticality
Memory Access Control in Multiprocessor for Real-time Systems with Mixed Criticality Heechul Yun +, Gang Yao +, Rodolfo Pellizzoni *, Marco Caccamo +, Lui Sha + University of Illinois at Urbana and Champaign
SEER PROBABILISTIC SCHEDULING FOR COMMODITY HARDWARE TRANSACTIONAL MEMORY. 27 th Symposium on Parallel Architectures and Algorithms
27 th Symposium on Parallel Architectures and Algorithms SEER PROBABILISTIC SCHEDULING FOR COMMODITY HARDWARE TRANSACTIONAL MEMORY Nuno Diegues, Paolo Romano and Stoyan Garbatov Seer: Scheduling for Commodity
How To Identify A Churner
2012 45th Hawaii International Conference on System Sciences A New Ensemble Model for Efficient Churn Prediction in Mobile Telecommunication Namhyoung Kim, Jaewook Lee Department of Industrial and Management
Load Balancing in Fault Tolerant Video Server
Load Balancing in Fault Tolerant Video Server # D. N. Sujatha*, Girish K*, Rashmi B*, Venugopal K. R*, L. M. Patnaik** *Department of Computer Science and Engineering University Visvesvaraya College of
A Dynamic Resource Management with Energy Saving Mechanism for Supporting Cloud Computing
A Dynamic Resource Management with Energy Saving Mechanism for Supporting Cloud Computing Liang-Teh Lee, Kang-Yuan Liu, Hui-Yang Huang and Chia-Ying Tseng Department of Computer Science and Engineering,
A STUDY OF TASK SCHEDULING IN MULTIPROCESSOR ENVIROMENT Ranjit Rajak 1, C.P.Katti 2, Nidhi Rajak 3
A STUDY OF TASK SCHEDULING IN MULTIPROCESSOR ENVIROMENT Ranjit Rajak 1, C.P.Katti, Nidhi Rajak 1 Department of Computer Science & Applications, Dr.H.S.Gour Central University, Sagar, India, [email protected]
Performance Analysis and Optimization Tool
Performance Analysis and Optimization Tool Andres S. CHARIF-RUBIAL [email protected] Performance Analysis Team, University of Versailles http://www.maqao.org Introduction Performance Analysis Develop
EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set
EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set Amhmed A. Bhih School of Electrical and Electronic Engineering Princy Johnson School of Electrical and Electronic Engineering Martin
Virtualizing Performance Asymmetric Multi-core Systems
Virtualizing Performance Asymmetric Multi- Systems Youngjin Kwon, Changdae Kim, Seungryoul Maeng, and Jaehyuk Huh Computer Science Department, KAIST {yjkwon and cdkim}@calab.kaist.ac.kr, {maeng and jhhuh}@kaist.ac.kr
MAQAO Performance Analysis and Optimization Tool
MAQAO Performance Analysis and Optimization Tool Andres S. CHARIF-RUBIAL [email protected] Performance Evaluation Team, University of Versailles S-Q-Y http://www.maqao.org VI-HPS 18 th Grenoble 18/22
Bootstrapping Big Data
Bootstrapping Big Data Ariel Kleiner Ameet Talwalkar Purnamrita Sarkar Michael I. Jordan Computer Science Division University of California, Berkeley {akleiner, ameet, psarkar, jordan}@eecs.berkeley.edu
Big Data Processing with Google s MapReduce. Alexandru Costan
1 Big Data Processing with Google s MapReduce Alexandru Costan Outline Motivation MapReduce programming model Examples MapReduce system architecture Limitations Extensions 2 Motivation Big Data @Google:
Efficient Scheduling Of On-line Services in Cloud Computing Based on Task Migration
Efficient Scheduling Of On-line Services in Cloud Computing Based on Task Migration 1 Harish H G, 2 Dr. R Girisha 1 PG Student, 2 Professor, Department of CSE, PESCE Mandya (An Autonomous Institution under
White Paper Abstract Disclaimer
White Paper Synopsis of the Data Streaming Logical Specification (Phase I) Based on: RapidIO Specification Part X: Data Streaming Logical Specification Rev. 1.2, 08/2004 Abstract The Data Streaming specification
Performance Workload Design
Performance Workload Design The goal of this paper is to show the basic principles involved in designing a workload for performance and scalability testing. We will understand how to achieve these principles
FPGA-based Multithreading for In-Memory Hash Joins
FPGA-based Multithreading for In-Memory Hash Joins Robert J. Halstead, Ildar Absalyamov, Walid A. Najjar, Vassilis J. Tsotras University of California, Riverside Outline Background What are FPGAs Multithreaded
Exploring RAID Configurations
Exploring RAID Configurations J. Ryan Fishel Florida State University August 6, 2008 Abstract To address the limits of today s slow mechanical disks, we explored a number of data layouts to improve RAID
Windows Server Performance Monitoring
Spot server problems before they are noticed The system s really slow today! How often have you heard that? Finding the solution isn t so easy. The obvious questions to ask are why is it running slowly
Performance Modeling and Analysis of a Database Server with Write-Heavy Workload
Performance Modeling and Analysis of a Database Server with Write-Heavy Workload Manfred Dellkrantz, Maria Kihl 2, and Anders Robertsson Department of Automatic Control, Lund University 2 Department of
Xiaoqiao Meng, Vasileios Pappas, Li Zhang IBM T.J. Watson Research Center Presented by: Payman Khani
Improving the Scalability of Data Center Networks with Traffic-aware Virtual Machine Placement Xiaoqiao Meng, Vasileios Pappas, Li Zhang IBM T.J. Watson Research Center Presented by: Payman Khani Overview:
Map/Reduce Affinity Propagation Clustering Algorithm
Map/Reduce Affinity Propagation Clustering Algorithm Wei-Chih Hung, Chun-Yen Chu, and Yi-Leh Wu Department of Computer Science and Information Engineering, National Taiwan University of Science and Technology,
The assignment of chunk size according to the target data characteristics in deduplication backup system
The assignment of chunk size according to the target data characteristics in deduplication backup system Mikito Ogata Norihisa Komoda Hitachi Information and Telecommunication Engineering, Ltd. 781 Sakai,
Social Media Mining. Data Mining Essentials
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
Parallel Computing for Data Science
Parallel Computing for Data Science With Examples in R, C++ and CUDA Norman Matloff University of California, Davis USA (g) CRC Press Taylor & Francis Group Boca Raton London New York CRC Press is an imprint
