Cache Replacement Policies for Embedded Mobile Systems: Performance
|
|
|
- Lucas Hunter
- 10 years ago
- Views:
Transcription
1 Cache Replacement Policies for Embedded Mobile Systems: Performance and Power Consumption 1 1, First Author Henan University of Technology, China, [email protected] Abstract With fast advances in processor technology, imbalance between the relative speeds of processors and the main memory is the rapidly enlarged. To mitigate it, an increasingly larger portion of the chip transistors budget is dedicated for on-chip cache usage, leading to an ever expanding memory hierarchy. As the key for cache system, cache replacement policy component give great impact on performance and overall energy consumption. As the spread of mobile embedded system, in this paper we talk about how nine selected last cache level replacement policies behaves on performance, energy consumption and hardware cost. We find that for popular mobile applications, most of the policies have better behavior than traditional LRU algorithms in both performance and energy cost. The performance speedup can achieve up to 8% and the LLC cache misses can be reduced by 35%. It proved that it s promising to provide more replacement policy diversity for embedded mobile system. 1. Introduction Keywords: Embedded System, Cache Replacement, Power Consumption With each access to the main memory taking hundreds of processor cycles, there is more and more on-chip resource are taken by cache system and its hierarchies became deeper and deeper to tolerate the long latency memory access. There also comes lots of speculation approaches like prefetching or value prediction [1][8][10] to bridge this performance gap. However, prior studies have shown that only a small fraction of the cache block actually hold the live data that will be referenced again before evicted from the cache (called live blocks [4]). Most of the cache blocks are wasted by dead data just used only one time. The inadequate use of cache hierarchy leads to the performance loss of the application and extra energy consumption. For extreme case that the streaming application takes over all the cache space, leaves little room for locality friendly data and results huge number of memory accesses. Traditional LRU policy inserts the new data in the MRU position and removes the block in LRU position. Therefore, it just fits the application that can fit in cache or with good locality. With embedded mobile system becoming popular, we should study how replacement policy works for it and if there exists any best candidate policy. To this goal, we select night replacement policies that improve the traditional LRU algorithms; apply them in the last level cache in a three-layer cache hierarchy; study their performance and energy consumption behaviors. The following paper is organized that: in section 2, we review the background of our study. In section 3, we introduce our methodology. In section 4 and 5, we give the performance analysis and energy consumption analysis respectively. At last, we make the conclusion. 2. Background In this section, we first describe the existing cache replacement algorithms and then we state nine selected improved cache replacement algorithms. With them, we can study how different cache replacement can impact the performance and energy consumption of system. Belady introduced an optimal offline replacement policy during the 1960 s [2]. However, it s not practical for it relies on the knowledge of future address references. Therefore, there exist some replacement policies in many practical situations, like LRU policy. LRU can be as good as the optimal policy. Yet, its success is very dependent on applications natures, like their locality and working set size. For some case, its performance is far from the one optimal policy can provide. That s because LRU only insert the new coming line into MRU position of the stack and remove the cache line in the LRU position. To overcome this drawback, researchers introduced three adaptive extensions of LRU to prevent cache thrashing for workloads with a large memory footprint. LIP (LRU Insertion Policy)[5] Research Notes in Information Science (RNIS) Volume11, January 2013 doi: /rnis.vol
2 places all incoming lines in the LRU position, and moves them to the MRU position only if they are referenced while they are still in the LRU position. BIP (Bimodal Insertion Policy) [3] improves LIP by just placing some of the incoming lines directly in the MRU position. Therefore, it can adapt to changes in the working set during the execution. However, both LIP and BIP cannot overwhelm LRU on all benchmarks, a dynamic policy, called DIP (Dynamic Insertion Policy) [7], was proposed to choose dynamically between the traditional LRU policy and BIP depending on the performance improvement the policy can provide. In our study, we select night replacement policy from [11]: N1 is a novel adaptive cache replacement policy named SCORE, which uses a score system to select a cache line to replace; N2 is a high performance cache replacement algorithm called Dueling Segmented LRU replacement algorithm with adaptive Bypassing (DSB). It randomly provides better protection for a newly allocated line and use an aging algorithm to remove stale cache lines; N3 is named MadCache. It s a cache insertion policy that uses memory access history based on the Program Counter (PC) to determine the appropriate policy for the L3 cache, either LRU for locality friendly application or bypass for streaming memory accesses; N4 is ASRP policy, each set in Last-Level Cache (LLC) is divided into multiple subsets, and one subset is active and others are inactive at a given time. The victim block for a miss is only chosen from the active subset using LRU policy. N5 estimates the data reuse possibility on the basis of data reuse history and remove the one with least reuse possibility. N6 is a combination of cache replacement and bypass policy derived by dead block prediction. N7 uses the Decision Tree Analysis (DTA) based insertion policy selection to find the best position for new loading in data block. N8 is the 3P policy that uses a bimodal insertion improving memory-level parallelism to reduce the impact from long time memory access; N9 is a protected LRU algorithm, its Competitor provides access to the class structure holding meta-data and functions responsible for selecting the evicted line. 3. Methodology In this section, we state our experiment methodology, including simulation platform and benchmarks. We used a simulation framework based on the CMP$im [6]. CMP$im is a binaryinstrumentation-based cache simulator from Intel. It can be used to study the cache performance of kinds of workloads, including single-threaded, multi-threaded, and multi-programmed applications. To model modern computing systems, we use the parameters listed in Table 2 in our simulation: an 8- Stage and 4-wide pipeline powerful processor coupled with a four-layer memory sub-system and a branch predictor to make perfect branch prediction. Table 1. Simulation Parameters Processor 8-Stage, 4-wide pipeline Instruction window size 128 entries Branch Predictor Perfect L1 inst cache 32KB, 64B linesize, 4-way, LRU, 1cycle hit L1 data cache 32KB, 64B linesize, 8-way, LRU, 1cycle hit L2 cache 256KB, 64B linesize, 8-way, LRU, 10cycle hit L3 cache 4MB, 64B linesize, 16-way, 30 cycle hit Main memory 200 cycle There has been a wide variety of workloads including entertainment, image processing, and data processing etc. spread in modern embedded mobile systems. To represent these embedded mobile workloads, we select a set of benchmark programs as follows: SAX and DOM are the XML data processing benchmarks taken from Xerces-C++ [16]. SAX implements the push XML parsing model, in which the parser sends (pushes) XML data to the client as the parser encounters elements in an XML infoset; DOM implements the tree-based XML parsing model, in which the parser reads the entire content of an XML document into memory and creates inmemory objects to represent it. In the experiment, we use both SAX and DOM models to parse four different XML files with varying sizes and complexities, and then take their average performance results for measurement. 41
3 JPEG2000 Encode and JPEG2000 Decode are taken from MediaBench II, which represents multimedia and entertainment workloads [17]. They respectively implement an encoder and a decoder based on the ISO JPEG-2000 standard for wavelet-based image compression. Fluidanimate and Freqmine are taken from the PARSEC benchmark suite [18]. Fluidanimate simulates an incompressible fluid for interactive animation purposes and is commonly used in gaming applications; whereas Freqmine employs an array-based version of FP-growth method for frequent item set mining that is usually used to mine multimedia contents. In this paper, we have also used analytical modeling techniques [23-26] to model the cache energy consumption behavior; you will see the details in section Performance analysis With CMP$im, we conclude how different cache replacement algorithms impact system performance and list the results in Figure 1 and Figure 2. Figure 1 gives the comparisons of Last Level Cache (LLC) misses Figure 2 states how performance speedup varies. In these two figures, we use the performance under LRU cache replacement as our baseline, so that in figure 1 the positive number means there is a performance improvement and in figure 2 the positive one means there is a LLC miss reduction, vice versa. It can be seen in Figure, for those benchmarks with the streaming characters like JPEG2000 Encode JPEG2000 Decode and SAX, almost all the selected cache replacement algorithms can greatly alleviate the performance gains from long latency memory accesses. For JPEG2000 Encode, there is an average 25% LLC miss reduction and in the best case it can reduce up to 35% memory accesses. That s because the workload with streaming nature always poses a large work size and its locality is pretty low. When applying LRU algorithms, it makes most of the cache block stuffed with one-time-used data, leaves little room for locality friendly data and produces more long latency memory accesses. However, all the selected algorithms can get rid of this case by limiting the space streaming data occupies and leaving more space to hold more locality friendly data. As a result, the amount of unnecessary long latency accesses can be greatly reduced. For selected cache replacement algorithms, we can observe that N3, N6 and N1 are the three algorithms that provide most LLC miss reduction. Especially, N3 gives top results among five of six benchmarks. In JPEG2000 Encode, it can achieve 35% LLC miss reduction. That s because N3 is a PC-aware cache insertion policy, it use a PC-based history table to store information regarding cache accesses and determine whether apply LRU policy for workloads that exhibit good locality or bypass for streaming memory accesses. We can also observe that there is even miss increase in executing benchmark freqmine, that s because freqmine is a benchmark so that it benefits least from all the non-lru algorithms. It can be seen in Figure 2, most of the variation keep similar trend with those in Figure 1. For benchmarks JPEG2000 Encode and JPEG2000 Decode, almost all the selected cache replacement algorithms can provide performance gains in terms of reduction of CPI. For JPEG2000 Encode, there is an average 6% performance improvement and in the best case there is an up to 8% improvement. Remaindering Figure 1, we can deduce that it s the results of up to 35% LLC miss reductions. That s because the applying of selected cache replacement algorithms can preserve those spaces streaming data occupied and leaves more space serving for locality friendly data. So that we can reduce considerable amount of diminished unnecessary long latency accesses lead to up to 8% CPI reduction. We can also reduce those benchmarks like JPEG2000 Encode is cache-replacement-sensitive so that careful selection of replacement policy can determine how it behaves in performance. On the other hand, remaindering Figure the LLC miss increase in benchmark freqmine produces some performance degradation, for algorithms N1 there comes nearly 1% CPI decrease. However, on average its performance variation is so obvious. We can say that freqmine is a cache-replacement-insensitive benchmark, whose performance keeps steady no matter what kinds of replacement policy applied. 42
4 N1 N2 N3 N4 N5 N6 N7 Figure1. Miss Reduction of LLC N1 N2 N3 N4 N5 N6 N7 N8 5. Analysis of energy consumption Figure 2. Performance Speedup In this study, we aim to find out how different cache replacement policy which is applied in LLC can impact the system performance and energy consumption. In this section, we move to the energy consumption part. Here, the energy consumption we talked is the energy cost of memory subsystem. Since there have more and more on-chip transistors used to be a member of memory subsystem, whose energy dissipation turns to be the major part in the whole system. Since just the last level policy changes, we also make some assumptions to simplify our discussion. As we studies in previous work [9], the overall system consumption consists of two part: the dynamic energy (Edynamic) and the static energy (Estatic). In equation 1, we can see dynamic one is the result of the amount of how many time memory subsystem is accessed (nm) multiplying dynamic energy consumed each time (E m); the static one is the of static power (Pstatic) and the overall execution time (t). E E static E dynamic P static t n m E ' m To detailed discussing the energy consumption under different policies, we make the further discussion. With different policies, the invariable parts are the amounts of the accesses to L1, L2 and L3 cache; and their static power consumption. The variable parts are the amount of the accesses to memory and the total execution time. So that when we talk about the energy consumption variation with different policies, we only need to focuses on the variable components. The accesses to memory are the last level cache misses, i.e. L3 cache misses. Reminding Figure 1, for most benchmarks, the night selected replacement policies is able to reduce the amount of memory accesses, leading to the 43
5 dynamic energy consumption reduction. Reminding Figure 2, most benchmarks are speeded by selected policies with the reduced execution time. That results a reduction in static energy. We can get the conclusion that: applying different replacement policy, there s double-fold benefits to the system. On the one hand, proper policy will keep useful data resident longer in the cache therefore reduce the amount of memory accesses and the dynamic energy consumption. On the other hand, the overall execution time is shortened by dismissing those long latency memory accesses, leading to the reduction of static energy consumption. Table 2. Hardware cost of selected policies Cache replacement policy Hardware cost (K bits) N1 69 N N3 159 N4 45 N N6 129 N N N9 37 LRU The only one overhead we have to consider is those introduced by implementation of replacement policy itself, including hardware cost and the energy cost that its hardware implementation consumes. In table 1, we list the hardware cost needed for each policy under our experiment configuration. For all policies, their requirements are fewer than 160K bits, considering that the last level cache is 4MB with 16-way associtivity and 64B line size, it s fairly reasonable. Compared to the LRU policy with 65K bits overhead, half of the policies require less hardware so their energy consumption is less than what LRU implementation need. For the rest policies that need more hardware, the hardware and energy cost are also tolerable. 6. Conclusion In this paper, we study night selected cache replace policy including their performance, energy consumption and hardware cost. We find that in embedded mobile system, the traditional LRU algorithms is not the best candidate for last level cache replacement. For most mobile application, night selected policies are able to provide more speedups as well as reduction the overall energy consumption, with tolerant hardware cost. It results that we can make more replacement policy available for embedded mobile system, dynamically serving for different applications. In our future plan, since we have done some research on energy efficiency of mobile computing [9][22], garbage collection of java virtual machine [12][13][15][21] and xml parsing acceleration [14][19], we hope can provide an entire powerful and energy efficient mechanism for mobile system, which equipped with well-built cache replacement, speculative memory assistance as well as hardware assisted middleware including garbage collection and xml parsing. 7. References [1] D.G. Perez, G. Mouchard, and O. Temam, Microlib: A case for the quantitative comparison of micro-architecture mechanisms, In Proceedings of International Symposiums on Microarchitecture (MICRO), 2007 [2] L. A. Belady, A study of replacement algorithms for a virtual-storage computer. IBM Syst. J., 5(2):78-101,1966. [3] M. Qureshi, A. Jaleel, Y.N. Patt, S.C. Steely Jr, and J. Emer. Adaptiveinsertion policies for high performance caching, In Proc. of the 34th Int. Symp. on Computer Architecture,
6 [4] S. Kaxiras, Z. Hu, and M. Martonosi. Cache Decay: Exploiting Generational Behavior to Reduce Cache Leakage Power. Proc. of the International Symposium on Computer Architecture, [5] M. K. Qureshi, A. Jaleel, Y. N. Patt, S. C. Steely, and J. Emer. Adaptive insertion policies for high performance caching. In ISCA '07: Proceedings of the 34th annual international symposium on Computer architecture, pages , New York, NY, USA, ACM. [6] A. Jaleel, R. S. Cohn, C. K. Luk, and B. Jacob. CMP$im: A Pin-Based On-The-Fly Multi-Core Cache Simulator. In MoBS, [7] M. K. Qureshi, A. Jaleel, Y. N. Patt, S. C. Steely. Jr., and J. Emer. Adaptive insertion policies for highperformance caching. Proc. of the International Symposium on Computer Architecture, [8] Shaoshan Liu and Jean-Luc Gaudiot,Value Prediction in Modern Many-Core Systems, Proceedings of the 23rd IEEE International Parallel & Distributed Processing Symposium (IPDPS 2009), TCPP- Ph.D. Forum, Rome, Italy, May 25-29, 2009 [9] Jie Tang, Shaoshan Liu, Zhimin Gu, Chen Liu and Jean-Luc Gaudiot, Prefetching in mobile embedded system can be energy efficient, IEEE Computer Architecture Letters, IEEE Computer Architecture Letters, Volume 10, Issue 1 (2011), Page 8-11 [10] Shaoshan Liu, Christine Eisenbeis, and Jean-Luc Gaudiot, A Theoretical Framework for Value Prediction in Parallel Systems, in Proceedings of the 39th International Conference on Parallel Processing (ICPP 2010), San Diego, California, September 13-16, [11] JWAC-1: Cache Replacement Championship Program, June 20, 2010, [12] Jie Tang, Shaoshan Liu, Zhimin Gu, Xiao-Feng Li, Jean-Luc Gaudiot. Achieving middleware execution efficiency: Hardware-assisted Garbage Collection Operations. Journal of Supercomputing: Volume 59, Issue 3 (2012), Page [13] Jie Tang, Shaoshan Liu, Zhimin Gu, Xiao-Feng Li, Jean-Luc Gaudiot. Hardware-Assisted Middleware: Acceleration of Garbage Collection Operations. In Proceedings of the 21st IEEE International Conference on Application-Specific Systems. Architectures and Processors (ASAP 2010), Rennes, France, 2010: [14] Jie Tang, Shaoshan Liu, Zhimin Gu, Chen Liu, Jean-Luc Gaudiot. Memory-Side Acceleration for XML Parsing. In Proceedings of the 8th IFIP International Conference on Network and Parallel Computing (NPC 2011). Changsha, China, 2009: [15] Shaoshan Liu, Jie Tang, Ligang Wang, Xiao-Feng Li, Jean-Luc Gaudiot. Packer: Parallel Garbage Collection Based on Virtual Spaces. IEEE Transactions on Computers, [16] Xerces-C++ XML Parser: [17] C. Lee, M. Potkonjak, and W.H. Mangione-Smith, MediaBench: a tool for evaluating and synthesizing multimedia and communications systems, in Proceedings of the international symposium on Microarchitecture (MICRO), 1997 [18] C. Bienia, S. Kumar, J.P. Singh, and K. Li, The PARSEC benchmark suite: characterization and architectural implications, Princeton University Technical Report TR , [19] Jie Tang, Shaoshan Liu, Zhimin Gu, Chen Liu, Jean-Luc Gaudiot. Acceleration of XML Parsing Through Prefetching. IEEE Transactions on Computers [20] Jie Tang, Pollawat Thanarungroj, Chen Liu, Shaoshan Liu, Zhimin Gu and Jean-Luc Gaudiot. Pinned OS/Services: A Case Study of XML Parsing on Intel SCC. Journal of Computer Science and Technology [21] Shaoshan Liu, Jie Tang, Chengrui Deng, Xiao-Feng Li, Jean-Luc Gaudiot. RHE: A JVM Courseware. IEEE Transactions on Education[J]. Volume 54, Issue 1 (2011), Page [22] Huang, Jie Tang, Zhimin GU, Min Cai, Jianxun Zhang, Ninghan Zheng. The Performance Optimization of Threaded Prefetching for Linked Data Structures. International Journal of Parallel Programming, Vol.40, no.2 pp [23] Zhefu Shi, Beard Cory, Mitchell Ken. Analytical Models for Understanding Misbehavior and MAC Friendliness in CSMA Networks. Performance Evaluation (2009), Volume: 66, Issue: 9-10, Pages: [24] Zhefu Shi, Beard Cory, Mitchell Ken. Misbehavior and MAC Friendliness in CSMA Networks. IEEE Wireless Communications and Networking Conference, (WCNC 2007). March page(s):
7 [25] Zhefu Shi, Beard Cory, Mitchell Ken. Tunable Traffic Control for Multihop CSMA Networks. IEEE Military Communications Conference, MILCOM Nov On page(s): 1-7 [26] Zhefu Shi, Beard Cory, Mitchell Ken. Competition, Cooperation, and Optimization in Multi-Hop CSMA Networks. Proceeding: PE-WASUN '11 Proceedings of the 8th ACM Symposium on Performance evaluation of wireless ad hoc, sensor, and ubiquitous networks. 46
EHA: The Extremely Heterogeneous Architecture
EHA: The Extremely Heterogeneous Architecture Shaoshan Liu 1, Won W. Ro 2, Chen Liu 3, Alfredo C. Salas 4, Christophe Cérin 5, Jian-Jun Han 6 and Jean-Luc Gaudiot 7 1 Microsoft, WA, U.S.A. 2 Yonsei University,
A Novel Way of Deduplication Approach for Cloud Backup Services Using Block Index Caching Technique
A Novel Way of Deduplication Approach for Cloud Backup Services Using Block Index Caching Technique Jyoti Malhotra 1,Priya Ghyare 2 Associate Professor, Dept. of Information Technology, MIT College of
Analysis of Memory Sensitive SPEC CPU2006 Integer Benchmarks for Big Data Benchmarking
Analysis of Memory Sensitive SPEC CPU2006 Integer Benchmarks for Big Data Benchmarking Kathlene Hurt and Eugene John Department of Electrical and Computer Engineering University of Texas at San Antonio
Data Backup and Archiving with Enterprise Storage Systems
Data Backup and Archiving with Enterprise Storage Systems Slavjan Ivanov 1, Igor Mishkovski 1 1 Faculty of Computer Science and Engineering Ss. Cyril and Methodius University Skopje, Macedonia [email protected],
Implementation of Buffer Cache Simulator for Hybrid Main Memory and Flash Memory Storages
Implementation of Buffer Cache Simulator for Hybrid Main Memory and Flash Memory Storages Soohyun Yang and Yeonseung Ryu Department of Computer Engineering, Myongji University Yongin, Gyeonggi-do, Korea
Performance Impacts of Non-blocking Caches in Out-of-order Processors
Performance Impacts of Non-blocking Caches in Out-of-order Processors Sheng Li; Ke Chen; Jay B. Brockman; Norman P. Jouppi HP Laboratories HPL-2011-65 Keyword(s): Non-blocking cache; MSHR; Out-of-order
RUNAHEAD EXECUTION: AN EFFECTIVE ALTERNATIVE TO LARGE INSTRUCTION WINDOWS
RUNAHEAD EXECUTION: AN EFFECTIVE ALTERNATIVE TO LARGE INSTRUCTION WINDOWS AN INSTRUCTION WINDOW THAT CAN TOLERATE LATENCIES TO DRAM MEMORY IS PROHIBITIVELY COMPLEX AND POWER HUNGRY. TO AVOID HAVING TO
A Dynamic Resource Management with Energy Saving Mechanism for Supporting Cloud Computing
A Dynamic Resource Management with Energy Saving Mechanism for Supporting Cloud Computing Liang-Teh Lee, Kang-Yuan Liu, Hui-Yang Huang and Chia-Ying Tseng Department of Computer Science and Engineering,
Stream Processing on GPUs Using Distributed Multimedia Middleware
Stream Processing on GPUs Using Distributed Multimedia Middleware Michael Repplinger 1,2, and Philipp Slusallek 1,2 1 Computer Graphics Lab, Saarland University, Saarbrücken, Germany 2 German Research
PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions. Outline. Performance oriented design
PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions Slide 1 Outline Principles for performance oriented design Performance testing Performance tuning General
Reconfigurable Architecture Requirements for Co-Designed Virtual Machines
Reconfigurable Architecture Requirements for Co-Designed Virtual Machines Kenneth B. Kent University of New Brunswick Faculty of Computer Science Fredericton, New Brunswick, Canada [email protected] Micaela Serra
Research Statement. Hung-Wei Tseng
Research Statement Hung-Wei Tseng I have research experience in many areas of computer science and engineering, including computer architecture [1, 2, 3, 4], high-performance and reliable storage systems
Chapter 1 Computer System Overview
Operating Systems: Internals and Design Principles Chapter 1 Computer System Overview Eighth Edition By William Stallings Operating System Exploits the hardware resources of one or more processors Provides
Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association
Making Multicore Work and Measuring its Benefits Markus Levy, president EEMBC and Multicore Association Agenda Why Multicore? Standards and issues in the multicore community What is Multicore Association?
A Predictive Model for Cache-Based Side Channels in Multicore and Multithreaded Microprocessors
A Predictive Model for Cache-Based Side Channels in Multicore and Multithreaded Microprocessors Leonid Domnitser, Nael Abu-Ghazaleh and Dmitry Ponomarev Department of Computer Science SUNY-Binghamton {lenny,
Energy-Efficient, High-Performance Heterogeneous Core Design
Energy-Efficient, High-Performance Heterogeneous Core Design Raj Parihar Core Design Session, MICRO - 2012 Advanced Computer Architecture Lab, UofR, Rochester April 18, 2013 Raj Parihar Energy-Efficient,
Optimizing Shared Resource Contention in HPC Clusters
Optimizing Shared Resource Contention in HPC Clusters Sergey Blagodurov Simon Fraser University Alexandra Fedorova Simon Fraser University Abstract Contention for shared resources in HPC clusters occurs
Binary search tree with SIMD bandwidth optimization using SSE
Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous
CHAPTER 1 INTRODUCTION
1 CHAPTER 1 INTRODUCTION 1.1 MOTIVATION OF RESEARCH Multicore processors have two or more execution cores (processors) implemented on a single chip having their own set of execution and architectural recourses.
Data Storage Framework on Flash Memory using Object-based Storage Model
2011 International Conference on Computer Science and Information Technology (ICCSIT 2011) IPCSIT vol. 51 (2012) (2012) IACSIT Press, Singapore DOI: 10.7763/IPCSIT.2012.V51. 118 Data Storage Framework
Presentation of Multi Level Data Replication Distributed Decision Making Strategy for High Priority Tasks in Real Time Data Grids
Presentation of Multi Level Data Replication Distributed Decision Making Strategy for High Priority Tasks in Real Time Data Grids Naghmeh Esmaieli [email protected] Mahdi Jafari [email protected]
Enterprise Applications
Enterprise Applications Chi Ho Yue Sorav Bansal Shivnath Babu Amin Firoozshahian EE392C Emerging Applications Study Spring 2003 Functionality Online Transaction Processing (OLTP) Users/apps interacting
Power-Aware High-Performance Scientific Computing
Power-Aware High-Performance Scientific Computing Padma Raghavan Scalable Computing Laboratory Department of Computer Science Engineering The Pennsylvania State University http://www.cse.psu.edu/~raghavan
Efficient Data Replication Scheme based on Hadoop Distributed File System
, pp. 177-186 http://dx.doi.org/10.14257/ijseia.2015.9.12.16 Efficient Data Replication Scheme based on Hadoop Distributed File System Jungha Lee 1, Jaehwa Chung 2 and Daewon Lee 3* 1 Division of Supercomputing,
Parallel Computing 37 (2011) 26 41. Contents lists available at ScienceDirect. Parallel Computing. journal homepage: www.elsevier.
Parallel Computing 37 (2011) 26 41 Contents lists available at ScienceDirect Parallel Computing journal homepage: www.elsevier.com/locate/parco Architectural support for thread communications in multi-core
what operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored?
Inside the CPU how does the CPU work? what operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored? some short, boring programs to illustrate the
Infrastructure Matters: POWER8 vs. Xeon x86
Advisory Infrastructure Matters: POWER8 vs. Xeon x86 Executive Summary This report compares IBM s new POWER8-based scale-out Power System to Intel E5 v2 x86- based scale-out systems. A follow-on report
Driving force. What future software needs. Potential research topics
Improving Software Robustness and Efficiency Driving force Processor core clock speed reach practical limit ~4GHz (power issue) Percentage of sustainable # of active transistors decrease; Increase in #
Design of a NAND Flash Memory File System to Improve System Boot Time
International Journal of Information Processing Systems, Vol.2, No.3, December 2006 147 Design of a NAND Flash Memory File System to Improve System Boot Time Song-Hwa Park*, Tae-Hoon Lee*, and Ki-Dong
Java Virtual Machine: the key for accurated memory prefetching
Java Virtual Machine: the key for accurated memory prefetching Yolanda Becerra Jordi Garcia Toni Cortes Nacho Navarro Computer Architecture Department Universitat Politècnica de Catalunya Barcelona, Spain
TPCalc : a throughput calculator for computer architecture studies
TPCalc : a throughput calculator for computer architecture studies Pierre Michaud Stijn Eyerman Wouter Rogiest IRISA/INRIA Ghent University Ghent University [email protected] [email protected]
An Architecture Model of Sensor Information System Based on Cloud Computing
An Architecture Model of Sensor Information System Based on Cloud Computing Pengfei You, Yuxing Peng National Key Laboratory for Parallel and Distributed Processing, School of Computer Science, National
Categories and Subject Descriptors C.1.1 [Processor Architecture]: Single Data Stream Architectures. General Terms Performance, Design.
Enhancing Memory Level Parallelism via Recovery-Free Value Prediction Huiyang Zhou Thomas M. Conte Department of Electrical and Computer Engineering North Carolina State University 1-919-513-2014 {hzhou,
Understanding the Benefits of IBM SPSS Statistics Server
IBM SPSS Statistics Server Understanding the Benefits of IBM SPSS Statistics Server Contents: 1 Introduction 2 Performance 101: Understanding the drivers of better performance 3 Why performance is faster
Dynamic resource management for energy saving in the cloud computing environment
Dynamic resource management for energy saving in the cloud computing environment Liang-Teh Lee, Kang-Yuan Liu, and Hui-Yang Huang Department of Computer Science and Engineering, Tatung University, Taiwan
Accelerating Hadoop MapReduce Using an In-Memory Data Grid
Accelerating Hadoop MapReduce Using an In-Memory Data Grid By David L. Brinker and William L. Bain, ScaleOut Software, Inc. 2013 ScaleOut Software, Inc. 12/27/2012 H adoop has been widely embraced for
Hardware Acceleration for Just-In-Time Compilation on Heterogeneous Embedded Systems
Hardware Acceleration for Just-In-Time Compilation on Heterogeneous Embedded Systems A. Carbon, Y. Lhuillier, H.-P. Charles CEA LIST DACLE division Embedded Computing Embedded Software Laboratories France
ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU
Computer Science 14 (2) 2013 http://dx.doi.org/10.7494/csci.2013.14.2.243 Marcin Pietroń Pawe l Russek Kazimierz Wiatr ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU Abstract This paper presents
PARSEC vs. SPLASH 2: A Quantitative Comparison of Two Multithreaded Benchmark Suites on Chip Multiprocessors
PARSEC vs. SPLASH 2: A Quantitative Comparison of Two Multithreaded Benchmark Suites on Chip Multiprocessors ChristianBienia,SanjeevKumar andkaili DepartmentofComputerScience,PrincetonUniversity MicroprocessorTechnologyLabs,Intel
Software Performance and Scalability
Software Performance and Scalability A Quantitative Approach Henry H. Liu ^ IEEE )computer society WILEY A JOHN WILEY & SONS, INC., PUBLICATION Contents PREFACE ACKNOWLEDGMENTS xv xxi Introduction 1 Performance
A SURVEY ON MAPREDUCE IN CLOUD COMPUTING
A SURVEY ON MAPREDUCE IN CLOUD COMPUTING Dr.M.Newlin Rajkumar 1, S.Balachandar 2, Dr.V.Venkatesakumar 3, T.Mahadevan 4 1 Asst. Prof, Dept. of CSE,Anna University Regional Centre, Coimbatore, [email protected]
3D On-chip Data Center Networks Using Circuit Switches and Packet Switches
3D On-chip Data Center Networks Using Circuit Switches and Packet Switches Takahide Ikeda Yuichi Ohsita, and Masayuki Murata Graduate School of Information Science and Technology, Osaka University Osaka,
ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM
ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM 1 The ARM architecture processors popular in Mobile phone systems 2 ARM Features ARM has 32-bit architecture but supports 16 bit
In-Block Level Redundancy Management for Flash Storage System
, pp.309-318 http://dx.doi.org/10.14257/ijmue.2015.10.9.32 In-Block Level Redundancy Management for Flash Storage System Seung-Ho Lim Division of Computer and Electronic Systems Engineering Hankuk University
How To Build A Cloud Computer
Introducing the Singlechip Cloud Computer Exploring the Future of Many-core Processors White Paper Intel Labs Jim Held Intel Fellow, Intel Labs Director, Tera-scale Computing Research Sean Koehl Technology
More on Pipelining and Pipelines in Real Machines CS 333 Fall 2006 Main Ideas Data Hazards RAW WAR WAW More pipeline stall reduction techniques Branch prediction» static» dynamic bimodal branch prediction
Operating Systems. Virtual Memory
Operating Systems Virtual Memory Virtual Memory Topics. Memory Hierarchy. Why Virtual Memory. Virtual Memory Issues. Virtual Memory Solutions. Locality of Reference. Virtual Memory with Segmentation. Page
Accelerating Business Intelligence with Large-Scale System Memory
Accelerating Business Intelligence with Large-Scale System Memory A Proof of Concept by Intel, Samsung, and SAP Executive Summary Real-time business intelligence (BI) plays a vital role in driving competitiveness
IMPROVING QUALITY OF VIDEOS IN VIDEO STREAMING USING FRAMEWORK IN THE CLOUD
IMPROVING QUALITY OF VIDEOS IN VIDEO STREAMING USING FRAMEWORK IN THE CLOUD R.Dhanya 1, Mr. G.R.Anantha Raman 2 1. Department of Computer Science and Engineering, Adhiyamaan college of Engineering(Hosur).
BEAGLEBONE BLACK ARCHITECTURE MADELEINE DAIGNEAU MICHELLE ADVENA
BEAGLEBONE BLACK ARCHITECTURE MADELEINE DAIGNEAU MICHELLE ADVENA AGENDA INTRO TO BEAGLEBONE BLACK HARDWARE & SPECS CORTEX-A8 ARMV7 PROCESSOR PROS & CONS VS RASPBERRY PI WHEN TO USE BEAGLEBONE BLACK Single
Intel Pentium 4 Processor on 90nm Technology
Intel Pentium 4 Processor on 90nm Technology Ronak Singhal August 24, 2004 Hot Chips 16 1 1 Agenda Netburst Microarchitecture Review Microarchitecture Features Hyper-Threading Technology SSE3 Intel Extended
In-network Monitoring and Control Policy for DVFS of CMP Networkson-Chip and Last Level Caches
In-network Monitoring and Control Policy for DVFS of CMP Networkson-Chip and Last Level Caches Xi Chen 1, Zheng Xu 1, Hyungjun Kim 1, Paul V. Gratz 1, Jiang Hu 1, Michael Kishinevsky 2 and Umit Ogras 2
Interpreters and virtual machines. Interpreters. Interpreters. Why interpreters? Tree-based interpreters. Text-based interpreters
Interpreters and virtual machines Michel Schinz 2007 03 23 Interpreters Interpreters Why interpreters? An interpreter is a program that executes another program, represented as some kind of data-structure.
The assignment of chunk size according to the target data characteristics in deduplication backup system
The assignment of chunk size according to the target data characteristics in deduplication backup system Mikito Ogata Norihisa Komoda Hitachi Information and Telecommunication Engineering, Ltd. 781 Sakai,
Distributed forests for MapReduce-based machine learning
Distributed forests for MapReduce-based machine learning Ryoji Wakayama, Ryuei Murata, Akisato Kimura, Takayoshi Yamashita, Yuji Yamauchi, Hironobu Fujiyoshi Chubu University, Japan. NTT Communication
Bindel, Spring 2010 Applications of Parallel Computers (CS 5220) Week 1: Wednesday, Jan 27
Logistics Week 1: Wednesday, Jan 27 Because of overcrowding, we will be changing to a new room on Monday (Snee 1120). Accounts on the class cluster (crocus.csuglab.cornell.edu) will be available next week.
MAGENTO HOSTING Progressive Server Performance Improvements
MAGENTO HOSTING Progressive Server Performance Improvements Simple Helix, LLC 4092 Memorial Parkway Ste 202 Huntsville, AL 35802 [email protected] 1.866.963.0424 www.simplehelix.com 2 Table of Contents
Integrating Apache Spark with an Enterprise Data Warehouse
Integrating Apache Spark with an Enterprise Warehouse Dr. Michael Wurst, IBM Corporation Architect Spark/R/Python base Integration, In-base Analytics Dr. Toni Bollinger, IBM Corporation Senior Software
Multithreading Lin Gao cs9244 report, 2006
Multithreading Lin Gao cs9244 report, 2006 2 Contents 1 Introduction 5 2 Multithreading Technology 7 2.1 Fine-grained multithreading (FGMT)............. 8 2.2 Coarse-grained multithreading (CGMT)............
AN EFFICIENT STRATEGY OF AGGREGATE SECURE DATA TRANSMISSION
INTERNATIONAL JOURNAL OF REVIEWS ON RECENT ELECTRONICS AND COMPUTER SCIENCE AN EFFICIENT STRATEGY OF AGGREGATE SECURE DATA TRANSMISSION K.Anusha 1, K.Sudha 2 1 M.Tech Student, Dept of CSE, Aurora's Technological
A Hybrid Load Balancing Policy underlying Cloud Computing Environment
A Hybrid Load Balancing Policy underlying Cloud Computing Environment S.C. WANG, S.C. TSENG, S.S. WANG*, K.Q. YAN* Chaoyang University of Technology 168, Jifeng E. Rd., Wufeng District, Taichung 41349
The Reduced Address Space (RAS) for Application Memory Authentication
The Reduced Address Space (RAS) for Application Memory Authentication David Champagne, Reouven Elbaz and Ruby B. Lee Princeton University, USA Introduction Background: TPM, XOM, AEGIS, SP, SecureBlue want
Delivering Quality in Software Performance and Scalability Testing
Delivering Quality in Software Performance and Scalability Testing Abstract Khun Ban, Robert Scott, Kingsum Chow, and Huijun Yan Software and Services Group, Intel Corporation {khun.ban, robert.l.scott,
SRAM Scaling Limit: Its Circuit & Architecture Solutions
SRAM Scaling Limit: Its Circuit & Architecture Solutions Nam Sung Kim, Ph.D. Assistant Professor Department of Electrical and Computer Engineering University of Wisconsin - Madison SRAM VCC min Challenges
In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller
In-Memory Databases Algorithms and Data Structures on Modern Hardware Martin Faust David Schwalb Jens Krüger Jürgen Müller The Free Lunch Is Over 2 Number of transistors per CPU increases Clock frequency
SPARC64 VIIIfx: CPU for the K computer
SPARC64 VIIIfx: CPU for the K computer Toshio Yoshida Mikio Hondo Ryuji Kan Go Sugizaki SPARC64 VIIIfx, which was developed as a processor for the K computer, uses Fujitsu Semiconductor Ltd. s 45-nm CMOS
A Survey on ARM Cortex A Processors. Wei Wang Tanima Dey
A Survey on ARM Cortex A Processors Wei Wang Tanima Dey 1 Overview of ARM Processors Focusing on Cortex A9 & Cortex A15 ARM ships no processors but only IP cores For SoC integration Targeting markets:
Masters Project Proposal
Masters Project Proposal Virtual Machine Storage Performance Using SR-IOV by Michael J. Kopps Committee Members and Signatures Approved By Date Advisor: Dr. Jia Rao Committee Member: Dr. Xiabo Zhou Committee
International Journal of Scientific & Engineering Research, Volume 4, Issue 11, November-2013 349 ISSN 2229-5518
International Journal of Scientific & Engineering Research, Volume 4, Issue 11, November-2013 349 Load Balancing Heterogeneous Request in DHT-based P2P Systems Mrs. Yogita A. Dalvi Dr. R. Shankar Mr. Atesh
Chapter 12: Multiprocessor Architectures. Lesson 01: Performance characteristics of Multiprocessor Architectures and Speedup
Chapter 12: Multiprocessor Architectures Lesson 01: Performance characteristics of Multiprocessor Architectures and Speedup Objective Be familiar with basic multiprocessor architectures and be able to
MAXIMIZING RESTORABLE THROUGHPUT IN MPLS NETWORKS
MAXIMIZING RESTORABLE THROUGHPUT IN MPLS NETWORKS 1 M.LAKSHMI, 2 N.LAKSHMI 1 Assitant Professor, Dept.of.Computer science, MCC college.pattukottai. 2 Research Scholar, Dept.of.Computer science, MCC college.pattukottai.
A PPM-like, tag-based branch predictor
Journal of Instruction-Level Parallelism 7 (25) 1-1 Submitted 1/5; published 4/5 A PPM-like, tag-based branch predictor Pierre Michaud IRISA/INRIA Campus de Beaulieu, Rennes 35, France [email protected]
Maximizing Hardware Prefetch Effectiveness with Machine Learning
Maximizing Hardware Prefetch Effectiveness with Machine Learning Saami Rahman, Martin Burtscher, Ziliang Zong, and Apan Qasem Department of Computer Science Texas State University San Marcos, TX 78666
Study Plan Masters of Science in Computer Engineering and Networks (Thesis Track)
Plan Number 2009 Study Plan Masters of Science in Computer Engineering and Networks (Thesis Track) I. General Rules and Conditions 1. This plan conforms to the regulations of the general frame of programs
A Data De-duplication Access Framework for Solid State Drives
JOURNAL OF INFORMATION SCIENCE AND ENGINEERING 28, 941-954 (2012) A Data De-duplication Access Framework for Solid State Drives Department of Electronic Engineering National Taiwan University of Science
hybridfs: Integrating NAND Flash-Based SSD and HDD for Hybrid File System
hybridfs: Integrating NAND Flash-Based SSD and HDD for Hybrid File System Jinsun Suk and Jaechun No College of Electronics and Information Engineering Sejong University 98 Gunja-dong, Gwangjin-gu, Seoul
Intel Itanium Quad-Core Architecture for the Enterprise. Lambert Schaelicke Eric DeLano
Intel Itanium Quad-Core Architecture for the Enterprise Lambert Schaelicke Eric DeLano Agenda Introduction Intel Itanium Roadmap Intel Itanium Processor 9300 Series Overview Key Features Pipeline Overview
A STUDY OF THE BEHAVIOUR OF THE MOBILE AGENT IN THE NETWORK MANAGEMENT SYSTEMS
A STUDY OF THE BEHAVIOUR OF THE MOBILE AGENT IN THE NETWORK MANAGEMENT SYSTEMS Tarag Fahad, Sufian Yousef & Caroline Strange School of Design and Communication Systems, Anglia Polytechnic University Victoria
Accelerating Business Intelligence with Large-Scale System Memory
Accelerating Business Intelligence with Large-Scale System Memory A Proof of Concept by Intel, Samsung, and SAP Executive Summary Real-time business intelligence (BI) plays a vital role in driving competitiveness
Quiz for Chapter 1 Computer Abstractions and Technology 3.10
Date: 3.10 Not all questions are of equal difficulty. Please review the entire quiz first and then budget your time carefully. Name: Course: Solutions in Red 1. [15 points] Consider two different implementations,
Capstone Overview Architecture for Big Data & Machine Learning. Debbie Marr ICRI-CI 2015 Retreat, May 5, 2015
Capstone Overview Architecture for Big Data & Machine Learning Debbie Marr ICRI-CI 2015 Retreat, May 5, 2015 Accelerators Memory Traffic Reduction Memory Intensive Arch. Context-based Prefetching Deep
