How to Benchmark Code Execution Times on Intel IA-32 and IA-64 Instruction Set Architectures
|
|
|
- Ross Whitehead
- 9 years ago
- Views:
Transcription
1 White Paper Gabriele Paoloni Linux Kernel/Embedded Software Engineer Intel Corporation How to Benchmark Code Execution Times on Intel IA-32 and IA-64 Instruction Set Architectures September
2 Abstract This paper provides precise methods to measure the clock cycles spent when executing a certain C code on a Linux* environment with a generic Intel architecture processor (either 32 bits or 64 bits). 2
3 Contents 1 0BIntroduction Purpose/Scope Assumptions Terminology Problem Description Introduction Problems with RDTSC Instruction in C Inline Assembly Variance Introduced by CPUID and Improvements with RTDSCP Instruction Problems with the CPUID Instruction Code Analysis Evaluation of the First Benchmarking Method Improvements Using RDTSCP Instruction The Improved Benchmarking Method Evaluation of the Improved Benchmarking Method An Alternative Method for Architecture Not Supporting RDTSCP Evaluation of the Alternative Method Resolution of the Benchmarking Methodologies Resolution with RDTSCP Resolution with the Alternative Method BSummary Appendix Reference List Figures Figure 1. Minimum Value Behavior Graph Figure 2. Variance Behavior Graph Figure 3. Minimum Value Behavior Graph Figure 4. Variance Behavior Graph Figure 5. Minimum Value Behavior Graph Figure 6. Variance Behavior Graph Figure 7. Minimum Value Behavior Graph Figure 8. Variance Behavior Graph Figure 9. Variance Behavior Graph
4 Tables Figure 10. Variance Behavior Graph Table 1. List of Terms
5 1 0BIntroduction 1.1 Purpose/Scope The purpose of this document is to provide software developers with precise methods to measure the clock cycles required to execute specific C code in a Linux environment running on a generic Intel architecture processor. These methods can be very useful in a CPU-benchmarking context, in a code-optimization context, and also in an OS-tuning context. In all these cases, the developer is interested in knowing exactly how many clock cycles are elapsed while executing code. At the time of this writing, the best description of how to benchmark code execution can be found in [1]. Unfortunately, many problems were encountered while using this method. This paper describes the problems and proposes two separate solutions. 1.2 Assumptions In this paper, all the results shown were obtained by running tests on a platform whose BIOS was optimized by removing every factor that could cause indeterminism. All power optimization, Intel Hyper-Threading technology, frequency scaling and turbo mode functionalities were turned off. The OS used was opensuse* 11.2 (linux ). 1.3 Terminology Table 1 lists the terms used in this document. Table 1. List of Terms Term Description CPU Central Processing Unit IA32 Intel 32-bit Architecture IA64 Intel 64-bit Architecture GCC GNU* Compiler Collection ICC Intel C/C++ Compiler 5
6 Term RDTSCP Description Read Time-Stamp Counter and Processor ID IA assembly instruction RTDSC Read Time-Stamp Counter and Processor ID IA assembly instruction 6
7 2 Problem Description This section explains the issues involved in reading the timestamp register and discusses the correct methodology to return precise and reliable clock cycles measurements. It is expected that readers have knowledge of basic GCC and ICC compiling techniques, basic Intel assembly syntax, and AT&T* assembly syntax. Those not interested in the problem description and method justification can skip this section and go to Section (if their platform supports the RDTSCP instruction) or Section (if not) to acquire the code. 2.1 Introduction Intel CPUs have a timestamp counter to keep track of every cycle that occurs on the CPU. Starting with the Intel Pentium processor, the devices have included a per-core timestamp register that stores the value of the timestamp counter and that can be accessed by the RDTSC and RDTSCP assembly instructions. When running a Linux OS, the developer can check if his CPU supports the RDTSCP instruction by looking at the flags field of /proc/cpuinfo ; if rdtscp is one of the flags, then it is supported. 2.2 Problems with RDTSC Instruction in C Inline Assembly Assume that you are working in a Linux environment, and are compiling by using GCC. You have C code and want to know how many clock cycles are spent to execute the code itself or just a part of it. To make sure that our measurements are not tainted by any kind of interrupt (including scheduling preemption), we are going to write a kernel module where we guarantee the exclusive ownership of the CPU when executing the code that we want to benchmark. To understand the practical implementation, let s consider the following dummy kernel module; it simply calls a function that is taking a pointer as input and is setting the pointed value to 1. We want to measure how many clock cycles it takes to call such a function: #include <linux/module.h> #include <linux/kernel.h> #include <linux/init.h> #include <linux/hardirq.h> #include <linux/preempt.h> #include <linux/sched.h> void inline measured_function(volatile int *var) { (*var) = 1; 7
8 } static int init hello_start(void) { unsigned long flags; uint64_t start, end; int variable = 0; unsigned cycles_low, cycles_high, cycles_low1, cycles_high1; printk(kern_info "Loading test module...\n"); preempt_disable(); /*we disable preemption on our CPU*/ raw_local_irq_save(flags); /*we disable hard interrupts on our CPU*/ /*at this stage we exclusively own the CPU*/ asm volatile ( "RDTSC\n\t" "mov %%eax, %1\n\t": "=r" (cycles_high), "=r" (cycles_low)); measured_function(&variable); asm volatile ( "RDTSC\n\t" "mov %%eax, %1\n\t": "=r" (cycles_high1), "=r" (cycles_low1)); raw_local_irq_restore(flags); /*we enable hard interrupts on our CPU*/ preempt_enable();/*we enable preemption*/ start = ( ((uint64_t)cycles_high << 32) cycles_low ); end = ( ((uint64_t)cycles_high1 << 32) cycles_low1 ); printk(kern_info "\n function execution time is %llu clock cycles", (endstart)); } return 0; static void exit hello_end(void) { printk(kern_info "Goodbye Mr.\n"); } module_init(hello_start); module_exit(hello_end); The RDTSC instruction loads the high-order 32 bits of the timestamp register into EDX, and the low-order 32 bits into EAX. A bitwise OR is performed to reconstruct and store the register value into a local variable. In the code above, to guarantee the exclusive ownership of the CPU before performing the measure, we disable the preemption (preempt_disable()) and we disable the hard interrupts (raw_local_irq_save()). Then we call the RDTSC assembly instruction to read the timestamp register. We call our function (measured_function()), and we read the timestamp register again (RDTSC) to see how many clock cycles have been elapsed since the first read. The two variables 8
9 start and end store the timestamp register values at the respective times of the RDTSC calls. Finally, we print the measurement on the screen. Logically the code above makes sense, but if we try to compile it, we could get segmentation faults or some weird results. This is because we didn t consider a few issues that are related to the RDTSC instruction itself and to the Intel Architecture: Register Overwriting RDTSC instruction, once called, overwrites the EAX and EDX registers. In the inline assembly that we presented above, we didn t declare any clobbered register. Basically we have to push those register statuses onto the stack before calling RDTSC and popping them afterwards. The practical solution for that is to write the inline assembly as follows (note bold items): asm volatile ("RDTSC\n\t" "mov %%eax, %1\n\t": "=r" (cycles_high), "=r" (cycles_low):: %eax, %edx ); In case we are using an IA64 platform rather than an IA32, in the list of clobbered registers we have to replace %eax, %edx with %rax, %rdx. In fact, in the Intel 64 and IA-32 Architectures Software Developer s Manual Volume 2B ([3]), it states that On processors that support the Intel 64 architecture, the high-order 32 bits of each of RAX and RDX are cleared. Out of Order Execution Starting with the Intel Pentium processor, most Intel CPUs support out-of-order execution of the code. The purpose is to optimize the penalties due to the different instruction latencies. Unfortunately this feature does not guarantee that the temporal sequence of the single compiled C instructions will respect the sequence of the instruction themselves as written in the source C file. When we call the RDTSC instruction, we pretend that that instruction will be executed exactly at the beginning and at the end of code being measured (i.e., we don t want to measure compiled code executed outside of the RDTSC calls or executed in between the calls themselves). The solution is to call a serializing instruction before calling the RDTSC one. A serializing instruction is an instruction that forces the CPU to complete every preceding instruction of the C code before continuing the program execution. By doing so we guarantee that only the code that is under measurement will be executed in between the RDTSC calls and that no part of that code will be executed outside the calls. The complete list of available serializing instructions on IA64 and IA32 can be found in the Intel 64 and IA-32 Architectures Software Developer s Manual Volume 3A [4]. Reading this manual, we find that CPUID can be executed at any privilege level to serialize instruction execution with no effect on program flow, except that the EAX, EBX, ECX and EDX registers are modified. Accordingly, the 9
10 natural choice to avoid out of order execution would be to call CPUID just before both RTDSC calls; this method works but there is a lot of variance (in terms of clock cycles) that is intrinsically associated with the CPUID instruction execution itself. This means that to guarantee serialization of instructions, we lose in terms of measurement resolution when using CPUID. A quantitative analysis about this is presented in Section An important consideration that we have to make is that the CPUID instruction overwrites EAX, EBX, ECX, and EDX registers. So we have to add EBX and ECX to the list of clobbered registers mentioned in Register Overwriting above. If we are using an IA64 rather than an IA32 platform, in the list of clobbered registers we have to replace "%eax", "%ebx", "%ecx", "%edx" with "%rax", "%rbx", "%rcx", "%rdx". In fact, in the Intel 64 and IA-32 Architectures Software Developer s Manual Volume 2A ([3]), it states that On Intel 64 processors, CPUID clears the high 32 bits of the RAX/RBX/RCX/RDX registers in all modes. Overhead in Calling CPUID and RDTSC When we call the instructions to capture the clock (the serializing one plus RDTSC) an overhead (in terms of clock cycles) is associated with the calls themselves; such overhead has to be measured and subtracted from the measurement of the code we are interested in. Later in this paper, we show how to properly measure the overhead involved in taking the measurement itself. 10
11 3 Variance Introduced by CPUID and Improvements with RTDSCP Instruction This section shows that if, from one side, the CPUID instruction guarantees no code cross-contamination, then, from a measurement perspective, the other can introduce a variance in terms of clock cycles that is too high to guarantee an acceptable measurement resolution. To solve this issue, we use an alternative implementation using the RTDSCP instruction. 3.1 Problems with the CPUID Instruction Let s consider the code shown in the Appendix. Later in this paper we reference numbered code lines in the appendix to help avoid duplication of code. Also, in this case, the code has been written in kernel space to guarantee the exclusive ownership of the CPU Code Analysis Ln98: Init function of the kernel module. Ln101: Here we declare **times double pointer. This pointer is allocated with a double array of memory (ln108 to ln122) of size BOUND_OF_LOOP*SIZE_OF_STAT: the meaning of these two values is explained later in this paper. The purpose of **times is to store all the time measurements (clock cycles). Ln102/ln103: The pointers *variances and *min_values are declared. Those pointers are used to respectively store the array of the variances and the array of minimum values of different ensembles of measures. The memory for both arrays is allocated at lines 124 to 134. Ln137: Filltimes function is called. Such function is defined at ln12; it is the core function of our code. Its purpose is to calculate the execution times of the code under measurement and to fill accordingly the **times double array. Ln19 to Ln30: In these lines we are consecutively calling the inline assembly instructions used just afterwards in the code to calculate the times. The purpose of this is to warm up the instruction cache to avoid spurious measurements due to cache effects in the first iterations of the following loop. 11
12 Ln33/34: Here we have two nested loops inside which the measurements take place. There are two reasons for having two loops for the following scenarios: When there is no function to be measured - in this case we are evaluating the statistical nature of the offset associated with the process of taking the measure itself. The inner loop is used to calculate statistic values such as minimum, maximum deviation from the minimum, variance; the outer loop is used to evaluate the ergodicity of the method taking the measures. When evaluating a function duration - the outer loop is used to increase step by step the complexity of the function itself in such a way to evaluate the goodness of the measuring method itself (in terms of clock cycles resolution). Ln38/39: Here we get the exclusive ownership of the CPU (see Section 2.2). Ln41 to Ln51: Here we implement the inline assembly code used to take the measurement. This is the part that along with this paper can change evaluation techniques and introduce improvements in the method. Ln53/54: We release the ownership of the CPU (see Section 2.2). Ln68: We fill the times array with the measured time. Ln139: At this stage the **times array is entirely filled with the calculated values. Following the array there are two nested loops: the inner one (ln145 to ln150) goes from zero to (SIZE_OF_STAT-1) and is used to calculate the minimum value and the maximum deviation from the minimum (max - min) for a certain ensemble of measures; the external one (ln139) is used to go across different ensembles. On the same ensemble the variance is calculated (ln160) and is stored in the array of variances. An accumulator (tot_var) is used to calculate the total variance (calculated also on the outer loop) of all the measurements. spurious (ln156) is a counter that is increased whenever between contiguous ensembles the minimum value of the previous is bigger than the one that follows. It is a useful index in case we are evaluating a function whose complexity is increasing along the external loop: a more complex function has to take more cycles to be executed; if the minimum measured value is smaller than the one measured on the ensemble for the less complex function, there is something wrong (we will see later how this index is useful). Finally, at ln168/169, the variance of the variances is calculated, and the variance of the minimum values. Both are needed to evaluate the ergodicity of the measurement process (if the process is ergodic the variance of variances tends to zero and, in this specific case, also the variance of the minimum value) Evaluation of the First Benchmarking Method Having built the kernel module using the code in the Appendix, we load this module and look at the kernel log ( dmesg ). The output is as follows: Loading hello module... loop_size:0 >>>> variance(cycles): 85; max_deviation: 80 ;min time:
13 loop_size:1 >>>> variance(cycles): 85; max_deviation: 56 ;min time: 456 loop_size:2 >>>> variance(cycles): 85; max_deviation: 96 ;min time: 456 loop_size:997 >>>> variance(cycles): 85; max_deviation: 92 ;min time: 456 loop_size:998 >>>> variance(cycles): 85; max_deviation: 100 ;min time: 452 loop_size:999 >>>> variance(cycles): 85; max_deviation: 60 ;min time: 452 total number of spurious min values = 262 total variance = 48 absolute max deviation = 636 variance of variances = 2306 variance of minimum values = 118 The loop_size index refers to the external loop (ln33); accordingly each row of the log shows, for a certain ensemble of measures, the variance, the maximum deviation and the minimum measured time (all of them in clock cycles). At the end of the log there are: the number of spurious minimum values (that in this case is meaningless and can be neglected): the total variance (the average of the variances in each row); the absolute maximum deviation (the maximum value amongst the max deviations of all the rows); the variance of variances and the variance of the minimum values. Looking at the results, it is clear that this method is not reliable for benchmarking. There are different reasons for this: The minimum value is not constant between different ensembles (the variance of the minimum values is 118 cycles). This means that we cannot evaluate the cost of calling the benchmarking function itself. When we are benchmarking a function we want to be able to subtract the cost of calling the benchmarking function itself from the measurement of the function to be benchmarked. This cost is the minimum possible number of cycles that it takes to call the benchmarking function (i.e., the min times in the rows of the kernel log above). Basically, in this case, each statistic is performed over 100,000 samples. The fact that over 100,000 samples an absolute minimum cannot be determined means that we cannot calculate the cost to be subtracted when benchmarking any function. A solution could be to increase the number of samples till we always get the same minimum value, but this is practically too costly since the developer would have to wait quite a lot for orders of magnitude greater than 10^5 samples. 13
14 The total variance is 48 cycles. This means that this method would introduce an uncertainty on the measure (standard deviation) of 6.9 cycles. If the developer wanted to have an average error on the measure less than 5%, it would mean that he cannot benchmark functions whose execution is shorter than 139 clock cycles. If the average desired error was less than 1%, he couldn t benchmark functions that take less than 690 cycles! The variance itself is not constant between different ensembles: the variance of the variances is 2306 cycles (i.e., a standard error on the variance of 48 cycles that is as big as the total variance itself!). This means that the standard deviation varies between measurements (i.e., the measuring process itself is not ergodic) and the error on the measure cannot be identified. A graphic view of both variances and minimum values behavior between different ensembles is shown in Figure 1 and Figure 2: Figure 1. Minimum Value Behavior Graph 1 graph clock cycles Series ensembles 14
15 Figure 2. Variance Behavior Graph 2 graph clock cycles Series ensembles 3.2 Improvements Using RDTSCP Instruction The RDTSCP instruction is described in the Intel 64 and IA-32 Architectures Software Developer s Manual Volume 2B ([3]) as an assembly instruction that, at the same time, reads the timestamp register and the CPU identifier. The value of the timestamp register is stored into the EDX and EAX registers; the value of the CPU id is stored into the ECX register ( On processors that support the Intel 64 architecture, the high order 32 bits of each of RAX, RDX, and RCX are cleared ). What is interesting in this case is the pseudo serializing property of RDTSCP. The manual states: The RDTSCP instruction waits until all previous instructions have been executed before reading the counter. However, subsequent instructions may begin execution before the read operation is performed. This means that this instruction guarantees that everything that is above its call in the source code is executed before the instruction itself is called. It cannot, however, guarantee that for optimization purposes the CPU will not execute, before the RDTSCP call, instructions that, in the source code, are placed after the RDTSCP function call itself. If this happens, a contamination caused by instructions in the source code that come after the RDTSCP will occur in the code under measurement.. 15
16 The problem is graphically described as follows: asm volatile ( "CPUID\n\t"/*serialize*/ "RDTSC\n\t"/*read the clock*/ "mov %%eax, %1\n\t": "=r" (cycles_high), "=r" (cycles_low):: "%rax", "%rbx", "%rcx", "%rdx"); /* Call the function to benchmark */ This out of order execution corrupts the*** asm volatile( "RDTSCP\n\t"/*read the clock*/ "mov %%eax, %1\n\t": "=r" (cycles_high1), "=r" (cycles_low1):: "%rax", "%rcx", "%rdx"); /*do other things*/ If we find a way to avoid the undesired behavior described above we can avoid calling the serializing CPUID instruction between the two timestamp register reads The Improved Benchmarking Method The solution to the problem presented in Section 0 is to add a CPUID instruction just after the RDTPSCP and the two mov instructions (to store in memory the value of edx and eax). The implementation is as follows: asm volatile ("CPUID\n\t" "RDTSC\n\t" "mov %%eax, %1\n\t": "=r" (cycles_high), "=r" (cycles_low):: "%rax", "%rbx", "%rcx", "%rdx"); /***********************************/ /*call the function to measure here*/ /***********************************/ asm volatile("rdtscp\n\t" "mov %%eax, %1\n\t" "CPUID\n\t": "=r" (cycles_high1), "=r" (cycles_low1):: "%rax", "%rbx", "%rcx", "%rdx"); In the code above, the first CPUID call implements a barrier to avoid out-of-order execution of the instructions above and below the RDTSC instruction. Nevertheless, this call does not affect the measurement since it comes before the RDTSC (i.e., before the timestamp register is read). The first RDTSC then reads the timestamp register and the value is stored in memory. Then the code that we want to measure is executed. If the code is a call to a function, it is recommended to declare such function as inline so that from an assembly perspective there is no overhead in calling the function itself. 16
17 The RDTSCP instruction reads the timestamp register for the second time and guarantees that the execution of all the code we wanted to measure is completed. The two mov instructions coming afterwards store the edx and eax registers values into memory. Both instructions are guaranteed to be executed after RDTSC (i.e., they don t corrupt the measure) since there is a logical dependency between RDTSCP and the register edx and eax (RDTSCP is writing those registers and the CPU is obliged to wait for RDTSCP to be finished before executing the two mov ). Finally a CPUID call guarantees that a barrier is implemented again so that it is impossible that any instruction coming afterwards is executed before CPUID itself (and logically also before RDTSCP). With this method we avoid to call a CPUID instruction in between the reads of the real-time registers (avoiding all the problems described in Section 3.1) Evaluation of the Improved Benchmarking Method In reference to the code presented in Section 3.1, we replace the previous benchmarking method with the new one, i.e., we replace ln19 to ln54 in the Appendix with the following code: asm volatile ("CPUID\n\t" "RDTSC\n\t" "mov %%eax, %1\n\t": "=r" (cycles_high), "=r" (cycles_low):: "%rax", "%rbx", "%rcx", "%rdx"); asm volatile("rdtscp\n\t" "mov %%eax, %1\n\t" "CPUID\n\t": "=r" (cycles_high1), "=r" (cycles_low1):: "%rax", "%rbx", "%rcx", "%rdx"); asm volatile ("CPUID\n\t" "RDTSC\n\t" "mov %%eax, %1\n\t": "=r" (cycles_high), "=r" (cycles_low):: "%rax", "%rbx", "%rcx", "%rdx"); asm volatile("rdtscp\n\t" "mov %%eax, %1\n\t" "CPUID\n\t": "=r" (cycles_high1), "=r" (cycles_low1):: "%rax", "%rbx", "%rcx", "%rdx"); for (j=0; j<bound_of_loop; j++) { for (i =0; i<size_of_stat; i++) { variable = 0; preempt_disable(); raw_local_irq_save(flags); asm volatile ("CPUID\n\t" "RDTSC\n\t" 17
18 "mov %%eax, %1\n\t": "=r" (cycles_high), "=r" (cycles_low):: "%rax", "%rbx", "%rcx", "%rdx"); /***********************************/ /*call the function to measure here*/ /***********************************/ asm volatile("rdtscp\n\t" "mov %%eax, %1\n\t" "CPUID\n\t": "=r" (cycles_high1), "=r" (cycles_low1):: "%rax", "%rbx", "%rcx", "%rdx"); raw_local_irq_restore(flags); preempt_enable(); If we perform the same analysis as in Section 3.2.1, we obtain a kernel log as follows: Loading hello module... loop_size:0 >>>> variance(cycles): 2; max_deviation: 4 ;min time: 44 loop_size:1 >>>> variance(cycles): 3; max_deviation: 4 ;min time: 44 loop_size:2 >>>> variance(cycles): 3; max_deviation: 4 ;min time: 44 loop_size:997 >>>> variance(cycles): 2; max_deviation: 4 ;min time: 44 loop_size:998 >>>> variance(cycles): 3; max_deviation: 4 ;min time: 44 loop_size:999 >>>> variance(cycles): 3; max_deviation: 4 ;min time: 44 total number of spurious min values = 0 total variance = 2 absolute max deviation = 104 variance of variances = 0 variance of minimum values = 0 In this case, the minimum time does not change between different ensembles (it is always the same along all the 1000 repetitions of each ensemble); this means that the overhead of calling the benchmarking function itself can be exactly determined. The total variance is 2 cycles, i.e., the standard error on the measure is 1,414 cycles (before it was 6,9 cycles). 18
19 Both the variance of variances and the variance of the minimum values are zero. This means that this improved benchmarking method is completely ergodic (between different ensembles the maximum fluctuation of the variance is 1 clock cycle and the minimum value is perfectly constant). This is the most important characteristic that we need for a method to be suitable for benchmarking purposes. For completeness, Figure 3 and Figure 4 show the same graphic analysis as done above in Section Figure 3. Minimum Value Behavior Graph 3 graph clock cycles minimum value ensembles 19
20 Figure 4. Variance Behavior Graph 4 graph4 clock cycles ensembles Variance In Figure 3 we can see that the minimum value is perfectly constant between ensembles; in Figure 4 the variance is either equal to 2 or 3 clock cycles An Alternative Method for Architecture Not Supporting RDTSCP This section presents an alternative method to benchmark code execution cycles for architectures that do not support the RDTSCP instruction. Such a method is not as good as the one presented in Section 3.2.1, but it is still much better than the one using CPUID to serialize code execution. In this method between the two timestamp register reads we serialize the code execution by writing the control register CR0. Regarding the code in the Appendix, the developer should replace ln19 to ln54 with the following: asm volatile( "CPUID\n\t" "RDTSC\n\t" "mov %%eax, %1\n\t": "=r" (cycles_high), "=r" (cycles_low):: "%rax", "%rbx", "%rcx", "%rdx"); asm volatile( "mov %%cr0, %%rax\n\t" "mov %%rax, %%cr0\n\t" "RDTSC\n\t" "mov %%eax, %1\n\t": "=r" (cycles_high1), "=r" (cycles_low1):: "%rax", "%rdx"); 20
21 asm volatile( "CPUID\n\t" "RDTSC\n\t" "mov %%eax, %1\n\t": "=r" (cycles_high), "=r" (cycles_low):: "%rax", "%rbx", "%rcx", "%rdx"); asm volatile( "mov %%cr0, %%rax\n\t" "mov %%rax, %%cr0\n\t" "RDTSC\n\t" "mov %%eax, %1\n\t": "=r" (cycles_high1), "=r" (cycles_low1):: "%rax", "%rdx"); asm volatile( "CPUID\n\t" "RDTSC\n\t" "mov %%eax, %1\n\t": "=r" (cycles_high), "=r" (cycles_low):: "%rax", "%rbx", "%rcx", "%rdx"); asm volatile( "mov %%cr0, %%rax\n\t" "mov %%rax, %%cr0\n\t" "RDTSC\n\t" "mov %%eax, %1\n\t": "=r" (cycles_high1), "=r" (cycles_low1):: "%rax", "%rdx"); for (j=0; j<bound_of_loop; j++) { for (i =0; i<size_of_stat; i++) { variable = 0; preempt_disable(); raw_local_irq_save(flags); asm volatile ("CPUID\n\t"::: "%rax", "%rbx", "%rcx", "%rdx"); asm volatile ("RDTSC\n\t" "mov %%eax, %1\n\t": "=r" (cycles_high), "=r" (cycles_low):: "%rax", "%rdx"); /*call the function to measure here*/ asm volatile("mov %%cr0, %%rax\n\t" "mov %%rax, %%cr0\n\t" "RDTSC\n\t" "mov %%eax, %1\n\t": "=r" (cycles_high1), "=r" (cycles_low1):: "%rax", "%rdx"); raw_local_irq_restore(flags); preempt_enable(); In the code above, first we have the repetition (three times) of the instructions called in the body of the following nested loops; this is just to warm up the instructions cache. Then the body of the loop is executed. We: First take the exclusive ownership of the CPU (preempt_disable(), raw_local_irq_save()) Call CPUID to serialize 21
22 Read the timestamp register the first time by RDTSC and store the value in memory Read the value of the control register CR0 into RAX register Write the value of RAX back to CR0 (this instruction serializes) Read the timestamp register the second time by RDTSC and store the value in memory Release the CPU ownership (raw_local_irq_restore, preempt_enable) Evaluation of the Alternative Method As done in Section and Section 3.2.2, hereafter we present the statistical analysis of the method. The resulting kernel log is as follows: loop_size:2 >>>> variance(cycles): 3; max_deviation: 4 ;min time: 208 loop_size:3 >>>> variance(cycles): 3; max_deviation: 4 ;min time: 208 loop_size:4 >>>> variance(cycles): 3; max_deviation: 4 ;min time: 208 loop_size:998 >>>> variance(cycles): 4; max_deviation: 4 ;min time: 208 loop_size:999 >>>> variance(cycles): 3; max_deviation: 4 ;min time: 208 total number of spurious min values = 0 total variance = 3 absolute max deviation = 176 variance of variances = 0 variance of minimum values = 0 In the log we see that the total variance of this method is 3 cycles rather than 2 and the absolute max deviation is 176 cycles rather than 104 cycles. This means that the standard error on the measure is 1,73 rather than 1,414 and the maximum error is increased as well. Nevertheless, we still have met the ergodicity requirements since the variance does not change between different ensembles (the maximum fluctuation is 1 clock cycle) as well as the minimum value; both the variance of the variances and the variance of the minimum values are zero. Such a method may be suitable for benchmarking whenever the RDTSCP instruction is not available on the CPU. As done previously, the following graphs show the behavior of the minimum values and of the variances between different ensembles. 22
23 Figure 5. Minimum Value Behavior Graph 5 graph clock cycles minimum value ensembles Figure 6. Variance Behavior Graph 6 graph6 clock cycles ensembles variance In Figure 5, we can see how the minimum value is perfectly constant between ensembles. In Figure 6, we have the variance being either equal to 3 or 4 clock cycles. 23
24 3.3 Resolution of the Benchmarking Methodologies In this section we analyze the resolution of our benchmarking methods, focusing on an Intel Core i7 processor-based platform, which is representative of a highend server solution and supports the RDTSCP instruction and out of order execution. For resolution, we mean the minimum number of sequential assembly instructions that a method is able to benchmark (appreciate). The purpose of this evaluation is mostly intended to define the methodology to calculate the resolution rather than to analyze the resolutions themselves. The resolution itself is, in fact, strictly dependant on the target machine and the developer is advised to run the following test before starting to benchmark to evaluate the benchmarking limits of the platform. With reference to the code presented in the Appendix, the developer should make the proper replacements according to the benchmarking method he intends to use (Section if RDTSCP is supported, Section otherwise). Also, the following code should be added at ln11: void inline measured_loop(unsigned int n, volatile int *var) { int k; for (k=0; k<n; k++) (*var)= 1; } and the following in place of /*call the function to measure here*/ : measured_loop(j, &variable); By doing so, the external loop of the two nested ones (j index) is intended to increase step by step the complexity of the function to benchmark; more in details between two consecutive ensembles it increases the parameter that determines the size of the measured loop, thus adding exactly one assembly instruction to the code under benchmark Resolution with RDTSCP According to the guidelines in Section 3.3, we effect the recommended replacements in the code for the method using RDTSCP instruction, we build the kernel module, and we call insmod from the shell. The resulting kernel log is as follows: loop_size:0 >>>> variance(cycles): 3; max_deviation: 16 ;min time: 44 loop_size:1 >>>> variance(cycles): 3; max_deviation: 16 ;min time: 44 loop_size:2 >>>> variance(cycles): 3; max_deviation: 48 ;min time: 44 loop_size:3 >>>> variance(cycles): 4; max_deviation: 16 ;min time: 44 24
25 loop_size:4 >>>> variance(cycles): 3; max_deviation: 32 ;min time: 44 loop_size:5 >>>> variance(cycles): 0; max_deviation: 36 ;min time: 44 loop_size:6 >>>> variance(cycles): 3; max_deviation: 28 ;min time: 48 loop_size:7 >>>> variance(cycles): 4; max_deviation: 32 ;min time: 48 loop_size:8 >>>> variance(cycles): 3; max_deviation: 16 ;min time: 48 loop_size:9 >>>> variance(cycles): 2; max_deviation: 48 ;min time: 48 loop_size:10 >>>> variance(cycles): 0; max_deviation: 28 ;min time: 48 loop_size:11 >>>> variance(cycles): 3; max_deviation: 64 ;min time: 52 loop_size:994 >>>> variance(cycles): 3; max_deviation: 4 ;min time: 2036 loop_size:995 >>>> variance(cycles): 0; max_deviation: 4 ;min time: 2036 loop_size:996 >>>> variance(cycles): 3; max_deviation: 4 ;min time: 2040 loop_size:997 >>>> variance(cycles): 21; max_deviation: 4 ;min time: 2044 loop_size:998 >>>> variance(cycles): 22; max_deviation: 112 ;min time: 2048 loop_size:999 >>>> variance(cycles): 23; max_deviation: 160 ;min time: 2048 total number of spurious min values = 0 total variance = 1 absolute max deviation = 176 variance of variances = 2 variance of minimum values = As seen, each row is an ensemble of measures for a specific size of measured_loop ; going down row by row, the complexity of the measured_loop function is increased exactly by one assembly instruction. Accordingly we see that min time increases as we scroll down the rows. Now let s look at the final part of the log: the total number of spurious min values is zero: this means that as we increase the complexity of the measured loop, the minimum measured time monotonically increases (if we perform enough measures we can exactly determine the minimum number of clock cycles that it takes to run a certain function) the total variance is 1 cycle: that means that the overall standard error is 1 cycle (that is very good) 25
26 the absolute max deviation is 176 cycles: from a benchmarking perspective it is not very important; instead this parameter is fundamental to evaluate the capabilities of the system to meet real-time constraints (we will not pursue this further since it is out of the scope of this paper). The variance of the variances is 2 cycles: this index is very representative of how reliable our benchmarking method is (i.e., the variance of the measures does not vary according to the complexity of the function under benchmark). Finally the variance of the minimum values is completely meaningless and useless in this context and can be neglected (this index was crucial for Section 0) From a resolution perspective we can see that we have the min value that is constant (44 cycles) between 0 and 5 measured assembly instructions and between 6 and 10 assembly instructions (48 clock cycles). Then it increases very regularly by four cycles every two assembly instructions. So unless the function under benchmark is very small, in this case the resolution of this benchmarking method is two assembly instructions (that is the minimum variation in the code complexity that can be revealed). For completeness, the following graph (Graph 7) shows the minimum values, and the next graph (Graph 8) shows the variances. Figure 7. Minimum Value Behavior Graph 7 graph clock cycles min value ensembles 26
27 Figure 8. Variance Behavior Graph 8 graph8 clock cycles ensembles variance Resolution with the Alternative Method According to what we did in Section 3.3.1, we run the same test using the alternative benchmarking method presented in Section The resulting kernel log is as follows: loop_size:0 >>>> variance(cycles): 3; max_deviation: 88 ;min time: 208 loop_size:1 >>>> variance(cycles): 0; max_deviation: 16 ;min time: 208 loop_size:2 >>>> variance(cycles): 4; max_deviation: 56 ;min time: 208 loop_size:3 >>>> variance(cycles): 0; max_deviation: 20 ;min time: 212 loop_size:4 >>>> variance(cycles): 3; max_deviation: 36 ;min time: 212 loop_size:5 >>>> variance(cycles): 3; max_deviation: 36 ;min time: 216 loop_size:6 >>>> variance(cycles): 4; max_deviation: 36 ;min time: 216 loop_size:7 >>>> variance(cycles): 0; max_deviation: 68 ;min time: 220 loop_size:994 >>>> variance(cycles): 28; max_deviation: 112 ;min time:
28 loop_size:995 >>>> variance(cycles): 0; max_deviation: 0 ;min time: 2216 loop_size:996 >>>> variance(cycles): 28; max_deviation: 4 ;min time: 2216 loop_size:997 >>>> variance(cycles): 0; max_deviation: 112 ;min time: 2216 loop_size:998 >>>> variance(cycles): 28; max_deviation: 116 ;min time: 2220 loop_size:999 >>>> variance(cycles): 0; max_deviation: 0 ;min time: 2224 total number of spurious min values = 0 total variance = 1 absolute max deviation = 220 variance of variances = 2 variance of minimum values = With this method we achieved results as good as the previous ones. The only difference is the absolute maximum deviation that here is slightly higher; this does not affect the quality of the method from a benchmarking perspective. For completeness, the following graphs present behaviors of the variance and the minimum value. Figure 9. Variance Behavior Graph 9 graph clock cycles min value ensembles 28
29 Figure 10. Variance Behavior Graph 10 graph10 clock cycles ensembles variance 29
30 4 5BSummary In Section and Section we showed two suitable methods for benchmarking the execution time of a generic C/C++ function running on an IA32/IA64 platform. The former should be chosen if the RDTSCP instruction is available; if not, the other one can be used. Whenever taking a measurement, the developer should perform the following steps: 1. Run the tests in Section or Section (according to the platform). 2. Analyze the variance of the variances and the variance of the minimum values to validate the method on his platform. If the values that the user obtains are not satisfactory, he may have to change the BIOS settings or the BIOS itself. 3. Calculate the resolution that the method is able to guarantee. 4. Make the measurement and subtract the offset (additional cost of calling the measuring function itself) that the user will have calculated before (minimum value from Section or Section 3.2.4). A couple final considerations should be made: Counter Overflow: The timestamp register is 64 bit. On a single overflow, we encounter no problems since we are making a difference between unsigned int and the results would be still correct. The problem arises if the duration of the code under measurement takes longer than 2^64 cycles. For a 1-GHz CPU, that would mean that your code should take longer than (2^64)/(10^9) = seconds ~ 585 years So it shouldn t be a problem or, if it is, the developer won t still be alive to see it! 32- vs. 64-Bit Architectures: Particular attention must be paid to the 64-bit registers used in the code presented in this paper. Whenever working with a 32b platform, the code presented is still valid, but whatever occurrence of rax, rbx, rcx, rdx has to be replaced respectively with eax, ebx, ecx, edx. The Intel Embedded Design Center provides qualified developers with web-based access to technical resources. Access Intel Confidential design materials, step-by-step guidance, application reference solutions, training, Intel s tool loaner program, and connect with an e-help desk and the embedded community. Design Fast. Design Smart. Get started today. 43Hhttp://intel.com/embedded/edc. 30
31 5 Appendix 1 #include <linux/module.h> 2 #include <linux/kernel.h> 3 #include <linux/init.h> 4 #include <linux/hardirq.h> 5 #include <linux/preempt.h> 6 #include <linux/sched.h> 7 8 #define SIZE_OF_STAT #define BOUND_OF_LOOP #define UINT64_MAX ( ULL) void inline Filltimes(uint64_t **times) { 13 unsigned long flags; 14 int i, j; 15 uint64_t start, end; 16 unsigned cycles_low, cycles_high, cycles_low1, cycles_high1; 17 volatile int variable = 0; asm volatile ("CPUID\n\t" 20 "RDTSC\n\t" "mov %%eax, %1\n\t": "=r" (cycles_high), "=r" (cycles_low):: "%rax", "%rbx", "%rcx", "%rdx"); 23 asm volatile ("CPUID\n\t" 24 "RDTSC\n\t" 25 "CPUID\n\t" 26 "RDTSC\n\t" "mov %%eax, %1\n\t": "=r" (cycles_high), "=r" (cycles_low):: "%rax", "%rbx", "%rcx", "%rdx"); 29 asm volatile ("CPUID\n\t" 30 "RDTSC\n\t"::: "%rax", "%rbx", "%rcx", "%rdx"); for (j=0; j<bound_of_loop; j++) { 34 for (i =0; i<size_of_stat; i++) { variable = 0; preempt_disable(); 39 raw_local_irq_save(flags); asm volatile ( 42 "CPUID\n\t" 43 "RDTSC\n\t" "mov %%eax, %1\n\t": "=r" (cycles_high), "=r" (cycles_low):: "%rax", "%rbx", "%rcx", "%rdx"); 46 /*call the function to measure here*/ 47 asm volatile( 48 "CPUID\n\t" 49 "RDTSC\n\t" 50 31
32 51 "mov %%eax, %1\n\t": "=r" (cycles_high1), "=r" (cycles_low1):: "%rax", "%rbx", "%rcx", "%rdx"); raw_local_irq_restore(flags); 54 preempt_enable(); start = ( ((uint64_t)cycles_high << 32) cycles_low ); end = ( ((uint64_t)cycles_high1 << 32) cycles_low1 ); if ( (end - start) < 0) { 62 printk(kern_err "\n\n>>>>>>>>>>>>>> CRITICAL ERROR IN TAKING THE TIME!!!!!!\n loop(%d) stat(%d) start = %llu, end = %llu, variable = %u\n", j, i, start, end, variable); 63 times[j][i] = 0; 64 } 65 else 66 { 67 times[j][i] = end - start; 68 } 69 } 70 } 71 return; 72} 73uint64_t var_calc(uint64_t *inputs, int size) 74{ 75 int i; 76 uint64_t acc = 0, previous = 0, temp_var = 0; 77 for (i=0; i< size; i++) { 78 if (acc < previous) goto overflow; 79 previous = acc; 80 acc += inputs[i]; 81 } 82 acc = acc * acc; 83 if (acc < previous) goto overflow; 84 previous = 0; 85 for (i=0; i< size; i++){ 86 if (temp_var < previous) goto overflow; 87 previous = temp_var; 88 temp_var+= (inputs[i]*inputs[i]); 89 } 90 temp_var = temp_var * size; 91 if (temp_var < previous) goto overflow; 92 temp_var =(temp_var - acc)/(((uint64_t)(size))*((uint64_t)(size))); 93 return (temp_var); 94overflow: 95 printk(kern_err "\n\n>>>>>>>>>>>>>> CRITICAL OVERFLOW ERROR IN var_calc!!!!!!\n\n"); 96 return -EINVAL; 97} 98static int init hello_start(void) 99{ 100 int i = 0, j = 0, spurious = 0, k =0; 101 uint64_t **times; 102 uint64_t *variances; 32
33 103 uint64_t *min_values; 104 uint64_t max_dev = 0, min_time = 0, max_time = 0, prev_min =0, tot_var=0, max_dev_all=0, var_of_vars=0, var_of_mins=0; printk(kern_info "Loading hello module...\n"); times = kmalloc(bound_of_loop*sizeof(uint64_t*), GFP_KERNEL); 109 if (!times) { 110 printk(kern_err "unable to allocate memory for times\n"); 111 return 0; 112 } for (j=0; j<bound_of_loop; j++) { 115 times[j] = kmalloc(size_of_stat*sizeof(uint64_t), GFP_KERNEL); 116 if (!times[j]) { 117 printk(kern_err "unable to allocate memory for times[%d]\n", j); 118 for (k=0; k<j; k++) 119 kfree(times[k]); 120 return 0; 121 } 122 } variances = kmalloc(bound_of_loop*sizeof(uint64_t), GFP_KERNEL); 125 if (!variances) { 126 printk(kern_err "unable to allocate memory for variances\n"); 127 return 0; 128 } min_values = kmalloc(bound_of_loop*sizeof(uint64_t), GFP_KERNEL); 131 if (!min_values) { 132 printk(kern_err "unable to allocate memory for min_values\n"); 133 return 0; 134 } Filltimes(times); for (j=0; j<bound_of_loop; j++) { max_dev = 0; 142 min_time = 0; 143 max_time = 0; for (i =0; i<size_of_stat; i++) { 146 if ((min_time == 0) (min_time > times[j][i])) 147 min_time = times[j][i]; 148 if (max_time < times[j][i]) 149 max_time = times[j][i]; 150 } max_dev = max_time - min_time; 153 min_values[j] = min_time; if ((prev_min!= 0) && (prev_min > min_time)) 156 spurious++; 157 if (max_dev > max_dev_all) 158 max_dev_all = max_dev; variances[j] = var_calc(times[j], SIZE_OF_STAT); 33
34 161 tot_var += variances[j]; printk(kern_err "loop_size:%d >>>> variance(cycles): %llu; max_deviation: %llu ;min time: %llu", j, variances[j], max_dev, min_time); prev_min = min_time; 166 } var_of_vars = var_calc(variances, BOUND_OF_LOOP); 169 var_of_mins = var_calc(min_values, BOUND_OF_LOOP); printk(kern_err "\n total number of spurious min values = %d", spurious); 172 printk(kern_err "\n total variance = %llu", (tot_var/bound_of_loop)); 173 printk(kern_err "\n absolute max deviation = %llu", max_dev_all); 174 printk(kern_err "\n variance of variances = %llu", var_of_vars); 175 printk(kern_err "\n variance of minimum values = %llu", var_of_mins); for (j=0; j<bound_of_loop; j++) { 178 kfree(times[j]); 179 } 180 kfree(times); 181 kfree(variances); 182 kfree(min_values); 183 return 0; 184} static void exit hello_end(void) 187{ 188 printk(kern_info "Goodbye Mr.\n"); 189} module_init(hello_start); 192module_exit(hello_end); 34
35 6 Reference List [1] Using the RDTSC Instruction for Performance Monitoring [2] GCC Inline Assembly HowTo [3] Intel 64 and IA-32 Architectures Software Developer s Manual Volume 2B [4] Intel 64 and IA-32 Architectures Software Developer s Manual Volume 2A [5] The Linux Kernel Module Programming Guide 35
36 Author Gabriele Paoloni is an Embedded/Linux software engineer in the Intel Architecture Group. 36
37 INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. UNLESS OTHERWISE AGREED IN WRITING BY INTEL, THE INTEL PRODUCTS ARE NOT DESIGNED NOR INTENDED FOR ANY APPLICATION IN WHICH THE FAILURE OF THE INTEL PRODUCT COULD CREATE A SITUATION WHERE PERSONAL INJURY OR DEATH MAY OCCUR. Intel may make changes to specifications and product descriptions at any time, without notice. This paper is for informational purposes only. THIS DOCUMENT IS PROVIDED "AS IS" WITH NO WARRANTIES WHATSOEVER, INCLUDING ANY WARRANTY OF MERCHANTABILITY, NONINFRINGEMENT, FITNESS FOR ANY PARTICULAR PURPOSE, OR ANY WARRANTY OTHERWISE ARISING OUT OF ANY PROPOSAL, SPECIFICATION OR SAMPLE. Intel disclaims all liability, including liability for infringement of any proprietary rights, relating to use of information in this specification. No license, express or implied, by estoppel or otherwise, to any intellectual property rights is granted herein. Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, Go to: 4Hhttp:// BunnyPeople, Celeron, Celeron Inside, Centrino, Centrino Inside, Core Inside, i960, Intel, the Intel logo, Intel AppUp, Intel Atom, Intel Atom Inside, Intel Core, Intel Inside, the Intel Inside logo, Intel NetBurst, Intel NetMerge, Intel NetStructure, Intel SingleDriver, Intel SpeedStep, Intel Sponsors of Tomorrow., the Intel Sponsors of Tomorrow. logo, Intel StrataFlash, Intel Viiv, Intel vpro, Intel XScale, InTru, the InTru logo, InTru soundmark, Itanium, Itanium Inside, MCS, MMX, Moblin, Pentium, Pentium Inside, skoool, the skoool logo, Sound Mark, The Journey Inside, vpro Inside, VTune, Xeon, and Xeon Inside are trademarks of Intel Corporation in the U.S. and other countries. *Other names and brands may be claimed as the property of others. Copyright 2010 Intel Corporation. All rights reserved. 37
Using the RDTSC Instruction for Performance Monitoring
Using the Instruction for Performance Monitoring http://developer.intel.com/drg/pentiumii/appnotes/pm1.htm Using the Instruction for Performance Monitoring Information in this document is provided in connection
Upgrading Intel AMT 5.0 drivers to Linux kernel v2.6.31
White Paper Zerene Sangma Platform Application Engineer Intel Corporation Upgrading Intel AMT 5.0 drivers to Linux kernel v2.6.31 For Intel Q45 and Intel GM45 based embedded platforms June 2010 323961
ARM* to Intel Atom Microarchitecture - A Migration Study
White Paper Mark Oliver Senior Systems Engineer Intel Corporation ARM* to Intel Atom Microarchitecture - A Migration Study November 2011 326398-001 1 Introduction At Intel, our engineers do not perform
CPU performance monitoring using the Time-Stamp Counter register
CPU performance monitoring using the Time-Stamp Counter register This laboratory work introduces basic information on the Time-Stamp Counter CPU register, which is used for performance monitoring. The
Internal LVDS Dynamic Backlight Brightness Control
White Paper Ho Nee Shen Senior Software Engineer Intel Corporation Chan Swee Tat System Engineer Intel Corporation Internal LVDS Dynamic Backlight Brightness Control A platform and software design using
Intel Platform Controller Hub EG20T
Intel Platform Controller Hub EG20T General Purpose Input Output (GPIO) Driver for Windows* Order Number: 324257-002US Legal Lines and Disclaimers INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION
Using GStreamer for hardware accelerated video decoding on Intel Atom Processor E6xx series
White Paper Abhishek Girotra Graphics SW TME Intel Corporation Using GStreamer for hardware accelerated video decoding on Intel Atom Processor E6xx series September 2010 324294 Contents Executive Summary...3
DDR2 x16 Hardware Implementation Utilizing the Intel EP80579 Integrated Processor Product Line
Utilizing the Intel EP80579 Integrated Processor Product Line Order Number: 320296-002US Legal Lines and Disclaimers INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE,
PHYSICAL CORES V. ENHANCED THREADING SOFTWARE: PERFORMANCE EVALUATION WHITEPAPER
PHYSICAL CORES V. ENHANCED THREADING SOFTWARE: PERFORMANCE EVALUATION WHITEPAPER Preface Today s world is ripe with computing technology. Computing technology is all around us and it s often difficult
Cyber Security Framework: Intel s Implementation Tools & Approach
Cyber Security Framework: Intel s Implementation Tools & Approach Tim Casey Senior Strategic Risk Analyst @timcaseycyber NIST Workshop #6 October 29, 2014 Intel s Goals in Using the CSF Establish alignment
Debugging Machine Check Exceptions on Embedded IA Platforms
White Paper Ashley Montgomery Platform Application Engineer Intel Corporation Tian Tian Platform Application Engineer Intel Corporation Debugging Machine Check Exceptions on Embedded IA Platforms July
Processor Reorder Buffer (ROB) Timeout
White Paper Ai Bee Lim Senior Platform Application Engineer Embedded Communications Group Performance Products Division Intel Corporation Jack R Johnson Senior Platform Application Engineer Embedded Communications
Accessing the Real Time Clock Registers and the NMI Enable Bit
White Paper Sam Fleming Technical Marketing Engineer Intel Corporation Accessing the Real Time Clock Registers and the NMI Enable Bit A Study in I/O Locations 0x70-0x77 January 2009 321088 1 Executive
White Paper David Hibler Jr Platform Solutions Engineer Intel Corporation. Considerations for designing an Embedded IA System with DDR3 ECC SO-DIMMs
White Paper David Hibler Jr Platform Solutions Engineer Intel Corporation Considerations for designing an Embedded IA System with DDR3 ECC SO-DIMMs September 2012 326491 Executive Summary What are ECC
Intel EP80579 Software for Security Applications on Intel QuickAssist Technology Cryptographic API Reference
Intel EP80579 Software for Security Applications on Intel QuickAssist Technology Cryptographic API Reference Automatically generated from sources, May 19, 2009. Reference Number: 320184, Revision -003
Intel IXP42X Product Line of Network Processors and IXC1100 Control Plane Processor: Spread-Spectrum Clocking to Reduce EMI
Intel IXP42X Product Line of Network Processors and IXC1100 Control Plane Processor: Spread-Spectrum Clocking to Reduce EMI Application Note July 2004 Document Number: 254028-002 INFORMATION IN THIS DOCUMENT
Embedded Controller Usage in Low Power Embedded Designs
White Paper Matthew Lee Platform Application Engineer Intel Corporation Embedded Controller Usage in Low Power Embedded Designs An Overview September 2011 326133 Executive Summary Embedded Controllers
CPU Monitoring With DTS/PECI
White Paper Michael Berktold Thermal/Mechanical Engineer Tian Tian Technical Marketing Engineer Intel Corporation CPU Monitoring With DTS/PECI September 2010 322683 Executive Summary This document describes
IT@Intel. Memory Sizing for Server Virtualization. White Paper Intel Information Technology Computer Manufacturing Server Virtualization
White Paper Intel Information Technology Computer Manufacturing Server Virtualization Memory Sizing for Server Virtualization Intel IT has standardized on 16 gigabytes (GB) of memory for dual-socket virtualization
CT Bus Clock Fallback for Linux Operating Systems
CT Bus Clock Fallback for Linux Operating Systems Demo Guide August 2005 05-1900-002 INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL
Enabling new usage models for Intel Embedded Platforms
White Paper David Galus Graphics Platform Application Engineer Kirk Blum Graphics Solution Architect Intel Corporation Hybrid Multimonitor Support Enabling new usage models for Intel Embedded Platforms
Intel Rapid Storage Technology (Intel RST) in Linux*
White Paper Ramu Ramakesavan (Technical Marketing Engineer) Patrick Thomson (Software Engineer) Intel Corporation Intel Rapid Storage Technology (Intel RST) in Linux* August 2011 326024 Executive Summary
Performance monitoring with Intel Architecture
Performance monitoring with Intel Architecture CSCE 351: Operating System Kernels Lecture 5.2 Why performance monitoring? Fine-tune software Book-keeping Locating bottlenecks Explore potential problems
Enhanced Intel SpeedStep Technology for the Intel Pentium M Processor
Enhanced Intel SpeedStep Technology for the Intel Pentium M Processor White Paper March 2004 Order Number: 301170-001 INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE,
Considerations for Designing an Embedded Intel Architecture System with System Memory Down
White Paper David Hibler Jr Technical Marketing Engineer Intel Corporation Considerations for Designing an Embedded Intel Architecture System with System Memory Down August 2009 322506 Executive Summary
Software PBX Performance on Intel Multi- Core Platforms - a Study of Asterisk* White Paper
Software PBX Performance on Intel Multi- Core Platforms - a Study of Asterisk* Order Number: 318862-001US Legal Lines and Disclaimers INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS.
Developing secure software A practical approach
Developing secure software A practical approach Juan Marcelo da Cruz Pinto Security Architect Legal notice Intel Active Management Technology requires the computer system to have an Intel(R) AMT-enabled
Intel Embedded Virtualization Manager
White Paper Kelvin Lum Fee Foon Kong Platform Application Engineer, ECG Penang Intel Corporation Kam Boon Hee (Thomas) Marketing Development Manager, ECG Penang Intel Corporation Intel Embedded Virtualization
White Paper Amy Chong Yew Ee Online Sales Account Manager APAC Online Sales Center Intel Corporation. BOM Cost Reduction by Removing S3 State
White Paper Amy Chong Yew Ee Online Sales Account Manager APAC Online Sales Center Intel Corporation BOM Cost Reduction by Removing S3 State May 2011 325448 Executive Summary In today s embedded design,
Page Modification Logging for Virtual Machine Monitor White Paper
Page Modification Logging for Virtual Machine Monitor White Paper This document is intended only for VMM or hypervisor software developers and not for application developers or end-customers. Readers are
Video Encoding on Intel Atom Processor E38XX Series using Intel EMGD and GStreamer
White Paper Lim Siew Hoon Graphics Software Engineer Intel Corporation Kumaran Kalaiyappan Graphics Software Engineer Intel Corporation Tay Boon Wooi Graphics Software Engineer Intel Corporation Video
Implementing Multiple Displays with IEGD Multi-GPU - Multi-Monitor Mode on Intel Atom Processor with Intel System Controller Hub US15W Chipset
White Paper Chris Wojslaw Lead Technical Marketing Engineer Kirk Blum Graphics Solution Architect Intel Corporation Implementing Multiple Displays with IEGD Multi-GPU - Multi-Monitor Mode on Intel Atom
Intel Core TM i7-660ue, i7-620le/ue, i7-610e, i5-520e, i3-330e and Intel Celeron Processor P4505, U3405 Series
Intel Core TM i7-660ue, i7-620le/ue, i7-610e, i5-520e, i3-330e and Intel Celeron Processor P4505, U3405 Series Datasheet Addendum Specification Update Document Number: 323179 Legal Lines and Disclaimers
Contents -------- Overview and Product Contents -----------------------------
------------------------------------------------------------------------ Intel(R) Threading Building Blocks - Release Notes Version 2.0 ------------------------------------------------------------------------
Intel(R) IT Director User's Guide
Intel(R) IT Director User's Guide Table of Contents Disclaimer and Legal Information... 1 Introduction... 3 To set up Intel IT Director:... 3... 3 System Configuration... 5... 5 Settings Page: Overview...
Development for Mobile Devices Tools from Intel, Platform of Your Choice!
Development for Mobile Devices Tools from Intel, Platform of Your Choice! Sergey Lunev, Intel Corporation HTML5 Tools Development Manager Optional: Download App Preview Android bit.ly/1i8vegl ios bit.ly/1a3w7bk
Inside Linux* graphics
White Paper Ragav Gopalan Platform Applications Engineer Intel Corporation Inside Linux* graphics Understanding the components, upgrading drivers, and advanced use cases April 2011 325350 Inside Linux*
Intel Processor Serial Number
APPLICATION NOTE Intel Processor Serial Number March 1999 ORDER NUMBER: 245125-001 Information in this document is provided in connection with Intel products. No license, express or implied, by estoppel
64-Bit NASM Notes. Invoking 64-Bit NASM
64-Bit NASM Notes The transition from 32- to 64-bit architectures is no joke, as anyone who has wrestled with 32/64 bit incompatibilities will attest We note here some key differences between 32- and 64-bit
Keil C51 Cross Compiler
Keil C51 Cross Compiler ANSI C Compiler Generates fast compact code for the 8051 and it s derivatives Advantages of C over Assembler Do not need to know the microcontroller instruction set Register allocation
Device Management API for Windows* and Linux* Operating Systems
Device Management API for Windows* and Linux* Operating Systems Library Reference September 2004 05-2222-002 INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS
Dual-Core Processors on Dell-Supported Operating Systems
Dual-Core Processors on Dell-Supported Operating Systems With the advent of Intel s dual-core, in addition to existing Hyper-Threading capability, there is confusion regarding the number of that software
Xen in Embedded Systems. Ray Kinsella Senior Software Engineer Embedded and Communications Group Intel Corporation
Xen in Embedded Systems Ray Kinsella Senior Software Engineer Embedded and Communications Group Intel Corporation Legal Notices INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS.
Fast CRC Computation for Generic Polynomials Using PCLMULQDQ Instruction
White Paper Vinodh Gopal Erdinc Ozturk Jim Guilford Gil Wolrich Wajdi Feghali Martin Dixon IA Architects Intel Corporation Deniz Karakoyunlu PhD Student Worcester Polytechnic Institute Fast CRC Computation
How To Install An Intel System Studio 2015 For Windows* For Free On A Computer Or Mac Or Ipa (For Free)
Intel System Studio 2015 for Windows* Installation Guide and Release Notes Installation Guide and Release Notes for Windows* Host and Windows* target Document number: 331182-002US 8 October 2014 Contents
Intel Media Server Studio - Metrics Monitor (v1.1.0) Reference Manual
Intel Media Server Studio - Metrics Monitor (v1.1.0) Reference Manual Overview Metrics Monitor is part of Intel Media Server Studio 2015 for Linux Server. Metrics Monitor is a user space shared library
Processing Multiple Buffers in Parallel to Increase Performance on Intel Architecture Processors
White Paper Vinodh Gopal Jim Guilford Wajdi Feghali Erdinc Ozturk Gil Wolrich Martin Dixon IA Architects Intel Corporation Processing Multiple Buffers in Parallel to Increase Performance on Intel Architecture
Intel Server Raid Controller. RAID Configuration Utility (RCU)
Intel Server Raid Controller RAID Configuration Utility (RCU) Revision 1.1 July 2000 Revision History Date Rev Modifications 02/13/00 1.0 Initial Release 07/20/00 1.1 Update to include general instructions
Lecture 7: Machine-Level Programming I: Basics Mohamed Zahran (aka Z) [email protected] http://www.mzahran.com
CSCI-UA.0201-003 Computer Systems Organization Lecture 7: Machine-Level Programming I: Basics Mohamed Zahran (aka Z) [email protected] http://www.mzahran.com Some slides adapted (and slightly modified)
Measuring Cache and Memory Latency and CPU to Memory Bandwidth
White Paper Joshua Ruggiero Computer Systems Engineer Intel Corporation Measuring Cache and Memory Latency and CPU to Memory Bandwidth For use with Intel Architecture December 2008 1 321074 Executive Summary
Using Windows* 7/Windows Embedded Standard 7* with Platforms Based on the Intel Atom Processor Z670/Z650 and Intel SM35 Express Chipset
Using Windows* 7/Windows Embedded Standard 7* with Platforms Based on the Intel Atom Processor Z670/Z650 and Intel SM35 Express Chipset Quick Start Guide December 2011 Document Number: 326555-001 Introduction
Intel Virtualization Technology FlexMigration Application Note
Intel Virtualization Technology FlexMigration Application Note This document is intended only for VMM or hypervisor software developers and not for application developers or end-customers. Readers are
Software Solutions for Multi-Display Setups
White Paper Bruce Bao Graphics Application Engineer Intel Corporation Software Solutions for Multi-Display Setups January 2013 328563-001 Executive Summary Multi-display systems are growing in popularity.
IT@Intel. Comparing Multi-Core Processors for Server Virtualization
White Paper Intel Information Technology Computer Manufacturing Server Virtualization Comparing Multi-Core Processors for Server Virtualization Intel IT tested servers based on select Intel multi-core
White Paper Amit Aneja Platform Architect Intel Corporation. Xen* Hypervisor Case Study - Designing Embedded Virtualized Intel Architecture Platforms
White Paper Amit Aneja Platform Architect Intel Corporation Xen* Hypervisor Case Study - Designing Embedded Virtualized Intel Architecture Platforms March 2011 325258 Executive Summary Virtualization technology
Power Benefits Using Intel Quick Sync Video H.264 Codec With Sorenson Squeeze
Power Benefits Using Intel Quick Sync Video H.264 Codec With Sorenson Squeeze Whitepaper December 2012 Anita Banerjee Contents Introduction... 3 Sorenson Squeeze... 4 Intel QSV H.264... 5 Power Performance...
An Architecture to Deliver a Healthcare Dial-tone
An Architecture to Deliver a Healthcare Dial-tone Using SOA for Healthcare Data Interoperability Joe Natoli Platform Architect Intel SOA Products Division April 2008 Legal Notices This presentation is
Intel Media SDK Library Distribution and Dispatching Process
Intel Media SDK Library Distribution and Dispatching Process Overview Dispatching Procedure Software Libraries Platform-Specific Libraries Legal Information Overview This document describes the Intel Media
Embedded Programming in C/C++: Lesson-1: Programming Elements and Programming in C
Embedded Programming in C/C++: Lesson-1: Programming Elements and Programming in C 1 An essential part of any embedded system design Programming 2 Programming in Assembly or HLL Processor and memory-sensitive
Intel Virtualization Technology FlexMigration Application Note
Intel Virtualization Technology FlexMigration Application Note This document is intended only for VMM or hypervisor software developers and not for application developers or end-customers. Readers are
COLO: COarse-grain LOck-stepping Virtual Machine for Non-stop Service
COLO: COarse-grain LOck-stepping Virtual Machine for Non-stop Service Eddie Dong, Yunhong Jiang 1 Legal Disclaimer INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE,
Intel Media SDK Features in Microsoft Windows 7* Multi- Monitor Configurations on 2 nd Generation Intel Core Processor-Based Platforms
Intel Media SDK Features in Microsoft Windows 7* Multi- Monitor Configurations on 2 nd Generation Intel Core Processor-Based Platforms Technical Advisory December 2010 Version 1.0 Document Number: 29437
Hexadecimal Object File Format Specification
Hexadecimal Object File Format Specification Revision A January 6, 1988 This specification is provided "as is" with no warranties whatsoever, including any warranty of merchantability, noninfringement,
Breakthrough AES Performance with. Intel AES New Instructions
White Paper Breakthrough AES Performance with Intel AES New Instructions Kahraman Akdemir, Martin Dixon, Wajdi Feghali, Patrick Fay, Vinodh Gopal, Jim Guilford, Erdinc Ozturk, Gil Wolrich, Ronen Zohar
Intel RAID Controller Troubleshooting Guide
Intel RAID Controller Troubleshooting Guide A Guide for Technically Qualified Assemblers of Intel Identified Subassemblies/Products Intel order number C18781-001 September 2, 2002 Revision History Troubleshooting
Intel EP80579 Software for IP Telephony Applications on Intel QuickAssist Technology
Intel EP80579 Software for IP Telephony Applications on Intel QuickAssist Technology Linux* Tuning Guide September 2008 Order Number: 320524-001US Legal Lines and Disclaimers INFORMATION IN THIS DOCUMENT
Intel Data Direct I/O Technology (Intel DDIO): A Primer >
Intel Data Direct I/O Technology (Intel DDIO): A Primer > Technical Brief February 2012 Revision 1.0 Legal Statements INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE,
Intel Server Board S3420GPRX Intel Server System SR1630GPRX Intel Server System SR1630HGPRX
Server WHQL Testing Services Enterprise Platforms and Services Division Intel Server Board S3420GPRX Intel Server System SR1630GPRX Intel Server System SR1630HGPRX Rev 1.0 Server Test Submission (STS)
Intel Dialogic System Software for PCI Products on Windows
Intel Dialogic System Software for PCI Products on Windows Administration Guide November 2003 05-1910-001 INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS
Performance Evaluation and Optimization of A Custom Native Linux Threads Library
Center for Embedded Computer Systems University of California, Irvine Performance Evaluation and Optimization of A Custom Native Linux Threads Library Guantao Liu and Rainer Dömer Technical Report CECS-12-11
Intel Virtualization Technology (VT) in Converged Application Platforms
Intel Virtualization Technology (VT) in Converged Application Platforms Enabling Improved Utilization, Change Management, and Cost Reduction through Hardware Assisted Virtualization White Paper January
Specification Update. January 2014
Intel Embedded Media and Graphics Driver v36.15.0 (32-bit) & v3.15.0 (64-bit) for Intel Processor E3800 Product Family/Intel Celeron Processor * Release Specification Update January 2014 Notice: The Intel
How To Get A Client Side Virtualization Solution For Your Financial Services Business
SOLUTION BRIEF Financial Services Industry 2nd Generation Intel Core i5 vpro and Core i7 vpro Processors Benefits of Client-Side Virtualization A Flexible, New Solution for Improving Manageability, Security,
Three Paths to Faster Simulations Using ANSYS Mechanical 16.0 and Intel Architecture
White Paper Intel Xeon processor E5 v3 family Intel Xeon Phi coprocessor family Digital Design and Engineering Three Paths to Faster Simulations Using ANSYS Mechanical 16.0 and Intel Architecture Executive
Haswell Cryptographic Performance
White Paper Sean Gulley Vinodh Gopal IA Architects Intel Corporation Haswell Cryptographic Performance July 2013 329282-001 Executive Summary The new Haswell microarchitecture featured in the 4 th generation
Introduction to PCI Express Positioning Information
Introduction to PCI Express Positioning Information Main PCI Express is the latest development in PCI to support adapters and devices. The technology is aimed at multiple market segments, meaning that
Intel Cloud Builder Guide: Cloud Design and Deployment on Intel Platforms
EXECUTIVE SUMMARY Intel Cloud Builder Guide Intel Xeon Processor-based Servers Red Hat* Cloud Foundations Intel Cloud Builder Guide: Cloud Design and Deployment on Intel Platforms Red Hat* Cloud Foundations
Jorix kernel: real-time scheduling
Jorix kernel: real-time scheduling Joris Huizer Kwie Min Wong May 16, 2007 1 Introduction As a specialized part of the kernel, we implemented two real-time scheduling algorithms: RM (rate monotonic) and
MCA Enhancements in Future Intel Xeon Processors June 2013
MCA Enhancements in Future Intel Xeon Processors June 2013 Reference Number: 329176-001, Revision: 1.0 INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR
EMC Business Continuity for Microsoft SQL Server Enabled by SQL DB Mirroring Celerra Unified Storage Platforms Using iscsi
EMC Business Continuity for Microsoft SQL Server Enabled by SQL DB Mirroring Applied Technology Abstract Microsoft SQL Server includes a powerful capability to protect active databases by using either
(Refer Slide Time: 00:01:16 min)
Digital Computer Organization Prof. P. K. Biswas Department of Electronic & Electrical Communication Engineering Indian Institute of Technology, Kharagpur Lecture No. # 04 CPU Design: Tirning & Control
An Implementation Of Multiprocessor Linux
An Implementation Of Multiprocessor Linux This document describes the implementation of a simple SMP Linux kernel extension and how to use this to develop SMP Linux kernels for architectures other than
Intel 8086 architecture
Intel 8086 architecture Today we ll take a look at Intel s 8086, which is one of the oldest and yet most prevalent processor architectures around. We ll make many comparisons between the MIPS and 8086
Memory Management Outline. Background Swapping Contiguous Memory Allocation Paging Segmentation Segmented Paging
Memory Management Outline Background Swapping Contiguous Memory Allocation Paging Segmentation Segmented Paging 1 Background Memory is a large array of bytes memory and registers are only storage CPU can
RAID Basics Training Guide
RAID Basics Training Guide Discover a Higher Level of Performance RAID matters. Rely on Intel RAID. Table of Contents 1. What is RAID? 2. RAID Levels RAID 0 RAID 1 RAID 5 RAID 6 RAID 10 RAID 0+1 RAID 1E
A Study of Performance Monitoring Unit, perf and perf_events subsystem
A Study of Performance Monitoring Unit, perf and perf_events subsystem Team Aman Singh Anup Buchke Mentor Dr. Yann-Hang Lee Summary Performance Monitoring Unit, or the PMU, is found in all high end processors
Intel Desktop Board DG31GL
Intel Desktop Board DG31GL Basic Certified Motherboard Logo Program (MLP) Report 4/1/2008 Purpose: This report describes the DG31GL Motherboard Logo Program testing run conducted by Intel Corporation.
Intel Server Board S3420GPV
Server WHQL Testing Services Enterprise Platforms and Services Division Intel Server Board S3420GPV Rev 1.0 Server Test Submission (STS) Report For the Microsoft Windows Logo Program (WLP) Dec. 30 th,
Intel Storage System SSR212CC Enclosure Management Software Installation Guide For Red Hat* Enterprise Linux
Intel Storage System SSR212CC Enclosure Management Software Installation Guide For Red Hat* Enterprise Linux Order Number: D58855-002 Disclaimer Information in this document is provided in connection with
Atmel AVR4027: Tips and Tricks to Optimize Your C Code for 8-bit AVR Microcontrollers. 8-bit Atmel Microcontrollers. Application Note.
Atmel AVR4027: Tips and Tricks to Optimize Your C Code for 8-bit AVR Microcontrollers Features Atmel AVR core and Atmel AVR GCC introduction Tips and tricks to reduce code size Tips and tricks to reduce
Intel X38 Express Chipset Memory Technology and Configuration Guide
Intel X38 Express Chipset Memory Technology and Configuration Guide White Paper January 2008 Document Number: 318469-002 INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE,
Programing the Microprocessor in C Microprocessor System Design and Interfacing ECE 362
PURDUE UNIVERSITY Programing the Microprocessor in C Microprocessor System Design and Interfacing ECE 362 Course Staff 1/31/2012 1 Introduction This tutorial is made to help the student use C language
Intel Desktop Board DG43RK
Intel Desktop Board DG43RK Specification Update December 2010 Order Number: E92421-003US The Intel Desktop Board DG43RK may contain design defects or errors known as errata, which may cause the product
IA-32 Intel Architecture Software Developer s Manual
IA-32 Intel Architecture Software Developer s Manual Volume 1: Basic Architecture NOTE: The IA-32 Intel Architecture Software Developer s Manual consists of three volumes: Basic Architecture, Order Number
Computer Organization and Architecture
Computer Organization and Architecture Chapter 11 Instruction Sets: Addressing Modes and Formats Instruction Set Design One goal of instruction set design is to minimize instruction length Another goal
RAID and Storage Options Available on Intel Server Boards and Systems
and Storage Options Available on Intel Server Boards and Systems Revision 1.0 March, 009 Revision History and Storage Options Available on Intel Server Boards and Systems Revision History Date Revision
Steps to Migrating to a Private Cloud
Deploying and Managing Private Clouds The Essentials Series Steps to Migrating to a Private Cloud sponsored by Introduction to Realtime Publishers by Don Jones, Series Editor For several years now, Realtime
