Elemental functions: Writing data-parallel code in C/C++ using Intel Cilk Plus
|
|
|
- Rosemary West
- 10 years ago
- Views:
Transcription
1 Elemental functions: Writing data-parallel code in C/C++ using Intel Cilk Plus A simple C/C++ language extension construct for data parallel operations Robert Geva [email protected] Introduction Intel Cilk Plus provides simple to use language extensions to express data and task-parallelism to the C and C++ language implemented by the Intel C++ Compiler, which is part of Intel Parallel Composer and Intel Parallel Studio. This article describes one of these programming constructs: elemental functions. There are cases in which the algorithm performs operations on large collection of data, with no dependence among the operations being performed on the various data item. For example, the programmer may write, at a certain point of the algorithm: add arrays a1 and a2 and store the result in array r. When thinking at that level, the programmer is thinking in terms of a single operation that needs to be performed on many elements, independently of each other. Unfortunately, the C/C++ programming languages do not provide constructs that allow expressing the operation as such. Instead, they force the programmer to elaborate the operation in terms of a loop that operates on each array element. The end result, in terms of the values being stored in the result array, would be the same. However, the introduction of the loop introduces an unintended order of operations: The implementation has to add the first two array elements, store the results in the first location of the result array, move on to perform the same operation on the second set of elements, and so on. An example later will show what is expected to be a typical use of elemental function: use a single line to mark a standard C/C++ function as elemental, change the invocation loop from serial to parallel (in this example, it is a change of one keyword) and that is it. In the example below, there is a measured speed up of more than 15X 1. Limitations of serial programming The serial program under utilizes the parallel HW resources and therefore in most cases performs slower than an equivalent parallel program. Moreover, in many cases, the C/C++ language semantics do enforce the serial execution of the program and do not allow an optimizing compiler to implement a parallel version of the serial program. A compiler can only provide a parallel implementation of a 1 Compiler: Intel Parallel Composer 2011 with Microsoft* Visual Studio* 2008 Compiler options: /c /Oi /Qip /D "WIN32" /D "NDEBUG" /D "_CONSOLE" /D "_UNICODE" /D "UNICODE" /EHsc /MD /GS /fp:precise /Fo"Release/" /Fd"Release/vc90.pdb" /W3 /nologo /Zi System Specifications: Intel Xeon E5462 processor, 2.8 GHz, 8 cores, 8 GB RAM, Microsoft Windows Server 2003* x64 Edition. Sample code available at i
2 portion of a program if it can be proven in compile time that the parallel implementation and the serial implementation would be equivalent. The success of these proofs depends on many variables, for example the language construct being chosen to implement the code. In the sample code below, the compiler cannot prove that the arrays pointed- to by the 3 pointers which are provided as function argument do not point into the same arrays. void do_add_doubles(double * src1, double * src2, double * r, int length) for (i = 0; i < length; ++i) *r = *src1 + *src2; Array notations Intel Cilk Plus provide simple language constructs that operate on arrays, instead of requiring the programmer to express the algorithm in terms of loops that operate on elements of arrays. The example of adding two arrays can be expressed in Intel Cilk plus as r[0:n] = src1[0:n] + src2[0:n]; In this example, the elements of the arrays src1 and src2 are added and the results are assigned to array r. The usual subscript syntax in C/C++ is replaced by an array section descriptor. In this example, 0:N means that the addition uses N elements of the array, starting from index number 0. The important difference between using this language construct and the standard C/C++ loop is that here, there is no serial ordering implied. Therefore, the expected code generation strategy from the compiler is vector parallelism, i.e. use the SSE instructions to implement the additions in a SIMD fashion. The compiler generates vector, SIMD code for operations on arrays including all the language built- in operators, such as +, *, &, && etc. However, the most general operation is when the programmer describes the operation using a function. Elemental functions An elemental function is a regular function, which can be invoked either on scalar arguments or on array elements in parallel. So far, the Intel compilers have been providing a limited form of this functionality, some of the language standard library functions. For example, the programmer can write code in a loop that calls certain math library functions such as a sin() function, and the compiler can automatically vectorize the loop. ii
3 for (i = 0; i < length; ++i) a[i] = b[i] * c[i]; d[i] = sin(a[i]); e[i] = d[i] * d[i]; The implication is that the code can execute a short vector version of the sin function that takes a number of arguments which depends on the vector length of the HW XMM registers (e.g. 4) and compute 4 results in one invocation, at about the same time it would take to compute a single result. The new elemental functions feature extends the capability from built in functions to application code functions. The programmer can now provide a function and use a declspec (vector) annotation to indicate to the compiler to generate a short vector version of the implementation of the function. Note that the compiler also generates a scalar implementation of the function, and the program can invoke the function either on single arguments or arrays of argument, at different places in the program. Lastly, the program needs to invoke the elemental function in a parallel context, i.e. provide multiple arguments instead of a single one. The parallel invocation can be done in a number of ways. 1. Invoke the function from a C/C++ standard for loop. With this invocation, the compiler will replace the function call with a call to the short vector version within the loop. for (i = 0; i < length; ++i) res[i] = elemental_func(a[i],b[i]); With this invocation, the instances of the elemental function will all execute on a single thread. Note that as a programmer, you do not need to take care of the case where the number of loop iterations is not evenly divided by the number of arguments handled by the short vector version. The compiler takes care of the remainder automatically. 2. Invoke the function from a cilk_for loop: _Cilk_for (i = 0; i < length; ++i) res[i] = elemental_func(a[i],b[i]); With this invocation, multiple instances of the elemental functions will be spawned into execution cores and execute in parallel. Therefore, with a single expression of the parallel work, you get the benefit of both multi core parallelism and vector parallelism. 3. Invoke the function from array notation syntax. res[:] = elemental_func(a[:], b[:]); iii
4 With this invocation syntax, the the current compiler implementation provides the same behavior as the serial loop in case 1 above. However, this syntax is reserved for future implementations to provide an optimal mix of both behaviors. iv
5 The Task Scheduler When instances of an elemental function are spawned, the compiler is using the run time task scheduler, which is the same task scheduler that managers other work spawned with the _Cilk_spawn and _Cilk_for constructs. The scheduler is a user level (not OS level) work stealing task scheduler that does parent stealing. Work stealing means that when a code stream executes a task spawn, it queues work in its own work deque. Workers do not push work onto other workers. A worker only steals work from another worker if its own work deque is empty. This policy ensures load balancing and prevents over subscription of workers. Parent stealing means that when a code sequence executes the statement _Cilk_spawn func(y); then the same worker continues to execute the function func, and the continuation of the spawning function is what is detached, placed on the deque and made available for stealing. The worker executes func, and if no other worker stole it, then it returns from the execution of func and pops the continuation from its own work deque. The end result is that the behavior matches that of a regular function call in serial execution. The task scheduler s policy guarantees that the program produces the same result if executing on a single or multiple cores. The combination of very low overhead during spawning a task with the automatic load balancing suggests a more natural approach to the overall SW architecture of a parallel program. Some parallel programs are architected with a monolithic perspective in which an architect carefully orchestrates the work being done at each point of the program by each thread. The program only allows as much parallelism as the target platforms supports. By contrast, with the work stealing task scheduling approach, parallelism does not have to be orchestrated globally throughout the program. It can be originated independently from multiple modules of the program with low risk of overwhelming execution resources. Conversely, making more parallelism available than what the current platform can benefit from, may lead to a scalable program that can be moved over time to platforms with more execution resources and benefit from the additional parallelism. Performance With Intel Cilk Plus Elemental Functions With this introduction of elemental function, let s take a look at the Black-Scholes example 2 using elemental functions. Black-Scholes is a well known short program used for pricing of European style options. In this example, the main loop is in the function test_option_price_call_and_put_black_scholes_serial() which, in turn, invokes the call and put options calculations using the original or serial version of the routines: 2 Sample code available at This sample is derived from code published by Bernt Arne Odegaard, v
6 double option_price_call_black_scholes( ) double option_price_put_black_scholes( ) void test_option_price_call_and_put_black_scholes_serial( double S[NUM_OPTIONS], double K[NUM_OPTIONS], double r, double sigma,double time[num_options], double call_serial[num_options], double put_serial[num_options]) for (int i=0; i < NUM_OPTIONS; i++) // invoke calculations for call-options Turning call_serial[i] this program = from option_price_call_black_scholes(s[i], the original serial program into a parallel K[i], program r, sigma, does time[i]); not require a change to the implementation of the two functions that do the work, // invoke calculations for put-options option_price_call_black_scholes() put_serial[i] = option_price_put_black_scholes(s[i], and option_price_put_black_scholes(). K[i], r, The sigma, only time[i]); change required is the addition of declspec(vector) to their declaration. This will trigger the compiler to generate short vector forms of the two functions. The other modification to the program is in the way these function are called. An array- notation call syntax is below: declspec(vector) double option_price_call_black_scholes( ) declspec(vector) double option_price_put_black_scholes( ) void test_option_price_call_and_put_black_scholes_arraynotations( double S[NUM_OPTIONS], double K[NUM_OPTIONS], double r, double sigma,double time[num_options], double call_arraynotations[num_options], double put_arraynotations[num_options]) // invoking vectoried version of the call functions optimized for given core's SIMD call_arraynotations[:] = option_price_call_black_scholes_arraynotations(s[:], K[:], r, sigma, time[:]); // invoking vectoried version of the put functions optimized for given core's SIMD put_arraynotations[:] = option_price_put_black_scholes_arraynotations(s[:], K[:], r, sigma, time[:]); As noted before, the current Intel compiler does not yet implement the capability to spawn multiple calls to the elemental function into cores. That can be accomplished using the _Cilk_for loop calling syntax, as shown below: void test_option_price_call_and_put_black_scholes_cilkplus( double S[NUM_OPTIONS], double K[NUM_OPTIONS], double r, double sigma,double time[num_options], double call_cilkplus[num_options], double put_cilkplus[num_options]) // invoking vectoried version of the call and put functions optimized for given core's SIMD // and then spreading it across mutiple cores using cilk_for for optimized multi-core performance cilk_for (int i=0; i<num_options; i++) call_cilkplus[i] = option_price_call_black_scholes_arraynotations(s[i], K[i], r, sigma, time[i]); put_cilkplus[i] = option_price_put_black_scholes_arraynotations(s[i], K[i], r, sigma, time[i]); vi
7 This simple, yet powerful, change automatically parallelizes the loop and executes the function calls on multiple cores. Note that the Cilk Plus runtime determines the number of cores available on the system and distributes the load accordingly. Running this optimized parallel binary on a 8-core machine gets us a combined speedup of 15.31x 3 over the original serial version: 3 See footnote 1 for configuration information. vii
8 Optimizations In some cases, when the function is applied to successive elements of an array, one or more of its arguments takes the same value across all invocations. We call this a scalar argument. An example is that if instead of adding two arrays and storing the result in a third array, you want to add one value to all elements of an array, and store the results in a result array. You can write an elemental function to do that using the syntax shown above. declspec (vector) double ef_add_doubles(double x, double a) return x + a; //invoke the function for (i = 0; i < n; ++i) y[i] = ef_add_doubles(x[i],42); While this will work and produce the expected results, it is a somewhat wasteful. If the function is designed to add the same value to all elements of an array, then there is no need to pass the value multiple times. The above implementation will pass the value 42 along to ef_add_doubles together which each value of an element of the array x, whereas passing it only once would be sufficient. The way to indicate to the compiler that passing the value once instead of multiple time is sufficient is by using a the scalar clause in the declspec vector: //each invocation passes a short vector of elements of x and a single value of a declspec (vector scalar(a)) double ef_add_doubles(double x, double a) return x + a; //invoke the function for (i = 0; i < n; ++i) y[i] = ef_add_doubles(x[i],42); Another optimization opportunity is when the value of an argument is not the same for each invocation, but instead changes by a constant increment each time. For example, you can add two arrays and store the result into a result array by passing the start addresses of the arrays to the elemental function, along with an index variable. In this case, the value of the variable i is incremented consistently each time, and therefore does not need to be passed multiple times to the function. declspec (vector (linear (r,a,b)) void ef_add_doubles(int i) r[i] = a[i] + b[i]; //invoke the function for (i = 0; i < n; ++i) ef_add_doubles(&r[0],&a[0],&b[0],i); Related constructs Functions calls can be executed in parallel without being elemental. With the current Intel compiler, you can invoke a function call from a _Cilk_for loop, and the iterations will execute in parallel. You can also use the array notations to map a scalar function onto elements of an array, for example: res[:] = some_func(a[:], b[:]); viii
9 This construct applies the function to elements of the input array and assigns the results to the corresponding elements of the output array. This construct is defined such that there is no enforced ordering between the execution of the function on the array elements, and therefore multiple instances can be invoked in parallel. The advantage of using the elemental function construct over the above is that you benefit from both types of parallel HW resources, the cores and the vectors. Language Rules The compiler imposes several limitations on the language constructs that are allowed to be used inside elemental functions. The purpose of these rules is to guarantee that the compiler can generate a short vector implementation of the function. Some of these rules are artifacts of the current state of the technology and will be removed in future implementation, as the capability improved. Currently, the following are disallowed inside elemental functions: Loops, including the keywords for, while, do and overloaded operators Switch statements Goto, setjmp, longjmp Calls to functions other than other elemental functions and the intrinsic short vector math libraries provided with the Intel compilers Operations on structs, other than member selection _Cilk_spawn Array notations C++ exceptions Elemental functions cannot be virtual, and can only be called directly, not through a function pointer Note, however, that elemental functions do not have to be pure functions. They are allowed to modify global state and have side effects. They can make recursive function calls. Summary When your algorithm requires performing the same operation on multiple data items, and the data items can be organized in an array, consider using the new language construct provided with Intel Cilk Plus - elemental functions. Writing an elemental function is simply writing a scalar function, followed by invoking it in a parallel context as shown above. The Intel compiler will provide an implementation that gives you the benefits of maximizing both the SIMD vector capabilities of your processor and the multiple cores with little effort to develop your parallel code and maintain it. ix
10 Optimization Notice Intel compilers, associated libraries and associated development tools may include or utilize options that optimize for instruction sets that are available in both Intel and non-intel microprocessors (for example SIMD instruction sets), but do not optimize equally for non-intel microprocessors. In addition, certain compiler options for Intel compilers, including some that are not specific to Intel micro-architecture, are reserved for Intel microprocessors. For a detailed description of Intel compiler options, including the instruction sets and specific microprocessors they implicate, please refer to the Intel Compiler User and Reference Guides under Compiler Options." Many library routines that are part of Intel compiler products are more highly optimized for Intel microprocessors than for other microprocessors. While the compilers and libraries in Intel compiler products offer optimizations for both Intel and Intel-compatible microprocessors, depending on the options you select, your code and other factors, you likely will get extra performance on Intel microprocessors. Intel compilers, associated libraries and associated development tools may or may not optimize to the same degree for non-intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include Intel Streaming SIMD Extensions 2 (Intel SSE2), Intel Streaming SIMD Extensions 3 (Intel SSE3), and Supplemental Streaming SIMD Extensions 3 (Intel SSSE3) instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. While Intel believes our compilers and libraries are excellent choices to assist in obtaining the best performance on Intel and non-intel microprocessors, Intel recommends that you evaluate other compilers and libraries to determine which best meet your requirements. We hope to win your business by striving to offer the best performance of any compiler or library; please let us know if you find we do not. Notice revision # x
INTEL PARALLEL STUDIO EVALUATION GUIDE. Intel Cilk Plus: A Simple Path to Parallelism
Intel Cilk Plus: A Simple Path to Parallelism Compiler extensions to simplify task and data parallelism Intel Cilk Plus adds simple language extensions to express data and task parallelism to the C and
INTEL PARALLEL STUDIO XE EVALUATION GUIDE
Introduction This guide will illustrate how you use Intel Parallel Studio XE to find the hotspots (areas that are taking a lot of time) in your application and then recompiling those parts to improve overall
Eliminate Memory Errors and Improve Program Stability
Eliminate Memory Errors and Improve Program Stability with Intel Parallel Studio XE Can running one simple tool make a difference? Yes, in many cases. You can find errors that cause complex, intermittent
Towards OpenMP Support in LLVM
Towards OpenMP Support in LLVM Alexey Bataev, Andrey Bokhanko, James Cownie Intel 1 Agenda What is the OpenMP * language? Who Can Benefit from the OpenMP language? OpenMP Language Support Early / Late
Pexip Speeds Videoconferencing with Intel Parallel Studio XE
1 Pexip Speeds Videoconferencing with Intel Parallel Studio XE by Stephen Blair-Chappell, Technical Consulting Engineer, Intel Over the last 18 months, Pexip s software engineers have been optimizing Pexip
Get an Easy Performance Boost Even with Unthreaded Apps. with Intel Parallel Studio XE for Windows*
Get an Easy Performance Boost Even with Unthreaded Apps for Windows* Can recompiling just one file make a difference? Yes, in many cases it can! Often, you can achieve a major performance boost by recompiling
OpenMP* 4.0 for HPC in a Nutshell
OpenMP* 4.0 for HPC in a Nutshell Dr.-Ing. Michael Klemm Senior Application Engineer Software and Services Group ([email protected]) *Other brands and names are the property of their respective owners.
Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child
Introducing A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child Bio Tim Child 35 years experience of software development Formerly VP Oracle Corporation VP BEA Systems Inc.
Intel Compiler Code Coverage Tool
Intel Compiler Code Coverage Tool Table of Contents Overview................................................ 3 Features...3 Default Colors for the Code Coverage Tool...4 Compatibility.............................................
Improve Fortran Code Quality with Static Analysis
Improve Fortran Code Quality with Static Analysis This document is an introductory tutorial describing how to use static analysis on Fortran code to improve software quality, either by eliminating bugs
Intel Media Server Studio - Metrics Monitor (v1.1.0) Reference Manual
Intel Media Server Studio - Metrics Monitor (v1.1.0) Reference Manual Overview Metrics Monitor is part of Intel Media Server Studio 2015 for Linux Server. Metrics Monitor is a user space shared library
Glossary of Object Oriented Terms
Appendix E Glossary of Object Oriented Terms abstract class: A class primarily intended to define an instance, but can not be instantiated without additional methods. abstract data type: An abstraction
The ROI from Optimizing Software Performance with Intel Parallel Studio XE
The ROI from Optimizing Software Performance with Intel Parallel Studio XE Intel Parallel Studio XE delivers ROI solutions to development organizations. This comprehensive tool offering for the entire
Course MS10975A Introduction to Programming. Length: 5 Days
3 Riverchase Office Plaza Hoover, Alabama 35244 Phone: 205.989.4944 Fax: 855.317.2187 E-Mail: [email protected] Web: www.discoveritt.com Course MS10975A Introduction to Programming Length: 5 Days
Intel Perceptual Computing SDK My First C++ Application
Intel Perceptual Computing SDK My First C++ Application LEGAL DISCLAIMER THIS DOCUMENT CONTAINS INFORMATION ON PRODUCTS IN THE DESIGN PHASE OF DEVELOPMENT. INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION
1 The Java Virtual Machine
1 The Java Virtual Machine About the Spec Format This document describes the Java virtual machine and the instruction set. In this introduction, each component of the machine is briefly described. This
Hardware Assisted Virtualization
Hardware Assisted Virtualization G. Lettieri 21 Oct. 2015 1 Introduction In the hardware-assisted virtualization technique we try to execute the instructions of the target machine directly on the host
Keys to node-level performance analysis and threading in HPC applications
Keys to node-level performance analysis and threading in HPC applications Thomas GUILLET (Intel; Exascale Computing Research) IFERC seminar, 18 March 2015 Legal Disclaimer & Optimization Notice INFORMATION
Scaling up to Production
1 Scaling up to Production Overview Productionize then Scale Building Production Systems Scaling Production Systems Use Case: Scaling a Production Galaxy Instance Infrastructure Advice 2 PRODUCTIONIZE
The C Programming Language course syllabus associate level
TECHNOLOGIES The C Programming Language course syllabus associate level Course description The course fully covers the basics of programming in the C programming language and demonstrates fundamental programming
Last Class: OS and Computer Architecture. Last Class: OS and Computer Architecture
Last Class: OS and Computer Architecture System bus Network card CPU, memory, I/O devices, network card, system bus Lecture 3, page 1 Last Class: OS and Computer Architecture OS Service Protection Interrupts
Monte Carlo European Options Pricing Implementation Using Various Industry Library Solutions. By Sergey A. Maidanov
Monte Carlo European Options Pricing Implementation Using Various Industry Library Solutions By Sergey A. Maidanov Monte Carlo simulation is one of the recognized numerical tools for pricing derivative
Interpreters and virtual machines. Interpreters. Interpreters. Why interpreters? Tree-based interpreters. Text-based interpreters
Interpreters and virtual machines Michel Schinz 2007 03 23 Interpreters Interpreters Why interpreters? An interpreter is a program that executes another program, represented as some kind of data-structure.
Chapter 6, The Operating System Machine Level
Chapter 6, The Operating System Machine Level 6.1 Virtual Memory 6.2 Virtual I/O Instructions 6.3 Virtual Instructions For Parallel Processing 6.4 Example Operating Systems 6.5 Summary Virtual Memory General
White Paper. Intel Xeon Phi Coprocessor DEVELOPER S QUICK START GUIDE. Version 1.5
White Paper Intel Xeon Phi Coprocessor DEVELOPER S QUICK START GUIDE Version 1.5 Contents Introduction... 4 Goals... 4 This document does:... 4 This document does not:... 4 Terminology... 4 System Configuration...
Parallelization: Binary Tree Traversal
By Aaron Weeden and Patrick Royal Shodor Education Foundation, Inc. August 2012 Introduction: According to Moore s law, the number of transistors on a computer chip doubles roughly every two years. First
Week 1 out-of-class notes, discussions and sample problems
Week 1 out-of-class notes, discussions and sample problems Although we will primarily concentrate on RISC processors as found in some desktop/laptop computers, here we take a look at the varying types
Inside the Erlang VM
Rev A Inside the Erlang VM with focus on SMP Prepared by Kenneth Lundin, Ericsson AB Presentation held at Erlang User Conference, Stockholm, November 13, 2008 1 Introduction The history of support for
KITES TECHNOLOGY COURSE MODULE (C, C++, DS)
KITES TECHNOLOGY 360 Degree Solution www.kitestechnology.com/academy.php [email protected] [email protected] Contact: - 8961334776 9433759247 9830639522.NET JAVA WEB DESIGN PHP SQL, PL/SQL
Multi-GPU Load Balancing for Simulation and Rendering
Multi- Load Balancing for Simulation and Rendering Yong Cao Computer Science Department, Virginia Tech, USA In-situ ualization and ual Analytics Instant visualization and interaction of computing tasks
How To Program With Adaptive Vision Studio
Studio 4 intuitive powerful adaptable software for machine vision engineers Introduction Adaptive Vision Studio Adaptive Vision Studio software is the most powerful graphical environment for machine vision
Binary search tree with SIMD bandwidth optimization using SSE
Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous
Floating-point control in the Intel compiler and libraries or Why doesn t my application always give the expected answer?
Floating-point control in the Intel compiler and libraries or Why doesn t my application always give the expected answer? Software Solutions Group Intel Corporation 2012 *Other brands and names are the
Parallel Compression and Decompression of DNA Sequence Reads in FASTQ Format
, pp.91-100 http://dx.doi.org/10.14257/ijhit.2014.7.4.09 Parallel Compression and Decompression of DNA Sequence Reads in FASTQ Format Jingjing Zheng 1,* and Ting Wang 1, 2 1,* Parallel Software and Computational
-------- Overview --------
------------------------------------------------------------------- Intel(R) Trace Analyzer and Collector 9.1 Update 1 for Windows* OS Release Notes -------------------------------------------------------------------
Habanero Extreme Scale Software Research Project
Habanero Extreme Scale Software Research Project Comp215: Java Method Dispatch Zoran Budimlić (Rice University) Always remember that you are absolutely unique. Just like everyone else. - Margaret Mead
Freescale Semiconductor, I
nc. Application Note 6/2002 8-Bit Software Development Kit By Jiri Ryba Introduction 8-Bit SDK Overview This application note describes the features and advantages of the 8-bit SDK (software development
Improve Fortran Code Quality with Static Security Analysis (SSA)
Improve Fortran Code Quality with Static Security Analysis (SSA) with Intel Parallel Studio XE This document is an introductory tutorial describing how to use static security analysis (SSA) on C++ code
ultra fast SOM using CUDA
ultra fast SOM using CUDA SOM (Self-Organizing Map) is one of the most popular artificial neural network algorithms in the unsupervised learning category. Sijo Mathew Preetha Joy Sibi Rajendra Manoj A
HPC Wales Skills Academy Course Catalogue 2015
HPC Wales Skills Academy Course Catalogue 2015 Overview The HPC Wales Skills Academy provides a variety of courses and workshops aimed at building skills in High Performance Computing (HPC). Our courses
RARITAN VALLEY COMMUNITY COLLEGE ACADEMIC COURSE OUTLINE. CISY 105 Foundations of Computer Science
I. Basic Course Information RARITAN VALLEY COMMUNITY COLLEGE ACADEMIC COURSE OUTLINE CISY 105 Foundations of Computer Science A. Course Number and Title: CISY-105, Foundations of Computer Science B. New
Moving from CS 61A Scheme to CS 61B Java
Moving from CS 61A Scheme to CS 61B Java Introduction Java is an object-oriented language. This document describes some of the differences between object-oriented programming in Scheme (which we hope you
CS 377: Operating Systems. Outline. A review of what you ve learned, and how it applies to a real operating system. Lecture 25 - Linux Case Study
CS 377: Operating Systems Lecture 25 - Linux Case Study Guest Lecturer: Tim Wood Outline Linux History Design Principles System Overview Process Scheduling Memory Management File Systems A review of what
Multi-Threading Performance on Commodity Multi-Core Processors
Multi-Threading Performance on Commodity Multi-Core Processors Jie Chen and William Watson III Scientific Computing Group Jefferson Lab 12000 Jefferson Ave. Newport News, VA 23606 Organization Introduction
Scheduling Task Parallelism" on Multi-Socket Multicore Systems"
Scheduling Task Parallelism" on Multi-Socket Multicore Systems" Stephen Olivier, UNC Chapel Hill Allan Porterfield, RENCI Kyle Wheeler, Sandia National Labs Jan Prins, UNC Chapel Hill Outline" Introduction
JOURNAL OF OBJECT TECHNOLOGY
JOURNAL OF OBJECT TECHNOLOGY Online at http://www.jot.fm. Published by ETH Zurich, Chair of Software Engineering JOT, 2006 Vol. 5, No. 6, July - August 2006 On Assuring Software Quality and Curbing Software
Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology
Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology Parallel Reduction Common and important data parallel primitive Easy to implement in CUDA Harder to get it right Serves as
MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research
MapReduce and Distributed Data Analysis Google Research 1 Dealing With Massive Data 2 2 Dealing With Massive Data Polynomial Memory Sublinear RAM Sketches External Memory Property Testing 3 3 Dealing With
SWARM: A Parallel Programming Framework for Multicore Processors. David A. Bader, Varun N. Kanade and Kamesh Madduri
SWARM: A Parallel Programming Framework for Multicore Processors David A. Bader, Varun N. Kanade and Kamesh Madduri Our Contributions SWARM: SoftWare and Algorithms for Running on Multicore, a portable
Introduction to GPU hardware and to CUDA
Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 37 Course outline Introduction to GPU hardware
Kernel Types System Calls. Operating Systems. Autumn 2013 CS4023
Operating Systems Autumn 2013 Outline 1 2 Types of 2.4, SGG The OS Kernel The kernel is the central component of an OS It has complete control over everything that occurs in the system Kernel overview
C++ INTERVIEW QUESTIONS
C++ INTERVIEW QUESTIONS http://www.tutorialspoint.com/cplusplus/cpp_interview_questions.htm Copyright tutorialspoint.com Dear readers, these C++ Interview Questions have been designed specially to get
2) Write in detail the issues in the design of code generator.
COMPUTER SCIENCE AND ENGINEERING VI SEM CSE Principles of Compiler Design Unit-IV Question and answers UNIT IV CODE GENERATION 9 Issues in the design of code generator The target machine Runtime Storage
Application Note: AN00141 xcore-xa - Application Development
Application Note: AN00141 xcore-xa - Application Development This application note shows how to create a simple example which targets the XMOS xcore-xa device and demonstrates how to build and run this
Instruction Set Architecture
Instruction Set Architecture Consider x := y+z. (x, y, z are memory variables) 1-address instructions 2-address instructions LOAD y (r :=y) ADD y,z (y := y+z) ADD z (r:=r+z) MOVE x,y (x := y) STORE x (x:=r)
Enterprise Deployment: Laserfiche 8 in a Virtual Environment. White Paper
Enterprise Deployment: Laserfiche 8 in a Virtual Environment White Paper August 2008 The information contained in this document represents the current view of Compulink Management Center, Inc on the issues
Introduction to Multicore Programming
Introduction to Multicore Programming Marc Moreno Maza University of Western Ontario, London, Ontario (Canada) CS 4435 - CS 9624 (Moreno Maza) Introduction to Multicore Programming CS 433 - CS 9624 1 /
A Comparison Of Shared Memory Parallel Programming Models. Jace A Mogill David Haglin
A Comparison Of Shared Memory Parallel Programming Models Jace A Mogill David Haglin 1 Parallel Programming Gap Not many innovations... Memory semantics unchanged for over 50 years 2010 Multi-Core x86
Cost Model: Work, Span and Parallelism. 1 The RAM model for sequential computation:
CSE341T 08/31/2015 Lecture 3 Cost Model: Work, Span and Parallelism In this lecture, we will look at how one analyze a parallel program written using Cilk Plus. When we analyze the cost of an algorithm
1 External Model Access
1 External Model Access Function List The EMA package contains the following functions. Ema_Init() on page MFA-1-110 Ema_Model_Attr_Add() on page MFA-1-114 Ema_Model_Attr_Get() on page MFA-1-115 Ema_Model_Attr_Nth()
Interconnect Efficiency of Tyan PSC T-630 with Microsoft Compute Cluster Server 2003
Interconnect Efficiency of Tyan PSC T-630 with Microsoft Compute Cluster Server 2003 Josef Pelikán Charles University in Prague, KSVI Department, [email protected] Abstract 1 Interconnect quality
Scalability and Classifications
Scalability and Classifications 1 Types of Parallel Computers MIMD and SIMD classifications shared and distributed memory multicomputers distributed shared memory computers 2 Network Topologies static
Xeon Phi Application Development on Windows OS
Chapter 12 Xeon Phi Application Development on Windows OS So far we have looked at application development on the Linux OS for the Xeon Phi coprocessor. This chapter looks at what types of support are
Efficient Parallel Execution of Sequence Similarity Analysis Via Dynamic Load Balancing
Efficient Parallel Execution of Sequence Similarity Analysis Via Dynamic Load Balancing James D. Jackson Philip J. Hatcher Department of Computer Science Kingsbury Hall University of New Hampshire Durham,
The programming language C. sws1 1
The programming language C sws1 1 The programming language C invented by Dennis Ritchie in early 1970s who used it to write the first Hello World program C was used to write UNIX Standardised as K&C (Kernighan
OpenACC 2.0 and the PGI Accelerator Compilers
OpenACC 2.0 and the PGI Accelerator Compilers Michael Wolfe The Portland Group [email protected] This presentation discusses the additions made to the OpenACC API in Version 2.0. I will also present
Parallelization of video compressing with FFmpeg and OpenMP in supercomputing environment
Proceedings of the 9 th International Conference on Applied Informatics Eger, Hungary, January 29 February 1, 2014. Vol. 1. pp. 231 237 doi: 10.14794/ICAI.9.2014.1.231 Parallelization of video compressing
1) The postfix expression for the infix expression A+B*(C+D)/F+D*E is ABCD+*F/DE*++
Answer the following 1) The postfix expression for the infix expression A+B*(C+D)/F+D*E is ABCD+*F/DE*++ 2) Which data structure is needed to convert infix notations to postfix notations? Stack 3) The
Intro to GPU computing. Spring 2015 Mark Silberstein, 048661, Technion 1
Intro to GPU computing Spring 2015 Mark Silberstein, 048661, Technion 1 Serial vs. parallel program One instruction at a time Multiple instructions in parallel Spring 2015 Mark Silberstein, 048661, Technion
Fundamentals of Java Programming
Fundamentals of Java Programming This document is exclusive property of Cisco Systems, Inc. Permission is granted to print and copy this document for non-commercial distribution and exclusive use by instructors
CS Standards Crosswalk: CSTA K-12 Computer Science Standards and Oracle Java Programming (2014)
CS Standards Crosswalk: CSTA K-12 Computer Science Standards and Oracle Java Programming (2014) CSTA Website Oracle Website Oracle Contact http://csta.acm.org/curriculum/sub/k12standards.html https://academy.oracle.com/oa-web-introcs-curriculum.html
Very Large Enterprise Network, Deployment, 25000+ Users
Very Large Enterprise Network, Deployment, 25000+ Users Websense software can be deployed in different configurations, depending on the size and characteristics of the network, and the organization s filtering
Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi
Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi ICPP 6 th International Workshop on Parallel Programming Models and Systems Software for High-End Computing October 1, 2013 Lyon, France
ADVANCED SCHOOL OF SYSTEMS AND DATA STUDIES (ASSDAS) PROGRAM: CTech in Computer Science
ADVANCED SCHOOL OF SYSTEMS AND DATA STUDIES (ASSDAS) PROGRAM: CTech in Computer Science Program Schedule CTech Computer Science Credits CS101 Computer Science I 3 MATH100 Foundations of Mathematics and
Bringing Big Data Modelling into the Hands of Domain Experts
Bringing Big Data Modelling into the Hands of Domain Experts David Willingham Senior Application Engineer MathWorks [email protected] 2015 The MathWorks, Inc. 1 Data is the sword of the
A deeper look at Inline functions
A deeper look at Inline functions I think it s safe to say that all Overload readers know what C++ inline functions are. When we declare a function or member function as inline we are trying to avoid the
Java the UML Way: Integrating Object-Oriented Design and Programming
Java the UML Way: Integrating Object-Oriented Design and Programming by Else Lervik and Vegard B. Havdal ISBN 0-470-84386-1 John Wiley & Sons, Ltd. Table of Contents Preface xi 1 Introduction 1 1.1 Preliminaries
White Paper. Recording Server Virtualization
White Paper Recording Server Virtualization Prepared by: Mike Sherwood, Senior Solutions Engineer Milestone Systems 23 March 2011 Table of Contents Introduction... 3 Target audience and white paper purpose...
Embedded Systems. Review of ANSI C Topics. A Review of ANSI C and Considerations for Embedded C Programming. Basic features of C
Embedded Systems A Review of ANSI C and Considerations for Embedded C Programming Dr. Jeff Jackson Lecture 2-1 Review of ANSI C Topics Basic features of C C fundamentals Basic data types Expressions Selection
MAQAO Performance Analysis and Optimization Tool
MAQAO Performance Analysis and Optimization Tool Andres S. CHARIF-RUBIAL [email protected] Performance Evaluation Team, University of Versailles S-Q-Y http://www.maqao.org VI-HPS 18 th Grenoble 18/22
Parallel Programming for Multi-Core, Distributed Systems, and GPUs Exercises
Parallel Programming for Multi-Core, Distributed Systems, and GPUs Exercises Pierre-Yves Taunay Research Computing and Cyberinfrastructure 224A Computer Building The Pennsylvania State University University
Java Interview Questions and Answers
1. What is the most important feature of Java? Java is a platform independent language. 2. What do you mean by platform independence? Platform independence means that we can write and compile the java
Informatica e Sistemi in Tempo Reale
Informatica e Sistemi in Tempo Reale Introduction to C programming Giuseppe Lipari http://retis.sssup.it/~lipari Scuola Superiore Sant Anna Pisa October 25, 2010 G. Lipari (Scuola Superiore Sant Anna)
Parallel Computing. Parallel shared memory computing with OpenMP
Parallel Computing Parallel shared memory computing with OpenMP Thorsten Grahs, 14.07.2014 Table of contents Introduction Directives Scope of data Synchronization OpenMP vs. MPI OpenMP & MPI 14.07.2014
language 1 (source) compiler language 2 (target) Figure 1: Compiling a program
CS 2112 Lecture 27 Interpreters, compilers, and the Java Virtual Machine 1 May 2012 Lecturer: Andrew Myers 1 Interpreters vs. compilers There are two strategies for obtaining runnable code from a program
Performance Analysis of Web based Applications on Single and Multi Core Servers
Performance Analysis of Web based Applications on Single and Multi Core Servers Gitika Khare, Diptikant Pathy, Alpana Rajan, Alok Jain, Anil Rawat Raja Ramanna Centre for Advanced Technology Department
Course Title: Software Development
Course Title: Software Development Unit: Customer Service Content Standard(s) and Depth of 1. Analyze customer software needs and system requirements to design an information technology-based project plan.
How To Install An Intel System Studio 2015 For Windows* For Free On A Computer Or Mac Or Ipa (For Free)
Intel System Studio 2015 for Windows* Installation Guide and Release Notes Installation Guide and Release Notes for Windows* Host and Windows* target Document number: 331182-002US 8 October 2014 Contents
Intel Software Guard Extensions(Intel SGX) Carlos Rozas Intel Labs November 6, 2013
Intel Software Guard Extensions(Intel SGX) Carlos Rozas Intel Labs November 6, 2013 Legal Disclaimers INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR
Rethinking SIMD Vectorization for In-Memory Databases
SIGMOD 215, Melbourne, Victoria, Australia Rethinking SIMD Vectorization for In-Memory Databases Orestis Polychroniou Columbia University Arun Raghavan Oracle Labs Kenneth A. Ross Columbia University Latest
Tackling Big Data with MATLAB Adam Filion Application Engineer MathWorks, Inc.
Tackling Big Data with MATLAB Adam Filion Application Engineer MathWorks, Inc. 2015 The MathWorks, Inc. 1 Challenges of Big Data Any collection of data sets so large and complex that it becomes difficult
An Introduction to Parallel Computing/ Programming
An Introduction to Parallel Computing/ Programming Vicky Papadopoulou Lesta Astrophysics and High Performance Computing Research Group (http://ahpc.euc.ac.cy) Dep. of Computer Science and Engineering European
Debugging Multi-threaded Applications in Windows
Debugging Multi-threaded Applications in Windows Abstract One of the most complex aspects of software development is the process of debugging. This process becomes especially challenging with the increased
