Tutorial: Finding Hotspots on a Remote Linux* System Intel VTune Amplifier for Systems Linux* OS C++ Sample Application Code Document Number: 330219-001 Legal Information
Contents Contents Legal Information... 5 Overview... 7 Chapter 1: Navigation Quick Start Chapter 2: Finding Hotspots Prepare Your Target Device... 13 Cross Build and Load the Sampling Drivers...14 Prepare Your Sample Application...15 Run Advanced Hotspot Analysis... 16 View Your Results... 18 Chapter 3: Summary Chapter 4: Key Terms 3
Tutorial: Finding Hotspots on a Remote Linux* System 4
Legal Information No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document. Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade. This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and roadmaps. The products and services described may contain defects or errors which may cause deviations from published specifications. No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document. Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade. This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and roadmaps. The products and services described may contain defects or errors which may cause deviations from published specifications. Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor family, not across different processor families. Go to: Learn About Intel Processor Numbers Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Cilk, Intel, the Intel logo, Intel Atom, Intel Core, Intel Inside, Intel NetBurst, Intel SpeedStep, Intel vpro, Intel Xeon Phi, Intel XScale, Itanium, MMX, Pentium, Thunderbolt, Ultrabook, VTune and Xeon are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. Microsoft, Windows, and the Windows logo are trademarks, or registered trademarks of Microsoft Corporation in the United States and/or other countries. 2015 Intel Corporation. 5
Tutorial: Finding Hotspots on a Remote Linux* System 6
Overview Discover how to use Advanced Hotspots Analysis of the Intel VTune Amplifier for Systems to understand where your embedded application is spending time by identifying hotspots - the most timeconsuming program units. Advanced Hotspots Analysis is useful to analyze the performance of both serial and parallel applications. The Intel VTune Amplifier for Systems supports analysis of remote Linux* applications running on regular or embedded Linux systems, but this tutorial will focus on embedded platforms. About This Tutorial This tutorial uses the sample tachyon and guides you through the basic steps required to use the GUI to analyze the code for hotspots by means of remote data collection. Estimated Duration 20 minutes: Preparing your host and target device for use 15 minutes: Preparing your sample application and analyzing it Learning Objectives After you complete this tutorial, you will be able to find hotspots by: Preparing Your Target Device Cross Build and Load Sampling Drivers Preparing Your Sample Application, tachyon Running an Advanced Hotspot Analysis Viewing Your Results More Resources The Intel Developer Zone is a site devoted to software development tools, resources, forums, blogs, and knowledge bases, see http://software.intel.com The Intel Software Documentation Library is part of the Intel Developer Zone and is an online collection of Release Notes, User and Reference Guides, White Papers, Help, and Tutorials for Intel software products, http:// software.intel.com/en-us/intel-software-technical-documentation For troubleshooting the creation and installation of the sep drivers, see http://software.intel.com/en-us/articles/troubleshooting-issues-with-sep-inthe-embedded-tool-suite-intel-system-studio Start Here 7
Tutorial: Finding Hotspots on a Remote Linux* System 8
Navigation Quick Start 1 The Intel VTune Amplifier for Systems provides information on code performance for users developing serial and multithreaded applications on supported embedded platforms. VTune Amplifier helps you analyze algorithm choices and identify where and how your application can benefit from available hardware resources. It reports your most significant problems thereby showing you the best ways to utilize your available optimization schedule and resources. VTune Amplifier for Systems Graphical User Interface (GUI) Access The VTune Amplifier installation includes shell scripts that you can run in your terminal window to set up required environment variables: 1. From the installation directory, enter source amplxe-vars.sh. This script sets the PATH environment variable that specifies locations of the product's graphical user interface and command line utilities. NOTE For the VTune Amplifier for Systems installed as part of Intel System Studio, the default <install_dir> is: For super-users: /opt/intel/system_studio_<version>/vtune_amplifier_< version>_for_systems For ordinary users: $HOME/intel/system_studio_<version>/vtune_amplifier_< version>_for_systems For the standalone VTune Amplifier for Systems installed without Intel System Studio, the default <install_dir> is: For super-users: /opt/intel/vtune_amplifier_for_systems_<version> For ordinary users: $HOME/intel/vtune_amplifier_for_systems_<version> 2. You can modify your login shell to include these important shell variables. For example, if you use the bash shell, you can add this line to your $HOME/.bashrc: source /opt/intel/ vtune_amplifier_<version>_for_systems/amplxe-vars.sh 3. Enter amplxe-gui to launch the product graphical interface. 9
1 Tutorial: Finding Hotspots on a Remote Linux* System Configure and manage projects and results, and launch new analyses from the primary toolbar. Click the Project Properties button on this toolbar to manage result file locations. Newly completed and opened analysis results along with result comparisons appear in the results tab for easy navigation. Use the VTune Amplifier menu to control result collection, define and view project properties, and set various options. The Project Navigator provides an iconic representation of your projects and analysis results. Click the Project Navigator button on the toolbar to enable/disable the Project Navigator. Click the (change) link to select a viewpoint, a preset configuration of windows/panes for an analysis result. For each analysis type, you can switch among several viewpoints to focus on particular performance metrics. Click the yellow question mark icon to read the viewpoint description. Switch between window tabs to explore the analysis type configuration options and collected data provided by the selected viewpoint. Use the Grouping drop-down menu to choose a granularity level for grouping data in the grid. Use the filter toolbar to filter out the result data according to the selected categories. Next step: Finding Hotspots 10
Finding Hotspots 2 Use the Intel VTune Amplifier for Systems to identify and analyze hotspot functions in your serial or parallel embedded application by performing a series of steps in a workflow. This tutorial guides you through these workflow steps while using a sample ray-tracer application named tachyon that runs on your embedded device. To optimize the performance of your embedded application, you must first understand its current performance qualities as it runs on the embedded device. You then modify the application based on that performance data, and, check the new performance metrics to compare the results. You can repeat this cycle until the results match your performance goals. When you check the performance of your application you run advanced sampling-based performance analysis on the application as it runs. These analyses help you identify performance hotspots and bottlenecks. If they are not where you expect them, then you can rewrite your code accordingly and test again. Running an analysis after each change allows you to verify that each change results in the desired improvement. Running multiple analysis checks also allows a comparison with the initial unchanged run to determine a point of diminishing returns. To obtain this important sampling-based performance data, you compile and run your application in a supported, embedded development environment. Then you define and launch a profiling agent, called a remote data collector, which runs on the embedded device. This remote data collector then records specified performance data collected from your running application. Then this performance information is automatically transferred to a host system where you can view and analyze it, and plan your optimization strategy and its implementation based on your available time and resources. Your embedded application must be cross compiled and present on this host system as well, so that your results will accurately reflect the function names and the line numbers in your code. While there are several supported embedded OS versions, this tutorial focuses on the Yocto Project* 1.* environment. The tachyon sample code has been optimized for the Yocto Project environment. Additional information can be found at https://www.yoctoproject.org/about. If you choose to run this tutorial on an embedded system with a different Linux* OS distribution, you will need to provide your own sample application, kernel version, and kernel source directory. To summarize, for this tutorial you will collect data on your embedded system with the VTune Amplifier GUI amplxe-gui and SSH communication, started from the host system. Copying the kernel and drivers from your host to your target system is a one-time setup procedure, after which you can run multiple data collection sessions and view and compare the results. 11
2 Tutorial: Finding Hotspots on a Remote Linux* System Once you have collected performance data you make modifications to your code to improve its performance profile, and test again. NOTE This tutorial focuses on obtaining the baseline results for Advanced Hotspots Analysis and the tachyon sample application. For more information on the iterative process of testing, modifying, improving, and retesting your code for comparative analysis, see Tutorial: Finding Hotspots - C++ Sample Code To find hotspots in your application complete these activities: Step 1: Prepare your target device Build a Yocto* Project kernel Install a target package including remote collectors Configure ssh for a no-password connection Step 2: Cross build and load sampling drivers Step 3: Prepare your sample application Step 4: Run Advanced Hotspot Analysis Cross build and load the sampling driver (sep) Cross compile tachyon for use Copy tachyon to your Yocto Project target Use the Intel VTune Amplifier for Systems GUI to set up your remote configuration Run Advanced Hotspot Analysis Step 5: View your results View your results in Intel VTune Amplifier for Systems Next step: Prepare Your Target Device 12
Finding Hotspots 2 Prepare Your Target Device Use the following steps to set up your target device after you have installed the VTune Amplifier for Systems on your host. NOTE You will not be able to identify time-consuming code in your application using Advanced Hotspots Analysis if the nmi_watchdog interrupt capability is enabled on your target system, which prevents collecting accurate event-based sampling data. You will have to disable nmi_watchdog interrupt, or see the "Troubleshooting" section of the product documentation for details. 1. If you have not yet done so, download the Yocto Project* version appropriate for your system. A list of supported host system distributions and required packages for each distribution is available here: http://www.yoctoproject.org/docs/current/ref-manual/ref-manual.html#introrequirements. For a list of all Yocto Project versions, see https://www.yoctoproject.org/ downloads/yocto-project. The Yocto Project Quick Start document for your selected version provides detailed installation and configuration steps. The Quick Start document and all other documentation is available from https://www.yoctoproject.org/documentation. In this tutorial, we are using the Yocto Project version 1.2.1. Kernel version and source directory information provided in the examples are specific to this version. NOTE The tachyon sample code used in this tutorial has been optimized for the Yocto Project environment. Other Linux distributions can be analyzed using VTune Amplifier, but you will need to provide your own application. To run the tutorial using a different Linux distribution, be sure to note the kernel version and kernel source directory for use in building the VTune Amplifier drivers. 2. Copy the required package archive located at /opt/intel/vtune_amplifier_for_system/target on your host system to the /opt/intel directory on your target system and unzip it. linux32\vtune_amplifier_target_x86.tgz for x86 systems linux64\vtune_amplifier_target_x86_64.tgz for 64-bit systems NOTE Unzip both x86 and x86-64 packages if you plan to run and analyze 32-bit processes on 64-bit systems. a. Copy the file to the target system using the following command: scp -r <filename> root@<ip address>:/opt/intel/ b. Extract the file on the target system using the following command: tar -xvsf <filename> NOTE You can find detailed instructions for setting up your target Linux system in the Preparing a Target Linux* System for Remote Analysis online help topic at https://software.intel.com/en-us/ linux_target_setup. 3. Configure ssh to work in password-less mode so it does not prompt for a password on each invocation. To do this, use the key generation utility on the host system. a. Generate the key with an empty passphrase: host> ssh-keygen 13
2 Tutorial: Finding Hotspots on a Remote Linux* System b. Copy the key to the target system: host> cat ~/.ssh/id_dsa.pub ssh user@target "cat >> ~/.ssh/authorized_keys" You will need the target user password to complete this operation. If this command completes successfully, you will not require it afterwards. Make sure that only the owner (root) has read/write/execute permissions to the $HOME/.ssh/ directory and that such a directory exists. In these examples target can be a hostname or IP address. c. After you set the password-less mode, run a command to verify that a password is not required anymore. For example: host> ssh user@target ls NOTE An example of building a Yocto project and installing it is available at the Intel Developer Zone https://software.intel.com/en-us/forums/topic/507002. Next step: Cross Build and Load Sampling Drivers Cross Build and Load the Sampling Drivers Build the sampling drivers for your target environment on your Linux* host and transfer them to the target, where you load them into the kernel you customized for this purpose. If you do not build the drivers for your specific device by version and build number, your driver will not load; or, if it loads, it will not work. To find the kernel-version, see $KERNEL-SRC-DIR/include/generated/utsrelease.h. To find what version of the kernel is currently running, use the uname -a command on the target. 1. Change into the source directory: cd /opt/intel/vtune_amplifier_for_systems/sepdk/src 2. Build the sampling driver using the following command:./build-driver -ni --c-compiler=<compiler>\ --kernel-src-dir=<kernel source location> --kernel-version=<kernel version>\ --make-args="platform=x32 ARITY=smp" --install-dir=<install target location> For example:./build-driver -ni --c-compiler=i586-poky-linux-gcc\--kernel-src-dir=~/yocto/poky-denzil-7.0/ build/tmp/work/\fri2_noemgd-poky-linux/linuxyocto3.2.11+git1\ +5b4c9dc78b5ae607173cc3ddab9bce1b5f78129b_1+7\6dc683eccc4680729a76b9d2fd425ba540a483-r1/linux-fri2- noemgd-\standard-build --kernel-version=3.0.24-yocto-standard\--make-args="platform=x32 ARITY=smp" --install-dir=../prebuilt 3. Once the driver files are built, copy them from your host to your target machine using the following commands: host> cd /opt/intel/vtune_amplifier_for_systems host> scp -r sepdk root@<ip address>:/home/root 4. Load the sampling drivers on your target machine using the following commands: target> cd /home/root/sepdk/src target>./insmod-sep3 -re For example, the command output could look like the following: Checking for PMU arbitration service(pax)...detected. PAX service is accessible to users in group "0" Executing: insmod./sep3_15-x32-3.0.24-yocto-standardsmp.ko Creating /dev/sep3_15 base devices with major number 251...done. Creating /dev/sep3_15 percpu devices with major number 250... done. The sep3_15 drivers has been successfully loaded. 14
Checking for vtsspp driver... not detected. Executing: insmod./vtsspp/vtsspp-x32-3.0.24-yocto-standardsmp.ko gid=0 mode=0666 The vtsspp driver has been successfully loaded. For some embedded Linux systems the insmod-vtsspp command may not work. In that event, you can load the kernel module directly by using insmod:./insmod-sep3 -re cd /home/root/sep/sepdk/src/vtsspp insmod vtsspp.ko 5. Confirm that the driver has been installed: lsmod grep sep sep3_10 80108 0 lsmod grep vtsspp vtsspp 295740 0 Finding Hotspots 2 NOTE You can find detailed instructions for installing your Linux target drivers at the online documentation: Preparing a Target Linux* System for Remote Analysis at https://software.intel.com/enus/linux_target_setup. Next step: Prepare Your Sample Application Prepare Your Sample Application The Intel VTune Amplifier for Systems release includes sample code called tachyon for you to compile and use on the target system. The tachyon sample code included with your distribution is modified for the Yocto* environment. The needed changes to the Makefiles listed in this section have been completed in the makefiles located in your distribution, which are included as examples. After compiling tachyon, copy the application to your target. 1. On the host Linux* system, change directories so you can untar the sample code: cd /~yocto 2. Unarchive (untar) the tachyon sample application: tar xvzf /opt/intel/vtune_amplifier_for_systems/samples/en/c++/tachyon_vtune_amp_xe.tgz 3. Open the top-level Makefile. The line containing CXX has been commented out. In the lower level tachyon/common/gui/ Makefile.gmake file, the following lines have been added: 4. If the host system is x86_64, you must comment some lines in the Makefile: #ifeq ($(shell uname -m),x86_64 #Arch=intel64 #CXXXFLAGS+= -m64 #else Arch=ia32 CXXFLAGS+= -m32 #endif 15
2 Tutorial: Finding Hotspots on a Remote Linux* System 5. Source important environmental variables: source /opt/poky/1.2/environment-setup-i586-poky-linux 6. Compile the tachyon code: make 7. Copy the tachyon binary, the dat folder and the libtbb.so folder to an appropriate location on your target system where the executable can find it. scp tachyon_find_hostspots dat lbbtbb.so root@<ip address>:<target location> For example: scp tachyon_find_hotspots dat libtbb.so root@target_ip:/usr/local/sbin Next step: Run Advanced Hotspot Analysis Run Advanced Hotspot Analysis The following steps show you how to launch the Intel VTune Amplifier for Systems GUI and create a new project. 1. Run amplxe-gui. Refer to the steps in Navigation Quick Start to set the appropriate environment variables if you have not already done so. 2. Click New Project and enter an identifying project name such as tachyon1 so that you can distinguish this project from other projects. Keep or change the default project file Location: and click Create Project. 3. Set up the analysis target. 16
Finding Hotspots 2 a. Select remote Linux (SSH) for the target system. b. Specify the user name and the host name or IP address of the remote system you are profiling via SSH. c. Enter the full path for the target binary in the Application field. In this example the path is / home/root/tachyon_find_hotspots. d. Enter any the path to the data file in the Application parameters field. In this example, the path is /home/root/dat/balls.dat. When collecting data remotely, the VTune Amplifier looks for the collectors on the target device in its default location: /opt/intel/vtune_amplifier_201x_for_systems.<package_num>. It also temporary stores performance results on the target system in the /tmp directory. If you followed the steps detailed in Prepare Your Target Device, then the collectors were installed in the default location. If you installed the target package to a different location and need to specify another temporary directory, make sure to configure your settings from the Analysis Target tab for your project. Use the VTune Amplifier installation directory on the remote system option to specify the path to the VTune Amplifier on the remote system. If default location is used, the path is provided automatically. Use the Temporary directory on the remote system option to specify a non-default temporary directory. Alternatively, use the -target-install-dir and -target-temp-dir options from the command line. 4. Click Choose Analysis to switch to the Analysis Type tab. 5. Select the Advanced Hotspots analysis type. You will notice communication with the remote system before the Analysis Type screen appears. 17
2 Tutorial: Finding Hotspots on a Remote Linux* System 6. Click the Start button to launch the Advanced Hotspots Analysis session. The VTune Amplifier sets up the now passwordless SSH connection to your target device and launches the target application. It collects Advanced Hotspots data with default settings, and then copies those results back to the host. Next step: View Your Results View Your Results After the target device sends the tachyon results - usually within a minute or two - the results appear on your display: Next step: Prepare your own embedded applications for analysis using the VTune Amplifier to view hotspots. 18
Summary 3 You have completed the Finding Hotspots tutorial. Here are some important things to remember when using the Intel VTune Amplifier for Systems to analyze your code for hotspots: Step Tutorial Recap Key Tutorial Take-aways 1. Prepare your target device 2. Build and load the sampling drivers 3. Prepare your sample application 4. Run Advanced Hotspot Analysis 5. View your results You installed a stable Yocto Project kernel; copied the appropriate VTune Amplifier for Systems files to your target system; and setup a password-less connection. You compiled the sampling drivers on your host system and loaded them on your target system. You extracted the tachyon code and, if necessary, modified it for use in your specific embedded environment. You ran the VTune Amplifier GUI to configure and launch Advanced Hotspot Analysis on the tachyon code on your target device. It ran on your target and the results were sent via ssh back to your server. You viewed the Advanced Hotspots analysis on the tachyon application in the VTune Amplifier for Systems GUI. Download and extract an appropriate toolchain from the Yocto Project web site and create an installation area. Build a Yocto Project kernel for your target. Configure ssh so there is no password request for file transfers between your server and target. Compile sampling drivers and transfer them to your target for use. Unarchive tachyon in the /~yocto directory. View the necessary changes to the top level Makefile. View the necessary changes to the lower level Makefile.gmake. Launch the GUI using the amplxe-gui command. Use the Analysis Target tab to choose and configure your analysis target. Use the Analysis Type tab to choose, configure, and run the Advanced Hotspot Analysis. You can also use the VTune Amplifier command-line interface by running the amplxe-cl command to test your code for hotspots and regressions. For details see the Command-line Interface Support section in the VTune Amplifier online help. Optimization Notice Intel's compilers may or may not optimize to the same degree for non-intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessordependent optimizations in this product are intended for use with Intel microprocessors. Certain 19
3 Tutorial: Finding Hotspots on a Remote Linux* System Optimization Notice optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 20
Key Terms 4 baseline : A performance metric used as a basis for comparison of the application versions before and after optimization. Baseline should be measurable and reproducible. CPU time : The amount of time a thread spends executing on a logical processor. For multiple threads, the CPU time of the threads is summed. The application CPU time is the sum of the CPU time of all the threads that run the application. CPU usage: A performance metric when the VTune Amplifier identifies a processor utilization scale, calculates the target CPU usage, and defines default utilization ranges depending on the number of processor cores. Utilizatio n Type Idle Poor OK Ideal Default color Description All CPUs are waiting - no threads are running. Poor usage. By default, poor usage is when the number of simultaneously running CPUs is less than or equal to 50% of the target CPU usage. Acceptable (OK) usage. By default, OK usage is when the number of simultaneously running CPUs is between 51-85% of the target CPU usage. Ideal usage. By default, Ideal usage is when the number of simultaneously running CPUs is between 86-100% of the target CPU usage. Elapsed time : The total time your target ran, calculated as follows: Wall clock time at end of application Wall clock time at start of application. finalization : A process during which the Intel VTune Amplifier converts the collected data to a database, resolves symbol information, and pre-computes data to make further analysis more efficient and responsive. hotspot: A section of code that took a long time to execute. Some hotspots may indicate bottlenecks and can be removed, while other hotspots inevitably take a long time to execute due to their nature. Advanced Hotspots Analysis: A non-default analysis type used to understand the application flow of control and to identify hotspots, that works directly with the CPU without the influence of the booted operating system. VTune Amplifier creates a list of functions in your application ordered by the amount of time spent in a function. It also detects the call stacks for each of these functions so you can see how the hot functions are called. VTune Amplifier uses a low overhead (about 5%) user-mode sampling and tracing collection that gets you the information you need without slowing down the application execution significantly. A target is an executable file you analyze using the Intel VTune Amplifier. host system : The Linux* server on which you install amplxe-gui and from which you launch your application analysis and view those results. target system: The supported, embedded device on which you install sampling drivers and run the application you are running performance analysis on. 21
4 Tutorial: Finding Hotspots on a Remote Linux* System viewpoint : A preset result tab configuration that filters out the data collected during a performance analysis and enables you to focus on specific performance problems. When you select a viewpoint, you select a set of performance metrics the VTune Amplifier shows in the windows/panes of the result tab. To select the required viewpoint, click the (change) link and use the drop-down menu at the top of the result tab. 22
Index Index H Hotspot Analysis, Run Advanced16 N Navigation Quick Start9 R Results, View Your18 S Sample Application, Prepare Your15 Sampling Drivers, Cross Build and Load14 Summary19 T Target Device, Prepare Your13 23
Tutorial: Finding Hotspots on a Remote Linux* System 24