Analyse de performances pour les systèmes intégrés multi-cœurs
|
|
- Lewis Whitehead
- 8 years ago
- Views:
Transcription
1 Ben Salma SANA Laboratoire TIMA SLS Analyse de performances pour les systèmes intégrés multi-cœurs Encadrants: Frederic Petrot Nicolas Fournel Page 1
2 1 - Introduction: Computer processor design has evolved at a constant pace for the last 20 years. The proliferation of computers into the mass market and the tasks we ask of them continue to push the need for more powerful processors. The market requirement for higher performing processors is linked to the demand for more sophisticated software applications. , for instance, which is now used globally, was only a limited and expensive technology 10 years ago. Today, software applications span everything from helping large corporations better manage and protect their business-critical data and networks to allowing PCs in the home to edit home videos, manipulate digital photographs, and burn downloaded music to CDs. Multi-core processors represent a major evolution in computing technology. This important development is coming at a time when businesses and consumers are beginning to require the benefits offered by these processors due to the exponential growth of digital data and the globalization of the Internet. Multi-core processors will eventually become the pervasive computing model because they offer performance and productivity benefits beyond the capabilities of today s single-core processors. 2 Context: Benchmarks are necessary to experimentally determine the benefits of new designs. The presence of parallel machines is now ubiquitous. Nevertheless, existing benchmarks have a number of limitations and will not be very useful to evaluate future CMPs. The lack of benchmark suites can hamper parallel architecture research and reduce its impact. The Princeton Application Repository for Shared-Memory Computers (PARSEC) was created to be useful for the design of next generation of processors. Parsec is benchmark suite for chip multiprocessor that started as cooperation between Intel and Princeton University. It offers new application areas for parallel machines through proliferation of CMPs. Parsec was used with many processors such as i386, x86_64 and Sparc but not with the ARM processors. During this semester, my TER project was in the research laboratory TIMA carrying out basic research in the field of Electronics: researches aim at the design, architecture, test, CAD, quality of integrated circuits and by means of data processing and microelectronics. My research project topics were to adapt Parsec benchmarks to ARM architecture throw a Cross Compilation and execute them on ARM. Page 2
3 3- What is Cross Compilation? A cross compiler is a compiler capable of creating executable code for a platform other than the one on which the compiler is run. Cross compiler tools are used to generate executables for embedded system or multiple platforms. It is used to compile for a platform upon which it is not feasible to do the compiling, like microcontrollers that don't support an operating system. It has become more common to use this tool for paravirtualization where a system may have one or more platforms in use. Before doing the cross compilation I started with compiling and executing Benchmark Parsec on the local machine (Linux Intel) using the Gnu compiler gcc. Then I compiled the code with arm-none-linux-gnueabi-gcc to execute it finally on the ARM machine. 4 The benchmarks Parsec: 4.1. General Presentation: The Princeton Application Repository for Shared-Memory Computers is a benchmark suite composed of multithreaded programs. It differs from other benchmark suites by different characteristics: Multithreaded: PARSEC uses different ways of parallelization (threads, pipeline ). Diverse: PARSEC does try to explore different application domains such as financial analysis, Computer vision, animation, similarity search Not HPC-Focused: The PARSEC suite anticipates the development of parallel programs in the domain of High-Performance Computing and tries to avoid a program selection which is skewed towards HPC. It focuses on programs from all domains, such as desktop and server applications. Research: The suite is primarily intended for research. It can also be used for performance measurements of real machines, but its original purpose is insight, not numbers. Benchmark suites intended for research usually go beyond pure scoring systems and provide infrastructure to instrument and manipulate included programs in an efficient manner. Page 3
4 In the section bellow I will give deeper explanation of Parsec application starting with Blackscholes application Parsec applications: Blackscholes: The Black-Scholes equation is a differential equation that describes how, under a certain set of assumptions, the value of an option changes as the price of the underlying asset changes. Parallelization: The parallelization algorithms are very simple: they simply price multiple options in parallel using Black-Scholes formula. Each thread prices an individual option. In practice financial houses price 10.s to 100.s of thousands of options using Black-Scholes. Caractérisation des Blackscholes Bodytrack : The bodytrack computer vision application is an Intel RMS workload which tracks a 3D pose of a marker-less human body with multiple cameras through an image sequence. Bodytrack employs an annealed particle filter to track the pose using edges and the foreground silhouette as image features, based on a 10 segment 3D kinematic tree body model. These two image features were chosen because they exhibit a high degree of invariance under a wide range of conditions and because they are easy to extract. An annealed particle filter was employed in order to be able to search high dimensional configuration spaces without having to rely on any assumptions of the tracked body such Page 4
5 as the existence of markers or constrained movements. This benchmark was included due to the increasing significance of computer vision algorithms in areas such as video surveillance, character animation and computer interfaces. This work is based upon the following works : -Articulated Body Motion Capture by Stochastic Search Jonathan Deutscher and Ian Reid -A. Balan, L. Sigal and M. Black. A Quantitative Evaluation of video-based 3D Person Tracking. Output of Bodytrack (Frame 1) Parallelization: The program has three parallel kernels: Edge detection: bodytrack employs a gradient based edge detection mask to find edges. The result is compared against a threshold to eliminate spurious edges. Edge detection is implemented in function GradientMagThreshold. The output of this kernel will be further refined before it is used to compute the particle weights. Page 5
6 Edge smoothing: A separable Gaussian filter of size 7 7 pixels is used to smooth the edges in function GaussianBlur. The result is remapped between 0 and 1 to produce a pixel map in which the value of each pixel is related to its distance from an edge. The kernel has two parallel phases, one to filter image rows and one to filter image columns. Calculate particle weights: This kernel evaluates the foreground silhouette and the image edges produced earlier to compute the weights for the particles. This kernel is executed once for every annealing layer during every time step, making it the computationally most intensive part of the body tracker. The parallel kernels use tickets to distribute the work among threads balance the load dynamically. The ticketing mechanism is implemented in class TicketDispenser and behaves like a shared counter. Bodytrack characteristics Page 6
7 Canneal : This kernel was developed by Princeton University. It uses cache-aware simulated annealing to minimize the routing cost of a chip design. Simulated annealing is a common method to approximate the global optimum in a large search space. Canneal pseudo-randomly picks pairs of elements and tries to swap them. The program was included in the PARSEC program selection to represent engineering workloads, for the fine-grained parallelism with its lock-free synchronization techniques and due to its pseudo-random worst-case memory access pattern. Canneal uses a very aggressive synchronization strategy that is based on data race recovery instead of avoidance. Pointers to the elements are dereferenced and swapped atomically, but no locks are held while a potential swap is evaluated. This can cause disadvantageous swaps if one of the relevant elements has been replaced by another thread during that time. This equals a higher effective probability to accept swaps which increase the routing cost, and the simulated annealing method automatically recovers from it. The swap operation employs lock-free synchronization which is implemented with atomic instructions. An alternative implementation which relied on conventional locks turned out to be too inefficient due to excessive locking overhead. The synchronization routines with the atomic instructions are taken from the BSD kernel. Support for most new architectures can be added easily by copying the correct header file from the BSD kernel sources. Canneal characteristics Page 7
8 Dedup: The dedup kernel was developed by Princeton University. It compresses a data stream with a combination of global compression and local compression in order to achieve high compression ratios. Such a compression is called deduplication. The reason for the inclusion of this kernel is that deduplication has become a mainstream method to compress storage footprints for new-generation backup storage systems and to compress communication data for newgeneration bandwidth optimized networking appliances. The kernel uses a pipelined programming model to parallelize the compression to mimic real-world implementations. Parallelization: There are five pipeline stages the intermediate three of which are parallel. In the first stage, dedup reads the input stream and breaks it up into coarse-grained chunks to get independent work units for the threads. The second stage anchors each chunk into fine-grained small segments with rolling fingerprinting. The third pipeline stage computes a hash value for each data segment. The fourth stage compresses each data segment with the Ziv-Lempel algorithm and builds a global hash table that maps hash values to data. The final stage assembles the deduplicated output stream consisting of hash values and compressed data segments. Dedup uses a separate thread pool for each parallel pipeline stage. Each thread pool should at least have a number of threads equal to the number of available cores to allow the system to fully work on any stage should the need arise. The operating system scheduler is responsible for a thread schedule which will maximize the overall throughput of the pipeline. In order to avoid lock contention, the number of queues is scaled with the number of threads, with a small group of threads sharing an input and output queue at a time. Dedup characteristics Page 8
9 Facesim: This Intel RMS application was originally developed by Stanford University. It takes a model of a human face and a time sequence of muscle activations and computes a visually realistic animation of the modeled face by simulating the underlying physics. The goal is to create a visually realistic result. Certain effects such as inertial movements would have only a small visible effect and are not simulated. The workload was included in the benchmark suite because an increasing number of computer games and other forms of animation employ physical simulation to create more realistic virtual environment. Human faces in particular are observed with more attention from users than other details of a virtual world, making their realistic presentation a key element for animations. The parallelization uses a static partitioning of the mesh. Data that spans nodes belonging to more than one partition is replicated. Every time step the partitions process all elements that contain at least one node owned by the particle, but only results for nodes which are owned by the partition are written. Facesim characteristics Page 9
10 Ferret: This application is based on the Ferret toolkit which is used for content-based similarity search of feature-rich data such as audio, images, video, 3D shapes and so on. It was developed by Princeton University. The reason for the inclusion in the benchmark is that it represents emerging next generation desktop and Internet search engines for non-text document data types. In the benchmark, we have configured the Ferret toolkit for image similarity search. Ferret is parallelized using the pipeline model with six stages. The first and the last stage are for input and output. The middle four stages are for query image segmentation, feature extraction, indexing of candidate sets with multi-probe Locality Sensitive Hashing (LSH) and ranking. Each stage has its own thread pool and the basic work unit of the pipeline is a query image. Ferret characteristics Page 10
11 Fluidanimate: This Intel RMS application uses an extension of the Smoothed Particle Hydrodynamics (SPH) method to simulate an incompressible fluid for interactive animation purposes. Its output can be visualized by detecting and rendering the surface of the fluid. The force density fields are derived directly from the Navier-Stokes equation. Fluidanimate uses special-purpose kernels to increase stability and speed. Fluidanimate was included in the PARSEC benchmark suite because of the increasing significance of physics simulations for computer games and other forms of real-time animations. The parallelization algorithm provided is based on spatial partitioning. As mentioned above, particles are sorted spatially into a uniform grid which covers the entire simulation domain. This grid is evenly partitioned along cell boundaries using axis-aligned splitting planes to produce subgrids, one per thread. Since particles which reside in adjacent cells may interact with each other, multiple threads may need to update the cells which lie on sub-grid boundaries. Page 11
12 Each grid cell has a flag which indicates whether or not it lies on a subgrid boundary. Whenever a particle in a boundary cell needs to be updated, fine-grain locks are used for that update. Locks are not used when updating particles away from boundaries since these updates will never occur concurrently. Fluidanimate characteristics Freqmine: Frequent Itemsets Mining (FIM) is the basis of Association Rule Mining. Association Rule Mining is the process of analyzing a set of transactions to extract association rules. Association Rule Mining is a very common used and well-studied data mining problem. The mining is applicable to any sequential and time series data via discretization. Example domains are protein sequences, market data, web logs, text, music, stock market, etc. To mine Association Rule Mining is converted to mine the frequent itemsets Lk, which contains the frequent itemsets of length k. Many FIMI (FIM Implementation) algorithms have been proposed in the literature, including FP-growth and Apriori based approaches. Page 12
13 Researches showed that the FP-growth can get much faster than some old algorithms like the Apriori based approaches except in some cases the FP-tree can be too large to be stored in memory when the database size is so large or the database is too sparse. Freqmine has been parallelized with OpenMP. It employs three parallel kernels: Build FP-tree header, Construct prefix tree and Mine data. Freqmine characteristics Page 13
14 Streamcluster: This RMS kernel was developed by Princeton University and solves the online clustering problem: For a stream of input points, it finds a predetermined number of medians so that each point is assigned to its nearest center. The quality of the clustering is measured by the sum of squared distances (SSQ) metric. Stream clustering is a common operation where large amounts or continuously produced data has to be organized under real-time conditions, for example network intrusion detection, pattern recognition and data mining. The program spends most of its time evaluating the gain of opening a new center. This operation uses a parallelization scheme which employs static partitioning of data points. The program is memory bound for low-dimensional data and becomes increasingly computationally intensive as the dimensionality increases. Due to its online character the working set size of the algorithm can be chosen independently from the input data. Streamcluster was included in the PARSEC benchmark suite because of the importance of data mining algorithms and the prevalence of problems with streaming characteristics. Streamcluster characteristics Page 14
15 Swaption: The swaptions application is an Intel RMS workload which uses the Heath-Jarrow-Morton (HJM) framework to price a portfolio of swaptions. The HJM framework describes how interest rates evolve for risk management and asset liability management for a class of models. Its central insight is that there is an explicit relationship between the drift and volatility parameters of the forward-rate dynamics in a noarbitrage market. Because HJM models are non-markovian the analytic approach of solving the PDE to price a derivative cannot be used. Swaptions therefore employs Monte Carlo (MC) simulation to compute the prices. The workload was included in the benchmark suite because of the significance of PDEs and the wide use of Monte Carlo simulation. The program stores the portfolio in the swaptions array. Each entry corresponds to one derivative. Swaptions partitions the array into a number of blocks equal to the number of threads and assigns one block to every thread. Each thread iterates through all swaptions in the work unit it was assigned and calls the function HJM Swaption Blocking for every entry in order to compute the price. Swaption characteristics Vips: This application is based on the VASARI Image Processing System (VIPS) which was originally developed through several projects funded by European Union (EU) grants. The benchmark version is derived from a print on demand service that is offered at the National Gallery of London, which is also the current maintainer of the system. Page 15
16 The benchmark includes fundamental image operations such as an affine transformation and a convolution. It was chosen because image transformations are a common task on desktop computers and for the ability of the VASARI system to construct multi-threaded image processing pipelines transparently on the fly. Future libraries might use concepts such as the ones employed by VIPS to make multi-threaded functionality available to the user. The image transformation pipeline of the vips benchmark has 18 stages. It is implemented in the VIPS operation in benchmark. Vips characteristics Page 16
17 x264: The x264 application is an H.264/AVC (Advanced Video Coding) video encoder. In the 4th annual video codec comparison it was ranked 2nd best codec for its high encoding quality. It is based on the ITU-T H.264 standard which was completed in May 2003 and which is now also part of ISO/IEC MPEG-4. In that context the standard is also known as MPEG-4 Part 10. H.264 describes the lossy compression of a video stream. It improves over previous video encoding standards with new features such as increased sample bit depth precision, higherresolution color information, variable block-size motion compensation (VBSMC) or contextadaptive binary arithmetic coding (CABAC). These advancements allow H.264 encoders to achieve a higher output quality with a lower bit-rate at the expense of a significantly increased encoding and decoding time. The flexibility of H.264 allows its use in a wide range of contexts with different requirements, from video conferencing solutions to high-definition (HD) movie distribution. Next generation HD DVD or Blu-ray video players already require H.264/AVC encoding. The flexibility and wide range of application of the H.264 standard and its ubiquity in next generation video systems are the reasons for the inclusion of x264 in the PARSEC benchmark suite. The parallel algorithm of x264 uses the pipeline model with one stage per input video frame. This results in a virtual pipeline with as many stages as there are input frames. x264 processes a number of pipeline stages equal to the number of encoder threads in parallel, resulting in a sliding window which moves from the beginning of the pipeline to its end. For P- and B-Frames the encoder requires the image data and motion vectors from the relevant region of the reference frames in order to encode the current frame, and so each stage makes this information available as it is calculated during the encoding process. Fast upward movements can thus cause delays which can limit the achievable speedup of x264 in practice. In order to compensate for this effect, the parallelization model requires that x264 is executed with a number of threads greater than the number of cores to achieve maximum performance. Page 17
18 X264 characteristics Raytrace: Raytracing is a technique that generates a visually realistic image by tracing the path of light through a scene. It leverages the physical property that the path of light is always reversible to reduce the computational requirements by following the light rays from the eye point through the image plane to the source of the light. This way only light rays that contribute to the image are considered. The raytrace benchmark program uses a variety of the raytracing method that would typically be employed for real-time animations such as computer games. It is optimized for speed rather than realism. The computational complexity of the algorithm depends on the resolution of the output image and the scene. For more details about the specifics of benchmarks Parsec: Page 18
19 5. Project Milestones: In the beginning of my TER I started with reading documentations and trying to understand the goals of the subject. It was the most difficult part of the work. The subject treats new concepts that I didn t know and which are varied and complex. First the understanding of what is a benchmark and why they are useful then I started to explore the Parsec applications and that was very difficult because they treat different domains and use varied types of parallelization. Trying to understand Parsec took the biggest part of my project and I know that I didn t really understand the real use of each benchmark it is so complicated and it takes more then a half day per week. Then I started to discover the cross compilation. I first began with compiling and executing the code on the local machine (Intel Linux) and add the necessary libraries. Finally I made the cross compilation of the code using arm-none-linux-gnueabi-gcc and then simulate it on a four core ARM machine. I tried to execute all the benchmarks but I still have some benchmarks that didn t yet work on the ARM machine. The last step of the project was to analyze performance of ARM with benchmarks Parsec but I didn t have time to make this step. Analyze needs a good comprehension of each benchmark and its inputs (the size of benchmarks parsec is about 6 Go). 6. Conclusion: This project was very useful for me. I learned a lot of new concepts: what is Cross Compilation why we use it and how to cross compile a program for a specific architecture; it also improved my knowledge about scripting. Then I learned about the use of benchmarks and how important they are for the analysis of the performance of an architecture and how it gives the possibility of comparing the performance of different architecture. It was also very hard because benchmark Parsec use many technologies in many domains such as financial analysis, media processing, animation, visualization I find a big difficulty in understanding the use of each benchmark so I spent a lot of time on reading specifications. References: The site of Parsec: Page 19
PARSEC vs. SPLASH 2: A Quantitative Comparison of Two Multithreaded Benchmark Suites on Chip Multiprocessors
PARSEC vs. SPLASH 2: A Quantitative Comparison of Two Multithreaded Benchmark Suites on Chip Multiprocessors ChristianBienia,SanjeevKumar andkaili DepartmentofComputerScience,PrincetonUniversity MicroprocessorTechnologyLabs,Intel
More information18-742 Lecture 4. Parallel Programming II. Homework & Reading. Page 1. Projects handout On Friday Form teams, groups of two
age 1 18-742 Lecture 4 arallel rogramming II Spring 2005 rof. Babak Falsafi http://www.ece.cmu.edu/~ece742 write X Memory send X Memory read X Memory Slides developed in part by rofs. Adve, Falsafi, Hill,
More informationThe Scientific Data Mining Process
Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In
More informationPerformance Analysis and Comparison of JM 15.1 and Intel IPP H.264 Encoder and Decoder
Performance Analysis and Comparison of 15.1 and H.264 Encoder and Decoder K.V.Suchethan Swaroop and K.R.Rao, IEEE Fellow Department of Electrical Engineering, University of Texas at Arlington Arlington,
More informationBachelor of Games and Virtual Worlds (Programming) Subject and Course Summaries
First Semester Development 1A On completion of this subject students will be able to apply basic programming and problem solving skills in a 3 rd generation object-oriented programming language (such as
More informationEverything you need to know about flash storage performance
Everything you need to know about flash storage performance The unique characteristics of flash make performance validation testing immensely challenging and critically important; follow these best practices
More informationParallel Ray Tracing using MPI: A Dynamic Load-balancing Approach
Parallel Ray Tracing using MPI: A Dynamic Load-balancing Approach S. M. Ashraful Kadir 1 and Tazrian Khan 2 1 Scientific Computing, Royal Institute of Technology (KTH), Stockholm, Sweden smakadir@csc.kth.se,
More informationIntroduction to DISC and Hadoop
Introduction to DISC and Hadoop Alice E. Fischer April 24, 2009 Alice E. Fischer DISC... 1/20 1 2 History Hadoop provides a three-layer paradigm Alice E. Fischer DISC... 2/20 Parallel Computing Past and
More informationManaging Performance vs. Accuracy Trade-offs With Loop Perforation
Managing Performance vs. Accuracy Trade-offs With Loop Perforation Stelios Sidiroglou Sasa Misailovic Henry Hoffmann Martin Rinard Computer Science and Artificial Intelligence Laboratory Massachusetts
More informationEmerging Markets for H.264 Video Encoding
Leveraging High Definition and Efficient IP Networking WHITE PAPER Introduction Already dominant in traditional applications such as video conferencing and TV broadcasting, H.264 Advanced Video Coding
More informationChapter 11 I/O Management and Disk Scheduling
Operating Systems: Internals and Design Principles, 6/E William Stallings Chapter 11 I/O Management and Disk Scheduling Dave Bremer Otago Polytechnic, NZ 2008, Prentice Hall I/O Devices Roadmap Organization
More informationHow To Build A Cloud Computer
Introducing the Singlechip Cloud Computer Exploring the Future of Many-core Processors White Paper Intel Labs Jim Held Intel Fellow, Intel Labs Director, Tera-scale Computing Research Sean Koehl Technology
More informationImage Compression through DCT and Huffman Coding Technique
International Journal of Current Engineering and Technology E-ISSN 2277 4106, P-ISSN 2347 5161 2015 INPRESSCO, All Rights Reserved Available at http://inpressco.com/category/ijcet Research Article Rahul
More informationMulti-GPU Load Balancing for Simulation and Rendering
Multi- Load Balancing for Simulation and Rendering Yong Cao Computer Science Department, Virginia Tech, USA In-situ ualization and ual Analytics Instant visualization and interaction of computing tasks
More informationScaling Objectivity Database Performance with Panasas Scale-Out NAS Storage
White Paper Scaling Objectivity Database Performance with Panasas Scale-Out NAS Storage A Benchmark Report August 211 Background Objectivity/DB uses a powerful distributed processing architecture to manage
More informationPART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions. Outline. Performance oriented design
PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions Slide 1 Outline Principles for performance oriented design Performance testing Performance tuning General
More informationWhite paper. H.264 video compression standard. New possibilities within video surveillance.
White paper H.264 video compression standard. New possibilities within video surveillance. Table of contents 1. Introduction 3 2. Development of H.264 3 3. How video compression works 4 4. H.264 profiles
More informationParallelism and Cloud Computing
Parallelism and Cloud Computing Kai Shen Parallel Computing Parallel computing: Process sub tasks simultaneously so that work can be completed faster. For instances: divide the work of matrix multiplication
More informationA Deduplication File System & Course Review
A Deduplication File System & Course Review Kai Li 12/13/12 Topics A Deduplication File System Review 12/13/12 2 Traditional Data Center Storage Hierarchy Clients Network Server SAN Storage Remote mirror
More informationDevelop Computer Animation
Name: Block: A. Introduction 1. Animation simulation of movement created by rapidly displaying images or frames. Relies on persistence of vision the way our eyes retain images for a split second longer
More informationGEDAE TM - A Graphical Programming and Autocode Generation Tool for Signal Processor Applications
GEDAE TM - A Graphical Programming and Autocode Generation Tool for Signal Processor Applications Harris Z. Zebrowitz Lockheed Martin Advanced Technology Laboratories 1 Federal Street Camden, NJ 08102
More informationImplementation of Canny Edge Detector of color images on CELL/B.E. Architecture.
Implementation of Canny Edge Detector of color images on CELL/B.E. Architecture. Chirag Gupta,Sumod Mohan K cgupta@clemson.edu, sumodm@clemson.edu Abstract In this project we propose a method to improve
More informationIntroduction to Cloud Computing
Introduction to Cloud Computing Parallel Processing I 15 319, spring 2010 7 th Lecture, Feb 2 nd Majd F. Sakr Lecture Motivation Concurrency and why? Different flavors of parallel computing Get the basic
More informationReducing Backups with Data Deduplication
The Essentials Series: New Techniques for Creating Better Backups Reducing Backups with Data Deduplication sponsored by by Eric Beehler Reducing Backups with Data Deduplication... 1 Explaining Data Deduplication...
More informationThe Data Center as a Grid Load Stabilizer
The Data Center as a Grid Load Stabilizer Hao Chen *, Michael C. Caramanis ** and Ayse K. Coskun * * Department of Electrical and Computer Engineering ** Division of Systems Engineering Boston University
More informationDesign and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms
Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms Amani AlOnazi, David E. Keyes, Alexey Lastovetsky, Vladimir Rychkov Extreme Computing Research Center,
More informationManual for Human Tracking Software
Manual for Human Tracking Software Alexandru Balan Department of Computer Science Brown University Providence, Rhode Island 02912 alb@cs.brown.edu Version 1.0 November 26, 2005 1. Abstract A novel human
More informationSQL Server 2012 Performance White Paper
Published: April 2012 Applies to: SQL Server 2012 Copyright The information contained in this document represents the current view of Microsoft Corporation on the issues discussed as of the date of publication.
More informationArchitectures and Platforms
Hardware/Software Codesign Arch&Platf. - 1 Architectures and Platforms 1. Architecture Selection: The Basic Trade-Offs 2. General Purpose vs. Application-Specific Processors 3. Processor Specialisation
More informationParallel Compression and Decompression of DNA Sequence Reads in FASTQ Format
, pp.91-100 http://dx.doi.org/10.14257/ijhit.2014.7.4.09 Parallel Compression and Decompression of DNA Sequence Reads in FASTQ Format Jingjing Zheng 1,* and Ting Wang 1, 2 1,* Parallel Software and Computational
More information22S:295 Seminar in Applied Statistics High Performance Computing in Statistics
22S:295 Seminar in Applied Statistics High Performance Computing in Statistics Luke Tierney Department of Statistics & Actuarial Science University of Iowa August 30, 2007 Luke Tierney (U. of Iowa) HPC
More informationCS231M Project Report - Automated Real-Time Face Tracking and Blending
CS231M Project Report - Automated Real-Time Face Tracking and Blending Steven Lee, slee2010@stanford.edu June 6, 2015 1 Introduction Summary statement: The goal of this project is to create an Android
More informationHardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui
Hardware-Aware Analysis and Optimization of Stable Fluids Presentation Date: Sep 15 th 2009 Chrissie C. Cui Outline Introduction Highlights Flop and Bandwidth Analysis Mehrstellen Schemes Advection Caching
More informationVideo Analytics A New Standard
Benefits The system offers the following overall benefits: Tracker High quality tracking engine UDP s embedded intelligent Video Analytics software is fast becoming the standard for all surveillance and
More informationVisualizing molecular simulations
Visualizing molecular simulations ChE210D Overview Visualization plays a very important role in molecular simulations: it enables us to develop physical intuition about the behavior of a system that is
More informationMaster of Mathematical Finance: Course Descriptions
Master of Mathematical Finance: Course Descriptions CS 522 Data Mining Computer Science This course provides continued exploration of data mining algorithms. More sophisticated algorithms such as support
More informationMMGD0203 Multimedia Design MMGD0203 MULTIMEDIA DESIGN. Chapter 3 Graphics and Animations
MMGD0203 MULTIMEDIA DESIGN Chapter 3 Graphics and Animations 1 Topics: Definition of Graphics Why use Graphics? Graphics Categories Graphics Qualities File Formats Types of Graphics Graphic File Size Introduction
More informationHow To Test Video Quality With Real Time Monitor
White Paper Real Time Monitoring Explained Video Clarity, Inc. 1566 La Pradera Dr Campbell, CA 95008 www.videoclarity.com 408-379-6952 Version 1.0 A Video Clarity White Paper page 1 of 7 Real Time Monitor
More informationStream Processing on GPUs Using Distributed Multimedia Middleware
Stream Processing on GPUs Using Distributed Multimedia Middleware Michael Repplinger 1,2, and Philipp Slusallek 1,2 1 Computer Graphics Lab, Saarland University, Saarbrücken, Germany 2 German Research
More informationData-parallel Acceleration of PARSEC Black-Scholes Benchmark
Data-parallel Acceleration of PARSEC Black-Scholes Benchmark AUGUST ANDRÉN and PATRIK HAGERNÄS KTH Information and Communication Technology Bachelor of Science Thesis Stockholm, Sweden 2013 TRITA-ICT-EX-2013:158
More informationScalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011
Scalable Data Analysis in R Lee E. Edlefsen Chief Scientist UserR! 2011 1 Introduction Our ability to collect and store data has rapidly been outpacing our ability to analyze it We need scalable data analysis
More informationKey Components of WAN Optimization Controller Functionality
Key Components of WAN Optimization Controller Functionality Introduction and Goals One of the key challenges facing IT organizations relative to application and service delivery is ensuring that the applications
More information: Application Layer. Factor the Content. Bernd Paysan. EuroForth 2011, Vienna
net : Application Layer Factor the Content Bernd Paysan EuroForth 2011, Vienna Outline Motivation Requirements Solutions Some Basic Insights Factor Data Distribute the Code net2o Recap: Lower Level Parts
More informationFPGA-based Multithreading for In-Memory Hash Joins
FPGA-based Multithreading for In-Memory Hash Joins Robert J. Halstead, Ildar Absalyamov, Walid A. Najjar, Vassilis J. Tsotras University of California, Riverside Outline Background What are FPGAs Multithreaded
More informationPERFORMANCE ENHANCEMENTS IN TreeAge Pro 2014 R1.0
PERFORMANCE ENHANCEMENTS IN TreeAge Pro 2014 R1.0 15 th January 2014 Al Chrosny Director, Software Engineering TreeAge Software, Inc. achrosny@treeage.com Andrew Munzer Director, Training and Customer
More informationMulti-core and Linux* Kernel
Multi-core and Linux* Kernel Suresh Siddha Intel Open Source Technology Center Abstract Semiconductor technological advances in the recent years have led to the inclusion of multiple CPU execution cores
More informationVideo Coding Technologies and Standards: Now and Beyond
Hitachi Review Vol. 55 (Mar. 2006) 11 Video Coding Technologies and Standards: Now and Beyond Tomokazu Murakami Hiroaki Ito Muneaki Yamaguchi Yuichiro Nakaya, Ph.D. OVERVIEW: Video coding technology compresses
More informationPARALLEL PROGRAMMING
PARALLEL PROGRAMMING TECHNIQUES AND APPLICATIONS USING NETWORKED WORKSTATIONS AND PARALLEL COMPUTERS 2nd Edition BARRY WILKINSON University of North Carolina at Charlotte Western Carolina University MICHAEL
More informationMulti-core architectures. Jernej Barbic 15-213, Spring 2007 May 3, 2007
Multi-core architectures Jernej Barbic 15-213, Spring 2007 May 3, 2007 1 Single-core computer 2 Single-core CPU chip the single core 3 Multi-core architectures This lecture is about a new trend in computer
More informationRevoScaleR Speed and Scalability
EXECUTIVE WHITE PAPER RevoScaleR Speed and Scalability By Lee Edlefsen Ph.D., Chief Scientist, Revolution Analytics Abstract RevoScaleR, the Big Data predictive analytics library included with Revolution
More informationParametric Comparison of H.264 with Existing Video Standards
Parametric Comparison of H.264 with Existing Video Standards Sumit Bhardwaj Department of Electronics and Communication Engineering Amity School of Engineering, Noida, Uttar Pradesh,INDIA Jyoti Bhardwaj
More informationFAST 11. Yongseok Oh <ysoh@uos.ac.kr> University of Seoul. Mobile Embedded System Laboratory
CAFTL: A Content-Aware Flash Translation Layer Enhancing the Lifespan of flash Memory based Solid State Drives FAST 11 Yongseok Oh University of Seoul Mobile Embedded System Laboratory
More informationHPC Wales Skills Academy Course Catalogue 2015
HPC Wales Skills Academy Course Catalogue 2015 Overview The HPC Wales Skills Academy provides a variety of courses and workshops aimed at building skills in High Performance Computing (HPC). Our courses
More informationThe Feasibility of Supporting Large-Scale Live Streaming Applications with Dynamic Application End-Points
The Feasibility of Supporting Large-Scale Live Streaming Applications with Dynamic Application End-Points Kay Sripanidkulchai, Aditya Ganjam, Bruce Maggs, and Hui Zhang Instructor: Fabian Bustamante Presented
More informationDemystifying Deduplication for Backup with the Dell DR4000
Demystifying Deduplication for Backup with the Dell DR4000 This Dell Technical White Paper explains how deduplication with the DR4000 can help your organization save time, space, and money. John Bassett
More informationA Study on SURF Algorithm and Real-Time Tracking Objects Using Optical Flow
, pp.233-237 http://dx.doi.org/10.14257/astl.2014.51.53 A Study on SURF Algorithm and Real-Time Tracking Objects Using Optical Flow Giwoo Kim 1, Hye-Youn Lim 1 and Dae-Seong Kang 1, 1 Department of electronices
More informationH 261. Video Compression 1: H 261 Multimedia Systems (Module 4 Lesson 2) H 261 Coding Basics. Sources: Summary:
Video Compression : 6 Multimedia Systems (Module Lesson ) Summary: 6 Coding Compress color motion video into a low-rate bit stream at following resolutions: QCIF (76 x ) CIF ( x 88) Inter and Intra Frame
More informationwhat operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored?
Inside the CPU how does the CPU work? what operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored? some short, boring programs to illustrate the
More informationTPCalc : a throughput calculator for computer architecture studies
TPCalc : a throughput calculator for computer architecture studies Pierre Michaud Stijn Eyerman Wouter Rogiest IRISA/INRIA Ghent University Ghent University pierre.michaud@inria.fr Stijn.Eyerman@elis.UGent.be
More informationApplication Performance Analysis of the Cortex-A9 MPCore
This project in ARM is in part funded by ICT-eMuCo, a European project supported under the Seventh Framework Programme (7FP) for research and technological development Application Performance Analysis
More informationMulticore Parallel Computing with OpenMP
Multicore Parallel Computing with OpenMP Tan Chee Chiang (SVU/Academic Computing, Computer Centre) 1. OpenMP Programming The death of OpenMP was anticipated when cluster systems rapidly replaced large
More informationGrid Computing for Artificial Intelligence
Grid Computing for Artificial Intelligence J.M.P. van Waveren May 25th 2007 2007, Id Software, Inc. Abstract To show intelligent behavior in a First Person Shooter (FPS) game an Artificial Intelligence
More informationCHAPTER 1 INTRODUCTION
1 CHAPTER 1 INTRODUCTION 1.1 MOTIVATION OF RESEARCH Multicore processors have two or more execution cores (processors) implemented on a single chip having their own set of execution and architectural recourses.
More informationSCALABILITY AND AVAILABILITY
SCALABILITY AND AVAILABILITY Real Systems must be Scalable fast enough to handle the expected load and grow easily when the load grows Available available enough of the time Scalable Scale-up increase
More informationMean-Shift Tracking with Random Sampling
1 Mean-Shift Tracking with Random Sampling Alex Po Leung, Shaogang Gong Department of Computer Science Queen Mary, University of London, London, E1 4NS Abstract In this work, boosting the efficiency of
More informationRemote Graphical Visualization of Large Interactive Spatial Data
Remote Graphical Visualization of Large Interactive Spatial Data ComplexHPC Spring School 2011 International ComplexHPC Challenge Cristinel Mihai Mocan Computer Science Department Technical University
More informationDeploying De-Duplication on Ext4 File System
Deploying De-Duplication on Ext4 File System Usha A. Joglekar 1, Bhushan M. Jagtap 2, Koninika B. Patil 3, 1. Asst. Prof., 2, 3 Students Department of Computer Engineering Smt. Kashibai Navale College
More informationReSense: Mapping Dynamic Workloads of Colocated Multithreaded Applications Using Resource Sensitivity
ReSense: Mapping Dynamic Workloads of Colocated Multithreaded Applications Using Resource Sensitivity TANIMA DEY, WEI WANG, JACK W. DAVIDSON, and MARY LOU SOFFA, University of Virginia To utilize the full
More informationINTEL IPP REALISTIC RENDERING MOBILE PLATFORM SOFTWARE DEVELOPMENT KIT
INTEL IPP REALISTIC RENDERING MOBILE PLATFORM SOFTWARE DEVELOPMENT KIT Department of computer science and engineering, Sogang university 2008. 7. 22 Deukhyun Cha INTEL PERFORMANCE LIBRARY: INTEGRATED PERFORMANCE
More informationBringing Big Data Modelling into the Hands of Domain Experts
Bringing Big Data Modelling into the Hands of Domain Experts David Willingham Senior Application Engineer MathWorks david.willingham@mathworks.com.au 2015 The MathWorks, Inc. 1 Data is the sword of the
More informationMapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012
MapReduce and Hadoop Aaron Birkland Cornell Center for Advanced Computing January 2012 Motivation Simple programming model for Big Data Distributed, parallel but hides this Established success at petabyte
More informationDoctor of Philosophy in Computer Science
Doctor of Philosophy in Computer Science Background/Rationale The program aims to develop computer scientists who are armed with methods, tools and techniques from both theoretical and systems aspects
More informationDistributed Computing and Big Data: Hadoop and MapReduce
Distributed Computing and Big Data: Hadoop and MapReduce Bill Keenan, Director Terry Heinze, Architect Thomson Reuters Research & Development Agenda R&D Overview Hadoop and MapReduce Overview Use Case:
More informationMaximizing Hadoop Performance and Storage Capacity with AltraHD TM
Maximizing Hadoop Performance and Storage Capacity with AltraHD TM Executive Summary The explosion of internet data, driven in large part by the growth of more and more powerful mobile devices, has created
More informationInternet Firewall CSIS 4222. Packet Filtering. Internet Firewall. Examples. Spring 2011 CSIS 4222. net15 1. Routers can implement packet filtering
Internet Firewall CSIS 4222 A combination of hardware and software that isolates an organization s internal network from the Internet at large Ch 27: Internet Routing Ch 30: Packet filtering & firewalls
More informationSo today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02)
Internet Technology Prof. Indranil Sengupta Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Lecture No #39 Search Engines and Web Crawler :: Part 2 So today we
More informationLONG BEACH CITY COLLEGE MEMORANDUM
LONG BEACH CITY COLLEGE MEMORANDUM DATE: May 5, 2000 TO: Academic Senate Equivalency Committee FROM: John Hugunin Department Head for CBIS SUBJECT: Equivalency statement for Computer Science Instructor
More informationWhite Paper. How Streaming Data Analytics Enables Real-Time Decisions
White Paper How Streaming Data Analytics Enables Real-Time Decisions Contents Introduction... 1 What Is Streaming Analytics?... 1 How Does SAS Event Stream Processing Work?... 2 Overview...2 Event Stream
More informationHardware design for ray tracing
Hardware design for ray tracing Jae-sung Yoon Introduction Realtime ray tracing performance has recently been achieved even on single CPU. [Wald et al. 2001, 2002, 2004] However, higher resolutions, complex
More informationTitolo del paragrafo. Titolo del documento - Sottotitolo documento The Benefits of Pushing Real-Time Market Data via a Web Infrastructure
1 Alessandro Alinone Agenda Introduction Push Technology: definition, typology, history, early failures Lightstreamer: 3rd Generation architecture, true-push Client-side push technology (Browser client,
More informationGPUs for Scientific Computing
GPUs for Scientific Computing p. 1/16 GPUs for Scientific Computing Mike Giles mike.giles@maths.ox.ac.uk Oxford-Man Institute of Quantitative Finance Oxford University Mathematical Institute Oxford e-research
More informationA Comparative Study on Vega-HTTP & Popular Open-source Web-servers
A Comparative Study on Vega-HTTP & Popular Open-source Web-servers Happiest People. Happiest Customers Contents Abstract... 3 Introduction... 3 Performance Comparison... 4 Architecture... 5 Diagram...
More informationNAPCS Product List for NAICS 51219: Post Production Services and Other Motion Picture and Video Industries
National 51219 1 Postproduction Providing computerized and electronic image and sound processing (film, video, digital media, etc.). Includes editing, transfer, color correction, digital restoration, visual
More informationBroadband Networks. Prof. Karandikar. Department of Electrical Engineering. Indian Institute of Technology, Bombay. Lecture - 26
Broadband Networks Prof. Karandikar Department of Electrical Engineering Indian Institute of Technology, Bombay Lecture - 26 Optical Network &MPLS So, as you were discussing in the previous lectures, next
More informationAnalecta Vol. 8, No. 2 ISSN 2064-7964
EXPERIMENTAL APPLICATIONS OF ARTIFICIAL NEURAL NETWORKS IN ENGINEERING PROCESSING SYSTEM S. Dadvandipour Institute of Information Engineering, University of Miskolc, Egyetemváros, 3515, Miskolc, Hungary,
More informationAchieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging
Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging In some markets and scenarios where competitive advantage is all about speed, speed is measured in micro- and even nano-seconds.
More informationPERFORMANCE MODELS FOR APACHE ACCUMULO:
Securely explore your data PERFORMANCE MODELS FOR APACHE ACCUMULO: THE HEAVY TAIL OF A SHAREDNOTHING ARCHITECTURE Chris McCubbin Director of Data Science Sqrrl Data, Inc. I M NOT ADAM FUCHS But perhaps
More informationA Deduplication-based Data Archiving System
2012 International Conference on Image, Vision and Computing (ICIVC 2012) IPCSIT vol. 50 (2012) (2012) IACSIT Press, Singapore DOI: 10.7763/IPCSIT.2012.V50.20 A Deduplication-based Data Archiving System
More informationVideo compression: Performance of available codec software
Video compression: Performance of available codec software Introduction. Digital Video A digital video is a collection of images presented sequentially to produce the effect of continuous motion. It takes
More informationUnderstanding Compression Technologies for HD and Megapixel Surveillance
When the security industry began the transition from using VHS tapes to hard disks for video surveillance storage, the question of how to compress and store video became a top consideration for video surveillance
More informationevm Virtualization Platform for Windows
B A C K G R O U N D E R evm Virtualization Platform for Windows Host your Embedded OS and Windows on a Single Hardware Platform using Intel Virtualization Technology April, 2008 TenAsys Corporation 1400
More informationCSE 237A Final Project Final Report
CSE 237A Final Project Final Report Multi-way video conferencing system over 802.11 wireless network Motivation Yanhua Mao and Shan Yan The latest technology trends in personal mobile computing are towards
More informationInteractive Multimedia Courses-1
Interactive Multimedia Courses-1 IMM 110/Introduction to Digital Media An introduction to digital media for interactive multimedia through the study of state-of-the-art methods of creating digital media:
More informationDiscovering Computers 2008. Chapter 3 Application Software
Discovering Computers 2008 Chapter 3 Application Software Chapter 3 Objectives Identify the categories of application software Explain ways software is distributed Explain how to work with application
More informationChapter 7. Using Hadoop Cluster and MapReduce
Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in
More informationLoad Balancing on a Non-dedicated Heterogeneous Network of Workstations
Load Balancing on a Non-dedicated Heterogeneous Network of Workstations Dr. Maurice Eggen Nathan Franklin Department of Computer Science Trinity University San Antonio, Texas 78212 Dr. Roger Eggen Department
More informationPerformance Analysis of Thread Mappings with a Holistic View of the Hardware Resources
Performance Analysis of Thread Mappings with a Holistic View of the Hardware Resources Wei Wang, Tanima Dey, Jason Mars, Lingjia Tang, Jack Davidson, Mary Lou Soffa Department of Computer Science University
More informationWhite Paper. Recording Server Virtualization
White Paper Recording Server Virtualization Prepared by: Mike Sherwood, Senior Solutions Engineer Milestone Systems 23 March 2011 Table of Contents Introduction... 3 Target audience and white paper purpose...
More informationDistributed Dynamic Load Balancing for Iterative-Stencil Applications
Distributed Dynamic Load Balancing for Iterative-Stencil Applications G. Dethier 1, P. Marchot 2 and P.A. de Marneffe 1 1 EECS Department, University of Liege, Belgium 2 Chemical Engineering Department,
More informationMaking Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association
Making Multicore Work and Measuring its Benefits Markus Levy, president EEMBC and Multicore Association Agenda Why Multicore? Standards and issues in the multicore community What is Multicore Association?
More information