NOISE REDUCTION WITH USING PARALLEL ALGORITHMS
|
|
- Marion Davidson
- 7 years ago
- Views:
Transcription
1 NOISE REDUCTION WITH USING PARALLEL ALGORITHMS Maciej WALCZYŃSKI 1, Wojciech BOŻEJKO 2 Wrocław University of Technology Wybrzeże Wyspiańskiego 27, Wrocław, Poland 1 Institute of Telecommunications, Teleinformatics and Acoustics 2 Institute of Computer Engineering, Control and Robotics maciej.walczynski@pwr.wroc.pl, wojciech.bozejko@pwr.wroc.pl Zamek Książ - Wałbrzych, Poland, 6-9 June 2010 ABSTRACT In this paper we propose a parallel version of the LMS, an algorithm which is used to digital signal processing such as echo elimination and noise reduction. Parallel approach allows for decomposition the problem into a number of smaller ones, which can be computed faster. Obtained results, especially increase of speed and efficiency show, that the parallel method implemented on GPU is much faster than other existing procedures and it can be used in the real-time systems. INTRODUCTION LMS (Least Mean Square) filters are based on the minimization of the mean square error. These filters are stable and easy for implementation. Unfortunately, parallelization of this algorithm, especially in the distributed-memory parallel computing systems is not so obvious. A main disadvantage of the LMS algorithm is slow convergence of this approach. There is a number of LMS variants including PNLMS (Proportional Normalized Least Mean Square) which are focused on improving weak convergence of the original LMS method. Procedure of the filter adaptation requires a significant calculation and time cost, which has to be minimized. Faster convergence of the algorithm needs longer size of vectors used inside the filter (thousands of elements). The most complex element of the computational process is matrix multiplication procedure. By its parallelizing we obtain the concurrent algorithm which works as the sequential one, but much faster (so-called single-walk parallelization) [1]. 1
2 Proposed algorithm was implemented in C++ with using CUDA and executed on 128- processors nvidia Tesla GPU. The problem of LMS filters application in ANC was considered by Akhtar et al. [1], Koike [2], Elliott et al. [3] and Eriksson [4]. THE PROBLEM The problem of an active noise control is known for many years. Its advantage is possibility of adjust to a variable (in time) characteristic of distractions. A construction which uses a microphone an flexible steered speaker was proposed in 1936 by Luega [???] for the first time. Nowadays most of solutions are based on adaptive filters algorithms. A basic idea, on which ANC (Active Noise Control) are built is generating a disturbing signal estimate function. The signal estimator should characterized similar amplitude and frequency spectrum comparing to the considered signal, but it should has an opposite phase. The signal without distractions constitutes the primary signal (with distractions) combined with the built estimator of the disturbing signal. An idea of adaptive filtrating is based on permanent adaptation to distractions changing in time such that they are fast and efficiently eliminated. Let us consider an example of a person which speak by the speakerphone of the mobile inside a moving car. A signal (d(n)) recorded by the device microphone includes distractions also, apart of the speech (such as an engine sound, noises from the outside, etc.). We would like that the signal obtained by the second person has no these distractions. By using of the second, reference microphone, placed in the other location and recorded only distractions (x(n)), we want to reduce an unnecessary component. In the general case signals recorded by both microphones differ in amplitude and phase, but they are also correlated (due to occurred a distraction component). This correlation, by using an adaptive filter, allows us to eliminate distractions. We expect from the system that it will work in the real time and the adaptation time (that is the time after which an estimator of the distraction signal is similar to the reference signal) will be possible shortest. SEQUETIAL LMS FILTERS One of the most popular algorithm used to adaptive filtering is LMS (Least Mean Square) algorithm. LMS belongs to the gradient adaptive filters class. In these filters we assume that a 2
3 modification of h(n) vectore of the h(n) filter parameters should be proportional in each time moment n to the cost function gradient vector J(n), which can be written as an equation: (1) where µ(n) is a scale variable which influences onto the speed of the filter modification. In the general case it depends on the time. To speed up of the adaptation process, an additionally weight matrix W(n) is introduced. Such modified equation (1) takes the form of: (2) In the case of LMS a temporal error value is minimized. Therefore the error criterion takes the form of: (3) From this the cost function derivative is given by: (4) where M denotes a filter dimension. In turn: (5) where is an estimator d(n) of the reference signal y(n). Finally, the equation (1) takes the form of: (6) which can be formulated in the matrix form as: (7) There exists many kinds of LMS filters. In the simplest form we assume that the scaling component is permanent in time, that is = and the matrix W(n) is a identical diagonal matrix I, which follows to the formula: (8) The general scheme of a sequential LMS filter is showed on the Figure 1. 3
4 Signal source Main microphone Z -k d(n) + Noise source 2 nd microphone x(n) FIR y(n) LMS e(n) Figure 1. Active noise control system using LMS filter. LMS PARALLELIZATION The parallel algorithm was designed in C++ with using CUDA library for being executed on nvidia GPUs. The general outlet of the main element of the LMS algorithm (matrix multiplication) is showed on Figures 2 and 3. Such a method of using concurrency is called single-walk parallelization and it consists in using parallel computing enjoinment to speed up of the most computational exhaustive element of the method. In LMS filter this element constitutes a process of vectors (matrixes) product calculation. global void matrix_mult(int *a, int *b, int *c, int n, int m) { int idx = blockidx.x * blockdim.x + threadidx.x+1 ; int idy = blockidx.y * blockdim.y + threadidx.y+1 ; { int temp = 0; for(int k=1;k<=n;k++) temp+=a[m*(idx)+k]*b[m*(k)+idy]; c[idx+m*idy] = temp;} } Figure 2. Parallel algorithm 4
5 int main(){//kernel invocation dim3 grid( N/10, M/10); dim3 threads( 10, 10); matrix_mult<<<grid,threads>>>(deva, devb, devc, N, M); } Figure 3. Kernel invocation NUMERICAL EXPERIMENTS The parallel algorithm for the considered problem of LMS parallelization was coded in C (CUDA) for GPU and ran on three GPUs: 1. nvidia GeForce 9600M GS with 32 streaming processors installed on Lenovo Y530, Intel Pentium Dual-Core CPU 2GHz, 3GB RAM under 32-bit Windows Vista Home Premium operating system, 2. nvidia GeForce GTX 295 with 480 streaming processors installed on Intel Core2Duo 2.4Ghz, 2GB RAM under 32-bit Windows Vista Business operating system, 3. nvidia Tesla C870 GPU (512 GFLOPS) with 128 streaming processor cores. This GPU was installed on the Hewlett-Packard server based on 2 Dual-Core AMD 1 GHz Opteron processors with 1 MB cache memory and 8 GB RAM working under 64-bit Linux Debian 5.0 operating system. On nvidia Tesla architecture, a thread block has 16kB of shared memory visible to all threads of the block. All threads have access to the same global memory. Shared memory is much faster than global memory. When there is no bank conflicts accessing the shared memory is fast as accessing a register. For comparision access to the global memory takes cycles. It is possible to use shared memory only for smallest test instances (small matrixes). Table 1. Parallel runtimes comparison on nvidia GeForce GTX295 GPU. n x n nvidia GeForce GTX295 sequential on GeForce GTX295 speedup min. max. average min. max. average 10 x 10 0,15 1,92 0,42 0,65 2,54 0,98 2,32 20 x 20 0,12 2,91 0,45 4,37 10,38 5,17 11,54 30 x 30 0,17 2,89 0,56 15,32 17,05 16,01 28,79 40 x 40 0,17 2,02 0,49 34,06 40,24 35,66 72,93 50 x 50 0,20 2,00 0,51 67,59 72,40 68,68 135, x 100 0,43 2,29 0,75 527,97 529,63 528,63 702,52 5
6 Table 2. Parallel runtimes comparison on nvidia GeForce 9600M GS. n x n nvidia GeForce 9600M GS sequential on GeForce 9600M GS speedup min. max. average min. max. average 10 x 10 0,18 1,23 0,67 0,87 1,86 1,18 1,76 20 x 20 0,28 1,22 0,75 5,88 8,29 6,62 8,82 30 x 30 0,39 1,32 0,84 19,26 21,51 20,53 24,48 40 x 40 0,62 1,37 1,10 45,90 48,46 46,39 42,05 50 x 50 0,98 1,95 1,52 90,41 92,38 91,28 59, x 100 6,14 7,70 6,67 715,72 722,64 718,09 107,68 Table 3. Parallel runtimes on nvidia Tesla C870 GPU (512 GFLOPS). n x n nvidia GeForce 9600M GS min. max. average 10 x 10 0,04 0,08 0,06 20 x 20 0,06 0,10 0,08 30 x 30 0,09 0,13 0,10 40 x 40 0,12 0,22 0,13 50 x 50 0,20 0,25 0, x 100 1,13 1,19 1, x ,95 12,78 11, x ,42 30,21 29, x ,12 75,36 73, x ,33 138,24 136, x , , , x , , ,85 From Tables 1 and 2 it follows that the results computed on nvidia GeForce GTX295 and GeForce 9600M GS card are 99 times faster in average than the sequential algorithm results obtained by those cards. The speedup value is from 1.72 to on GeForce 9600M GS card, and from 2.32 to on nvidia GeForce 9600M GS. Table 3 shows resultse of computations on nvidia Tesla C870 GPU. A comparison is given also on Figure 4 (times) and Figure 5 (speedups). Obtained results show, that the proposed method can be used in the real time system. 6
7 Figure 4. Time of matrix multiplication in function of matrix dimension. Figure 5. Speedup in function of matrix dimension. 7
8 CONCLUSION The method of single-walk parallelization of LMS filter used to distractions elimination is proposed here. It consist in parallelization of the matrix multiplication module. We obtain a very fast algorithm which can be used in real-time systems with using GPGPU multithread calculation environment. REFERENCES [1] W. Bożejko, M. Walczyński, M. Wodecki, Zastosowanie algorytmu poszukiwania snopowego opartego na szybkiej transformacie Fouriera do cyfrowej analizy sygnałów, Automatyka, Zeszyty Naukowe Politechniki Śląskiej, Gliwice 2008, z. 150, pp [2] M. T. Akhtar, M. Abe, M. Kawamata, Modified-Filtered-X LMS Algorithm Based Active Noise Control System with Improved Online Secondary-Path Modeling, The 47- th IEEE International Midwest Symposium on Circuits and Systems [3] S. Koike, A class of adaptive step-size control algorithms for adaptive filters, IEEE Trans. Signal Processing, vol. 50, no. 6, pp , June 2002 [4] S. J. Elliott, I. M. Stothers, P. A. Nelson, A multiple error LMS algorithm and its application to the active control of sound and vibration, IEEE Trans. Acoustical, Speech, Signal Processing, ASSP-35, , Oct [5] L. J. Eriksson, M. C. Allie, and C. D. Bremigan, Active noise control using adaptive digital Signal processing in Proc. ICASSP, 1988, pp
GPU Parallel Computing Architecture and CUDA Programming Model
GPU Parallel Computing Architecture and CUDA Programming Model John Nickolls Outline Why GPU Computing? GPU Computing Architecture Multithreading and Arrays Data Parallel Problem Decomposition Parallel
More informationCUDA Basics. Murphy Stein New York University
CUDA Basics Murphy Stein New York University Overview Device Architecture CUDA Programming Model Matrix Transpose in CUDA Further Reading What is CUDA? CUDA stands for: Compute Unified Device Architecture
More informationHPC with Multicore and GPUs
HPC with Multicore and GPUs Stan Tomov Electrical Engineering and Computer Science Department University of Tennessee, Knoxville CS 594 Lecture Notes March 4, 2015 1/18 Outline! Introduction - Hardware
More informationLBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR
LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR Frédéric Kuznik, frederic.kuznik@insa lyon.fr 1 Framework Introduction Hardware architecture CUDA overview Implementation details A simple case:
More informationThe Filtered-x LMS Algorithm
The Filtered-x LMS Algorithm L. Håkansson Department of Telecommunications and Signal Processing, University of Karlskrona/Ronneby 372 25 Ronneby Sweden Adaptive filters are normally defined for problems
More informationADAPTIVE ALGORITHMS FOR ACOUSTIC ECHO CANCELLATION IN SPEECH PROCESSING
www.arpapress.com/volumes/vol7issue1/ijrras_7_1_05.pdf ADAPTIVE ALGORITHMS FOR ACOUSTIC ECHO CANCELLATION IN SPEECH PROCESSING 1,* Radhika Chinaboina, 1 D.S.Ramkiran, 2 Habibulla Khan, 1 M.Usha, 1 B.T.P.Madhav,
More informationOverview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming
Overview Lecture 1: an introduction to CUDA Mike Giles mike.giles@maths.ox.ac.uk hardware view software view Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Lecture 1 p.
More informationultra fast SOM using CUDA
ultra fast SOM using CUDA SOM (Self-Organizing Map) is one of the most popular artificial neural network algorithms in the unsupervised learning category. Sijo Mathew Preetha Joy Sibi Rajendra Manoj A
More informationIntroduction to GPU hardware and to CUDA
Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 37 Course outline Introduction to GPU hardware
More informationGPU Accelerated Monte Carlo Simulations and Time Series Analysis
GPU Accelerated Monte Carlo Simulations and Time Series Analysis Institute of Physics, Johannes Gutenberg-University of Mainz Center for Polymer Studies, Department of Physics, Boston University Artemis
More informationGPU File System Encryption Kartik Kulkarni and Eugene Linkov
GPU File System Encryption Kartik Kulkarni and Eugene Linkov 5/10/2012 SUMMARY. We implemented a file system that encrypts and decrypts files. The implementation uses the AES algorithm computed through
More informationParallel Image Processing with CUDA A case study with the Canny Edge Detection Filter
Parallel Image Processing with CUDA A case study with the Canny Edge Detection Filter Daniel Weingaertner Informatics Department Federal University of Paraná - Brazil Hochschule Regensburg 02.05.2011 Daniel
More informationIntroduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model
Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA Part 1: Hardware design and programming model Amin Safi Faculty of Mathematics, TU dortmund January 22, 2016 Table of Contents Set
More informationGPU Hardware and Programming Models. Jeremy Appleyard, September 2015
GPU Hardware and Programming Models Jeremy Appleyard, September 2015 A brief history of GPUs In this talk Hardware Overview Programming Models Ask questions at any point! 2 A Brief History of GPUs 3 Once
More informationACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU
Computer Science 14 (2) 2013 http://dx.doi.org/10.7494/csci.2013.14.2.243 Marcin Pietroń Pawe l Russek Kazimierz Wiatr ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU Abstract This paper presents
More informationGraphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011
Graphics Cards and Graphics Processing Units Ben Johnstone Russ Martin November 15, 2011 Contents Graphics Processing Units (GPUs) Graphics Pipeline Architectures 8800-GTX200 Fermi Cayman Performance Analysis
More informationAccelerating Intensity Layer Based Pencil Filter Algorithm using CUDA
Accelerating Intensity Layer Based Pencil Filter Algorithm using CUDA Dissertation submitted in partial fulfillment of the requirements for the degree of Master of Technology, Computer Engineering by Amol
More informationOpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA
OpenCL Optimization San Jose 10/2/2009 Peng Wang, NVIDIA Outline Overview The CUDA architecture Memory optimization Execution configuration optimization Instruction optimization Summary Overall Optimization
More informationPerformance Evaluations of Graph Database using CUDA and OpenMP Compatible Libraries
Performance Evaluations of Graph Database using CUDA and OpenMP Compatible Libraries Shin Morishima 1 and Hiroki Matsutani 1,2,3 1Keio University, 3 14 1 Hiyoshi, Kohoku ku, Yokohama, Japan 2National Institute
More informationImage Processing & Video Algorithms with CUDA
Image Processing & Video Algorithms with CUDA Eric Young & Frank Jargstorff 8 NVIDIA Corporation. introduction Image processing is a natural fit for data parallel processing Pixels can be mapped directly
More informationIntro to GPU computing. Spring 2015 Mark Silberstein, 048661, Technion 1
Intro to GPU computing Spring 2015 Mark Silberstein, 048661, Technion 1 Serial vs. parallel program One instruction at a time Multiple instructions in parallel Spring 2015 Mark Silberstein, 048661, Technion
More informationParallel Programming Survey
Christian Terboven 02.09.2014 / Aachen, Germany Stand: 26.08.2014 Version 2.3 IT Center der RWTH Aachen University Agenda Overview: Processor Microarchitecture Shared-Memory
More informationClustering Billions of Data Points Using GPUs
Clustering Billions of Data Points Using GPUs Ren Wu ren.wu@hp.com Bin Zhang bin.zhang2@hp.com Meichun Hsu meichun.hsu@hp.com ABSTRACT In this paper, we report our research on using GPUs to accelerate
More informationGPU Computing with CUDA Lecture 3 - Efficient Shared Memory Use. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile
GPU Computing with CUDA Lecture 3 - Efficient Shared Memory Use Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1 Outline of lecture Recap of Lecture 2 Shared memory in detail
More informationNVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist
NVIDIA CUDA Software and GPU Parallel Computing Architecture David B. Kirk, Chief Scientist Outline Applications of GPU Computing CUDA Programming Model Overview Programming in CUDA The Basics How to Get
More informationReal-time Visual Tracker by Stream Processing
Real-time Visual Tracker by Stream Processing Simultaneous and Fast 3D Tracking of Multiple Faces in Video Sequences by Using a Particle Filter Oscar Mateo Lozano & Kuzahiro Otsuka presented by Piotr Rudol
More informationGPU System Architecture. Alan Gray EPCC The University of Edinburgh
GPU System Architecture EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? GPU-CPU comparison Architectural reasons for GPU performance advantages GPU accelerated systems
More informationIntroduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1
Introduction to GP-GPUs Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1 GPU Architectures: How do we reach here? NVIDIA Fermi, 512 Processing Elements (PEs) 2 What Can It Do?
More informationIntroduction to CUDA C
Introduction to CUDA C What is CUDA? CUDA Architecture Expose general-purpose GPU computing as first-class capability Retain traditional DirectX/OpenGL graphics performance CUDA C Based on industry-standard
More informationBenchmark Hadoop and Mars: MapReduce on cluster versus on GPU
Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Heshan Li, Shaopeng Wang The Johns Hopkins University 3400 N. Charles Street Baltimore, Maryland 21218 {heshanli, shaopeng}@cs.jhu.edu 1 Overview
More informationST810 Advanced Computing
ST810 Advanced Computing Lecture 17: Parallel computing part I Eric B. Laber Hua Zhou Department of Statistics North Carolina State University Mar 13, 2013 Outline computing Hardware computing overview
More informationGeoImaging Accelerator Pansharp Test Results
GeoImaging Accelerator Pansharp Test Results Executive Summary After demonstrating the exceptional performance improvement in the orthorectification module (approximately fourteen-fold see GXL Ortho Performance
More informationLecture 1: an introduction to CUDA
Lecture 1: an introduction to CUDA Mike Giles mike.giles@maths.ox.ac.uk Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Overview hardware view software view CUDA programming
More informationStream Processing on GPUs Using Distributed Multimedia Middleware
Stream Processing on GPUs Using Distributed Multimedia Middleware Michael Repplinger 1,2, and Philipp Slusallek 1,2 1 Computer Graphics Lab, Saarland University, Saarbrücken, Germany 2 German Research
More informationControl 2004, University of Bath, UK, September 2004
Control, University of Bath, UK, September ID- IMPACT OF DEPENDENCY AND LOAD BALANCING IN MULTITHREADING REAL-TIME CONTROL ALGORITHMS M A Hossain and M O Tokhi Department of Computing, The University of
More informationIntroduction to GPU Computing
Matthis Hauschild Universität Hamburg Fakultät für Mathematik, Informatik und Naturwissenschaften Technische Aspekte Multimodaler Systeme December 4, 2014 M. Hauschild - 1 Table of Contents 1. Architecture
More informationGraphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data
Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data Amanda O Connor, Bryan Justice, and A. Thomas Harris IN52A. Big Data in the Geosciences:
More informationE6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices
E6895 Advanced Big Data Analytics Lecture 14: NVIDIA GPU Examples and GPU on ios devices Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science IBM Chief Scientist,
More informationAutomatic CUDA Code Synthesis Framework for Multicore CPU and GPU architectures
Automatic CUDA Code Synthesis Framework for Multicore CPU and GPU architectures 1 Hanwoong Jung, and 2 Youngmin Yi, 1 Soonhoi Ha 1 School of EECS, Seoul National University, Seoul, Korea {jhw7884, sha}@iris.snu.ac.kr
More informationSpeeding Up RSA Encryption Using GPU Parallelization
2014 Fifth International Conference on Intelligent Systems, Modelling and Simulation Speeding Up RSA Encryption Using GPU Parallelization Chu-Hsing Lin, Jung-Chun Liu, and Cheng-Chieh Li Department of
More informationGPU Computing with CUDA Lecture 4 - Optimizations. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile
GPU Computing with CUDA Lecture 4 - Optimizations Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1 Outline of lecture Recap of Lecture 3 Control flow Coalescing Latency hiding
More information~ Greetings from WSU CAPPLab ~
~ Greetings from WSU CAPPLab ~ Multicore with SMT/GPGPU provides the ultimate performance; at WSU CAPPLab, we can help! Dr. Abu Asaduzzaman, Assistant Professor and Director Wichita State University (WSU)
More informationAPPLICATIONS OF LINUX-BASED QT-CUDA PARALLEL ARCHITECTURE
APPLICATIONS OF LINUX-BASED QT-CUDA PARALLEL ARCHITECTURE Tuyou Peng 1, Jun Peng 2 1 Electronics and information Technology Department Jiangmen Polytechnic, Jiangmen, Guangdong, China, typeng2001@yahoo.com
More informationLearn CUDA in an Afternoon: Hands-on Practical Exercises
Learn CUDA in an Afternoon: Hands-on Practical Exercises Alan Gray and James Perry, EPCC, The University of Edinburgh Introduction This document forms the hands-on practical component of the Learn CUDA
More informationRootbeer: Seamlessly using GPUs from Java
Rootbeer: Seamlessly using GPUs from Java Phil Pratt-Szeliga. Dr. Jim Fawcett. Dr. Roy Welch. Syracuse University. Rootbeer Overview and Motivation Rootbeer allows a developer to program a GPU in Java
More informationData-parallel Acceleration of PARSEC Black-Scholes Benchmark
Data-parallel Acceleration of PARSEC Black-Scholes Benchmark AUGUST ANDRÉN and PATRIK HAGERNÄS KTH Information and Communication Technology Bachelor of Science Thesis Stockholm, Sweden 2013 TRITA-ICT-EX-2013:158
More informationExperiences on using GPU accelerators for data analysis in ROOT/RooFit
Experiences on using GPU accelerators for data analysis in ROOT/RooFit Sverre Jarp, Alfio Lazzaro, Julien Leduc, Yngve Sneen Lindal, Andrzej Nowak European Organization for Nuclear Research (CERN), Geneva,
More informationIntroduction GPU Hardware GPU Computing Today GPU Computing Example Outlook Summary. GPU Computing. Numerical Simulation - from Models to Software
GPU Computing Numerical Simulation - from Models to Software Andreas Barthels JASS 2009, Course 2, St. Petersburg, Russia Prof. Dr. Sergey Y. Slavyanov St. Petersburg State University Prof. Dr. Thomas
More informationIntroduction to GPGPU. Tiziano Diamanti t.diamanti@cineca.it
t.diamanti@cineca.it Agenda From GPUs to GPGPUs GPGPU architecture CUDA programming model Perspective projection Vectors that connect the vanishing point to every point of the 3D model will intersecate
More informationGPGPU in Scientific Applications
West Pomeranian University of Technology Plan of presentation Parallel computing GPGPU GPGPU technologies Scientific applications Computational limits Resources Speed: Faster hardware Optimized software
More informationAccelerating BIRCH for Clustering Large Scale Streaming Data Using CUDA Dynamic Parallelism
Accelerating BIRCH for Clustering Large Scale Streaming Data Using CUDA Dynamic Parallelism Jianqiang Dong, Fei Wang and Bo Yuan Intelligent Computing Lab, Division of Informatics Graduate School at Shenzhen,
More informationMulti-Threading Performance on Commodity Multi-Core Processors
Multi-Threading Performance on Commodity Multi-Core Processors Jie Chen and William Watson III Scientific Computing Group Jefferson Lab 12000 Jefferson Ave. Newport News, VA 23606 Organization Introduction
More informationOptimizing a 3D-FWT code in a cluster of CPUs+GPUs
Optimizing a 3D-FWT code in a cluster of CPUs+GPUs Gregorio Bernabé Javier Cuenca Domingo Giménez Universidad de Murcia Scientific Computing and Parallel Programming Group XXIX Simposium Nacional de la
More informationLecture 5: Variants of the LMS algorithm
1 Standard LMS Algorithm FIR filters: Lecture 5: Variants of the LMS algorithm y(n) = w 0 (n)u(n)+w 1 (n)u(n 1) +...+ w M 1 (n)u(n M +1) = M 1 k=0 w k (n)u(n k) =w(n) T u(n), Error between filter output
More informationParallel Firewalls on General-Purpose Graphics Processing Units
Parallel Firewalls on General-Purpose Graphics Processing Units Manoj Singh Gaur and Vijay Laxmi Kamal Chandra Reddy, Ankit Tharwani, Ch.Vamshi Krishna, Lakshminarayanan.V Department of Computer Engineering
More informationBackground 2. Lecture 2 1. The Least Mean Square (LMS) algorithm 4. The Least Mean Square (LMS) algorithm 3. br(n) = u(n)u H (n) bp(n) = u(n)d (n)
Lecture 2 1 During this lecture you will learn about The Least Mean Squares algorithm (LMS) Convergence analysis of the LMS Equalizer (Kanalutjämnare) Background 2 The method of the Steepest descent that
More informationParallel Prefix Sum (Scan) with CUDA. Mark Harris mharris@nvidia.com
Parallel Prefix Sum (Scan) with CUDA Mark Harris mharris@nvidia.com April 2007 Document Change History Version Date Responsible Reason for Change February 14, 2007 Mark Harris Initial release April 2007
More informationCUDA Programming. Week 4. Shared memory and register
CUDA Programming Week 4. Shared memory and register Outline Shared memory and bank confliction Memory padding Register allocation Example of matrix-matrix multiplication Homework SHARED MEMORY AND BANK
More informationCell-SWat: Modeling and Scheduling Wavefront Computations on the Cell Broadband Engine
Cell-SWat: Modeling and Scheduling Wavefront Computations on the Cell Broadband Engine Ashwin Aji, Wu Feng, Filip Blagojevic and Dimitris Nikolopoulos Forecast Efficient mapping of wavefront algorithms
More informationLecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.
Lecture 11: Multi-Core and GPU Multi-core computers Multithreading GPUs General Purpose GPUs Zebo Peng, IDA, LiTH 1 Multi-Core System Integration of multiple processor cores on a single chip. To provide
More informationPerformance Modeling and Analysis of a Database Server with Write-Heavy Workload
Performance Modeling and Analysis of a Database Server with Write-Heavy Workload Manfred Dellkrantz, Maria Kihl 2, and Anders Robertsson Department of Automatic Control, Lund University 2 Department of
More informationSeveral tips on how to choose a suitable computer
Several tips on how to choose a suitable computer This document provides more specific information on how to choose a computer that will be suitable for scanning and postprocessing of your data with Artec
More informationLecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com
CSCI-GA.3033-012 Graphics Processing Units (GPUs): Architecture and Programming Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Modern GPU
More informationChoosing a Computer for Running SLX, P3D, and P5
Choosing a Computer for Running SLX, P3D, and P5 This paper is based on my experience purchasing a new laptop in January, 2010. I ll lead you through my selection criteria and point you to some on-line
More informationReliable Systolic Computing through Redundancy
Reliable Systolic Computing through Redundancy Kunio Okuda 1, Siang Wun Song 1, and Marcos Tatsuo Yamamoto 1 Universidade de São Paulo, Brazil, {kunio,song,mty}@ime.usp.br, http://www.ime.usp.br/ song/
More informationOptimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology
Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology Parallel Reduction Common and important data parallel primitive Easy to implement in CUDA Harder to get it right Serves as
More informationAccelerating Wavelet-Based Video Coding on Graphics Hardware
Wladimir J. van der Laan, Andrei C. Jalba, and Jos B.T.M. Roerdink. Accelerating Wavelet-Based Video Coding on Graphics Hardware using CUDA. In Proc. 6th International Symposium on Image and Signal Processing
More informationImplementation of Stereo Matching Using High Level Compiler for Parallel Computing Acceleration
Implementation of Stereo Matching Using High Level Compiler for Parallel Computing Acceleration Jinglin Zhang, Jean François Nezan, Jean-Gabriel Cousin, Erwan Raffin To cite this version: Jinglin Zhang,
More informationIP Video Rendering Basics
CohuHD offers a broad line of High Definition network based cameras, positioning systems and VMS solutions designed for the performance requirements associated with critical infrastructure applications.
More informationGraphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data
Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data Amanda O Connor, Bryan Justice, and A. Thomas Harris IN52A. Big Data in the Geosciences:
More informationMONTE-CARLO SIMULATION OF AMERICAN OPTIONS WITH GPUS. Julien Demouth, NVIDIA
MONTE-CARLO SIMULATION OF AMERICAN OPTIONS WITH GPUS Julien Demouth, NVIDIA STAC-A2 BENCHMARK STAC-A2 Benchmark Developed by banks Macro and micro, performance and accuracy Pricing and Greeks for American
More informationPARALLEL JAVASCRIPT. Norm Rubin (NVIDIA) Jin Wang (Georgia School of Technology)
PARALLEL JAVASCRIPT Norm Rubin (NVIDIA) Jin Wang (Georgia School of Technology) JAVASCRIPT Not connected with Java Scheme and self (dressed in c clothing) Lots of design errors (like automatic semicolon
More informationBinary search tree with SIMD bandwidth optimization using SSE
Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous
More informationA GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS
A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS SUDHAKARAN.G APCF, AERO, VSSC, ISRO 914712564742 g_suhakaran@vssc.gov.in THOMAS.C.BABU APCF, AERO, VSSC, ISRO 914712565833
More informationGPU-BASED TUNING OF QUANTUM-INSPIRED GENETIC ALGORITHM FOR A COMBINATORIAL OPTIMIZATION PROBLEM
GPU-BASED TUNING OF QUANTUM-INSPIRED GENETIC ALGORITHM FOR A COMBINATORIAL OPTIMIZATION PROBLEM Robert Nowotniak, Jacek Kucharski Computer Engineering Department The Faculty of Electrical, Electronic,
More informationAssessing the Performance of OpenMP Programs on the Intel Xeon Phi
Assessing the Performance of OpenMP Programs on the Intel Xeon Phi Dirk Schmidl, Tim Cramer, Sandra Wienke, Christian Terboven, and Matthias S. Müller schmidl@rz.rwth-aachen.de Rechen- und Kommunikationszentrum
More informationSystem requirements for Autodesk Building Design Suite 2017
System requirements for Autodesk Building Design Suite 2017 For specific recommendations for a product within the Building Design Suite, please refer to that products system requirements for additional
More informationEnhancing the SNR of the Fiber Optic Rotation Sensor using the LMS Algorithm
1 Enhancing the SNR of the Fiber Optic Rotation Sensor using the LMS Algorithm Hani Mehrpouyan, Student Member, IEEE, Department of Electrical and Computer Engineering Queen s University, Kingston, Ontario,
More informationSCATTERED DATA VISUALIZATION USING GPU. A Thesis. Presented to. The Graduate Faculty of The University of Akron. In Partial Fulfillment
SCATTERED DATA VISUALIZATION USING GPU A Thesis Presented to The Graduate Faculty of The University of Akron In Partial Fulfillment of the Requirements for the Degree Master of Science Bo Cai May, 2015
More informationMulti-core architectures. Jernej Barbic 15-213, Spring 2007 May 3, 2007
Multi-core architectures Jernej Barbic 15-213, Spring 2007 May 3, 2007 1 Single-core computer 2 Single-core CPU chip the single core 3 Multi-core architectures This lecture is about a new trend in computer
More informationInfluence of Load Balancing on Quality of Real Time Data Transmission*
SERBIAN JOURNAL OF ELECTRICAL ENGINEERING Vol. 6, No. 3, December 2009, 515-524 UDK: 004.738.2 Influence of Load Balancing on Quality of Real Time Data Transmission* Nataša Maksić 1,a, Petar Knežević 2,
More informationGPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile
GPU Computing with CUDA Lecture 2 - CUDA Memories Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1 Outline of lecture Recap of Lecture 1 Warp scheduling CUDA Memory hierarchy
More informationSeveral tips on how to choose a suitable computer
Several tips on how to choose a suitable computer This document provides more specific information on how to choose a computer that will be suitable for scanning and postprocessing of your data with Artec
More informationThe Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System
The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System Qingyu Meng, Alan Humphrey, Martin Berzins Thanks to: John Schmidt and J. Davison de St. Germain, SCI Institute Justin Luitjens
More informationANALYSIS OF RSA ALGORITHM USING GPU PROGRAMMING
ANALYSIS OF RSA ALGORITHM USING GPU PROGRAMMING Sonam Mahajan 1 and Maninder Singh 2 1 Department of Computer Science Engineering, Thapar University, Patiala, India 2 Department of Computer Science Engineering,
More informationFor designers and engineers, Autodesk Product Design Suite Standard provides a foundational 3D design and drafting solution.
Autodesk Product Design Suite Standard 2013 System Requirements Typical Persona and Workflow For designers and engineers, Autodesk Product Design Suite Standard provides a foundational 3D design and drafting
More information4F7 Adaptive Filters (and Spectrum Estimation) Least Mean Square (LMS) Algorithm Sumeetpal Singh Engineering Department Email : sss40@eng.cam.ac.
4F7 Adaptive Filters (and Spectrum Estimation) Least Mean Square (LMS) Algorithm Sumeetpal Singh Engineering Department Email : sss40@eng.cam.ac.uk 1 1 Outline The LMS algorithm Overview of LMS issues
More informationThe Green Index: A Metric for Evaluating System-Wide Energy Efficiency in HPC Systems
202 IEEE 202 26th IEEE International 26th International Parallel Parallel and Distributed and Distributed Processing Processing Symposium Symposium Workshops Workshops & PhD Forum The Green Index: A Metric
More informationA STUDY OF ECHO IN VOIP SYSTEMS AND SYNCHRONOUS CONVERGENCE OF
A STUDY OF ECHO IN VOIP SYSTEMS AND SYNCHRONOUS CONVERGENCE OF THE µ-law PNLMS ALGORITHM Laura Mintandjian and Patrick A. Naylor 2 TSS Departement, Nortel Parc d activites de Chateaufort, 78 Chateaufort-France
More informationFinal Year Project Progress Report. Frequency-Domain Adaptive Filtering. Myles Friel. Supervisor: Dr.Edward Jones
Final Year Project Progress Report Frequency-Domain Adaptive Filtering Myles Friel 01510401 Supervisor: Dr.Edward Jones Abstract The Final Year Project is an important part of the final year of the Electronic
More informationInteractive Level-Set Deformation On the GPU
Interactive Level-Set Deformation On the GPU Institute for Data Analysis and Visualization University of California, Davis Problem Statement Goal Interactive system for deformable surface manipulation
More informationProgramming GPUs with CUDA
Programming GPUs with CUDA Max Grossman Department of Computer Science Rice University johnmc@rice.edu COMP 422 Lecture 23 12 April 2016 Why GPUs? Two major trends GPU performance is pulling away from
More informationLMS is a simple but powerful algorithm and can be implemented to take advantage of the Lattice FPGA architecture.
February 2012 Introduction Reference Design RD1031 Adaptive algorithms have become a mainstay in DSP. They are used in wide ranging applications including wireless channel estimation, radar guidance systems,
More information2020 Design Update 11.3. Release Notes November 10, 2015
2020 Design Update 11.3 Release Notes November 10, 2015 Contents Introduction... 1 System Requirements... 2 Actively Supported Operating Systems... 2 Hardware Requirements (Minimum)... 2 Hardware Requirements
More informationTurbomachinery CFD on many-core platforms experiences and strategies
Turbomachinery CFD on many-core platforms experiences and strategies Graham Pullan Whittle Laboratory, Department of Engineering, University of Cambridge MUSAF Colloquium, CERFACS, Toulouse September 27-29
More informationHigh Performance Matrix Inversion with Several GPUs
High Performance Matrix Inversion on a Multi-core Platform with Several GPUs Pablo Ezzatti 1, Enrique S. Quintana-Ortí 2 and Alfredo Remón 2 1 Centro de Cálculo-Instituto de Computación, Univ. de la República
More informationHardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui
Hardware-Aware Analysis and Optimization of Stable Fluids Presentation Date: Sep 15 th 2009 Chrissie C. Cui Outline Introduction Highlights Flop and Bandwidth Analysis Mehrstellen Schemes Advection Caching
More informationPorting the Plasma Simulation PIConGPU to Heterogeneous Architectures with Alpaka
Porting the Plasma Simulation PIConGPU to Heterogeneous Architectures with Alpaka René Widera1, Erik Zenker1,2, Guido Juckeland1, Benjamin Worpitz1,2, Axel Huebl1,2, Andreas Knüpfer2, Wolfgang E. Nagel2,
More informationGPGPU Computing. Yong Cao
GPGPU Computing Yong Cao Why Graphics Card? It s powerful! A quiet trend Copyright 2009 by Yong Cao Why Graphics Card? It s powerful! Processor Processing Units FLOPs per Unit Clock Speed Processing Power
More information