NOISE REDUCTION WITH USING PARALLEL ALGORITHMS

Transcription

1 NOISE REDUCTION WITH USING PARALLEL ALGORITHMS Maciej WALCZYŃSKI 1, Wojciech BOŻEJKO 2 Wrocław University of Technology Wybrzeże Wyspiańskiego 27, Wrocław, Poland 1 Institute of Telecommunications, Teleinformatics and Acoustics 2 Institute of Computer Engineering, Control and Robotics maciej.walczynski@pwr.wroc.pl, wojciech.bozejko@pwr.wroc.pl Zamek Książ - Wałbrzych, Poland, 6-9 June 2010 ABSTRACT In this paper we propose a parallel version of the LMS, an algorithm which is used to digital signal processing such as echo elimination and noise reduction. Parallel approach allows for decomposition the problem into a number of smaller ones, which can be computed faster. Obtained results, especially increase of speed and efficiency show, that the parallel method implemented on GPU is much faster than other existing procedures and it can be used in the real-time systems. INTRODUCTION LMS (Least Mean Square) filters are based on the minimization of the mean square error. These filters are stable and easy for implementation. Unfortunately, parallelization of this algorithm, especially in the distributed-memory parallel computing systems is not so obvious. A main disadvantage of the LMS algorithm is slow convergence of this approach. There is a number of LMS variants including PNLMS (Proportional Normalized Least Mean Square) which are focused on improving weak convergence of the original LMS method. Procedure of the filter adaptation requires a significant calculation and time cost, which has to be minimized. Faster convergence of the algorithm needs longer size of vectors used inside the filter (thousands of elements). The most complex element of the computational process is matrix multiplication procedure. By its parallelizing we obtain the concurrent algorithm which works as the sequential one, but much faster (so-called single-walk parallelization) [1]. 1

2 Proposed algorithm was implemented in C++ with using CUDA and executed on 128- processors nvidia Tesla GPU. The problem of LMS filters application in ANC was considered by Akhtar et al. [1], Koike [2], Elliott et al. [3] and Eriksson [4]. THE PROBLEM The problem of an active noise control is known for many years. Its advantage is possibility of adjust to a variable (in time) characteristic of distractions. A construction which uses a microphone an flexible steered speaker was proposed in 1936 by Luega [???] for the first time. Nowadays most of solutions are based on adaptive filters algorithms. A basic idea, on which ANC (Active Noise Control) are built is generating a disturbing signal estimate function. The signal estimator should characterized similar amplitude and frequency spectrum comparing to the considered signal, but it should has an opposite phase. The signal without distractions constitutes the primary signal (with distractions) combined with the built estimator of the disturbing signal. An idea of adaptive filtrating is based on permanent adaptation to distractions changing in time such that they are fast and efficiently eliminated. Let us consider an example of a person which speak by the speakerphone of the mobile inside a moving car. A signal (d(n)) recorded by the device microphone includes distractions also, apart of the speech (such as an engine sound, noises from the outside, etc.). We would like that the signal obtained by the second person has no these distractions. By using of the second, reference microphone, placed in the other location and recorded only distractions (x(n)), we want to reduce an unnecessary component. In the general case signals recorded by both microphones differ in amplitude and phase, but they are also correlated (due to occurred a distraction component). This correlation, by using an adaptive filter, allows us to eliminate distractions. We expect from the system that it will work in the real time and the adaptation time (that is the time after which an estimator of the distraction signal is similar to the reference signal) will be possible shortest. SEQUETIAL LMS FILTERS One of the most popular algorithm used to adaptive filtering is LMS (Least Mean Square) algorithm. LMS belongs to the gradient adaptive filters class. In these filters we assume that a 2

3 modification of h(n) vectore of the h(n) filter parameters should be proportional in each time moment n to the cost function gradient vector J(n), which can be written as an equation: (1) where µ(n) is a scale variable which influences onto the speed of the filter modification. In the general case it depends on the time. To speed up of the adaptation process, an additionally weight matrix W(n) is introduced. Such modified equation (1) takes the form of: (2) In the case of LMS a temporal error value is minimized. Therefore the error criterion takes the form of: (3) From this the cost function derivative is given by: (4) where M denotes a filter dimension. In turn: (5) where is an estimator d(n) of the reference signal y(n). Finally, the equation (1) takes the form of: (6) which can be formulated in the matrix form as: (7) There exists many kinds of LMS filters. In the simplest form we assume that the scaling component is permanent in time, that is = and the matrix W(n) is a identical diagonal matrix I, which follows to the formula: (8) The general scheme of a sequential LMS filter is showed on the Figure 1. 3

4 Signal source Main microphone Z -k d(n) + Noise source 2 nd microphone x(n) FIR y(n) LMS e(n) Figure 1. Active noise control system using LMS filter. LMS PARALLELIZATION The parallel algorithm was designed in C++ with using CUDA library for being executed on nvidia GPUs. The general outlet of the main element of the LMS algorithm (matrix multiplication) is showed on Figures 2 and 3. Such a method of using concurrency is called single-walk parallelization and it consists in using parallel computing enjoinment to speed up of the most computational exhaustive element of the method. In LMS filter this element constitutes a process of vectors (matrixes) product calculation. global void matrix_mult(int *a, int *b, int *c, int n, int m) { int idx = blockidx.x * blockdim.x + threadidx.x+1 ; int idy = blockidx.y * blockdim.y + threadidx.y+1 ; { int temp = 0; for(int k=1;k<=n;k++) temp+=a[m*(idx)+k]*b[m*(k)+idy]; c[idx+m*idy] = temp;} } Figure 2. Parallel algorithm 4

5 int main(){//kernel invocation dim3 grid( N/10, M/10); dim3 threads( 10, 10); matrix_mult<<<grid,threads>>>(deva, devb, devc, N, M); } Figure 3. Kernel invocation NUMERICAL EXPERIMENTS The parallel algorithm for the considered problem of LMS parallelization was coded in C (CUDA) for GPU and ran on three GPUs: 1. nvidia GeForce 9600M GS with 32 streaming processors installed on Lenovo Y530, Intel Pentium Dual-Core CPU 2GHz, 3GB RAM under 32-bit Windows Vista Home Premium operating system, 2. nvidia GeForce GTX 295 with 480 streaming processors installed on Intel Core2Duo 2.4Ghz, 2GB RAM under 32-bit Windows Vista Business operating system, 3. nvidia Tesla C870 GPU (512 GFLOPS) with 128 streaming processor cores. This GPU was installed on the Hewlett-Packard server based on 2 Dual-Core AMD 1 GHz Opteron processors with 1 MB cache memory and 8 GB RAM working under 64-bit Linux Debian 5.0 operating system. On nvidia Tesla architecture, a thread block has 16kB of shared memory visible to all threads of the block. All threads have access to the same global memory. Shared memory is much faster than global memory. When there is no bank conflicts accessing the shared memory is fast as accessing a register. For comparision access to the global memory takes cycles. It is possible to use shared memory only for smallest test instances (small matrixes). Table 1. Parallel runtimes comparison on nvidia GeForce GTX295 GPU. n x n nvidia GeForce GTX295 sequential on GeForce GTX295 speedup min. max. average min. max. average 10 x 10 0,15 1,92 0,42 0,65 2,54 0,98 2,32 20 x 20 0,12 2,91 0,45 4,37 10,38 5,17 11,54 30 x 30 0,17 2,89 0,56 15,32 17,05 16,01 28,79 40 x 40 0,17 2,02 0,49 34,06 40,24 35,66 72,93 50 x 50 0,20 2,00 0,51 67,59 72,40 68,68 135, x 100 0,43 2,29 0,75 527,97 529,63 528,63 702,52 5

6 Table 2. Parallel runtimes comparison on nvidia GeForce 9600M GS. n x n nvidia GeForce 9600M GS sequential on GeForce 9600M GS speedup min. max. average min. max. average 10 x 10 0,18 1,23 0,67 0,87 1,86 1,18 1,76 20 x 20 0,28 1,22 0,75 5,88 8,29 6,62 8,82 30 x 30 0,39 1,32 0,84 19,26 21,51 20,53 24,48 40 x 40 0,62 1,37 1,10 45,90 48,46 46,39 42,05 50 x 50 0,98 1,95 1,52 90,41 92,38 91,28 59, x 100 6,14 7,70 6,67 715,72 722,64 718,09 107,68 Table 3. Parallel runtimes on nvidia Tesla C870 GPU (512 GFLOPS). n x n nvidia GeForce 9600M GS min. max. average 10 x 10 0,04 0,08 0,06 20 x 20 0,06 0,10 0,08 30 x 30 0,09 0,13 0,10 40 x 40 0,12 0,22 0,13 50 x 50 0,20 0,25 0, x 100 1,13 1,19 1, x ,95 12,78 11, x ,42 30,21 29, x ,12 75,36 73, x ,33 138,24 136, x , , , x , , ,85 From Tables 1 and 2 it follows that the results computed on nvidia GeForce GTX295 and GeForce 9600M GS card are 99 times faster in average than the sequential algorithm results obtained by those cards. The speedup value is from 1.72 to on GeForce 9600M GS card, and from 2.32 to on nvidia GeForce 9600M GS. Table 3 shows resultse of computations on nvidia Tesla C870 GPU. A comparison is given also on Figure 4 (times) and Figure 5 (speedups). Obtained results show, that the proposed method can be used in the real time system. 6

7 Figure 4. Time of matrix multiplication in function of matrix dimension. Figure 5. Speedup in function of matrix dimension. 7

8 CONCLUSION The method of single-walk parallelization of LMS filter used to distractions elimination is proposed here. It consist in parallelization of the matrix multiplication module. We obtain a very fast algorithm which can be used in real-time systems with using GPGPU multithread calculation environment. REFERENCES [1] W. Bożejko, M. Walczyński, M. Wodecki, Zastosowanie algorytmu poszukiwania snopowego opartego na szybkiej transformacie Fouriera do cyfrowej analizy sygnałów, Automatyka, Zeszyty Naukowe Politechniki Śląskiej, Gliwice 2008, z. 150, pp [2] M. T. Akhtar, M. Abe, M. Kawamata, Modified-Filtered-X LMS Algorithm Based Active Noise Control System with Improved Online Secondary-Path Modeling, The 47- th IEEE International Midwest Symposium on Circuits and Systems [3] S. Koike, A class of adaptive step-size control algorithms for adaptive filters, IEEE Trans. Signal Processing, vol. 50, no. 6, pp , June 2002 [4] S. J. Elliott, I. M. Stothers, P. A. Nelson, A multiple error LMS algorithm and its application to the active control of sound and vibration, IEEE Trans. Acoustical, Speech, Signal Processing, ASSP-35, , Oct [5] L. J. Eriksson, M. C. Allie, and C. D. Bremigan, Active noise control using adaptive digital Signal processing in Proc. ICASSP, 1988, pp