Isma2010 0557

Using a Graphics Processor to Accelerate DSP Calculations
J.R. Blough Michigan Technological University, Department of Mechanical Engineering-Engineering Mechanics 1400 Townsend Drive, Houghton, MI, USA email: jrblough@mtu.edu
Abstract
Recent advances in graphics processing (GPU) power available in personal computers has lead to the availability of an extremely large amount of processing power which is not typically used on PCs for nongraphics applications. This processing power has become available for use in general computing applications through the release of the Nvidia CUDA programming language as well as several MATLAB toolboxes. This paper seeks to exploit and measure the advantages that this processing capability has for digital signal processing applications such as Fast Fourier Transforms, digital filtering, adaptive resampling, Vold-Kalman filtering, and various other DSP applications. It is shown that utilizing the GPU or a specialized GPU based processor card can accelerate some applications by tremendously. For each application, details will be presented on the most efficient implementations found for both long time histories as well as for large channel count datasets. Finally, suggestions as to other applications of this processing power that may benefit the structural dynamics community will be presented.
Introduction
In the last couple of years there has been a revelation that many personal computers have a large amount of untapped computing resources in them, this being their graphics card or adapter. While many of the latest PC CPUs have at least two processor cores and many of the higher end processors have four processor cores, the graphics processing unit, GPU, may have from 64 up to 240 processor cores depending on the level of the card. There is in fact so much interest in this technology that a very comprehensive website has been created called General-Purpose Computation on Graphics Hardware which includes links to papers, conferences, and discussions about how to best benefit from this technology [1]. One of the largest GPU manufacturers, Nvidia, has created the Compute Unified Device Architecture, CUDA, programming language to tap this resource for general purpose computing. The CUDA programming language is a very low level programming language that requires a great amount of talent and skill to utilize to its fullest extent for improving the speed of scientific calculations; most often the programmer is a dedicated computer science person. CUDA has however been broadly adapted across many industries to increase either the computational speed for problems or to enhance the visualization of data, including many real time imaging type applications. Nvidia hosts a yearly conference for CUDA programmers to meet to discuss and showcase their work. Nvidia has also created a line of GPU based computing cards, called the Tesla, which contain from 240 up to 512 processor cores and up to 6 Gigabytes of on board memory, with the latest iteration being the M2070 which boasts up to 515 Gigaflops of performance for double precision calculations and 1.03 Teraflops of single precision performance [2]. CUDA also supports using multiple GPUs to further enhance the computing capability available on a PC/MAC/Linux platform. The approximate cost of an M2050 is $2277.00; this would be the highest performance available GPU at the time this paper was written [3].
1703
1704
P ROCEEDINGS OF ISMA2010 INCLUDING USD2010
To facilitate the adoption of GPU based data processing and visualization the Accelereyes Jacket package has been developed to interface MATLAB to the GPU through the use of CUDA. Jacket is a layer that sits between MATLAB and CUDA and allows m-files and functions written in MATLAB to utilize the GPU for improved computing performance. In many instances it is not required to re-write significant portions of the code. It should also be noted that while this paper investigates the combination of Nvidia hardware, CUDA, and the Jacket platform there are other options for each of these components as ATI has also developed a general purpose GPU computing language for their GPUs [4]. There are also other packages similar to Jacket such as GPULib which interfaces between MATLAB and CUDA [5]. The choices made for this paper were based on the authors opinion that the tools chosen were the most stable and well integrated of the solutions available. This paper builds on this introduction of technology to understand how this additional computing resource might benefit the noise and vibrations industry and in particular in the implementation of many digital signal processing applications which are implemented in a post-processing environment. It is shown that significant advantages can be gained for some types of applications but not all applications benefit from this technology.
2
2.1
GPU vs CPU Data Processing

GPU vs CPU Architectures
There are significant differences in the architectures of a GPU processing card verses the CPU and its supporting architecture. The central processing unit, CPU, has the job of running the operating system and therefore coordinating many activities on a computer while simultaneously performing the desired scientific calculations. The CPU has access to all of the main memory or RAM on the computer and therefore can handle large datasets with todays 64 bit PCs having as much as 12 GB of RAM. This allows all of the data to be processed to be loaded into the memory of the computer and processing to occur without the use of the swap space on the computers hard drive that slows computations considerably. The CPU operates with a much higher clock speed and with many fewer cores than a typical GPU. The ramifications of this are that the CPU can execute serial type instructions much faster than a GPU. The GPU however with its many more processor cores can execute many more instructions simultaneously than the CPU, hence it works very well for data processing which can take advantage of parallel processing. The GPU typically has much less memory than the CPU has access to with the range of available memory ranging from 128 MB up to 4-6 GB on a dedicated GPU processor card. This can lead to memory limitations, which can limit the size of matrix calculations that can be performed. The CPU solves computational problems in double precision in most cases by default and hence gives the maximum accuracy in its solutions. The GPU solves computations using single precision calculations by default. While most GPUs are speed optimized for single point solutions, the new Nvidia GPU processor, code named Fermi is optimized for double point precision, it is interesting to note however that it is still much faster in single precision computations than double point computations. This double point optimization of a GPU processor is a first and will open even more opportunities for using the GPU for general purpose computing when a very high accuracy is a requirement. Due to the architecture of a computer, there is a bottleneck in GPU processing which can be significant. Any processing which is to take place on the GPU requires that the data to be processed first be loaded into the main memory of the computer, then be transferred to the GPU for analysis and then be transferred back to the main CPU for storage to disk or other types of processing. This moving of data from the main memory to the GPU memory and back again can take a significant amount of computational time and be a choke point for some processes.
DYNAMIC TESTING : METHODS AND INSTRUMENTATION
1705
In summary, the CPU is still the preferred processing unit when solving many problems that require serial calculations. The GPU on the other hand is developing to be the preferred processor when solving large parallel type problems. The GPU handles the large parallel problems by allocating the individual parallel steps to the different cores of the GPU, for instance when performing an FFT, each GPU core can be used to solve for a single frequency line of the result hence solving for up to the number of cores of frequency lines simultaneously as opposed to the CPU which would solve for the solution one frequency line at a time.
2.2
GPU Programming using Accelereyes Jacket
The software package used in the development of this paper was the Accelereyes Jacket platform coupled with MATLAB. This environment was chosen because MATLAB is very widely used and to facilitate the consideration of the use of the GPU it was decided to make the GPU programming as easy and straightforward as possible. This is exactly the intent of the Jacket platform. Jacket operates by creating a software layer between CUDA that communicates directly with the GPU and MATLAB that the user typically programs in [6]. This layer is nearly transparent to the programmer for most functions other than the requirement of declaring variables. The Jacket platform supports many of MATLABs native commands and effectively works by hijacking a MATLAB command if the data has been declared as being a GPU variable and sending the data and command to the GPU for computation. In this way many parts of an m-file or function written to run on a CPU can be run on the GPU without modification. Jacket will support this process using multiple GPUs simultaneously if it is licensed properly, it can also support using MATLABs parallel computing toolbox to simultaneously use multiple CPU cores coupled with multiple GPU cards to create a true supercomputing platform from a typical high end PC. Jacket also supports accelerated graphics in MATLAB by more fully utilizing the GPU. 2.2.1 Optimizing MATLAB code for Jacket
The first step in converting MATLAB code to be used with Jacket is typically to profile the current code using MATLABs profiler. The philosophy being that programming time and effort should be first spent on areas of the code that will benefit the most from an improvement in execution speed. While Jacket supports many of MATLABs commands it does not fully support all of them and hence it may not be possible to convert all MATLAB code to run on the GPU. If the commands that are demanding the most computation time are supported by Jacket in the context that they are being used no modifications of the current code are necessary other than to declare the variables to be GPU variables. It should be noted that only variables that are necessary in the portion of the code that is to be executed on the GPU should be declared GPU variables, as the programmer should avoid all unnecessary data transfers to and from the GPU. Most of the time, a program will use a combination of CPU calculations and GPU calculations, this is valid and will be efficient as long as unnecessary data transfers between the CPU and the GPU are avoided. Upon determining the portions of the code with the longest execution time, the next step is to determine whether this code is fully vectorized. Like MATLAB, Jacket prefers vectorized solutions. Upon declaring a variable to be a GPU variable, Jacket will automatically execute the vectorized code using multiple GPU cores if it is vectorized with no required code modifications. If the code of interest is not vectorized and it is possible to vectorize it, this will be the easiest and most efficient method to take advantage of the GPUs parallel processing capability. If the portion of the code which is computationally demanding cannot be vectorized, but is coded using a for loop there is the possibility of using a gfor loop to decrease the computation time. The gfor command deals each iteration of a for loop to a different core on the GPU and hence parallelizes the solution of the for loop. However, there are significant limitations at this time on using gfor loops. First, they cannot be nested inside of each other but can be nested inside a traditional for loop. Jacket at
1706
this time does not support any conditional statements inside of a gfor loop, this can severely limit the application of a gfor loops. The last major limitation is that an iterative solution cannot be solved using gfor as previous iterations are not available if the loop is solved in parallel. Finally, as mentioned above at this point in time GPUs are much more efficient at single precision calculations than double precision calculations. This implies that the user must find single precision solutions acceptable if they want to maximize the benefit of their GPU, many GPUs only support single precision. The best way to assess the impact of this limitation on results is to compute the results of a problem using both both double precision on the CPU or GPU and then single precision on the GPU and comparing results. In the authors experience in most cases the differences will be on the order of seven orders of magnitude below the amplitudes of the data being processed. In most cases, this would be considered to be in the noise floor of the data however there were several instances where a cumulative sum was computed that this error was large enough to be unacceptable. 2.2.2 Differences between MATLAB and Jacket Programming
Through the course of converting several different programs from CPU based MATLAB programs to GPU based Jacket programs inside of MATLAB there were several differences in programming requirements were encountered. While MATLAB does not require that all variables be declared, Jacket requires that all GPU variables be declared with their type and size. For instance whether they are real or complex, what size they are, and whether they are single or double precision. Fortran and C programmers have always had these requirements but those used to MATLAB programming tend to relax these requirements since they are not necessary. At times it is challenging to determine what size a matrix will be in complex calculations where for instance the blocksize of a dataset may change with each change in rpm in an order tracking application. There are instances where if MATLAB code is modified it becomes more efficient if run on the CPU as well as it was not fully optimized to start with. There are also cases where it is possible to change the code such that it will run faster on the GPU when the modification slows down the implementation when run just on the CPU. This is due to the fact that the CPU performs better with serialized applications and the GPU with parallelized applications. It is also impacted by the level of support that the Jacket has for some commands forcing the programmer to find different ways to program some algorithms so that they may be run on the GPU. Taking these differences in programming techniques into account what may have initially seemed to be a very straight forward programming task can be quite time consuming, though in some cases very rewarding as significant decreases in computation time can be achieved with some applications.
2.3
DSP Calculations on the GPU
To understand which DSP calculations might benefit from the use of the GPU as a processing unit several different algorithms were converted from running on the CPU in MATLAB to running on the GPU using Jacket in MATLAB. The following sections discuss which parts of the commonly used algorithms were re-coded and what aspects were investigated to attempt to lower computation times. Each re-coded algorithm was then benchmarked against its CPU equivalent on two different computers with 4 different graphics cards total. For each of these benchmarks several aspects of the performance were compared including the effects of blocksize, time history length, and number of channels as well as other relevant parameters. These results are then presented and discussed to illustrate the effectiveness of the GPU as a processing device.
1707
2.3.1
FFT Calculations
The first algorithms that were re-coded and investigated were those based on the FFT, these include both autopower computations and frequency response function (FRF) calculations. The modifications required for the GPU program were in this case non-intuitive to get a performance improvement. The portions of the two programs that are not identical are shown below in Table 1. CPU Program
% Set up the loops to calculate the FFT's Gxx=zeros(size(data,2),size(data,2),blocksize/2); Gxx_i=zeros(size(data,2),size(data,2),blocksize/2); data_g=gsingle(data); for ii=1:numavg; xxx=xx_window.*data((ii-1)*(blocksizeolpnt)+1:ii*(blocksize-olpnt)+olpnt,:); xxfft=fft(xxx); % Only keep the first numpnt/2 points from the FFT xxfft2=xxfft(1:blocksize/2,:); % Correct for using window and number of points in the FFT xxfft2=xxfft2/(blocksize/2)*wincorfact; % Correct DC value xxfft2(1,:)=2*xxfft2(1,:); % Calculate the auto and cross powers from the FFT's % Loop through for each frequency for jj=1:size(xxfft2,1); Gxx_i(:,:,jj)=conj(xxfft2(jj,:).')*xxfft2(jj,:); end % Average the crosspower matrices together Gxx=Gxx+Gxx_i; end % Actually compute the average value, divide by the number of averages Gxx=Gxx/ii; % Calculate the auto and cross powers from the FFT's % Process frequency vector in parallel if(num_channels==1), Gxx_i_g(:,:,:) = dot(xxfft2_g,xxfft2_g, 2)'; else gfor jj=1:size(xxfft2_g,1), Gxx_i_g(:,:,jj) = kron(xxfft2_g(jj,:)' ,xxfft2_g(jj,:) ) ; gend end % Average the crosspower matrices together Gxx_g=Gxx_g+Gxx_i_g; geval(Gxx_g); end % Actually compute the average value, divide by the number of averages Gxx=double(Gxx_g/ii);
GPU Program
% Set up the loops to calculate the FFT's xx_window_g=gsingle(xx_window); Gxx_g=complex(gzeros(size(data,2),size(data,2),blocksize/2), gzeros(size(data,2),size(data,2),blocksize/2)); Gxx_i_g=complex(gzeros(size(data,2),size(data,2),blocksize/ 2),gzeros(size(data,2),size(data,2),blocksize/2)); blocksize_g=gsingle(blocksize); olpnt_g=gsingle(olpnt); wincorfact_g=gsingle(wincorfact); xxx_g=zeros(blocksize,num_channels); xxfft_g=complex(gzeros(blocksize,num_channels),gones(bloc ksize,num_channels)); for ii=(1:numavg); xxx_g=(xx_window_g.*data_g((ii-1)*(blocksize_golpnt_g)+1:ii*(blocksize_g-olpnt_g)+olpnt_g,:)); xxfft_g=fft(xxx_g); % Only keep the first numpnt/2 points from the FFT xxfft2_g=gsingle(xxfft_g(1:blocksize_g/2,:)); % Correct for using window and number of points in the FFT xxfft2_g=xxfft2_g/(blocksize_g/2)*wincorfact_g; % Correct DC value xxfft2_g(1,:)=2*xxfft2_g(1,:);
Table 1: CPU vs. GPU Autopower Algorithms
1708
The first differences in the code show the mandatory declaration of variables in the GPU algorithm, the CPU algorithm only declares two large matrices as this is a very efficient way to manage memory in MATLAB and decreases execution time considerably. Note that all variables to be used on the GPU must be appropriately declared as GPU variables and have been denoted with a _g at the end of their name to identify them to the programmer as being GPU variables. The other major differnce between the two programs is in the nested for loop where the CPU uses a standard for loop to loop through and calculate the autopower matrix for each frequency. The GPU uses a gfor loop which deals a different frequency to each core of the GPU and therefore does the calculation in parallel. The calculations are also done using the dot product for a single input and the kron function for multiple inputs. Both of these functions are supported by the GPU in Jacket and hence are much faster than using the conventional matrix multiplication approach. Both approaches, using matrix multiplication and kron were tried on both the CPU and the GPU with the matrix multiplication being faster on the CPU and the kron command being faster on the GPU. This was a case where the changes between the two algorithms were not minor from a computational perspective, however as can be seen there are not a large number of lines that changed between the CPU and GPU algorithms. The final difference is that the GPU computed results must be returned to the CPU memory space if they are to be stored or used in further processing. The last line in the GPU algorithm shown accomplishes this. 2.3.2 Digital Filtering and Decimation
The performance of the GPU for digital filtering and decimation was investigated since digital filtering is used in numerous other analysis applications. In this case the code changes between the CPU and GPU versions of the software were quite minimal since MATLAB has a filter command that is also supported on the GPU through Jacket. However, there is a limitation as of the time of writing this paper of a maximum of 50 filter coefficients in Jacket so the command is not fully supported. The portions of the algorithms required to do decimation on a dataset are shown below in Table 2. In both the CPU and GPU algorithms two different methods of decimation were evaluated to understand the most efficient implementation on both the CPU and the GPU. On the CPU the decimate command from MATLABs Signal Processing toolbox was used as well as doing a filter command followed by manually excluding the non-necessary data points. On the CPU processor the decimate command was actually slower than the two step approach, this was evaluated using a dataset of 10 channels. Both approaches are shown in the CPU Algorithm column. On the GPU the two approaches evaluated were using a gfor loop to loop through the individual channels as well as the two step approach that was used on the CPU. Again, both approaches are shown in the GPU Algorithm column. Again, the last line necessary when using the GPU is to retrieve the data from the GPU memory space and return it to the CPU memory space as shown. Note that any time that a performance analysis was done on the GPU algorithms the computer timing included the time necessary to transfer the data to and from the GPU as necessary as this would be the total time required to actually employ the GPU in calculations.
1709
CPU Algorithm
decimate_amount=floor(Fsample/(max_freq*2.56)); % Must step through and decimate one channel at a time in MATLAB for ii = 1:size(resp,2); resp2(:,ii) = decimate(resp(:,ii),decimate_amount); end % Decimate using filter command instead of decimate command % Create filter [filt_b] = fir1(49,1/decimate_amount); % Filter all channels at one time in MATLAB resp2 = filter(filt_b,1,resp); resp2 = resp2(1:decimate_amount:length(time),:);
GPU Algorithm
decimate_amount=floor(Fsample/(max_freq*2.56)); % Create filter [filt_b] = fir1(49,1/decimate_amount); % Copy data to the GPU resp_g = gsingle(resp); decimate_amount_g = gsingle(decimate_amount); filt_b_g = single(filt_b); resp2_g = gzeros(size(resp_g,1),size(resp_g,2)); resp2_g1 = gzeros(size(resp_g,1),size(resp_g,2)); % Decimate using gfor loop for each channel gfor ii = 1:size(resp_g,2) resp2_g(:,ii) = filter(filt_b_g,1,resp_g(:,ii)); resp2_g = resp2_g(1:decimate_amount_g:length(resp),:); gend % Decimate without using gfor loop resp2_g = filter(filt_b_g,1,resp_g); resp2_g = resp2_g(1:decimate_amount_g:length(resp),:); % Copy data back from GPU resp3 = double(resp2_g);
Table 2: CPU vs. GPU Decimation Algorithms 2.3.3 Adaptive Resampling for Order Tracking
The adaptive resampling algorithm was also evaluated as to its potential for deployment to the GPU for calculation. At this point in time however, based on the authors implementation using an upsampled interpolation filter there is some logic required to minimize the calculations required to perform the interpolation. It is believed that the most efficient way to implement this algorithm on the GPU is to use a gfor loop and deal each resampled angle domain instance to a GPU core. This would allow the number of parallel simultaneous computations made for resampled angle domain points to be equal to the number of GPU cores. It is hypothesized that this would be considerably faster than the current CPU implementation however at this point it cannot be evaluated until the logic is eliminated from the algorithm as the gfor loop in its current implementation does not allow logic inside of the loop. 2.3.4 Time Variant Discrete Fourier Transform Order Tracking
The Time Variant Discrete Fourier Transform algorithm was modified to allow computations on the GPU. The two versions of the algorithms used are shown below in Table 3. The line which computes the cumulative sum of the instantaneous rpm to drive the creation of the order kernels was evaluated on both the CPU and GPU and found that the calculation cannot be done in single precision on the GPU or large errors were introduced. Again we see additional lines of code in the GPU algorithm which declare the GPU variables and transfer the results back from the GPU to the CPU. It can be seen in this particular case that other than these declaration lines the only other lines which were modified was to use a gfor in place of a standard for to calculate the order kernels.
1710
CPU Algorithm

GPU Algorithm
% Pre-allocate space for final result order_amp_g = gzeros(num_channels,num_orders,length(start_index))+0i; order_amp_phase_g = gzeros(length(start_index),num_orders)+0i; pre_kernel=(cumsum(irpm)); % if this is single precision than error goes up by four orders of magnitude! % Calculate the order kernels for ii = 1:length(start_index); % Pre-allocate the vectors for the order kernel order_kernel_g = complex(gzeros(end_index(ii)start_index(ii)+1,num_orders)); % Create kernel of transform order_points_g = [start_index_g(ii):end_index_g(ii)]; % Loop through and calculate order estimates gfor mm = 1:num_orders; order_kernel_g(:,mm) = exp(2.*pi.*1i.*delta_t_g.*(orders_g(mm).*(pre_kernel(order_point s_g)-pre_kernel(start_index_g(ii)))/60)); % Calculate window and apply to data [window] = pwindow_c(window_type,size(order_kernel_g,1)); order_kernel_g(:,mm)=order_kernel_g(:,mm).*window; gend
Pre-allocate space for final result order_amp=zeros(num_channels,num_orders,length(start_inde x)); order_amp_phase=zeros(length(start_index),num_orders); pre_kernel=cumsum(irpm); % Calculate the order kernels for ii=1:length(start_index); % Pre-allocate the vectors for the order kernel order_kernel=zeros(end_index(ii)start_index(ii)+1,num_orders); % Create kernel of transform order_points = [start_index(ii):end_index(ii)]; % Loop through and calculate order estimates for mm=1:num_orders; order_kernel(:,mm) = exp(2.*pi.*1i.*delta_t.*(orders(mm).*(pre_kernel(order_points)pre_kernel(start_index(ii)))/60)); % Calculate window and apply to data [window]=pwindow_c(window_type,length(order_kernel(:,1))) ; order_kernel(:,mm)=order_kernel(:,mm).*window; end % Compute amplitude of orders
% Compute amplitude of orders order_amp(:,:,ii)=resp(:,start_index(ii):end_index(ii))*order_ke rnel/length(window); % Compute amplitude of order that the others are phase matched to order_amp_phase(ii,:)=rresp(start_index(ii):end_index(ii))*ord er_kernel/length(window); % Compute average rpm of order estimate avg_rpm(ii)=mean(irpm(start_index(ii):end_index(ii))); % Compute average time of order estimate avg_time(ii)=mean(time(start_index(ii):end_index(ii))); end end % Get the data back from the GPU % Pre-allocate the vectors of results avg_rpm = double(avg_rpm_g); avg_time = double(avg_time_g); % Pre-allocate space for final result order_amp = double(order_amp_g); order_amp_phase = double(order_amp_phase_g); window = double(window); order_amp_g(:,:,ii)=resp_g(:,start_index_g(ii):end_index_g(ii)) *order_kernel_g/size(order_kernel_g,1); % Compute amplitude of order that the others are phase matched to order_amp_phase_g(ii,:)=rresp_g(order_points_g)*order_kerne l_g/size(order_kernel_g,1); % Compute average rpm of order estimate avg_rpm_g(ii)=mean(irpm(order_points_g)); % Compute average time of order estimate avg_time_g(ii)=mean(time(order_points_g));
Table 3: CPU vs GPU Algorithms for TVDFT Order Tracking
1711
2.3.5
Vold-Kalman Filtering
Like the Adaptive Resampling algorithm the Vold-Kalman filtering could not be easily adapted to run on the GPU at this point in time. The major limitation is that the Jacket software at this point does not support any sparse matrix calculations, which are the most efficient way to solve Vold-Kalman filtering equations.
CPU vs. GPU Performance Comparison
The three processes, which were successfully recoded, to run on the GPU were evaluated from a performance perspective on two computers. The computers and graphics adapters used for the performance evaluations are detailed below in Table 4. The CUDA Cores that are identified in the table are the number of parallel cores available to CUDA and therefore Jacket for use in parallelizing calculations.
Computer Main Memory CPU Processor CPU Speed GPU Processor CUDA Cores GPU Memory Operating System Apple MacBook Pro 3 GB Intel Core 2 Duo 2.93 GHz NVIDIA GeForce 9400M 16 256 MB OS X Snow Leopard Apple MacBook Pro 3 GB Intel Core 2 Duo 2.93 GHz NVIDIA GeForce 9600M GT 32 512 MB OS X Snow Leopard Colfax CXT3000 12 GB Intel Core i7 Model 950 3.06 GHz NVIDIA Quadro FX1700 32 512 MB Windows XP 64 Colfax CXT3000 12 GB Intel Core i7 Model 950 3.06 GHz NVIDIA Tesla C1060 240 4 GB Windows XP 64
Table 4: Computer Specifications All performance numbers that are reported are for the equivalent operations on all platforms with the GPU computing times including the time to transfer the necessary data to and from the GPU so as to be an accurate estimate of the true computation time required to employ the GPU in a solution.
3.1
FFT CALCULATIONS
The simulations run to investigate the performance of the CPUs and GPUs relative to performing FFTs and computing crosspower matrices included 150 different combinations of blocksize, numbers of averages, and number of channels. The blocksize ranged from 1024 to 16384 by powers of 2, the number of averages ranged from 25 to 150, and the number of channels ranged from 2 to 16 channels. No overlap processing was used, a window was applied to all channels, and all FFTs were properly scaled for blocksize, window correction factor, and the negative frequencies were discarded and the appropriate scaling correction applied. The full crosspower matrix was computed between all channels and accuracy was verified since the GPU used single precision calculations and the CPU used double precision calculations. The maximum error was on the order of 7 orders of magnitude below the peak values of the data and 4 orders of magnitude below the smallest data values. This level of error seems acceptable, as it would indicate 140 dB of
1712
dynamic range in the data, most experimental datasets only possess approximately 70-80 dB of dynamic range at best. The calculations, which were included in the timing analysis, included the data transfer to the GPU, the FFTs with scaling, the calculation of the full crosspower matrix, and the transfer of the crosspower matrix back to the CPU. The calculation which was used in all cases to compute relative differences in compute speeds between the CPUs and GPUs was the percentage faster that the GPU was than the CPU as shown in Equation 1. A positive number in this calculation indicated that the GPU was faster than the CPU while a negative number indicated that the CPU was faster than the GPU.
(CPUtime GPUtime)
CPUtime
*100%
(1)
Close inspection of the CPU and GPU timing data showed a couple of interesting facts. First was that the GPUs outperformed the CPUs by at least 30% in all cases except where the number of channels being processed was greater than the number of GPU CUDA Cores available. In these instances the CPU was usually faster than the GPU by up to 40%, with the margin increasing with the overall data size sent to the GPU. Since the Tesla processor always had more cores than the number of channels evaluated and had enough memory to handle every dataset that was evaluated, it always outperformed the CPU by at least 40% and in most cases was closer to 85% faster. It should be noted that based on how these algorithms were coded the parallelization occurred by Jacket using a parallelized FFT calculation and by computing the crosspower matrix using a gfor loop to parallelize the computations for each frequency followed by the kron command which Jacket also has supported on the GPU. The kron command computes the Kronecker matrix that is all potential multiplication results from a vector and a row, hence the full crosspower matrix at each frequency.
3.2
Digital Filtering and Decimation
In all simulation cases, the CPUs outperformed the GPUs regardless of the length of the time histories, the number of channels, or the amount of decimation. Several different formulations were evaluated for the GPU implementation but in all cases the CPU was significantly faster. The CPU was at least three orders of magnitude faster than the GPU using the algorithms shown in Table 2. The Colfax CPU was approximately twice as fast as the MacBook CPU in most cases. The difference between the four different GPU calculation times was as high as a 50% difference with the Tesla on the Colfax and the 9600M GT on the MacBook being very similar in most instances as long as the dataset did not push the memory boundaries of the 9600M GT. It is believed that the CPUs are faster because filtering a time history with a generic filter algorithm is essentially a serial type calculation, and is required to be a serial type calculation for an IIR filter. It would seem that there is the potential to develop a filtering routine using a polyphase FIR filter which could be implemented using parallel processing and simultaneously do the filtering and decimation. It is not clear for rather low decimation amounts that this would be faster than the CPU algorithms but it is certainly an opportunity to be explored. The algorithms in this case attempted to parallelize the calculations by dealing each channel of data to a different GPU core either explicitly using the gfor loop or through Jackets implementation of the filter command.
1713
3.3
TVDFT Order Tracking
The TVDFT algorithms as shown in Table 3 were also evaluated from a performance perspective. The variables that were varied included the order resolution, the number of channels, and the number of orders tracked. The order resolution was varied from 0.001 up to 1.0, the number of channels was varied from 1 up to 16, and the number of orders tracked from 1 up to 25. Total there were 190 combinations of these variables simulated to assess what aspects most importantly impacted calculation times. In all cases 100 estimates of each orders amplitude were estimated from each data trace. Close analysis of this data again showed some interesting trends. The CPUs outperformed the GPUs in datasets where the number of orders tracked was below 16 when the order resolution was greater than 0.01. The CPU and GPU computation times approached each other as the number of orders tracked went up and the order resolution went down. The number of channels did not seem to play a significant role in the convergence of these times. The CPU was up 200% faster when only two orders were tracked with a relatively large order resolution on a small number of channels. When number of orders tracked was greater than 16 and the order resolution was 0.01 or smaller the GPUs outperformed the CPUs. The exact crossover point where the GPUs began to outperform the CPUs was directly correlated with the GPU processing capabilities. The order in which the crossovers occurred was the Tesla, the 9600M GT, and the FX1700 and the 9400M being very similar to one another. At the test case where the number of orders tracked was 25 and the order resolution was 0.001 all of the GPUs outperformed the CPUs substantially with the Tesla being <50% faster than the Colfax CPU. In this particular case the number of channels did not seem to significantly change the performance, the computation times changed less than 15% in all cases (both CPU and GPU) when comparing 1 channel up to 15 channels of processing.
Conclusions
This paper introduced and discussed capabilities and programming techniques of the graphic processing unit. The software used in this paper was MATLAB coupled with Accelereyes Jacket with appropriate programming constructs presented. In conclusion the tradeoff in performance between the CPU of a computer and the new GPU based processing units is a complex relationship. In this paper three different signal processing tasks were evaluated for performance differences when running on two different computers with two different GPU processors in each computer. In all cases, the processing was completed in MATLAB using the Accelereyes Jacket platform to utilize the GPU. The results of the implementations used in this paper did not sure a clear winner for all of the signal processing tasks evaluated. In both the FFT/Crosspower task and the TVDFT order tracking task there were points at which the GPUs significantly outperformed the CPUs. However, there were also simulation cases where the CPUs significantly outperformed the GPUs, including the NVIDIA Tesla which is a dedicated GPU based computational engine with 240 parallel cores. The conditions under which each processor type outperformed the other processor types for each type of calculation were discussed in the appropriate sections. For all cases of digital filtering and decimation the CPUs significantly outperformed the GPUs. Two other algorithms, adaptive resampling and Vold-Kalman filtering were also investigated for deployment on GPUs for processing. In both of these cases the algorithms could not be successfully implemented because the GPU processing library available at this time through Jacket and CUDA did not support all of the programming constructs necessary for implementation. In the Vold-Kalman case, sparse matrices were not supported and in the adaptive resampling case, logic statements were not supported inside of gfor loops. Based on the experiences of this paper it is clear that many different types of calculations will benefit from the use of GPU processing in the future. It is also clear that a careful analysis will need to be carried out
1714
for each algorithm to assess whether it will benefit from the use of the GPU. In particular, the more an algorithm can be parallelized the more it will benefit, in cases where it is possible to parallelize an algorithm in more than one way all possible solutions should be evaluated as there is a significant sensitivity to how it is done.
References
[1] General-Purpose Computation on Graphics Hardware website, June 2010, http://gpgpu.org/ [2] NVIDIA website, June 2010, http://www.nvidia.com/object/product_tesla_M2050_M2070_us.html [3] Colfax International website, June 2010, http://www.colfax-intl.com/nvidiaGPU.html [4] ATI Stream website, June 2010, http://www.amd.com/US/PRODUCTS/TECHNOLOGIES/ STREAM-TECHNOLOGY/Pages/stream-technology.aspx [5] Tech-X Corporation website, June 2010, http://www.txcorp.com/products/GPULib/ [6] Accelereyes Jacket website, June 2010, http://www.accelereyes.com/

[7] Checklist paper size: DIN A4-paper (297mm x 210mm) margins: top 30mm, bottom 22mm, left 22mm, right 22mm title: Helvetica 17pt, bold, left aligned author: Times New Roman 11pt, bold, left aligned affiliation: Times New Roman 11pt, regular, left aligned first headings: Helvetica 14pt, bold, left aligned second headings: Helvetica 12pt, bold, left aligned third headings: Helvetica 11pt, bold, left aligned plain text: Times New Roman 11pt, regular, justified references: quotation with brackets [] and enumerated list with brackets font types in figures, tables, equations, etc.
1715
1716

Isma2010 0557

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Isma2010 0557

Uploaded by

Copyright:

Available Formats

Using a Graphics Processor to Accelerate DSP Calculations

P ROCEEDINGS OF ISMA2010 INCLUDING USD2010

GPU vs CPU Data Processing

DYNAMIC TESTING : METHODS AND INSTRUMENTATION

GPU Programming using Accelereyes Jacket

P ROCEEDINGS OF ISMA2010 INCLUDING USD2010

DSP Calculations on the GPU

DYNAMIC TESTING : METHODS AND INSTRUMENTATION

Table 1: CPU vs. GPU Autopower Algorithms

P ROCEEDINGS OF ISMA2010 INCLUDING USD2010

DYNAMIC TESTING : METHODS AND INSTRUMENTATION

P ROCEEDINGS OF ISMA2010 INCLUDING USD2010

Table 3: CPU vs GPU Algorithms for TVDFT Order Tracking

DYNAMIC TESTING : METHODS AND INSTRUMENTATION

CPU vs. GPU Performance Comparison

P ROCEEDINGS OF ISMA2010 INCLUDING USD2010

Digital Filtering and Decimation

DYNAMIC TESTING : METHODS AND INSTRUMENTATION

TVDFT Order Tracking

P ROCEEDINGS OF ISMA2010 INCLUDING USD2010

DYNAMIC TESTING : METHODS AND INSTRUMENTATION

P ROCEEDINGS OF ISMA2010 INCLUDING USD2010

You might also like