You are on page 1of 5

Implementation of Reconfigurable Adaptive Filtering Algorithms

P.Muralidhar1 C.B.Rama Rao2 K.S.Chaitanya3


National Institute of Technology, Warangal, India
{ 1pmurali_nitw@yahoo.co.in, 2cbrr@nitw.ac.in, 3kodukula.chaitanya@gmail.com }

Abstract and the high performance of dedicated hardware


Filtering data in real-time requires dedicated hardware using ASIC technology.
to meet demanding time requirements. If the statistics Modern FPGAs contains many resources that support
of the signal are not known, then adaptive filtering DSP applications such as embedded multipliers,
algorithms can be implemented to estimate the signals multiply accumulate units (MAC), and processor
statistics iteratively. Modern Field programmable gate cores. These resources are implemented in the FPGA
arrays (FPGAs) include the resources needed to fabric and optimized for high performance and low
design efficient filtering structures. This paper aims to power consumption. Also many soft cores are
combine efficient filter structures with optimized code available from different vendors that provide a
to create a system-on-programmable-chip (SoPC) support for the basic blocks in many DSP applications.
solution for various adaptive filtering problems. The The availability of hard/soft core processors in
algorithms in this paper are implemented using the modern FPGAs allow moving DSP algorithms written
Cyclone II FPGA device chipped on Altera DE2 for GPP or DSP processors to FPGAs using the core
board. The inbuilt NIOS II soft core processor of the processors. An alternative approach is to move part of
FPGA device acts as the processor for processing the algorithm into hardware (HW) to improve
applications. In this paper Least Mean Square (LMS) performance. This is a form of HW/SW Co-design,
adaptive filtering algorithm and its variations have which requires profiling the software to efficiently
been implemented in software and as well as partition it between HW and SW. This solution could
hardware/software co-design for the NIOS II result in a more efficient implementation as part of
processor. A comparison is then made between the the algorithm is accelerated using HW while the
software implementation and hardware/software co- flexibility is maintained. A third, more efficient, and
design implementation. Results obtained show an more complex alternative is to convert the complete
improvement in the number of clock cycles required algorithm into hardware. Although this solution is
when implementing on hardware/software co-design attractive in terms of performance, area, and power
over a pure software implementation. However, using consumption, the design cycle is much longer and
a pure hardware implementation results in a much more complex.
higher performance with somewhat lower flexibility. This paper first discusses the theory behind the data
Key terms: Field programmable gate array (FPGA), reduction algorithm. In section 3 a description of the
system-on-programmable-chip (SoPC), Least Mean implementation is given while section 4 displays the
Square (LMS) algorithm, NIOS II processor results obtained. Finally section 5 gives the
conclusions made from the results obtained.
1. Introduction
In the last few decades the demand for portable and 2. Theory
embedded digital signal processing (DSP) systems Adaptive filters learn the statistics of their operating
has increased dramatically. Applications such as cell environment and continually adjust their parameters
phones, hearing aids, and digital audio devices are accordingly. Because of their ability to perform well
applications with stringent constraints such as area, in unknown environments and track statistical time
speed and power consumption. These applications variations, adaptive filters have been employed in a
require an implementation that meet these constrains wide range of fields. The adjustable parameters that
with the shortest time to market. The possible are dependent upon the applications at hand are the
alternative implementations that can be used range number of filter taps, choice of FIR or IIR, choice of
from an ASIC custom chip, general purpose processor training algorithm, and the learning rate. Beyond
(GPP) to DSP processors. While the first choice could these, the underlying architecture required for
provide the solution that meets all the hard constraints, realization is independent of the application.
it lacks the flexibility that exists in the other two, and
also its design cycle is much longer. Reconfigurable 2.1 Adaptive Filtering Problem
computing is gaining much attention as a prototyping The goal of any filter is to extract useful information
and implementation technology of digital systems. from noisy data. Whereas a normal fixed filter is
Using programmable deceives (like FPGAs) for DSP designed in advance with knowledge of the statistics
applications could narrow the gap between the of both the signal and the unwanted noise, the
flexibility of GPP, and programmable DSP processors, adaptive filter continuously adjusts to a changing
environment through the use of recursive algorithms
[1]. This is useful when either the statistics of the sufficient condition for the convergence or stability of
signals are not known beforehand of change with time. the steepest descent algorithm is for µ to satisfy

where λmax is the largest eigen value of the


correlation matrix R.
Although it is still less complex than solving the
Wiener Hopf equation, the method of steepest descent
is rarely used in practice because of the high
computation needed. Calculating the gradient at each
time step would involve calculating p and R, whereas
the least mean square algorithm performs similarly
using much less calculations.
Fig.2.1 Block diagram for the adaptive filtering
problem 2.2.2 Least Mean Square Algorithm: The least
The discrete adaptive filter in fig.2.1 accepts an input mean square (LMS) algorithm is similar to the
u(n) and produces an output y(n) by a convolution method of steepest descent in that it adapts the
with the filters weights, w(k). A desired reference weights by iteratively approaching the MSE minimum.
signal, d(n), is compared to the output to obtain an Widrow and Hoff invented this technique in 1960 for
estimation error e(n). This error signal is used to use in training neural networks. The key is that
incrementally adjust the filters weights for the next instead of calculating the gradient at every time step,
time instant. Several algorithms exist for the weight the LMS algorithm [2] uses a rough approximation to
adjustment, such as the Least Mean Square (LMS) the gradient. The error at the output of the filter can
and the Recursive Least Squares (RLS) algorithms. be expressed as
The choice of algorithm is dependent upon needed
convergence time and the computational complexity
available, as statistics of the operating environment.
which is simply the desired output minus the actual
filter output. Using this definition for the error an
2.2 Adaptive Algorithms approximation of the gradient is found by
There are numerous methods for the performing
weight update of an adaptive filter. There is the
Wiener filter, which is the optimum linear filter in the
terms of mean squared error, and several algorithms Substituting this expression for the gradient into the
that attempt to approximate it, such as the method of weight update equation from the method of steepest
steepest descent. There is also least mean square descent gives
algorithm for use in artificial neural networks. Finally,
there are other techniques such as the recursive least
squares algorithm and the Kalman filter. The choice which is the Widrow Hoff LMS algorithm. As with
of algorithm is highly dependent on the signals of the steepest descent algorithm, it can be shown to
interest and the operating environment, as well as the converge for values of µ less than the reciprocal of
convergence time required and computation power λmax, but λmax may be time varying, and to avoid
available. computing it another criterion can be used. This is
2.2.1 Method of Steepest Descent: With the error
performance surface defined previously, one can use
the method of steepest descent to converge to the
optimal filter weights for a given problem. Since the where M is the number of filter taps and Smax is the
gradient of a surface (or hyper surface) points in the maximum value of the power spectral density of the
direction of maximum increase, then the direction tap inputs u.
opposite the gradient ( - ) will point towards the The relatively good performance of the LMS
minimum point of the surface. One can adaptively algorithm given its simplicity has caused it to be the
reach the minimum by updating the weights at each most widely implemented in practice. For an N tap
time step by using the equation filter, the number of operations has been reduced to
2*N multiplications and N additions per coefficient
update. This is suitable for real time applications, and
where the constant µ is the step size parameter. The is the reason for the popularity of the LMS algorithm.
step size parameter determines how fast the algorithm
converges to the optimal weights. A necessary and
2.2.3 Normalized Least Mean Square The main components of the system are:
Algorithm: In Normalized LMS, the gradient step • NIOS II processor
factor µ is normalized by the energy of the data vector • Avalon Tristate Bridge: To interface all
[3] peripherals with NIOS processor, parallel
input/output (PIO).
• Parallel Input/Output peripherals
• SDRAM 8MB [7]
• PLL: To provide delayed clock input to
where 0 < ˜µ < 2 is the step size and a is a positive
SDRAM and Timers
constant.
• JTAG UART
• Timer: Issues Interrupts to the processor to
obtain timing information
• Performance Counter
The normalization has several interpretations
• Corresponds to the 2nd-order convergence
bound
• Makes the algorithm independent of signal
scalings
• Minimizes mean e_out at time n + 1
NLMS usually converges much more quickly than
LMS at very little extra cost; NLMS is very
commonly used.

3. Implementation

3.1 Hardware Platform


The Architecture for LMS and NLMS algorithm as
shown in Fig 3.1 is implemented on Altera Cyclone II
FPGA DE2 board. The NIOS II system built using
SOPC builder tool is configured into the device. The
device is configured in passive configuration mode- Fig.3.2 Components of the implemented system
JTAG (Joint test action group) mode. The Quartus II
software automatically generates .sof files that can be 3.1.2 Custom Instruction [8]: NIOS II processor
downloaded using Byte-Blaster II or USB Blaster custom instructions are custom logic blocks adjacent
Cable for JTAG configuration. to the ALU in the cpu data path. With custom
instructions we can reduce a complex sequence of
standard instructions to a single instruction
implemented in hardware. The NIOS II CPU
configuration wizard provides a facility to add up to
256 custom instructions to the processor. The custom
instruction logic connects directly to the NIOS II
processor ALU logic as shown in fig.3.3.

Fig.3.1 Architecture of LMS algorithm

3.1.1 Nios II System Design: The NIOS II is


Altera’s Second Generation Soft-Core 32 bit RISC
Microprocessor. NIOS II plus all peripherals written
in HDL which can be targeted for all Altera FPGAs
and synthesis using Quartus II integrated synthesis.
The design of NIOS II system is performed using
SOPC builder [6] and the implemented system is
shown in fig.3.2. Fig.3.3 Custom instruction logic in NIOS
Algorithm Steps:
The algorithm is implemented in C language and the 1. Read input samples and desired response
computational intensive part of the algorithm from a file and store them in an array
i.e.multiply and accumulate operation is implemented 2. Calculate the filter output by assuming
as a custom instruction to improve the performance arbitrary weights to the filter
by reducing the number of clock cycles required to 3. Check for convergence of the output
implement the mac operation. The hardware block samples
diagram of the implementation is shown in fig.3.4. 4. If the error is minimum or with in the
expected range (samples converged), output
the filter output
5. If the samples do not converge, then update
the weights and repeat from step2

4. Results
LMS and Normalized LMS algorithms are
implemented and tested using 8 samples as input
length and 100 samples as input length. For 8 sample
input, a 4 tap adaptive filter is selected and for 100
sample input, a 25 tap adaptive filter is selected. The
following screen shots give the results for building
and running the application on NIOS II processor.

Fig.3.4 Hardware block diagram of Nios II System

3.3 Software Implementation


The NIOS II Integrated Development Environment
(IDE) is used to run the software on top of NIOS II
system. The base addresses are specified in the
system.h header file and we have to include this file in
our code to access different components of the system.
The operating flow chart and algorithm steps are
given in fig.3.5.

Fig 4.1 Output console showing build of application

Fig 4.2 Output console showing actual output of the


application
Fig.3.5 Flow chart of LMS Algorithm
The desired response and actual output obtained for 5. Conclusions
an 8 sample input sequence is given in Table 4.1. In this paper, LMS and NLMS algorithms for
Table 4.1 Input, Desired and Output samples adaptive filtering applications are implemented using
Input Desired Output NIOS processor. It is observed that NLMS algorithm
samples response samples provides superior performance over the standard fixed
1.0313 0.0331 0.031250 step-size LMS algorithm. Also in this project, two
0.5881 0.0395 0.035156 different architectures were proposed to implement
0.2986 0.0438 0.039062 adaptive filtering algorithms. A comparison between
0.7021 0.0454 0.042969 the two architectures shows that using a custom
0.1169 0.0456 0.042969 instruction (HW accelerator) coupled with the
0.6005 0.0437 0.046875 processor in a Co-design configuration reduces the
0.3773 0.0410 0.040000 number of cycles required to perform the most critical
0.9747 0.0378 0.035156 operations in the algorithm. By testing the algorithms
with 8 sample input sequence, we observed an
The profiling results regarding number of clock average of 6% improvement in the number of clock
cycles required for LMS and NLMS algorithms with cycles with just an average of 2 to 6 percent increase
and without custom instruction for an 8 sample input in the number of logic elements. In addition, while
sequence are given in Table 4.2. using custom instruction we can efficiently utilize the
Table 4.2 Performance comparison of the system with embedded multipliers provided with the hardware.
and with out custom instruction That is adding custom instruction to the processor
With Without architecture improves the performance of the
Algorith Custom Custom %Impro- processor. This improvement in the performance is
m Instr Instr vement achieved at a cost of lager area and lower level of
LMS with design flexibility.
mu=0.8 686756 710864 3.56
LMS with 6. References
mu=0.2 2080936 2138283 2.96 [1] S. Haykin, Adaptive Filter Theory, Prentice
NLMS 335132 364711 8.11 Hall, Upper Saddle River, NJ, 2002.
[2] “LMS Adaptive Filter”, Lattice
The comparison for number of logic elements utilized Semiconductor corporation, 2006.
by the system developed with and without custom [3] M. Tarrab and A. Feuer. “Convergence and
instruction is given in Table 4.3. Performance Analysis of the Normalized
Table 4.3 Comparison of number of logic elements LMS Algorithm with Uncorrelated Gaussian
utilized with and without custom instruction Data” IEEE Trans. On Inform. Theory, 34,
With Without (4), pp. 680-691, 1988.
Custom Custom % inc in [4] A.I. Sulyman and A. Zerguine. “Echo
Algorithm Instr Instr utilization Cancellation Using a Variable Step-Size
No of Logic NLMS Algorithm”. Signal Processing,
Elements 4606 4541 1.41 EURASIP 2004.
No of [5] “Nios II Embeded Processor: Programmer’s
Embedded Reference manual”, Altera Corporation, San
multipliers 10/70 4/70 8.52 Jose C.A.
[6] “Nios II Tutorial”, Altera Corporation, San
The comparison between different algorithms Jose C.A.
regarding the number of iterations taken for the [7] “Nios II Hardware Development Tutorial”,
samples to converge i.e. to minimize the error is given Altera Corporation, San Jose C.A.
in Table 4.4. [8] “Nios II Embedded Processor: Peripheral’s
Table4.4 Comparison between different algorithms Reference manual”, Altera Corporation, San
regarding number of iterations for convergence Jose C.A.
No of Input No of Input [9] “Nios Custom Instructions Tutorial”, Altera
Algorithm samples(N=8) samples(N=100) Corporation, San Jose C.A., 2002
LMS with
mu=0.8 660 5813
LMS with
mu=0.2 2199 10289
NLMS 313 6508

You might also like