Professional Documents
Culture Documents
In this chapter, we present a flexible and universal framework using the state-space
approach to blind separation and filtering. As aspecialcase, we consider the standard
multichannel blind deconvolution problem with causal FIR filters.
The state-space description of dynamical systems 1659,8961 is a powerful and flexible
generalized model for blind separation and deconvolution or more generally for filtering
and separation. There are several reasons why the state-space models are advantageous
for blind separation and filtering. Although transfer function models in the z-domain or
the frequency domain are equivalent to the state-space models in the time domain for any
linear, stable time-invariant dynamical system, using transfer function directly it is difficult
to exploit internal representation of real dynamical systems. The main advantage of the
state-space description is that it not only gives the internal description of a system, but
there are various equivalent canonical types of state-space realizations for a system, such as
balanced realization and observable canonical forms [659, 8961. In particular, it is possible
to parameterize some specific classes of models which are of interest in applications. In
addition, it is relatively easy to tackle the stability problem of state-space systems using
the Kalmanfilter. Moreover, the state-spacemodel enables amuch more general description
than the standardfinite impulse response(FIR) convolutive filteringmodels discussed in the
Chapter 9. In fact, all the known filtering models, such as the AR, MA, ARMA, ARMAX
and Gamma filtering, could also be consideredas special cases of flexible state-space models
[989, 13651.
423
424 BLIND FILTERING AND SEPARATIONUSINGA STATE-SPACE APPROACH
Suppose that the unknown source signals s ( k ) are mixed by a stable but unknown causal,
linear time invariant dynamical system described by the following set of matrix difference
equations (see Fig. 11.1)
-
<(k + 1) = K C ( k ) +Bs(k) + N Y p ( k ) , (11.1)
x(k) = c $ ( k ) +Ds(k)+ Y ( k ) , (11.2)
where C ( k ) E Rd is the state vector of the system, s ( k ) E R" is a vector of unknown source
signals (assuming that they are zero-mean, i.i.d. and spatially independent), x ( k ) E R" is
an available vector of sensor signals, K E E l d x d is a state matrix, B E E t d x n is an input
mixing matrix, C E is an
output mixing matrix, D E is an
input-output
mixing matrix and E E t d x d is a noise matrix. The integer d is called the state dimension
or system order. In principle, there exists an infinite number of state space realizations for
a given system. For example, the state vector c ( k ) might contain some states that are not
excited by the input or are not observed in the output. In practice, we consider models for
which the order d is minimal. Even for the minimal order (called canonical form), the state
representation is not unique. An equivalent system representation giving the same transfer
function is obtained by applying a state transformation nonsingular matrix T E R d x d to
define a new state vector $'(IC)= -(/c). The eigenvalues of the state matrixA are invariant
under this transformation. In order to ensure the stability of the system, the eigenvalues
of A must be smaller than l in absolute value, i.e., the eigenvalues must be located in the
unit circle. In fact, the eigenvalues of x are directly related to the poles of matrix transfer
function.
Let us assume that there exists a stable inverse system (in some sense discussed later)
called a separating ordemixing-filtering system. In other words, we assume that the demix-
ing/filtering model consists of another linear state-space system,described as (see Fig. 11.1)
(11.3)
(11.4)
PROBLEMFORMULATION AND BASK MODELS 425
Fig. 11.1 Conceptual block diagramillustrating the general linear state-space mixing and self-
adaptive demixing model for blind separation and filtering. The objective of learningalgorithms
is the estimation of a set matrices {A, B, C , D, L} [287, 289, 290, 1359, 1360, 1361, 13681.
where E(k) E R" is the state vector of the separating system and the unknown state-space
matrices have dimensions: A E R M x M B , E RIRMxm, C E I R m X M , D E I R m X w ' , with
M 2 d (i.e., the order of the demixing system should be at least of the order of the mixing
system).
Since the mixing system is completely unknown (we do not know its parameters or
even its order), our objective is to identify this system or to estimate a demixing/filtering
system with some kind of ambiguities. In the blind deconvolution problem, the estimation
of order (dimension M ) is a difficult task, and therefore we are usually overestimate it, i.e.,
M >> d. The overestimation of the order M may produce auxiliarydelays of output signals
426 BLlNDFILTERING AND SEPARATION USING A STATE-SPACE APPROACH
with respect to original sources, but this is usually acceptable in the blind deconvolution
since it is most important to recover waveform of the original signals.
The state-spacedescription [659,896] allows us to divide the variables into two types: The
internal state variable <(IC),which produce the dynamics of the system, and the external
variables x(k) and y ( k ) , which represent the input and output of the demixing/filtering
system, respectively. The vector < ( k ) is known as thestate of thedynamicalsystem,
which summarizes allthe information about the past behavior of the system thatis needed
to uniquely predict its future behavior.The linear state-space model plays a critical role
in the mathematical formulation of a dynamical system. It also allowsus to realize the
internal structure of the system and to define the controllability and observability of the
system [659]. The parameters in the state equation of the separating/filtering system are
referred to as internal representation parameters (or simply internal parameters), and the
parameters in the output equation as external ones. Such a partition enables us to estimate
the demixing model in two stages: Estimation the internal dynamical representation and
the output (memoryless) demixing. The first stage involves a batch or adaptive on-line
estimation of the statespace and input parameters represented by the set of matrices A, B,
such that the systems would be stable andhave possibly sparse, and lowest dimensions. The
second stage involves fixing the internal parameters and estimating the output (external)
parameters represented by the set of matrices C ,D by employing suitable batchor adaptive
algorithms. In general, the set of matrices 0 = { A , B ,C , D , L } contains parameters, to
be determined in a learning process on the basis of the sequence of available sensor signals
x ( k ) and some a priori knowledge about the system and noise.
If we ignore the noise terms in the mixing model, its transfer function is an m X m matrix
of the form'
H(z) = ~ ( Z I - - ) - ~ B + D , (11.5)
where z-l is a time delay operator.
We formulate the blind separation problem as a task to recover the original signals from
[H(z)]
observations x ( k ) = _ _ s_( k_
) , without a priori knowledge of the source signals s ( k ) or
state-space matrices [A, B, C , D] except of certain statistical features of the source signals.
For simplicity, we assume that the noise terms both in the mixing and demixing models
are negligibly small. The transfer function of the demixing model can be expressed as
The output signals y ( k ) should estimate the sourcesignals in the following sense
(11.7)
'In this chapter, we assume without loss of generality that the number of outputs is equal to the number
of sensors, and the numberof sources is less than or equal to the number of sensors (please see the previous
chapters for an explanation and justification).
PROBLEMFORMULATION AND BASIC MODELS 427
demixing model (11.3) and (11.4) if the global matrix transfer function satisfies
the following
relationship
G(z) = W(Z)H(z) = P A(z). (11.8)
Any nonsingular transform matrix T does not change the transfer functions if the following
relations hold
A = T(A-BD+C)T-l,
_-
B = TBD+,
C = -D+GT-l
D = D'.
Therefore, source signals can, in principle, be recovered by the linear state space demix-
ing/filtering model (11.3) and (11.4). We summarize this feature in theform of the following
Theorem [1365]:
Theorem 11.1 If the m a t r i x D in the mixing model is full column rank, i.e., r a n k ( D ) = n,
then there exists a set of matrices W = {A,B, C , D}, such that the output signals y of the
state-space system (11.5') and (11.4) recover the independent source signals in the sense of
(11.8).
21n practice, the matrix D E RmX"is a square nonsingular matrix, and thus D+ = D - l .
428 BLIND FILTERING AND SEPARATIONUSINGASTATE-SPACEAPPROACH
Chapter 3). However, the order estimation problemin bland demixing and filtering i s a quite
dificult and challenging problem. It remains an open problem that is not discussed in this
book. Fortunately, if the dimension of the state vector in the demixing model can be simply
overestimated, i.e., aj we take it larger than necessary, we can still successfully recover the
original source signals.
A= [ -Q
L(L-l)
B=[ 21 (11.11)
The objective of this section is to derive basic adaptive learning algorithms that perform an
update of the external (output) parameters W = [ C ,D] in the demixing model. It should
be noted that for any feed-forward dynamical model, the internal parameters represented
by the set of matrices [A,B] are fixed and they can be explicitly determined on the basis
of the assumed model. In such a case, the problem is reduced to the estimation of external
(output) parameters but the dynamical (internal) part of the system can be established and
fixed in advance.
Infact,theseparatingmatrix W = [C,D] E IR"X("fM) canbeestimated by any
learning algorithm for ICA, BSS or BSE discussed in the previous chapters, assuming that
the sensor vector has the form % = [ET, xTIT = [E1,&, . . . , E M , zl,22, ..., E IRM+m.
In otherwords, our objectivein this section is to estimate thenon-square full rank separating
+
matrix W for a memoryless system with ( M m ) inputs and m outputs.
The basic idea is to use the gradient descent approach to minimize suitably design cost
functions. We can use several criteria and cost functions discussed in the previous chapters.
To illustrate the approach, we use the minimization of the Kullback-Leibler divergence as
a measure of independence of the output of the signals. In order to obtain an improved
learning performance, we define a new search direction, which is related to the natural
gradient, developed by Amari [21] and Amari and Nagaoka [42].
DERIVATION O f BASIC LEARNING ALGORITHMS 429
11.2.1 -
Gradient Descent Algorithms for Estimation of Output Matrices
W = [C,D]
Let us consider a basic cost (risk) function derived from the Kullback-Leibler divergence
and mutual information as [287, 13651
-
R(Y,W) =:
1
-- log ldet(DDT)(-
2 cm
i= 1
E{logqi(yi)}.
(11
where det(D) is the determinant of the matrix D. Each pdf qi(yi) is an approximation of
the true pdf ri(yi) of an estimated source signal.
The first term of the loss function prevents all the outputs signals from decaying to zero
and the second term provides the output signals to be maximally statistically independent.
It should be noted that by minimization of such a loss function, we are not explicitly aiming
to recover the source signals, but we only attempt to make the output signals mutually
independent. The true source signals may differ from these recovered independent output
signals by an arbitrary permutation and convolution.
In order t o evaluate the gradient of the loss function p(y,W) with respect to W,we
calculate the totaldifferential dp(y,W)of p(y,W), m
where we take a differential on W,
dp(y,W)= - t ~ - ( d ~ ~ - ~ ) + f ~ ( y ) d y , (11.16)
. , . ,fm(ym)]Tis a vector of
where tr(.) is the trace of a matrix and f ( y ) = [fl(yl), f2(y2),
nonlinear activation functions
(11.17)
(11.19)
(11.20)
430 BLINDFILTERING AND SEPARATIONUSING A STATE-SPACE APPROACH
Let US apply now the standard gradientdescent method for updating the matrixC and the
natural gradient rule for updating the matrix D. Then, we obtain on the basis of (11.19)
and (11.20) the adaptive learning algorithm
AC(1) =
ax
C ( l + 1) - C(1)= -7 - = -77 (f[y(k)] < T ( k ) ) , (11.21)
dC
AD(1) = D ( l + 1) - D(1) = -77
ax
-D*D
aD
= 77 [I - (f[Y(k)l x T ( k ) D T ( l ) ) lW )
= 77 [I - (f[Y(k)lY,T(k))] D(% (11.22)
and
(11.24)
where
(11.25)
and
(11.26)
The above algorithm is computationally efficient because the matrix D do not need to
be inverted at each iteration step, but it does not dramatically improve the convergence
speed in comparison to the standard gradient descent method because the modified search
direction is not exactly the natural gradient one [288].
Remark 11.2 The above algorithm uses the vector of activation functions f ( y ) . The opti-
mal choice of the activation function is given by equation ( l 1.1’7) with qi(yi) = r i ( y i ) , if we
can adaptively estimate the true source probability distribution ri(yi). Another solution is to
use a score function according to the statistics of source signals. Typically, if a source signal
yi is a super-Gaussian, one can choose f i ( y i ) = t a n h ( y i ) , and if it is a sub-Gaussian, one
can choose f i ( y i ) = y! [29, 30, 4021. A question can be raised whether the learning algorithm
will converge to a true solution if the approximated activation functions are used. The the-
ory of the semi-parametric model f o r blind separation/deconvolution (28, 40, 41, 1356, 13581
shows that even if a misspecified pdf is used in the learning algorithm, it can still converge
to the true solution if certain stability conditions are satisfied (see Chapter 10) [29].
DERIVATION OF BASIC LEARNING ALGORITHMS 431
The equilibrium points of the above learning algorithm satisfy the following equations
E{f(Y(k)) cTw = 0, (11.27)
- f(Y(k)) Y,T(k>)= 0. (11.28)
By pre-multiplying equation (11.27) by CT from the right hand, and adding it t o equation
(11.28), we obtain
E {I - f(Y(k)) YT(k)} = 0. (11.29)
This means that the output signals y converge when the generalized covariance matrix
Rf, = E{f(Y(k)) Y T ( W =1 (11.30)
is the identity matrix (or more generally any positive definite diagonal matrix).
For the special case when matrix D = 0,we can formulate the following alternative cost
function [a881
m
where C1 is an m x m nonsingular matrix which is the block matrix of the output matrix
C = [Cl,C,] E l R m x M . The first term in the cost function ensures that the trivial solution
y = 0 is avoided, while the second term ensures mutual independence of outputs.
By using the combined natural-standard gradient approach, we obtain simple learning
rules for C1 and C2 [1360, 13611
(11.31)
(11.33)
whereyl=AC1t1, c1=[C;1,C;2,...,~m]T,andc2=[~~+l,Jm+z,...,C;~]T.
A slightly different algorithm can be derived by applying the extended natural gradient
rule, described in Chapter 6, given in general form as [1361, 1363, 13641
(11.34)
with Q = I, where AM E R M x M
is a positive definite diagonal matrix.
11.2.2 Special Case - Multichannel Blind Deconvolution with Causal FIR Filters
Let us consider a special case of a simplified, moving average (convolutive) model described
as [287, 2881
Y(k) = c t ( k ) + D x(k) = [ W ( Z ) l x(k)
p=o
= cL
p=o
W p Z - P x ( k )= c
L
W P X @ -P), (11.37)
11.2.3 Derivation of the Natural Gradient Algorithm for State Space Model
The learning algorithms presented inthe previous section, although of practical value, have
been derived in a heuristic or intuitive way. In this section, we derive the efficient natural
gradient learning algorithm in a mathematically rigorous way [1365].
Let us assume without loss of generality that the number of outputs is equal to the
number of sensors and the matrix D E RmX"is nonsingular. From the linear output
equation (11.4), we have
~ ( k=) D-' (y(k) - C t ( k ) ) . (11.41)
Substituting (11.41) into (11.18), we obtain
dy=(dC-dDD-lC)t+dDD-'y+Cdt. (11.42)
In order to improve the computational efficiency of learning algorithms, we introduce a new
search direction defined as
d p ( y l W ) = f(y(k))C T ( k ) ] (11.45)
8x1
My, = f(y(k))yT(k) - I. (11.46)
8x2
Using the standard gradient descent approach] we obtain X1 and X2
(11.47)
(11.50)
CTD
D ~ D 1 =VR (11.51)
where VR = [v
F]
definite pre-conditioning matrix
is the standard gradient and M is a symmetric positive
M= [ DTC DTD.
CTD I
The natural gradient learning algorithm can be rewritten equivalently in the following form
Using the moving-average method, the on-line version of the natural gradient algorithm
with nonholonomic constraints (see Chapter 6) can be expressed as
and
where
(11.55)
(11.56)
and elements of the diagonal matrix A ( k )= diag{Xy), X?), . . . , ,X (IC) } are updated as
(11.58)
(11.59)
<(k + 1) = c
IC-l
p=o
APBy(k -p), (11.60)
Till now we have assumed that the matrices [A,B] are fixed and are determined explicitly
from an assumed structure of the dynamicalsystem. However, for recurrentdynamical
systems, these matrices are unknown and must be estimated by a suitably designed learning
algorithm. In order to develop a learning algorithm for matrices A and B, we use in this
ESTIMATIONOF MATRICES [A,B] BYINFORMATION BACK-PROPAGATION 435
Reference
AC = -B (f(Y) E T )
Zhang and Cichocki (1998) [l3591 linear
AD = 11 [I - (f(y) xT D')] D
Salam and
Erten (1999) [l0321 nonlinear
Lagrange
multiplier
approach
AC(k) = [
~ ( k )( A ( k )- RP;) C ( k )- R ( k ) ]
fE
NG Algorithm
AD(k) = ) RPY)]D(k)
q(k) [ A ( k -
(11.61)
436 BLINDFILTERING AND SEPARATION USING A STATE-SPACE APPROACH
Therefore, we can calculate the gradient of the loss function p(y), with respect to A E
RMXM
and B E I R M as follows
(11.62)
(11.63)
where
entries of matrices
andare
obtained by the following on-line iterations
(11.64)
(11.65)
(11.66)
(11.67)
Since the matrices A and B are sparse in the canonical forms, we do not need to update
all elements of these matrices. Here, we elaborate the learning algorithm for the controller
canonical form. In the controller canonical form, the matrix B is a constant matrix, and
only the first n rows of matrix A are variable parameters. Denote the vector of the h-th
row of the matrix A by ah, (h = 1 , 2 , .. . , M ) , and define a matrix
-- (11.68)
dah
(11.69)
The above learning algorithm updates on-line the internal parameters of the dynamical
system. The dynamical system (11.64) and (11.65) is the variational system of the demixing
model with respect to A and B. The purpose of the learning algorithm is to estimate on-
line the derivatives of t ( k ) with respect to A and B. It should be noted that we should
choose very carefully the initial values of the matrices A and B to ensure stability of the
system during learning process3. However, there is no guarantee that the system will stable
during the learning process even if in initial iterations the system was stable. Stability is a
common problem in dynamical system identification. One possible solution is to formulate
the demixing model in the Lyapunov balanced canonical form [896].
11.4STATEESTIMATOR - T H E K A L M A N FILTER
11.4.1 Kalman
Filter
The Kalman filter is a powerful approach for estimating the state vector in state-space
models [119, 517, 13731. The function of the Kalman filter is to generate on-line the state
estimate of the state c ( k ) . The Kalman filter dynamics is given as follows (see Fig. 11.2)
where L is the Kalman filter gain matrix, ande ( k ) is called the innovation or residual which
measures the error between the measured(orexpected) output y(k) and the predicted
output y(k1k- l) There are a variety of algorithms with which to update theKalman filter
gain matrix L as well as the state i ( k l k - 1); refer to [517] for more details. In this section,
we introduce a concept called hidden innovation in order to implement the Kalman filter
for the blind separation and filtering problem [1373]. Since the updating matrices C and
D will produce an innovation in each learning step, we can define a hidden innovation as
follows
3The system is stable if all eigenvalues of the state matrixA are located inside the unit circle in the complex
plane.
438 BLIND FILTERING AND SEPARATIONUSING A STATE-SPACEAPPROACH
I noise
Kalman filter
B z-'I C
-D
+ +
where AC(k) = C ( k 1) - C ( k ) and AD(k) = D ( k 1) - D(k). The hidden innovation
presents the adjusting directionof the outputof the demixing systemand is used to generate
an a posteriori state estimate. Once we define the hidden innovation, we can employ the
commonly used Kalman filter to estimate the statevector E ( k ) and to update theKalman
gain matrix L (see Fig. 11.2). The updating rules are described as follows [517, 1373, 13651
1. Kalman gain
L(k) = R ~ ~ ) C C T ( k ) [ C ( k ) R ~ ~RcL]-'.
)CT(k)+
2. Update the estimate with hidden innovation
Z(klk) = g ( l ~ l +l ~ 1) + L(~c)e ( k ) .
TWO-STAGE SEPARATION ALGORITHM 439
where R,,
( k ) and R,(k) are thecovariance matrices of the noise vector n(k) and output
measurement noise v(k ) respectively.
One disadvantage of the above algorithm is that it requires knowledge of the covariance
matrices of the noise. The theoretical problems such as convergence analysis and stabilityof
the above procedure remains an open problem. However, extensive simulation experiments
show that this algorithm, based on the Kalman filter, can separate the convolved signals
efficiently 11373,13651.
11.5 TWO-STAGESEPARATIONALGORITHM
In this section, we present a two-stage separation algorithm for state-space models. In this
approach, we decompose the separation problem into the following two stages. First, we
separate the mixed signals in the following sense [287, 1365, 13741
A = [ L ( OT
M-l) (11.77)
(11.78)
Thus, we have
G ( z )= W(Z)H(z) = P A ( z ) &(.-l) = PD(z), (11.80)
where D ( z ) = A(z) Q(.-') is adiagonal matrix in polynomials of 2 - l in its diagonal
entities. It is easily seen that both A and B are constant matrices. Therefore, we have only
to develop a learning algorithm to update C and D so as to obtain the separated signals in
the sense of (11.75).
On the other hand,we know that if the matrixD in themixing model satisfies r a n k ( D ) =
n, then there exists a set of matrices {A, B, C, D}, such that the output signal y of state-
spacesystem (11.3) and (11.4) recovers the independentsourcesignalsin the sense of
(11.75). Therefore, we have the following Theorem [1365]:
Theorem 11.2 If the matrix D in the mixing model satisfies r a n k ( D ) = n, then for given
specific matrices A and B as (11.77), there exist matrices [C,D], such that the transfer
matrix W(Z)of the system (11.3) and (11.4) satisfies equation (11.75).
The two-stage blind deconvolution can be realized in the following way: First, we construct
the matrices A and B of the state equation in the form (11.77), and then we employ the
natural gradient algorithm to update C and D . After the first stage, the outcome signals
can be represented in the following form
Then we can employ the blind equalization approach discussed in Chapter 9 for double
finite FIR filter to remove distortion caused by filtering or convolution of the signals. It
should be noted that the two-stage approach enables us also to recover the source signals
mixed by a non-minimum phase dynamical system [1367].
m N
i=l k=l
where { p i ( . ) } is the probability density of source signals. In order to measure the mutual
independence of output signals, we employ the Kullback-Leibler divergence as a criterion,
which is an asymmetric measure of the distance between two different probability distribu-
tions,
where we replace pi(.) by certain approximate density functionsqi (.) for estimated sources,
since we do not know the true probability distributions ri(.) of original source signals.
Provided that initial conditions are set to [(l) = 0, we have the following relation [l3651
Y
- = wx,
where W is given by
W =
Using the relation (A.2), we derive the loss function p ( W ( z ) )as follows
r n l N
p ( W ( z ) )= -1ogIdetHoI - ~ ~ ~ l o g q i ( ? ~ i ( k ) ) .
i=l k=l
Note that p ( 5 ) was not included in (A.6)because it does not depend onthe set of parameters
{Hi}.