You are on page 1of 6

An approximate dynamic programming based approach to dual adaptive control

Jong Min Lee


a
, Jay H. Lee
b,
*
a
Chemical and Materials Engineering, University of Alberta, Edmonton, AB, T6G 2G6 Canada
b
311 Ferst Dr. NW, School of Chemical and Biomolecular Engineering, Georgia Institute of Technology, Atlanta, GA 30332-0100, USA
a r t i c l e i n f o
Article history:
Received 3 July 2008
Received in revised form 3 November 2008
Accepted 5 November 2008
Keywords:
Dual control
Stochastic optimal control
Adaptive control
Stochastic dynamic programming
Approximate dynamic programming
a b s t r a c t
In this paper, an approximate dynamic programming (ADP) based strategy is applied to the dual adaptive
control problem. The ADP strategy provides a computationally amenable way to build a signicantly
improved policy by solving dynamic programming on only those points of the hyper-state space sampled
during closed-loop Monte Carlo simulations performed under known suboptimal control policies. The
potentials of the ADP approach for generating a signicantly improved policy are illustrated on an ARX
process with unknown/varying parameters.
2009 Elsevier Ltd. All rights reserved.
1. Introduction
Control of systems with unknown parameters offers an interest-
ing challenge. The usual approach is to combine parameter estima-
tion and deterministic control into an adaptive control strategy.
This certainty equivalence (CE) principle disregards potentially
signicant parameter uncertainties, leading to severe robustness
problems such as the bursting phenomenon [1]. One-step ahead
optimal controller, referred to as the cautious controller [2], yields
very small control signals when the variances of unknown param-
eters become large. Both controllers learn about the parameters in
a passive manner, meaning that they do not make exploratory
moves to actively generate information about important paramet-
ric uncertainties.
The optimal controller for this problem has dual goals in that it
should balance between two competing objectives: control and
exploration. By gaining more parameter information (i.e., explora-
tion) when needed, better control performance can be achieved in
the future. Feldbaum [3] showed that a dynamic program (DP) can
be solved to derive an optimal solution to the dual control prob-
lem. strm and Helmersson [4] solved a problem involving an
integrator with an unknown gain by discretizing the hyper-state
and numerically solving the DP on the discretized points. Despite
several interesting insights, the approach is not generalizable to
larger problems due to the exponential growth of the computa-
tional requirement with respect to the dimension of hyper-state.
Given the computational intractability of the DP-based solution,
most researchers have resorted to suboptimal solutions by incor-
porating cautious and active probing features into simple control-
lers. For example, they include controllers derived from a one-step
ahead objective function with an additional function of estimation
errors [5] or perturbation signals [6,7]. Multi-stage control objec-
tive function was also used in several approximated forms with
parameter estimation based on stochastic process [8] and approx-
imation of open-loop control objective function [9]. Though these
suboptimal designs are computationally simpler, they are prob-
lem-specic and have several disadvantages [10].
While application of dual control to industrial processes is still
in its infancy, there are some precedents. For example, [11] reports
a successful application of a suboptimal dual controller on an
industrial paper coating process. The paper showed that the dual
controller has signicant advantage over the standard CE controller
in regulating processes of which gains drift over time and possibly
switch their signs.
In this work, a recently proposed method of approximate dy-
namic programming (ADP) [12,13] is applied to solving the dual
control problem. A tailored procedure of ADP for stochastic opti-
mal control problems is presented. The ADP approach attempts
to solve the stochastic optimal control problem through DP but
only approximately and within limited regions of the hyper-state
space dened by Monte Carlo simulations with several known
suboptimal controllers. This approach has been used to solve
deterministic nonlinear control problems [14,12,13], but its
applicability to stochastic control problems has not been exam-
ined yet. The major contribution of the paper is to show that
the ADP procedure can provide a numerically tractable way of
0959-1524/$ - see front matter 2009 Elsevier Ltd. All rights reserved.
doi:10.1016/j.jprocont.2008.11.009
* Corresponding author. Tel.: +1 404 385 2148; fax: +1 404 894 2866.
E-mail addresses: jongmin.lee@ualberta.ca (J.M. Lee), jay.lee@chbe.gatech.edu
(J.H. Lee).
Journal of Process Control 19 (2009) 859864
Contents lists available at ScienceDirect
Journal of Process Control
j our nal homepage: www. el sevi er . com/ l ocat e/ j pr ocont
solving practically-sized dual optimal control problems with con-
tinuous hyper-state space. Similar approaches found in reinforce-
ment learning (RL) [15] and neuro-dynamic programming (NDP)
[16] are only applicable to systems with discrete state space. See
[13,17] for review of the ADP approach and further discussion on
its comparison with RL and NDP. In addition, previous ap-
proaches of approximately solving the stochastic DP have only
been applied to one-parameter case because of the exponential
growth of computation with respect to the problem size
[18,19]. We present an ARX example with two unknown param-
eters and show that the ADP approach derives a signicantly im-
proved policy with the desired dual feature.
2. Approximate dynamic programming for stochastic adaptive
control
2.1. Stochastic adaptive control
We consider a discrete time system
x
k1
f x
k
; u
k
; h
k
; e
k
; k 2 N 1
with a state vector x
k
, which is assumed to be measured, and a
manipulated input vector u
k
. h
k
is a vector containing unknown
parameters of the model, and {e
k
} is a sequence of exogenous
noises. The structure of f is assumed to be known.
Let us consider an optimal control policy that minimizes the fol-
lowing innite horizon cost at each time k:
E

1
t0
a
t
/x
kt
; u
kt
jn
k
_ _
; 2
where a 2 (0, 1) is a discount factor, / is a stage-wise cost, and
expectation operator E is taken over the distribution of e and h. If
h and e are random, x
k
is random. The random state, called the hy-
per-state (or information state), has a joint probability density func-
tion, and will be denoted as n
k
. For instance, if h and e are Gaussian
processes and f is linear, n
k
is a Gaussian joint probability density
function fully described by its mean and the covariance matrix.
The optimal control policy can be derived by solving the follow-
ing Bellman equation off-line [20]:
J

n
k
min
u
k
E /x
k
; u
k
aJ

n
k1
jn
k
; 3
where J
*
denotes the optimal cost-to-go function, which maps the hy-
per-state, n, to the cost-to-go value under the optimal control. Once
J
*
is calculated, the optimal control policy, l
*
, can be derived from
the following single-stage optimization problem:
u
k
l

n
k
arg min
u
k
E/x
k
; u
k
aJ

n
k1
jn
k
: 4
Note that n
k+1
is a stochastic variable and is affected by the
choice of control action u
k
. The expectation calculation for each
evaluation of a candidate u
k
requires integration of the successor
state n
k+1
over its all possible range (generally, R
n
for continuous
hyper-state case with the associated probability density function).
For numerical solution, one must densely grid the hyper-state
space, which translates into an exponential growth in computation
with respect to the hyper-state dimension. Though the optimal
controller will have the desired dual feature, the formulation is
computationally intractable except for very small problems.
2.2. Approximate dynamic programming
The ADP approach attempts to circumvent the curse-of-dimen-
sionality by conning the solution to a very small fraction of the
hyper-state space. This restricted space is usually dened by data
sampled from Monte Carlo simulations of the closed-loop system
with known suboptimal policies subject to various possible distur-
bances and operating conditions. A function approximator is then
built to provide a continuous cost-to-go function estimate based
on the sampled data. The approximation is improved through iter-
ation of the Bellman equation (as in value iteration) or iteration be-
tween the Bellman equation and the policy evaluation (as in policy
iteration).
The proposed approach for dual control problems is outlined as
follows. We note that steps 14 are performed off-line, and the
resulting cost-to-go function is used online in step 5. In addition,
it is a model-based approach.
Step 1: Perform stochastic simulations of the closed-loop system
with a chosen suboptimal control policy (l
0
) under all possible
operating conditions. The coverage of the hyper-state space
by the visits during the stochastic simulation has an inuence
on how close the converged cost-to-go function is to the optimal
one. Thus, one may simulate multiple control policies that exhi-
bit different characteristics we seek in our nal policy or show
good performances in different operating regimes we intend to
operate in. In our problem, one may want to use control policies
having different levels of characteristics of cautiousness and
active exploration.
Step 2: For each state visited during the simulation, we calculate
the innite horizon cost-to-go, J
0
, using the simulation data
according to
J
0
n
k

1
t0
a
t
/x
kt
; u
kt
: 5
The expectation operator is not explicitly evaluated in this step
because the function approximator we employ subsequently
should smoothen the uctuations in the cost-to-go values for
the sampled data. We note that the expected cost-to-go values
calculated from Eq. (5) can be different under different initial
policies for the same state because the initial policies are not
optimal in general. However, the optimal cost-to-go value for
each state will be unique and satisfy the Bellman equation
[20]. More importantly, the initial estimates are not so critical
because the cost-to-go values are continually updated in the va-
lue iteration step by evaluating the expectation operator
explicitly.
Step 3: With the sampled points and the initial cost-to-go values,
we construct a function approximator to approximate the cost-
to-go as a function of the continuous hyper-state variables,
denoted as
~
J. In this work, we use a local averager of the follow-
ing form:
~
Jn
0
b
0
J
0

K
i1
b
i
Jn
i
6
with

K
i0
b
i
1; b
i
P0 i 0; . . . ; K; 7
where n
0
is a query point, J
0
is a bias term, and K is the number of
neighboring points in the data set. It can be proved that this local
averaging scheme guarantees the convergence of the value iter-
ation [13]. A quadratic penalty term based on the estimate of a
local data density is added to discourage the excursion into re-
gions of low data density.
~
Jn
0
(
~
Jn
0
J
bias
n
0
; 8
J
bias
n
0
A H
1
f
X
n
0

q
_ _

1
f
X
n
0

q
q
_ _
2
; 9
860 J.M. Lee, J.H. Lee / Journal of Process Control 19 (2009) 859864
where f
X
(n
0
) is a density estimate around n
0
2 R
m
0
; H is a
Heaviside step function, A is a scaling parameter, and q is a
threshold value, which is inversely proportional to the data den-
sity at a given distance from n
0
. The density estimate is obtained
as a sum of kernel functions placed at each sample
f
X
n
0

1
Nr
m
0
B

N
i1
K
n
0
n
i
r
B
_ _
; 10
where N is the number of neighbors used for estimating the cost-
to-go, and K is the multivariate Gaussian distribution function
K
n
0
n
i
r
B
_ _

1
2pr
2
B

m
0
=2
exp
kn
0
n
i
k
2
2
2r
2
B
_ _
: 11
The bandwidth parameter r
B
is determined by considering the
distance range of the collected data. This also denes the buffer
zone, kxi
0
n
i
k
2
r
B
, inside which no penalty term is assigned.
A controls the rate of increase of the penalty term, and it is ad-
justed to give J
bias
(n
0
) = J
max
. The penalty term does not affect
the convergence of off-line learning step since it simply adds
constant values to the the cost-to-gos of states in the regions
with insufcient amounts of data points. The detailed procedure
of designing the penalty term and the associated convergence
property can be found in [13]. Though the optimality gap from
the full DP solution is still an open question, the suggested
scheme derives an improved control policy from starting control
policies based on the DP principle.
Step 4: We improve the cost-to-go approximation by performing
the value iteration
J
i1
n
k
min
u
k
E /x
k
; u
k
a
~
J
i
n
k1
jn
k
_ _
; 12
where superscript i denotes the ith iteration step. J
i+1
is calculated
for every n
k
in the data set obtained from simulations, and n
k+1
is
a successor state after applying the control action u
k
. This is re-
peated until the convergence, i.e., until kJ
i1
n J
i
nk
1
is negli-
gibly small. We note that initial cost-to-go estimates do not affect
the quality of the nal cost-to-go function [20], but their effects
on the convergence rate is still an open question. The reason
we use the value iteration is because the policy evaluation step
of policy iteration algorithms with local approximators involves
adding data at each iteration, the storage requirement of which
may become excessive. The expectation is evaluated by sampling
the innovation term, which is affected by the control action. This
explicit evaluation of expectation operator makes the learning
step less sensitive to statistical uctuations seen in the sample
average approximations used in other ADP approaches (e.g., tem-
poral-difference learning) [15]. Since the minimization is non-
convex, we perform a global search over the input space by dis-
cretizing the input space. Each candidate input gives a probabil-
ity distribution of the innovation term, from which the possible
outcomes of hyper-state are sampled and the current estimates
for their cost-to-go values are averaged.
Step 5: The converged cost-to-go values,
~
J
Nc
, can be directly
implemented as a control policy by solving the following mini-
mization in real-time:
u
k
arg min
u
k
E /n
k
; u
k
a
~
J
Nc
n
k1
jn
k
_ _
; 13
where N
c
is the iteration number for convergence. The single-
stage optimization should be computationally simpler than the
original innite horizon problem.
3. Numerical examples
We illustrate the ADP approach with an Auto-Regressive eXog-
enous (ARX) SISO system. The rst two examples have one
unknown constant and one time-varying parameter, respectively.
The third example has two unknown time-varying parameters,
for which no DP-based approach has been reported yet [9,19].
3.1. Problem statement
Let us consider an ARX SISO linear system described by
y
k1
a
k1
y
k
b
k1
u
k
e
k1
; 14
where {e
k
} is an independent identically-distributed (i.i.d.) Gaussian
noise sequence with mean zero and variance r
2
. a and b are time-
varying parameters. The stage-wise cost / in Eq. (2) is dened as
y
2
tk1
.
Let h
k
= [a
k
b
k
]
T
and u
k
= [y
k1
u
k1
]
T
. Then the process can be
written as
y
k
u
T
k
h
k
e
k
; 15
where the parameter vector, h, is changing according to a Gaussian
Markov process
h
k1
Uh
k
Cx
k
: 16
Given the observed outputs and inputs available at time k, the
estimator generates the information on the conditional probability
distribution of h within the Kalman lter framework as follows:
^
h
k1jk
U
^
h
kjk1
K
k
y
k
u
T
k
^
h
kjk1
_ _
; 17
K
k
UP
kjk1
u
k
u
T
k
P
kjk1
u
k
r
2
_ _
1
; 18
P
k1jk
U K
k
u
T
k
_ _
P
kjk1
U
T
CC
T
R
x
; 19
where R
x
is the covariance of x in Eq. (16).
3.2. Passive control policies
The CE policy calculates a control action at each sample time as
if the estimate
^
h
k
were exact
u
CE
k
l
CE
x
k
;
^
h
k
: 20
The cautious controller minimizes the cost function of Eq. (2)
only for a single step
min
u
k
Ey
k1
r
k1

2
; 21
where r is the set-point (here set as 0). The above minimization
yields
u
k

r
k1
^
b
k1jk

^
a
k1jk
^
b
k1jk
P
12
k1jk
_ _
y
k
^
b
2
k1jk
P
22
k1jk
; 22
where P
ij
is the (i, j) element of the covariance matrix P.
3.3. Example 1: An integrator with step change in an unknown gain
We rst consider a simple case where a is unity and only b is an
unknown constant that can jump from the initial value of 0.5 to a
value between 15 and 15 except 0. It is a somewhat idealized sce-
nario, in which the timing of the jump is known to the controller.
Hence, the covariance of the estimator is reset to a high value of
200 at the time of the jump.
The initial parameter value is assumed to be known exactly
(P
0
= 0), and the variance of the exogenous noise term is set as
unity. In the particular realization we simulate, the noise is kept
to zero up to some time point (e.g., k = 100). The parameter jump
occurs in the middle of that quiescent period (e.g., at k = 10).
The gain of the integrator is modeled with U= 1,C = 1 in Eq. (16)
and the hyper-state of the process is n
k
y
k
;
^
b
k1jk
; P
k1jk

T
.
J.M. Lee, J.H. Lee / Journal of Process Control 19 (2009) 859864 861
3.3.1. Data generation
Closed-loop simulations were performed using the following
control policies: (1) One-step ahead CE controller, (2) the cautious
controller, and (34) the CE and the cautious controllers with dith-
ered inputs.
We simulated parameter jumps from the nominal value to
b = 5, 10, 15. The dither signals were randomly sampled from
the uniform distribution of [0.1, 0.1]. Three sets of dither signals,
each lasting for four sample periods, were injected at regular inter-
vals during quiescent periods. Three and ve realizations of e were
simulated for non-dithered policies and dithered policies, respec-
tively. Three different realizations of input dither signals were also
simulated for each realization for the dithered policies. Each simu-
lation lasted for 500 sample times, and the total number of simu-
lation data obtained from the simulations was 3849.
3.3.2. Value iteration
We solve Eq. (12) with /x
k
; u
k
y
2
k1
and a = 0.98. The expec-
tation operator was evaluated by sampling 50 innovation values
(
k+1
) for each candidate u
k
evaluated in the optimization. Note
that

k1
y
k1
y
k

^
b
k2jk1
u
k
23
and hence has the following distribution:

k1
$ N 0; 1 u
2
k
P
k2jk1
_ _
: 24
We used four nearest neighbor averager for cost-to-go approxima-
tion, and the value iteration converged after 24 runs with
e
rel

J
i
n
k
J
i1
n
k

J
i1
n
k

_
_
_
_
_
_
_
_
_
_
1
< 0:03: 25
The quadratic penalty, J
bias
in Eq. (9), was designed with the param-
eter choices of A = 0.87, q = 0.047, r
B
= 0.35, J
max
= 2500. To bound
the cost-to-go in the off-line iteration steps, the additive penalty
term was set as J
max
whenever
~
Jn
0
PJ
max
.
3.3.3. On-line performance
Different parameter jump cases including the untrained cases
were tested to compare the performance of the resulting policy (la-
beled ADP policy) with those of the initial suboptimal control pol-
icies. For each case, the total cost over 50 sample times (

50
k1
y
2
k)
were calculated. The total cost averaged over 10 separate realiza-
tions are compared in Table 1. Whereas the average performance
of the ADP controller does not vary much with different parameter
values, the other control policies suffer from bursting or turn-off
phenomena, leading to poor average performances.
The performance disparities were observed during the transient
period after the parameter jump where exogenous noises started
entering the system. Fig. 1 shows a sample result of the output reg-
ulation under the three policies (CE, Cautious, and ADP). At time
10, b jumps from 0.5 to 15 and the covariance is reset to 200. White
noise e starts entering the system at time 15. It shows that the ADP
controller injects the probing signal at time 10 and achieves the
best overall performance of regulation by actively reducing the
uncertainty of b, whereas the passive policies do not move the con-
trol actions until time 15, and the performances are degraded
either by the bursting of the output or by the turn-off of the control
signals.
3.4. Example 2: Time-varying gain for the integrator process
We consider a more realistic scenario of nonzero R
x
and r. To
bound the parameter range, we use U= 0.9987 and C = 0.05, which
gives b of unit variance. The performances of the CE controller and
the cautious controller deteriorate when the parameter estimate
becomes close to zero even with very small error. Hence, we also
simulate the performances of the CE and cautious controllers with
input dither signals (sampled from the uniform distribution of
[15, 15]), which enter the input channel whenever j
^
bj < 0:2.
From 25 sets of simulations under the four different control pol-
icies (CE, Cautious, CE/Cautious + input dither signals), 3323 data
points (hyper-state vs. cost-to-go) were obtained. The value itera-
tion step converged after 21 runs using the same convergence cri-
terion as in the rst example, and the quadratic penalty term was
designed with K = 4, r
B
= 0.4, q = 0.1064, A = 0.8709, and
J
max
= 2500. Twenty new realizations were performed and the
average on-line performances are compared in Table 2. Though
the performance of the dithered cautious controller appears com-
parable to that of the ADP-based dual controller, a systematic
guideline on a proper magnitude and timing of injecting dithering
signals does not exist and choosing them is generally very difcult
in practice.
3.5. Example 3: ARX model with two time-varying parameters
We consider the ARX model of Eq. (14) with both a and b as
time-varying parameters. Suppose that the unknown parameters
can be modeled with U= 0.9I and C = I of Eq. (16). I is the 2-by-2
identity matrix and {x} of Eq. (16) is a sequence of i.i.d. Gaussian
noise vectors, of which the mean and the covariance matrix are
^
x 0 0
T
; R
x

0:1 0:05
0:05 0:1
_ _
: 26
3.5.1. Data generation
Four suboptimal controllers were used to generate the data: the
multi-step CE controller with the prediction horizon and control
horizon of both 2, the cautious controller, and the multi-step CE
Table 1
Averaged cost over 50 sample steps based on 10 realizations of e.
b CE Cautious Dithered CE Dithered Cautious ADP
15 630.5 152.3 63.1 79.8 52.9
10 184.1 163.9 156.0 64.3 56.7
12 630.1 109.3 60.0 52.3 51.2
12 401.5 85.8 875.0 51.7 46.6
7 125.9 126.8 60.0 83.8 46.6
7 345.1 167.4 84.1 65.2 60.7
Fig. 1. A sample run of the parameter jump to b = 15: y and u.
862 J.M. Lee, J.H. Lee / Journal of Process Control 19 (2009) 859864
and the cautious controllers with input dithering signals added at
each sample time. The dithering signals were sampled from the
uniform distribution of [1, 1], and {e
k
} has the mean value of 0
and the standard deviation of 0.5.
The hyper-state of the system, n
k
, is
n
k
y
k
;
^
a
k1jk
;
^
b
k1jk
; P
11
k1jk
; P
22
k1jk
; P
12
k1jk
_ _
T
: 27
We performed 50 runs of simulations with each suboptimal control
policy, 200 realizations in total. Each simulation run lasted 1500
sample times and a total of 10,114 data points were obtained after
removing similar points and outliers.
3.5.2. Value iteration
The value iteration was performed as in the previous examples
with a = 0.98. For each data point we also constructed a candidate
action set composed of the four different suboptimal control
actions implemented in the simulation step as well as discretized
action values to avoid local optima. For each candidate action, 30
values of the innovation term were sampled for the purpose of
evaluating the expectation operator.

k1
$ N 0; r
2
P
11
k1jk
y
2
k
P
22
k1jk
u
2
k
2P
12
k1jk
y
k
u
k
_ _
: 28
With the convergence criterion of e
rel
< 0.05, the value iteration
converged after 31 runs. The parameters for the penalty term of
4-nearest averager were chosen as A = 0.87, q = 170, r
B
= 0.9295,
J
max
= 2500, where r
B
corresponds to 4% of the normalized data
range.
3.5.3. On-line performance
In order to quantify how much performance improvement has
been achieved, we perform a couple of statistical tests using the
results from independent Monte Carlo runs, each containing
500 time steps. Two different control policies can be compared
by the following statistical test [21,8].
(1) Consider data sequences dened as
D
AB
i


/
A
i


/
B
i
; 29
where A and B represent two different control policies, and

/
i
is the
average cost per time step for i = 1, . . . , N
test
, with N
test
being the
number of Monte Carlo test runs. From the central limit theorem,
D
AB
i
follows the normal distribution. Hence, the statement D
AB
> 0
implies that the controller B is better than the controller A. The test
statistic D
AB
=r
AB
allows the evaluation of the probable error a
AB
,
which is the probability that D
AB
is negative. Then the condence
(1a
AB
) is the probability that D
AB
is positive. The value of a
AB
is
taken from the normal distribution table [22].
(2) We also dene the improvement of control policy B from A
at each test run as
I
i

/
A
i


/
B
i

/
A
i
: 30
The following average improvement can quantify how much
improvement is achieved given its condence
I
1
N
test

Ntest
i1
I
i
: 31
The comparisons between the controllers are presented in Table
3. We simulated two different realizations of {e} for 50 different se-
quences of {a,b}. This gives us N
test
= 100, with each simulation last-
ing 500 time steps (i.e., T = 500). The comparisons show that the
ADP policy offers signicant performance improvement over the
non-randomized control policies (CE and Cautious). It is also better
than the best starting controller (the cautious controller injected
with dithering input signals) with 81.1% condence. The improve-
ment of 24.6% is statistically signicant, which tells us that ADP
controller smartly uses a probing signal when it is required. The
hypothesis that the ADP policy is better than the starting control-
lers can be accepted, the level of signicance (a) being less than
20%.
Table 2
Averaged cost over 20 realizations of e: 500 sample times.
CE Cautious Dithered CE Dithered Cautious ADP
Avg. 136,120 894 13,046 887 838
Max. 387,999 1016 327,130 994 906
Min. 7722 730 6708 690 654
Table 3
Statistical comparison between different control policies of Example 3: 1 = CE,
2 = Cautious, 3 = Dithered CE, 4 = Dithered Cautious, and 5 = ADP.
Policies D
AB
r
AB
D
AB
rAB
I% (1 a) (%)
15 698 768.3 0.91 96.0 81.9
25 14.3 16.9 0.85 74.4 80.2
35 290 241 1.20 94.6 88.5
45 2.00 2.29 0.88 24.6 81.1
Fig. 2. Sample result of the output and the input: the ADP controller.
Fig. 3. Sample result of the output and the input: the dithered cautious controller.
J.M. Lee, J.H. Lee / Journal of Process Control 19 (2009) 859864 863
Figs. 24 show a typical result of the Monte Carlo simulations.
In the gures we compare the ADP controller with the dithered
cautious controller, which is the best among the starting control-
lers. Around time 168, output regulation performance of the cau-
tious controller becomes very poor, even though a dithering
signal is added at each time. This demonstrates that dithering sig-
nals with appropriate magnitudes should be injected for better
estimation of uncertain parameters. However, this is a very dif-
cult design task as the denition of appropriate varies case-to-
case and period-to-period. Fig. 3 shows that magnitudes of the
dithering signals for the cautious controller around time 160 are
not enough when compared to the control actions of the ADP con-
troller shown in Fig. 2. Fig. 4 shows that the resulting parameter
estimates of the ADP controller are much closer to the real ones
than those of the dithered cautious controller.
4. Conclusions
We proposed to use an approximate dynamic programming
based strategy to solve a stochastic optimal control problem called
a dual control problem. Starting from some known suboptimal
control policies, including several passive and randomized policies,
the ADP approach could derive a superior control policy, which ac-
tively reduces the parameter uncertainty, leading to a signicant
performance improvement. The algorithm uses Monte Carlo simu-
lations to dene a relevant region of the hyper-state space in
which the associated dynamic programming is solved. It also re-
places the costly numerical integration for the evaluation of the ex-
pected cost-to-go with a sample average approximation scheme.
The key feature of dening working regions for cost-to-go
function based on a penalty term can allow the proposed ADP
framework to be extended to practically-sized problems. However,
increasing coverage in the hyper-state space through safe explo-
rations as more information is gathered in the online implementa-
tion step would facilitate deriving a more improved policy for
systems with many unknown parameters [23]. A systematic explo-
ration scheme under the guidance of cautious and/or CE policies in
the unvisited regions of hyper-state space may be incorporated for
this purpose.
References
[1] B.D. Anderson, Adaptive systems, lack of persistency of excitation and bursting
phenomena, Automatica 21 (3) (1985) 247258.
[2] K.J. strm, B. Wittenmark, Problems of identication and control, Journal of
Mathematical Analysis and Applications 34 (1971) 90113.
[3] A.A. Feldbaum, Dual control theory. I, Automation Remote Control 21 (1960)
874880;
A.A. Feldbaum, Dual control theory. II, Automation Remote Control 21 (1960)
14531464;
A.A. Feldbaum, Dual control theory. III, Automation Remote Control 22 (1961)
112;
A.A. Feldbaum, Dual control theory. IV, Automation Remote Control 22 (1961)
109121.
[4] K.J. strm, A. Helmersson, Dual control of an integrator with unknown gain,
Computers and Mathematics with Applications 12A (6) (1986) 653662.
[5] B. Wittenmark, An active suboptimal dual controller for systems with
stochastic parameters, Automat. Control Theory Appl. 3 (1) (1975) 1319.
[6] J. Wieslander, B. Wittenmark, An approach to adaptive control using real time
identication, Automatica 7 (1971) 211217.
[7] D.J. Hughes, O.L.R. Jacobs, Turn-off, escape and probing in nonlinear
stochastic control, in: Preprint IFAC Symposium on Stochastic Control,
Budapest, 1974.
[8] A.L. Maitelli, T. Yoneyama, A multistage suboptimal dual controller using
optimal predictors, IEEE Transactions on Automatic Control 44 (5) (1999)
10021008.
[9] B. Lindoff, J. Holst, B. Wittenmark, Analysis of approximations of dual control,
International Journal of Adaptive Control and Signal Processing 13 (1999) 593
620.
[10] N.M. Filatov, H. Unbehauen, Survey of adaptive dual control methods, IEEE
Proceedings Control Theory and Applications 147 (2000) 118128.
[11] A. Ismail, G.A. Dumont, J. Backstrom, Dual adaptive control of paper coating,
IEEE Transactions on Control Systems Technology 11 (2003) 289309.
[12] J.M. Lee, J.H. Lee, Approximate dynamic programming based approaches for
inputoutput data driven control of nonlinear processes, Automatica 41 (7)
(2005) 12811288.
[13] J.M. Lee, N.S. Kaisare, J.H. Lee, Choice of approximator and design of penalty
function for an approximate dynamic programming based control approach,
Journal of Process Control 16 (2) (2006) 135156.
[14] N.S. Kaisare, J.M. Lee, J.H. Lee, Simulation based strategy for nonlinear optimal
control: application to a microbial cell reactor, International Journal of Robust
and Nonlinear Control 13 (34) (2002) 347363.
[15] R.S. Sutton, A.G. Barto, Reinforcement Learning: An Introduction, MIT Press,
Cambridge, MA, 1998.
[16] D.P. Bertsekas, J.N. Tsitsiklis, Neuro-Dynamic Programming, Athena Scientic,
Belmont, MA, 1996.
[17] J.M. Lee, J.H. Lee, Approximate dynamic programming strategies and their
applicability for process control: a review and future directions, International
Journal of Control, Automation, and Systems 2 (3) (2004) 263278.
[18] M.S. Lobo, S. Boyd, Policies for simultaneous estimation and optimization, in:
Proceedings of the American Control Conference, San Diego, CA, 1999, pp.
958964.
[19] A.M. Thompson, W.R. Cluett, Stochastic iterative dynamic programming: a
Monte Carlo approach to dual control, Automatica 41 (2005) 767778.
[20] D.P. Bertsekas, Dynamic Programming and Optimal Control, second ed.,
Athena Scientic, Belmont, MA, 2000.
[21] C.J. Wenk, Y. Bar-Shalom, A multiple model adaptive dual control algorithm for
stochastic systems with unknown parameters, IEEE Transactions on Automatic
Control AC-25 (1980) 703710.
[22] A. Jeffrey, Handbook of Mathematical Formulas and Integrals, Academic Press,
Sandiego, CA, 1995.
[23] W.B. Powell, Approximate Dynamic Programming, Wiley, 2007.
Fig. 4. Comparison of parameter estimates between the ADP controller and the
dithered cautious controller.
864 J.M. Lee, J.H. Lee / Journal of Process Control 19 (2009) 859864

You might also like