/
Text
Dynamic Programming
and Optimal Control
Volume I
Dimitri P. Bertsekas
Massachusetts Institute of Technology
Athena Scientific, Belmont, Massachusetts
'1
\
__J
Athena Scientific
Post Officc Dox 391
Belmont, Mass. 02178-9998
U.S.A.
Elnail: at.llcllasc({ilworl(l.st.d.cOlll
Cover Design: Ann Gallager
( ;;'\iii{j
........
/"
-'
.
'"
,. j
!!:/
Contents'
@ 1995 Dimitri P. Bertsekas
All rights rescrvcd. No part of this book may be reproduced in any form
by any elcctronic or mechanical means (including photocopying, recording,
or information storage and retrieval) without permission in writing from
the publishcr.
V 1. The Dynamic Programming Algorithm
1.1. Introduction ...........
1.2. The Basic Problem. . . . . . . . .
1.3. The Dynamic Programming Algorithm
1.4. State Augmentation
1.5. Somc Mathematical Issues
1.6. Notcs, Sources, and Exercises
Portions of this volunle arc adaptcd and reprinted from the author's Dy-
nmnic Pl"O!/mmmin!/: Dctenninistic and Stochastic Models, Prentice-Hall,
1987, by permission of Prentice-Hall, Inc.
2. Deterministic Systems and the Shortest Path Problem
2.1. Finite-State Systems and Shortest Paths
2.2. Some Shortest Path Applications
2.2.1. Critical Path Analysis . . . . .
2.2.2. Hidden Markov Models and the Vit.erbi Algorithm
2.3. Shortest Path Algorithms . . .
2.3.1. Label Correcting Methods
2.3.2. Auction Algorithms
2.4. Notes, Sources, and Exercises .
Publisher's Catalogillg-in-Publication Data
Dertsckas, Dimitri P.
Dynamic Programming and Optimal Control
Llcludes Bibliography and Index
1. Mat.hematical Optimization. 2. Dynamic Programming. 1. Title.
QA402.5 .D465 1995 519.703 95-075041
, 3. Deterministic Continuous-Time Optimal Control
3.1. Continuous-Time Optimal Control . . .
3.2. The Hamilton-Jacobi-Bellman Equation
3.3. The Pontryagin Minimum Principle . .
3.3.1. An Informal Derivation Using the HJD Equation
3.3.2. A Derivation Dascd on Variational Ideas
3.3.3. Minimum Principle for Discrete-Time Problems
3.4. Extensions of the Minimum Principle
3.4.1. Fixcd Terminal Stat.e
3.4.2. Free Initial State
3.4.3. Free Terminal Time
3.4.4. Time-Varying System and Cost
3.il.5. Singular Problems . . . . . .
ISDN 1-886G2U-12-4 (Vol. 1)
ISDN 1-886529-13-2 (Vol. II)
ISDN 1-886.529-11-6 (Vol. I ami II)
p. 2
p. 10
p. 16
p. 30
p.3G
p.37
p. GO
p. G4
p. 54
p. Sf)
p. G2
p.66
p.74
p. 81
p.88
p. 91
p.97
p.07
p. lOG
p. 110
p.112
p. 112
p. 116
p. 116
p. 119
p. 120
iii
_..1
iv
3.5. Notes, Sources, and Exercises . . . . . .
4. Problems with Perfect State Information
4.1. Linear Systems and Quadratic Cost
4.2. Inventory Control . . . .
4.3. Dynamic Portfolio Analysis . . . .
4.4. Optimal Stopping Problems . . . .
4.5. Scheduling and the Interchange Argument
4.6. Notes, Sources, and Exercises . . . . . .
, 5. Problems with Imperfect State Information
J 5.1. Reduction to the Perfect Information Case
\; 5.2. Linear Systems and Quadratic Cost
5.3. Minimum Variance Control of Linear Systems
5.4. Sufficient Statistics and Finite-State Markov Chains
5.5. Sequential Hypothesis Testing
5.6. Notes, Sources, and Exercises . .
6. Suboptimal and Adaptive Control
6.1. Certainty Equivalent and Adaptive Control
G.1.1. Caution, Probing, and Dual Control
6.1.2. Two-Phase Control and Identifiability
6.1.3. Certainty Equivalent Control and Identifiability
6.1.4. Self-Tuning Regulators . . . . . . .
6.2. Open-Loop Feedback Control . . . . . . .
6.3. Limited Lookahead Policies and Applications
6.3.1. Flexible Manufacturing . . . . . .
6.3.2. Computer Chess . . . . . . . . . .
6.4. Approximations in Dynamic Programming
6.4.1. Discretization of Optimal Control Problems
6.4.2. Cost-to-Go Approximation
6.4.3. Other Approximations
6.5. Notes, Sources, and Exercises . .
7. Introduction to Infinite Horizon Problems
7.1. An Overview . . . . . . . . .
7.2. Stochastic Shortest Path Problems
7.3. Discounted Problems . . . .
7.4. Average Cost Problems. . .
7.5. Notes, Sources, and Exercises
Contents
Contents
p. 123
Appendix A: Mathematical Review
A.1. Sets . . . . . .
A.2. Euclidean Space .
A.3. Matrices . . . .
A.4. Analysis . . . .
A.5. Convex Sets and Functions
p. 130
p.144
p. 152
p. 158
p. 168
p. 172
Appendix B: On Optimization Theory
B.1. Optimal Solutions . . . . . . .
B.2. Optimality Conditions . . . . .
B.3. Minimization of Quadratic Forms
p. 186
p. 1!J6
p.203
p. 21!J
p.232
p. 237
Appendix C: On Probability Theory
C.1. Probability Spaces. . .
C.2. Random Variables. . .
C.3. Conditional Probability
p. 246
p. 251
p. 252
p.254
p. 25!J
p.261
p.265
p.2G9
p. 272
p.280
p.280
p. 282
p. 286
p.287
Appendix D: On Finite-State Markov Chains
D.1. Stationary Markov Chains
D.2. Classification of States
D.3. Limiting Probabilities
D.4. First Passage Times .
Appendix E: Kalman Filtering
E.1. Least-Squares Estimation.
E.2. Linear Least-Squares Estimation
E.3. State Estimation - Kalman Filter
E.4. Stability Aspects . . . . . . .
E.5. Gauss-Markov Estimators
E.6. Deterministic Least-Squares Estimation
Appendix F: Modeling of Stochastic Linear Systems
F.1. Linear Systems with Stochastic Inputs
F.2. Processes with Rational Spectrum
F.3. The ARMAX Model . . . . . . . .
p. 2!J2
p. 2%
p.306
p.310
p. 324
References
Index . . . . . . . . . . . . . . . . . . . . . . . . . .
v
p. 329
p.330
p.331
p.334
p.337
p.338
p.340
p. 310
p.341
p. 342
p.344
p.346
p. 347
p.348
p.349
p.350
p. 352
p.360
p.365
p.367
p.370
p.371
p.372
p.374
p. 375
p.385
_. ..J
vi
CONTENTS OF VOLUME II
1. Infinite Horizon - Discounted Problems
1.1. Minimization of Total Cost - Introduction
1.2. Discounted Problems with Dounded Cost per Stage
1.3. Finitc-Statc Systcms - Computational Methods
1.3.1. Value Itcration and Error Bounds
1.3.2. Policy Itcration
1.3.3. Adaptivc Aggrcgation
1.3.4. Lincar Programming
1.4. Thc Role of Contraction Mappings
1.5. Schcduling and Multiarmed Bandit Problmns
1.6. Notcs, Sourccs, and Exercises
2. Stochastic Shortest Path Problems
2.1. Main Rcsults
2.2. Computational Mcthods
2.2.1. Valuc Iteration
2.2.2. Policy Itcration
2.3. Simulation-Bascd Mcthods
2.3.1. Policy Evaluation by Monte-Carlo Simulation
2.3.2. Q-Learning
2.3.3. Approximat.iuns
2.3.4. Extcnsions to Discounted Proulems
2.3.5. Thc Rolc of Parallel Computation
2.4. Notcs, Sourccs, and Excrciscs
3. Undiscoullted Problems
3.1. Unboundcd Custs per St.age
3.2. Linem' Systcms and Quadratic Cost
3.3. Invcntory Control
3.4. Opt.imal St.opping
3.G. Optimal Gambling SLrat.C'gies
3.6. Nonstationary and Periodic Problcms
3.7. Note,;, Sources, and Excrciscs
4. Average Cost per Stage Problems
4.1. Prclilninary An;llysis
4.2. Optimality Conditions
4.3. Computational Mct.hods
4.3.1. Value Iteration
4.3.2. Policy Iteration
nicnt.s
Conienis
vii
4.3.3. Lincar Programming
4.3.4. Simulation-Dased Mcthods
4.4. Infinitc State Space
4.5. Notes, Sources, and Excrcises
5. Continuous-Time Problems
5.1. U niformiz,ation
5.2. Qucueing Applications
5.3. Scmi-Markov Problems
5.4. Notcs, Sources, and Exercises
_. J
ABOUT THE AUTHOR
Dimit.ri Dert.sekas studied Mechanical and Electrical Engineering at.
the National 'l'eclmical University or At.hens, Greece, and obtained his
Ph.D. in system science from the Massachusetts Institute of Technology. He
has held faculty positions with the Engineering-Economic Systems Dcpt.,
Stanford University and the Electrical Engineering Dept. of the Univer-
sity of Illinois, Urbana. He is currently Professor of Electrical Engineering
and Computer Science at the Massachusetts Institute of Technology. He
consults regularly with private industry and has held editorial positions in
several journals. He has been elected Fellow of the IEEE.
Professor Bertsekas has done research in a broad variety of suujects
from control theory, optimization theory, parallel and distributed computa-
tion, data communication networks, and systems analysis. He has written
numerous papers in each of these areas. This book is his fourth on dynamic
programming and optimal control.
Preface
This two-volume book is based on a first-year graduate course on
dynamic programming and optimal control that I have taught for over
twenty years at Stanford University, the University of Illinois, and thc Mas-
sachusetts Institute of Technology. The course has been typically attended
by students from engineering, operations research, economics, ,111d applied
mathematics. Accordingly, a principal objective of the book has been to
provide a unified treatment of the subject, suitable for a broad audience.
In particular, problems with a continuous character, such as stochastic con-
trol problems, popular in modern control theory, arc simultaneously treated
with problems with a discrete character, such as Markovian decision prob-
lems, popular in operations research. Furthermore, many applications and
examples, drawn from a broad variety of fields, are discussed.
The book may be viewed as a greatly expanded and pedagogically
improved version of my 1987 book "Dynamic Programming: Det.erminist.ic
and Stochastic Models," published by Prentice-Hall. I have included much
new material on deterministic and stochastic shortest path problems, as
well as a new chapter on continuous-time optimal control proulcms and the
Pontryagin Maximum Principle, developed from a dynamic programming
viewpoint. I have also added a fairly extensive exposition of simulation-
based approximation techniques for dynamic programming. These tech-
niques, which are often referred to as "neuro-dynamic programming" or
"reinforcement learning," represent a breakthrough in the practical ap-
plication of dynamic programming to complex problems that involve the
dual curse of large dimension and lack of an accurate mathematical model.
Other material was also augmented, substantially modified, and updated.
With the new material, however, the book grew so much in size that it
became necessary to divide it into two volumes: one on finite horizon, and
the other on infinite horizon problems. This division was not only natural in
terms of size, but also in terms of style and orientation. The first volume is
more oriented towards modeling, and the second is more oriented towards
mathematical analysis and computation. To make the first volume self-
contained for instructors who wish to cover a modest amount of infinite
horizon material in a course that is primarily oriented towards modcling,
Other books by the author:
1) Dynamic Programming and Stochastic Control, Academic Press, 1976.
2) Stochastic Optimal Control: The Discrete-Time Case, Academic Press,
1978 (with S. E. Shreve; translated in Russian).
3) Constrained Optimization and Lagrange Multiplier Methods, Academic
Press, 1982 (t.ranslated in Russian).
4) Dynamic Programming: Deterministic and Stochastic Models, Prenti-
ce-Hall, 1987.
5) Data Nctwo1"ks, Prentice-Hall, 1987 (with R. G. Gallager; translated
in Russian and Japanese); 2nd Edition 1992.
6) Parallel and Distributed Computation: Numerical Methods, Prentice-
Hall, 1989 (with J. N. Tsitsiklis).
7) Linear Netwo1'k Optimization: Algorithms and Codes, M.1.T. Press
1991.
viii
ix
_.II
. J
x
Preface
Preface
xi
conceptualization, and finite horizon problems, I have added a final chapter
that provides an introductory treatment of infinite horizon problems,
Many topics in the book are relatively independent of the others. For
example Chapter 2 of Vol. I on shortest path problems can be skipped
without loss of continuity, and the same is true for Chapter 3 of Vol. I,
which deals with continuous-time optimal control. As a result, the book
can be used to teach several diIrerent types of courses.
(a) A two-semester course that covers both volumes.
(b) A one-semester course primarily focused on finite horizon problems
that covers most of the first volume.
(c) A one-semester cOlll'se focused on stochastic optimal control that. cov-
en; Chapters 1, 4, 5, and 6 of Vol. I, and Chapters 1, 2, and 4 of Vol.
II.
subjects developed somewhat infol"lually here.
Finally, I am thankful to a number of individuals and iustitutions
for their contributions to the book. My understanding of the subject was
sharpened while I worked with St.evell Shreve on our 1978 lnOilOgraph.
My iuteraction and collaboration with John Tsitsiklis on stochastic short-
est paths and approximate dynamic programming have bC(,lI most. valu-
able. Michael Caramauis, Emmanuel Hrnandez-Gaucherand, Pierre IIllln-
bid, Lennart Ljung, and John Tsitsiklis taught from versions of the book,
and contributed several substantive comments and homework problems. A
number of colleagues offered valuable insights and information, particularly
David Castanon, Eugene Feinberg, and Krishna Pattipati. NSF provided
research support. Prentice-Hall graciously allowed t.he use of material from
my 1987 book. Teaching ami interacting wit.h t.he st.udent.s aL MI'l' haV<'
kept up my interest and excitcment for the subject.
A one-semester course that covers Chapter 1, about 50% of Chapters
2 through 6 of Vol. I, and about 70% of Chapters 1, 2, and 4 of Vol.
II. This is the course I usually teach at MIT.
(d) A one-quarter engineering course that covers the first three chapters
and parts of Chapters 4 through G of Vol. 1.
(e) A one-quarter mathematically oriented course focused on infinite hori-
ZOIl problems t.hat covers Vol. II.
The mathematical prerequisit.e for the text is knowledge of advanced
cakulus, introductory prob,1bility theory, and matrix-vector algebra. A
SUlllnlary of this material is provided in the appendixes. Nat.urally, prior
exposure to dynamic system theory, control, optimization, or operations
research will be helpful t.o the reader, but based on my experiencc, the
material given here is reasonably self-contained.
The book contains a large number of exercises, and the serious reader
will benefit greatly by going through them. Solutions to all exercises are
compiled in a manual that is available to instructors from Athena Scientific
or from the author. Many thanks are due to the several people who spent
long hours contributing to this manual, particularly Steven Shreve, Eric
Lo;edcrman, Lakis Polymenakos, and Cynara \Vu.
Dynamic programming is a conceptually simple technique that can
be adequately explained using elementary analysis. Yet a mathematically
rigorous treatment of general dynamic programming requires the compli-
cat.ed machinery of mea.';ure-theoretic probability. My choice ha.,; becn t.o
bypass the complicated mat.hematics by developing the subject in general-
ity, while claiming rigor only when the underlying probability spaces are
countable. A mathematically rigorous treatment of the subject is carried
out in my monograph "Stocha.stic Opt.imal Control: The Discrete Time
Case," Academic Press, IU78, coauthored by Stevcn Shreve. This mono-
graph complements the present text and provides a solid foundation for the
(c)
Dimitri P. Bertsekas
bertsekas@lids.mit.edll
_.I
1
The Dynamic Programming
Algorithm
Contents
1.1. Introduction . . . . . . . . . .
1.2. The Basic Problem . . . . . . .
1.3. The Dynamic Programming Algorithm
1.4. State Augmentation. . . . .
1.5. Some Mathematical Issues . .
1.6. Notes, Sources, and Exercises.
p. 2
p. 10
p. 16
p. 30
p. 35
p. 37
1
_ J
2
Tile Dynamic Programming Algor
Chap. 1
Sec. 1.1
Introduct.
3
Life can only be understood going backwards,
but it must be lived going forwards.
Kierkegaard
The cost function is additive ill the sense th,lt the cost. in(,\II"I(.d at.
time k, denoted by gk(Xk, Uk, Wk), accumulates over time. The total r.ost
is
N-I
!/N(XN) + L .'Jk(Xk,Uk,Wk),
k=O
1.1 INTRODUCTION
where 9N(XN) is a terminal cost incurred at thc end of the process. How-
ever, because of the presence of Wk, cost is generally a random variable and
cannot be meaningfully optimized. We therefore formulate the problem as
an optimization of the expected cost
This book deals with situations where decisions arc made in stages.
The out.eonle of (ach decision is not. fully predidaulc but can h( iI.1lLici-
patcd to some extent before the next decision is made. The objective is to
minimize a certain cost - a mathematical expression of what is considNed
an unde::;irable out.collle.
A key aspect of such situations is that decisions cannot be viewed in
isolation since one must balance the desire for low present cost with the
undesirability of high future costs. The dynamic programming tedmique
captures this tradeoff. At each stage, decisions arc ranked based on the
sum of the present cost and the expected future cost, assuming optimal
decision making for subsequent stages.
There is a very broad variety of practical problems that can be treated
by dynamic programming. In this book, we try to keep the main ideas
uncluttered by irrelevant assumptions on problem structure. To this end,
we formulate in this section a broadly applicable model of optimal control
of a dynamic system over a finite number of stages (a finite horizon). This
1II0del will occupy us for the first six chapters; its infinite horizon version
will be the subject of the last chapter and Vol. II.
Our basic model has two principal features: (1) an underlying disc1-ete-
t'imc dynamic systcm, and (2) a cost function that is additivc ovcr I:imc.
The dynamic system is of the form
{ N-I }
}:; !/N(:J;N) + !/k(:l;k,'II.k,Wd ,
k=O
where the expectation is with respect to the joint distribution of the random
variables involved. The optimization is over the controls '11.0, '11.1, . . . , UN -I,
but some qualification is needed here; each control Uk is selected with sOllie
knowledge of the current state Xk (either its exact value or some other
information relating to it).
A more precise definition of the terminology just used will be given
shortly. We first provide some orientation by means of examples.
Example 1.1 (Inventory Control)
Xk+l = fk(Xk, Uk, Wk),
k = O,l,...,IV - 1,
Consider a problem of ordering a quantity of a certain item at each of N
periods so as to meet. a stochast.ic demand. Let. us denot.e
Xk stock available at the beginning of t.he kth period,
Uk stock ordered (and immediately delivered) at t.he beginning of t.he kth
period,
Wk demand during the kth period with given probability distribut.ion.
'vVe assume that WQ,Wl,...,WN_l are independent random variables,
and that. excess demand is backlogg<:d and filled iI. :.;oon as additional illv!'l1-
t.ory becomes available. Thus, st.ock evolves according t.o t.he discrete-t.ime
equation
where
k indexes discret.e time,
.7: k is the state of the system and summarizes past information that is
relevant for future optimization,
Uk is the control or decision variable to be selected at time k,
Xk+l = Xk + Uk -- Wk,
Wk
is a random parameter (also called disturbance or noise depending on
the context),
IV is the horizon or number of times control is applied.
where negative stock corrcspomls to backlogged demand (see Fig. 1.1.1).
The cost incurred in period k consists of two components:
(a) A cost r(xk) representing a penalt.y for either posit.ive stock Xk (holding
Cost for eXCess inventory) or negative stock Xk (shortage cost for unfilled
demand) .
(b) The purchasing cost. CUk, where C is Cost per unit ordered.
_ J
4
The Dynamic Programmiug Algorit.
Suc. 1.1
5
Chap. 1
lntroducUo
Wk Demand at Period k
and each possiblc value of :I:k,
/Lk(Xk) = amount. that should bc orderr'd at. time k if the st.oek is :q.
The sequence "Tr = {/Lo,.", /LN -I} will be referred to as a policy or
cont7'Ollaw. For cach "Tr, the corf('Aponding cost. for a fixed initial st.ock .DO IS
J,,(xo) = E { R(XN) + (r(xk) + C/Lk(Xk)) } ,
and we want t.o mllllluize J" (xu) for a given Xu ovcr all "Tr that satis(y thc
constraints of t.he problem. This is a typical dynamic programming problcm.
Wc will analyze this problem in various forms in subsequcnt sections. For
example, we will show in Section 4.2 that for a reasonable choice of t.hc cost.
function, the optimal ordcring policy is of thc form
( ) { Sk - Xk if Xk < .'h,
/Lk Xk = 0 otherwisc,
where Sk is a suit.able threshold level determincd by t.he data of the problcm.
In other words, whcn stock falls below thc threshold Sk, order just cnough to
bring stock up to Sk.
Stock at Period k
xk
Stock at Period k + 1
Inventory
System
Xk + 1 = xk + Uk - wk
Cost of Period k
r(xk) + cU k
Stock Ordered at
Period k
uk
Figure 1.1.1 Inventory control example. At period k, the current. st.ock
(state) Xk, thc stock ordered (control) Uk> and thc demand (random distur-
bancc) Wk det.ermine the cost "(Xk) -I- CUk and the st.ock Xk+l = Xk -I-Uk - Wk
at the next period.
Thcrc is also a tcrminal cost R(XN) for being left with inventory XN at the
end of N periods. Thus, the total cost over N periods is
The preceding example illustrates the main ingredients of the basic
problcm formulation:
(a) A discrete-time system of the form
Xk+l = h(:!:k,'1I.k,wd,
wherc fk is some function; in the inventory example h (.Tk, 'Uk, Wk) =
Xk + 'Uk - Wk.
(b) Independent random parameters Wk. This will be generalized by al-
lowing the probability distribution of 11Jk to depend on :l:k and 1Lk;
in the context of the inventory cxamplc, we can think of a situaLion
whcre the level of demand 11Jk is influenced by the current stock level
Xk.
(c) A mnt7'Ol constr'aint; in t.he (xalllple, we have Uk :::: O. Tn gPlwral,
the constraint set will dcpend on Xk and the time index k that is
Uk E Uk(Xk). To see how constraints dependent on Xk can arise in th
inventory context, think of a situation wherc there is an IIppcr bound
B on the level of stock that can be accommodated, so Uk ::; B - Xk'
(d) An additive cost of thc form
E {9N(XN) + 9k(Xk,'Uk,Wk)},
where gk arc some functions; in the example, we have gN (;l: N) =
R(XN), and 9k(Xk, 'Uk, 11Jk) = r(Xk) + C'Uk.
(e) Optimization over (closed-loop) policies, that is, rules for choosing 'Uk
for each k ami possiblc valuc of ;l:k.
E {R(XN) + (r(xk) -I- CUk)}'
We want to minimize this cost by proper choice of the orders uo, . . . , UN - I ,
subject to the natural const.raint Uk 2': 0 for all k.
At t.his point we need to distinguish bet.ween closed-loop and open-
loop minimizat.ion of thc cost. Tn opcn-loop minimization wc selcct all orders
'11.0,. .. ,UN -I at once at timc 0, without waiting to sce the subsequcnt demand
levels. In closed-loop minimization we postpone placing the order Uk until the
last possiblc moment (time k) when the current stock Xk will be known. The
idea is that since thcre is no penalty for delaying the order Uk up to time k,
we can take advantage of information that becomes available between times
o and k (the demand and stock level in past periods).
Closed-loop optimization is of central importance in dynamic program-
ming and is t.he type of optimization t.hat we will consider almost exclusively
in this book. Thus, in our basic formulation, decisions are made in stages
while gathering information between stages that will be used to enhanc.e the
qualit.y of the decisions. The effect of this on the structure of the rcsulting
optimization problem is quite profound. 1n particular, in closed-loop inven-
tory optimization we are not interested in finding optimal numerical values
of t.he orders but rather we want to find an optimal rule Jar selecting at each
period k an order Uk Jar each ]JOssible value oj stock Xk that can occur, This
is an "action versus strategy" distinction.
Mathematically, in closed-loop inventory optimization, we want to find
a sequence of functions /Lk, k = 0,..., N - 1, mapping stock Xk into order Uk
so as t.o minimize the expected cost. The meaning of /Lk is that, for each k
_.J
6
The Dynamic Programming Algl
,m
ClJap. 1
Sec. 1.1
Introduciic
7
Discrete-State and Finite-State Systems
Example 1.2 (Machine Replacement)
In the preceding example, the state Xk was a continuous real variaulc,
and it is easy to think of multidimensional generalizations where t.he st.ate
is an n-dimensional vector of real variables. It is also possible, however,
that the state takes values from a discrete set, such as t.he integcrs.
A version of the inventory problem where a discrcte viewpoint is more
natural arises when stock is measured in whole units (such as cars), each
of which is a significant fraction of Xk, Uk, or Wk. It is more appropriate
thcn to take as state space thc sct of all integers rather than the set of real
!lumbers. The form of the system cquation and the cost per pcriod will, of
course, stay the same.
In other systems the state is nat urally discrete and there is no contin-
uous counterpart of the problcm. Such systems are often convenicntly spec-
ified in terms of the probabilities of transition between the states. What. we
need to know is 11,j (u, k), which is the probability at time k that the next
state will be j, given that the currcnt state is i, and the control selected is
u, i.e.,
Consider a proulem of operating eflicienlly over N t.ime periolb a machine
t.hat. can be in anyone of n states, denot.ed 1,2".., n. The implication here
is t.hat state i is better than state i + 1, and state 1 corresponds t.o a machine
in perfect condition. In particular, we denote by g(i) the operating cost. per
period when the machine is in state i, and we ussume t.hat
g(l) ::; g(2) ::; .,. ::; g(n).
During a period of operation, the state of the machine can become worse
or it may stay unchanged. We thus assume that. the t.ransition probabilities
Pi) = P {next state will be j I current state is i}
satisfy
P'j = a
if j < i.
Xk+l = Wk,
\Ve assume that at t.he start of each period we know the state of t.he
machine and we must choose one of the following t.wo options:
(a) Let the machine operate one more period in t.he state it. currently is.
(b) Repair the machine and bring it to the perfect state 1 at a cost R.
\Ne assume that the machine, onc.e repaired, is guaranteed to st.ay iu st.ate
1 for one period. In subsequent periods, it. may deteriorate to states j > I
according t.o t.he transit.ion probabilitic:,; Plj.
Thus the objective here is to decide on the level of deteriorat.ion (st.ate)
at which it is worth paying the cost of machine repair, thereby obtaining the
benefit of smaller future operating Costs. Note that the decision should also
be affected by the period we are in. For example, we would be Icss inclined
t.o repair the machine when there are few periods left.
The system equation for this problem is represented by t.he graphs of
Fig. 1.1.2. These graphs depict the transition probabilities between various
pairs of states for each value of t.he control and arc known as tmnsition proba-
bility graphs or simply transition graphs. Note that there is a difl"erent. graph
for each control; in the present case t.here are t.wo cont.rols (rcpair or not
repair).
Pij(u,k) =P{:rk+l =j IXk =i,Uk =u}.
Such a system can be described alternatively in terms of the discrete-time
system equation
where the probabilit.y distribntion of the random parameter Wk is
P{Wk = j I Xk = i,Uk = u} = Pij(u,k).
Conversely, given a discret.e-state system in the form
Xk+l = fk(Xk, Uk, Wk),
together with the probability distribution Pk(Wk I Xk, Uk) of Wk, we can
provide an equivalent transition probability description. The corresponding
transition probabilitics arc given by
Pij(U, k) = pk{Wdi, u,j) I Xk = i,Uk = u},
where W (i, u, j) is thc set
Example 1.3 (Control of a Queue)
Wk(i,u,j) = {w I j = fk(i,u,w)}.
Consider a queueing system with room for n customers operat.ing over N
time periods. We assume that service of a customer can st.art. (cnd) only
at the beginning (end) of the period and that. the system can scrve only
one customer at. a time. The probability Pm of m customer arrivals during
a period is given, and the numbers of arrivals in two different periods are
independent. Customers finding the system full depart without attempting
to enter later. The system offers t.wo kinds of service, fast and slow, wit.h cost.
per period Cf and c., respectively. Service can be switched between fust and
Thus a discrete-state system can equivalently be described in terms
of a difference equation and in terms of transition probabilities. Depending
on the given problem, it may be notationally more convenient t.o use one
description over the other.
The following thrce examples illustrate discrete-state systems.
_..J
8
The Dynamic Programming Alg(
!)
P1n
Do not repair
m
Chap. 1
Sec. 1.1
1nirodud.,
and it equals t.hc probabilit.y of 1L or morc cllstomcr arrivals when j = n,
POn(Uf) = POn(U,,) = L Pm.
When there is at least one cust.omer in the syst.em (i > 0), we have
piJ(1I.f) = 0,
if j < i-I,
p.j(Uf) = qf]Jo,
ifj=i-l,
Poj (u f) = P {j - i + 1 arrivals, service complded}
+ P{j - i arrivals, service not complded}
= (jf]Jj-i+l + (1- qf)pj-.,
if i-I < j < n - 1,
Pi(n-l)(Uf) = qf L pm + (1 - IlJ)Pn-l-.,
fn=n-i
Pin(Uf)=(1-(jf) L Pm.
Repair
m=n-1.
The transition probabilities when slow service is provided are also given by
these formulas with U f and (jf replaced by Us and qs, respectively.
Figure 1.1.2 Machine replacement example. Transition probability graphs for
each of the two possible controls (repair or not repair). At each stage and state i,
the cost of repairing is R+g(l), and the cost of not repairing is g(i). The terminal
cost is O.
Example 1.4 (Optimizing a Chess Match Strategy)
A player is about to play a two-game chess match with an opponent., and
want.s t.o maximize his winlling chances. Each gamc can have one of t.wo
outcomes:
slow at the beginning of each period. With fust (slow) service, a customer in
service at the beginning of a period will terminate service at the end of the
period with probability qf (respectively, qs) independently of the number of
periods the customer has been in service and the number of customers in the
system (qf > qs). There is a cost r(i) for each period for which there are i
customers in t.he syst.em. There is also a t.erminal cost R(i) for i C1lst.omers
left in t.he syst.em at t.he end of the last period. The problem is to choose, at
each period, the type of service as a function of the number of customers in
the syst.em so as to minimize the expected total cost over N periods.
It is appropriate to take as state here the number i of customers in the
system at the st.art of a period and as control the type of scrvice provided.
The cost per period then is r(i) plus Cf or C s depending on whether fast or
slow service is provided. We derive the transition probabilities of the system.
When the system is empty at the start of the period, the probability
that the next state is j is independent of the type of service provided. It
equals the given probability of j customer arrivals when j < n,
pOj(Uf) = POj(u s ) = Pj,
j = 0,1,..., n - 1,
(a) A win by one of the players (1 point for the winner and 0 for the loser).
(b) A draw (1/2 point for each of the two players).
If the score is tied at. 1-1 at. the ('nd of tll( two games, t.h(' mat.ch gO('S int.o
sudden-death mode, whereby the players continue to play until the first time
one of them wins a game (and the match). The player has two playing styles
and he can choose one of the two at will in each game, independently of the
style he chose in previous games.
(1) Timid play with which he draws with probability ]Jd > 0, and he loses
with probability (1 - pd).
(2) Bold play with which he wins wit.h probability Pw, and he loses wit.h
probability (1 - Pw).
Thus, in a given game, timid play never wins, while bold play never draws.
The player wants to find a style selection strategy that. maximizes his proba-
bility of winning t.he match. Note that once the match gets into sudden deat.h,
the player should play bold, since wit.h timid play he can at best. prolong the
_....1
10
The Dynamic Programming Algon
Chap. 1
Sec. 1.2
The Basic 1'.
Cln
11
suudall deat.h play, while running the risk of losing. Therefore, t.here "H' only
two decisiolls for the player to make, the selection of the playing strat.egy in
the first two games. Thus, we can model the problem as one with t.wo stages,
and with st.ates the possible scorcs at the start of each of the first t.wo st.ages
(games), as shown in Fig. 1.1.3. The init.ial state is t.he init.ial score 0-0. The
t.ransition probabilitics for each of the two different. controls (playing styles)
arc also shown in Fig. 1.1.3. There is a cost at the terminal states: a cost of
-1 at t.he winning scorcs 2-0 anu 1.5-0.5, a cost of 0 at t.he losing scores 0-2
and 0.5-1.5, and a cost of -pw at the tied score 1-1 (since the probability of
winning in sudden death is Pw). Note that to maximize the probability P of
winning the mat.ch, we must minimize -Po
This problem has an interesting feature. One would think that if pw <
1/2, the player would have a less than 50-50 chance of winning the mat.ch,
even with opt.imal play, since his probability of losing is greater than his
probability of winning anyone game, regardless of his playing style. This is
not so, however, because the player can adapt his playing style to the current
score, but his opponent docs not have that option. In other words, the play<:'r
can use a closed-loop strategy, and it. will be seen later that with opt.imal play,
as det.ermined by t.he dynamic programming algorithm, he has a belter than
50-50 chance of winning the match provided pw is higher than a t.hr<:'shold
value 15, which, depending on the value of Pd, may satisfy 15 < 1/2.
1st Game I Timid Play
1st Game I Bold Play
e
1.2 THE BASIC PROBLEM
We now formulate a general problcm of decision under stochastic
uncertainty over a finite number of stages. This problem, which we call
basic:, is the cent.rn'! problem in this book. We will discuss solution methods
for this problem based all dynamic programming in the first six chapters,
and we will extend our allalysis to versiolls of this problem involvillg an
infinite number of stages in the last chapter and in Vol. II of this work.
The basic problem is very gencral. In particular, we willllot rcquire
that t.he st.ate, control, or random parameter take a finitc number of val-
UeS or belong to a space of n-dimensional vectors. A surprising aspect of
dynamic programming is that its applicability depends very little on t.he
nat.IIl"<' of t.he sLate, (,ollt.rol, and ralldom parameter spaces. For this rPH.o.;on
it is cOllvenicnt to proceed without allY assumptions on the structure of
these spaces; indeed such assumpt.ions would become a serious impediment
later.
2nd Game I Timid Play
2nd Game I Bold Play
Figure 1.1.3 Chess match example. Transil.ion probabilit.y graphs ror (';I("h or
t.he two possible cont.rols (t.imid or bold play). Not.e here t.hat. the stat.e space is
not the same at each stage. The terminal cost. is -1 at the winning final scores 2-0
and 1.5-0.5, 0 at. the losing final scores 0-2 and 0.5-1.5, and -1'", at. t.he t.ied s("ore
1-1.
XHI = !k(Xk, Uk, Wk),
k=O,I,...,N-1,
(2.1 )
whcre the statc Xk is an element of a spacc Sk, the control UI,; is an clement.
of a space Ck, and t.he random "disturbance" WI,; is an dement. of a spacp
Dk'
The control Uk is constrained to take values in a given nonempty
subset U(Xk) C Ck, which depends on the current state Xk; that is, Uk E
Uk(:Ck) for all Xk E Sk and k.
The random disturbance Wk is characterized by a probability distri-
bution Pk(' I xk, Uk) that may depend explicitly on :Ck and UI,; but not on
valucs of prior dist.urbanccs WI,;-l, . . . , Woo
We consider the class of policics (also called control laws) that consist
Basic Problem
We arc given a discrete-time dynamic system
12
The Dynamic Programming Alg( WI
Chap. I
See. 1.2
The l3asjc :JblclIJ
13
of a sequence of functions
The Rolc and Valuc of Informat.ion
Xk+1 = fk(xk>/lk(xk),wk),
k = 0,1,..., N - 1,
(2.2)
'vVe mentioned earlier the distinction between open-loop lIlIlIlIlllza-
tion, where we select all controls 'Uu,..., UN -1 at once at. tillle 0, alld
clo:,;ed-loop minimization, where we select 11 policy {JIU,..., IIN __[} (.11 at.
applies the control /lk(xd at time k with knowledge of the currellt state
Xk (see Fig. 1.2.1). With closed-loop policies, it i:,; po:,;sible to achieve lower
COHt, essentially by taking advantage of the extra illforlllat.ioll (Llw vallI<'
of the current state). The reduction in cost may be called the value of
the information and can be significant. indeed. If the informatioll is not
available, the controller cannot. adapt appropriately to unexpected values
of t.he state, and as a result the cost. can be adversely affect.ed. For cX<1l11ple,
in the invent.ory control example of the preceding section, the informat.ion
that becomes available at the beginning of ea.ch period k is the invcntory
stock Xk. Clearly, this information is very important to the inventory man-
ager, who will want to adjust the amount Uk t.o be purchased depending
on whether the current stock :Ck is running high or low.
n = {/lO,... ,/IN-I},
where /lk maps stat.es Xk into controls Uk = /lk(Xk) and is such that
ILk(Xk) E Uk(:cd for all Xk E Sk. Such policies will be called admissible.
Given an initial stat.e Xu and an admissible policy n = {IIO, . . . , II,N -1},
the sy:,;tem equation
makes Xk and Wk random variables wit.h well-defined distributions. Thus,
for given functions gk, k = 0, 1, . . . , N, the expected cost
J"/r(XO) = k=O'I'N-l {gN(:CN) + gk(:Ck, /lk(Xk), Wk) } (2.3)
is a well-defined quantit.y. For a given initial state Xo, an optimal policy n*
is one t.hat minimizes t.his cost.; t.hat is,
Wk
Uk= Jlk(xk)
System
Xk+ 1 = 'k(Xk,Uk,Wk)
Xk
J"/r* (xu) = minJ"/r(xu),
"/rEf!
where II is the sct. of all admissible policies.
Note that the optimal policy n* is associat.ed with a fixed initial state
xu. However, an interesting aspect of the basic problem and of dynamic
programming is that it. is typically possiule to find a policy 7r* that is
simultaneously optimal for all init.ial states.
The optimal cost depends on Xu and is denoted by J*(xo); that is,
J*(XO) = minJ"/r(xo).
"/rEf!
Figure 1.2.1 Information gathering in the basic problem. At each Lime k the
controller observes the current state Xk and applies a control Uk = /ldxk) thaI,
depends on that. state.
It is useful to view J* as a function that assigns to each initial state Xo the
optimal cost J*(xo) and call it the optimal cost function or optimal value
function.
For the benefit of the mathematically oriented reader we note that
in the preceding equat.ion, "min" denotes the greatest lower bound (or
infimum) of the set of numbers {J"/r (xo) In E II}. A notation more in line
with normal mathematical usage would be to write J* (xo) = inf"/rEIT J"/r (xo).
However (as discussed in Appendix B), we find it convenient to use "min" in
place of "inf" even when the infimum is not attained. It is less distracting,
and it will not lead to any confusion.
Example 2.1
To illustrate the benefits of the proper use of information, let us cOllsider
the chess match example of the preceding section. There, a player can select.
timid play (probabilitiPB Pd and 1 - Pd for a draw and a loss, respect.iwly)
or bold play (probabilities pw and 1 - pw for a win and a loss, respectively)
in each of the two games of the match. Suppose the player chooses a policy
of playing timid if and only if he is ahead in the score, as illustrated in Fig.
1.2.2; we will see in the next section that this policy is optimal, assuming
Pd > Pw. Then after the first game (in which he plays bold), the score is 1-0
with probability Pw and 0-1 with probability 1 - pw. In the second game, he
14
The Dynamic Programming AlgoJ
Chap. 1
n
plays timid in the former case and bold in the latter Ca.:l. Thus after two
games, t.he probability of a mat.ch win is P",Pd, the probabllIt.y of a mat.ch lo:s
is (1- ]1,.,)2, and tile probabilit.y of a tied score is ]1,.,(1 - ]1d) + (I - P,.,)Pw, 111
whieh case he has a probability p", of winning t.he subsequent sudden-death
game. Thus the probabilit.y of winning the mat.ch with the given st.rategy is
1)",]1d + Pw (Pw(1 - Pd) + (1 - Pw)Pw),
which, wit.h some rearrangement, gives
Probability of a match win = p(2 - pw) + p",(1 - )Jw)Pd.
(2A)
Figure 1.2.2 Illustration of the policy used in Example 2.1 to obtain a greater
than 50-50 chalice of winlling the chess match and associated transitIOn prob-
abilities. The player chooses a policy of playing timid if and only if he is ahead
in the score.
Suppmie 1101\' that Jlw ..... Ii J. rhen the player ha.:; a greater probabdJt)
of losing than winning anyone game, regardless of the type of play he uses.
From this we can infer that. no open-loop strat.egy can give the player a great.er
than 50-50 chance of winning the match. Yet from Eq. (2.4) it can be seen
that with the closed-loop st.rategy of playing timid if and only if the player
is ahead in the score, t.he chance of a mat.ch win can be greater than 50-50,
provided that pw is close enough to 1/2 and Pd is close enougl to 1. As .an
example, for Pw = OA5 and Pd = 0.9, Eq. (2..1) gives a match wm probabilIty
of ronghly 0.53.
'1'0 calcnlate the value of information, let US consider the four open-loop
policies, whereby we decide on the type of play to be used without waiting to
see the rcsult of the first game. These are:
Sec. 1.2
The Basic Pr,
rn
15
(I) Play timid in bot.h games; t.his ha.. a probability JIP,,' or \\'inllillg till'
match.
(2) Play bold ill both games; this has a probability Pv I- JI,(I . - f}".) cc
P (2 - ]1",) of winning t.he match.
(3) Play bold in thc first game and timid in the second game; this has a
probability P",Pd + p;;'( 1 - )Jd) of winning t.he mat.ch.
(4) Play timid in the first game and bold in the second game; this abo has
a probability ]JwPd + p;;,(1 - Pd) of winning the mat.ch.
The first policy is always dominated by the others, and the opt.imal
open-loop probability of winning the match is
Open-loop probabilit.y of win = max(p(2 - p",), P",Pd + p,(l -- Pd))
= P, + P", (1 - ]1",) max(]1w, )Jd).
]Jy making the reasonable assumption IJ,l > )Jw, we see that. the optilllal Opell-
loop policy is to play timid in one of the two games and play bold in the other,
and that
Open-loop probability of a mat.ch win = pw)Jd + P;v(1 -- Pd). (2.5)
For pw = 0.45 and Pd = 0.9, Eq. (2.5) gives an optimal open-loop match win
probability of roughly 0.425. Thus, the value of t.he information (the outcome
of t.he first game) is t.he diJTerence of the optimal closed-loop and open-loop
values, which is approximately 0.53 - 0.425 = 0.105. More generally, by
suut.racting Eqs, (2.4) and (2.5), we see that
V,due of inl'JI"lliatioli p;;,(2 - Pw) I-]Iw(l -- ]1w)]},j -- (P,,,P,j l]i:;,(1 /I,d)
=, p,(l - /iw).
\
It should be noted, however, that whereas availability of the state
information cannot hurt, it may not result in an advantage either. For
instance, in deterministic problems, where no random dist.urbances are
present, one can predict the future stat.es given the initial state and the se-
quence of controls. Thus, optimization over all sequenccs {1Lo, 1LI, . . . , 'lLN -1}
of controls leads to the same opt.imal cost ,J... optimization over all admis-
sible policics. The same can be true even in some stochastic problems (see
for example Exercise 1.13). This brings up a relat.ed issue. Assuming no
information is forgotten, the cont.roller actually knows the prior states and
controls :ro, 1Lo, . . . , Xk-I, 1Lk-1 as well as t.he current stat.e irk. Therefore,
the question arises whether policies that use the entire system history can
be superior to policies that use just the current state. The answer turns out
to be negativc although the proof is t.echnically complicated (see [DeS78]).
The intuitive reason is that, for a given time k and state Xk, all future
expected cost.s depcnd explicitly just on Xk and not on prior history.
_..J
16
The Dynamic Programming Alg
'!In
Chap. I
Sec. 1.3
The Dynam
mgralIlming Algorithm
17
1.3 THE DYNAMIC PROGRAMMING ALGORITHM
Thc dynamic programming (DP) tcchnique rests on a vcry simple
idea, thc principle of optimality. The name is due to Bellman, who con-
tributed a great deal to the popularization of DP and to its transformation
into a systematic tool. Roughly, the principle of optimality states the fol-
lowing rathcr obvious fact.
Period N - 1: Assume that at the beginning of period N - 1 the stuck
is x N -1. Clcarly, no matter what happened in the past, the inventory
manager should order the amount of invcntory that minimizes over "lLN -I 2:
o the sum of thc ordcring cost and the expect cd t.crlninal holding/short.age
cost C"lLN-l -I- E{R(XN)}, which can bc written as
C"lLN-l + E {R(XN-l -I- "lLN-l - WN-I)}.
WN_l
Principle of Optimality
Let 7r* = {/l'a, /1i, . . . , J.1.N _ d be an optimal policy for the basic prob-
lem, and assume that when using 7r*, a given state Xi occurs at time
i with positive probability. Consider the subproblem whercby we are
at Xi at time i and wish to minimize the "cost-ta-go" from time i to
time N
Adding thc holding/shortage cost of period N - 1, the optimal cost for the
last period (plus thc tcrminal cost) is
IN-1(XN-1) = T(J:N-l)
-I- min [ cnN-1+ E {R(:I:N-l-1-nN-1--WN-IJ} j .
UN_lO 1lJN-l
{ N-l }
E gN(XN) + f;9k(Xk,/1k(Xk),Wk) .
Naturally, J N -1 is a function of the stock :1: N -1' It is calculatcd either
analytically or numerically (in which case a table is used for computcr
storagc of the funcLion J N -1). In the process of calculating J N -1, we obtain
the optimal inventory policy J.1.?v -1 (XN -1) for thc last period; /1 ?v -I (XN - d
is the value of nN-l that minimizes thc right-hand sidc of the preccding
equation for a given value of XN-l.
Period N - 2: Assumc that at the beginning of period N - 2 the stock is
x N - 2. It is clear that the inventory manager should order the amount of
invcntory that minimizcs not just thc expccLed cost of period N - 2 hut
rather the
Then the truncated policy {JLi, /1i+l' . . , , J.1.?v -I} is optimal for this sub-
problem.
The intuitivc justification of the principle of optimality is very simple.
If thc truncat.ed policy V<, 1<+ l' . . . , It N -I} were not optimal as stated, wc
would be able to reduce the cost further by switching to an optimal policy
for thc subproblcm once we reach x,. For an auto travel analogy, supposc
that t.he fastest rout.c from Los Anl!;dcs to Doston pasSCI> through Chicago.
The principlc of optimality translates to the obvious facL that thc Chicago
to Boston portion of thc route is also the fastest route for a trip that starts
from Chicago and cnds in Boston.
The principlc of optimality suggests that an optimal policy can be
constructcd in piecemeal fashion, first constructing an optimal policy for
the "tail subproblem" involving the last stage, then extending the optimal
policy to the "tail subproblem" involving the last two stages, and continuing
in this manner until an optimal policy for the entire problem is constructed.
The DP algorithm is based on this idea. We introduce the algorithm by
means of an exam pie.
(XI)(('t(d ('ost nf' period N - 2) I (exp(.('ted ('ost of' lH'riod N I,
given that an optimal policy will be used at period N - 1),
which is equal to
r(xN-2) -I- CUN-2 -I- E{ .1N-l (.TN - J)}.
Using the systcm equation J:N-l = :£N-2 -I- "lLN-2 - 111N-2, t.he last t.enll is
also written as I N - 1 (XN-2 -I- nN-2 - WN-2).
Thus the optimal cost for the last two pcriods given that we at'( at.
statc XN-2, denoted IN-2(XN-2), is given by
Consider the inventory control cxample of the prcvious section and
the following procedurc for determining the optimal ordering policy starting
with the last period and proceeding backward in time.
I N - 2 (XN-2) = r(xN-2)
-I- min [ CUN-2 -I- E {1N-l(XN-2 -I- nN-2 - WN-2) } j
"N-2;::O WN-2
The DP Algorithm for the Inventory Control Example
Again I N - 2 (XN-2) is calculated for every :CN-2' At the samc time, the
optimal policy /1?v-2(XN-2) is also computed.
18
The Dynamic Programming
oriillln
Chilp. 1
Sec. 1.3
The DYIW
Programming Algorithm
19
Period k: Similarly, we have that at period k, when the stock is .Tk, the
inventory manager should order Uk to minimize
(,xpected (:ot of period k) + (expected cost of periods k -1- 1,..., N - 1,
given that an opt.imal policy will be used [or thcsc period).
",:,here the expectation is taken with respect to the probability distribu-
tIOn of Wk, which depends on Xk and 1L k Furthermore if u* - It * ( , c )
. , k - k' k
minimizes the right side of Eq. (3.3) for each Xk anel k, the policy
71"* = {/Lo, . . . , ILM _ J} is optimal.
Dy denoting by Jd.Tk) t.he optimal cost, we have
.h(.rk) = 1'(:Ck) + min [ ClLk + E {.h+l (:Ck + "Ilk - wd } ] ,
uk::O: 0 wk
(3.1)
Proof: t For any admissible policy 71" = {/LO, /Ll,..., 11N-J} and each k =
0,1,..., N -1, denote 71"k = {/Lk,/Lk+J,... ,/LN-J}. For k = 0, 1,..., N -1.
let Ji,(xd be the optimal cost for the (N- k)-stage problem that. tarts at.
state Xk anel time k, and ends at time Nj that is,
The DP Algorithm
Jj;(Xk) = ltn wk,...1fvN-l {.<IN(XN) + IYi('T"/L'(:Ci)'W')}'
For k = N, we define .lM(:CN) = .<IN(:r:N)' We will show by induction
that the functions .lie are equal to the [unctions Jk generat.ed l.Jy the DJ'
algorithm, so that for k = 0, we will obtain the elesired result.
Indeed, we have by definition J M = I N = .<IN. Assume t.hat for
some k and all Xk+1, we have Jk'+J(Xk+J) = J k +1(Xk+J). Then, since
71"k = (/Lk,71"k+I), we have for all Xk
Jk'(Xk) = mip I E { 9k(Xk,ltd.Tk),Wk)
(J1.k>1T + ) Wk'''',WN_J
+.<IN(XN) + Yi(.T,,/L,(Xi),W,) }
,=k+l
which is actually the dynamic programming cquation for this problem.
The functions .h(:I:d denote the optim,ll expected cost. [or the l'C-
maining periods when st.art.ing at period k and with initial invent.ory ;Z'k.
Thce funct.ions are computed recursively backward in timc, sLutiug al,
period N - 1 and ending at period O. The value Jo(xo) is the optimal
expected cost for the process when the initial stock at time 0 is .TO. During
the calculations, the optimal policy is simultaneously computed from the
minimization in thc right-hanel side of Eq. (3.1).
The example illustrates t.he main advantage offcred by DP. While
the original inventory problem requires an optimizat.ion over the set of
policies, the DP algorithm of Eq. (3.1) decomposes this problem into a
sequence of minimizat.ions carried out over the set. of controls. Each of
these minimizations is much simpler than the original problem.
Proposition 3.1: For every initial state Xo, the optimal cost J*(xo)
of the basic problem is equal to 1o(xo), where the function Jo is given
by the last step of the following algorithm, which proceeds backward
in timc from pcriod N - 1 to period 0:
= min E { .<Jk(:Ck,/LdXk),Wk)
JLk 11Jk
+ m1 [ E { YN(:CN) + t 1 gi(Xi, Il,(Xi), w,) }] }
1T 1Uk+l,...,tlJN_}
,=k+l
= min E {Yk (:Ck, /Ldxk), Wk) + Jk' + 1 (h (:rk, ILk(:rk), Wk ) ) }
Ilk Wk
= min E {9k(Xk,/Lk(Xk),Wk) + Jk+l(fk(Xk,/Lk(Xk),1lJk ))}
J1.k Wk
n ( I J il ( 1 ) E {.lJk(:Ck, 1J.k, Wk) + .1,.,1-1 (fk(Xk, "Ilk, wd)}
ll.kE k Xk IlIk
We now state t.he DP algorithm for the basic problem and show it.s
opt.imality by translating in mathematical terms the heurist.ic argument
givcn above for the inventory example.
IN(XN) = YN(XN),
(3.2)
= JdXk),
.h(:r;k) = min E {gk(:Ck,lLk, Wk) + Jk+l (Jk(Xk, 1J.k, Wk))},
UkEUk(Xk) Wk
k = O,l,...,fV - 1,
(3.3)
t For a strictly rigorous proof, some t.edlllical mathcmatical issues must ue
addressed; see Section 1.5. These issues do not. arise if t.he disturba.nce Wk Lakes
a finite or countable number of values and the expected values of all terms in
the expression of t.he cost function (2.2) arc well defined and finit.e for ev('rv
admissible policy 71". .
_..1
20
TIJC Dynamic Programming Alg_
Jln
Chap. 1
Sec. 1.3
The Dynal
PrugramlIIwg AlguriihlII
21
complet.ing the induction. In the second equation above, we moved the
minimum over 7r k + 1 inside the braced expression, using the assumption
that the probability distribution of Wi, i = k -I- 1,. . . , N - 1, depends only
on Xi and Ui. In the third equation, we used the definition of J;;+1> and in
the fourth equation we used the induction hypothesis. In the fifth equation,
we converted the minimization over P,k to a minimization over Uk, using
the fact that for any function F of X and u, we have
min F(X,IL(X)) = min F(x,u),
,<EM uEU(x)
'DO: illitial t.elllperat.m'(' or t.iI(' material,
Xk, k = 1,2: temperature of the mat.erial at the exit of oven k,
Uk-I, k = 1,2: prevailing temperat.ure in oven k.
We assume a model of t.he form
Xk+l = (1 - a)xk -I- aUk,
k = 0,1,
where M is the set of all functions IL(X) such that p,(x) E U(x) for all .T.
Q.E.D.
where a is a known scalar from the interval (0,1). The objective is to get.
t..he final temperatre X2 close to a given target '1', while expending relatively
httle energy. This IS exprcssed by a cost function of the form
r(X2 - '1')2 -I- u6 -I- ui,
The argument of the preceding proof provides an interpretation of
Jk(Xk) as the optimal cost for an (N - k)-stage problem starting at state
:Z;k and time k, and ending at time N. We consequently call Jk(J;k) the
cost-to-go at state Xk and time k, and refer to J k as the cost-to-go fundion
at time k.
Ideally, we would like to use the DP algorithm to obtain closed-form
expressions for Jk or an optimal policy. In this book, we will consider a large
number of models that admit analytical solution by DP. Even if such models
encompass oversimplified assumptions, they are often very useful. They
may provide valuable insights about the structure of the optimal solution of
more complex models, and they may form the basis for suboptimal control
schemes. Furthermore, the broad collection of analytically solvable models
provides helpful guidelines for modeling; when faced with a new problem it
is worth trying to pattern its model after one of the principal analytically
tractable models.
Unfortunately, in many practical ca.ses an analytical solution i" not
possible, and one has to resort to numerical execution of the DP algorithm.
This may be quite time-consuming since the minimization in the DP Eq.
(3.3) must be carried out for each value of Xk. This means that the state
space must be discretized in some way (if it is not already a finite set).
The computational requirements are proportional to the number of dis-
cretization points, so for complex problems the computational burden may
be excessive. Nonetheless, DP is the only general approach for sequential
optimilmtion under uncertainty, and even when it is computationally pro-
hibit.ive, it. can serve 0"; the ba.<;iH for more practical suboptimal approaches,
which will be discussed in Chapter 6.
The following examples illustrate some of the analytical and compu-
tat.ional aspects of DP.
where r > 0 is a iven scalar. We assume no constraints on Uk. (In reality,
t.he:e are constramts, but If we can solve the unconst.rained problem and
verify tht the solton satisfies the constraints, everything will be fine.) The
problem IS deterl1lnllst.IC; that IS, t.here is no stochastic uncertainty. However,
SUCl . problms can be placed within the basic framework by int.roducing a
fictItIOus disturbance t.aking a unique value with probability one.
Innial
Temperature Xo
Oven 1
Temperature
U o
Oven 2
Temperature
u j
Final
Temperature x 2
Figure 1.3.1 Problem of Example 3.1. The temperature of the material
evolves according to xk+l = (1 - a)xk + aUk, where a is SOBle scalar wit.h
O<a<1.
. e have N = 2 and a terminal cost 92(X2) = r(x2 - 7')2, so the initial
conditIOn for thc DP algorit.hm is [cf. Eq. (3.2)]
h(X2) = r(x2 - '1')2.
For thc next-to-Iast stage, we have [cf. Eq. (3.3)]
.l1(:Z;1) = rnill[1L -I- .h(:1:2)]
"I
= min[1Li -I- h((1- a)xI -I- a1Ll) ] '
Ul
Substit.uting the previous form of h, wc obtain
h(Xd = ln H -I- r((1- a)xl -I- aUI - '1')21.
(3.4)
Example 3.1
A certain mat.erial is passed through a sequence of two ovcns (see Fig, 1.3.1).
Dcnote
This minimization will be done by setting to zero t.he derivative with rcspect
t.o 1LI. This yields
0= 21Ll -I- 2ra( (1 - a)xl -I- a1Ll - '1'),
.J
22
The Dynamic Programming Algo
Chap. 1
23
1]
Sec. 1.3
The DynamJ :ugramming Algorithm
and by collect.ing terms and solving for 1Ll, we obtain t.he optimal temperature
for the last oven:
One not.cworthy feature in the precedillg eXHmple is t.he facilit.y with
which we obt.ained an analytical solut.ion. A little thought while tracing
the steps of the algorit.hm will convince the reader that what simplifies the
solution is the quadratic nature of the cost and the linearity of thc system
equation. In Section 4.1 wc will sec that, gcnerally, when the syst.(1Il is
linear and thc cost is quadratic thcn, regardless of the numuer of stages N,
the optimal policy is given by a closed-form expression.
Another noteworthy feature of the example is that the optimal policy
remains unaffected when a zero-mea.n stochastic disturbance is addcd in
the system equation. To see this, assume that the material's t.cmperature
evolves according to
. ra(T - (1 - a)xJ)
ILl (xl) = 1 + ra2
Not.e that. t.his is \lot. a single control hut rather a cont.rol function, a rule that.
tells liS the optimal oven t.emperature Ul = , (XI) for each possible st.at.c XI.
By subst.ituting the optimalul in the expression (3.4) for JI, we obt.a1l1
1.2a2({l-a)xl-T)2 ( ra2(T-(1-a):cl) _T ) 2
J ( ) - + r ( 1 - a ) xJ + 2
J Xl -- (1 + ra2)2 1 + ra
r 2 a 2 ((1- a)xl - '1')2 + l' ( _ _ 1 ) 2 ((1- a)Xl _ '1')2
(1 + ra 2 )2 1 + m 2
2
r((1- a)Xl - '1')
1 + ra 2
XkH = (1 - a)Xk + aUk + 1lJk,
k = 0, 1,
where 1lJo, 1lJl are independent random variables with given clistribution,
zcro mean
We now go back onc stage. We havc [ef. Eq. (3.3)]
Jo(xu) = min [u + J1(XI)] = min[u +.h ((1 - a)xo + auo)],
1'0 uo
E{ lIJ o} = E{lIJJ} = 0,
ancl finitc variance. Then the equatioll for J I [ef. Eq. (3.3)] becomcs
a\ld by subst.it.ut.ing the exprcssion already obtained for J l , we have
,ft(:q) = mill E { 1LI + r((l - a)xl + aUI + WI - T)2 }
1Ll WI
= min[ui + r((l - a)xl + aUI - T)2
Ul
+ 2r E{ wJ} ((1 - a)xI + aUI - T) + rE{wD].
. [ 2 r((1-a)2XU+(1-a)aUO-T)2 ] .
Jo(:cu) = lfiln Uo + 1 + ra2
Uo
Since E {1IJ I} = 0, we obtain
'vVe minimize wit.h respect. t.o .u,o by set.ting t.he corresponding derivat.ive t.o
zero. We obtain
Jl(Xl) = n [ui + r((l - a)xI + aUl - T)2] + rE{wn.
21"(1- a)a((1- u)2xo + (1 - a)au,o - T)
o = 2uo + 1 + ra 2 .
Comparing this cquation with Eq. (3.4), wc sec that the prcsenc( of WI
has resultcd in an additional inconsequential term, 1'E{wn. Therefore,
the optimal policy for the last stagc rcmains unaffected by the presence
of WI, while JI (XI) is increased by thc constant term I'E{wn. It can be
seen that a similar sit.uation also holds for the first stage. In particular,
the optimal cost is given by the same cxpression as beforc exccpt for an
additive constant that depends on E{w5} and E{wt}.
If the optimal policy is unaffectcd when thc disturbanccs are rcplaced
by their mcans, we say that certainty equivalence holds. We will derive
certainty equivalence rcsults for several types of problems involving a linear
system and a quadratic cost (see Sections 4.1, 5.2, and 5.3).
This yields, aft.er some calculation, the opt.imal temperature of t.he first oven:
. r(l - a)a(T - (1 - a)2xo)
ILo(Xo) = 1 + ra2(1 + (1 _ a)2) .
The opt.imal cost is obtained by substit.uting this exprcssio.n in the fomula
for Jo. This leads to a straightforward but le\lgthy calculatIOn, which 111 the
end yields t.he rather simple formula
r((1-a)2xo-T)2 .
Jo(xo) = 1 + ra2(1 + (1 - a)2)'
Example 3.2
This complet.es the solution of the problem.
To illustrate the computational aspects of DP, consider an invent.ory cont.rol
problem that is slightly different from the one of Sections 1.1 and 1.2. [n
_ J
24
The Dynamic Programming AlgG 1m
Chap. 1
Sec. 1.3
The DynaJ.
Programming Algorithm
25
particular, we assume that inventory Uk and t.he demand Wk are nonnegat.ive
integers, and that the excess demand (Wk - Xk - Uk) is lost. As a result, the
stock equation takes the form
Xk+l = 11Ia.-'«0,."tk + Uk - Wk).
Stock; 2 Stock; 2 Siock ; 2 0
Stock; 1 Stock; 1 Stock; 1
Stock; 1
Stock; 0
stock; 2
We also assume that t.here is an upper bound of 2 units on t.he stock that Call
he stored, i.e. t.herc is a constraint Xk + Uk 2. The holding/st.orage cost. for
the kt.h period is given by
Stock; 0
Slock ; 0
Stock; 0
Siock purchased; 0
Stock purchased; 1
(Xk + Uk - Wk)1,
implying a penalt.y bot.h for excess invent.ory and for unmct demand at. t.he
end of t.he kt.h period, The ordering cost. is 1 per unit st.ock ordered. Thus
t.he cost per period is
Stock;2 0
Slock ; 2
Stock; 1 0
Slock ; 1
Yk(;k, ltk, Wk) = Uk + (Xk + ltk - Wk)2.
The t.erminal cost is assumed t.o be 0,
Stock; 0
Stock; 0
YN(."CN) = O.
Slock purchased; 2
p(Wk = 0) = 0.1,
p(Wk = 1) = 0.7,
p(Wk = 2) = 0.2.
Stage 0 Stage 0 Stage 1 St.age 1 Stage 2 Stage 2
Stock Cost-to-go Optimal Cost-ta-go Optimal Cost-to-go Optimal
stock t.o stock to stock to
purchase purchase purchase
0 3.07 1 2.5 1 1.3 I
1 2.67 0 1.2 0 0.3 0
2 2.608 0 1.68 0 1.1 0
The planning horizon N is 3 periods, and t.he initial stock Xo is O. The demand
Wk has t.he same probability dist.ribution for all periods, given by
The system can also be reprc:,;ent.ed in terms of the t.ransition prohalJilities
P'J(u) between t.he three possible stat.es, for t.he different. values of the control
(see Fig. 1.3.2).
The st.arting equation for the DP algorithm is
h(X3) = 0,
Figure 1.3.2 System and OP results for Example 3.2. The trausition proba-
blhty diagrams for the different values of stock purchased (control) arc shown.
The numbers next to the arcs are the transition probabilities. The coutrol
U = 1 is not available at state 2 because of the limitation Xk + Uk < 2. Simi-
lady, the control U = 2 is not available at states 1 and 2. The res;;-lts uf the
OP algonthm are given in the table.
since t.he t.erminal state cost is 0 [ef. Eq. (3.2)]. The algorithm takes the form
[ef. Eq. (3.3)]
Jk(Xk) =
min 1<) {Uk + (Xk +Uk -Wk)1 + Jk+1 (max(O, Xk -I-Uk -Wk))},
osuk2-xk wk
uk=O,l,2
We calculate the expectatioll of the right side for each of the three possible
values of 112:
where k = 0,1,2, and Xk, l1k, Wk can take t.he values 0,1, and 2.
Period 2: 'We compute h(;2) for each of t.he three possible :>t.at.e:>. We have
h(O) = min, E{U1 -I- (U2 - W1)1}
1L'2=U,l,:.l tl'2
min [U2 -I- 0.1(U2)2 -I- 0.7(U2 - 1)2 + 0.2(U2 - 2)2].
1.£2=0,1,2
U2 = 0: E{.} = 0.7.1 -I- 0.2.4 = 1.5,
11.2 = 1 : E{.} = 1 + 0.1.1 + 0.2.1 = 1.3,
U2 = 2: E{.} = 2 -I- 0.1. 4 -I- 0.7.1 = 3.1.
_ J
26
The Dynamic Programming Algo,
Chap. 1
n
Hence we have, by selecting the minimizing U2,
h(O) = 1.3,
JL (0) = 1.
For X2 = 1, we have
h(l)= min E{u2+(I+U2- W 2)2}
1£2=0,1 1JI2
= min [U2 + 0.1(1 + U2)2 + 0.7(U2)2 + 0.2(U2 - 1)2].
H2=O,1
U2 = a : E{,} = 0.1 . 1 + 0.2.1 = 0.3,
U2 = 1 : E{'} = 1 + 0.1. 4 + 0.7.1 = 2.1.
II ence
h(l) = 0.3, JL(l) = 0,
For ;2 = 2, the only admissible control is U2 = 0, so we have
,h(2) = E{(2- wd} = 0.1,4 +0.7.1 = 1.1,
. '"2
h(2) = 1.1,
JL (2) = O.
. l' A T'tin we compute 1 1 (xl) for each of the t.hree possible st.at.es
Period . g,. I . I ,1 ( 0 ) J (1) h(2) obtained in t.he prevIOus
;2 = 0, 1,2, usmg t Ie va ues 2 , 2 ,
period. For XI = 0, we have
1 1 (0)= min E{Ul+(lL1-Wl)2+h(max(0,ul-Wl))},
1q =0,1,2 wI
III = 0 : J';{.} 0.1 . h(O) 1- 0.7( 1 + h(O)) + 0.2(4 + ,h(0)) = 2.8,
111 = 1 : 1':{.} = 1 + 0.1(1 + h(I)) + 0.7. h(O) + 0.2(1-1- h(O)) = 2.G,
Ul = 2: E{.} = 2 + 0.1(4 + h(2)) + 0.7(1 + h(l)) + 0,2. h(O) = 3.68,
Jl(O) = 2.G, (Li(O) = 1.
1'()1' XI = 1, we have
11(1)= min E{1LI+(1+UI-WI)2+h(max(0,1+UI-WI))},
1£1=0,1 wI
ILl = 0 : J{.} = 0.1 (1 + h( I)) + 0.7. h(O) + O.2( 1 + h(O)) = 1.2,
UI = 1 : E{.} = 1 + 0.1(4 + h(2)) + 0.7(1 + h(I)) + 0.2. J 2 (0) = 2,68,
.h(l) = 1.2, (Li(l) = O.
For .1;1 = 2, the only admissible cont.rol is IJ,I = 0, so we have
.Jl(2) E{(2-w1)2I-h(max(0,2-Wl))}
WI
-OI(II.l.,(:!))+O.7(1+.fc(I))+O.2..J 2 (O)
= 1.68,
Sec. 1.3
The Dynamic ogramming Algorithm
27
1 1 (2) = 1.68,
JLi (2) = O.
Period 0: Here we need to compute only Jo(O) since the init.ial st.at.C' is known
to be O. We have
10(0) = min 1': {uo + (uo - wof I- J l (max(O, Uo - wo))},
'UO=O,1,2 UJQ
Uo = 0: EO = 0.1. .1 1 (0) + 0.7(1 + 1 1 (0)) + 0.2(4 + .h(O)) = ;j.0.
Uo = 1 : E{.} = 1 + 0.1(1 + .11(1)) + 0.7. .h(O) + 0,2(1 + h(O)) = 3.67,
Uo = 2: E{.} = 2 + 0.1(4 + .11(2)) + 0.7(1 + J 1 (1)) I- 0.2. h(O) = 5.108,
.10(0) = 3.67,
II.(O) = I.
If the initial state were not known a priori, we would have to comput.e
in a similar manner .1 0 (1) and .1 0 (2), as well as the minimizing Uo. The reader
may verify (Exercise 1.2) t.hat these calculations yield
.1 0 (1) = 2.67,
(L(I) = 0,
1 0 (2) = 2.608,
JL(2) = O.
f:,:
Thus the optimal ordering policy for each period is t.o order one unit. if t.he
current stock is zero and order nothing ot.herwise. The results of t.h,' D1>
algorithm are given in tabular form in Fig. 1.3.1.
..
':"1
;'
t
k-
:
1.';;.'
t:i
Example 3.3 (Optimizing a Chess Match Strategy)
Consider the chess match example considered in Sedion 1.1. There, ,I player
can select timid play (probabilities Pd and 1 - Pd for a draw or loss, respec-
tively) or bold play (probabilities p," and 1- p," for a win or loss, respectively)
in each game of t.he match. We want to formulate a DP algorit.hm for linding
the policy that maximizes t.he player's probability of winning the mat.ch. Not.e
that here we are dealing with a maximization problem. We can cOnvert the
problem to a minimization problem by changing t.he sign of the Cost fundion,
but a simpler alternative, which we will generally adopt, is t.o replace t.he
minimizat.ion in t.he DP algorithm with Inaximizat.ioll.
Let us consider t.he general case of an N-game match, and let t.h(, st.ate
be the net score, that is, the difference bet.ween t.he point.s of the playC'r
minus the points of the opponent (so a st.ate of 0 corresponds t.o all eVC'n
score). The optimal cost.-ta-go function at t.he st.art. of t.he kth ganlf' is giV<'1I
by t.he dynamic programming recursion
.h(Xk) = max[p,t.lk11(:Ck) + (1- Pd)Jk1 I(J;k - I),
P,u.lk+l (Xk + 1) + (1 - Pw).Jk+l (.I;k - 1)].
(3.G)
The maximum above is t.akcn ovcr t.he two possible decisions:
_...1
28
The Dynamic Programming AlgG ifrJ
Chap. 1
(a) Timid play, which keeps the score at Xk wit.h probabilit.y pd, and changes
Xk t.o Xk - 1 with probability 1 - Pd.
(1)) Bold play, which changes Xk t.o Xk -I- 1 or to Xk - 1 with probabilities
P", or (1 - JI",), respectively.
It is optimal to play bold when
l'w.h+I(Xk -I- 1) -I- (1 - p",).h+l(Xk -1)::0: p,dk+l(Xk) -I- (l-l'd)J k + 1 (Xk - 1)
or cqnivalently, if
P Jk+l ( Xk) - Jk+l(Xk -1)
>
Pd - .h+l(Xk -I- 1) - Jk+I(Xk - 1)'
(3.6)
The dynamic programming recursion is started with
.IN(:I:N) = {w
if XN > 0,
if :I:N 0,
if XN < O.
(:\7)
We have IN(O) = p", because when the score is even aft.er N games (:rN ,O),
it is optimal to play bold in the first game of suden eat.h. . .
By executing the DP algorithm (3.5) startmg with the tcrmmal condl-
t.ion (3.7), and using the criterion (3.6) for optimality of bold play, \\"(' find
the following, assllming that Pd > pw:
IN-l(XN-l) = 1 for XN-l > 1; opt.imal play: eit.her
I N -I(l) = max[Pd -I- (1 - Pd)p", , P", -I- (1 - p",)pw]
= Pd -I- (1 - Pd)pw; optimal play: timid
IN-I(O) = ])",; optimal play: bold
IN-l(-I) =P; optimal play: bold
IN-l(XN-J) = 0 for XN-l < -1; optimal play: eit.her.
Also, given IN-I(XN-l), and Eqs. (3.5) and (3.6) we obtain
IN-Z(O) = max[pdp", -I- (1 - Pd)p, P",(Pd -I- (1- Pd)p",) -I- (1 - Pw)p]
= JI"'(p", -I- (p", -I- Pd)(1- p",»)
'uHI that if t.he score is even with 2 games remaining, it is opt.im,1 t l o ply
, I I t. . I olic y for both peno( S IS .0
b Id Thus for a 2-game malc I, tie op nna p .' .
a . timid if and only if the player is ahead in the score. The reg.lOn of p,alrs
P ( y P ) f or which the player has a better than 50-50 chance to wm a 2-game
PWI d
tHat.ch is
Rz = {(Pw,pd) I Jo(O) = pw(p", -I- (p", -I- Pd)(l - PW)) > I/2},
and, as noted in t.he preceding section, it includes points where P", < 1/2.
(
/
f'
:
If
,
r,
Ii
1,-.;
h
¥\
r
(,
"
h'
r
i\
r
t
i
Sec. 1.3
The DYIli
. rrogralllllling AIguriUllrJ
29
Example 3.4 (Finite-State Systems)
We mentioned earlier (d. the examples in Section 1.1) that syst.ems wit.h a
finite number of states can be represent.ed eit.her in terms of a discrete-t.ime
system equation or in terms of the probabilities of transit.ion between the
states. Let us work out the DP algorithm corresponding to the latter case.
We will assume for the sake of the following discussion that the problem is
stationary (i.e., the transition probabilities, the cost per stage, and t.he control
constraint sets do not change from one stage to the next). Then, if
p,;(U) = P{Xk+l = j I Xk = i, Uk = u}
are the transition probabilities, we can alternatively represent the system by
the system equation (d. the discussion of t.he previous section)
Xk+l = Wk,
where the probabilit.y dist.ribulion of the dist.urbance 1IIk is
P{Wk = j I Xk = i, Uk = u} = Pij(U).
Using t.his system equation and denoting by g(i, u) t.he expected cost per st.age
at state i when control U is applied, the D1' algorithm can be rewritten as
Jk(i) = min [g(i,u)-I-E{Jk+l(Wk)}]
uEU(i)
or equivalently (in view of the distribution of Wk given previously)
Jk(i) = min [ 9(i'U) -I- "" Pij(U)A+l(j) ] .
UEU(,)
j
As an illustration, in the machine replacement example of Section 1.1,
this algorithm t.akes t.he form
IN(i) =0, i=l,...,n,
A(i) = min [R -I- g(I) -I- Jk+l(I), g(i) -I- tP,jJk+l(j)]
The two expressions in the above minimization correspond to the two available
decisions (replace or not replace the machine).
In the queueing example of Section 1.1, the DP algorithm takes the
form
IN(i)=R(i), i=O,I,...,n,
Jk(i) = min [r(i) -I- Cf -I- Pij(lLf)Jk+I(j), rei) -I- c., -I- 1"J(1l,).hll(j)]
The two expressions in the above minimization correspond to the two possible
decisions (fast and slow service).
_ J
30
The Dynamic Programming AlgI
Chap. 1
'n
1.4 STATE AUGMENTATION
We now discuss how to deal with situations wherc somc of the as-
sumptions of thc basic problcm arc violatcd. Gcncrally, in such cases t.hc
problem can be reformulated into the basic problem format. This process is
called state auqmentatian bccause it typically involves the enlargement of
the statc spac. The gencral guidelinc in statc augmentation is to include
in the enlarged state at time k all the infarmatian that is kna"Wn to. the
controllcT a.t timc k and can be used with a.dvantage in sclcctiny Uk. U n-
fortunately, statc augmcntation oftcn comes at a price: thc refonnulaed
problcm may havc very complex state and/or control spaces. \Ve provide
somc examplcs.
Time Lags
In many applications thc systcm state XkH dcpcnds not only on thc
preceding state :Ck and control 'Uk but also on earlier stat.cs. and contr.ols.
In other words statcs and controls influence future states with somc tnnc
lag. Such situ:ttions can bc handled by state augmcntation; thc state is
expanded to includc an appropriate number of earlicr. statcs ald controls.
For simplicity, assumc that there is at most a smglc pel"lod t.unc lug
in t.he state and control; that is, t.hc systcm equation has the form
Xk+l = !k(:Ck,:rk-l,uk,U'k-I,"Wk),
k = 1,2,...,lY - 1,
(4.1 )
1:1 = /o(:co, UO, wo).
Time lags of morc than one pcriod can be handlcd similarly.
If wc introducc additional stat.e variables )lk and Sk, and we makc thc
idcntifications )lk = :Ck-l, 5k = Uk-- 1, the system cquation (4.1) yields
( :Z:k+l ) ( !k(:l:k, )lk: k, 5k, "Wk) )
'!hl.l .q, .
S k t I Uk
(lJ.2)
Dy dcfining :l:k = (:I:k, Yk, 5k) as the ncw state, we have
Xk+1 = 1k(xk, Uk, "Wk),
wherc thc systcm function1k is dcfincd from Eq. (4.2). By using the prec?d-
ing cquation a.<; thc system equation and by expressing the cot functIOn
in terms of the new statc, thc problem is reduced to the basic problem
without time lags. Naturally, thc control Uk should now dcpcnd on the
ncw state ik, or equivalently a policy should consist of functions iLk of thc
Sec. 1.4
State A ugn, "tion
31
current state :1:k, as well a.<; the preceding statc Xk-I and thl preceding
control Uk-I.
When the DP algorithm for thc reformulatcd problem is translated
in tcrms of the variables of the original problem, it takcs the form
IN(XN) = 9N(XN),
IN -I (XN -I, XN-2, 1LN-2)
= rnin E { 9N-I(J:N-I,lL N - I ,"W N -J)
"N-l EUN_)(xN_l) WN-l
+ IN(JN-I(XN-I,XN-2,lLN-I,lLN-2,11JN_J))},
Jk(Xk,:Ck-l, lL k-J) = min E { 9k(Xk, Uk, 11Jk)
"kEUkCxk) wk
+ Jk+J(h(J:k,J:k-I,"Il.k,"Il.k_I,W..),:q,Uk) }, k = 1,...,lY - 2,
Jo(xo) = min E {go(XO,lLo,11Jo) + .h(Jo(:l:o,lLo,'Wo),:eo,"Il.o)}.
uoEUo(xo) wo
Similar reformulations arc possible when timc lags appcar in the cost;
for cxample, in thc ca.<;e where the cost is of the form
{ N-l }
E 9N(XN,XN-I) +90(xo,lLo,11Jo)+ t;9k(:Z:k,Xk-I,lJ.k,1JJ d .
The extremc case of time lags in thc cost ariscs in the nonadditivc fol'lu
E {9 N (x N , X N - 1 , . . . , xo, 1L N - I , . . . , Uo, W N _ I , . . . , 11Jo) } .
Then, the problcm can be rcduccd to the basic problcm format, by taking
as augmentcd stat.c
Xk = (:l:k,Xk-I,... ,Xo, lLk-I,..., lLo, 'Wk-l,..., wo)
,,;
and E{gN(XN)} as reformulated cost. Policies consist of functions Ilk of the
prcscnt and past statcs Xk,. .. ,:ro, thc past controls lLk-I,... ,Uo, and the
past disturbances Wk-l,"', WOo Naturally, we must assumc that the past
disturbances are known to thc controller. Otherwise, we are fac:(d with
a problcm where the state is imprecisely known to thc controller. Such
problems arc known as problems with imperfcct state information and will
be discusscd in Chapt.cr 5.
f
-
f
"
_...1
32
The Dynamic Programming Algor. .1
Chap. 1
Correlat.ed Disturbances
Wc t.urn now t.o the case whcre t.he disturbances Wk are correlated
ovcr time. A COlIllllOII sit.uatioll that C;l1I be halldlcd cflicienUy by state
augmcntation arises when t.he proccss Wo,. .. , WN-l can bc represented as
the output of a lincar systcm driven by independcnt random variables. As
an example, supposc that by using statistical methods, we determine that
the evolution of Wk can be modeled by an cquation of the form
Wk = AWk-l + k,
whcrc A is a givcn scalar and {d is a scquencc of independcnt. random
vectors with given distribution. Then we can introduce an additional statc
variable
Yk = 11Jk-l
and obtain a ncw systcm cquation
( Xk+l ) = ( !k (Xk, Uk, AYk + k) ) ,
Yk+l AYk + k
whcre thc ncw state is the pair Xk = (Xk, Yk) and the new disturbance is
the vector k'
Morc gcncrally, supposc that Wk can bc modeled by
Wk = CkYk+l,
where
Yk+l = AkYk + k,
k = 0,. . . , N - 1,
A k , C k are known matriccs of appropriatc dimcnsion, and k arc indepen-
dent random vcctors with givcn distribution (sce Fig. 1.4.1). By viewing
Yk as an additional state variable, wc obtain the new system equation
( Xk+l ) = ( !k(.Tk,Uk,Ck(AkYk +d) ) .
Yk+l AkYk + k
Yk+ 1 = AkYk+ k C k
Figure 1.4.1 Representing correlated disturbances as the output of a linear sys-
tem driven by independent random vectors.
,."
Sec. 1.4
SLaLe Augn
,Lion
33
Note that in order to hav(, lH'rr('cl. stale inronna!.ion, 1.11(' ('olll.roll('r
must be able to observe Yk. Unfortunately, this is true only in the minority
of practical cases; for examplc when Ck is the idcntity matrix and "Uik-I is
observed before Uk is applied. In the ca.o.;e of jH,rfect. stat.e infonllaLion 1.11<'
DP algorithm take::; the form '
IN(XN,YN) = gN(XN),
,h(Xk, Yk) = min E{gk(xk. Uk, Ck(AkYk + d)
UkEUdxk) k
+ Jk+l (h(Xk,Uk, Ck(AkYk + k)), AkYk + k)}.
Forecasts
Finally, considcr the case wherc at time k the cont.rollcr has access t.o
a forccast Yk that re::;ults in a reasscssment of thc probability distribut.ion
of Wk and possibly of futurc disturbances. For exampl(', Uk limy he ;111 (xact
predict.ion of 11Jk or all exact prcdiction that the probability dbtribution of
Wk is a specific one out of a finite collcction of distributions. Forccasts of
intcrcst in practice arc, for examplc, probabilistic predictions on thc state
of thc weathcr, the interest ratc for moncy, and thc demand for invcnt.ory.
Gcnerally, forecasts can bc handlcd by statc augmentation although
the reformulation into the basic problem format may bc quitc complex. We
will treat here only a simple spccial casc.
Assume that at the beginning of each period k, thc controller rc-
ceives an accuratc prediction that the next disturbance 11Jk will be sdect.ed
according to a particular probability distribution out of a givcn collection
of distributions {Ql,..., Qm}; that is, if the forecast is i, thcn "Wk is sc-
lected according to Qi. The a priori probability that thc foreca::;t will be i
is denoted by Pi and is given.
As <In example, supposc that in our carlier invcntory example t.he de-
mand Wk is determined according to onc of three distributions Ql, Q2, and
Q3, corresponding to "small," "mcdium," and "largc" demand. Each of thc
thrce types of demand occurs with a given probabilit.y at each timc pcriod,
independently of the values of demand at prcvious time pcriods. Howevcr,
the inventory managcr, prior to ordering Uk, gcts to know through a forc-
ct .the .type of dcmand that will occur. (Note that it is the probnbility
dlstnbutlOn of demand that bccomes known through thc forecast, 1101, the
demand itself.)
The forecasting process can be represented by means of the equation
Yk+l = k,
where Yk+l can take the values 1,. . . , m, corresponding to the m possiblc
forecasts, and k is a random variable taking the value i with probability
_..1
34
Tlw Dynamic Programming Algc
Chap. 1
/II
pi. The interpret.ation here s tl.lat vhen k takes the value i, then Wk+l
will occur according to the dlstnbutlOn Qi. . . _
Dy combining the system equatin with the forecast equatIOn Yk+ 1 -
k, we obtain an augmented system given by
( .1:k+l ) = ( !k(Xk,Uk,Wk) ) .
Yk+l k
'I'll(' IlPW t.at.P i :1:,. = (.1:", yd, and because the forecast Yk s Inowli ;It tl\e
. ..... ..... Ii' ", ." .\ ,.!., rh.' ".'W ,JI:,turl':1IlCe I:' Ii't.. = (1/',. ,I; .
':11\';'1'1:': '1''';h::;q;I;:\' '\;:;,tl'II'\'I/h'.'ll '\;: dt'{\'llllllh'd b) IlH' dlt.rIbll{lUn (J{ dud
the probabilities Pi, and depends explicitly 011 Xk (via Yd but not on the
prior disturbances. Thus, by suitable rcformulatioll of the cost, the probl:m
can be cast into the basic problem format. Note that t.he control apphed
depellds on both the current state and the current forecast.
The DP algorithm takes the form
IN(XN,YN) = gN(:I;N), (4.3)
Jk(Xk, Yk) = mill E { gdXk' Uk, Wk)
UkEUk(Xk) Wk
(4.4)
+ f1J,.h+l (h(:I;k, 11'k, Wk), i) I Yk}'
i=l
where 1/k may t.ake the values 1,..., m, and the expect.ation over Wk is
taken wit.h respect to the distribut.ion QYk'
There is a lIice simplification of the above algorithm that allows DP
to be executed over a smaller spacc. In particular, define
m
.J k (:1:k) = LP;Jk(:rk, i),
i=1
k = 0, 1,..., N - 1,
and
.J N(:!:N) = gN (:!:N)'
Theil [rom Eq. (4.4), we obtain the algorithm
Jd:I;k) = f 11, min E {!Jk(:I;k' Uk, Wk)
,=1 "kEUk(xd wk
+ .J k +1Ud.1: k ,Uk,Wk))1 Yk = i},
which is executed over the space of Xk rather than Xk and Yk. This simpli-
fication arises also in other contexts where the state has a component that
cannot be affected by the choice of control (see Exercise 1.22).
It should be clear that the preceding formulation admits several ex-
tensions; one example is the case where forecasts can be influenced by
the cOIlt.rol action and involv() several future disturbances. However, the
price for these extensions is incre,lSed complexity of the correspolldillg D P
algorithm.
Sec. 1.5
Some Mathe, ..kal Issues
35
1.5 SOME MATHEMATICAL ISSUES
Let us now discuss some technical issues relating to the basic prob-
lem formulation. The reader who is not mathematically inclined lIeed 1I0t
be concerned about these issues and can skip this section without loss of
continuity.
Once an admissible policy {lLa,.. ., 11N-I} is adopted, the followillg
sequence of events is envisioned at the typical stage k:
1. The controller observes Xk and applies Uk = ILk(Xk).
2. The disturbance Wk is generated according to the given distribution
Pk(- I Xk,ILk(xk)).
3. The cost Yk(Xk,11k(Xk),Wk) is incurred and added to previous costs.
4. The next state Xk+l is generated according to the system equatioll
Xk+l = !k (Xk, l1.k(:1:k), Wk)'
If this is the last stage (k = N - 1), t.he terminal cost. YN(:rN) is
added to previous costs and the process terminates. Otherwise, k is
incremented, and the same sequence of events is repeated for the next
stage.
For each stage, the above process is well-defined and is couched ill pre-
cise probabilistic terms. Matters are, however, complicated by thc lIeed to
view the cost as a well-defined random variable with well-dcfined expect.ed
value. The framework of probability theory requires that for each policy
we define an underlyillg probability space, that is, a set n, a collection of
events in n, and a probability measure on these events. In addition, the
cost must be a well-defined random variable on this space ill t.he scnse of
Appelldix C (a measurable fUllction from the probability space illto the real
line in the termillology of measure-theoretic probability theory). For this
to be true, additional (mcasurability) ,lSsumptions on the functions !k, gk,
and ILk may be required, and it may be neccssary to introduce additional
structure on the spaces Sk, Ck, and Dk. Furthermore, these assumptiolls
may restrict the class of admissible policies, since the fUlictions Pk may bc
constrained to satisfy additional (measurability) requirements.
Thus, unless these additional assumptiolls and structure are specified,
the b,lSic problem is formulated inadequat.ely. Unfortunat.ely, a rigorous
formulation for general state, cOlltrol, alld disturbance spaces is well heyolld
the mathematical framework of this introductory book and will not be
undertaken here. Nonetheless, it turns out that these difficulties arc mainly
technical alld do not substantially affect the basic results to be obtaiJl(d.
l<or this reason, we find it convenient to proceed with informal derivations
and arguments; this is consistent with most of the literature on the subjed.
_. J
I;
.
f
. "
!
:
,
! "
.
f
I".
I.
I
r
"
36
T1JC DYllamic Programming Algor. .11
Chap. 1
We would likc to strcss, however, that undcr at least onc frcquently
satisfied assumption, thc mathematical difficulties mentioned above disap-
pear. In particular, let us assume that the disturbance spaces Dk arc all
countable and the expected valucs of all terms in the cost are finitc for
every admissible policy (this is true in particular if thc spaccs Dk arc finitc
sets). Then, for every admissible policy, the expected values of all the cost
terms can be written as (possibly infinite) sums involving thc probabilities
of the elements of the spaces Dk'
Alternatively, one lIlay writc the cost as
J1f(xo) = Xl,.I?xN {9N(XN) + 9k(Xk,J1.k(Xk))} , (5.1)
wherc
9k(:Ck,fld x d) = E{!/k(Xk,J1.k(Xk),Wk) I Xk,flk(:Ck)},
Wk
with the prcceding cxpectation taken with respect to the distribution Pk (- I
Xk, J1.k(Xk)) defined on the countable sct Dk. Then one ma,!' take as thc ua-
sic probability space the Cartesian product of the spaces 8 k , k = 1,..., N,
given for all k by
Sk+l = {Xk+1 E Sk+l I :Ck+1 = fk(Xk,J1.k(Xk),Wk), Xk E Sk, Wk E Dk},
where So = {xo}. The set Sk is the subset of all states that can be reached
at time k when the policy {J1.0"" ,J1.N-l} is used. Because the disturbance
spaces Dk are countable, the sets Sk are also countable (this is true since the
union of any countable collection of countable sets is a countable set). The
system equation Xk+l = h(XbJ1.k(Xk),Wk), the probability distributions
Pk('\ XbJ1.k(Xk)), the initial state XO, and the poli<:?, {J1.0,'" 'I!:N-d define
a probability distribution on the countable set 81 x .., X 8N, and the
expected value in the cost expression (5.1) is defined with respect to this
latter distribution.
In conclusion, the basic problem has been formulated rigorously only
when the disturbance spaces Do,..., D N - 1 arc countable sets. In the ab-
sence of countability of these spaces, the reader should interpret subsequent
results and conclusions as essentially correct but mathematically imprecise
statements. In fact, when discussing infinite horizon problems (where the
need for precision is greatcr), we will make the countability assumption
explicit.
We note, however, that the advanccd reader will have little difficulty
in establishing rigorously most of our subsequent results cOllceming f>pe-
cific finite horizon applications, even if the countability assumption is not
satisfied. This can be done by using the DP algorithm as a verification
Sec. 1.6
Notes, Source .lId Exercises
37
theorcm. In particular, if onc can find within a subset of policics CI (such
as thosc satisfying certain measurability restrictions) a policy that attains
thc minimum in thc D algorithm, thcn this policy can be rcadily shown
to be optimal within 11. This result is developed in Excrcbe 1.12, and
can be uscd by t.hc mathcmatically orient.ed rcadcr to establish rigorously
many of our subsequent results concerning specific applications. For cx-
ample, in linear-quadratic problcms (Section 4.1) one determincs from the
DP algorithm a policy in closcd form, which is linear in the currcnt stat.e.
When Wk can t.akc llncollnt,lbly many valucs, it. is nccessary t.hat. admis-
sible policies consist of Borel measurable functions J1.k. Since thc lincar
policy obtained from the DP algorithm belongs to this class, thc rcsult of
Exercise 1.12 guarantees that this policy is optimal. For a rigorous math-
emat.ical treat.ment of DP that resolves the associatcd measurability issucs
and supplcments thc prescnt text, see [BcS78].
1.6 NOTES, SOURCES, AND EXERCISES
Dynamic programming is a simplc mathcmatical tcchniquc that has
been used for many years by engineers, mathematicians, and social scien-
tists in a variety of contexts. It was Bellman, however, who fCalizcd in
the early fifties that DP could be developed (in conjunction with the thcn
appearing digital computer) into a systematic tool for optimization. In his
influential books [Bel57], [BeD62J, Bellman demonstrated the broad scopc
of DP and helpcd strcamline its thcory.
Following Bellman's works, therc was much rcsearch on DP. In par-
ticular, the mathematical and algorithmic issues associated with infinitc
horizon problems were extensively invcstigated, cxtcnsions to cOlltillUOllS-
time problcms were formulated and analyzed, and the mathemat.ical issucs
discussed in the preceding section relating to the formulation of DP werc
addressed. In addition, DP was used in a very broad variety of applica-
tions, ranging from many branches of engineering to statistics, economics,
finance, and some of the social scienccs. Samples of these applications will
be given in subsequent chapters.
EXERCISES
1.1
Consider t.he system
Xk+l = Xk + Uk + Wk,
k=O,1,2,3,
_.I
38
The Dynamk Programming Ah- vhm
Chap. 1
wit.h init.ial stat.e :I:u = ::i, and t.he cost. function
3
2 2
L)Xk + Uk)'
k=O
Apply the OJ> algorithm for the following three cases:
(a) The control constraint set Uk(xd is {u I 0 S; Xk + uS; ::i, u : int.eger}
for all Xk and k, and the disturbance Wk is equal to zero for all k.
(b) The control constraint and the disturbance Wk are as in part (a), but
there is in addit.ion a constraint X4 = 5 on the final state. llint: For
t.his problem you need to define a state space for X4 that consists of just
t.he value :1:4 = ::i, and also to redefine U 3 (X3)' Alternatively, YOIl lIIay
IIse a terminal cost. 94(X4) equal to a very large number for X4 f S.
(c) The control constraint. i", as in part (a) and the dist.urbanc(' 10,- t.ake",
the value", -1 and 1 wit.h equal prouauility 1/'2 for all Xk and Uk, except
if Xk + Uk is equal t.o 0 or 5, in which case Wk = 0 with probabilit.y 1.
1.2
Carry out the calculations needed to verify that J o (l)
2.608 in Example 3.2.
2.67 and Jo(2)
1.3
Suppose we have a machine that is either running or is broken down. If it
rnns thronghout. one we(,k, it. makes a gross profit of $100. If it. fails dming
the week, gross profit is zero. 1£ it is running at the start of the week and
we perform preventive maintenance, the probability that it will fail during
the week is 0.'1. If we do not perform such maintenance, the probability of
failure is 0.7. However, maintenance will cost $20. When the machine is
broken down at the start of the week, it may either be repaired at a co",t of
$40, in which case it will fail during the week with a probability of 0.4, or it
may be replaced at a cost of $150 by a new machine that is guaranteed t.o run
t.hrollgh it.s first. w('(,k of op('ratioll. Filld t.he opt.imal repair, n'plm'eIlH'lIt. alld
maint.enance policy t.hat maximi?e", total profit over four we(,ks, ,!.'i",uming a
nl'\\' machinl' at. th(' start of th(' first week.
1.4
A game of the blackjack variety is played by two players as follows: Both
players throw a die. The first player, knowing his opponent's result, may stop
or may throw the die again and add the result to the rC8ult of his previous
throw. He then may stop or throw again and add the result of the new throw
to the sum of his previons t.hrows. lie may repeat t.his process as many times
as he wishes. If his sum exceeds seven (i.e., he busts), he loses the game. If
,
r
r
[.
k
(,
f
Sec. 1.6
Notes, SOL
" and Exercises
39
he slops bcf()re l''-("l'el!ing Sl'\"<'II. Ihl' cl'l)lId pb,'r \;lkl' ,)\<'\" ,11\.1 t \11,'\" t h,'
die successively until the sum of his throws is four or higher. If I he slim of
the second player is over seven, he loses the game. Otherwise the player with
t.he larger sum wins, and in case of a tie the second player wins. The probl<'ln
is t.o determine a stopping strategy for the first player t.hat ma.ximizes his
probabilit.y of winning for each possible init.ial throw of t.he s('('ond plaVl'r.
Formulat.e the problem in tenll'" of OJ> and find an opt.imal stopping st.rat.l'g'
for the case where the second player's initial throw is three. llinl: Take N = ()
and a tate space consisting of the following 1::i st.ates:
Xl : busted
1+. I I .
x : a reae y stopped at sum 'l (1 S; i <; 7),
X 8 + i : current sum is i but the player has not. yet. st.opped (1 <; i :S 7).
The optimal strategy is t.o throw until the sum is four or higher.
1.5 (Min-Max Problems)
In the framework of the basic problem, consider the case where the dist.ur-
bances Wu, WI,. . . , WN-l do not have a probabilistic descript.ion but. rat.her
are known t.o belong to corresponding given sets Wk(Xk, Uk) C J)k, k =
0,1,..., N - 1, which may depend on the current state Xk and control Uk.
Consider the problem of finding a policy 7f = {Jlo,..., IlN- d with /1.;-(:1:..) E
Uk(Xk) for all Xk, k, which minimizes the cost function
[ N-l ]
J,,(XO) = max 9N(XN)+9k(Xk,Jlk(Xk),Wk)
wkEWk(xkol"k(xk)) L....
k=O.I.....N-l k=O
The 01' algorithm for this problem takes t.he form
IN(XN) = 9N(XN),
Jk(Xk)= min max [9k(Xk,Uk,Wk) + J k + I ( !k(Xk,1lk,Wk) )] .
"kEU(xk) wkEWk(xk'''k)
Assuming that Jk(Xk) > -00 for all Xk and k, show that the optimal cost
equals .7 o (.T.o). llint.: Imitate the proof for the sl.odla.'it.ic "a.'ic. I'rov,' alld liS"
the following fact.: 11" U, vV, X are three sets, 1 : W . X is a function and
M is the set of all functions Jl : X -> U, then for any funct.ions Go : W ->
( -00,00], G I : X X U -> (-00,00] such that.
mill G 1 (I(w), 1l) > -00,
ttEU
for all W E W
we have
min max [Go(w) + G 1 (1(111), JI.(J( w)))] = max [ G u ( w) + mill G l ( /(111), 11. )] .
p.ENI wEW wEW uEU
_. J
40
T1JC Dynamic l'rogralI1ll1ing Algo
C1Jilp. I
u
1.6 (Discounted Cost per Stage)
In the framework of t.he basic problem, consider the C;llie where the cost is of
t.he form
E
tvk
k=O,l....,N-1
{ N-l }
o:N 9N (J;N) + ak9k(xk,Uk,Wk) ,
where 0: is a discount fact.or with 0 < (X < 1. Show that an alt.ernate form of
the D1' algorit.hm is given by
VN(:rN) = 9N(."rN),
Vd.q.) = !llin E {9k(Xk' !Lk, Wk) + a V k + 1 (Jk(Xk' !Lk, Wk)) }.
t1kt:U1,;(xk) U'k
1.7 (Exponential Cost Function)
In t.he framework of the basic problem, consider the case where the cost. is of
tl1(. fl'rln
/ '
"
Wk
k=O.I....,N-l
{ex p (9N(XN) + 9k('"rk'Uk'Wk)) }.
(a) Show t.hat t.he optimal cost aud itn optimal policy can be obt.ained from
the 0 l' -Ii ke algori thm
IN(XN) =exp(9N(XN)),
Jk(Xk) = min E{Jk+l(Jk(Xk,Uk,wk))exp(9k(Xk,U",wd)}.
"kEU,,(xk) U'k
(b) Define the functions Vk(Xk) = In Jk(Xk). Assume also that gk is a
function of Xk and Uk only (and not of Wk)' Show that the above
algorithm can be rewritten as
VN(XN) =9N(XN),
Vk(:Ck) = min { 9k(Xk, Uk) + In E {exp(Vk+l (Jk(Xk, Uk, Wk))) } .
ukEUk(xk) wk
::ice. 1.6
Noies, ::ionfL . and Bxercisvs
41
1.8 (Terminating Process)
In the framework of the basic proulem, consider t.he ('a.e where the sys(elll
evolut.ion terminat.es at. time i when a given valne w, of t.he disturuance at.
t.ime i occurs, or when a terminat.ion decision u, is made by t.he cont.roll,.r. If
t.ermination occurs at time i, the rl'.sult.ing cost is
T+ Lgk(Xk,Uk'W,,),
k=O
where T is a termination cost.. If the process has not. t.el"lllinat.ed lip t.o t.he finitl
time N, the resulting cost is 9N(XN) + I:,::OI 9k(Xk, Uk, Wk). ncfol"lnulat.e t.he
problem into the framework of t.he basic problem. Hin/.: Augment t.he st.at.l'
space with a special t.ermination state.
1.9 (Multiplicative Cost)
In the framework of the basic problem, consider t.he case where t.he cost has
the multiplicative form
E {9N(XN) . 9N-.l(XN-l, UN-I, WN-l)' .. 90(XO, Uo, Wo)}.
wk
k=O,I,...,N-I
,
f
Develop a DP-like algorithm for this problem assuming that 9k(.Ck, Ilk, Wk) 2:
o for all Xk, Uk, Illk, and k.
1.10
t
f
r.
t
Assume that we have a vessel whose maximum weight capacit.y is z and whose
cargo is to consist. of different quantit.ies of N different items. Let Vi denot.e
the value of the ith type of item, Ill, the weight of ith type of item, and x,
the number of items of type i that arc loaded in the vessel. The problem is
to find the most valuable cargo, Le., to maximize I:,::l x,v, subject to the
constraints I:,::l Xillli z and x, = 0,1,2,... Formulatc t.his problem in
terms of DI'.
I.
L
I
1.11
"
I'
f.
;
:-
Consider a device consisting of N st.ages cOllnect.ed in series, where each st.age
consists of a particular component. The compollents are subject to failure,
and to increase the reliability of the device duplicate components are provided.
For j = 1,2,..., N, let (I + 1nJ) be t.he number of components for t.hl' jt.lt
stage, let pJ(m J ) be t.he probability of successful operation of the jth stage
when (I + 1nj) components are used, and let Cj denote the cost of a single
component. at the jt.h st.age. Formulat.e in terms of 01' t.he problem of finding
\
r:
_ J
42
Thc Dynamic Programming Alga.
n
Chap. 1
t.he number of component. at. each t.il.ge that maximize th,' reliabilit.y of the
device expresed by
]J1( 1H l)' ]J2( 1H 2)" 'PN(mN),
subject. t.o t.he cost. consl.raiut L.cl ('./1/1,./ -::: A, where A > ° is ).';iv('II.
1.12 (Minimization over a Subset of Policies)
This problem is primarily of theoretical interest (see the end of Section 1.5).
Consider a variation of the basic problem whereby we seek
mil J,,(Xo),
"Ell
where fl is some given subset of t.he set. or sequences {/l,n, 1/.1, . . . , I',N _ I} or
functions ILk: Sk -> Ck with ILk(:I:k) E Ud:I:k) for all Xk E ::h. Assume t.hat
Jr* = {IL,JL;,... ,Jt-l}
bclongs t.o 11 and attains the minimum in the DP algorithm; t.hat is, for all
k = 0, I,..., N - I and :J;k E ,(h,
JdXk) = E {9k (Xk' ILZ(Xk), Wk) + J k + 1 (Jk(Xk, ILZ(Xk), Wk)) }
Wk
min E {9k(Xk, Uk, Wk) + Jk+1 (!k(Xk, Uk, Wk)) } ,
"kEUk(xk) Wk
with IN (XN) = gN (XN). Assume further that t.he functions .h arc real-valued
and that the preceding expectations are well defined and finite. Show that
Jr' is optimal wit.hin Ii and that the correponding optimal cot is equal 1,0
Jo(xo).
1.13 (Semilinear Systems)
Consider a problem involving t.he yst.em
Xk+1 = Akxk + !k(Uk) + Wk,
where Xk E n", !k arc given functions, and Ak and Wk arc random n x n
matrices and n-vectors, respectively, with given probabilit.y distributions that.
do not. depend on Xk, Uk or prior values of Ak and Wk. Assume that the cOst
is of the form
{ N-l }
B C'r,XN + L (rxk + 9k(ILk(Xk))) ,
Ak,wk
k=O.I..".N-l k=O
where Ck are given vect.ors and gk arc given functions. Show t.hat. if the
optimal cost. for this proulem is finite and t.he control const.raint set.s Uk(Xk)
are illdepeudeut. of ;/:k, theu the cost-to-go fuuctious of lIw OJ> algorithm arc
alfine (linear plus constant). Assuming that there is at least one optimal
policy, show that there exists an optimal policy that consist.s of constant
functions ILk; t.hat is, ILZ(Xk) = constant for all Xk E n.
Sec. 1. G
Notes, SOUl
and Exercises
'13
1.14
A farmer annually producing Xk unit.s of a certain crop stores (1- Uk )Xk units
of his production, where ° ::; Uk ::; 1, and invests t.he remaiuillg lI.k./:k ullits.
thus increa:-;inJ.'; the u('xt year's production to a 1,'v('1 :q I I giv('11 hy
Xk+l = Xk + WkUkXk,
k = 0, 1,..., N - I.
The scalars Wk arc independent random variables with identical prohahilil.y
di5t.ributions that. do not depend either on Xk or Uk. Furt.hermore, E {/lid =
ill > O. The problem is to find the optimal investment policy that. Illaximiz('s
the t.otal expected product stored over N ycars
{ N-I }
k=O'/"N-l XN + (1 - UdXk .
Show t.he opt.imality of the following policy that consist.s of coust.,1Ilt runct iOllS:
(a) Ifw> 1,llo(xO)="'=IL?v_l(XN-I)=l.
(b) ffO < w < I/N, ILo(XO) =... = 1?v_,(J:N-l) = 0.
(c) If l/N::; w::; 1,
I"(xo) =... = I'N-k-,(:/;N-k_,) 1,
IL_k(J;N_k) =... = Il?v_/(XN_l) = 0,
where k i uch that 1/(k + 1) < ill ::; I/k.
1.15
Let Xk denote the number of educators in a cert.ain country at timc k and
let. Yk denot.e the number of research scient.ist at time k. New cicnt.it.s (po-
tential educators or research scientists) arc produced during t.he kt.h period
by educators at. a rat.c /k pcr cducat.or, while educator and rcscarch scicn-
tists leave the field due to death, retirement, and transfcr at a rate 8k. The
scalars /k, k = 0,1,..., N - 1, arc independent idcntically ditributed ran-
dom variables taking values within a closed and bounded interval of positive
numbers. Similarly 8k, k = 0,1,. . . ,N - 1, arc independent. ident.ically dis-
t.ributcd and t.ake value in an int.erval [8,8'] wit.h ° < b ::; 8' < 1. By means
of incentives, a cience policy maker can detcrmine t.he proportion Uk of new
scient.ists produced at. t.illle k who become educators. Thus, the nUllluer of
rcca.r("h Kcicntixt.s and pducal.orx evolves according t.o t.he cqllal.iOllx
Xk+l = (1 - 8k)Xk + Uk/kXk,
Yk+l = (I - 8k)Yk + (1 - ukhk.Tk.
,
J,
f
,:
_.I
44
Tlw Dynamic Programming AlgL
/lJ
Chap.
Scc. 1.6
Nail's, Sourc,
.nd Exercises
45
The init.ial numbers xo, yo are known, and it is required to find a. policy
{tL(XO'Yo)",' ,tL7v_l(XN-l,YN-l)}
1.18 (Computer Assignment)
Jk(Xk,Yk) = kXk + (kYk.
Iu the classical game of blackjack t.h(. player draws cards kuowiug ouly OU('
card of t.he dealer. The player loses llpon reaching a sum ,)f cards ('x("l'"ding 21.
If the player stops before exceeding 21, the dealer draws cards until reaching
17 or higher. The dealer loses upon reaching a sum exceeding 21 or stopping
at a lower sum than t.he player's. If player and dealer end up with an equal
sum no one wins. In all ot.her cases the dealer wins. An ace for t.he player
may be counted as a 1 or an 11 as the player chooses. An ace for tlw dealer is
counted as an 11 if t.his result.s in a sum from 17 t.o 21 and as a I ot.herwise.
Jacks, queens, and kings count as 10 for both dealer and player. We assume
an infinite card deck so the probability of a particular card showing up is
independent. of earlier cards.
(a) For every possible initial dealer card, calculate the prubauility t.hat t.he
dealer will reach a sum of 17,18,19,20,21, or over 21.
(b) Calculat.e the opt.imal choice of t.he player (draw or stop) for each uf
the possible combinations of dealer's card and player's sum of 12 t.o 20.
Assume t.hat. t.he player's cards do not include an ace.
(c) Repeat part (b) for the case where the player's cards include an ace.
with
0< (X tLk(Xk, Yk) (3 < 1, for all Xk, Yk, and k,
which maximizes E"Yk.bk {YN} (i.e., the expected final number of research sci-
entists after N periods). The scalars a and (3 are given.
(a) Show t.hat the cost-t.o-go functions JdXk, Yk) are linear; that is, for
some scalars k, (k,
(c)
Derive an optimal policy {tL, . . . ,tL7v- d under the assumption Ebd >
E{ od and show t.hat this optimal policy can consist of const.ant func-
tions.
Assume that the proportion of new scientists who become educators at
time k is Uk + Lk (rather than Uk), where <k are identically distributed
independent random variables that are also independent of "jk, Ok and
t.ake values in the int.erval [-a, 1- {3]. Derive the form of t.he cost-t.o-go
functions and the optimal policy.
(b)
1.16
1.19 [Shr81]
Given a sequence of matrix multiplications
M 1 M 2 ...AhAh/I...MN,
where each Mk is a matrix of dimension nk x nk+l, the order in which mul-
tiplications are carried out can make a difference. For example, if nl = 1,
71 2 = 10, 7Ia = 1, and n4 = 10, the calculation ((M 1 M 2 )!vh) requires 20
scalar multiplications, but the calculation (MI (M2M3)) requires 200 scalar
multiplications (multiplying an m x n matrix with an n x k matrix requires
mnk scalar multiplications). Derive a DP algorithm for finding the optimal
multiplication order. Solve the problem for N = 3, nl = 2, n2 = 10, n3 = 5,
and n4 = 1.
A decision maker must continually choose bet.weeu t.wo activit.ies over a time
illt.(rval [0, rrI. Clioosing activit.y i a.t t.itIl( /" wh(rc i = 1,2, ('arliS nward ;d,
a rate gi(t), and every switch between the two activities costs c> O. Thus,
for example, the reward for st.arting with activity 1, switching to 2 at time
1.1, ,md swi tching back to 1 at time t2 > h carns total reward
1 '1 1 '2 j T
gl(t)dt+ g2(t)dt+ gl(t)dt-2c.
o 'I '2
1.17
We want to find a set of switching times that. maximize t.he tot.al reward.
Assume that the function gl(t) - g2(t) changes sign a finite number of times
in the interval [0, '1']. Formulate the problem as a finite horizon problem and
write the corresponding DP algorithm.
The paragraphing problem deals with breaking up a sequence of N words of
given lengt.hs into lines of length A. Let Wl,..., WN be the words and let
Ll,' . . ,LN be their lengths. In a simple version of the problem, words are
separated by blanks whose ideal width is b, but blanks can stretch or shrink
if necessary, so that a line Wi, Wi+l,. . . ,Wi+k has length exactly A. The cost
associated with the line is (k+ 1)lb' -bl, where b' = (A- L, -. ..- LiH)/(k+ 1)
is the actual average width of the blanks, except. if we have the last. line (N =
'i + k), in which case the cost is zero when b' 2: b. Formulate a DP algorithm
for finding the minimum cost separation. Hint: Consider the subproblems of
optimally separating Wi,. . . ,WN for i = 1,. .. ,N.
1.20 (Games)
(a) Consider a smaller version of a popular puzzle game. Three square tiles
numbered 1, 2, and 3 arc placed in a 2 x 2 grid with one space left empty.
The two tiles adjacent to the empty space can be moved into that. space,
t.hereby ereat.illg new cOllfigurat.ions. Use a 01' argulllcnt to answer the
question whether it is possible to generate a given configurat.ion starting
from any other configuration.
(b) From a pile of eleven matchsticks, two players take turns removing one
or four sticks. The player who removes the last stick wins. Use a DP
_ J
46
The Dynamic Programming Algori,
Chap. 1
argumellt. to show that. there is a winning strategy for the player who
plays first.
1.21 (The Counterfeit Coin Problem)
We are given six coins, one of which is counterfeit and is known to have
different weight than the rest. Construct a strategy to find t.lm count.erfeit.
coin using a t.wo-pan scale iu a minimum average numbl'l' of tries. flint: There
arc t.wo initial dccisions that make sense: (I) test t.wo of the coins agaiust. t.wo
ot.hers, aud (2) test oue of t.he coins against oue other.
1.22 (DP for Uncontrollable State Components)
Consider the case where t.he stat.e is a composite (Xk, Yk) of two components
Xk and Yk. The evolut.ion of the main component is affect.ed by the control
Uk according to t.he equation
Xk+l = Jk(Xk,Yk,lLk,Wk),
where t.he probabilit.y distribution Pk(Wk I Xk, ?/k, Uk) is given. The evolu-
tion of the ot.her component. is governed by a given conditional dist.ribution
Pk(Yk I Xk) and cannot be al[ected by the control, except indirectly through
Xk. Defiue
Jk(:rk) = I'; {.h(:i:k, Yk) I :rd.
Yk
Show that. an equivalent DP algorit.hm is given by
.h(:rk) = I,,' { JIlin 1-: {!lk(Xk, Yk, Uk, Illk)
Yk "kEUk(xk.Yk) wk
+ Jk+1(Jdxk,Yk,Uk,1llk))}1 Xk}.
Discuss the advant.ages of t.his algorithm.
1.23 (Monotonicity Property of DP)
An evident, yet. very important propert.y of the DP algorithm is that if t.he
terminal cost. gN is changed to a uniformly larger cost. UN [i.e., !IN(XN) ::;
YN(XN) for all XN], then t.he last stage cost-to-go IN-l(XN-l) will be uni-
formly incrcased. More generally, giveu t.wo functions Jk+ 1 and .J k+ 1 with
,h+l(Xk+l)::; .!k+l(:rk+d for all Xk+l, we have, for all Xk and Uk E Uk(Xk),
J.; {Yk(Xk, Uk, Wk) + .h+l (JdXk' Uk, Wk))}
1J/k
::; R{gk(:rk,Uk,Wk) + jk+1(Jdxk,1Lk,11Ik))}.
wi.:
See. l.G
Notes, Source.'
,d bxercises
.17
Suppose now that in t.he basic problem t.he syst(,lll and cost are t.in\(' invariant.;
t.hat is, Eh == S, C k == C, Dk == D, !k == J, lh(xk) == U(Xk), and !Jk = !J
for some S, C, D, J, U(.), and g. Show t.hat if in the DP algorit.hm we have
IN-l(X) ::; IN(x) for all xES, then
.1.(:1:)::; .hll(:I:),
for all :1: C ,c..o' "".I 1.:.
Similarly, if we have IN-1(X) 2. IN(x) for aU XES, then
.h(x) 2. .1.+1 (x),
for all xES and k.
1.24
t
t:
An unscrupulous innkeeper charges a different. rat.e for a room as t.he day
progresses, depending on whether he has many or few vacancies. lIis objective
is t.o maximize his expected total iucome during t.he day. Let x be t.he number
of empty rooms at the start of t.he day, and let. Y be the number of customers
that will ask for a room in the course of the day. We assume (somewhat.
unrealist.ically) that. t.he iunkeeper knows Y wit.h certa.inty, and upon arrival
of a customer, quotes one of Tn prices I'i, i = 1,...,711., where 0 < 1'1 ::; "'2
. .. ::; I'm. A quote of a rat.e 1', is accept('d wit.h probabilit.y Pi and is rejected
with probability 1 - 11" in which case the cllstomer departs, never to return
during t.hat day.
(a) Formulat.e t.his as a problem wit.h Y stages and show that. the maximal
expected income as a function of x and Y satisfies the recursion
,
(
);"
If
t
:
.
t'
r:
.
r-;
I
I
J(X,y) = . milX [p,(I', + J(x - 1,y - 1)) + (1- p,)J(x,y - l)J,
t=l,...,rH
for all x :::: 1 and Y 2. 1, with initial conditions
./(J:,O) = ./(O,!/) = 0,
for all :/; and y.
(b)
Assuming that the product. p,l'; is monot.onically nondecreasing with t,
show that the innkeeper should always charge the highest rate 1"m.
Consider a variant of the problem where each arriving customer offers a
price 1', for a room, which the innkeeper may accept or reject., ill which
case the customer departs, never to return during that day. Show t.hat
an appropriate DP algorithm is
J(X,y) = I>;max[l'i + J(x -I,y -1), J(x,y-1)],
i=l
with initial condit.ions
J(X,O) = J(O, y) = 0,
for all x and y.
Show also that for given x and !/ it is opt.imal t.o accept a Clls(,Ol\l('r's
ofl"er if it. is larger t.h,lll >;ollle t.hrcshokl 1'(.1:, V). llinl.: This part. is rela(,cd
to DP for uncontrollable state component.s (cf. Exercise 1.22).
_ J
41:\
Tile I )YIJil.llIic J 'rogf"illJJ/llilJg IIlgori.
Clia". I
1.25 (Investing in a Stock)
An invf'sl.or obervC' at t.hC' jwginlling of each I)('riod k I.h" price:q of" to('k
and decides whether to bny I unit, sell I unit, or do not.hing. There is a
t.ransa.ction cost (' for bllying or sf'lling. The st.ock pricC' can t.ake 011(' of 11.
dilferent values Vi, . . . ,v" and t.he t.ransit.ion probabilities p = P{ Xk+l =
vi I :Ck = v'} are known. The invest.or want.s t.o maximize the t.ot.al wort.h of
his stock at a fixed final period N minus his investment. costs from period D
t.o period N 1 (revenue from a sale is viewed iJ.5 nC'gative cost). WC' asume
t.hat E{ :r:N I Xk = x} - x is monotonically nonincreasing as a functiou of :r;;
t.hat i, t.he <'xIH'ct.ed profit. frolll a pnrchase i a nonillnea«illg fnl1ction of t.h"
purch<1Se price.
(a) Assume t.hat t.he investor st.art.s with N or 1II0re units of st.ock and an
nnlimited ,unount of cash, so t.hat a purcha«e or sale decision i possible
at. ('ach period regardless of the past decisions and t.he current. price.
Show t.hat for every period k, there is a t.hreshold Xk such t.hat it is
optimal t.o buy if Xk ::; Xk - c, sell if Xk 2 Xk + c, and do nothing ot.h-
erwise. Assuming that PJ is independent of /';, under what conditious
can you argue t.hat. as /,; --> 00, Xk tends to a constant valuC', which can
be viewed as the long term average price of t.he stock? !lint.: Formulate
the problem iJ.5 one of maximizing E{:=-Ol (ukgdxk) - clukl)}, where
Uk, E {-I,D, l} and gdXk) = E{XN I xd - Xk.
(b) Formulate an eflicient. DP algorithm for t.he case whcre the invest.or
starts with less than N units of stock and an unlimited amount. of ciJ.5h.
Show t.hat. it is still optimal t.o buy if x" ::; x" - r: ami i t. i still not
optimal to sell if Xk < Xk + c. Could it. be optimal to buy at any prices
Xk great.er than Xk - c?
(c) Consider the situation wherc the invest.or initially has N or more units
of stock bllt a limited amount uf cash. 'vVe model this situation by
impotiing t.he constraint t.hat for any time k the number of purchase
decisions up to /,; should not exceed the number of sale decisions up
to k by more t.han a given fixed number m. Formulate an efficient DP
algorit.hm for this case. Show that it is still optimal to sell if Xk 2 Xk + c
and it is still not opt.imal to buy if Xk > Xk - c.
(d) Consider the situation where the invest.or init.ially has less than N unit.s
of stock as well as a limit.ed amount of ciJ.5h. Derive a DP algorit.hm for
t.his problem.
(e) How would t.he analysis of (a)-(d) be affected if cash is invested at a
givcn fixed interest. rat.e'?
(f) Formulate problems similar to the ones of parts (a)-(c), where the in-
vestor can buy or sell two or more types of stock. Derive the corre-
sponding DP algorit.hms.
2
Deterministic Systems and the
Shortest Path Problem
Contents
2.1. Finite-State Systems and Shortest Paths
2.2. Some Shortest Path Applications . . .
2.2.1. Critical Path Analysis . . . , .
2.2.2. Hidden Markov Models and the Viterbi Algorithm
2.3. Shortest Path Algorithms
2.3.1. Label Correcting Methods
2.3.2. Auction Algorithms
2.4. Notes, Sources, and Exercises
p. 50
p. 54
p. 54
p. 56
p. 62
p. 66
p.74
p. Sl
49
_.-1
50
Deterministic Sy:;tems and the Shortest Path Jblem
Chap. 2
III t.bis chapt.(r, w( foclls (ill d((,(nllillistic problems, t.bat. is, problems
whcrc cach disturbancc Wk can takc only onc valuc. Such problcms arise
in many important contexts and they also arise in cases where the problem
is really stochastic but, as an approximation, the disturbance is fixed at
somc typical valuc; see Chaptcr 6.
An important property of deterministic problems is that, in contrast
with stochastic problems, using feedback r-esults in no advantage in terms
of cost I'eduction. In other words, minimizing the cost ovcr admissible
policies VLo, . . . , fl N - d results in the same optimal cost as minimiziug over
sequences of control vcctors {uo,..., uN-d. This is true because given a
policy and the initial state, the future evolution of the system's state and
thc corresponding future coutrols arc perfectly predictable. Iu particular,
the cost achieved by an admissible policy {flO, . . . , flN -I} for a deterministic
problem is also achieved by the control sequence {uo,..., uN-d defined by
Uk = JLdxd,
k = 0,1,..., N - 1,
where the states XI, . . . , XN -I are determined from Xo aud {flo, . . . , JLN-d
by
Xk+1 = h(Xk,JLk(Xk)),
k = 0, 1,...,N - 1.
As a result, we lIlay restrict attention to sequences of controls without loss
of optimality.
The difference just discussed between deterministic aud stochastic
problems often has important computational implications. In particular,
in a deterministic problem with a "continuous space" character (states and
controls are Euclidean vectors), optimal control sequences may be found
by variational techniques to be discussed in Chapter 3, and by widely used
optimal control algorithms such as steepest descent, conjugate gradient,
and Newton's method (see e.g., [Der95], [Lue84], and [Pol71]). These algo-
rithms, when applicable, are usually more efficient than DP. On the other
hand, DP has a widcr scope of applicability since it can handle difficult
constraint sets such as integer or discrcte sets. Furthermore, DP leads to
a globally optimal solution as opposed to variational techniques, for which
this cannot be guaranteed in general.
In this chapter, we consider deterministic problems with a discrete
character for which variational optimal control techniques are inapplicable,
and specialized forms of DP are the principal solution methods.
2.1 FINITE-STATE SYSTEMS AND SHORTEST PATHS
Consider a deterministic problem where the state space Sk is a finite
set for each k. Then at any state Xk, a control Uk can be associated with
Sec. 2.1
Finite-State [
JInS and Shortest Paths
51
a transition from the state Xk to the state !k (:rk', 1/...), at a cost. rid:!:... /I..).
Thus a finite-state deterministic problem can be equivalently represented
by a graph such as the one of Fig. 2.1.1, where the arcs correspond t.o
transitions between states at successive stages and each arc has a cost
associated with it. To handle the final stage, we have also added an artificial
terminal node t. Each state XN at stage N is connected to the terminal
node t with an arc having cost gN(XN). Control sequences correspond t.o
paths originating at theiniUal state (node s at stage 0) allcl t.ermillat.illg
at one of the nodes corresponding to the final st.age N. If we view the cost
of an arc as its length, we see that a detcr-m'inistic finite-state p1'Oblcm is
equivalent to finding a minimum-length (or shortest) path f1'Om the initial
node s of the !lmph to the tenninal nodc t. Here, by a path we llWill\
a sequence of arcs of the form (jl,j2), (j2JI),..., (jk-l,jd, alld by t.he
length of a path we mean the sum of the lengths of its arcs.
Stage 0
Stage 1
Stage 2
. .. Stage N. 1
Stage N
Figure 2.1.1 Transition graph for a deterministic finite-state system. Nodes
correspond to states. An arc with start and end nodes Xk and xk+l, respectively,
corresponds to a transition of the form xk+l = fk(Xk, Uk)' The length of LIds arc
is equal to the cost of the corresponding transition flk(Xk, Uk)' The prol>lelll is
equivalent t.o finding a shortest path from t.he initial node s to the terminal nodc
t.
Let us denote
aj Cost of transition at time k from state i E Sk to state j E Sk+l,
aft. = Terminal cost of state i E SN [which is gN(i)],
where we a.dopt the convention at = 00 if there is no control that moves
the state from i to j at time k. The DP algorithm takcs the form
IN(i) = aft.,
Jk(i) = min [aj + Jk+I(j)],
JESk+l
i E SN, (1.1)
iESk, k=O,I,...,N-l. (1.2)
_ J
52
Dctcnnillistic Systems alld the Shortest Path
Chap. 2
blellJ
Thc optimal cost is J o (s) and is equal to thc length of the shortcst path
from s to t.
A Forward DP Algorithm
The prcceding algorithm proceeds backward in time. It is possible to
derive an equivalent algorithm that procceds forward in time by means of
the following simple obscrvation. An optimal path from s to t is also an
optimal path from t to s in a "reverse" shortest path problcm where the
direction of each arc is reversed and its length is left unchanged. The DP
algorithm corresponding to this "reverse" problem starts from the states
XI E SI of stage 1, proceeds to states X2 E 52 of stage 2, and continues all
the way to states XN E 5N of stage N. It is given by
IN(j) = aJ'
j E 51,
J - ( . ) . [ N -k J - ( . )]
k J =. mm aij + k+ I l ,
'ESN-k
j E SN-k+l,
k = 1,2,. . . ,N -1.
Thc optimal cost is
Jo(t) = min [aii' + Jl(i)].
'ESN
The backward algorithm (1.1 )-(1.2) and the forward algorithm (1.3)-(1.4)
yield the same result in the sense that
.10(05) = .!o(t),
and an optimal control scquencc (or shortest path) obtained from anyone
of the two is optimal for the original problem. We may view Jdj) in Eq.
(1.4) as an optimal cost-to-an'ive to state j from the initial state s. This
should bc contrastcd with Jk(i) in Eq. (1.2), which rcprcsents the optimal
cost-to-go from state i to the terminal state t.
An important use of thc forward DP algorithm arises in rcal-time
applications where the stage k problem data are unknown prior to stage
k. An example will be given in connection with the state estimation of .,
Hidden Markov Models in Section 2.2.2. Note that to derive the forward
DP algorithm, we used the short cst path formulation, which is available
only for deterministic problems. Indeed, for stochastic problcms, there is
no analog of thc forward DP algorithm.
In conclusion, a deterministic finite-state problcm is equivalent to a
special type of shortest path problem and can be solved by either the ordinary
(backward) DP algorithm or by an alterna.tive forward DP algorithm. It is
also interesting to note that any shortest path problem can be posed as a
deterministic finite-state DP problem, as we now show.
(1.3)
Sec. 2.1
Finite-State Sys. .s and Shortest Paths
53
Converting a Shortest Path Problem to a Deterministic Finite-
State Problem
(1.4)
Let {1, 2, . . . ,N, t} be the set of nodes of a graph, and lct a'J be the
cost of moving from Bode 'i 1.0 node j (also rd(rr(,d t.o a:-; t.he lcnylh of t.l".
arc joining i and j). Node t is a special node, which we call the destinatiun.
We allow the possibility aij = 00 to account for the casc where thcrc is no
arc joining nodes i and j. We want to find a shortest path from cach nodc
i to node t, that is, a sequence of moves that minimizes total cost to get to
t from each of the nodes 1,2,.. . , N.
For the problem to have a solution, it is necessary to make an assump-
tion relating to cycles, that is, paths of the form (i, j r), (j I, h), . . . , Ch, i)
that start and end at the same node. We must exclude the possibility that
a cycle has negative total length. Otherwise, it would be possible to de-
crease the length of some paths to arbitrarily small values simply by adding
more and more negative-length cycles. Wc thus assumc that all cycles have
nonnegative length. With this assumption, it is clear that an optimal path
need not takc more than N moves, so we may limit t.he number of nwves
.. to N. We formulate the problem as onc whcrc we require exactly N moves
but allow degenerate moves from a node i to itself with cost a,i = O. We
denote for i = 1,..., N, k = 0,..., N - 1,
Jk(i) = optimal cost of getting from i to t in N - k moves.
Then the cost of the optimal path from i to t is Jo(i).
It is possible to formulate this problcm within thc framework of the
basic problem and subsequently apply the DP algorithm. For sinlplicity,
however, we write directly the DP equation, which takcs the int.uitively
clear form
optimal cost from i to t in N - k moves
= . min [aij + (optimal cost from j to t in N - k - 1 moves)],
J=l,...,N
(1.5)
Jk(i) = . min [aij + Jk+l(j)],
)=l,...,N
k = 0, 1,...,N - 2,
IN-l(i)=ait, i=I,2,...,N.
The optimal policy when at node i after k moves is to move to a node j* that
minimizes aij + Jk+1 (j) over all j = 1,.. . , N. If the optimal path obtaincd
flam the algorithm contains dcgeneratc moves from a node to itself, this
. ply means that the path involves in reality less than N moves.
To demonstrate the algorithm, consider thc problcm shown in Fig.
.1.2(a) where the costs aij with i # j arc shown along thc conBccting line
. ents (we assumc aij = aji)' Figure 2.1.2(b) shows thc cost-to-go Jk(i)
at each node i and index k together with the optimal path. The optimal
paths are
1 -+ 5, 2 -+ 3 -+ 4 -+ 5, 3 -+ 4 -+ 5, 4 -+ 5.
_ J
54
Deterministic Systems and the Shortest Path PI Jm
Chap. 2
State i
5
4
3
2
o
2
3
4 Stage k
Figure 2.1.2 (a) Shorlest path problem data. The destination is 5. Arc lengths
are equal in bolh directions and are shown along t.he line segments connecting
nodes. (1)) Cosls-to-go generated by the DP algorithm. The number along stage
k and slale i is Jdi). Arrows indicate lhe optimal moves at each stage and node.
2.2 SOME SHORTEST PATH APPLICATIONS
The shortest path problem appears in many diverse contexts. We
provide somc examples.
2.2.1 Critical Path Analysis
Consider thc planning of a project involving several activities, some
of which must be completed before othcrs can begin. The duration of
each activity is known in advance. We want to find the time required to
complete the projcct, as well as the critical activities, those that even if
slightly delayed will result in a corresponding delay of completion of the
overall project.
The problem can be rcpresented by a graph with nodes 1,. .. , N such
as the one shown in Fig. 2.2.1. Herc nodes represent completion of some
phase of the project. An arc (i, j) represents an activity that starts once
phase i is completed and has known duration tij > O. A phase (node) j
is completed when all activities or arcs (i, j) that are incoming to j are
completed. The special nodes 1 and N represent the start and end of the
project. Node 1 has no incoming arcs, while node N has no outgoing arcs.
Furthermore, there is at least one path from node 1 to every other node.
An important characteristic of an activity network is that it is acyclic;
that is, it has no cycles. This is inherent in the problem formulation and
the interpretation of nodes as phase completions.
For any path p = {(I,jl),(jI,j2,),... , (jk,i)) from node 1 to a node
i let D be the duration of the P ath defined as the sum of durations of its
, p
t'
t
I
f,'.
I
I
Sec. 2.2
Some Shortest 1 . Applications
55
Figure 2.2.1 Graph of an activity network. Arcs represent activities and are
labeled by the correspodmg duration. Nodes represenl completion of some pha..,e
of lhe proJect. A phillie IS compleled if all activities associaled with incoming arcs
at the correspondmg node are completed. The project is completed when all
phases arc compleled. The projecl duration time is the length of t.he longest pat.h
from node 1 lo node 5, which is shown with lhick line.
activities; that is,
D p = tih + thh + . . . + thi.
Then the time T i required to complete phase i is
T i = max D p .
pa.ths l'
rrol1l 1 to 1
Te .maximum above is attained by some path because there can be only
a fimte number of paths from 1 to i, since the network is acyclic. Thus to
find i, w should find the longcst path from 1 to i. Because t.hc graph is
acychc, tillS problem may also be viewed as a shortest path problem with
the length. of e.ach ac (i, j) being -tij. In particular, finding the duration
of the project lS eqUlvalent to finding the shortest path from 1 to N.
. Let us denote by SI the set of phases that do not depend on comple-
tIon of any other phase, and more generally for k = 1 2 let S I tl
set ' , ,..., , k)C le
Sk = {i I all paths from 1 to i have k arcs or lcss} ,
with o = {I}.. Using_the acyclicity of the network, it can be shown that.
there IS some mteger k such tat all the sets Sk with k :::; k are nonelllpty
and all the sets Sk with k > k are empty. The sets Sk can be viewed as
the state s.pce.s fo.r the equivalent DP problem. Using maximization in
place. of mll1UlllzatlOn while changing the sign of the arc leugths, the DP
algorIthm for the shortest path problem is
Ti = max [ tji + '1' ]
(i,i) such that J ,
jESk_l
for all i E Sk with i 1- 8k-l.
_....1
56
Deterministic Systems and the Shortest Path PI', m
Chap. 2
Note that this is a forward algorithm; that is, it starts at the origin 1 and
procceds towards the dcstination N. An altcrnative backward algorithm,
which starts at Nand procecds towards 1, is also possible, as discussed
in the preceding scction. A shortest path is also callcd a critical path. A
delay by a given amount in the completion of one of the activities on thc
critical path will delay the completion of the overall project by t.hc samc
amount.
As an example, for the activity network of Fig. 2.2.1, we havc
So = {I}, Sl = {1, 2}, S2 = {1, 2, 3},
S:J = {I, 2, 3, 4}, S4 = {I, 2, 3, 4, 5}.
A calculation using the preceding formula yields
'1'1 = 0, '1'-2 = 3, '13 = 4, 14 = G, '1'" = 10.
The critical path is 1 -> 2 -> 3 -> 4 -> 5.
2.2.2 Hidden Markov Models and the Viterbi Algorithm
Consider a Markov chain with a finite number of statcs and given
statc transition probabilitics Ihj. Suppose that when a transition 0CCurS,
we obtain an observation that relates to that transition. Given a sequcnce
of obscrvations, we want to cstimate in some optimal sense the sequence
of corrcsponding transitions. Wc are given the probabilit.y r(z; 'i, j) of an
observation taking value Z when the state transition is from .i to j. We
assume independent observations; that is, an observation depends only on
its corrcsponding transition and not on other transitions. Wc arc also given
thc probability 11'i that the initial statc takcs value i. The probabilitics
Pij and 1'(z; i, j) arc assumcd to be independent of time for notational
convcnience. The methodology to be described admits a straightforward
cxtcnsion to the case of time-varying system and obscrvation statistics.
Markov chains whose statc transitions are imperfectly observed ac-
cording t.o t.hc probabilistic mcchanism just dcscribcd are callcd Hidden
Mar'kov Models (HMMs for short) or partially observable Mar'kov cha'ins.
In Chapter 5 we will discuss the control of such Markov chains in the
context of stochastic optimal control problems with imperfect state infor-
mation. In the present section, we will focus on the problem of estimating
the state sequence given a corresponding observation sequencc. This is an
important problem, which arises in a broad variety of practical contexts.
We use a "most likely state" estimation criterion, whereby givcn the
observation sequence ZN = {Zl, Z2,..., ZN}, we adopt as our estimate the
statc transition sequcncc X N = {xo, Xl,..., XN} that maximizes ovcr all
XN = {XO,Xl,...,XN} the conditional probability p(X N I ZN). It turns
f'
Sec. 2.2
Some Shortest ,I] ApplicaUolJs
57
out th,lt. .f(N can bc fOllnd by solving a shortest path problclU, as \V(' UOIV
dcscribe.
Wc have
F
U
!
t'
I
I
,
p( X I z ) = P(XN,ZN)
N N p(ZN)'
wherc p(XN, ZN) and p(ZN) arc the unconditional probabilitics of OCCIlr-
rence of (XN, ZN) and ZN, respectively. Since p(ZN) is a positive conijtant
?:lCC ZN is .kown';,e can maximize P(XN,ZN) in place of p(XN I ZN).
lhc probablhty p(XN,ZN) can be writtcn as
p(X N , ZN) = p(:I:O, Xl,. . . , .TN, Zl, Z2,. .. , ZN)
= 11':rop(.Tl,...,.TN,Zl,Z2,...,ZN I :DO)
= 11' x op(:Dl, ZI I J:O)P(X2,. . . , XN, Z2, . . . ,ZN I :1:0, :DI, ZI)
= 11'xOPxoxi r(zl; Xo, Xl)p(X2,. . . ,:DN, Z2,. . ., ZN I Xo, J:I, ZI).
Thiij calculat.ion can be cOIlLinued by writ.illg
p(J:2, . . . , J:N, Z2, . . . ,ZN I Xo, XI, Zl)
= p(:1:2,Z2 I XO,Xl,Zl)p(X3,... ,:I:N,Z:J,... ,ZN I :r:0,Xl,Zl,:1:2, Z2)
= PXj X 2 r '(Z2; Xl, .T2)p(X3,. . . , XN, Z3,. . ., ZN I Xo, :Dl, Zl, :1:2, Z2),
where to obtain thc last equation, wc used thc assumcd independencc of
the observations. Combining the abovc two relat.ions, we havc
P(XN,ZN) = 11'XOPXOXlr(zl;xo,:Dl)pxP:2T'(Z2;XI,X2)
. P(.'C3, . . . ,:CN, Z3, . . . , ZN I .TO, X], Zl, X2, Z'2),
and continuing in the samc manner, we obtain
N
p(XN,ZN) = 11'xo I1Pxk_l.l:kT'(Zk;.Tk-l,:J:k). (2.1)
k=1
We now show how thc ma.ximization of the abovc cxprcssion can bc
viewed as a shortest path problem. In particular, we conijtruct. a graph of
state-time pairs, called the trellis diagram, by concatcnating N + 1 copics
of the state space, and by preceding and following thcm with dummy nodcs
sand t, respectively, as shown in Fig. 2.2.2. Thc nodcs of the kth copy
correspond to thc :;tates Xk-I at time k - 1. An arc cOllllectij a lIode Xk-I
of the kth copy with a node Xk of the (k + l)st copy if the correspondinrr
transition probability PXk-l Xk is positive. Since maximizing a positive cos7
functio.n is equivalcnt to maximizing its logarithm, we sce from Eq. (2.1)
that, glvcn t.hc obscrvation sequcncc ZN = {Zl, Z2,..., ZN}, thc problcm of
maximizing P(XN, ZN) is cquivalcnt to thc problem
N
minimize -In(11'xo) - Lln( PX k_1 X kT'(Zk;Xk-I,.1:d)
k=1 (2.2)
over all possible sequences {xo, Xl, . . . , X N }.
_ J
58
J)ct(,l"ministic Systems and i1JC SllOrtcst Path Pr0 n
Chap. 2
Dy assigning to an arc (c, :1:0) the length -In(7fxo)' to an arc (:CN, t) the
length 0, and to an arc (Xk-I,Xk) the length -In(PXk_1Xkr(Zk;;1;k-l,xd),
we see that the above minimization problem is equivalent to the problem
of finding the shortest path from s to t in the trellis diagram. This shortest
path defines the estimated state sequence {:1;0, X I, . . . , j; N }.
s
Xo
Xj
X2
XN-l
X N
Figure 2.2.2 State estimation of an lIMM viewed as a problem of linding a.
shortest path from s to t. Length of arcs from s to states xo is -In(7f xo ), and
length of arcs from states XN to t is zero. Lengt.h of an arc from a state Xk-l to
Xk is -In(PXk_lxkT(Zk;Xk-l,XkJ), where Zk is the kth observation.
In practice, the shortest path is most conveniently constructed se-
qucntially by forward DP, that is, by first calculating the shortest distance
from s to each node x I, then using these distances to calculate the shortest
distance from s to each node X2, etc. In particular, suppose that wc have
computcd the shortest distanccs Ddxk) from s to all states Xk on the basis
of the observation sequence Zk, and suppose that the new observation Zk+l
is obtained. Then the shortest distanccs Dk+I(Xk+l) from s to any state
Xk+l can be computed by the DP recursion
Dk+l(Xk+l) = min [ Dk(Xk) -In(PXkXk+lr(zk+l;Xk, ;1;k+1))]'
all x k such that
PXkxk+l >0
The initial condition is Do(xo) = -In(7fxo)' The final estimated state
sequence XN corresponds to the shortest path from s to the final state
iN that minimizes DN(xN) over the finite set of possible states XN. An
advantage of this procedure is that it can be executed in real time, as soon
as cach new observation is obtained.
There arc a number of practical schcmes for estimating a portion of
the state sequence without waiting to receive the entire observation sc-
quence ZN, and this is useful if ZN is a long sequence. For example, one
can check fairly easily whether for some k, all shortest paths from s to
states Xk pass through a single node in the subgraph of states Xo,.. . ,Xk-l.
Sec. 2.2
Some Shortest Pat
pplications
59
If so, it can be secll from Pig. 2.2.:) that t.hc gllOrtcst. pat.h from s t.o t.hat.
node will not be affected by reception of additional observations, and there-
fore the subsequence of state estimates up to that node can be determined
without waiting for the rcmaining observations.
I
t
(
L
,
l
1(.
'
[
&
!
ft
,
I
!,
i
t
f
'
f
Shortest Path
from s to m
/
Xk xk + 1
,,0
,
,
,
':----0
,0
, ,
.
,,' "'0
"
Figure 2.2.3 Estimating a portion of the state sequence prior to receiving the
entire observation sequence. Suppose that the shortest paths from s to all states
Xk pass through a single node m. If an additional observation is received the
shortest paths from s to all states xk+l will continue to pass through m. TllCre-
fore, the portion of the state sequence up to node m can be safely estimated
because additional observations will not change the initial portion of the shortest
paths from s up to m.
The shortest path-based estimation procedure just described is known
as the Viterbi algorithm, and finds numerous applications in a variety of
contexts. An example is speech recognition, where the basic goal is to tran-
scribe a spoken word sequence in terms of elemcntary speech units called
phonemes. One possibility is to associate the states of the HHM with
pholemes, and given a sequenc,e of recorded phonemes ZN = {Zl,..., ZN},
to find a phonemic sequence XN = {XI,...,XN} that maximizes over all
X N = {Xl,..., XN} the conditional probability p(X N I ZN). The probabil-
ities r(zk; Xk-l, Xk) and PXk-1Xk can be experimentally obtained, if neccs-
sary by specialized "training" for each speaker that uses the speech recog-
nition system. The Viterbi algorithm can then be used to find the most
likely phonemic sequence. There are also other HMMs used for word and
sentence recognition, where only phonemic sequences that constitute words
from a given dictionary are considered. We refer the reader to [Pic90] and
[Rab89] for a general review of HMMs applied to speech recognition and for
further references to related work. It is also possible to use similar models
for computerized recognition of handwriting.
The Viterbi algorithm was originally developed as a scheme for de-
coding data after transmission over a noisy communication channel. The
_....1
60
DeierIninisiic ysiems and ilw IJUriesi Paih 1>, 'CIIl
Chap. 2
following example describes this context in some detail.
Example 2.1 (Convolutional Coding and Decoding)
When binary dat.a arc t.ransmitted over a noisy communicat.ion chanlld, it
is often essential t.o use coding as a means of enhancing reliability of com-
munication. A common type of coding method, called convolutional coding,
convcrts a source-{;enerat.ed binary data sequence
{1U I, 1V2, . . .},
Wk E {a, I},
k=1,2,...,
into a coded sequence {Yl, Y2, . . .}, where each Yk is an n-dimcnsional vector
with binary coordinates, called codeword,
( 1
Yk
,, J
yi E {a, 1},
i= 1,...,n, k= 1,2,....
The sequence {Yl, Y2, . . .} is then transmitted over a noisy channel and get.s
transformed into a sequence {Zl' Z2, . . . , }, which is then decoded to yield the
decoded data sequence {WI, W2, .. .}; see Fig. 2.2.4. The objective is to design
the encoder/decoder scheme so that the decoded sequence is as close to the
original as possible.
Data (0 or 1)
Z"Z2' .,"
Received
Sequence
Decoded
Sequence
Decoder
Encoder
A A
w 1 . w 2 ....
W I .W2''''
Y"Y2' ...
Figure 2.2.4 Encoder/decoder scheme.
The problem just discussed is central in information theory and can be
approadu.d in Heventl differellt waYH. Tn a particularly popular and effective
techlliqlw called couvolul.i.ou"l coding, the vedors Yk arc related to Wk via
equations of the fOrIn
Yk = CXk-l + dWk,
k=1,2,...,
Xk = AXk-l + bWk,
k = 1,2,...,
Xo: given,
where Xk is an m-dimensional vector wit.h binary coordinates, which we view
as state, and C, d, A, and bare n x m, n x 1, m x m, and m x 1 matrices,
respectively, with binary coordinates. The products and the sums involved in
the expressions CXk-l + dWk and AXk-l + bWk are calculated using modulo
2 arit.hmetic.
r
,
,
!
\'
l
t
J
t
"
(2.3)
(2.4)
cc. 2.1
Ol11C horic. . Jaih ApjJlical,ioIls
61
As an example, let. m = 2, n = ;, and
C=G D
A=( ),
d= (D,
b= (n.
Then the evoltiOl of the syst.em (2.3)-(2.4) can be represent.ed by the di-
agram shown m FIg. 2.2.5. Given the initial Xo, this diagram can be Ilsed
to generate t.he codeword sequence {YI, Y2,. . .} corresponding t.o a dat.a se-
quence {WI, 102, . . .}. For example, when thc initial state is Xo = 00 th(' data
sequence '
{WI, 102, W3,W4} = {l, 0, 0, I}
gencrates the state sequencc
{XO,Xl,X2,X3,X4} = {00,01, 11, 1O,00},
and the codeword seqUence
{YI,Y2'Y3,Y4} = {111,011,111,011}.
Old .State xk _ 1 New State xk
00 00
01 01
10 10
11 11
Figurc 2.2.5 State transition diagram for convolutional coding. The binary
?umber par on each arc is the data/codeword pair Wk/Yk for the correspond-
mg transltlolL Su for example, when Xk-I = 0], a zero data bit (llIk = 0)
effects a transition to Xk = 11 and generates the codeword 001.
Assume now that the characteristics of the noisy transmission channel
are such that a c?deword Y is actually received as Z with known probability
p(z I V), where Z IS allY n-bit binary number. We assume independent. errors
so that
N
p(ZN I Y N ) = II p(Zk I Yk), (2.5)
k=l
_.I
62
Deterministic Systcms and the Shortest Path Pro. .1
Chap. 2
h . e Z - {z Z N} is the received sequence and YN = {Yl,... ,YN} is
w el N - 1,.." . .'
the transmitted sequence. By associating the codewords Y with state t.ranSl-
tions, we formulate a maximum likelihood estimation problem, whereby we
W;1nt. t.o lind a sequence Y N = {Yl, Y2, . . . , Y N} such t.hat
p(ZN I YN) = maxp(ZN I Y N ).
YN
The const.raint. on YN is t.hat. it must be a feasible codeword sequence (i.e., it
must correspond to some initial state and data sequence, or eqUivalently, t.o
a sequence of arcs of the trellis diagram). . . "
Let us construct a trellis diagram by concatenatmg N state transI-
tion diagrams and appending dummy nodes sand t on its left and rIght.
sides, which are connected with zero-length arcs. to the state Xo and XN,
t . 1 Busin g E q ( 2 5) we see that given the recClved sequence
respec lve y. y ..," ( I y ) . . I t
ZN = {ZI,Z2,... ,ZN}, the problem of maximizing p ZN N IS eqUlva en
to the problem
N
minimize L -In(1'(zk I Yk))
k=l
over all binary sequences {Yl, Y2,... , YN}.
This is equivalent to a problcm of flnding;1 shortest path in t.he trellis diagral
from s to t, where the length of the arc associated with the codeword Yk s
_ In (1'( Zk I Yk)), and the lengths of each. ar incident to a dummy node IS
zero. From t.he shortest path and the trelhs diagram, we can then obtam the
corresponding dat.a sequence {WI, . . . , W N }, which is accepted as the decoded
data. ' 1 ' 11
The maximum likelihood estimate YN can be found by so vmg ,e cor-
responding shortest path problem using the Viterbi algorithm. In particular,
the shortest distances Dk+I(Xk+l) from s to any state Xk+1 are computed by
the D1> recursion
(2.6)
Dk+l(Xk+J) =
min [ Dk(Xk) -In(p(Zk+l I Yk+l))],
all "'k such that
(xk,xk+l) is an arc
where Yk+l is the codeword corresponding to t.h ac (Xk,Xk+J). The final
stat.e XN on the shortest path is the one t.hat mml1l11ZeS DN(XN) over XN.
2.3
SHORTEST PATH ALGORITHMS
We have seen that short.est path problems and deterministic finite-
state optimal control problems are equivalent. The computational impli-
cations of this are twofold.
Sec. 1.3
Slwrtest l'aUJ A
ithl11s
63
(a) One can use DP to solve general shortest path problems. Not.e that
there are several other shortest path methods, some of which have
superior theoretical worst-case performance t.o DP. IIowever, DP is
often preferred in practice, particularly for problems with an acyclic
graph st.ructure ,\.lid also whcn a parallel comput.er i" avail"hl,'.
r!:
j'.
(b) One can use general shortest path methods (other than DP) for de-
terministic finite-state optimal control problems. Even though DP is
preferable in most cases to other shortest path methods because it. b
tailored to the sequential uature of optimal control problems, there
arc important cases where other shortest path methods are preferable.
In this section we discuss several alternative shortest path m<'!,lwds.
We motivate these methods by focusing on shortest path problems with a
very large number of nodes. Suppose that there is only one origin and only
one destination, as in shortest path problems arising from deterministic
optimal control (d. Fig. 2.1.1). Then it is often true that most of the Ilodes
are not relevant to the shortest path problem in the sense thnt they arc
unlikely candidates for incl usioll ill a short.est path bctweeu the gi ven origin
and destination. Unfortunately, however, in the DP algorithm every node
and arc will participate in the computation, so there arises the possibility
of other more efficient methods.
A similar situation arises ill some search problems that are COllllllon
in artificial intelligence. We provide some examples.
k
t:,
,
f
Fe
:
t
!
j
Example 3.1 (The Four Queens Problem)
Four queens must be placed on a 4 x 4 portion of a chessboard so t.hat no
queen can attack another. In other words, the placement must be such t.hat
every row, column, or diagonal of the 4 x <1 board contains at most one queen.
Equivalently, we can view the problem as a sequence of problems; first, placing
a queen in one of the first t.wo squares in the top row, then placing another
queen in the second row so that it is not attacked by the first, and similarly
placing the third and fourth queens. (It is suJIicient to consider only t.he lirst.
two squares of the top row, since the other t.wo squares lead to symmetric
positions.) We can associate positions with nodes of an acyclic graph where
the root node s corresponds to the position with no queens and the terminal
nodes correspond to the positions where no additional queens call be placed
without some qUeen attacking another. Let us connect each terminal position
with an artificial node t by means of an arc. Let us also assign t.o all arcs
length zero except for the artificial arcs connecting terminal positions with
lcss than four queens with t.he artificial node t. These laUer arcs am w<sign"d
the length 00 (see Fig. 2.3.1) to exprcss t.he fact. that t.hey correspond t.o deaJ-
end positions that cannot lead to a solution. Then, the four queens problem
reduces to finding a shortest path from node s t.o node t. Not.e t.hat. once the
nodes of the graph are enumerat.ed the problem is essentially solved. llere t.he
number of nodes is small. However, we can think of similar problems with
_.I
64
DctcrllJinisiic SystclIJs and Uw SllOrtcst Path r
Chap. 2
.'CIll
much larger memory requirements. For example, there is an eight. queens
problem where the board is 8 x 8 instead of 4 x 4.
Starting Position
Root Node s
/
. .
ad-End Posilio
. .
. . .
Dead-End Position
. .
. .
. .
LengUl =
SoJuti
Lenglh = .
.
.
.
,/
ength = 0
Artificial
Terminal
Node t
De
on
Figure 2.3.1 Shortest path formulation of the four queens problem. Symmetric
positions resnlting from placing a queen in one of the rightmost squares in the top
row have been ignored. Squares containing a queen have been darkened. All arcs
have length zero except for those connecting dead-end positions to the artificial
terminal noele.
Example 3.2 (The Traveling Salesman Problem)
An import.ant. model for scheduling a sequence of operations is t.he classical
traveling salesman problem. lIere we are given N cities and the mileage
,
[
I
l',
,
.
',.
I';
t'
t,
.
I:
r
j
,
r
t
t:
t
j.
r
r
i
i
.
t
.
.
i
t:
r,
l;
.
.
!
I
I
:
t
Sec. 2.:3
G5
Slwrtcst 1)at. . Jgoritlnns
between each p,tir of cities. We wish to lind a minimum-mileage t.rip that.
visits each of the cities exactly once and returns to the origin node. To
convert this problem to a shortest path problem, we associat.e a node with
every sequence of n dist.inct. cit.ies, wl\('re " -S N. The const.ruction and arc
lengths of the corresponding graph are explained by means of an example
in Fig. 2.3.2. The origin node s consists of city A, taken as t.he start. A
sequence of n cities (n < N) yields a sequence of (n + 1) cit.ies by adding a
new city. Two such sequences are connected by an arc with length ('qual to
the mileage between the last t.wo of the n + 1 cities. Each sequence of N cities
is connected to an artificial terminal node t with an arc having lengt.h equal
to the distance from the last city of the sequence to the starting cit.y A. Note
that t.he number of nodes grows exponentially with the number of cities.
Artificial Terminal Node t
Figure 2.3.2 Example of a shortest path formulation of the traveling salesman
problem. The distance between the four cities A, I3, C, and D arc shown in t.he
table. The arc lcngt.hs arc shown next. to t.he arcs.
,
,
I
In the shortest path problem that we will consider ill this section,
there is a special node s, called the origin, and a special node t, called
the destination. Exercise 2.6 deals with the case where there are multiple
r..:o
_J
66
Deterministic Systems and the Shortest Pat]] .
Chap. 2
)lem
destinations. A node j is called a child of node i if there is an arc (i, j)
connecting i with j. The length of arc (i, j) is denoted byai) and we assume
that all arcs have nonnegative lcngth. Exercise 2.7 deals with the case where
all cycle lengths (rather than arc lengths) are assumed nonnegative. We
wish to find a shortest path from origin to destination.
2.3.1 Label Correcting Methods
We now discuss a general type of shortest path algorithm. The idea
is to progressively discover shorter paths from the origin s to every othcr
node i, and to maintain thc length of the shortest path found so far in
a variable d i called the label of i. Each time d i is reduced following the
discovcry of a shorter path to i, the algorithm checks to see if the labels
d j of the children j of i can be "corrccted," that is, they can be reduced
by setting them to d i + aij [the length of the shortest path to i found thus
far followed by arc (i, j)]. The label d t of the destination is maintained
in a variable called UPPER, which plays a special role in the algorithm.
Init.ially, d s = 0 and d i = 00 for all other nodes i.
The algorithm also lllakes use of a list of nodes called OPEN (another
name frequeutly used is candidate list). The list OPEN contains nodes
that are currently active in the sense that they are candidates for further
examination by the algorithm and possible inclusion in the shortest path.
Initially, OPEN contains just the origin node s. Each node that has ent.ercd
OPEN at least. once, except s, is assigned a "parent," which is some other
node. The. parent nodes are not necessary for the computation of the
shortest distance; they are needed for tracing the shortest path t.o the
origin aft.er the algorithm terminates. The steps of the algorithm are as
follows:
Label Correcting Algorithm
Step 1: Remove a node i from OPEN and for each child j of i, execut.e
step 2.
Step 2: H d i + aij < min{d j , UPPER}, set d j = d i + aij and set i to
be the parent of j. In addition, if j # t, place j in OPEN if it is not
already in OPEN, while if j = t, set UPPER t.o the new value di + ait
of d t .
Step 3: If OPEN is empty, terminate; else go to step 1.
It can be secn by induction that, throughout the algorithm, d j is
either 00 (if node j has not yet entered the OPEN list), or else it is the
length of some path from s to j consisting of nodes that have entered the
OI'GN list. at. lcast oncc. In t.he latter case, t.his pat.h can be construct.ed
by tracing backward the parent nodes starting with the parent of node j.
F'
Sec. 2.3
Shortest Path
orithms
67
t.
I
f.
r:'
t
.
I
1<0'
t
I,
1..;
f
,
t.
I
W
Ii
f
f
Furthermore, UPPER is either 00, or else it. is t.he lengt.h of sonll' path
from s to t, and consequently it is an upper bound of the shortest distance
from s to t. The idea in the algorithm is that when a path from s to j
is discovered that is shorter than those considered earlier (d, + aLj < d)
in step 2), the value of dj is accordingly reduced, and node j enters the
OPEN list so that paths passing through j and reaching the children of j
can be taken into account. It makes sense to do so, however, only whcn the
path considered has a chance of leading to a path from s to t with length
smaller than the upper bound UPPER of the shortest. distance from s to t.
In view of the non negativity of the arc lengths, this is possible only if the
path length d i + aij is smaller than UPPER. This provides the rationale
for entering j into OPEN in step 2 only if d i + aij is less than UPPER..
Tracing the steps of t.he algorithm, we see that it will first remove nodc
s from OPEN and sequentially examine its children. If t is not a child of s,
the algorithm will place all children j of s in OPEN after setting d j = asj.
If t is a child of s, then the algorithm will place all childrell j of s examined
before t in OPEN and will set their labels to asj; then it will examine t
and set UPPER to ast; finally, it will place each of the remaining children
j of s in OPEN only if asj is less than the current value of UPPER, which
is a.t. The algorithm will subsequently take a child i # t of s from OPEN,
and sequentially place in OPEN those of its children j # t that satisfy the
criterion of step 2; etc. Note that the origin s can never reenter OPEN
because d s cannot be reduced from its initial value of zero. Furt.hermore,
by the rules of the algorithm, the destination can never cnter OPEN. vVhen
the algorithm terminates, we will show shortly that a shortest pat.h can be
obtained by tracing backwards the parent nodes starting from t and going
towards s. Figure 2.3.3 illustrates the use of the algorithm to solve the
traveling salesman problem of Fig. 2.3.2.
The following proposition establishes the validity of the algorithm.
Proposition 3.1: If there exists at least one path from the origin to
the destination, the label correcting algorithm terminat.es with UPPER
equal to the shortest distance from the origin to the destination. Oth-
erwise the algorithm terminates with UPPER = 00.
Proof: We first show that the algorithm will terminate. Indeed, each time
a node j enters the OPEN list, its label is decreased and becomcs equal
to the length of some pat.h frolll oS to .i. On tlw ot.her hand, t.he nllln[H'r
of distinct lengths of paths from s to j that are smaller than any givell
number is finite. The reason is that each path can be decomposed into
a path with no repeated nodes (there is a finite numbcr of distinct such
paths), plus a (possibly enlpty) set. of cycles, ':ach having a nOIlIl<:gat.ivl:
length. Therefore, there can be only a finite number of label reductions,
_.J
68
Deterministic Sysicms iLnd ilw Shoricst Paih Jblcm
Chap. 2
:tJ
Artificial Terminal Node t
Iter. No. Node Exiting OPEN OPEN at the Elld of Iteration UPPER
0 - 1 00
1 1 2,7,10 00
2 2 3, 5, 7, 10 00
3 3 4, 5, 7, 10 00
4 4 5,7,10 43
5 5 6,7, 10 43
6 6 7,10 13
7 7 8,10 13
8 8 9,10 13
9 9 10 13
10 10 Empty 13
Figure 2.3.3 The algorithm applied to the traveling salesman problem of Fig.
2.3.2. The optimal solution ABDC is found after examining nodes 1 through 10
in the figure in that order. The table shows the successive contents of the OPEN
list.
implying that the algorithm will terminat.e.
Suppose that there is no path from s to t. Then a node i such that
(i, t) is an arc cannot enter the OPEN list, because as argued earlier, this
would establish that there is a path from 8 to i, and therefore also a path
from 8 to t. Thus, based on the rules of the algorithm, UPPER call never
SllOricsi Paih 1\., iillllls
uti
be reduced from its initial value of 00.
Suppose now that there is a path from s to t. Then, since there is a
finite number of distinct lengths of paths from 5 to t that are smaller than
any given number, there is also a shortest path. Let (5, jl, 12,.. . , j},;, t)
be a shortest path and let d* be the corresponding shortest dist.ance. We
will show that the value of UPPER llpon termination must be equal to
d*. Indeed, each subpath (8, jl, . . . , jm), Tn = 1,. . . , k, of the shortest path
,(9,jl,j2,'" ,jk, t) must be a shortest path from 5 to jm. If the value of
UPPER is larger than d* at termination, the same must be true throughout
the algorithm, and therefore UPPER will also be larger than the lengt.h of
all the paths (8,jl,...,jm), m = 1,...,k, throughout the algorithm, in
view of the nonnegative arc length assumption. It follows that node jk will
never enter the OPEN list with d h equal to the shortest distance from s
to jk, since in this case UPPER would be set. t.o d' in step 2 inllncdiaLdy
following the next time node j},; is examined by the algorit.hm in st.ep 2.
Similarly, and using also the nonnegative length assumption, this means
that node jk-l will never enter the OPEN list with d jk _ 1 equal to the
shortest distance from 8 to jk-l. Proceeding backward, we conclude that
;1 never enters the OPEN list with djJ equal to the shortest distance from 8
to jl [which is equal to t.he length of the arc (8, jl)]. This happens, however,
at the first iteration of the algorithm, obtaining a contradiction. It follows
that at termination, UPPER will be equal to the shortest distance from s
to t. Q.E.D.
From the preceding proof, it can also be seen that, upon termination
of the algorithm, the path constructed by tracing the parent nodes back-
ward from t to 5 has length equal to UPPER, so it is a shortest path from
s to t. Thus the algorithm yields not just the shortest. distance but. also a
shortest path, provided that we keep track of the parent of each node that.
enters OPEN.
An important property of the algorithm is that nodes j for which
c4 + aij 2:: UPPER in step 2 need not enter OPEN and be examined later.
As a result the number of nodes that enter OPEN may be much smaller
than the total number of nodes. Furthermore, if we know a good lower
bound to the shortest distance from s to t (or the shortest distance itself),
we can terminate the computation once UPPER reaches that bound within
an acceptable tolerance. This is useful, for example, in the four queens
problem, where the shortest distance is known to be zero or infinity. Theil
the algorithm will terminate once a solution is found.
Specific Label Correcting Methods
There is considerable freedom in selecting the node to be removed
from OPEN at each iteration. This gives rise to several different methods.
The following are some of the most important.
_..-1
70
Deterministic Systems and the Shortest Path Pn
n
C1Jap. 2
(a) Breadth-first sear'ch, also known as thc Bellman-Ford method, which
adopts a first-in/first-out policy; that is, the node is always removed
from the top of OPEN and each node entering OPEN is placed at the
bottom of OPEN. [Here and in the following methods except (c), we
assume that OPEN is structured as a qucue.]
(b) Depth-first search, which adopts a last-in/first-out policy; that i,
the node is always rcmoved from the top of OPEN and each node
entering OPEN is placed at the top of OPEN. One motivation for
this mcthod is that it oftcn rcquires relatively little mcmory. For
example, suppose that the graph has a tree-like structurc whcreby
there is a unique path from the origin node to every node othcr than
the destination as shown in Fig. 2.3.4. Then the nodes will enter
OPEN only once and in the order shown in Fig. 2.3.4. At any onc
time, it is only necessary to store a small portion of the graph as
shuwn in Fig. 2.3.5.
Origin Node 5
1
Figure 2.3.4 Searching a tree in depth-first
fashion. The numbers next to the nodes indi-
cate the order in which nodes exit the OPEN
list.
Deslinalion Node t
(c) Best-first scar'ch, which at cach itcration removes from OPEN a node
with minimum value of label, i.e., a node i with
d i = min d j .
j in OPEN
This method, also known as Dijkstra's method or label settin!l method,
has a part.icularly illt.ercst.ing propNty. II. can be shown t.hat in this
methud, a node will enter the OPEN lit at must once (ec Exercise
2.4). The drawback of this method is the overhead required to find at
1
Sec. 2.3
Shurtest Pa, llgorithms
71
>
t.
I
Figure 2.3.5 Memory requirements of depth-
first search for the graph of Fig. 2.3.4. At the
time the node marked by the checklllark ('xits
the OPEN list, only the solid-lille portion of
the tree is needed in memory. The dotted-lille
portion h:, beell gellerated a.nd pllrged fro III
memory based on the rule that for a graph
where there is only one path from the origin
to every node other than the destillation. it is
unnecessary to store a node once all of its suc-
cessors arc out of the OPEN list. The broken-
line portion of the tree has not yet been gl'n-
erated.
f
I.
,
f,
(
t
t.
f;
,\
{,
t
i
I
P.
. .
6 b
I \
I \
I \
0 Q
I \
I \
I \
0 0
v
cach iteration tlw node of OPEN t.hat. ha;; lIIinillllllll labd. S"v('ral so-
phisticated methods have been developed to carry out thi operation
efficiently (see e.g., [Der91a]).
(d) D 'Esopo-Pape method, which at cach itcration rcmovcs the node thaI,
is at the top of OPEN, but inserts a node at the top of OPEN if it has
already been in OPEN earlier, and inserts the node at the bottom of
OPEN otherwise.
(e) Small-Label-First (SLF) method, which at each itcration removcs the
nodc that is at the top of OPEN, but inserts a node i at the top
of OPEN if its label d i is less than or equal to the label d j of t.he
top node j of OPEN; otherwise it inserts i at the bottom of OPEN.
This is a low-overhead approximation to the best-first search strategy.
As a complemcnt to thc SLF strategy, one may also try to avoid
removing nodes with relatively large labels from OPEN using the
following device, known as the Large- Label- Last (LLL) strategy: at
the start of an iteration, the top node of OPEN is compared with the
average of the labels of the nodes in OPEN and if it is larger, the
top node is placed at the bottom of OPEN and the new top nodc of
OPEN is similarly tested against the average of the labels. III thi
way the removal of nodes with relatively large labels is postponed
in favor of nodes with relatively small labels. The extra ovcrhcad
rcquired in this method is to maintain the sum of the labels and the
number of nodes in OPEN. When starting a new iteration, the ratio
of these numbers gives the desired average. There are also several
other methods, which are based on the idea of examining nodes with
small labels first (sec [Bcrg3] and [DGM94] for dctailcd decriJ>tiowi
and computational studies).
GCllerally, it. appears t.hat. for nOllll(gat.ive arc lengt.hs, 1.1i" Illllltl)(r
of iterations is reduced as the method b more succcssful in rcmoving from
_.I
72
Deterministic Systems and the Shortest Path Pnwlem
Chap. 2
OPEN node with a relatively small label. For a upporting lleuritic ar-
gument, note that. for a node j to reenter OPEN, some node i such t.lw.t
(l, + aij < el j must first exit OPEN. Thus, the smaller el j wa. at. the previ-
ous exit of.i from OPEN, the less likely it is that d, -1- aij will subsequcllt.ly
become le thau el J for some nmlc -i ill OPEN anLi arc (i,j). In part.icular,
if dj <:: mini ill OPEN el" it is impossible that subsequent to the exit of j
from OPEN we will have d, + a'j < elj for some i ill OPEN, ince the ,m:
lcngths ail arc nonnegative. The SLF and ot.her relat.ed but more sophi-
ticated methods, often require a number of it.erations, which i close to the
minimum (the one required by the best-first method). However, they can
be much fa.ter t.han the best.-first. method, becaue they rcquire much less
ov(rhead for determining the node to be removed from OPEN.
Variations of Label Correcting Met.hods
We mentioned carlier that a key idea of the algorithm i t.o save
comput.ation by foregoing the examination of nodes j t.hat cannot lie on
a shortest path. This b based on the test d, + a,j < UPPER that node
j must pass before it can be placed in the OPEN list in step 2. We can
strengt.hen this test if we can find a positive underestimate It j of the shortest.
distance of node j to t.he destination. Such an estimate can be obtained
frolIl special knowledge about the problem at hand. We may then speed up
the computation substantially by placing a node j in OPEN in step 2 only
when d i + ail + It j < UPPER (instead of d i + aij < UPPER). In this way,
fewer nodes will potentially be placed in OPEN before termination. Using
the fact that hj is an underestimate of the true shortest distance from j to
the destination, the argument given in the proof of Prop. 3.1 shows that
the algorithm will terminate with a shortest path.
The idea just described is one way to sharpen the test el i + a,) <
UPPER for admission of node j into the OPEN list. An alternative idea
is to try to reduce the value of UPPER by obtaining for the node j in step
2 an upper' bound Tn) of the shortest distance from j to the destination.
Then if d j + Tnj < UPPER after step 2, we can reduce UPPER to el j + Tnj,
thereby making t.he test. for future admissibility into OPEN more stringent.
Thb idea is ued in some versions of the branch-and-bound algorithm, one
of which we now briefly describe (see also [PaS82] and [Pea84] for further
discussion) .
Example 3.3 (Branch-and-Bound Algorithm)
Consider a problem of minimizing a cost function f(x) over a finite set of
feasible olutions X. Thc branch-and-bound algorithm uses an acyclic graph
with nodes that correspond on a one-to-onc basis with a collectioll ,y of
subscts X. We require the following:
1. X E ,y (i.e., the ct of all solut.ions i a node).
r
"
!
t
r
t
r
E
i
t
Sec. 2.3
Shortest Path,. Jrjthms
73
2. If x i a Olllt.ioll, t.hcll {:r} E :t' (i.e., each :>olut.ioll viewcd ik'i a ill!;lct.ull
set i a node).
3. If a set Y E X cont.ains more t.han one solut.ion x EX, t.hcn t.hen' .'xist.
sets Y 1 ,..., ), EX slIch that Y; f Y ror all i and
Uy,=y.
i=l
The set Y i:> called the parent of Y j , . . . , Y n , and the sets Y l , . . . , y;, are
called the children of Y.
4. Each set. in ,y other than X has at. least one parent..
The collection of sets ,y defines an acyclic graph with root noue the
set of all feasible solutions X and terminal nodes t.he singleton solutions {x}
:l; c= X (S('" Fig. 2.:1.(;). The arcs or !.I,,; graph an, I.hose I.hal. COIIII('('1. pan'"I.
Y and their childrcn Y,. Suppose that for every nonterminal node }/ t.here
is an algorit.hm that calculates upper and lowcr bounds J and f for t.he
minimum cost over Y: -y y
f y <:: minf(x) <::fy.
- xEY
Asume further t.hat. thc upper and lowcr bonnLls arc exact. for each singleton
solution node {x}:
[{x} = J(x) = fix}'
for all :c E X.
Figure 2.3.6 A tree corresponding to a uranch-and-bound algorit.hm. Each
node (subset) except t.hose consisting of a single solution is pa,.t.itiolled into
several other nodes (subsets).
_..1
Pi = 0, for all i
74
Deterministic Systems and the Shortest Path Prob
75
Chap. 2
Shortest Path Alge lms
Defille now the length of an arc involving a parent Y and a child Y;, to
ue t.he lower uound difference
. destination becomes the tcrminal node of thc path, the ,dgorithm teruli-
nates.
To get an intuitive sense of the algorithm, think of a mouse moving in
a graph-like maze, trying to reach a destination. The mouse criss-crosses
the maze, either advancing or backtracking along its current path. Each
.time the mouse backtracks from a node, it records a measure of the desir-
ability of revisiting and advancing from that node in the future (this will
be represented by a suitable variable). The mouse revisits and procceds
forward from a node when the node's mcasure of desirability is judged
superior to those of other nodes. The algorithm emulates efficiently this
. search process using simple data structures.
The algorithm maintains a path P = ((S,il),(il,i2),...,(ik_l,ik))
with no cycles, and modifics P using two operations, extension and eon-
troction. If ikH is a node not on P and (ik, ik+d is an arc, an extension of
P by ikH replaces P by the path ((s, il), (iI, i2),. .. , (ik-b ik), (ik, ik+d).
If P does not consist of just the origin node s, a contraction of P replaces
P by the path ((s, il), (ii, i2),..., (ik-2, ik-l)).
We introduce a variable Pi for each node i, called the price of node i.
.y/e denote by P the price vector of all node prices. The algorithm maintains
a price vector P satisfying together with P the following property
t" -[v.
Then the lengt.h of any path from the origin node X to any node Y is Ly'
Since J = J(x) for all feasible solutions x EX, it is clear that minimizing
-{x}
J(:1;) over x E X is equivalent. t.o findillg a shortest path from the origin node
to one of the singleton nodes {x}.
Consider now a variation of the label correcting method where in ad-
dit.ion we use our kllowledge of the upper bounds 1 y to reduce t.he value of
UPPER. Initially, OPEN contains just. X, and UPPER equals Ix.
Branch-and-Bound Algorithm
Step 1: Remove a node Y from OPEN. For each child Y j of Y, do
the following: If J y ' < UPPER, then place Y j in OPEN. If in addition
- J
In < UPPER, then set UPPER = 1 yj' and if Y j consists of a single
::;olution, mark that ::;olution as being the best solution found so far.
Step 2: (Termination Test) If OPEN is nonempty, go to step 1.
Otherwise, terminate; the best solution found so far is optimal.
for all arcs (i, j) ,
(3.1a)
Pi :::: aij + Pj,
An alt.ernative t.ermination step 2 for the preceding algorithm i::; to set
it t.olerance '- > 0, and check whether UPPER and the minimum lower bound
J over all sets Y in the OPEN list differ by less t.han Co If so, the algorithm is
Illinated and some set in OPEN must contain a solution that is within € of
being oPtilal. There are a number of other variat.ions of the algorithm. For
example, if the upper bound I y at a node is actually the cost J(x) of some
element x E Y, then this element can be taken as the best solution found so
far whenever I y < UPPER in step 2. Other variations relate to the method
of selecting a node from OPEN in step 1. For example, two strategies of the
uest-first type are to select the node with minimal lower or upper bound. In
closing, we note that to apply branch-and-bound effectively it is important
t.o have algorit.hms for generating as sharp as practically possible upper and
lowcr hOIlI1<b at. cach 11<)(1<-. Then, fewe'l" nodes will be admitted into OPEN,
with attendallt. computational savings.
Pi = aij + Pj, for all pairs of successive nodes i and j of P, (3.1b)
referred to as complementary slackness (CS for short), bccause of its rc-
lation to a corresponding linear programming notion (see [Der91aJ). We
assume that the initial pair (P,p) satisfies CS. This is not restrictive, since
, the default pair
P = (s),
isfies es in view of the nonnegative arc lcngth assumption. To define
e algorithm we also need to assume that all cycles have positive length;
cise 2.9 indicates how this assumption can be relaxed.
It can be shown that if a pair (P,p) satisfies the CS conditions, t.hen
,the portion of P between nodc s and any nodc i E P is a shortest path
,from s to i, while ps - Pi is the corresponding shortest distance. To see this,
. te that by Eq. (3.lb), Pi - Pk is the length of the portion of P between
and k, and that every path connecting i to k must have length at least
ual to Pi - Pk [add Eq. (3.1a) along the arcs of the path].
The .algorithm proceeds in iterations, transforming a pair (P, p) sat-
es into another pair satisfying es. At each iteration, the path P
ther is extended by a new node or else is contracted by deleting its ter-
. al node. In the latter case the price of the terminal node is increased
ictly. A degenerate case occurs when the path consists of just the origin
e Sj in this case the path either is extended or is left unchanged with
e price Ps being strictly increased. The iteration is as follows.
2.3.2 Auction Algorithms
We now discuss anothcr class of algorithms for finding a shortest path
from the origin s to the destination t. These are callcd auction algorithms
because they can be shown to be closely related to an algorithm for opti-
mally assigning persons to objects, which resembles a real-life auction; see
[Dcr91a]. The main algorithm is very simple. It maintains a single path
starting at the origin. At each iteration, the path is either extended by
adding a new node, or contracted by delcting its terminal node. When the
_ J
76
Deterministic Systems and the Shortest Path Proolem
Chap. 2
Typical Iteration of the Auction Algorithm
Let i be the terminal node of P. If
Pi < min [aij + pj],
Uj(i,j) is all arc}
(3.2)
go to Step 1; else go to Step 2.
Step 1 (Contract path): Set
Pi := min [aij + Pj],
UICi,j) is all arc}
(3.3)
and if i f. s, contract P. Go to the next iteration.
Step 2 (Extcnd path): Extend P by node ji where
ji=arg min [ai.+p'],
UI(i,j) is an arc} J J
(3.4)
If ji is the destination t, stop; P is the desired shortest path. Other-
wise, go to the next iteration.
It is easily seen that the algorithm maintains CS. Furthermore, the
addition of the node ji to P following an extension does not create a cycle,
since otherwise, in view of the condition Pi S; aij + Pj, for every arc (i, j)
of the cycle we would have Pi = aij + pj. By adding this equality along
the cycle, we see that the length of the cycle must be zero, which is not
possible by our assumptions.
Figure 2.3.7 illustrates the algorithm. It can be seen from the exam-
ple of this figure that the terminal node traces the tree of shortest paths
from the origin to the nodes that are closer to the origin than the given
destination. This behavior is typical when the initial prices are all zero (see
Exercise 2.10).
There is an interesting interpretation of the CS conditions in terms
of a mechanical model. Think of cach node as a ball, and for every arc
(i,j), connect i and j with a string of length aij' (This requires that
aij = aji > 0, which we assume for the sake of the interpretation.) Let
the resulting balls-and-strings model be at an arbitrary position in three-
dimensional space, and let Pi be the vertical coordinate of node i. Thcn
the CS condition Pi - Pj S; aij clearly holds for all arcs (i,j), as illustrated
in Fig. 2.3.8(b). If the model is picked up and left to hang from the origin
node (by gravity - strings that are tight arc perfectly vertical), then for
all the tight strings (i,j) we have Pi - pj = aij, so any tight chain of
strings corresponds to a shortest path between the end nodes of the chain,
,1 _C .L.. ..:..1., ,.1,,;n
t
\
r
Sec. 2.3
Shortest . cdh Algorithms
77
P3 =3
Shortest path problem with arc
lengths as shown
Trajectory of terminal node
and final prices generated by
the algorithm
f
I
I
t
Itel'ation # Path P prior Price vector p prior Type of action
to iteration to iteration during iteration
1 (1) (0,0,0,0) contraction at 1
2 (1) (1,0,0,0) cxtcnsion to 2
3 (1,2) (1,0,0,0) contraction at 2
4 (J) (1,1.5,0,0) contraction at 1
5 (J) (2, 1.5,0,0) extcnsion to 3
6 (1,3) (2, 1.5,0,0) contraction at 3
7 (1) (2, 1.5,3,0) contraction at 1
8 (1) (2.5,1.5,3,0) cxtension to 2
9 (1,2) (2.5,1.5,3,0) extension to 4
10 (1,2,4) (2.5,1.5,3,0) stop
Figure 2.3.7 An example illustrating the auction algorithm starting with P = (1)
and p = 0.
connecting the origin node s to any other node i is ps - Pi and is also equal
to the shortest distance from s to i.
The algorithm can also be interpreted in terms of the balls-and-strings
model; it can be viewed as a process whereby nodes are raised in stages as
illustrated in Fig. 2.3.g. Initially all nodes are resting on a flat surface. At
each stage, we raise the last node in a tight cha.in that starts at t.he origin
to the level at which at least one more string becomes tight.
The following proposition establishes the validity of the auction algo-
rithm.
_..1
78
Deterministic Systems and the Shortest Path Prc Jm
Chap. 2
Shortest path prOblem with
arc lengths shown next to the arcs.
Node 1 is the origin.
Node 4 is the destination.
(a)
P1 =2.5
P1
P4 =0
P2 = 1.5 - - - -
P2 -
P3
P4
P3 = 0..5
(b)
(c)
Figure 2.3.8 Illustration of the CS conditions for the shortest path problem. If
each node is a ball, and for every arc (i, j), nodes i and j are connected with a
string of length aij, the vertical coordinates Pi of the nodes satisfy Pi - Pj :S aij,
as shown in (b) for the problem given in (a). If the model is picked up and left
to hang from the origin node s, then Ps - Pi gives the shortest distance to each
node i, as shown in (c).
Proposition 3.2: If there exists at least one path from the origin
to the destination, the auction algorithm terminates with a shortest
path from the origin to the destination. Otherwise the algorithm never
terminates and Ps -+ 00.
Proof: We first show by induction that (P, p) satisfies CS throughout the
algorithm. Indeed, the initial pair satisfies CS by assumption. Consider an
iteration that starts with a pair (P,p) satisfying CS and produces a pair
( P , p). Let i be the terminal node of P. If
Pi = min [aij + pj],
{jl(i,j) is an arc}
(3.5)
then P is the extension of P by a node ji and p = p, implying that the CS
condition (3.1h) holds for all ilrcs of P as well as arc (i,ji) [since)i atta,ins
i
Sec. 2.3
2.0
1.5
1.0
0.5
o
3.0
2.5
2.0
1.5
1.0
0.5
o
Shortest 1 _uh Algorithms
79
Shortest path problem with
arc lengths shown next to the arcs.
Node 1 is the origin.
Node 4 is the destination.
(a)
Initial position
After 1 st stage
After 2nd stage
4
4
After 3rd stage
After 5th stage
After 4th stage
The ball marked by gray (the terminal node of the
current path P) is raised at each stage.
(b)
Figure 2.3.9 Illustration of the auction algorithm in terms of the balls-and-
strings model for the problem shown in (a). The model initially rests on a flat
surface, and various balls are then raised in stages. At each stage we raise a
single ball i =i' t (marked by gray), which is at a lower level than the origin s
and can be reached from s through a sequence of tight strings; i should not have
any tight string connecting it to another ball, which is at a lower level, that. is, i
should be the last ball in a tight chain hanging from s. (If s does not have any
tight string connecting it to another ball, which is at a lower level, we use i = s.)
We then raise i to the first level at which one of the strings connecting it to a
ball at a lower level becomes tight. Each stage corresponds to a contraction plus
all the extensions up to the next contraction. The ball i, which is being raised,
_ J
80
DeienninisLie Syst.ems and ilw Slwriesi }Jaih Problcm
Chap. 2
Suppose next that
]h < min [11.'J \. JlJ J.
(jl(i,j) is all arc}
Then if P is the degenerat.e path (5), the CS condition holds vacuously.
Otherwise, P is obtained by contracting P, and for all nodes j E P, we
have Pj = PJ, implying conditions (3.1a) and (3.1b) for arcs outgoing from
nodes of P. Also, for the terminal node i, we have
Pi = min [aij + pj] ,
(jl("j) is an arc}
implying condition (3.Lt) for arcs outgoing from that node as well. Finally,
since p, > Pi and Pk = Pk for all k i= i, we have Pk ::; akj + Pj for all arcs
(k, j) out.going from nodes k f: P. This completes the induction proof that
(P, p) satisfies CS throughout the algorithm.
Assume first that there is a path from node 5 t.o the destination t.
Dy adding the CS condition (3.1a) along that path, we see that ps - PI is
an underestimate of the (finite) shortest distance from 5 to t. Since ps is
monotonically nondecreasing, and PI is fixed throughout the algorithm, it
follows that ps must. stay bounded.
We next claim that Pi must. stay bounded for all i. Indeed, in order t.o
have Pi --> 00, node i must become the terminal node of P infinitely often.
Each t.ime this happens, ps - Pi is equal to t.he shortest distance from 5 t.o
i, which is a contradiction since Ps is bounded.
We next show that the algorithm terminates. Indeed, it can be seen
with a straightforward induction argument that for every node i, Pi is eit.her
equal to its init.ial value, or else it is the length of somc path starting at i
plus t.he initial price of the final node of the path; we call this the modified
length of the path. Every path from 5 to i can be decomposed into a path
with no cycles toget.her with a finite number of cycles, each having positive
length by assumption, so the number of distinct modified path lengt.hs
within any bounded interval is bounded. Now Pi was shown earlier t.o be
bounded, and each time i becomes the terminal node by ext.ension of the
path P, Pi is strictly larger over the preceding time i became the terminal
node of P, corresponding to a strictly larger modified path length. It follows
that the number of times i can become a t.erminal node by extension of the
path P is bounded. Since the number of path contractions between two
consecutive path extensions is bounded by the number of nodes in t.he
graph, the number of it.erations of t.he algorithm is bounded, implying that
the algorithm terminates.
Assume now t.hat there is no path from node 5 to the destination.
Then, t.he algorithm will never terminate, so by t.he preceding argument,
some node i will become the terminal node by extension of the path P
infinitely oft.en and Pi --> 00. At the end of iterations where this happens,
Sec. 2..}
Noies, Sources, and L
.:lses
IH
ps - Pi must be equal to the shortest distance from s to i, implying that.
Ps --> 00. Q.E.D.
There are a number of variations of thc basic algorithm for which
we refer to the literature. The most significant of these relates to a for-
ward/reverse version of the algorithm that maint.ains, in addition to t.he
path P, another path R that cnds at tltc destination. Such an algorit.hm
can be interpreted in terms of the balls-and-strings model of Fig. 2.3.8 as
follows. \Vhen the forward part of the algorithm is used, we raise nodes in
stages as illustrat.ed in Fig. 2.3.9. When the reverse part of the algorithm
is used, we lower nodes in stages; at each stage, we lower the top node in a
tight chain that ends at the destination t.o the level at which at least one
more string becomes tight. The algorithm terminates when the origin and
the destination arc joined by a tight chain.
When implemented in its forward/reverse version, the auction algo-
rithm has been found to perform very well for many types of problems,
particularly those involving few destinations and a graph with relatively
small diameter. However, t.he algorithm as presented here, can perform
poorly in some specially constructed problems (in the language of COlll-
putational complexity theory, it is nOllpolynomial); for an example see
Exercise 2.8. There are a number of improved versions of the basic algo-
rithm [Ber91aJ, [Ber91 bJ, [CDP92J, including some polynomial variations
developed in [BPS92]; see also Exercise 2.9. The auction algorithm is also
well-suited for implementation in parallel computers; see [PoB94].
2.4 NOTES, SOURCES, AND EXERCISES
'.'
Work on the shortest path problem is very extensive. Literature sur-
veys are given in [Dre69J, [DePS4J, and [GaPS8]. For a detailed textbook
treatment of shortest paths and a variety of associated computer codes, see
[Ber91a]. For ,t discussion of application of shortest path methods in mti-
ficial intelligence, see [PeaS4]. For a t.reatment of critical path analysis, see
[Elm7S]. A tu..torial survey of Hidden Markov Models is given in [RabS9].
The Viterbi algorithm, first proposed in [Vit67J, is also discussed in [For73].
For applications in communicat.ion systems, see [PrS04] and [SkISS]. For
applications in speech recognition, see [RabS9] and [Pic90].
Label correcting methods draw their origin from the works of Dellman
[Be157] and Ford [For56]. The D'Esopo-Pape algorit.hm appeared ill [Pap7(t]
based on an earlier suggestion of D'Esopo. l:<or a discussion of various
implementations of Dijkstra's algorithm, see [Ber91a]. The SLF method
and some variations were proposed by the author [DerU3]; see also [DGM!:J4],
where the LLL strategy as well as implementations on a parallel computer
jt
",\"
f
'f"
_. J
82
Deterministic Systems and the Shortest jJath Problem
Cllaj). 2
of various label correcting methods arc discussed. The auction algorithm
for shortest paths is due to the author [Der!:J1a], [Der91b].
EXERCISES
2.1
Find a shortest path from each node to node 6 for t.he graph of Fig. 2.11.1 by
using the OP algorithm.
Figure 2.4.1 Graph for Exercise 2.1. The
arc lengths are shown next to the arcs.
2.2
Find a shortest path frolll node 1 to node 5 for the graph of Fig. 2.1.2 by
using the label correcting method of Section 2.3.1 and the auction algorithm
of Section 2.3.2.
Origin
Figure 2.4.2 Graph for Exercise 2.2. The arc lengths are shown next to the arcs.
Sec. 2.4
Notes, Sources, and L cises
83
2.3
Air transportation is available between n cities, in some cases directly and in
others through intermediate stops and change of carrier. The a.irfare between
cities i and j is denoted by aij. We assume that aij = aji, and for notational
convenience, we write aij = 00 if there is no direct flight between i and j. The
problem is to find the cheapest airfare for going between t.wo cit.ies perhaps
through intermediate stops. Let n = 6 and al2 = 30, al3 = 60, al4 = 25,
al5 = al6 = 00, an = a24 = a25 = 00, a26 = 50, a34 = 35, a35 = a36 = 00,
a45 = 15, a46 = 00, a56 = 15. Find the cheapest airfare from every cit.y t.o
every other city by using the OP algorithm.
2.4 (Dijkstra's Algorithm for Shortest Paths)
Consider the best-first version of the label correcting algorithm of Section
2.3.1. Here at each iteration we remove from OPEN a node that has minimum
label over all nodes in OPEN.
(a) Show that each node j will enter OPEN at most. once, and show that.
at the time it exits OPEN, its label d j is equal to the shortest distance
from s to j. Hint: Use the nonnegative arc length assumption t.o argue
that in the label correcting algorithm, in order for the node i tha.t.
exits OPEN to reenter, there must exist another node k in OPEN with
dk + aki < d i .
(b) Show t.hat the number of arithmetic operations required for terminatioH
is bounded by cN 2 where N is the number of nodes and c is some
constant.
2.5 (Label Correcting for Acyclic Graphs)
Consider a shortest path problem involving an acyclic graph. Let Sk be the
set of nodes i such that all paths from the origin to i have k arcs or less and
at least one such path has k arcs. Consider a label correcting algorithm that
removes from OPEN a node of Sk only if there are no nodes of Sl, . . . ,Sk-l in
OPEN. Show that each node with enter OPEN at most once. How docs this
result relate to the type of shortest path problem arising from deterministic
OP (d. Fig. 2.1.1)?
2.6 (Label Correcting with Multiple Destinations)
Consider the problem of finding a shortest path from Hode s to each node
in a subset T, assuming that all arc lengths arc nonnegative. Show t.hat. t.he
following modified version of the label correcting algorithm of Section 2.3.1
solves the problem. Initially, UPPER = 00, d s = 0, and d i = 00 for all i =I s.
Step 1: Remove a node i from OPEN and for each child j of i, execute step
2.
84
Deterministic Systems and the Shortest Path PI, .em
Notes, Sources, anu L. . ciscs
tlG
_. J
Step 2: If d, + ail < min{d j , UPPER}, set. d] = d, + a,], seL i to be the
parent of j, and place j in OPEN if it is not already in OPEN. In addition,
if JET, seL UPPER = maXtE1' dt.
Step 3: If OPEN is empt.y, t.erminate; else go to step 1.
Prove a termination property such as the one of Prop. 3.1 for this algorithm.
(c) Apply t.he algorit.hm to the probkm or l'xl'rl'isl' 2.K alld \"'riry th;1i it
performs much belter than the auction algorithm given in Sl'l't.ion 2.:).2.
2.7 (Label Correcting with Negative Arc Lengths)
Consider the auction algorithm wit.h all init.ial prices equal to 0. Sho\\" that
if node j becomes for the first. time the terminal node of ]J aft.er node idol's,
then the shortest distance from 1 to j is not less than the shortest dist.ance
from 1 to i.
Consider the problem of finding a shortest path from node s t.o node t,
assume that all cycle lengths are nonnegative (instead of all arc lengths
nonnegative). Suppose that a scalar Uj is known for each node j, which is
underestimat.e of t.he shortest distance from j to t (Uj can be taken -00
underestimate is known). Consider a modified version of the typical
of the label correcting algorithm of Section 2.3.1, where step 2 is replaced
the following:
Modified Step 2: If d i +a'j < min{dj, UPPER- Uj}, set d] = d, +aij
set i to be the parent of j. In addition, if j of t, place j in OPEN if it is
already in OPEN, while if j = t, set UPPER t.o the new value d, + ait of
(a) Show that the algorithm terminates.
(b) Why is the algorithm of Section 2.3.1 a special case of the one of
exercise?
11 (DP on Two Parallel Processors [Las85])
Formulate a DP algorithm t.o solvc the ddel"lilillistic prol'\"111 of" S"..t.ioll 2.1
. on a parallel computer with two processors. One processor should cxecutp a
forward algorithm and the other a backward algorithm.
.12 (Doubling Algorithms)
Consider a deterministic finite-st.ate problem that is time-invariant in t.he
5eIlSe that the state and control spaces, t.he cost per stage, aud t.h(' syst.l'm
equation are the same for each time period. Let. Jk(X, y) be t.he opt.imal cost
to reach state y at time k start.in! from stat.e x at. t.ime 0. Show that for all k
2.8
hk(X, y) = min{ Jk(X, z) + .h(z, y)}.
Consider the problem of finding a shortest path from node 1 to node 5 in
graph consisting of arcs (1,2), (2,3), (3,1), (4,2), and (3,5) with correspon .
lengths al2 = 1, a23 = 1, a34 = 1, a42 = 1, a35 = L, where L is a I
integer. Show that the auction algorithm requircs a number of iterations thai;
is proportional to L. .
Discuss how this equation may be used wit.h advantage to solve problems wit.h
a large number of stages.
,3 (Distributed Shortest Path Computation [Ber82a])
2.9 (An Auction Algorithm with Graph Reduction [BPS92])
Consider the problem of finding a shortest. path from nodes 1,2,..., N (,0
node t, and assume that all a.rc lengths are nonnegat.ive and all cycle lengt.hs
are positive. Consider the iteration
Consider the variant of the auction algorithm t.hat has the following add
feature: each time that a node j becomes the terminal node of the path P
through an extension using arc (i,j), all incoming arcs (k,j) of j with k of.
are deleted from the graph. Also, each time that a node j with no outgo'
arcs becomes the terminal node of P, the path P is contracted and the nod
j is deleted from the graph.
(a) Show t.hat t.he arc deletion process leaves the shortest distance from "0
to t IInafrcct.cd, and that t.11I' algorit.hm terminates either by finding a
shortest pat.h from s t.o t or by deleting s, depending on whether there'1
exists at least one pa.th from s to t or not.
(b) Show t.hat the conclusion of part (a) holds even if there are cycles
zero length.
d k + 1 . [ d k ]
i = IlIm aij + j ,
]
.i = 1,2,...,N,
(,1.1 )
d+l = 0.
It was shown in Section 2.1 t.hat. if t.he initial condition is d? = 00 for
i = 1,...,N and d = 0, then the iteration (4.1) yields the shortest.
distances in N st.eps. Show that if t.he init.ial condition is d;' = 0, for
all i = 1,..., N, t, t.hen the it"rat.ion (-1.1) yields LlII' shor(,,,(, dit.all('('s
in a finite number of steps.
Assume that the iteration
d i := min [aij + dj]
]
(4.2)
_ J
86
Determjnjstjc Systems and the Shortest Path PlObler,
Chap. 2
is executed at node i in parallel with the corresponding iteration for rlJ
at every other node j. However, the t.imes of execution of this iteration
at. the various nodes are not synchronized. Furt.hermore, each Hode i
communicates the results of its latest computation of rl, at. arbit.rary
times with potentially large communication delays. Therefore, there is
t.he possibilit.y of a node execut.ing it.erat.iou (1.2) several t.imes before
receiving a communication from every other neighboring node. Assume
t.hat each node never stops executing iteration (4.2) and transmit.t.ing
the result. t.o the ot.her nodes. Show that the values rl?' available at time
T at the corresponding nodes i are equal to the shortest distances for
all T great.er than a finite time T. lIint: Let. -;17 and Q be generat.ed
by it.eration (1.2) when st.arting from the lirst. and the se(;Qud iHil.ial
conditions in part (a), respectively. Show that for every k there exists
k T -k
a time Tk such that for all T 2: 1k and k, we have Q, ::; rl i ::; rl,.
For a detailed analysis of asynchronous iterative algorithms, including
algorithms for shortest paths and DP, see [BeT89].
3
p. 88
p.91
p. 97
p. 97
p.106
p. 110
p.112
p. 112
p. 116
p. 116
p. 119
p.120
p.123
Deterministic Continuous- Time
Optimal Control
.';
>j
.
Contents
3.1. Continuous-Time Optimal Control
3.2. The Hamilton-Jacobi-Dellman Equation
The Pontryagin Minimum Principle . .
3.3.1. An Informal Derivation Using the HJB Equation
3.3.2. A Derivation Based on Variational Ideas . . .
3.3.3. Minimum Principle for Discrete-Time Problems
Extensions of the Minimum Principle
3.4.1. Fixed Terminal State
3.4.2. Free Initial State
3.4.3. Free Terminal Time .
3.4.4. Time- Varying System and Cost
3.4.5. Singular Problems
3.5. Notes, Sources, and Exercises
87
_ J
88
Dcicnni"istic Cont.inuolls- Timc Optimal ContrL
Chap. 3
In this chapter, we provide an int.roduction to continuous-timc de-
tcrministic optimal control. 'vVe derive the analog of the DP algorithm,
which is the Hamilton-Jacobi-Bellman equation. Furthermore, we discuss
t.he Pontryagin Minimum Principle and its variations. We give two dift'erent
derivations of this result, one of whkh hi based on DP. We also illustrate
thc result by means of examples.
3.1 CONTINUOUS-TIME OPTIMAL CONTROL
We consider a continuous-time dynamic system
:i:(t) = I(x(t), u(t)),
(1.1)
o :S t :S T,
:/:(0) : given,
whcrc x(t) E n is thc state vector at time t, :i:(t) E 3t n is the vector of first
order time derivat.ives of the states at time t, u(t) E U C ffi'm is the control
vector at time t, U is the control constraint set, and T is the terminal time.
The components of I, :/:, :i:, and u will be denoted by Ii, :1':" :i:" and Ui,
respectively. Thus, t.he syst.<l1l (1.1) r"I)I'es(nt.s the "I/. first ordm' dirrnrent.ial
e(luations
d:r,(t) ( ( ) ( ))
- = Ii X t , nt,
dt
We view :i:(t), :r(t), and u(t) as column vectors. We assume that the system
function I, is continuously differentiable with respect to x and is continuous
with respect to u. The admissible control functions, also called control
I.mjectories, arc the piecewise continuous functions {u(t) I t E [0, Tn with
n(t) E U for all t E [0, TJ.
We should stress at the outset that the subject of this chapter is
highly sophisticated, and it is beyond our scope to develop it ttccording
to high standards of mathematical rigor. In particular, we assume that,
for any admissible control trajectory {u( t) I t E [0, Tn, the system of
differential equations (1.1) has a unique solution, which is denoted {x"(t) I
t E [0, Tn and is called the corresponding state tr-ujectory. In a more
rigorous treatment, the issue of existence and uniqueness of this solution
would have to be addressed more carefully.
We want to find an admissible control trajectory {u(t) I t E [0, Tn,
which, t.ogether with its corresponding state trajectory {xu(t) I t E [0, Tn,
minimizes a cost function of the form
i = 1,... ,no
T
h(xu(T)) + 1 .q(:ru(t), u(t))dt,
where the functions 9 and h arc continuously differentiable with respect to
:1' and 9 is continuous with respect to u..
:
Sec. 3.1
Continuous- Timc Optj...._J Control
8U
Example 1.1 (Motion Control)
A unit. mass moves on a line under the influence of a force 1/.. Let. J:l(t)
and x2(1.) be t.he posit.ion and velocity of t.he mass at. t.ime I., respcctively.
From a given (Xl(O),X2(O)) we want t.o bring the ltliJ.o.;S "ncar" a given lina.!
posit.ion-velocit.y pair (Xl, X2) at. t.ime T. In part.icular, wc w;mt to
minimize IX1(T) _x11 2 + IJ:2(T) _x21 2
subject. to the control constraint
lu(t)1 :s: I,
for all L E [0,'1'].
The corresponding continuous-t.ime system is
XI(L) = :r2(1.),
X2(t) = u(t.),
aJld the problem fit.s the general framework given earlier wit.h cost. fUIlct.ions
given by
h(J:(T)) = l:rj(T) _x11 2 + IX2('1') -x21",
q(:r(I), n(I.)) , 0, «)!' alii. r: [0, T].
There are many variations of t.he problem; for example, the fina.l posi-
tion and/or velocity may be fixed. These variations can be handled by various
reformulations of the general continuous-time opt.imal control problem, whi('.I,
will be given later.
Example 1.2 (Resource Allocation)
A producer with prouuct.ion rat.e x(t) at time t may allocat.e a port.iol1 n(l.)
of his/her product.ion rat.e to reinvestment and 1 - u(L) to production of a
storable good. Thus x(L) evolves according to
x(L) = ,u(L)x(t),
where I > 0 is a given constant. The producer wants t.o maximiw t.he I.otnl
amount of product st.ored
.']'
1 (1 - u(t) ):!:(t)dt
subject to
O:S: 1J.(L) <; I,
ror alii E [0,'1'].
The initial production rate x(O) is a given positive l1umber.
We will now derive informally a partial differential equation, which is
ed by the optimal cost-to-go function. This equation is the continuous-
e analog of the DP algorithm, and will be motivated by applying DP
a discrete-time approximation of the continuous-time optimal control
oblem.
Let us divide the time horizon [0, T] into N pieces using the discretiza-
tion interval
_ J
90
Deterministic Continuous-Time Optimal Con,,, 01
Chap.
Examplc 1.3 (Calculus of Variations Problems)
Calculus of variations problems involve liuJing (possibly multidimensional)'"
cmves x(t) with certain optimality properties. They are among the most}
celebrated problems of applied mat.hematics and have been worked on by'),
many of the illustrious mathematicians of the past 400 years (Euler, Lagrange,;
Bernoulli, Gauss, de,). We will see that calculus of variations problems can ;'
be reformulat.ed as optimal control problems. We illustrate this reformulation :,
by a simple example.
Suppose t.hat we want to find a curve from a given point to a givcnline,;,
t.hat has minimum length. The allSwer is of course evident, but we want to ;f
derive it. by using a continuous-time optimal control formulation. Without'
loss of generality, we let (0, a) be the given point, and we let the givenlinc be,.
the vertical line that passes t.hrough (7',0), as shown in Fig. :.:1.1.1. Let also.\
(t, x( i)) be the points of the curve (0 ::; t ::; T). The port.ion of the curveo
joining the points (t,x(t)) and (t + dt,x(i + dt)) can be approximated, for):
small dt, by the hypot.enuse of a right triangle with sides dt and x(t)dt. Thus.:
the length of this portion is .
/ ( dt)2 + (x(t)) 2 (dt)2.
which is equal to
\ /1 + (.i;(I)) 2 dt.
::c:- ;)rC'lf;j-': 1:; L'
minimize j V 1+ (i:(t)f dt
o'
We denote
subject to x(O) = a.
To reformulate t.he problem as a continuous-time optimal control problem,
we introduce a control u and the system equation
The Hamilton-Jacobi-Be."nan Equation
91
Given/
Point
Length = I 1 + (u(t))2dt
\ 0 I
I
I
I
I""-Given
I L .
I me
I
o
T
Igure 3.1.1 Problem of finding a curve of minimum length from a given point
a given line and its formulation as a calculus of variations problem.
HAMILTON-JACOBI-BELLMAN EQUATION
T
b = N'
Xk = x(kb),
Uk = u(kb),
k = 0, 1,.. . , N,
k = 0,1,. . . , N.
We approximate the continuous-time system by
Xk+l = Xk + f(Xk, Uk) . b
xCi) = u(t),
x(O). = a.
Our problem then becomes
T
minimize 1 V I + (u(t)f dt.
This is a problem that fits our continuous-time optimal control framework.
N-l
h(XN) + L g(Xk, Uk) .8.
k=O
We now apply DP to the discrete-timc approximation. Let.
J*(t, x) : Optimal cost-to-go at time t and state x
for the continuous-time problem,
_ J
92
DcicrmilJisiic CO/JiilJ[lOl1s-'1'ime Opiimal ContrL
Chap. 3
.l*(t, x) : Optimal cost-to-go at t.ime t and state x
for the discrete-time approximation.
The DP equat.ions arc
.l*(N15,x) = h(x),
.l*(ko,:c) = min[g(x,u).0-t-.l*((k-l-1).15,:c-l- f(x,u).15)], k = a,... ,N 1.
uEU
Assuming that .l* has the required differentiability properties, we expand
it int.o a first order Taylor series, obtaining
.l*((k -I- 1). O,X -I- f(:1;,u), b) = .l*(k15,x) -I- Vt.l*(kb,x). 15
-I- V x .l*(k15,x)'f(x,u). 0 -I- o(b),
whcre 0(15) rcpresents second order terms satisfying lim6o 0(15)/0 = 0,
V I. denotes partial derivative with rcspect to t, and V x denotcs thc n-
dimensional (column) vector of partial derivatives with respect to ;Z:. Sub-
stitut.ing in t.h( DP e(jlmtion, wc obtain
J*(kb,;i:) = minly(;/:,u)' b -I- J*(kb,x) -t Vt.J*(kb,x)' b
uEU
-I- V x.J*(kb, ;c)' f(:r, u) . (5 -I- 0(15)].
C<1ncding .l* (ko, :1:) from both sides, dividing by 0, and taking the limit as
15 ---> a, while assuming that thc discrete-time cost-to-go function yields in
the limit its continuous-time counterpart,
lim .l*(kb,;r) = J*(I,,;z;),
k--->O<J, bO, kb=t
for all t, ;r,
we obtain t.hc following cquation for thc cost-to-go function J*(t, x):
0= min[g(;r,u) -I- Vtl*(t,x) -I- V:rJ*(t,;c)'f(x,u)],
uEU
for all t, x,
wit.h the boundary condition J* ('1', :1;) = h(x).
This is the Hamilton-Jacobi-Bellman (HJB) equation. It b a partial
differcntial equation, which is satisfied for all time-state pairs (t, x) by the
cost-to-go function J* (t, x). Of course we do not know a priori that J*(t, ;c)
is differentiable, which was assumed in the preceding informal derivation.
Howcver, if we can solve thc HJB equation analytically or computationally,
t.heu we can obtain an optimal control policy by minimizing its right-hand-
side, as shown in the following proposit.ion. This situation is similar to
DP; if we can executc thc DP algorithm, which may not be possible due
to exccssive computational rcquircments, we can find an optimal policy by
minimization of the right-hand side.
Sec, 3.2
The lIamiltol1-Jac0 Jellma/J }<;ql1aiiol1
93
Proposition 2.1: (Sufficiency Theorem) Suppose V(t,;z:) is a so-
lution to thc HJB equation; that is, V is continuously differcntiable in
;c ancl t, aud is such that
a = min[g(x,u) -I- VtV(t,x) -I- VxV(t,x)'f(x,u)],
uEU
for all t.;z:,
V(T, x) = h(x),
(2.1)
(2.2)
for all x.
Suppose also that J-L*(t, x) attains the minimum in Eq. (2.1) for all t
and x. Lct {x* (t) I t E [0, TJ} be the state trajectory obtaincd from
the given initial condition x(O) when the control trajectory u*(t) =
J-L*(t,x*(t)), t E [a,T] is used [that is, x*(O) = x(O) and for all t E
[0,'1'], x*(t) = f(x*(t),J-L*(t,x*(t))); we assume that this differcntial
equation has a unique solution starting at any pair (t,x) and that the
control trajectory {J-L*(t,x*(t)) It E [O,TJ} is piecewise continuous as
a function of t]. Then V is the unique solution of the HJD equation
and is equal to the optimal cost-to-go function, i.e.,
V(t,x) = J*(t,x),
for all t, ;c.
Furthermore, the control trajcctory {u*(t) It E [0, TJ} is optimal.
Proof: Let {u( t) I t E [0, Tn bc any admissible control trajectory and let
{i:(t) It E [0, Tn be the corresponding state trajectory. From Eq. (2.1) wc
have for all t E [0, '1']
O:S y(.i:(t),u,(t)) -I- VtV(t,x(t)) -I- VxV(t,;i;(t))'f(:r(t),iL(t)).
Using the system cquation :i:(t) = f(x(t), u(t)), the right.-hand sidp of the
above inequality is equal to the cxpression
g(x(t), u(t)) -I- :t (V(t, x(t))),
where d/ dt(-) dcnotcs total derivative with rcspcct to t. Integrating t.his
expression over t E [0, '1'], and using the preccding incquality, we obtain
T
O:S 1 g(.i(t),u(t))dt -I- V(T,x(T)) - v(a,i:(a)).
Thus by using the tel'lninal condition V (1',:J:) = h(:c) of Eq. (2.2) and Llle
initial condition x(O) = x(a), we have
T
V(a,x(O)) :S h(i:(T)) -I- 1 g(.i(t),u(t))dt.
_ J
D4
Vcl.c/"Ininisiic GoniillllOlls-'l'i1lJe Opiimal Gontrv
Chap. :3
If we u:;e u*(t) and :c*(t) in place of 11(t) and i;(t), re:;pectively, tlw preceding
inequalities become:; equalitie:;, and we obtain
T
V(O,:c(O)) = h(:c*(T)) + 1 g(:c*(t),u*(t))dt.
Therefore the cost corresponding to {u*(t) It E [O,T]} is V(O,.1:(O)) and
i:; no larger than the co:;t corresponding to any other admissible control
trajectory {ii,(t) It E [O,1']}. It follows that {u*(t) I t E [0, T]} i:; optimal
and that
V(O,:r(O)) = J*(O,.1:(O)).
We now note that the preceding argument can be repeated with any initial
time t E [0, T] and any initial state .1:. We thus obtain
V(t,x) = J*(t,x),
for all t, x.
Q.E.D.
Example 2.1
To illu:;t.rate t.he IIJB equation, let. u:; consider a simple example involving
the scalar syst.em
x(t) = 11(1),
with the constraint lu(t)1 ::; 1 for all t E [0, TJ. The cost is
(X(T)t
The lIJB equation here is
0= min [\7,Y(L,x) + \7xV(L,x)u],
1"1::;1
for all L, x,
(2.3)
with the t.enninal condition
1 2
V(T,x) = -x .
2
(2.4)
There is an evident calldiJate for optimality, namely moving the state
t.owards 0 as quickly as possible, and keeping it at 0 once it is at O. The
corresponding cont.rol policy is
Il*(1-,X) = -sgn(x) = {
-1
if x < °
if x = °
if x> O.
(2.5)
For a given initial time t and initial state x, the cost associated with this
policy can be calculated to be
J*(I,x) = (lI1ax{O, Ixl- (1' - t)} (
(2.6)
Sec. :3.2
The llamiholJ-Jaco. .JellmalJ B<juaiiolJ
!J5
Thi:; funct.ion, which is illust.rated in Fig. 3.2.1, sat.islies t.he tel"lninal mndit i011
(2.4), since 1* (1', x) = (1/2)x 2 . Let us verify that this fUllctioll also satislies
the 11.113 Eq. (2.:1), and that. 11 = -sgn(:r) at.l.ains U", llIinillllnn in U", riglit-
hand side of the equat.ion for all Land x. Proposit.ion 2.1 will t.hen f!.lIaranll'l'
that t.he state and control trajectories corresponding to t.he policy II: (I., :1;) aI"('
optimal.
Indeed, we have
\7tof'(t,:c) = max{O, Ixl- (1' - I.)},
\7 x J*(t,x) = sgn(x). max{O, Ixl- (1' - I.)}.
Substituting these expressions, the HJB Eq. (2.3) becomes
0= min [1 + sgn(x). u] max{O, Ixl- (T - I.)},
1"19
(2.7)
which can be seen to hold as an ident.it.y for all (I., :1:). 1.'lIrth,'rnlol"(', the
minimum is attained for u = -sgn(x). 'vVe therefore conclude based on Pl'Op.
2.1 that .J' (t, x) as given by Eq. (2.6) is indeed the optimal cosHo-go function,
and that the policy defined by Eq. (2.5) is optimal. Note, however, t.hat the
optimal policy is not unique. Based on Prop. 2.1, a.ny policy for which th"
minimum is attained in Eq. (2.7) is opt.imal. In particular, when Ix(t)1 ::; 'l'-t,
applying any control from the range [-1, 1] is optimal.
The preceding derivation generalizes to the case of t.he cost
h(x(T)),
where h is a nonnegative differentiable convex function with h(O) = O. The
corresponding optimal cost-to-go function i:;
{ h(X-('1'-t))
J'(t,x)= (X+(T-t))
ifx>1'-t
if x < -('1' - t)
if Ixl::; 1'-1,
and can be similarly verified to be a solution of the I-IJB equat.ion.
J*(t,x)
..{
- (T - t)
x
o
T - t
-"'-
Figure 3.2.1 Optimal cost-to-go function J*(t,x) for Example 2.1.
};
_ J
96
Dct.onninisUc ConUnuous-Timc Opiimal ContrL
Chap. 3
Example 2.2 (Lincar-Quadratic Problems)
Consider the n-dimensional linear system
i:(t) = /h(t) + lJu(t),
where A and 13 arc given matrices, and t.he quadratic cost.
:I:(T)'Ch:r(T) + (T (:/;(t)'Q:I:(t) _I: n(t)' Ull(t))dt,
./0
wl"'l"l' t.he IlIat.rices Qr Hlld CJ are sY"III.tric positive l'llIidelillitl', .IlId the
matrix R is symmet.ric positive definite (Appendix A defines posit.ive dclillit.e
and semidefinite matrices). The HJB equation is
0= min [x'Qx+u'Ru+V'tV(t,:I:) +V'xV(t,x)'(Ax + lJu)], (2.8)
uE3FH
VCl',:I:) = "JQTX.
Let us t.ry a solution of t.he form
V(t,,/;) = x' K(t)x, K(t) : n x n symmetric,
and see if we can solve t.he II.JB equation. We have V'xV(t,x) = 2J\(t)x and
V'tV(t,x) = x'l«t)x, where k(t) is the matrix with clements t.he first order
derivatives of t.he l'kment.s of J\(t,) wit.h rcspect. t.o time. By snbstituting
t.hese expressions in Eq. (2.8), we obtain
0= min [x'Qx + u' Hu + x' k(t)x + 2x'I«t)Ax + 2x' I«t)Bu]. (2.!J)
u
The minimum is attained at. a u for which t.he gradient with respect to u is
zero, that. is,
2B'I«L)x + 2Ru = 0
or
1t=-R- 1 B'I«t)x. (2.10)
Substit.uting t.he minimizing value of u in Eq. (2.0), we obtain
0= x'(k(L) + I«t)A + A'I«t) - I«L)JJJC I ]3'K(L) + Q)x, for all (L,x).
Therefore, in order [or V(t, x) = x'I«t)x t.o solve the IIJ8 equation,
I< (L) must satis[y t.he following matrix diUerential equation (knowll as the
continuous-time Hiccati equation)
k(t) = -K(t)A - A'I«t) + K(L)BIC 1 ]3' I«t) - Q (2.11)
wit.h the terminal condition
K(T) = QT.
(2.12)
Reversing the argument., we see t.hat. if j((t) is a solution o[ the Riccati
equation (2.11) with the boundary condition (2.12), t.hen V(t,x) = x'I«L)x
is a solut.ion of the IIJB equation. Thus, by using Prop. 2.1, we conclude that
the optimal cost-t.o-go function is
.J'(t,x) = x'J{(L)x.
Furthel"lnow, ill vi"w of the exprcssion d"rivl'd for the control that. minimizes
in the right-hand side of the llJB equat.ion [ef. Eq. (2.10)], an optimal policy
is
Jl*(t,X) = -R-1B'K(t)x.
i)'I
Sec. 3.3
The Poniryagin 1\[1.. ..1I1m Principle
97
3.3 THE PONTRYAGIN MINIMUM PRINCIPLE
In this section wc discuss the continuous-time and the Jiserctc-tilllc
versions of the Minimum Principle, starting with a DP-ba.'ied informal ar-
gument.
3.3.1 An Illfol'lllal Derivation U::;illg l,he H.ID Eqllat.ioll
Recall the HJD equation
0= min[g(:r,u) + 'Vd*(t,:r) + 'VxJ*(t,:r)'f(:r,u)], for all t,:t, (:).1)
uEU
J*(T,x) = h(x),
for all :c.
(3.2)
We argued that the optimal cost-to-go function J*(t, :r) satisfies t.his equa-
tion under somc conditions. Furthermore, the sufficiency theorem of thc
preccding section suggests t.hat. if for a given initial st.at< :r(O), t.he nJllt.rol
trajectory {u*(t) I t E [0, T]} is optimal with corresponding state t.rajectory
{x*(t) It E [O,T]}, then for all t E [0,1'],
u*(I) = arg min[g(x*(t), u) + 'V xl* (t, .1;*(1))' f(x*(t), u)]. (3.3)
uEU
Note that to obtain the optimal control trajectory via this equation, we
do not need to know \7 xJ* at all values of x and t; it is sufficient to know
'Vxl* at only one value of x for each t, that is, to know only 'V x.J*(t,x*(t)).
The Minimum Principle is basically Eq. (3.3) above. Its application is
facilitated by streamlining the computation of the derivative 'V x J* (t, x* (t)).
It turns out. that we can often calculate 'VxJ*(t,x*(t)) along the opt.imal
state trajectory far more easily than we can solve the HJD equation. In
particular, 'V xl* (t, x* (t)) satisfies a certain differential equation, called the
adjoint equation. We will derive this equation informally by differentiating
the HJB equation (3.1). We first need the following lemma, which indicatef
how to differentiate functions involving minima.
Lemma 3.1: Let F(t, x, u) be a continuously differentiable function
of t E R, x ERn, andu E Rm, and let U be a convex subset of Rm.
Assume that 11* (t, x) is a continuously differentiable function such that
f-L*(t,x) = argminF(t,x,u),
uEU
for all t, x.
Then
_ J
98
Del,efminist.ic ConiIJIlOlls-Time Opi[[Jal Conrol
Chap. 3
Vt { minF(t,x,u) } = VtF(t,x,J-L*(t,x)),
uEU
for all t, x,
V" { min F(t, J;, U) } = V "F(t, .x, J-L*(t, x)), for all t, :r.
uEU
[Note: On the left-hand side, V t {-} and V x{.} denote the gradients
of the function G(t, x) = minuEu F(t, x, u) with respect to t and .x,
respectively. On the right-hand side, Vt and V x denote the vectors of
partial derivatives of F with respect to t and x, respectively, evaluated
at (x,p*(t,x)).]
Proof: For notational simplicity, denote y = (t, .x), F(y, u) ,= F(t, .x, u),
anrlll*(Y) = Il*(t,x). Since miuuEu F(y,u) = F(Y,ll*(Y)),
V {illF(Y,u)} = VyF(Y,ll*(Y)) + Vp*(y)V u F(y,J1*(lI))'
We will prove the result by showing that the second term in the right-
hand side above is 7.cro. This is true wheu U = 3m, because then Il*(Y)
is an unconstrained minimum of F(y,u) and \1uF(y,p*(y)) = O. More
generally, for every fixed y, we have
(u -It*(lI))'\1uF(y,ll*(y)) :::: 0,
for a1l1l E U,
[see Eq. (D.2) in Appendix DJ. Now by Taylor's Theorem, we have that
when y chauges to 11 + !:!.y, the minimi7.ing Il* (y) changes [rom JL* (y) to
sOllie vector 1[* (y + !:!.y) = II.*(Y) + V p* (y)'!:!.y + o(ll!:!.yll) of U, so
(Vp*(Y)'!:!.Y + u(II!:!.YII))'VuF(y,ll*(y)) :::: 0,
for all !:!.y,
im plying that
\1p*(y)V u F(y,p*(y)) = o.
Q.E.D.
Consider the HJB equation (3.1), and for any (t,x), suppose that
I[*(t,:c) is a coutrol attaiuing the minimum in the right-hand side. We
make the restrictive assumptions that U is a convex set, and that p* (t, x)
is continuously differentiable in (t, x), so that we can use Lemma 3.1. (We
note, however, that alternative derivations of the Minimum Principle do
not require these a.sumptions; see Section 3.3.2.)
'!
..-"-
,
,
.
'.j
;Ji;
"M
I.
Sec. 3.3
The l'onryagin Mim,...l[[J Prindple
uu
vVe diflerentiate the HJD equatiou with rC'spect to :1: <1ud with re-
spect to t, and we rely on Lemma 3.1 to disregard the terms involving the
derivatives of Il*(I., :1:) wiLh respect t.o t ,lull :c. \\le uutain !()l" aU (1" .1:),
0= \1xg(x, J1*(t, x)) + VJ*(t,.x) + VxJ*(t,:I:)f(:l:,IJ,*(t,:c))
+ V x f (;z;, Il* (t, .x)) \1 x .1 * (t, :1:),
0= V;tJ*(t,x) + \1tJ*(t,.x)'f(x,p*(t,:I:)),
where Vxf(.X,ll*(t,x)) is the matrix
( !!..ll
aXI
V f = .
x .
!.!h.
BXn
!!...bl. )
aXI
d&
BX n
(3.4)
(:3.5)
with the partial derivatives evaluated at the argument (;Z;,ll*(t,:c)).
The above equations hold for all (t, x). Let \IS specialize them aloug
an optimal state and control trajectory {(x*(t),u*(t)) It E [Q,'1'J}, wl\(re
u*(t) = p*(t,x*(t)) for all t E [0,'1']. We have for aU t,
j;*(t) = /(x*(t), u*(t)),
so that the term
';-
\1tJ*(t,x*(t)) + \1xJ*(t,x*(t))f(x*(t),u*(t))
in Eq. (3.4) is equal to the foUowiug total derivative with respect to /,
d
dt (VxJ*(t,x*(t))).
Similarly, the term
V;t J * (t, x*(t)) + V;'tJ* (t, .x* (t))' f (.x* (t), u*(t))
in Eq. (3.5) is equal to the total derivative
d
dt (VtJ*(t,x*(t))).
Thus, by denoting
p(t) = VxJ*(t,x*(t)),
po(t) = VtJ*(t,.x*(t)),
Eq. (3.4) becomes
]J(t) = -\1,,/(;z;*(t), lI,*(t))p(t,) - \1.I:!l(x*U),u*(t))
(3.G)
(3.7)
(:1.8)
]jo(t) = 0
The Pontryagin Mini!. .,n Principle
101
_. J
100
Deterministic Continuous-Time Optimal Ce,. Jol
and Eq. (3.5) becomes
condition
or equivalently,
p(T) = V'h(x*(T)),
(3.17)
po(t) = constant, for all t E [O,TJ.
Equation (3.8) is a system of n first order differential equations kno
as the adjoint equation. From the boundary condition
J*(1',x) = h(.r), for all x, (3.10)
we have, by differentiation with respect to x, the relation V' xJ*(T, x) ';!i
V'h(:c), and by using the definition V'xJ*(t,x*(t)) = p(t), we obtain
p(1') = V'h(x*(T)). (3.11)7
Thus, we have a terminal boundary condition for the adjoint equation (3.8).
To summarize, along optimal state and control trajectories x*(t)/,
u*(t), t E [0, T], the adjoint equation (3.8) holds together with the bOillldi'
ary condition (3.11), while Eq. (3.3) and the definition of p(t) imply th
u * (t) satisfies
1/.*(1.) = nrglllillLq(:r*(t),u) + p(t)'J(:r*(t),lI.)], for all t E [0,1'].
HEll
(3.12);\
cost function. Then, for ail t E [0,1'],
= arg min H(x* (t), u, p(t)).
uEU
(3,15)
a constant C such that
for all t E [0, TJ.
; All the assertions of the Minimum Principle have been (informally)
'ved earlier except for the last assertion. To see why the Hamiltonian
ion is constant for t E [0,1'] along the optimal st.ate and cont.rol tra-
ries, note that by Eqs. (3.1), (3.6), and (3.7), we have for all t E [0,1']
Hamiltonian Formulation
H(x*(t),u*(t),p(t)) = -V'tl*(t,x*(t)) = -po(t),
x*(t) = J(x*(t),u*(t))
])o(t) is constant by Eq. (3.9). We should note here that the Hamil-
function need not be constant along the optimal trajectory if the
and cost are not time-independent, contrary to our assumption thus
(see Section 3.4.4).
It is important to note that the Minimum Principle provides neces-
optimality conditions, so all optimal control trajectories satisfy these
'tions, but if a control trajectory satisfies these conditions, it is not
ily optimal. Further analysis is needed to guarantee optimality.
_ ,method that often works is to prove that an optimal control trajec-
."::', exists, and to verify that there is only one control trajectory satisfying
"'conditions of the Minimum Principle (or that all control trajectories
. . g these conditions have equal cost). Another possibility to con-
.optimality arises when the system function J is linear in (x, u), the
aint set U is convex, and the cost functions hand g arc convex.
it can be shown that the conditions of the Minimum Principle are
necessary and sufficient for optimality.
.The Minimum Principle can often be used as the basis of a numerical
,tion. One possibility is the two-point boundary p1'Oblem method. In
method, we use the minimum condition
Motivated by the condition (3.12), we introduce the Hamiltonian T
function lIlapping triplets (:r:,u,p) E 3n x 3(m X 3(n to real numbers an"
given by
H(.r, u,p) = g(x, u) + p' J(x, u).
Note that the adjoint equation can be compactly written in terms of
Hamiltonian as
p(t) = -V'xH(x*(t),u*(t),p(t)).
We state the Minimum Principle in terms of the Hamiltonian function.
Proposition 3.1: (Minimum Principle) Let {u*(t) I t E [0, Tn;,'
be an optimal control trajectory and let {.r*(t) I t E [O,TJ} be the:
corresponding state trajectory, i.e.,
x*(O) = x(O) : given.
Let also p(t) be the solution of the adjoint equation
p(t) = -V'xH(x*(t),u*(t),p(t)),
u*(t) = argminH(x*(t),u,p(t)),
uEU
ress u*(t)in terms of x*(t) and p(t). We then substitute the result
the system and the adjoint equations, to obtain a set of 2n first order
_ J
102
Deterministic Cuntinuuus- Time Optimal Control
Chap. 3
differential equations in the components of x* (t) and p( t). Thesc cq \lations
can be solved using the split boundary conditions
x' (0) = x(O),
p(T) = V'h(x'(T)).
The number of boundary conditiolls (which is 2n) is cqual to the numuer
of differential cquations, so that we gencrally expect to be able to solve
these dHrerential cq uations numcrically.
Using the Minimum Principle to obtain an analytical solutioll b pos-
sible in many interesting problems, but typically requires considerable crc-
ativity. Wc give somc simple examples.
Examplc 3.1 (Calculus of Variations Continued)
Consider the problem of finding t.he curve of minimum lengt.h from a point
(0,0') to the line {(T,y) lyE n}. In Section 3.1, we formulated this problem
as the problem of linding an optimal cont.rol t.rajectory {u( t) I t E [0, T]} t.hat
rniniIni",es
."1"
1 V I + (U(t))2 dt
subject to
i(t) = u(t),
x(O) = o'.
Let. us apply the preceding necessary conditions. The llamilt.onian is
l1(x,u,p) = + pu,
and t.he adjoint equation is
jJ(t) = 0,
p(T) = 0.
It follows t.hat
p(t) = 0,
for all t E [0, T],
so minimization of t.he Hamiltonian gives
u'(t) = argmin = 0,
uEal
for all t E [0, TJ.
Therefore we have ;j;.(t) = U for all t, which implies that :1;'(t) is constant
Using t.he initial condition x' (0) = a, it follows t.hat
x'(t) = a,
for all t E [U, '1'].
We thus obtain the (a priori obvious) opt.imal solution, which is the horiwntal
line passing through (0,0').
Sec. 3.3
'j'he jJontryagin jl,Jil' .11n Principle
10:3
Example 3.2 (Resource Allocation Continucd)
Consider the optimal production problem (Example 1.2 in Sect.ion 3.1). We
want to maximize
T
1 (1 -u(t))x(t)dt
subject to
O::;u(t)::;l,
for all t E [0, T],
x(t) = ,u.(t)x(t),
x(O) > ° : given.
The Hamiltonian is
lI(x, u,p) = (1 - u)x + JYYux.
The adjoint equation is
jJ(t) = -,u'(t)p(t) - 1 + u,(t),
p(T) = 0.
Maximization of the Hamilt.onian over u E [0,1] yields
u'(t) = { ° if p(t) < ,
1 ifp(1-) 2:.
Since p(T) = 0, for t close to T we will have p(t) < lIT and /L'U) = O.
Therefore, for t near T the adjoint equation has the form jl(t) = I and p(t)
has the form shown in Fig. 3.3.1.
Thus, near t = T, p(t) decreases with slope -1. For t = T - lIT, p(t)
is equal to 1/1, so u'(t) changes to u'(t) = 1. It follows that for t < T - ]/1,
the adjoint equation is
p(t) = -,p(t)
or
p(t) = e-"'I( + constant.
Piccing together p(I.) for t greater lUtd less than T - 1/" we ol>t.ain t.he fOl"ln
shown in Fig. 3.3.2 for p(t) and u'(t). Note that if T < 1/1, the opt.imill
control is u'(t) = ° for aU t E [0, T]; that is, for a short enough horizon, it.
does not pay to rcinvest at any time.
- .I
104
Deterministic Continuous-Time Optimal ContI
Chap. J
1/y
Figure 3.3.1 Form of thc adjoint variable
p(t) for t near T in the resource allocation
example.
o
T- 1/y
p(t)
1/y
Figure 3.3.2 Form of thc adjoiliL variable
p(t) and the optimal control in t.he resource
allocation example.
o
T-1/y
T
u.(t)
./
u(t)=1
u'(t) =0
/
T- 1/y T
o
Example 3.3 (A Linear-Quadratic Problem)
Considcr the one-dimcnsionallincar systcm
x(t) = ax(t) + bu(t),
whcrc a amI I> arc givcn scalars. Wc want to find an optimal control over a
givcn intcrval [0, '1'] that minimizes thc quadratic cost
T
q'(X('1'))2+1 (U(t))2 dt ,
whcre q is a givcn positivc scalar. There are no constraints on the control, so
we havc a special case of the linear-quadratic problem of Examplc 2.2. We
will solve this problem via the Minimum Principle.
The Hamiltonian here is
H(x, u,p) = U2 + p(ax + bu),
Sec. J.J
The Pontryagin _ .ill1ullJ lJrinciplc
105
and t.he adjoint equation is
P(t.) = -ap(t.) ,
with the terminal condition
p('1') = qx'('1').
The optimal control is obt.ained by minimizing the lIamiltonian wit.h r"spect.
to u, yielding
u'(t) = argmJn [U2 + p(t) (ax'(t) + bu)] = -I>p(t).
(:.11))
We will extract the optimal solution from these conditions using two different
approaches.
In thc first approach, we solve the two-point boundary value problem
discussed following Prop. 3.1. In particular, by eliminating the control from
the system equation using Eq. (3.18), we obtain
x'(t) = ax'(t) - 1>2p(t).
Also, from the adjoint equation, we see that
p(t) = e- at (,
for all t E [0, '1'],
where ( = p(O) is an unknown parameter. The last two equations yield
x'(t) = ax'(t) _ b 2 e- at (.
(:3.19)
This differential equation, together with the given initial condit.ion x' (0)
x(O) and the terminal condition
x'('1') = e- aT ( ,
q
(which is the terminal condition for the adjoint equation) can be solved for
the unknown variable (. In particular, it can be verified that the solution of
the differential equation (3.19) is givcn by
x'(t) = x(O)e at + b 2 ( (e- at -1),
a
and ( can be obtained from the last two relations. Given (, we obtain p(t) =
e- at (, and from p(t), we can then determine the optimal control t.rajectory
as u'(t) = -bp(t), t E [0, '1'1 [cf. Eq. (3.18)1.
In the secolld approach, we basically derive the Riccati equation en-
countered in Example 2.2. In particular, we hypothesize a linear relation
between x*(t) and p(t), that is,
J((t)x*(t) = p(t),
for all t E [0, '1'j,
106
Deterministic Continllolls-Time Optimal Contm.
Chap. 3
Sec. 3.3
The Pontryagin Jl.l1. .urn Principle
107
_ J
and we how that K(t) can be obtaincd by olving t.he Riccati equat.ion.
Indeed, from Eq. (3.18) we have
can be reformulated as a t.erminal cot. by int.roducing a IW\\" st.ate \'ariahh'
y and the additional difl'erential equation
u*(t) = -I>K(t):c*('-),
iJ(t) = g(x(t),u(t)).
(3.21)
which by substitution in the system equation, yields
:i:*(t) = (a 1>2 [(1»):1;*('-).
The cost thell becomes
h(x(T)) +y(T),
(3.:!2)
I3y differentiat.ing the equation K(t)x*(t) = p('-) and by also uing tlw adjoint
equation, we obtain
k(t)x*(t) + K(t):i:*(t) = p(t) = -ap(t) = -o.K(t):1;*(t).
and the Minimum Principle corresponding to this terminal cost yields the
Minimum Principle for the general cost (3.20).
We illtroduce some a.o.;sumptions:
13y combining the lillit two relation, we have
k(t)x*(t) + K(t)(a - b 2 K(t»)x*(t) = --aK(t)x*(t),
Convexity Assumption: For every state x the set
from which we see that K(t) should satisfy
D = {J(x, u) I u E U}
k(t) = -2aK(t.) + b 2 (K(t)(
is convex.
Thi i the Riccati equat.ion of Example 2.2, speciali2ed t.o the problem of t.he
preent example. This equation can be solved using the terminal condition
K('1') = <1,
The convexity assumption is satisfied if U is a convex set and f is
linear in u [and g is linear in 'U in the case where there is an integral cost
of the form (3.20), which is reformulated a.'i a terminal cost by using the
additional state variable y of Eq. (3.21 )]. Thus the convexity assumption is
quite restrictive. However, the Minimum Principle typically holds without
the convexity assumption, because even when the set D = {J(x, u) I U E U}
is nonconvex, any vector in the convex hull of D can be generated by
quick alternation between vectors from D (for an example, see Exercise
3.10). This involves the complicated mathematical concept of randomized
or relaxed controls and will not be discussed further.
which is implied by the terminal condition peT) = <1x* (T) for the adjoint
equation. Once K(t) is known, the optimal control is obtained in the closed-
loop form u*(t) = -bK(t)x*(t). I3y reversing the preceding arguments, this
control can then be shown to satisfy all t.he conditions of the Minimum Prin-
ciple.
3.3.2 A Derivation Based on Variational Ideas
In this subsection we outline an alternative and more rigorous proof
of t.h;:) JVlinilllulll Principle. This proof i:; primarily directed towards the ad-
vanced reader, and is based on a comparison of an optimal trajectory with
neighboring trajectories, which are obtained by making small variations in
the optimal trajectory.
For convenience, we restrict attention to the case where the cost is
Regularity Assumption: Let u(t) and u*(t), t E [0, T], be any two
admissible control trajectories and let {x*(t) It E [0, Tn be the state
trajectory corresponding tou*(t). For any E E [0,1], the solution
{x.(t) I t E [0, Tn of the system
x.(t) = (1- t)f(x.(t),u*(t)) + Ef(x.(t),u(t)),
(3.23)
h(x(T)).
with x.(O) = x*(O), satisfies
The more general cost
x.(t) = x*(t) + t(t) + O(E),
(3.24)
T
h(x(T)) + 10 g(x(t),u(t))dt
(3.20)
where {(t) I t E [0, T]} is the solution of the linear differential system
_. J
108
Deterministic Gantinuaus- Time Optimal Gantn
Sec. 3.3 The Pantryagin M,,,.mum Principle
Ghap.3
109
{(t) = Y'x!(x*(t),u*(t»)W) + !(x*(t),u(t)) - !(x*(t), u*(t)), (3.25)
Using a standard result in thc theory of linear difl'erential (xluations
(see e.g. [CoL65]), thc solution of the linear differential systcm (3.25) can
be written in closed form as
with initial condition ((0) = O.
W) = <I!(t,T)((T)+ 1 t q)(t,T)(J(x*(T),u(T))-!(:I;*(T),u*(T)))dT, (3.27)
The regularity assumption "typically" holds because from Eq. (3.23)
we have
where the square matrix 1> satisfies for all t and T,
x«(t) - x*(t) = !(x«(t), u*(t)) - !(x*(t), u*(t))
+ f(J(X«(t), u(t)) - !(x«(t),u*(t))),
D1>(t, T) ( ) '
(37 = -1>(t,T)Y'x! X*(T),U*(T) ,
1>(t, t) = [.
Since ((0) = 0, we have from Eq. (3.27),
so from a first order Taylor series expansion we obtain
ox(t) = Y'!(x*(t),u*(t))'ox(t) +o(llox(t)lI)
+ f(J(X«(t),u(t)) - !(x«t), u*(t))),
((T) = iT 1>(T,t)(J(x*(t),u(t)) - !(x*(t),u*(t)))dt.
where
Dcfine
ox(t) = x«t) - x*(t).
Dividing by f and taking the limit as f --> 0, we see that the function
p(T) = Y'h(x*(T)),
p(t) = 1>(T, t)'p(T), t E [0, T].
W) = lim ox(t)/t,
(->0
t E [0, T],
By differentiating with respect to t, we obtain
should "typically" solve the linear system of differential equations (3.25),
while satisfying Eq. (3.24). .
Suppose llOW that {.u*(t) It C [0, T]) i::; all opt.imal cOlltrol t.raJectory,
and let {x* (t) I t E [0, T]} be the correspollding state trajectory. Then for
any other admissible cOlltrol trajectory {-u(t) It E [0, T]} and any f E [0,1],
the convexity assumption guaralltees that for each t, there eXl:>ts a control
u(t) E U such that
!(x«(t),u(t)) = (1- f)!(X«(t),u*(t)) + f!(X«(t),u(t)).
p(t) = a1>, t)' p(T).
(3.28)
(3.29)
(3.30)
Combining this equatioll with Eqs. (3.28) and (3.30), we see that p(t) is
generated by the differential equation
This is the adjoint equation corresponding to {(x*(t),u*(t)) It E [O,T]}.
Now, to obtain the Minimum Principle, we note that from Eqs. (3.26),
(3.29), and (3.30) we have
p(t) = -Y'x!(x*(t), u*(t))p(t),
with the terminal condition
Thus, the state trajectory {x«t) I t E [0, T]} of Eq. (3.23) corre.sponds
to the admissible control trajectory {u(t) I t E [O,T]}. Hence, llsmg the
optimality of {x*(t) It E [0, T]} and the regularity assumption, wc have
p(T) = Y'h(x*(T)).
h(x*(T)) ::::; h(x:(T))
= h(x* (T) + f((T) + o( f))
= h(x*(T)) + fY'h(x*(T))'((T) + O(f),
o ::::; p(T)'((T)
T
=p(T)'l <I>(T,t)(J(x*(t),u(t)) - !(x*(t),u*(t)))dt
= iT p(t)'(J(x*(t), u(t)) - !(x*(t), u*(t)))dt,
which implies that
Y'h(x*(T))'((T) 2: o.
(3.26)
(3.31)
(3.32)
(3.33)
_. J
110
Deterministic Continuous-Time Optimal Cant.
Chap. 3
from which it can be shown that for all t at which "lL*(') is continuou, we
have
p(t)'f(x*(t),"lL*(t)) ::; p(t)'f(:I;*(t),"lL),
for all "lL E U.
(3.34)
Indeed, if for some it E U and to E [0, T], we havc
p(to)' f(:r*(to), "lL*(to)) > p(to)'f(x*(to), it),
while {"lL*(t) I t E [0, T]} is continuous at to, we would also have
p(t)' j(:I:*(t), 1L*(t)) > p(t)'f(:/:*(t),il.),
for all t in somc nontrivial interval I containing to. Dy taking
u(t) = {*(t)
for tEl
for t r/c I,
we would then outain a contradiction of Eq. (3.33). \Vc have thus proved
the Minimum Principle (3.34) under the convexity and regularity assump-
tions.
To prove the Minimum Principle for the morc general integral cost
function (3.20), one can apply the preceding development to the ystem of
differential equations :i; = f(x, u) augmented by the additional Eq. (3.21)
and thc cquivalent terminal cost (3.22).
3.3.3 Minimum Principle for Discrete-Time Problems
In t.his subection we briefly dcrive a version of the Minimum Princi-
ple for discrcte-time detcrministic opt.imal control problems. Interestingly,
it is essential to make some convexity assumptions in order for the Min-
imum Principle to hold. For continuous-time problems these convexity
assumptions arc typically not needed, because, as mentioned earlier, the
differential system can generate any :i;(t) in the convex hull of the set. of
possi ble vectors j (:/:(t) , u(t)) by randomi,r,at.ion (see for eXftlllplc Excrcise
3.10).
Suppose that. we want to find a control sequence ("lLO, U 1, . . . , "lL N - J)
and a corwponding state sequence (:co,.r 1, . . . , X N ), which minimize
N--J
J(u) = .rJN(XN) + L !Jk(:Ck,"lL,.),
k=O
subjcct to the discrete-timc syst.em constraints
,rH-1 = h,(:I:k, ud, /.; = 0,..., N - 1,
:ro: givcn,
Sec. 3.3
The Pontryagin. .1imum Principle
111
and the control constraints
Uk E Uk C ffi'm,
k = 0, . . . , N - 1.
We first develop an expression for thc gradient 'V J(uo,. . . ,UN- I). We
have, using the chain rule,
'VUkJ(uo,..., UN-I) = 'Vuklk . 'V Xk+Jk+1 ... 'VJ:N_lfN-1 . 'Vg N
+ 'Vuklk. 'VXk+llk+l'" 'VJ:N_ 2 fN-2' 'VXN_l!/N-l
+ 'Vuklk. 'VJ:k+lflk+l
+ 'VUkg k ,
whcre all gradients arc evaluated along the control trajectory (uo,. . . ,"lLN-J)
and the corresponding state trajectory. To streamline the computation, we
introduce the discrete-time adjoint equation
Pk = 'V xkfk . Pk+l + 'V xk.rJk, k = 1,.. ., N - 1,
with terminal condition
PN = 'V.rJN.
Then it is straightforward to verify that
'VUkJ(UO,...,UN-J) = 'VukHk(Xk,Uk,Pk+J),
(3.35)
wherc Hk is the Hamiltonian function defined by
Hdxk, 1Lk,Pk+J) = gk(Xk, Uk) + Pk+llk(Xk, ud.
We will assume that the constraint sets Uk are convex, so t.hat. we can
apply the optimality condition
N-l
L 'Vl1kJ(U,..., u N _ 1 )'(1J.k -. uiJ 2: 0,
k=O
for all feasible (uo,...,UN-l) (see Appendix B). This condition can be
decomposcd into the N conditions
'VUJ(u o ,... ,UN_l)'(1Lk - u:) 2: 0,
for all Uk E Uk, k = 0,. . . ,N - 1.
(3.3(j)
We thus obtain:
112
Deterministic Continuous-Time Optimal Contra.
Chap. 3
Sec. 3.4
Extensions of the. .Jimum Principle
113
_. J
have
Proposition 3.2. (Discrete-Time Minimum Principle) Sup-
pose that (ui), ui . . . , uN -1) is an optimal control trajectory and that
(xi), xi,... , xN) is the corresponding state trajectory. Assume also
that the constraint sets Uk are convex. Then for all k = 0,. .. , N - 1,
we have
J*(T,x)= { O ifx=(T)
00 otherwIse.
Thus J* (T, x) cannot be differentiated with respect to :r, and thc terminal
boundary condition p(T) = Vh(x*(T)) for the adjoint equation docs 1101.
hold. However, as compensation, we have the extra conditioll
V u Jh(x k ,'u k ,]1k+1)'(uk - uk) 2: 0,
for all Uk E Uk,
(3.37)
x(T) : given,
thus mailltaining the balance between boundary conditions and ullkllowns.
If ollly some of the terminal states arc fixed, that is,
wherc the vcctors ]11,... ,]1N arc obtained from the adjoint equation
Pk = V xk/k . ]1k+l + V xkgk,
k = 1,...,lV - 1,
(3.38)
Xi(T) : givcn,
for all i E I,
where I is some index set, wc have thc partial boundary condition
with the terminal condition
PN = VgN(xN)'
(3.39)
. ( ) _ 8h(x*(T))
]1] T - iJ '
x]
for all J r.t I,
The partial derivatives above are evaluated along the optimal state and
control trajectories. If, in addition, the Hamiltonian Ih is a convex
function of Uk for any fixed Xk and ]1k+l, we have
for thc adjoint equation.
Example 4.1
Uk = arg min H k (x k ,uk,Pk+l),
ukEUk
for all k = 0,. .. ,lV - 1. (3.40)
Consider the problem of finding the curve of minimum length connecting two
points (0,0:) and (1', {3). This is a fixed endpoint variation of Example 3.1 in
the preceding section. We have
Proof: Equation (3.37) is a restatement of the necessary condition (3.36)
using thc expression (3.35) for the gradicnt of J. If Hk is convex with
respect to Uk, Eq. (3.36) is a sufficient condition for the minimum condition
(3.39) to hold (see Appendix D). Q.E.D.
x(t) = u(t),
x(o) = 0:, x(1') = {J,
and the cost is
iT V I + (U(t))2 dt.
The adjoint equation is
p(t) = 0,
implying that
3.4 EXTENSIONS OF THE MINIMUM PRINCIPLE
p(t) = constant,
for all t E [0,1'1.
Minimization of the Hamiltonian,
We now consider some variations of the continuous-time optimal con-
trol problem and derive corresponding variations of the Minium Principle.
min [ + p(t)u ] ,
uEm:
yields
Suppose that in addition to the initial state x(O), the final state x(T)
is given. Then the preceding informal derivations still hold except that the
terminal condition J*(T,x) = h(x) is not true anymore. In cffect, here we
u*(i.) = eOllstallt., for all I. C IO,'!'j.
Thus the optimal trajectory {x* (t) I t E [O,1']} is a straight line. Since
this trajectory must. pass through (0,0:) and (1',{J), we obtain the (a priori
obvious) optimal solution shown in Fig. 3.4.1.
3.4.1 Fixed Terminal State
_ J
114
Deterministic Continuolls-Time Optimal Control
Chap. 3
a
Ji
I
I
I
1
1
I
o
T
Figure 3.4.1 Optimal solution of the problem of connecting the two points (0, e<)
and (T, (3) with a minimum length curve (cf. Example 4.1).
Example 4.2 (The Brachistochrone Problem)
In 1696 Johann Bernoulli challenged the mathematical world of his time with
a problem that played an instrumental role in the development of t.he calculus
of variations: Given t.wo points A and B, find a curve connecting J\ and B
such that a body moving along the curve under thc force of gravity reaches
B in minimum time (see Fig. 3,4.2). Let A be (0,0) and B be ('1', -b) with
b> 0. Then it can be seen that the problem is to find {x(t) I t E [O,1']} with
x(o) = ° and x(T) = b, which minimi:lCs
i T F+
dt,
a v 2 I'x(t)
whcre r is thc acceleration due to gravity. Herc {( t, -x( t)) I t E [0, 1'J}, is
the desired curve, the term V I + (x(t)) 2 dt is the length of the curVe from
x(t) to x(t + dt), and t.he tcrm J'2')'x(t) is the velocity of the body upon
reaching the level x(t) [if m and v denote the mass and the velocity of the
body, the kinetic energy is mv 2 /2, which at level x(t) must be equal to t.he
change in potential cnergy, which is mrx(t); t.his yields v = J2I'x(t) ].
We introduce the syst.cm x = U, and we obtain a fixed terminal state
problcm [x(O) = ° and x(T) = b]. Letting
JI+U2
g(x,u) = = ,
y 2 r x
the Hamil tonian is
H(x, u, p) = !/(x, u) + pu.
Wc minimizc t.he Hamiltonian by setting to zero it.s derivative with respect
t.o u:
p(t) = -Y'ug(x*(t),u*(t)).
Sec. 3..1
Extensions of UlC J\., nUl1J 1 'rinciplc
115
We know from the Minimum Principk t.hat. t.he Ilalllilt.oJlian is const.ant along
an optimal trajectory, i.e"
g(x*(t),u*(t)) - Y'ug(x*(t),u*(t.))u*(t) = const.ant,
Using the expression for g, t.his can be written as
forallIE[O,T).
VI + (U*(t))2
J2I'x*(t)
(U*(t))2
= constant,
V I + (U*(t))2 J21'2;*U)
for all I. E [0, '1'],
or equivalently
1
= constant for all t E [0, '1'].
V I + (U*(t))2 J2I'x*U) ,
Using the relation x'(t) = u*(t), this yields
x*(t)(I+x*(t)2)=C, for all tE[O,T],
for some constant C. Thus an optimal traject.ory satisfies t.he differential
equation
x*(t) =
C-.x*(t)
x*(t)
for all t E [0,1'].
The solution of this differential cquation was known at. 13ernoulli's time to
be a cycloid; see Fig. 3.4.2. The unknown parameters of thc cycloid are
determined by the boundary conditions x*(O) = ° and x*(T) = b.
A
Distance A to 'B = Arc 'B to B
Distance A to 'c = Arc 'c to C
'c T '8
.-
, v
v
" "
I I :1
, , " I
- _\ - - - - - \ B /
, , ;
, , , ,
-- ....--
Cycloid: Trajectory of point A
as the circle turns moving
horizontally towards T.
Radius of the circle is such that
(T,b) lies on the cycloid.
,
,
,
,
.\1
;
,
Figure 3.4.2 Formulation and optimal solution of the brachistochrone problem.
_. J
116
Deterministic Coninuous-TilJlc Qptillwl Conl"'
Cilap. :J
3.4.2 l<ree Initial State
If the initial state :c(O) is nut fixed but is subject to optimization, we
have
.7*(0,:/;*(0)) :S; .7*(0,:/:),
for all:1: E lJn,
yielding
VxJ*(O,x*(O)) = 0
and the extra boundary condition for the adjoint equation
p(O) = O.
Also if there is a cost e(x(O)) on the initial state, i.e., the cost is
T
£(x(O)) + 1 g(x(t), u(t))dt + h(x(T)),
the boundary condition becomes
p(O) = -V£(x*(O)).
This follows by setting to zero the gradient with respect to x of €(:c) +
J(O, x), i.e.,
Vx{£(x) + J(O,x)}lx=x*(o) = O.
3.4.3 Free Terminal Time
Suppose the initial state and/or the terminal state are given, but the
terminal time T is subject to optimization.
Let {(x*(t), 11.* (t)) It E [0, TJ} be an optimal state-control trajectory
pair and let T* be the optimal terminal time. Then if the terminal time
were fixed at T*, {(u*(t),x*(t)) It E [O,TJ} would satisfy the conditions
of the Minimum Principle. In particular,
11. * (t) = arg min H (x* (t), 11., p( t)) ,
uEU
for all t E [0, T*],
where p(t) is the solution of the adjoint equation. What we lose with the
terminal time being free, we gain with an extra condition derived as follows.
We argue that if the terminal time were fixed at T* and the initial
state were fixed at the given x(O), but instead the initial time were subject
to optimization, it would be optimal to start at t = O. This means that
the first order variation of the optimal cost with respect to the initial time
must be zero; i.e.,
Vd*(t,x*(t))lt=o = o.
Sec. 3..j
EXelJsiow; of hc
.illlUIJI Principle
117
The H.JD equation can be written along the optinlill t.raj('(-(,ory a;;
Vt.J*(t,:r*(t)) = -H(:c*(t),1J.*(t),p(l.)), for all t. Cc [0,'1'*]
[ef. Eqs. (3.1) and (3.6)], so the last two equations yield
H(x*(O), u*(O),p(U)) = U.
Sine the Hamiltonian was shown earlier to be constant along the optimal
trajectory, we obtain for the case of a free terminal tinlf'
H(x*(t),u*(t),p(t)) = 0,
for all t E [0, T*].
Example 4.3 (Minimum-Time Problem)
A unit. IIlasS object moves horizontally under the influence of a force u(t),
that is,
jj(t) = u(t),
whr y(t) is the position of the object at time t. Give(l t.he object's initial
positiOn y(O) and initial velocity y(O), it is required to bring the object. to
rest (zero velocity) at. a given position, say zero, while using at most. unit.
magnitude force,
-1::;u(t)::;I,
for all to
We want. to accomplish this transfer in minimum time. Thus, we want. to
minimize T = iT Idt.
Not.e that the integral cost, g(x(t),u(t)) =- 1, is unusual here; it does not
deend on. he state or t.he control. However, the t.heory docs not. preclude
this posslblhty, and the problem is still meaningful because the terminal time
T is free and subject. to optimization.
Let the state variables be
Xl(t) = y(t),
X2(t) = i;(t),
so t.he system equation is
X!(t) = X2(t),
X2(t) = u(t).
The initial st.ate (Xl(0),X2(0)) is given and the terminal st.ate is also givcll
xl(T) = 0,
x2(T) = O.
. If {u*(t I t [0, TJ} is an optimal control t.rajectory, u*(t) must. mini-
mize the Hamiltonian for each t, Le.,
u*(t) = arg min [1 + Pl(t)X;(t) + p2(t)U ] .
-l::;u::;!
11,
_. .I
118
Dctenninistic Continuous-Time Optimal Contm
Chap. 3
Therefore
u' (t) = { I
-1
i[ P2(t) < 0,
if P2(t) O.
The adjoint equation i
Pl(t) =0,
1)2(t) = -PI(t),
so
PI (l) = Cl, P2(l) = C2 - c11.,
where CI and C2 are const.ants. It follows that {p2(t) I t E [0, TJ} has OIle of
t.he [our forms shown in Fig. 3,4.3(a); that. is, {p2(t) It E [0, TJ} witches
at most once in going from negative to po::;itive or reversely. [Note that it
is not possible for P2 (t) to be equal to 0 for all t because this implies that
PI (t) is also equal to 0 for all t, SO t.hat the Hamiltonian is equal to 1 for
all t; the necessary conditions require that thc Hamiltonian be 0 along t.he
optimal trajectory.] The corresponding control trajectories are shown in Fig.
3.4.:3(b). The conclusion is that., for each t, u*(t) is eithcr +1 or -I, and
{u*(t) It E [0, TJ} has at mo::;t one swit.ching point in the int.erval [0, TJ.
p,(11
p,(I)
p,t l )
p,{l)
1"
,
,
,
1"
,
(a)
U(I)
u'(I)
U'(I)
u'(1)
---,
,
,
.-------.
, ,
, ,
o
.1
1',
,
, 1',
!...--...!
I'
.1
.,
(b)
Figure 3.4.3 (a) Possible forms of the adjoint variable P2(t). (b) Correspond-
ing forms of the optimal control trajectory.
To determine the precise form of the optimal control t.rajectory, we use
the given initial and final states. For u(t) == (, where ( = :1:1, the system
evolves according to
( 2
Xl (t) = Xl(O) + X2(0)t + t ,
2
X2(t) = X2(0) + (t.
Sec. 3.4
Extensions oE the M1"...WTD Principle
llU
13y eliminating the t.ime t ill these two equations, we ('(' t.hat I'm alii
1 ) 2 1 ( ) 2
Xl(t) - 2((:/;2(t) = Xl(O) - 2( ):2(0) .
Thus for intervals where u(t) == I, the systcm moves along the curve whcrc
Xl(t) - (X2(t))2 is constant, shown in Fig, 3,4.4(a). For int.crvals where
2
u(t) == -1, the syst.em moves along the curves where Xl(t) + (X'2(t,)) is
constant, shown in Fig. 3.4.4(b).
X 2
X 2
x,
x,
(a)
(b)
Figure 3.4.4 State trajectories when the control is u(t) == I [Fig. (a)] and
whcn the control is u(t) == -I [Fig. (b)].
To bring the systcm from the initial st.at.e (Xl(0),X2(0)) t.o the orig;in
with at most one switch in the value of cont.rol, we must follow the following
rules involving the switching curve shown in Fig. 3.4.5.
(a) If the initial statc lies above the switching curve, use u*(t) == -1 until
t.he state hits the switching curve; t.hen use u*(t) == 1 until reaching t.he
origin.
(b) If the initial state lies below the switching curve, use u*(t) == 1 until the
stat.e hits the switching curve; then use u*(t) == -1 until reaching the
origin,
(c) If the initial state lies on the top (bottom) part of the switching curve,
usc u*(t) == -1 [u*(t) == 1, respectively] until reaching the origin.
3.4.4 Time- Varying System and Cost
If the system equation and the integral cost depend on the time t,
Le.,
x(t) = f(x(t), u(t), t),
120
Deterministic Continuous- Time Optimal Contm
Chap. 3
Sec. 3.'1
Extensions of ilJe M. .WIll Principle
121
_.J
X2
interval of time. Such problems are called singula.r. Their optimal tra.i('c(o-
ries con:>ist of portions, called regular arcs, where u*(t) can be determined
from the minimum condition (4.1), and other portions, callcu ,i'l/.!lula'" au;",
which can be determined from the condition that the Hamiltonian is inde-
pendent of u. The following is an example.
Example 4.4 (Road Construction)
X1
Suppo:>e that we want to construct a road over a one-dimensional terrain
whose ground elevation (altitude measured from sOllie reference point) is
known and is given by z(t), t E [0, TJ. The elevation of the road is de-
noted by x(t), t E [0, TJ, and the difference x(t) - z(t) must. be made up by
fill-in or excavation. It is desired to minimize
l i T 2
2 0 (x(t) - z(t)) dt,
Figure 3.4.5 Switching curve (shown with a thick line) and closed-loop op-
timal control for the minimum time example,
subject. t.o the constraint that the gradient of the road x(t) lies between -a
and a, where a is a specified maximum allowed slope. Thus we have the
constraint
cost = h(x(T)) + iT g(x(t),u(t),t)dt,
lu(t)1 ::; a,
t E [0, T],
where
we can convert the problem to one involving a time-independent system
and cost by introducing an extra sta.te variable y(t) representing time:
x(t) = u(t),
t E [0, T].
The adjoint equation here is
y(t) = 1,
y(O) = 0,
p(t) = -x*(t) + z(t),
with the terminal condition
x(t) = !(x(t),u(t),y(t)), x(O): given,
T
cost = h(x(T)) + 1 g(x(t),u(t),y(t))dt.
p(T) = o.
Minimization of the Hamiltonian
After working out the corresponding optimality conditions, we see
that they are the same as when the system and cost are time-independent.
The only difference is that the Hamiltonian depends explicitly on time and
need ,lOt be constant along the optimal trajectory.
II(x*(t),u,p(t),t) = (x*(t) - Z(t))2 +p(t)u
with respect to u yields
u*(t) = arg min p(t)u,
lulSa
3.4.5 Singular Problems
u* (t) = arg min H(x*(t), u, p(t), t)
uEU
(4.1 )
for all t, and shows that optimal trajectories are obt.ained by concatenation
of three types of arcs:
(a) Regular arcs where p(t) > 0 and u*(t) = -a (maximum downhill slope
arcs) .
(b) Regular arcs where p(t) < 0 and u* (t) = a (maximum uphill slope arcs).
(c) Singular arcs where p(t) = 0 and u* (t) can take any value in [-a, a] that
maintains the condition p(t) = O. From the adjoint. equation we :>ee t.hat.
In some cases, the minimum condition
is insufficient to det.ermine u*(t) for all t, because the values of x*(t) and
p(t) arc such that H(:r;*(t), u,p(t), t) is independent of u over a nontrivial
_ J
122
DetvrmhJistic Continuous-Time Optimal COl.
Chap, j
singular arcs are those along which p(l.) = 0 and x*(t) = z(I.), i.e., the
road follows the ground elevation (no fill-in or excavation). Along such
arcs we must. have z(i) = u*(t) E [-a,a].
Optimal solutions can be obtained by a graphical method using the
above observations. Consider the shar'ply uphill inl.e7'1Jals 7 such that. z(i) 2 a
for all t E 7, and the sharply downhill intervals L such that z(t) :S -a for
all /, E L Clearly, within each sharply uphill int.erval 7 the opt.inlal slope is
u* (t) = a, but the optimal slope is also a within a la..:Eer maximum uphill slope
interval V::) Y, which is such that p(t) < 0 wit.hin V and p(tl) = p(i2) = 0 at.
the endpoints t] and t2 of 17. In view of the form of the adjoint. equation, we
see that the endpoints tl and t2 of 17 should be such that
1 '2
(z(t) :1:*(1))<11 = 0;
,[
t.hat. is, t.he total Jill-in should be equal 10 the tol.al excavation within 17, (see
Fig. 3.4.6). Similarly, each sharply downhill interval L should be contained
within a larger maximum downhill slope interval,!:::. ::) L which is such that.
p(t) > 0 within,!:::., while t.he total fill-in should be equal to the total cxcavatiD!.!:
un thin ,!:::., (see Fig. 3.4.6). Thus t.he regular MCS consist of the intervals V
and V described above. Between the regular arcs there can be one or more
sing'1r arcs where x*(t) = z(t). The opt.imal solut.ion can be pieced together
starting at the endpoint t = T [where we know that p(7') = 0], and proceeding
backwards.
._______z(I)
51 Iopea
r oa .
_X(I)
I
I
,
o
T
='
Regular arcs
I
V
Figure 3.4.6 Graphical method for solving the road construction example. The
sharply uphill (downhill) intervals 7 (respectively, D are first identified, and are
t.hen embedded wit.hin maximuIII uphill (r"spect.ivcly, downhill) slope regular arcs
V (reslwctively, ) within which t.he t.otal fill-in is eqnal to the total excavation.
The regular arcs are joined by singular arcs where there is no fill-in or excavation.
The graphical process is started at the endpoint t = T.
Sec. 3.5
Notes, Sources, aJ. lExercjses
123
3.5 NOTES, SOURCES, AND EXERCISES
"
il
Iii
I
il!
it
j:
The calculus of variations is a classical subject that originated with
the works of the great matlwmaticial1s of the 17th al1d 18t.h ('cnt\ll"i(s. Its
rigorous development (by modern mathematical standards) took pla('(' ill
the lU30s and lU40s, with the work of a group of mathematicians that
originated mostly from the University of Chicago; Bliss, McShane, and
Hestenes ,He some of the most prominent members of this group. Curiously,
this development preceded the development of nonlinear programming; lJ}"
many years.t The nlOuern theory of detcrministic optimal control has its
roots primarily in the work of Pontryagin and his students in the lOGOs.
The t.heordical and applica!.ions lit.erature on the subject is very cxtcllsiv(:.
We give three representative references: the book by Athans and Falb
[AtF66] (a classical extensive text that includes engineering applications),
the book by Hestenes [HesG6] (a rigorous mathematical treatment), and
the book by Luenberger [Lue69] (which deals with optimal control within
a broader infinite dimensional context). The book by Cannon, Cullultl, and
Polak [CCP70] gives a detaileu treatment of discrete-time op!.inwl cont.rol
that includes proofs of the Minimum Principle under weaker convexity
assumpt.ions than the one given here.
EXERCISES
3.1
Solve the problem of Example 2.1 for the case where t.he cost function is
T
(:r,(T)) 2 + 1 (u(t)) 2 di.
I
--j
:';:
':
S;,
1'
'f.
:,:,',
t III the 30s and 40s journal spacc was at. a premium, and Iinite-dill1cusional
optimizat.ion research was thought t.o be a simple special ca.se of th(' ca.lculus
of variations, thus insufliciently challenging or novel for publicat.ion. Indeed the
modern optimality conditions of finite-dimensional opt.imi,mt.iou subject to ('qual-
ity and inequalit.y constraint.s were first developed in the 1939 Master's t.hrsis hy
Karush but first appeared in a journal quite a few years later under t.he nallles
of other researchers.
,j
J!
;j!
I.!.
j,:'
'i.
, '
i:i
t
I
_. J
124
Deterministic Continuous-Time Optimal Can.. 01
Chap. 3
3.2
An egocentric young man has inheritcd a large sum S and plans La spend it
so as Lo maximize his cnjoyment through the rcst of his lifc without working.
Ile estimat.es that hc will live exactly T more years and that his capit.al x(t)
should bc reduced to zcro at timc T, i.e., x(1') = O. Also hc models the
evolution of his capital by the differential equation
dx(t)
-;It = exx(t) - u(t),
where x(O) = S is his initial capit.al, ex > ° is a given int.erest rat.e, and
u(t) 2:: 0 is his rate of expenditure, The total enjoyment he will obtain is
given by
iT e- 13t V;;W dt.
Herc /3 2:: 0 is a given discount factor for future enjoyment. Find the optimal
{u(t) It E [0, TJ}.
3.3
Consider the system of rcservoirs shown in Fig. 3.5.1. The syst.em equations
are
Xl(t) = -Xl(t) + u(t),
X2(t) = Xl(t),
and the cont.rol constraint is ° :S u(t) :S 1 for all t. Initially
Xl (0) = .1:2(0) = O.
We want to maximize x2(1) subject to the constraint Xl (1) = 0,5. Solve the
problem.
u
-
Xl : Level of reservoir 1
x 2 : Level of reservoir 2
u : Inflow to reservoir 1
Figure 3.5.1 Reservoir system for Ex-
ercise 3.3.
Scc, 3.5
Notes, Sourccs, and. Jcises
.
'"
125
3.4
"Vork out t.he minimum-time problem (Example 4.3) for t.he case whel'<' t.hNe
is friction and the object's position moves according t.o
jj(l) = -aiJ(t) + u(t),
whcre a > 0 is given. Hint: The solution of the systcm
pl(t)=O,
P2(t) = -Pl(t) + ap2(t),
is
Pl(t) = Pl(O),
P2(t) = .!.(l- e ut )pl(O) + e ut p2(0),
a
The trajcctories of the syst.em for u(t) == -1 and u(t) == I arc sk('\'ch('d ill Fig.
3.5.2.
X2
x,
Xl
Figure 3.5.2 State trajectories of the system of Exercise 3.4 for u(t) == -1 and
u(t) == 1.
3.5 (Isoperimetric Problem)
Analyze the problem of finding a curve {x( t) I t E [0, T)} t.hat maximiz('s t.he
arca under X,
iT x(t)dt,
_. J
126
DeterminisUc Continuolls-Timc OpUmal Cont
Chap. 3
subject. t.o t.hc constraint.s
X(O) = a,
T
1 V I + (x(t))> dt = L,
x(T) = b,
whcrc a, b, aud L arc given positive scalars, The last constraint. is known
as an isopcrimetric constraint.; it requires that the lengt.h of t.he curve be L.
Hint: Introduce the system Xl = U, X2 = V1+U"2, and view t.he problem as
a fixed terminal stat.e problem. Show that the optimal u'(t) depends linearly
on t. Under some assumptions on a, b, and L, the optimal curve is a circular
arc.
3.6 (L'H6pital's Problem)
Let a, b, and T be positive scalars, and jet A = (0, a) and B = (T, b) be
two points in a medium wit.hin which the velocity of propagat.ion of light is
proportional to the vertical coordinate. Thus the t.ime it takes for light to
propagate from A to B along a curve {x(t) It E [0, TJ} is
I T VI + (X(t))2
o Cx(t) dt,
where C is a given positive constant. Find the curve of minimum travel time
of light from A to B, and show that it. is an arc of a circle of the form
X(t)2 + (t - d)2 = D,
where d and D are some constants.
3.7
A boat moves with constant unit velocity in a stream moving at constant
velocity s, The problem is to find the steering angle u(t), 0 ::; t ::; T, which
minimizes the time T required for the boat to move between the point (0,0)
to a given point (a, b). The equations of motion are
XI (t) = s + cosu(t),
X2(t) = sin 11(1.),
where Xl (t) and X2(t) are t.he positions of the boat parallel and perpendicular
to the stream velocity, rcspcctively, Show that the optimal solution is to steer
at. a constant angle.
Sec. 3,5
Notes, Sources, and L . Giscs
127
3.8
A unit mass object moves 011 a straight lille frolll a givcll illit.ialposit.ioll ./"1 (U)
and velocity X2(0). Find the force {u(t) It E [0, 1J} that. brings t.he object. at.
time 1 to rest. [x,(I) = 0] at. position Xl (1) = 0, andminimi2es
1 1 (u(t))' dt.
3.9
Usc the Millimum Principle to solve the linear-quadratic proble111 of Example
2.2. Hint: Follow the lines of Example 3.3.
3.10 (On the Need for Convexity Assumptions)
Solve the continuous-time problcm involving the system x(t) = 11(1.), t.he
terminal cost (x(T))', and the control constraint u(t) = -lor 1 for all /', and
show that thc solut.ion satisfies the Minimum Principle. Show that, depending
on the initial state Xo, this may not be true for t.he discrete-timc version
involving the systcm Xk+l = Xk + Uk, t.he terminal cost x, and the cont.rol
constraint Uk = -lor 1 for all k.
3.11
Use the discret.e-t.ime Minimum Principle to solve Exercise 1.14 of Chapter I,
assuming that each Wk is fixcd at a known dcterministic value.
3.12
Use the discretc-time l'"linimum Principle to solve Exercise 1.15 of Chapter 1,
assuming that 'Yk and Ok are fixed at known deterministic values.
"\(
_ J
4
Problems with Perfect State
Information
p.130
p. 144
p.152
p.158
p.168
p.172
Contents
4.1. Linear Systems and Quadratic Cost
4..Inventory Control . . . . .
4.3. Dynamic Portfolio Analysis
4,4. Optimal Stopping Problems
4.5.. Scheduling and the Interchange Argument
4.().Notes, Sources, and Exercises. . . . . .
t":
-
.
129
_. J
130
l'rooiems with PCl'feci State informaL
Chap. 4
In this chapter we consider a number of applications of discrete-time
stochastic optimal control with perfect state information. These appli-
cations are special cases of the basic problem of Section 1.2 and can be
addressed via the DP algorithm. In all these applications the stochastic
nature of the disturbances is significant. For this reason, in contrast with
the deterministic problems of the preceding two chapters, the use of closed-
loop control is esscntial to achieve optimal performance.
4.1 LINEAR SYSTEMS AND QUADRATIC COST
In this section we consider the special case of a linear system
:r:k+l = Ak:rk + BkUk + 'Wk,
k=O,I,...,N-l,
and the quadratic cost
{ N-l }
k=O.I,N-l XNQNXN + E (XkQkXk + Uk RkUk ) .
In these expressions, Xk and Uk arc vectors of dimension nand m, respec-
tively, and the matrices Ak, Bk, Qk, Rk are given and have appropriate
dimension. We assume that the matrices Qk arc positive semidefinite sym-
metric, and the matrices Rk are positive definite symmetric. The controls
Uk are unconstrained. The disturbances 'Wk are independent random vec-
tors with givcn probability distributions that do not depend on Xk and Uk.
Furthermore, each 'Wk has zero mean and finite second moment.
The problem described above is a popular formulation of a regulation
problcm whereby we want to keep the state of the system close to the origin.
Such problems arc common in the theory of automatic control of a motion
or a process. The quadratic cost function is often reasonable because it
induces a high penalty for large deviations of the state from the origin but
a relatively small penalty for small deviations. Also, the quadratic cost is
frequently used, even when it is not entirely justified, because it leads to a
nice analytical solution. A number of variations and generalizations have
similar solutions. For example, the disturbances 1lJk could have nonzero
means and the quadratic cost could have the form
{ N-l }
E (''!:N - XN )'Q N(:rN - XN) + L ((:rk - xd'Qd.Tk - xd + ILkRkUk) ,
k=O
which expresses a desire to keep the state of thc system close to a given
trajectory (xo, Xl, . . . , XN) rather tha.n close to the origin. Another gener-
ali;r,ed version of the problem arises when Ak, Bk are independent random
Sec. 4.1
Linear Systems and c.. lratic Cost
131
matrices, rather than being known. This case is considered at. t.he cnd ur
this section.
Applying now the DP algorithm, we have
IN(XN) = XNQNXN, (1.1)
Jk(Xk) = minE{xkQkXk + UkRkUk + .h+l(AkXk + Ihuk + Wk)}. (1.2)
Uk
It turns out that the cost-to-go functions J k arc quadratic and as a result
the optimal control law is a linear function of the state. These facts can be
verified by straightforward induction. We write Eq. (1.2) for k = N - 1,
IN-l(XN-I) = min E{.1:N_IQN-IXN-l +uN_IRN-IILN-1
tLN_l
+ (AN-IXN-l + BN-1UN-I + WN-l)'QN
. (AN-1XN-I + BN-1UN-I + WN-rJ},
and we expand the last quadratic form in the right-hand side. We then use
the fact E{WN-r} = 0 to eliminate the term E{wN_1QN(AN-IXN-1 +
BN-IUN-I)}, and we obtain
IN-l(XN-l) = xN_IQN-l.TN-l + min [UN_1RN-IUN-l
UN_I
+ UN_1B'tv_1QNBN-lUN-l + 2XN_IAN_1QNBN-IUN-J]
+ x'tv_1AN_IQNAN-1XN-l + E{w'tv_1QN1lJN-r}.
By differentiating with respect to UN -1 and by setting the derivative equal
to zero, we obtain
(RN-I + BN_IQNBN-rJUN-I = -BN_IQNAN-I:I:N-I.
'(
The matrix multiplying UN-Ion the left is positive definite (and hence
invertible), since RN -1 is positive definite and B'tv -1 Q N B N -1 is positive
semidefinite. As a result, the minimizing control vector is givcn by
U N _ l = -(RN-l + B'tv_1QNBN-l)-lB'tv_1QNAN-IXN-1.
By substitution into the expression for I N - 1 , we have
IN-l(XN-l) = xN_l](N-lXN-1 + E{w'tv'_IQNWN-.r},
whcre by straightforward calculation, the matrix l\. N -I is veri lieu to bc
KN-l = AN_l(QN - QNBN-l(B'tv_lQNBN-l + RN-rJ-1B'tv_1QN)AN-J
+QN.-I.
_ J
132
Problems wit!1 PerFect State InForm.. .1
Chap. 4
The matrix KN-l is clearly symmetric. It is also positive semidefinite. To
see this, note that from the preceding calculation we have for x E n
x' KN-1X = min[x'QN_IX + U'RN-IU
u
+ (AN-IX + BN-IU)'QN(AN-1X + BN-IU)].
Since Q N -1, RN-l, and Q N are positive semidefinite, the expression within
brackets is nonnegative. Minimization over U preserves nonnegativity, so
it follows that x' KN-IX 2: 0 for all x E n. Hence K N - l is positive
semi definite.
Since J N -1 is a positive semidefinite quadratic function (plus an in-
consequential constant term), we may proceed similarly and obtain from
the DP equation (1.2) the optimal control law for stage N - 2. As earlier,
we show that IN-2 is a positive semidefinite quadratic function, and by
proceeding sequentially, we obtain the optimal control law for every k. It
has the form
jJ.'k(Xk) = LkXk,
(1.3)
where the gain matrices Lk are given by the equation
Lk = -(BIJ<k+lBk + Rk)-lB£KkHAk,
(1.4)
and where the symmetric positive semidefinite matrices Kk are given re-
cursively by the algorithm
KN = QN,
(1.5)
Kk = Ak(I<k+l - Kk+lBdB;J<k+1Bk + Rk)-lBIJ<kH)Ak + Qk. (1.6)
Just like DP, this algorithm starts at the terminal time N and proceeds
backwards. The optimal cost is given by
N-l
Jo(xo) = xoKoxo + L E{wkKk+lWk}.
k=O
The control law (1.3) is simple and attractive for implementation in
engineering applications: the current state Xk is being fed back as input
through the linear feedback gain matrix Lk as shown in Fig. 4.1.1. This
accounts in part for the popularity of the linear-quadratic formulation. As
we will see in Chapter 5, the linearity of the control law is still maintained
even for problems where the state Xk is not completely observable (imper-
fect state information).
Sec. 4.1
Linear Systems an ,!uadratic Cost
133
W k
Uk
Xk+ 1 = A0k+ B,pk+ W k
Xk
Figure 4.1.1 Linc<1r fccdback structurc of thc optimal controllcr for the liucar-
quadratic problem.
The Riccati Equation and Its Asymptotic Behavior
Equation (1.6) is called the discrete-time Riccati equation. It plays
an important role in control theory. Its properties have been studied exten-
sively and exhaustively. One interesting property of the Riccati equation is
that whenever the matrices A k , Bk, Qk, Rk are constant and cqual to A, B,
Q, R, rcspectively, then the solution Kk converges as k -> -00 (undcr mild
assumptions) to a steady-state solution K satisfying the al!lcbmic Riccati
equation
K = A'(K - KB(B'KB + R)-lB'K)A + Q.
(1.7)
This property, to be proved shortly, indicates that for the system
Xk+l = AXk + BUk + Wk,
k = 0,1,..., N - 1,
(1.8)
and a large number of stages N, one can reasonably approximate the control
law (1.3) by the control law {jJ.*, jJ.*, . . . , jJ.*}, where
jJ.*(x) = Lx,
(1.9)
L = -(B'KB + R)-lB'KA,
(1.10)
and K solves the algebraic Riccati equation (1.7). This control law is
stationary; that is, it docs not change over time.
We now turn to proving convergence of the sequence of matrices {K k}
generated by the Riccati equation (1.6). We first introduce the notions
of controllability and observability, which are very important in control
theory.
i
;':
l'
'I
,.
_....1
134
Problems with PerFect State InForrnat
Chap. 4
Definition 1.1: A pair (A, B), where A is an n x n matrix and B is
an n x m matrix, is said to be controllable if the n x nm matrix
[B,AB,A2B,... ,An-IB]
has full rank (Le., has linearly independent rows). A pair (A, G), where
A is an n x n matrix and G an m x n matrix, is said to be observable if
the pair (A', G') is controllable, where A' and G' denote the transposes
of A and G, respectively.
One may show that if the pair (A, B) is controllable, then for any
initial state Xo there exists a sequence of control vectors uo, Ul,. .., 'Un-l
that force the state X n of the system
Xk+1 = AXk + HUk
(1.11)
to be equal to zero at time n. Indeed, by successively applying the above
equation for k = n - 1, n - 2,.. .,0, we obtain
X n = Anxo + BUn-1 + ABu n -2 +... + An-IBuo
or equivalent.ly
X n Anx, (B, AB, H ,An-' B{f: ) (112)
If (A, B) is controllable, the matrix (B, AB, . . . , An-l B) has full rank
and as a result the right-hand side of Eq. (1.12) can be made equal to any
vector in n by appropriate selection of (uo, UI,..., un-I). In particular,
one can choose (uo, UI, . . . , U n - tJ so that the right-hand side of Eq. (1.12)
is equal to -Anxo, which implies X n = O. This property explains the name
"controllable pair" and in fact is often used to define controllability.
The notion of observability has an analogous interpretation in the
context of estimation problems; that is, given measurements Zo, ZI, . . . , Zn-I
of the form Zk = GXk, it is possible to infer the initial state Xo of the system
."Lk+ I = A:tk, in view of the relation
(1') Cd;},
Alternatively, it can be seen that observability is equivalent to the property
that, in the absence of control, if Xk -; 0 then GXk -; O.
Sec. 4.1
Linear Systems anc .,Juadratic Cost
135
The notion of stability is of paramount importance in cont.rol t.hcory.
In the context of our problem it is important that. the stationary wntrol
law (1.9) results in a stable closed-loop system; that is, in the ab::;ence of
input disturbance, the state of the system
Xk+1 = (A + BL)xk,
k = 0,1,...,
tends to zero as k -; 00. Since Xk = (A + BL )kxo, it follows that the
closed-loop system is stable if and only if (A + BL)k -; 0, or equivalently
(see Appendix A), if and only if the eigenvalues of the matrix (A + ilL)
are strictly within the unit circle.
The following proposition shows that for a stationary controllable
system and constant matrices Q and R, the solution of the Riccati equation
(1.6) converges to a positive definite symmetric matrix J( for an arbitrary
positive semidefinite symmetric initial matrix. In addition, the proposition
shows that the corresponding closed-loop system is stable. The proposit.ion
also requires an observability assumption, namely, that Q can be written
as G'G, where the pair (A, G) is observable. Note that if l' is the rank of Q,
there exists an r x n matrix G of rank r such that Q = G'G (:'ice Appendix
A). The implication of the observability assumption is that in the absence
of control, if the state cost per stage XkQ."Lk tends to :lero or equivalently
GXk -; 0, then also Xk -; O.
To simplify notation, we reverse the time indexing of the Riccati
equation in the following proposition. Thus Pk in the following Eq. (1.13)
corresponds to I<N -k in Eq. (1.6). A graphical proof of the proposition for
the case of a scalar system is given in Fig. 4.1.2.
Proposition 4.1: Let A be an n x n matrix, B be an n x m mat.rix,
Q be an n x n positive semidefinitc symmetric matrix, and R be an
m x m positive definite symmetric matrix. Consider the discrete-time
Riccati equation
PHI = A'(Pk - PkB(B'PkB + R)-IB'Pk)A + Q, k = 0,1,...,
(1.13)
where the initial matrix Po is an arbitrary positive semi definite sym-
metric matrL'I:. Assume that the pair (A, B) is controllable. Assume
also that Q may be written as G'G, where the pair (A, G) is observable.
Then:
(a) There exists a positive definite symmetric matrix P such that for
every positive semidefinite symmetric initial matrix Po we have
lim P k = P.
k-+oo
: !
I
_ .J
136
Proulems with PerFect State InFormati,
Chap. 4
Furthermore, P is the unique solution of the algebraic matrix equation
P = A'(P - PB(B'PB + R)-lB/P)A + Q
(1.14)
within the class of positive semidefinite symmetric matrices.
(b) The corresponding closed-loop system is stable; that is, the eigen-
values of the lIlatrix
D = A + BL,
(1.15)
where
L = -(B'PB + R)-lB'PA,
(1.16)
arc strictly within the unit circle.
Proof: The proof proceeds in several steps. First we sho,,": conv.ergence of
the sequence generated by Eq. (1.13) when the initial matnx Po IS equ.al to
zero. Next we show that the corresponding matrix D of Eq. (1.15) satisfies
Dk -> O. Then we show the convergence of the sequence generated by Eq.
(1.13) when Po is any positive semidefinite symmetric matrix, and finally
we show uniqueness of the solution of Eq. (1.14).
Initial Matrix Po = O. Consider the optimal control problem of find-
ing Uo, UI,..., Uk-l that minimize
k-l
2)XQXi + uRu,)
i=O
(1.17)
subject to
i = 0,1,..., k - 1, (1.18)
according to the
XHl = AXi + BUi,
where Xo is given. The optimal value of this problem,
theory of this section, is
XPk(O)XO,
where Pk(O) is given by the Riccati equation (1.13) with Po = O.
control sequence (1LO, U 1, . . . , Uk) we have
k-I k
L(.'<Q:l:i + URUi) :S L(XQXi + U:RUi)
i=O i=O
For any
and hence
k-l
XPk(O)XO = n l)XQXi + URUi)
i=O,...,k-l i=O
k
:S in 2)XQXi + URUi) = XoPk+l(O)XO'
,
i:::::Q,...,k i=O
.
Sec. 4.1
Linear Systems ane. ladratic Cost
137
2
A +Q
B
P k
P k + 1 p'
p
Figure 4.1.2 Graphical proof of Prop. 4.1 for the case of a scalar stationary
system (onLdimensional state and control), assuming that A ¥' 0, 13 ¥' 0. Q > 0.
and R> O. The Riccati equation (1.13) is given by
( B2 p2 )
P - A 2 P k
k+l - k - B2Pk + R + Q,
which can be equivalently written as
Pk+l = F'(i\).
where the function F' is given by
A 2 RP
F(P) = 2 +Q.
B P+R
Because F is concave and monotonically increasing in the interval (-RI B 2 , 00),
as shown in the figure, the equation P = F(P) has one positive solution p' and
one negative solution p, The Riccati iteration Pk+1 = F(Pk) converges to P*
starting anywhere in the interval (P, 00) as shown in the figure.
where both minimizations are subject to the system equation constraint
Xi+! = Ax; + flu;. Furthermore, for it fixed ;];0 and for eV0,ry k, .1:;jP,.,(O):I:O
is bounded from above by the cost corresponding to a control sequcnce that
forces Xo to the origin in n steps and applies zero control after t.hat. Such
a sequence exists by t.he controllability assumption. Thus t.he sequen('.c
{XOPk(O)XO} is nondecreasing with respect to k and bounded from above,
and therefore converges to some real number for every Xo E ffi'n. It follows
that the sequence {Pk(O)} converges to some matrix P in the sense that
each of the sequences of the clements of PdO) converges to the correspond-
_ J
138
Problems with l'erfed State InformaL
Chap. 4
ing clements of P. To see this, take Xo = (1,0,. . . ,0). Then xPdO)xo is
equal to the first diagonal element of Pk(O), so it follows that the sequence
of first diagonal elements of Pk (0) converges; the limit of this sequence is the
first diagonal clement of P. Similarly, by taking :Co = (0,. . . ,0,1,0, . . . ,0)
with the 1 in the ith coordinate, for i = 2, . . . ,n, it follows that all the di-
agonal elements of Pk (0) converge to the corresponding diagonal elements
of P. Ncxt take Xo = (1,1,0,...,0) to show that the second clements of
the first. row converge. Continuing similarly, we obtain
lilll h(O) = P,
k-.oo
wherc Pk(O) are generated by Eq. (1.13) with Po = O. Furthermore, sincc
Pk(O) is positive semidefinite and symmetric, so is the limit matrix P. Now
by taking the limit in Eq. (1.13) it follows that P satisfics
P = A'(P - PB(B'PB + R)-lB'P)A + Q.
(1.19)
In addition, by dircct calculation we can verify the following useful equality
P = D' P D + Q + L'RL,
(1.20)
where D and L are given by Eqs. (1.15) and (1.16). An alteruative way to
dcrive t.his equalit.y is t.o observe t.hat from the DP algorithm corresponding
to a finite horizon N we have for all states ."L N-k
:C_kPk+l(O)XN-k = x_kQXN-k + jJ:N_k(XN-k)' RJ1.N_k(XN-k)
+ x_k+lPdO)XN-k+l.
By using the optimal controller expressionILN_k(XN-k) = LN-k:CN-k and
the closed-loop system equation XN-k+l = (A + BLN-k)XN-k, we thus
obtain
Pk+l(O) = Q + L_kRLN-k + (A + BLN-k)'Pk(O)(A + BLN-k). (1.21)
Equation (1.20) then follows by taking the limit as k -> 00 in Eq. (1.21).
Stability of the Closed-Loop System. Consider the system
Xk+l = (A + BL)Xk = DXk
(1. 22)
for an arbitrary initial state xo. We will show that Xk -> 0 as k -> 00. We
have for all k, by using Eq. (1.20),
;C"+LP;Ck+I-X"PXk =x,,(D'PD-P)xk = -x,,(Q+L'RL)Xk'
Hence
k
X"+lPXk+l = xPxo - 2:.:>;(Q + L'RL)x,.
;=0
(1.23)
Sec. 4. I
Linear Systems ane. ,uadratic Cost
139
The left-hand side of this equation is boundcu lwlow by ;oel"O, so it. follows
that
lim x,,(Q + L'RL);I:k = O.
koo
Since R is positive definite and Q may be written as C'C, w( obt.ain
lim CXk = 0,
k -HXJ
lim LXk = lim li* (:rd = O.
k-+oo 1.,-+00
( 1.2,1)
The preceding relations imply that as the control asyllljJt.otically be-
comes ngligibl:, :re hwe limkoo C;Ck = 0, and in view of the observabilit.y
a:'sumptlOn, tIlls nnphes that Xk -> O. To express this argument more pre-
cisely, let us use the relation Xk+L = (A + BL)Xk [ef. Eq. (1.22)], to write
C (."Lk+n-l - 2::::/ A'-l BLxk+n-i-l)
C (Xk+n-2 - 2:::: 1 2 Ai-I BLxk+n-i_2)
( CAn_l J
_ CA.,.-2 .
- : Xk.
CA
C
(1.25)
C(;Ck+l - BLxd
C."Lk
Since LXk -> 0 by Eq. (1.24), the left-hand side tends to zcro and hence
the right-hand side tends to zero also. By the observability a..<;sumption,
however, the matrix multiplying Xk on the right side of (1.25) has full rank.
It follows that Xk -> O.
I,
ii
(
,
"
Positive Definiteness of P. Assume the contrary, i.e., there exists
some Xo 01 0 such that xPxo = O. Since P is positive semidefinite, from
Eq. (1.23) we obtain
i'
I'
'I'
:c,,(Q + L'RL)l:k = 0,
k=O,l,...
Since Xk -> 0, we obtain X"QXk = :C"C'CXk = 0 and x"L'RLxk = 0, or
CXk = 0,
LXk = 0,
k = 0,1,...
;>
J:'
I.:,
Thus all the controls IL* (Xk) = LXk of the closed-loop system arc zero while
we have CXk = 0 for all k. Based on the observability assumption, we will
show that this implies Xo = 0, thereby reaching a contradiction. Indecd
consider Eq. (1.25) for k = O. By the preceding equalities, the left-hand
side is zero and hence
i;, '
L:
[:{
:1"
5.
( CA;n_l )
CA Xo.
C
Since the matrix multiplying Xo above has full rank by the observability
assumption, we obtain Xo = 0, which contradicts the hypothesis Xo 01 0
and proves that P is positive definite.
0=
'"\;
_ .I
140
I>roblems with I>erfcci State InformaL
ClJap. 4
Sec. /1.1
Linear Systems a, ..Juadraiic Cost
141
Arbitmnj Initial Matrix Po. Next we show that the sequence of ma-
trices {Pk(PO)}, defined by Eq. (1.13) when the starting matrix is an
arbitrary positive semidefinite symmetric matrix Po, converges to P =
limkoo PdO). Indeed, the optimal cost of the problem of minimizing
for an arbitrary positive semi<lefinite symmetric init.iallllat.rix 1)).
Uniqueness of Solution. If P is another positivc semidefinitc synnnet-
ric solution of the algebraic Riccati equation (1.14), wc ha\'(' F'dP) cc_ j"
for all k = 0,1, . . . From the convergence result jllst. proved, w( thcn obt.ain
k-l
XPOXk + 2)X;QXi + U;RUi)
i=O
( 1.26)
lim Pk (p) = P,
k-+oo
subject to the system equation Xi+! = AXi + BUi is equal to xoPk(AJ)xo.
Hence we have for every Xo E n
implying that P = P. Q.E.D.
lim PdO) = P.
k-+oo
( 1.28)
The assumptions of the preceding proposition can be relaxed some-
what. Suppose that, instead of controllability of the pair (A, B), we assume
that the system is stabilizable in the sense that there exists an m x n feed-
back gain matrix G such that the closed-loop system .Tk+l = (A + EG)Xk
is stable. Then the proof of convergence of Pk(O) to some positive semidef-
inite P given previously carries through. [We use t.he stationary control
law j-L(x) = Gx for which the closed-loop system is stable to ensure that
XPk(O)XO is bounded.] Suppose that, instead of observability of the pair
(A, C), the system is assumed detectable in the sense that A is such that if
Uk -> 0 and CXk -> 0 then it follows that Xk -> O. (This essentially means
that instability of the system can be detected by looking at. the measurc-
ment sequence {zd with Zk = Cxd Then Eq. (1.24) implies that Xk -> 0
and that the system Xk+l = (A + BL)Xk is stable. The other parts of
the proof of the proposition follow similarly, with the exception of positive
definiteness of P, which cannot be guaranteed anymore. (As an example,
take A = 0, B = 0, C = 0, R > O. Then both the stabilizability and the
detectability assumptions are satisfied, but P = 0.)
To summarize, if the controllability and observability assumptions of
the proposition are replaced by the preceding stabilizability and detectabil-
ity assumptions, the conclusions of the proposition hold with the exception
of positive definiteness of the limit matrix P, which can now only be guar-
anteed to be positive semi definite.
XPk(O)XO XPk(PO)XO'
Consider now the cost (1.26) corresponding to the controller Il(:I;d = Uk =
LJ:k, where L is defined by Eq. (1.20). This cost is
Xu (Dk' PODk + Di'(Q + L'RL)Di) Xo
and is greater or equal to XoPk(PO)XO, which is the optimal value of the
cost (1.26). Hence we have for all k and X E n
x' Pk(O)X x' Pk(PO)X x' (Dk' PODk + Di'(Q + L'RL)Di) x.
(1.27)
Now we have proved that.
We also have, IIsing the fact limkoo Dk' PODk = 0, and the relation Q +
L'RL = P - D'PD [ef. Eq. (1.20)],
} { Dk' PoDk + Di'(Q + L' RL)Di}
= !im { D"(Q + L'RL)D' }
koo (1.29)
i=O
= !im { D"(P - D' P D)Di }
k-+oo
i=O
=P.
Random System Matrices
We consider now the case where {Au, Eo},..., {AN-I' EN-d arc not.
known but rather are independent random matrices that are also indepen-
dent of wo, WI,..., WN-l. Their probability distributions arc given, and
they arc assumed to have finite second moments. This problem falls again
within the framework of the basic problem by considering as disturbance
at each time k the triplet (Ak, Bk, Wk). The DP algorithm is written w;
i1,
Combining the preceding three equations, we obtain
IN(XN) = XNQNXN,
Ii
r
lim PdP o ) = P,
k-+oo
Jk(Xk) = min E {XkQkXk + URkUk + Jk+l (AkXk + BkUk + Wk)}'
Uk wk,Ak;B k
_. J
142
Problems with Perfect State Informatio.
Chap. 4
Calculations very similar to those for the C,l.<;e wherc Ak, ilk arc not random
show that the optimal control law has the form
I-Lic(Xk) = Lk:£k,
(1.30)
wherc the gain matrices Lk arc given by
Lk = -(Rk + E{BIJ<k+1Bk}) -1 E{BKk+1Ad,
(1.31)
and where the matriccs Kk arc given by the recursive equation
KN = QN,
(1.32 )
Kk = E{A"Kk+1 A d
_ E{AIJ<k+1ild(Rk + E{BKk+1Bd) -1 E{BJ(k+lAd + Qk.
( 1.33)
In the casc of a stationary system and constant matrices Qk and Rk it
is not necessarily true that the above equation converges to a steady-state
solution. This is demonstrated in Fig. 4.1.3 for a scalar system, where it is
shown that if the expression
2 2
l' = E{A2}E{B2} - (E{A}) (E{il})
exceeds a certain threshold, the matrices Kk diverge to 00 starting from
any nonnegative initial condition. A possible interpretation is that if there
is a lot of uncert.ainty about. the system, as quantified by T, optimization
ovcr a long horizon is meaningless. This phenomenon has been called the
uncertainty threshold principle; see [AGK77], [KuA77].
On Certainty Equivalence
'vVe close this section by making an observation about the natme of
the quadratic cost. Consider the minimization over u of
E{ (ax + bu + w)2},
w
where a and b arc given scalars, x is known, and w is a random variable.
The optimum is at.tained for
u* = - () x - 0) E{w}.
Thus 'u* depends on the probability distribution of w only through the
mean E{w}. In particlll;1l', the result of the optimization is the same as
for the corresponding deterministic problcm where w is replaced by E{w}.
T
i
Sec. 4.1
Linear Syctems aIh ..!uadraticCost
143
.-'3-,
2
E{a} :
p
Figure 4.1.3 Graphical illustration of the DSymptotic behavior of the generalized
Riccati equation (1.33) in the CDSe of a scalar stationary system (one-dimensional
state and control). Using Pk in place of KN-k. this equation is written as
P k + 1 = P(1'k),
where the function j? is given by
f'(P) = E{A2}RP + Tp 2
E{B2}P+R Q+ E{li2}P+R '
T = E{A 2 }E{B 2 } - (E{A})2(E{B}t
If T = 0, DS in the Ca.'3e where A and B are not random, the Riccati equation
becomes identical with the one of Fig. 4.1.2 and converges to a steady-state.
Convergence also occurs when T hDS a small positive value. However, as illustrated
in the figure, for T large enough, the graph of the function l- and the 45-degrcc
Ime that pa.'3ses through the origin do not intersect at a positive value of 1', and
the Riccati equation diverges to infinity.
This property is called the certainty equivalence principle and appcars in
arious forms in many (but not all) st.ochastic control problems involving
Imear systems and quadratic cost. For the first problem of this section,
whcre A k , ilk are known, certainty cquivalence holds bccause t.he optimal
control law (1.3) is the same as the one that would be obtained from the
corresponding deterministic problem where Wk is not random but rather is
known and is equal to zero (its expected value). Howevcr, for the problcm
where A k , ilk are random, the certainty cquivalence principle docs not
hold, sinc if one replaces Ak, ilk with their cxpected valucs in Eq. (1.33),
the resultmg control law need not be optimal.
_ J
144
Problems wiLl} Perfect State InformL AJ
Chap, 4
Sec. .1.2
Inventory Can. ,
145
Xk+l = Xk + Uk - Wk,
k = O,I,...,IV - 1.
The function H can be eent.o.be convex, ince Uw functions Inil.x(l1, il',: y)
and max(O, Y-Wk) are convex 111 Y for each fixed 'Wk, and taking expectation
over Wk preserves convexity. We will prove shortly that .h+ I it; convex, but.
for t.he moment let, us proceed by assuming this fact. Then t.he function in
br.akets auove is convex. Suppose that t.his fnllction has an uncoust.ra.ined
mnul1lum with. respect to Yk, denoted by Sk. Then it is seen that (in vipw
of the constrall1t. Yk 2:. :cd a lllinillliing Yk e<jwL!s Sk if .Tk < Sk, and
equals Xk o.tlerwlse: USll1g the reverse transformation Uk = Yk .Tk, we see
that the 11l1l11mUm 111 the DP equation (2.2) is attained at. It,. = S,. - .r.. if
:Ck < Sk, and at Uk = 0 otherwise. An optimal policy is determined by the
sequence of scalars {So, 51, . . . , SN -d and has the form .
4.2 INVENTORY CONTROL
We consider now the inventory cont.rol problem discussed ill Sections
1.1 and 1.2. We assume that excess demand at each period is backlogged
and is filled when additional inventory becomes available. This is repre-
sented by negative inventory in the system equation
We assume that the demands Wk take values within some bounded inter-
val and are independent. We will analyze the problcm for the case of a
holding/shortage cost of the form
7'(:C) = ]Jmax(O, -:c) + hmax(O,;I;),
11* ( ' ) _ { Sk-Xk
"'k 'Ck - 0
if .Tk < Sk,
if :Ck 2: Sk'
(2.3)
where ]J and h arc given nonnegat.ivc scalars. Thus t.he total expect(d cost.
to be minimized is
For each k, thc scalar S k minillli:ws t.he function
{ N-l }
k=O,:N-l (CUk + pmax(O,wk - Xk - Uk) + hmax(O,xk + Uk - Wk))
Cdy) = cy + H(y) + E{ h+I(Y -w)}.
(2.4)
Wc assume that the purchase cost per unit stock C is positive and that
p > c. The last assumption is necessary for the problem to be well posed;
if c, the purchase cost per unit, were greater than p, the depletion cost. per
unit, it. would never be optimal to buy new stock at the last period and
possibly in earlier periods. Much of the subsequent analysis generalizes to
the case where r is a convex function that grows to infinity with asymptotic
slopes p and h as its argument tends to 00 and -00, respectively.
By applying the DP algorithm, we have
Thus, the optimality of the policy (2.3) will be proved if we can show
that the cost-to-go functions h [and hence also the functions G k of Ell.
(2:4)] are convex, and furthermore limlvl--+oo Ck(Y) = 00, so that the mini-
nnZll1g scalars Sk exist. .We proceed to show these properties inductiwly.
We hve hat J N IS the zero function, so it is convex. Since c < p
and. th denvatlve of H(y) tends to -[1 as Y -> -00, we see that GN-I(Y)
[whIch IS cY + H.()] has a derivative that becomes negative as Y -> -00
and becomes posltJVe as Y -> 00 (see Fig. 4.2.1). Therefore
lim GN-I(y) = 00.
Ivl--+oo
IN(XN) =0
h(Xk) = min [CUk + H(Xk + Uk) + E{Jk+1 (Xk + Uk - Wk)}],
"k2:0
(2.1)
(2.2)
Thus, as shown above, an optimal policy at time IV - 1 is given by
where the function H is defined by
* ( ) { SN-I-.TN-I
JlN_I .TN-I = 0
if :CN-l < SN-I,
if .TN-I 2: SN-I.
H(y) = pE{ max(O, Wk - y)} + hE{ max(O, y - Wk)}'
Furthermore, from the DP equation (2.1) we have
Actually, H depends on k whenever the probability distribution of Wk de-
pends on k. To simplify notation, we do not show this dependence and
assume that all demands are identically distributed, but the following anal-
ysis carries through even when t.he demand statistics are time-varying.
By introducing the variable Yk = Xk + Uk, we can write the right-hand
side of the DP equation (2.2) as
min [C'lJk + H(Yk) + E{ Jk+l(Yk - Wk)}] - CXk.
Vk2: x k
IN-1(."CN-J) = { C(SN-J - XN-J) + H(SN_I)
H(XN-tJ
if .T N - I < ,'iN. I,
if x N - I 2: S N - ] ,
which is a convex function because If is convex and SN _ I minimizes CII -/-
H(y) (see Fig. 4.2.1). Thus, given the convexity of IN, we were abl<: to
prove the convexity of J N -1. Furthermore
,
lim I N - 1 (y) = 00.
Ivl--+oo
146
Problems with PerFect State InFormation
Chap. 4
Sec. 4.2
Inventory Control
147
. . .. ; . i . . .. . . .f. .
'," '
:.:.
J
:1
_ J
The DP algorit.hm takct; t.he fmlll
IN(XN) =0,
Jk(:J.:k) = min [C(Uk) + H(Xk + Uk) + E{ ,h+l (Xk + UA, - Vlk)} 1,
llk:2: 0
with H defined as earlier by
H(y) = pE{max(O,w - V)} + hE{max(O,y - w)}.
. --
---
---
---
---
Consider again the functions
-cy
SN.l
y
Gdy) = cy + H(y) + E{ Jk+l(y - w)}.
(2.G)
I N _ 1 (XN.l)
Then Jk is written as
. --
---
I ---
--
I -
Jk(xd = min [ G k (.7: k ), min [I< + CUk + GdXk + UdJ ] ,
"k>O
or equivalently, through the change of variable Yk = Xk + Uk,
-C X N.l
SN_l
XN_l
Jk(Xk) = min [ C:Ek + G(Xk), min [I< + CYk + GdYk)J ] - CXk.
Vk>Xk
---
---
Figure 4.2.1 Structure of the cost-to-go functions when the fixed cost is zero.
If, as in thc casc where I< = 0, we could prove t.hat the functions
Gk are convex, then it would not be difficult to verify [see part (d) of the
following Lemma 2.1] that the policy
This argument can be repcated to show that for all k = N - 2,.. . ,0,
if Jk+l is convex, limlvl->ooJk+l(Y) = 00, and limlvl->ooGk(Y) = 00, thcn
wc h,LVe
. ( ) { Sk - Xk
J1.k Xk = 0
if Xk < Sk,
if Xk 2: Sk,
(2.6)
Jdxd = { C(Sk - Xk) + H(Sk) + E{Jk+l(Sk - wd}
H(Xk) + E{Jk+l (Xk - Wk)}
if Xk < Sk,
if Xk 2: Sk,
is optimal, where Sk is a value of Y that minimizes Gk(y) and Sk is the
smallest value of y for which Gdy) = J( + Gk(Sd. A policy of the form
(2.6) is known as a multiperiod (s, S) policy.
Unfortunately, when K > 0 it is not necessarily true that the func-
tions G k are convex. This opens the possibility of Gk having the form
shown in Fig. 4.2.2, where the optimal policy is to order (S - :E) in interval
I, zero in intervals II and IV, and (8 - x) in interval III. However, we will
show that even though the functions Gk may not be convex, they have thc
property
where Sk minimizes cy+H(y)+ E{ .h+l (y-w)}. Furthermore, J k is convcx,
limlvl->oo h(y) = 00, and limlvl->oo G k - 1 (y) = 00. Thus, the optimality
proof of the policy (2.3) is completed.
Positive Fixed Cost and (., S) Policies
We now turn to the more complicated case where there is a positive
fixed cost J( associated with a positive inventory order. Thus the cost for
ordering inventory 11. :2: 0 is
( Gk(Y) - Gk(y - b) )
K+G k (z+y):2: Gdy)+z b '
for all z :2: 0, b > 0, y.
C(U) = { O J( + cu if 11. > 0,
ifu = O.
(2.7)
This property is called K-convl'.xity and was first used by Scarf [SeaGO]
to show the optimality of multiperiod (s, S) policies. Now if K-convexity
holds, the situation shown in Fig. 4.2.2 is impossible; if Yo is the 10c,1l
!:t
'I'
.,
148
Problems with Perfect State Information
Cllap. 4
Sec. 4.2
Inventory Control
149
_ .J
maximum in the interval III, then we must have, for sufficiently small
b > a,
Gk(y)
,
,
,
I
. I
I
I
,S ' -
Yo ,S
.1. + oj. Y
UI IV
Lemma 2.1:
(a) A real-valued convex function 9 is also a-convex and hcnce also
K-convex for all K ;::: a.
(b) If gl(Y) and g2(Y) are K-convex and L-collvex (1(;::: 0, L:::: 0),
respectively, then agl(Y) + {3g2(Y) is (aK + {3L)-convcx for all
a > a and {3 > o.
(c) If g(y) is K-convcx and w is a random variable, thcn Ew{g(y-
w)} is also K -convex, provided Ew {Ig(y - w) I} < 00 for all y.
(d) If 9 is a continuous K-convex function and g(y) -> 00 as Iyl-> 00,
then there exist scalars 8 and 5 with 8 :::; 5 such that
(i) g(5) :::; g(y), for all scalars Yi
(ii) g(5) + K = g(8) < g(y), for all y < 8i
(iii) g(y) is a decreasing function on (-00,8);
(iv) g(y) :::; g(z) + K for all y, z with 8:::; y:::; z.
Gk(YO) - Gk(YO - b)
b :::: a,
and from Eq. (2.7) it follows that
K + Gk(S) ;::: Gk(YO),
which contradicts the construction shown in Fig. 4.2.2. More generally, it
will be shown by using part (d) of the following Lemma 2.1 that if the
K-convexity Eq. (2.7) holds, then an optimal policy takes the (8,5) form
(2.6).
Proof: Part (a) follows from elcmentary properties of convex functions,
and parts (b) and (c) follow from the dcfinition of a K -convex function.
We will thus concentrate on proving part (d).
Since 9 is continuous and g(y) -> 00 as Iyl -> 00, therc exists a
minimizing point of g. Let 5 be such a point. Also let 8 bc thc smallest
scalar z for which z :::; 5 and g(5) + K = g(z). For all y with y < 5, we
have from thc definition of K-convexity
Figure 4.2.2 Potential form of the function Gk when the fixed cost is nomlcro.
If G k had indeed the form shown in the figure, the optimal policy would be to
order (8 - x) in interval I, zero in intervals II and IV, and (8 - x) in interval III.
The use of K-convexity allows us to show that the form of Gk shown in the figure
is impossible.
5-8
K + g(5) ;::: g(8) + - (g(8) - g(y)).
8-y
Definition 2.1: Wc say that a real-valued function 9 is K-convcx,
where K ;::: 0, if
Since K + g(5) - g(5) = a, we obtain g(8) - g(y) :::; a. Since y < 5 and 5 is
the smallest scalar for which g(5) + K = g(8), we must have g(8) < g(y)
and (ii) is proved. To prove (iii) , note that for Yl < Y2 < 5, we have
K + g(z + y) ;::: g(y) + z ( g(y) - (y - b) ) ,
5 - 112
K + g(5) :::: g(Y2) + (g(Y2) - g(yJ)).
Y2 - YI
for all z:::: a,b > O,y.
Also from (ii),
g(Y2) > g(5) + K,
Some properties of K -convex functions are provided in the following
lemma. Part (d) of thc lemma essentially proves the optimality of the (8,5)
policy (2.6) when the functions Gk are K-convex.
and by adding thesc two inequalities we have
5-Y2 )
a > - (g(Y2) - g(yJ) ,
Y2 - Yl
_ J
150
JJrolJJems wiih J'eded Hiaie InformaiJ
Chap. '1
from which we obtain g(YI) > g(yz), thus proving (iii). To prove (iv), we
notc that it holds for y = z as well a,<; for either y = S or Y = s. There
rcmain t.wo other possibilities: S < y < z and 8 < y < S. If S < Y < z,
then by K-convexity
Z-1j
K -I- g(z) :0.:: g(y) -I- (g(y) - g(S)) 2: g(y),
y-
and (iv) is provcd. If s < y < S, thcn by K-convexity
8-y
g(8) = J( -I- g(8) :0.:: g(y) -I- -(g(y) - g(s)),
Y-8
from which
( 1 -I- 8 - y ) g(8) :0.:: ( 1 -I- 8 - y \ ) g(y),
Y-8 y-s
and g( 8) :0.:: g(y). Noting that
g(z) -I- K :0.:: g(8) -I- J( = g(s),
it follows that g(z) -I- J( :0.:: g(y). Thus (iv) is proved for this case as well.
Q.E.D.
Consider now the function GN-l of Eq. (2.5):
GN-I(Y) = cy -I- H(y).
Clearly, G N -1 is convex and hence by part (a) of the previous lemma, it is
also K -convex. We have
IN-l(X) = min [ G:l; -I- GN-l(:r:), min[K -I- cy -I- GN-l(Y)] ] - C:I;.
y>x
and it can be seen that
IN-l(X) = { J( -I- G N-l(8-d - c:r
GN-l(X) - ex
for x < 8N-l,
for x :0.:: 8N-l,
where 8N--l minimizes GN-l(Y) and 8N-l is the smallest valuc of y for
which GN-l(Y) = J{ -I- GN-l(8N-d. Note that since J{ > 0, we have
8 N -1 f. 8 N -1 and furthermore the derivative of G N -1 at 8 N -1 is negative.
As a result the left derivative of IN-l at 8N-l is greater than t.he right
derivative, as shown in Fig. 4.2.3, and IN-l is not convex. However, we
will show t.hat J N -1 is J{ -convex based on the fact that G N -1 is J{ -convex.
To this end we must verify that for all z :0.:: 0, b > 0, and y, wc have
( IN-l(y) - .IN-l(Y - b) )
[{-I-.JN_I(y-I- Z ):o.::JN_l(Y)-I- Z II .
(2.8)
(2.9)
Sec. 4.2
Jnveniory ConiroI
151
.
-
-
SN_1
x
-ex
Figure 4.2.3 Structure of the cost-ta-go function when fixed cost is nonzero.
We distinguish three cases:
Ca8e 1: y :0.:: 8N-l. If Y - b:o.:: 8N-l, then in this region of values of z,
b, and y, the function IN-l, by Eq. (2.8), is the sum of a K-convex function
and a linear function. Hence by part (b) of Lemma 2.1, it is J{-convex and
Eq. (2.9) holds. If y - b < 8N-l, then in view of Eq. (2.8) we can write Eq.
(2.9) as
J{ -I- GN-l(Y -I- z) - c(y -I- z) :0.:: GN-l(Y) - ey
-I- z ( GN--I(Y) - cy - GNb-I(8N-d -I- c(y - b) )
or equivalcntly
J( -I- GN-l(y -I- z) :0.:: GN-l(y) -I- z ( GN-l(Y) - N-I(SN-l) ). (2.10)
Now if y is such that GN-l(Y) :0.:: GN-l(SN-l), then by J{-convexity of
GN-l we have
J{ -I- GN-l(Y -I- z):o.:: GN-l(Y) -I- z ( GN-l(y) - GN-l(SN-d )
Y - 8N-l
> G ( ) ( GN-l(Y) - GN-l(8 N - d )
- N-l y -I- z b .
Thus Eq. (2.10) and hence also Eq. (2.9) hold. Ify is such t.hat G N - 1 (y) <
GN-l(8N-l), then we have
K -I- GN-l(Y -I- z) :0.:: K -I- GN-l(8N-d
= GN-l(SN-d
> GN-l(y)
> G _ ( 1 ) z ( GN-l(Y) - GN-l(8 N - d )
_N1J-I- II .
_. .1
152
Problcms wiih Pcrfcct Statc Informatio.
ClJap. 4
So for this case, Eq. (2.10) holds, and hence also the desired K-convexity
Eq. (2.9) holds.
Case 2: y ::; Y + z::; SN-l. In this region, by Eq. (2.8), the function
IN-l is linear and hence the K-convexity Eq. (2.9) holds.
Case 3: y < SN-l < Y + z. For this case, in view of Eq. (2.8), we can
write the K-convexity Eq. (2.9) as
K + GN-l(Y + z) - c(y + z) :::: GN-l(SN-d - cy
+ z ( G N -I (s N -I) - cy - N -I (s N - d + c(y - Ii) ) ,
or cquivalently
I< + GN-l(Y + z) :::: GN-1(SN-J),
which holds by the definition of SN-l.
We have thus proved that I< -convexity and continuity of G N -I, to-
gether with the fact that G N -1 (y) --> 00 as Iyl --> 00, imply I< -convexity
of IN-J. In addition, IN-I can be seen to be continuous. Now using
Lemma 2.1, it follows from Eq. (2.5) that GN-2 is a I<-convex function.
Furthermore, by using the boundedness of W N _ 2, it follows that G N - 2 is
continuous and, in addition, GN-2(Y) --> 00 as Iyl --> 00. Repeating the
preceding argument, we obtain that I N -2 is K-convex, and proceeding
similarly, we prove I<-convexity and continuity of the functions G k for all
k, as well as that Gk(Y) --> 00 as Iyl --> 00. At the same time [by using part
(d) of Lemma 2.1] we prove optimality of the multiperiod (s, S) policy of
Eq. (2.6).
The optimality of policies of the (s, S) type can be proved for several
other inventory problems (see Exercises 4.3 to 4.10).
4.3 DYNAMIC PORTFOLIO ANALYSIS
Portfolio theory deals with the question of how to invest a certain
amount of wealth among a collection of assets, perhaps over a long time
interval. One approach, to be discussed in this section, is to assume that
an investor makes a decision in each of several successive time periods with
the objective of maximizing final wealth. We will start with an analysis of
a single-period modd and thcn cxtend I.he result.s to the multiperiod ca.<;e.
Let Xo denote the initial wealth of the investor and assume that there
are n risky assets, with corresponding random rates of return eJ,. . . , en
among which the investor can allocate his wealth. The investor can also
invest in a riskless asset offering a sure rate of return s. If we dcnote by
Sec. 4.3
Dynamic Portfolio 1. lysjs
153
Ul,.. ., 11n the corresponding amounts invested in the 11, risky asset.s and hy
(xo - UI - . . . - 11n) the amount invested in the riskless asset, the wealth
at the end of the first period is given by
n
Xl = 8(XO - 111 - . . . - 11n) + L ei 11 i,
i=l
or equivalently
n
XI = SXO + L(ei - 8)11i.
i=1
The objective is to maximize over 111, . . . ,11n,
(3.1)
E{U(:cJ) },
.;
'.
where U is a known function, referred to as the utility function for the in-
vestor. We assume that U is concave and twice continuously differentiable,
and that the given expected value is well defined and finite for all Xo and
11i. We will not impose constraints on 111,..., 11n. This is necessa.ry in order
to obtain the results in convenient form. A few additional assumptions will
be made later.
Let us denote by 117 = 11-'* (xo), i = 1,. . . ,n, the optimal amounts to
be invested in the n risky assets when the initial wealth is xo. We will show
that when the utility function satisfies
"j
U'(x)
- U"(x) =a+bx,
for all X,
(3.3)
where U' and U" denote the I1rst and second derivatives of U, respectively,
and a and b are some scalars, then the optimal portfolio is given by the
linear policy
': !
ILi*(XO) = ai(a + bsxo),
i = 1,... ,11"
(3.4)
where a i are some constant scalars. .Furthermore, if J(xo) is the optimal
value of the problem
J(xo) = maxE{U(:rJ)},
Ui
(3.5)
then we have
1'(xo) a
- J"(xo) = -.; + bxo, for all Xo. (3.6)
It can be veril1ed that the following utility functions U(:c) sat.isfy
condition (3.3):
exponential: e- x / a , for b = 0,
logarithmic: In(x + a), for b = 1,
power: (l/(b -l))(a + bx)J-(1/b), [or b oJ 0, b 11.
l
f
t
i'
.1,
:'
J;
l" I
1:
F
_ J
154
Problems with Perfect State InformaUoJ.
Chap. 4
Only concave utility functions from this class are admissible for our prob-
lcm. Furthermore, if a utility function that is not defined ovcr the whole
real line is used, the problem should be formulated in a way that ensures
that all possible values of the resulting final wealth are within the domain
of definition of the utility function.
To show the desired relations, let us hypothesize that an optimal
portfolio exists and is of the form
p,i*(XO) = ai(xo)(a -I- b8Xo),
where ai(xo), i = 1,. . . , n, are some differentiable functions. We will prove
that dai(xo)/dxo = 0 for all XO, implying that the functions a' must be
constant, so the optimal portfolio has the lincar form (3.4).
We have for cvery Xo and i = 1,. .., n, by the optimality of Iii' (xo),
dE{U(Xl)} { ( . ) }
. = E U' sXo -I- L..,.(Cj - 8)a J (xO)(a -I- bsxo) (Ci - 8)
du, j=l
= O.
(3.7)
Diffcrentiating each of thc n equations above with respect to ;£0 yields
{ (( ) 2 ( )( )) } ( d<>I(.T.O) )
Cl - 8 .., Cl - 8 Cn - 8 -----.rxo
E : U"(xt)(a -I- bsxo) :
(C n - s)(el -)... (C n - 8)2 JO:'(;O)
( E{U"(Xt)(CI - 8)8(1 -I- L:l (Ci - s)ai(xo)b)} )
E {U"(:ct)(C n - 8 )8(1 L:l (c, - s )ai(xo)b) }
(3.8)
Using relation (3.4), we have
U'(XI)
U"(:z:t) = - a -I- b(sxo -I- L:l (Ci - s)ai(xo)(a -I- bsxo))
U'(Xl)
(a -I- b8:Co) (1 -I- L=I(Ci - 8)a i (xo)b)'
(3.9)
Substituting in Eq. (3.8) and using Eq. (3.7), we have that the right-hand
sidc of Eq. (3.8) is the 7:cro vector. Thc matrix on the left in Eq. (3.8),
cxcept for degenerate cascs, can be shown to be nonsingular. Assuming
that it is indeed nonsingular, we obtain
dai(xo)
=o,
i = 1,... ,n,
If
Sec. '1.3
Dynamic Portfolio J., .lysis
155
and a'(xo) = ai, whcre a' arc some constants, thus proving the optimality
of the linear policy (3.4). .
Wc now prove Eq. (3.6). We have
J(Xo) = E{U(xt)}
= E {U ( (1 -I- t(Ci - 8)a i b) Xo -I- t(C' - s)aLa) }
and hence
J'(xo) = E {U'(:Z;I)S (1 -I- t(C' - 8)a i b) },
J"(Xo) E { U"(.T>J,' (1+ t. (e, - ,)a'b) '} .
The last relation after somc calculation using Eq. (3.9) yields
(3.10)
J"(xo) = _ E{U'(xd 8 (1 -I- L=l (c, - s)a'b) }s
a -I- bsxo
By combining Eqs. (3.10) and (3.11), we obtain thc desired rcsult:
(3.11 )
J'(xo) a
-- J " ( ) = - -I- b.TO.
.TO S
(3.12)
The Multiperiod Problem
We no,,:", extend the preceding one-period analysis to the multi period
case. We will assumc that the current wealth can be reinvested at the
beginning of cach of N consecutive t.ime periods. We denot.c
:Ck: the wealth of the investor at the start of the kth period,
uf: the amount invested at the start of the kth period in thc ith risky
ct, .
e: the rate of return of the ith risky asset during the kth period,
Sk: the rate of return of thc riskless asset during the kth p(ri()d.
We have the system equation
n
Xk+l = 8k X k -I- L(c7 - sk)u7,
,=1
k = O,l,...,N - 1.
(3.13)
_..1
156
Problems wjih Perfect State InformaUo_
Chap. 4
We assume that the vectors e k = (e,..., e), k = 0,... , N - 1, are in-
dependent with given probability distributions that yield finite expected
values throughout the following analysis. The objective is to maximize
E{U(XN)}, the expected utility of the tcrminal wcalth XN, whcre we as-
sume that U satisfies
U'(X)
-- =a+bx
U"(x) ,
for all x.
Applying the DP algorithm to this problem, we have
IN(XN) = U(XN),
(3.14)
JdXk) = "r E {JkH (SkXk + (e - Sk)U) } . (3.15)
From the solution of the one-period problem, we have that the optimal
policy at period N - 1 has the form
IJ.N-l (XN-l) = ClN-l(a + bSN-1XN-l),
where ClN -1 is an appropriate n-dimensional vector. Furthermore, we have
J_I(X) = + bx.
J7._1(X) SN-l
(3.16)
Hence, applying the one-period result in the DP Eq. (3.15) for the next to
the last period, we obtain the optimal policy
J1.N_2(XN-2) = ClN-2 ( + bS N - 2 XN-2 ) ,
SN-l
where ClN -2 is again an appropriate n-dimensional vector.
Proceeding similarly, we have for the kth period
J1.'k(Xk) = Clk ( a + bSkXk ) ,
SN-l'" Sk+l
(3.17)
where Clk is an n-dimensional vector that depends on the probability distri-
butions of the rates of return e k of the risky assets and are determined by
I f .
maximizatioll in t.he DP Eq. (3.15). The corresponding cost.-to-go ullctlOns
satisfy
Jk(X) _ a
-- - +bx,
Jf(x) SN-l"'Sk
k = O,I,...,N - 1.
(3.18)
Sec. 4.3
Dynamjc PortfoJjo A"Jysjs
157
Thus it is sccn that the investor, whcn faced with t.he opport.unit.y t.o
sequentially rcinvest his wealth, uses a policy similar to that of t.hc single-
period case. Carrying the analysis one step further, it. is seen that if t.l1<'
utility funclion U is such that a = 0, that is, U has onc of the forms
lnx, forb=l,
C 1 ) (bx)l-(1jb),
for b 1 0, b 11,
then it follows from Eq. (3.17) that the investor acts at each stage k as
if he were faced with a single-period investmcnt charactcrized by the rates
of return Sk and e7, and t.he objective function E{U(Xk+l)}. This policy
whereby t.he investor can ignore the fact t.hat he will havc t.hc opportunity
to reinvcst his wealth is called a myopic policy.
Note that a myopic policy is also optimal when Sk = 1 for all k,
which mcans that wcalth is discounted at the rat.e of rct.um of the riskless
asset. Furthermore, it can be proved that when a = 0 a myopic policy
is optimal even in the more general case where the rates of return Sk are
independent random variablcs [Mos68], and for the casc where forecasts on
the probability distributions of the rates of return e of the risky assets
become available during the investment process (see Exercise 4.14).
It turns out that even in the more general case where a 1 0 only a
small amount of foresight is required on the part of the decision maker. It
can be seen [compare Eqs. (3.15) to (3.18)] that the optimal policy (3.17)
at period k is the one that the investor would use in a single-period problem
to maximize over u, i = 1,. . . , n,
E{U(SN_l ... sk+Jxk+d}
subject to
n
xk+l = SkXk + 2)c - Sk)Uf.
i=J
In other words, the investor at period k should maximize the expected
utility of wealth that results if amounts u are invested in the risky assets in
period k and the resulting wealth Xk+J is subsequently invested exclusively
in the riskless asset during the remaining periods k + 1, . . . , N - 1. This
is known as a partially myopic policy. Such a policy can also be shown to
be optimal when forecasts on the probability distributions of the rates of
return of the risky assets become available during the invcstmcnt process
(see Exercise 4.14).
AnoLlwr int.erest.ing aspect. uf t.he c'e whel'( a ! 0 is t.hat. if -'k > 1 1l)J"
all k, then as the horizon becomes increasingly long (N --> (0), the policy
in the initial stages approaches a myopic policy [comparc Eqs. (3.17) and
(3.18)]. Thus, for Sk > 1, a partially myopic policy becomes asymptotically
myopic as the horizon tends to infinity.
_..I
158
Problems with Perfect State lnformatio.
Chap. 4
4.4 OPTIMAL STOPPING PROBLEMS
Optimal stopping problems of the type that we will consider in this
and subsequent sections are characterized by the availability, at cach state,
of a control that stops the evolution of the system. Thus at each stage
the decision maker observes the current state of the system and decides
whether to continue the process (perhaps at a certain cost) or stop the
process and incur a certain loss. If the decision is to continue, a control
must be selected from a given set of available choices. If there is only one
choice other than stopping, then each policy is characterized at each period
by the stopping set, that is, the set of states where the policy stops the
system.
Asset Selling
As a first example, consider a person having an asset (say a piece of
land) for which he is offered an amount of money from period to period.
We assume that the offers, denoted WO,Wl,...,WN-l, are random and
independent, and take values within some bounded interval. We consider
a horizon of N stages and assume that if the person accepts the offer, he
can invest the money at a fixed rate of interest 7' > 0, and if he rejects
the offer, he waits until the next period to consider the next offer. Offers
rejected are not renewed, and we assume that the last offer W N -1 must be
accepted if every prior offer has been rejected. The objective is to find a
policy for accepting and rejecting offers that maximizes the revenue of the
person at the Nth period.
The DP algorithm for this problem can be derived by elementary
reasoning. As a modeling exercise, however, we will try to embed the
problem in the framework of the basic problem by defining the state space,
control space, disturbance space, system equation, and cost function. We
consider as disturbance at time k the random offer Wk and as corresponding
disturbance space the real line. The control space consists of two elements
u 1 and u 2 , which correspond to the decisions "sell" and "do not sell,"
respectively. We define the state space to be the real line, augmented with
an additional state (call it T), which is a termination state. The system
moves into the termination state as soon as the asset is sold. By writing
that the system is at a state Xk 'I T at time k, we lllean that the asset has
not been sold as yet and the current offer under consideration is equal to
:!:k. Dy writing that the syst.em is at state :\'k = T at time k, we mean that
the asset has already been sold.
With these conventions we may write a system equation of the form
Xk+l = h(Xk, llk, Wk),
k = O,I,...,N - 1,
Sec. 4.'1
Optimal Stopping Prob
15U
i
I
where the state space is the real line augment.ed by t.he t.ennination st.at.e
T, and the [unction fk is defined via the relation
{ T
Xk+l =
Wk
if Xk = T, or if Xk 'I T and Uk = u l (sell),
otherwise.
The corresponding reward function may be written as
{ N-l }
k=O.I','N-l 9N(XN) + t; gdXk, llk, Wk)
where
{ XN
.(}N(XN) = 0
if XN 'I T,
ot.herwise,
{ ( 1 + r ) N-kx
gk(Xk, llk, Wk) = 0 . k
if Xk l T and Uk = 'u l (sell),
otherwise.
Dased on this formulation we can write the corresponding DP algo-
rithm:
{ XN
.IN(XN) = 0
if XN iT,
if XN = T,
(4.1 )
Jk(Xk) = { max[(1 + r)N-kxk, E{.lk+l (wd}]
. 0
i[ ;£k I T,
if :£k = T.
( 4.2)
In the above equation, (1 + r)N-k xk is the revenue resulting from decision
u 1 (sell) when the offer is Xk, and E{JkH(Wk)} represents the expected
revenue corresponding to the decision u 2 (do not sell). Thus, the optimal
policy is to accept an offer if it is greater than E{ J k + 1 (Wk)} /(1 + 7.)N -k,
which represents expected revenue discounted to the present time:
accept the offer Xk
if Xk > O'.k,
.
reject the offer Xk
if Xk < O'.k,
where
E{.h+J(wd}
CXk = (1 + 7.)N-k .
When Xk = O'.k, both acceptance and rejection are optimal. Thus the
optimal policy is determined by the scalar sequence {O'.d (see Fig. 4.4.1).
_..I
160
Problems with Perfect State Informatio
Chap. 4
k
Figure 4.4.1 Optimal policy for accepting offers in the asset selling problem.
Properties of the Optimal Policy
We will now derive some properties of the optimal policy with some
further analysis. Let us assume that the offers "Wk are identically dis-
tributed, and to simplify notation, let us drop the time index k and uenote
by Ew{-} the expected value of the corresponding expression over Wk, for
all k. We will then show that
Qk :0.:: Qk+l,
for all k,
which expresses the intuitive fact that if an offer is good enough to be
acceptable at time k, it should also be acceptable at time k + 1 when there
.will be one less chance for improvement. We will also obtain an equation
for the limit of Qk as k -t -00.
Let us introduce the functions
.h(xd
VdXk) = (1 + r)N-k '
Xk IT.
It can be seen from Eqs. (4.1) and (4.2) that
VN(XN) = XN,
Vd.Tk) = max [Xk, (1 +r)-l{Vk+l(W))],
k = 0,1,.. .,f{ - 1,
and that
Ew {Vk+1 (w)}
Qk = .
l+r
(4.3)
(4.4)
Sec. 4.4
Optimal Stopping ProL .IS
161
To prove that Qk:o.:: CtHI, note that from Eqs. (4.:)) alld (/1.<1), W(' h;\.\,('
V N - 1 (:I:) 2: VN(:I:),
for all :I: 2: o.
Applying Eq. (4.4) for k = f{ - 2 and k = f{ - 1, and llsing t.he pnccding
inequality, we obtain for all x :0.:: 0
VN-2(X) = max [X, (1 + 1')-1 {VN-1 (w)} ]
:0.:: max [x, (1 + r)-l {p,{VN(W)}]
= VN-l(X).
Continuing in the same manner, we see tlult
Vdx) 2: Vk+1(X),
for all .T 2: 0 and k.
Since Qk = Ew{V k + 1 (w)} /(1 + 1'), we obtain Qk :0.:: Qk+1, as uesired.
Let us now see what happens when the horizon f{ is very large. From
the algorithm (4.3) and (4.4) we have
Vk(Xk) = max(xk,Qk). (4.5)
Hence we obtain
1
Qk = -Ew{Vk+l(W)}
l+r
1 l "k+l 1 1 00
= - Qk+1dP(W) + - wdP(w),
1 + r 0 1 + l' "k+l
where the function P is defined [or all scalars .\ by
P(.\) = Prob{w < .\}.
The difference equation for Qk may also be written as
.
P(Qk+l) 1 1 00
Qk = Qk+1 + - wdP(w),
1 + r 1 + l' "k+l
for all k,
(Hi)
with QN = O.
Now since we have
P ( Q ) 1
0<-<-<1
-1+r-l+1' '
for all Q:o.:: 0,
1 1 00 E{w}
o - wdP(w) -,
l+r "'k+l 1+1'
for all k,
Ii
__-1
162
Problems with Perfect State In[ormatio.
Cllap. 4
it can be seen, using the property D:k 2: D:k+l, that the sequence {D:d
generated (backward) by the difference equation (4.6) converges (as k ->
- 00) to a constant a satisfying
(1 + r)a = P(a)a + L oo wdP(w).
This equation is obtained from Eq. (4.6) by taking limits as k --> -00 and
by using the fact that P is continuous from the left.
Thus, when the horizon N tends to become longer, the optimal policy
for every fixed k 2: 1 can be approximated by the stationary policy
accept. the offer .(;k
reject the ofl'er :Ck
if Xk > a,
if Xk < a.
The optimalit.y of t.hi::; policy for t.he corresponding infinite horiE:OH problem
will be shown in Section 7.3.
Purchasing with a Deadline
Let us consider another stopping problem that has similar nature.
As::;ume that a certain quantity of raw material is requircd by a certain
time. If the price of t.his material fluctuates, there arises the problem of
deciding whether to purchase at the current price or wait a further period,
during which the price may go up or down. We thus want to minimize
the expected price of purchase. We nS3ume that successive price::; Wk are
independent and identically distributed with distribution P(Wk), and that
the purchase must be made within N time periods.
This problem and the earlier asset selling problem have obvious sim-
ilarities. Let us denote by
Xk+l = Wk
the price prevailing at the beginning of pcriod k + 1. We have ::;imilar to
the earlier problem the DP algorithm
IN(J:N) = J:N,
.h(.l:d = min[:I:k' E{.lk+I(IJJk)}].
Note that Jk(Xk) is the optimal cost-to-go when the current price is Xk and
t.he material has not. been purchased yet. To be strictly formal, we should
iHtroduce a termination state T, to which the system moves following a
purchasing decision and at which the sy::;tem subsequently stays at no cost.
A nonzero cost is incurred only when moving from Xk to T; this cost is
equal to Xk. Thus the cost-to-go from the termination state Tis 0, and for
this reason it was neglected in the preceding DP equation.
Sec. 4.4
Optimal Stopping Prob. .;
IG3
The optimal policy is given by
purchase if Xk < D:k,
do HOt. plln:ha.e if .7:,- > (I'..,
where
D:k = E{Jk+l(Wk)}.
Similar to the asset selling problem, the thresholds D:l, 0:2, .. . , ON -I can
be obtained from the discrete-time equation
J "'k+l
O:k = CXk+l(l - P(CXk+l)) + wdP(w),
.0
with the terminal condition
CXN-l = 1 00 wdP(w) = E{w}.
The Case of Corrclated Prices
Consider now a variation of the purchasing problem where we do not
assume that the successive prices WO,..., WN-l are independent. Inst.ead,
we assume that they are correlated and can be represented as the staLe of
a linear system driven by independent disturbances (d. Section 1.4). In
particular, we have
Wk = .Tk+l,
k = 0,1,... ,tv - 1,
with
Xk+l = AXk + k,
Xo = 0,
where A is a scalar wit.h 0 :S: A < 1 and o, 6, . . . , N -I are independent
identically distributed random variables taking positive values with given
probability distribution. As discussed in Section 1.4, the DP algorithm
under these circumstances has the form
IN(XN) = XN,
JdXk) = min[J:k, E{Jk+I(AXk +k)}],
where the cost associated wit.h the purchasing decision is J:k and the co::;t
associated with the waiting decision is E { Jk+ I (AXk + d}.
We will show t.hat in this case the opt.imal policy is of the same type
as the one for independent prices. Indeed, we have
IN-1(:l:N-I) = min[:l:N_l, A.TN-l +],
_ J
164
Problems with Perfect SLaLe In[ormaUoL
Chap. 4
XN..1
,
,
,
,
""",
""""
AXN_1 +
o
XN.'
---
Purchase aN_'
Do not Purchase
Figure 4.4.2 Structure of the cost-to-go function IN-dxN-d when priccs are
correlated.
where = E{N-d. As shown in Fig. 4.4.2, an optimal policy at time
N - 1 is given by
purchase if XN-I < O'.N-I,
do not purchase if XN-I > O'.N-I,
where O'.N-l is defined from the equation O'.N-l = AO'.N-l + j that is,
O'.N-l = 1 - A '
Note that
IN-l(X) s: IN(X),
for all x,
and that IN-l is concave and increasing in x. Using this fact in the DP
algorithm, one lIlay show (the lllonotonicit.y property of DPj Exercise 1.23
in Chapter 1) that
Jk(X) s: J k + l (x),
for all x and k,
and that Jk is concave and increasing in x for all k. Furthermore, since
= E{d > 0 for all k, one can show that
E {Jk+l (k)} > 0,
for all k.
These facts imply (as illustrated in Fig. 4.4.3) that the optimal policy for
every period k is of the form
purchase if Xk < O'.k,
Sec. ,1.4
OpUmal SLopping ProL. ,,5
165
do not purcha.e if XI.; > Ok,
where the scalar 0'.1.; is obtained as the unique positive solut.ion of t.he equa-
tion
x = E{JI.;+I(A:r + k)}.
Note t.hat the relation Jd:r) s: JI.;+I(X) for all:c and k implies that.
0'.1,;-1 s: 0'1,; s: 0'.1.;+1,
for all k,
and hence (as one would expect) the threshold price to purchase incl'<ases
as the deadline gets closer.
o
E{J k + 1 ().x + k)}
/
x
Purchase ak
Do not Purchase
Figure 4.4.3 Determining the optimal policy when prices are correlated.
General Stopping Problems and the One-Step-Look-Ahead Rule
\Ve now fUl"lnulat.e a general type of N-stag,e problem where st.opping
is mandatory at or before stage N. Consider the stationary version of
the basic problem of Chapter 1 (state, control, and disturbance spaces,
disturbance distribution, control constraint set, and cost per stage arc the
same for all times). Assume that at each state XI.; and at time k there
is available, in addition to the controls Uk E U(Xk), a stopping action
that forces the system to enter a termination state at a cost t(XI.;) and
subsequently remain there at no cost. The terminal cost, assuming st.opping
has not occurred by the last stage, is t(XN). Thus, in effect, we a.o.;sume
that the termination cost will always be incurred either at. the last. stage
or earlier.
The DP algorithm is givcn by
i'
r.
i'.
I'
I:'
I
1
i" i
:1
I."j
i
i
;
I .
"
,Ir '
.t!
! 1 1 \ .
b
I':;
IN(XN) = t(XN)
(.1. 7)
_. J
166
l'rovlews wit.h l>crfcct. St.at.e In[onnat.iol.
Chap. 4
Jk(Xk) = min [ t(:c d , min E{g(Xk' lLk, Wk) + Jk+1 (J(:I;k,Uk, wd) } ] ,
1LkEU(a:d
(.1.8)
and it is optimal to stop at time k for states x in the set
11. = { J; I t(:c):::; min E{g(:c, lL, w) + Jk+1 (J(x, 11., W))} } .
uEU(x)
We have from Eqs. (4.7) and (4.8)
IN-I(X) :::; IN(X), for all x,
and using this fact in the DP equation (4.8), we obtain inductively
Jk(X) :::; Jk+l(X), for all x and k.
[We arc making use here of the stationarity of the problem and the mono-
tonicity property of DP (Exercise 1.23 in Chapter 1).] Using this fact and
the definition of 11.. we see that
To c... C 110 C n+1 C... C TN-I' (4.9)
We will now consider a condition guaranteeing that all the stopping
sets Tk are equal. Suppose that the set TN-I is absorbing in the sense that
if a state belongs to TN -I and termination is not selected, the next state
will also be in TN-I; t.hat is,
f(x, lL, w) E TN-l, for all x E TN -1,U E U(x), w. (4.10)
We will show that equality holds in Eq. (4.0) and for all k we have
Tk = TN-I = { X E S I t(x):::; min E{g(X,lL,W) +t(J(X,lL,W))} } .
uEU(x)
To see this, note that by the definition of TN -I, we have
IN_I(X) = t(x), for all x E TN-I,
and using Eq. (4.10) we obtain for x E 1N-l
min E{g(:C,lL,W) + IN-I(J(J:,U,1JJ))}
uEU(a:)
= min E{g(X,lL,W)-I-t(J(x,U,1JJ))}
uEU(a:)
2: t(x).
Therefore, stopping is optimal for all XN -2 E TN-lor equivalently TN -I C
TN-2. This together with Eq. (4.9) implies 7N-2 = TN-I, Proceeding
similarly, we obt.ain 11.. = 1N -I for all k.
In conclusion, if condition (4.10) hoLds (the one-step stopping set
TN-I is absorbing), then the stopping sets Tk are aLL equaL to the set of
states faT which -it is better to stop rather than continue faT one more stage
and thcn stop. A policy of t.his t.ype i;; knowll as a onc-stc]l-look-n.hcad
poLicy. Such a policy turns out to be optimal in several types of applica-
t.ions. We provide next some examples. Additional examples are given in
t.he exercises, and in Vol. II, Chapter 3, where the stopping problem will
be reexamined in an infmite horizon cont.ext.
Sec. .1..1
Opt.imal St.opping Fl"<.. ws
lU7
Example 4.1 (Asset Sellillg with Past. Offers Retained)
C.ollsi(k.. Lhe assd sellillg prohlelll discllssed earlie.. ill Lhis sl'dioll wit.h the
dIfference that. rejected offers ca.n be accept.ed at. a later t.illle. Theil if t.he
asset. IS not sold at. time k t.he st.ate evolves according Lo
:/:k 1-1 = III ax (:rk', III..}
instead of Xk+l = Wk. The DP equat.ions (4.:3) anu (4.4) t.hen UCCOIllC
VN(XN) = XN,
Vk(Xk) = maX[J:k, (1 +r)-lE{V k + l (llIax(:z;k,llIk))}]'
The one-st.ep stopping set is
TN-l = {x I x (1 +r)-IE{rnax(x,w)}}.
It is seen [compare with Eqs. (4.5) and (4.6)] that. an alt.ernative charact.eri-
zation is
TN-l={xlx2:a},
where Q is obtained frolll t.he equation
(4.11)
(1 + r)a = P(a)a + lX> wdp(w).
Since past. offers can be accepted at. a 1at.er date, the effect.ive offer available
annot dcrease with time, and it follows that. the one-step stopping set. (4.11)
IS abs.orbmg m the sense of Eq. (4.10). Therefore, the one-st.ep-look-ahead
st.oppmg rule that accepts t.he first offer that equals or exceeds Q is opt.imal.
Note that this policy is independent of the horizon lengt.h N.
Example 4.2 (The Rational Burglar [Whi82])
A burglar may at. any night k choose to retire with his acculllulat.ed eal'llillgs
Xk or enter a house and bring home an amount. Wk. Howevcr, in Lhc lattcr
case he gets caught with probability p, and then he is forced t.o t.crminat.e his
tivties and fofeit his arnings thus far. The amounts Wk are independent.,
ldenlcllY dlstnbut.ed with mean w. The problem is to find a policy t.hat.
maxl1ll\zes the burglar's expected earnings over N night.s.
. We call formulate t.his problem as a stopping problem wit.h two act.ions
(retire or continue) and a state space consist.ing of t.he real line, t.he retirement.
state, and a special stat.e corresponding to t.he burr;lar r;dt.illg ('aught.. The
DP algorit.hm is givcn by
IN(XN) = XN
.h(Xk) = max [;Z;k, (1 - p) J'; { .h+l (Xk + Wk)}].
_.1
168
Problems with Perfect Siat.e Information
Chap. 4
TIIP one-step stopping sd is
_ { I (l-P)W }
TN-l = {x I x (1 - p)(x + w)} = x x P ,
(more accurately this set together with the special state corresponding to the
burglar's arrest), Since this set is absorbing in the sense of Eq. (4.10), e
see that the onc-step-look-ahead policy by which the burglar retires when Ills
earnings reach or exceeu (1 - p)wjp is optimal. The optimalit.y of litis. policy
for the corresponding infinite horizon problem will be demonstrat.ed 111 Vol.
II, Chapter 3.
4.5 SCHEDULING AND THE INTERCHANGE ARGUMENT
Suppose one has a collection of tasks to perform bu.t the orderin of
the tasks is subject to optimal choice. As examples, consider the ordenng
of operations in a construction project so as to minimize construction time
or the scheduling of jobs in a workshop so as to minimize machine idle time.
In such problems a useful technique is to start with some schedule and tlen
to interchange two adjacent tasks and see what happens. 'vVe first provide
some examples, and we then formalize mathematically the technique.
Example 5.1 (A Quiz Contest)
Consider a quiz contest where a person is given a list of N questions and can
answer these questions in any order he chooses. Question i will be answered
correctly with probability pi, and the person will then receive reward R i .
At the first incorrect answer, the quiz terminates and the person IS allowe? to
keep his previous rewards. The problem is to choose the ordermg of questIOns
so as to maximize expected rewards.
Let i anu j be the klh and (k + l)st questions in an optimally ordered
list
L = (i o ,... ,ik-l,i,j,ik+2,... ,iN-l).
Consider the list
L' = (io,... ,'ik-l,j,'i,ik+2,... ,iN-i)
obtaineu from L by interch,lnging the order of questions i and j. We comparc
t.he expected rewards of Land L'. We have
E{reward of L} = E{reward of {io,... ,ik-d}
+ Pio'" Pik-l (p,Ri + p,pjRJ)
+ p,o" 'Pik_1PiPJE{reward of {ik+2"" ,iN-d}
Sec. 4.5
Scheduling and the In. . change J1rgument.
169
E {reward of r/} = E {reward of {in, . . . , i k _ I } }
+ Pin'" Pi k _ 1 (pj IlJ + PJPiR.)
+ PiO ... Pik_lPJPiE{ reward of {i.H-2,. .., iN- d }.
Since L is optimally ordered, we have
E{reward of L} E{reward of L'},
so it follows from t.hese equations that
piR i + PiPJRj pJRJ + P)P,Ri
or equivalently
1), Il, > pjR)
1 - Pi - 1 - Pj .
Therefore, to maximiz8 expect.ed rewards, questions should be antiwered ill
decreasing order of PiR.j(l - Pi).
Example 5.2 (Job Scheduling on a Single Processor)
Suppose we have N jobs to process in tiequent.ial order with t.he lth job requir-
ing a random t.ime T, for it.s execution. The times T j , . . . , TN are independent..
If job l is complet.ed at time t, the reward is a' R" where a is a diticount fact.or
with 0 < a < 1. The prob1Cln is to find a schedule that maximizes t.he tot.al
expecleu reward.
It can be seen that the state for this problem is jUtit. t.he collection of jobs
yet to be processed. Indeed, because the execution t.imes 'ii arc independent.,
and also because future costs arc multiplicatively affected t.hrough discounting
by the completion times of preceding jobs, t.he optimization of t.he schcuuling
of future jobs is unaffected by the completion times of preceding jobs. As a
result these times need not be ineluued in the state; this would not be so if
either the times T i were correlated or if t.he reward for completing job i at.
time t were not a'R i but instead hau a general dependence on I.. Now, given
that the state is the collection of jobs yet to be processed, it. is clear t.hat. an
optimal policy can be mappeu into an opt.imal job schedule (io, . . ., iN-I).
Suppose that L = (i o ,...,ik-l,i,j,ik+2,...,i N _ l ) is an optimal job
schedule, anu consider the schedule L' = (io,.,. ,ik-l,j,i,ik+2,... ,iN-I) OU-
tained by interchanging i and j. Let tk be the time of completion of job lk-l'
We compare the rewards of the schedules Land L', similar to the preceding
example. Since the reward for completing t.he remaining jobs ik+ 2, . . . , iN-l
is indepenuent of the order in which jobs i and j arc executed, we outain
E{ at.k+1 R. + Q'k+T,+1j R)} 2: E{ c,/k+Tj R] + a'k+1j+Ti R.}.
Since ik, 'ii, and 7j are independent, this relation can be written as
E{a'k}(E{aTi}Ri + E{a T '}E{a 1J }R j )
2: E{a'k}(E{aTj}Rj + E{nTJ}B{aT,}U,),
_.1
170
Problems with Perfee( State Information
Chap. ,1
from whidl we finally obt.ain
E{ o:r,} H, E{ ( 1 )} Hj
2
1- B{a 1i } 1- E{aTJ}'
It. follows t.hat schedulin jobs in order of decrcasin E {aT;} H;j (I - E {nT, })
maximizes expected rewards. The structure of the opt.imal policy is idcntical
with the one we derived for the preceding quiz contest example (identify
E{ aT;} with the probability p; of answering correctly question i).
Example 5.3 (Job Scheduling on Two Processors in Series)
Conidcr the scheuulin!; of N jobs on t.wo processors ;1 anu U, nch that fJ
accepts the output of A as input. Job i requires known times a, anu v; for
processing in A and 13, respectively. The problem is t.o find a schedulp that
rniuimizes the total processing time.
To formulate the problem into the form of the basic problem, we in-
crement discret.e time at the moments when processing of a job is completed
at machine A and the next job is started. We take as state at time k t.he
collection of jobs X k that remain to be processed at A, together with the
backlog of work Tk at. machine B, that is, the amount of time that t.he jobs
currently at B need to clear B. Thus if (Xk,Tk) is the state at stage k and
job i is complet.ed at. machine A, the state changes to (Xk+I, Tk+l) givcn by
X HI = X k - {i},
Tk+l = bi + max(D, Tk - a;).
The corresponding DP algorithm is
Jk(Xk, Tk) = ii [a; + Jk+l (Xk - {i}, b; + max(D, Tk - a;))]
with the terminal condition
I N (0, TN) = TN,
where 0 is the empty set.
Since the problem is deterministic, there exists an optimal open-loop
schedule
{io, . . . , j'k-l, i, j, ik+2, . . . , iN-I}'
By arguing that the cost of this schedule is no worse than the cost of the
schpdnle
{io,..., ik-l,j, i, ik+2,..., iN-I},
obt.ained by interchanging i and j, it can be verified that
Jk+2(Xk - {i} - {j},T;j) ::; Jk+2(Xk - {i} - {j},Tj;), (5.1)
Sec. 4.5
V'
f:
I I' ,
, (i
It
Scheduling and the L change Argument
171
wherp T'J and TJI arc the bilcklogs at Inilchine 11 at t.inH' k I :2 wh('n I i
pro:essed before j and j is processed before i, respectively, awl the backlog
at tune k was Tk. A straightforward calculation shows t.hat
T'J = b i + b j - a, - aj + max(Tk, a;, a, + a J - v,),
(5.2)
TJi = b j + b, - aj - a; + max( Tk, a J , aj f- a; - b j ).
(5.:!)
ClPHrly, .h+2 is monot.onically illC1"l"..illg ill T, so frolll Eq. (:>.1) W(' obtain
Tt) Tjt.
In view of Eqs. (5.2) and (5.:3), this relation implies t.wo possibilities. The
firs t is
Tk 2: Inax(l1. tJ at + (J,J - btL
Tk 2 max(aj,aj + a; - Vj),
in which case TiJ = TJ' and the order of i and j makes no uifference. (This is
the case where the backlog at t.ime k is so large that both jobs i and j will
find B working on an earlier job.) The second possibility is that
max(a" a, +aj -b;)::; max(aj,aj +ai -b j ),
which can be seen to be equivalent to
min(a;,v J )::; min(aj,v,).
,
,
A schedule satisfying t.hese necessary conditions for opt.imalil.y can bp
constructed by the following procedure:
1. Find min, min(a" b,).
2. If t.he minimizing value is an a take the cOlTespondillg job first; if it is
a b, take the corresponding job last.
3. Repeat the procedure with the remaining jobs until a complete scheuule
is co nstructed.
To show that this schedule is indeed optimal, we start wit.h all opt.imal
schedule. We consider the job io that minimizes min(ai, b,) and by successive
interchanges we move it t.o t.he same position as in I.Iw schedule f'onst.ruet.pd
previously. It is seen frolll t.he preceding analysis that t.he resultillg sdwdul"
is still optimal. Similarly, cont.inuing through successive interchanges and
maintaining optimality throughout, we can transform the optimal schedule
into the schedule constructed earlier. We leave t.he det.ails to t.he reauer.
1.
_. J
172
Problems with Perfect State fnformatiol
Chap. 4
The Interchange Argument
Let us now consider the basic problem of Chapter 1 and formalize the
interchange argument used in the preceding examples. The main require-
ment is that t.he problem has structure such that. thae exists an opcn-loop
[Iolicy that is optimal, that is, a sequence of controls that performs as well
as any sequence of control functions. This is certainly true in detenninistic
problems as discussed in Chapter 1, but it is abo true in some stoclHlst.ic
problems such as those of Examples 5.1 and 5.2.
To apply t.he int.erchange argument, we start with an optimal seq\1ellce
{Uo, . . . , 'Uk-1, u', U, Uk+2, . . . ,UN -1}
and focus attention on t.he controls u' and u applied at times k and k + 1,
respectively. We t.hen argue that if t.he order of u' and u is int.erclmnged
the expected cost C<1l1110t decrease. In particular, if X k is the set of states
that. can occur with positive probability starting from the given initial state
Xo and using the control subsequence {uo,..., uk-d, we must have for all
;Ck E Xk
E{gk(Xk,U',Wk) + gk+l(XZ+1,ii,wk+1) + .J+2(xk+2)}
s: E{gdxk, ii, Wk) + gk+1(Xk-f-1, u', Wk+tJ + .J+2(h+2)},
(5.4)
where xk+1 and xk+2 (or Xk+1 and ik+2) are the states subsequent to Xk
when Uk = u' and Uk+1 = 11 (or Uk = ii and Uk+l = u', respectively) are
applied, and .Jk+2(') is the optimal cost-to-go function for time k + 2.
Relation (5.4) is a necessary condition for optimality. It holds for
every k and every optimal policy that is open-loop. There is no guarantee
that this necessary condit.ion is powerful enough to lead to an optimal
solution, but. it is worth considering in some specially structured problems.
4.6 NOTES, SOURCES, AND EXERCISES
The certaint.y equivalence principle for dynamic linear-quadratic prob-
lems was first discussed in [Sim56]. His work was preceded by [The54], who
considered a single-period case, and [HMS55], who considered a determin-
istic case. Similar problems were independently considered in [KaK58],
[JoT61], and [GuFG3]. The linear-quadratic problem has received a lot of
att.ention in control theory. See the special issue [IEE71], which contains
hundreds of references.
The lit.erature on inventory control st.imulated by the pioneering pa-
per of Arrow et al. [AHM51] is also voluminous. The 19G6 survey paper
I t
Sec. 4.6
Notes, Sources, and ,.-..ercises
173
[VdG6] contains 118 references. An imporhl t. '1- . -
the respareh \1 ) ". . ' I \VOl, sUnllnan;l,llIp; nlOst of
. tl A. - 1' 1 to 1 908 s [AKS5S]. The proof of optilllali t.y of (s. S) poliej('s
l1\ 1e ce 0 nonzero fied osts is due to [Sca60].
. Most of t.he material l1\ Section 4.3 is taken from [Mos68] M.
apphcatlOlIs of DP in econonlies C"m b , r I . [S J " ,tny
Tl' - , e lOUn( III aI'/)7 and [StL/)Uj
5 l' .le mte [ r;:;; of Section 4.4 is largely drawn from [Whj(j0]. EX,;lllPk
.: d ls given III s70], Example 5.2 is given in [Ros83 ] and EX'u nl)le " 3
IS ue to [ WePSO ] A t . " - d..
. n ex enSlve reference on scheduling is [Pin95].
i
I
I
EXERCISES
4.1 (Linear-Quadratic Problems with Forecasts)
onsid)er t.he linear-quadratic problem first examined in Section 4.1 (A. H .
nown for the case where at. the be g innin g of I )cri od I. we 11 ' 1 e . r k, k t '
1 E { I 2 } .. ,. . ,v ,I WI ccas
Vk " . . . ,n conslstmg of an accurate P rediction th t . 11 b I
i d . I a Wk WI e se ected
T n h accor anee Wit 1 a particular probability distribution P k l ( cf SectiOl 1 1 4)
e vectors w. need t b' . Yk" .
I . k no ,we zero meflll undcr t.he dist.ribut.ion I' SI.
t lat. the optlllial control law is of the forll! klllk' . 10\\
/-Lk(Xk, Vk) = -(B[(k+1Bk + Rk)-l Bf(k+1 (AkXk + E{ Wk I vd) + Ok,
wherc the matrices](' . b } . .
. k are given y tiC Rlccatl equat.ion (1.5) alld ( I G ) . d
CXk are approprIate vectors. . an
4.2
Consider a scalar linear system
Xk+l = akxk + bkUk + Wk k = 0 1 N - 1
, , "." ,
whcre ak b k E Rand' I . G .
d .' 2' , cae I Wk IS a aUSSlan random variable with zero Illcan
an varIance a . Show that the control law { '" . . .
the cost function /-Lo, /-LI,." ,/-LN-d that IllIlllIlllzes
E{ex p [x+(x+ru)]},
r > 0,
. lter in the state variable. .We assumc no control constraints, independent.
IS ur ancs, a.d that the ptIaI cost. is finite for every xo. Show by cxam )le
that th ?ausslan assumptIOn IS essential for thc result to hold ( "' or I I.
of multld . I . . L" ana yses
1l11CnSIOna vcrslons of this cxercisc, see [Jae73] and [WhiS2].)
_. .1
174
Problems with Perfect State InFormation
Chap. 4
4.3
Consider an invent.ory problem similar t.o t.he problem of Section 1\.2 (zero
fixed cost). The only difference is t.hat. at t.he beginning of each period k the
decision maker, in addit.ion to knowing t.he current. inventory level Xk, receives
an accurate forecast. t.hat t.he demand Wk will be selected in accordance with
one out of two possible prolmbility tlist.ributions Fe, Ps (large demand, small
demand). The a priori probability of a large demand forecast is known (d.
Section 1.4).
(a) Obtain t.he optimal invent.ory ordering policy for t.he case of a single-
period problem.
(b) Extend the result. t.o the N-period case.
(c) Ext.end t.he result. t.o t.he case of any finite number of possible dist.ribu-
tions.
4.4
COllsider t.he invent.ory problem ofSect.ion 1\.2 (zero fixed cost), where t.he pur-
chase cost.s Ck, k = 0,1, . . . ,N - 1, are not initially known, but instead they
are independent. random variables wit.h a priori known probability distribu-
t.ions. The exact. value of t.he cost. Ck, however, becomes known t.o t.he decision
maker at the beginning of t.he kth period, so t.hat t.he invent.ory purchasing
decision at t.ime k is made with exact knowledge of the cost Ck, Characterize
the optimal ordering policy assuming that p is greater than all possible values
of Ck.
4.5
Consider t.he inventory problem of Section 4.2 for the case where the cost has
the general form
E {rk(Xk) } .
The functions rk are convex and differentiable and
lim drk(X) = -00,
X-+-OQ dx
I" drk(x)
xoo = 00,
k= O,...,N.
(a) Assume that. the fixed cost is zero. Write the DP algorithm for this
problem and show that the optimal ordering policy has the same form
as t.he one derived in Section 4.2.
(b) Suppose there is a one-period t.ime lag bet.ween the order and the de-
livery of invent.ory; t.hat is, the syst.em equation is of t.he form
Xk+l = Xk + "Uk-l - Wk,
k=O,l,...,N-l,
of!
Sec. fl.G
Notes, Sources, and L Clses
175
where II- I is given. H.<'iormulat.e t.he probklll so LhaL it. h"s Lhe rOrln
of t.he problem of part. (a). flint.: Make a changl' of ,'ariabl"S.llk -.
Xk+Uk-l.
(c) Repeat parts (a) and (b) for the case where t.he lixcu cost is lIonzero.
4.6 (Invent.ory Control for Nonzcro Fixed Cost.)
Consider the invent.ory problem of Section 1\.2 (nonzero fixed cost) ullder t.he
assumption t.hat unfilled delnand al. (';I('h stage is not. backlogg,'d bilL raLhl'r
is lost; t.hat. is, t.he system equation is Xk+l = max(O, Xk + "Uk - wd insteau of
Xk+l = Xk + "Uk - Wk. Complete the details of the following argument., which
shows that. a 1l1ultiperiod (s, S) policy is optimal.
Abbrcviated Proof: (S. Shreve) Let IN(J;) = 0 and for all k
Gk(V) = cV+E{h max(O, y-wk)+pmax(O, wk-v)+.h+l (max(O, V-Wk)) },
,hex) = -c.r + min [[(6(n) + GdJ: + It) ] ,
u?::o
where 0(0) = 0, 6(n) = 1 for "U > O. The result. will follow ir we ("all show Lhat
G k is [{-convex, continuous, and Gdv) --t 00 as Ivl --t 00. The diflicult. part. is
to show K-convexity, since [(-convexity of G k - I docs IIOt. imply [( -convexit.y
of E{.h-H(max(O, V - w))}. It will be :;uflicicnt. t.o show t.hat. I\-l"oIlVl'xity (;1'
Gk+l implies [(-convcxit.y of .
H(V) = pmax(O, -V) + JHl (max(O,v)),
(G.l)
or equivalently that.
f .' + fl(y + 7 ) 2: 11 (VI ) + 7 ( Il(y) - b Il(Y-I)) ) ,
' _ _ z 2: 0, {) > 0, V C: n. (>.2)
By [(-convexity of Gk+1 we have for appropriate scalars Sk+1 anu Sk+l such
that GH1(Sk+l) = miny GHI(V) and ]{ + Gk+l(Sk+J) = Gk+J(Sk+l):
J ( ) { ]{ + GH1(S'k+J) - CX
k+1 X =
GH1(X) - C."r
if x < Sk+l,
if x 2: 8k+l,
(G.3)
and Jk+1 is [{-convex by the t.heory of Section 4.2.
Case J: 0 ::; V - b < V ::; V + Z . In this region, Eg. (G.2) follolVs from
K-convexity of Jk+l.
Case 2: y - b < y ::; y + Z ::; 0 . In this region, H is linear amI hence
K-convex.
Case 3: y-b < V::; 0::; y+z. III t.his region, Eq. (G.2) Ina)' be writt.('11
[in view of Eq. (6.1)] as
J( + kf-!(Y + z) 2: kt-l(O) - p(y + z).
_ J
176
Problems witil Perfect State InformatioJ.
Chap. 4
We will show that
K + Jk+l(Z) 2: Jk+l(O) = pZ,
z o.
(6.4)
If 0 < Sk+l ::; z, then using Eq. (6.3) and the fact p > c, we have
K + .Jk+l(Z) = J( - cz + Gk+I(Z) 2: [( - pz + Gk+l(S'k+l) = .h+I(O) = l'Z.
If 0 ::; Z ::; Ski 1, then using Eq. (6.3) and t.he fact p > e, we have
[( +JU1(Z) = 2[( -cz+GUl(,'h+d 2: [( -pZ+GU1(">\+I) = .h+l(O) -pz.
If Sk+l ::; 0 ::; z, then using Eq. (6.3), the fact p > c, and part. (iv) of t.he
lemma in Section 4.2, we have
[( + Jk+l(Z) = J( - cz + Gk+I(Z) 2: Gk+I(O) - pz = .h+l(O) - pz.
Thus Eq. (6.4) is proved and Eq. (6.2) follows for the c;lSe under consiueration.
Casc 4: y - b < 0 < y ::; y + Z . Then 0 < y < b. If
ll(y) - H(O) H(y) - H(y - b)
y 2: b '
(6.5)
then since 1I agrees with J UI on [0,00) and Jk+1 is K-convex,
K + H(y + z) 2: H(y) + Z ( lI(Y) fI(O) )
( ) ( II(y)-fI(Y-b) )
Hy+z b '
where the last. step follows from Eq. (6.5). If
H(y) - fICO) < H(y) - !I(y - b)
y b
then we have
H(y) - II(O) < ¥(H(y) - H(y - b)) = ¥(II(y) - H(O) + p(y - b)).
It follows that
(1 - t) (H(y) - H(O)) < (¥) p(y - b) = -py (1- ¥) ,
and since b > y,
H(y) - H(O) < -py. (6.6)
Now we have, using the definition of II, Eqs. (6.4), and (6.6),
fI(y)-H(y-b) ( ) ( fI(O)-PY-H(O)+P(y-b) )
H(y) + z b = II y + Z b
= H(y) - pz
< II(O) - p(y + z)
::;K+ll(y+z).
Hence Eq. (6.4) is proved for this cclSe as well. Q.E.D.
"
<.
Sec. 4.6
Notes, Sources, and f. eises
177
4.7
Consider the inventory problem of Section 4.2 (zero fixed cost) with the dif-
ference that successive demanus are corrclateu and satisfy a relat.ioll of the
form
Wk = ek - jCk-l,
/,; = U, 1,...,
where j is a given scalar, ek are indepelldent ranuolIl variahlps, alld C-l is
given.
(a) Show that t.his problem can be converted into an inventory problelll wit.h
independent. demands. ffinl.: Given1llu, 1111,..., 1IIk-l, we can (It.t.cnnine
Ck _ I in view of the relation
k-l
k '" i
ek-l = j C-I + L.; j Wk-I-i.
t=O
Define Zk = Xk + ICk-l as a new state variable.
(b) Show that the same is true when in addition there is a ol\{-period delay
in the delivery of inventory (cL Exercise 4.5).
4.8
Consider the inventory problem of Section 4.2 (zero fixed cost), the only
difference being that there is an upper bound b and a lower bound Q to t.he
allowable values of the stock Xk. This imposes the additional constraint on
Uk
Q + d ::; Uk + Xk ::; b,
where d > 0 is the ma.ximum value that the demanu Wk can take (we assume
Q + d < b). Show that an optimal policy {ji'Q,. . . , Ill, -I} is of t.he form
.'.
. { Sk - Xk
ji.dXk) = 0
if Xk < !:>\,
if Xk 2: Sk,
where So, Sl,..., SN-l are some scalars.
4.9
Consider the invent.ory problem of Section 4.2 (nonzero fixed cost) with the
difference that demand is deterministic and must be met at each time period
(i.e., the shortage cost per nnit is 00). Show that it b opt.imal t.o ordcr a
positivc amount at. period k if and olily if t.he st.ock :q is illsnllici"lIl. t.o 111(',,(,
the demand Wk. Furthermore, when a positive amount is orden'd, it. should
bring up stock to a level that will satisfy uemalld for an integrailiumber of
periods.
I.
!
;j\1
1 m '"
I .
',1
: 1\"
_.J
i
178 Problems with Perfect State InformaUOl.
Chap. ,1
Sec. 4.6
Notes, Sources, an, xercises
179
11
J:
4.10 [Vei65], [Tsi84b]
4.12
Consider the inventory control problem of Section 4.2 (zero fixed cost) with
t.he only difference that the orders 'Uk arc const.rained to be nonnegat.ive 1.n-
I.egers. Let. Jk be the optimal cost-to-go function. Show that:
( a) J k is continuous.
(b) Jk(X + 1) - Jk(X) is a nondecreasing function of x.
(c) There exists a sequence {Sd of numbers such that the policy given by
\Ve want to use a machine to produce a cert.ain item in quantities that meet
as closely as possible a known (nonrandom) S('(jnCIH'(' of d('ma.nd (h OWl" N
periods. The machine can be in one of t.wo states: good (G) or bad (H). The
st.ate of the machine is perfectly observed and evolves from one period t.o t.he
next. according to
peG I G) = AG, PCB I G) = l-Ac;, PCB I B) = All, peG 113) = l-A/3,
I f
: 1
ILk(:I;k) = {
if ,'h - n :S Xk < ,'h - n + 1,
if Xk 2: Sk
n =- 1,2,.. 'J
where AG and AB are given probabilities. Let. Xk be t.he stock at. t.he bq,;inning
of the kt.h period. If the machine is in good state at. period k, it. can produce
Uk, where Uk E [0, u], and the stock evolves according t.o
Xk+l = Xk + Uk - d k :
otherwise the st.ock evolves according to
Xk+l = Xk - d k .
There is a cost g(Xk) for having stock Xk in period k, and t.he terminal cost
is also g(XN). We assume that. the cost per stage 9 is a convex function such
that g(x) -+ 00 as Ixi -+ 00. The objective is to find a production poliey that
minimizes t.he t.ot.al expect.ed cost..
(a) Prove inductively a convexity property of the cost-to-go functions, and
show that for each k there is a target stock level Sk+l such t.hat if t.he
machine is in the good state, it is opt.imal to produce 'Uk E [0, u] that
will bring Xk+l as close as possible to Sk+l.
(b) Generalize part (a) for the case where each demand dk is random and
t.akes values in an int.erval [0, dl wit.h given probability dist.ribution. The
st.ock and the state of the machine are still perfectly observable.
is optimal.
4.11 (Capacity Expansion Problem)
Consider a problem of expanding over N time periods the capacity of a pro-
duction facility. Let us denote by Xk the production capacity at the beginning
of t.he Hh period and by Uk 2: 0 t.he addit.ion to capacity during the J.:t.h pe-
riod. Thus capacity evolves according to
Xk+l = Xk + 'Uk,
k = 0,1,.. .,N - 1.
The demand at the kl.h period is denoted Wk and has a known probability dis-
tribution that does not depend on eit.her Xk or Uk. Also, successive demands
are assumed to be independent and bounded. We denote:
Ck(Uk): expansion cost associated with adding capacity Uk
Pk(Xk+Uk-Wk): penalty associated with capacity Xk+Uk and demand
Wk
k=O.I.N-l { -S(XN) + (Ck(Uk) + Pk(Xk + Uk - Wk)) } .
A gambler enters a game whereby he may at time k stake any amount 'Uk 2: 0
that does not exceed his current fortune Xk (defined to be his init.ial capit.al
plus his gain or minus his loss thus far). He wins his stake back and as
much more with probability p, where < p < 1, and he loses his stake
with probability (1 - p). Show t.hat t.he gambling strategy that. maximizcs
E{ln XN}, where XN denotes his fortune after N plays, is to stake at. each
time k an amount Uk = (2p - l)xk. Hint: The problem is rclated to the
portfolio problem of Section 4.3.
I:'
I:
I'
i:'
I:
I'
f;
i'
I,'
I
I
.! '
4.13 (A Gambling Problcm)
S(XN): salvage value of final capacity XN.
Thus the cost function has the form
(a) Derive the DP algorithm for this problem.
(b) Assume that S is a concave function with limxoo dS(x)jdx = 0, Pk
are convex functions, and the expansion cost Ck is of the form
4.14
{ f( + CkU
Cd") = 0
if U > 0,
if u = 0,
Consider the dynamic. portfolio problcm or Sedioll '1.:1 ("or t.he ("a.,;(' will''"'' at
each period k there is a forecast that the rates of return of the risky assets
for that period will be selected in accordance with a particular probabilit.y
distribut.ion as in Section 1.4. Show that a partially myopic policy is opt.imal.
;1
I
I ,
, I
where K 2: 0, Ck > 0 for all k. Show that the optimal policy is of the
(8, S) type assuming CkY + E {Pk(Y - Wk)} -+ 00 as lyl -+ 00.
I
I
_. J
180
J>rovleIns wjth Perfect State In[ormaiiol.
ClJap. 4
4.15
Consider a problem involving t.he linear system
Xk+l = Ak;k + Ih1!k,
k = 0, I, . . . , N - 1,
where the n x n matrices Ak are given, and the n x m matrices Hk are random
and independent with given probability distributions that do not depend on
Xk, Uk. The problem is to find a policy t.hat maximizes E {U (c' XN) }, where
c is a given n-dimensional vector. We assume that U is a concave twice
continuously different.iable ut.ility function satisfying for all y
U'(y)
- U"(y) = a + by,
and that the control is unconstrained. Show that. the optimal policy consists
of linear functions of the current st.ate. Hint: Reduce the problem to a one-
dimensional problem and use the re;;ults of Section 4.3.
4.16
Suppose that a person wants to sell a house and an offer comes at the begin-
ning of each day. We assume that ,;uccessive offers are independent and an
offer is w J with probability pj, j = 1,. . . , n, where Wj are given nonnegative
scalars. Any offer not immediately accepted is not lost but may be accepted
at any later date. Also, a maintenance cost c is incurred for each day that the
house remains unsold. The objective is to maximize the price at which the
house is sold minus the maintenance costs. Consider t.he problem when there
is a deadline t.o sell the house wit.hin N days and characterize the optimal
policy.
4.17
Assume that we have Xo items of a certain type that we want to sell over a
period of N days. At each day we may sell at most one item. At the kth day,
knowing the currcnt number Xk of remaining unsold items, we can set the
selling price Uk of a unit item to a nonnegative number of our choice; then,
the probability Ak(Uk) of selling an item on the kth day depends on Uk as
follows:
"'\k( lt k) -=-..:: lH..:-- 1Lk ,
where a: is a given scalar with 0 < a: ::; 1. The objective is to find the optimal
price sett.ing policy so as to maximize t.he total expect.ed revenue over N days.
(a) Assuming that, for all k, the cost-to-go function Jk(Xk) is monotonically
nondecreasing as a function of J;k, prove that for Xk > 0, the optimal
prices have the form
J.Lk(Xk) = 1 + Jk+l(Xk) - -h+l(Xk - I),
Sec. 4.6
Notes, Sources, and. rCIses
u!!
and t.hat.
Jk(Xk) = a:e-"k(Xk) + Jk+l(Xk).
! t'
(b)
Prove simultaneously by induction that., for all k, t.he cost-to-go functiou
.h(:q) is indeed monot.onically nond"(TeiL'iing Wi a function of CD,. t.hat.
the optimal price J.Lk(Xk) is monotonically nonincreasing as a function
of Xk, and that Jk{xk) is given in closed form by
.
"
{ (N - k)a:c- l
Jk ( Xk ) = ,\"N-,ck -I,*(xk) + -1
i=k oe 1 .7:k Q e
o
if Xk 2 N - k,
if 0 < Xk < N - k,
if Xk = O.
II:'
4.18 (Optimal Termination of Sampling)
This is a classical problem, which when appropriately paraphrased, is known
as the job selection, or as the secretary selection, or as the spouse selection
problem. A collect.ion of N 2 2 objects is observed randomly and sequentially
one at a time. The observer may either select the current object observed,
in which case the selection process is terminated, or reject the object and
proceed to observe the next. The observer can rank each object relat.ive t.o
those already observed, and the objective is to maximize the probability of
selecting the "best" object according to some criterion. It is assumed t.hat no
two objects can be judged to be equal. Let r* be the smallest posit.ive int.eger
r such that
1 1 1
-+-+...+-<1.
N-I N-2 r-
Show that an optimal policy rel/uires that the first r' objeds be observed. If
the r* th object has rank 1 relative to the others already observed, it should
be selected; otherwise, the observation process should be continued until an
object of rank 1 relative to those already observed is founeL Hint.: We assume
that, if the rth object has rank 1 relative to the previous (r -1) objects, then
the probability that it is best is riN. For k 2 r*, let Jk(O) be the maximal
probability or finding the best object assuming k objects have been selected
and the kth object is not best relative to the previous (k - I) objects. Show
that
k ( 1 1 )
Jk(O) = - - + ... + -
N N-I k .
4.19
.
A driver is looking for parking on the way to his de;;t.in,lt.ion. Each parkillg
place is free with probability p independently of whet.her other parking places
are free or not. The driver cannot observe whether a parking place is free
until he reaches it.. If he parks k places from his destination, he incurs a cost.
k. If he reaches the dest.ination without having parked the cos(, is C.
_..1
182
Problems with Perfect. State Infonnaiion
Ch«p. 1
Sec. 1.6
Naif's, Sources, «nd
rcises
183
(a) Let. Fk be t.hc minimal expected east if he is k parking places from his
destinat.ion. Show that.
Fa =C,
money hc get.s over N periods by optiuwl dlOi('(' of t.hl' l'ayuH'1I1 d"UlaIHls Ilk.
(Note t.hat therc is no addit.ional penalty for being denouuced t.o t.hc police.)
\Vrite a DP algorithm and fiud the opt.irnal policy.
1,1.. =pmin(k,I'k_l) +qI'k-l,
k=I,2,...,
where q = 1 - p.
(b) Show that an optimal policy is of t.he form: uever park if k :::: k*, but
take t.he first. free place if k < k', wl,,'re !.: is t!H, 111111I1",,' of parkillg
places from the destination and k* is the smallest int.eger i satisfying
qi-l < (pC + q)-l.
4.23 [Whi82]
4.20 [Whi82]
The Creek adventurer Theseus is trapped in Kjng Minos' Labyriuth mam.
lie ("lIl try 011 each day olle of N pa.sag('s. I f I... ('I1I.,'rs 1);1.s"g" I I... will
escape with probability Pi, he will be killed with probabilit, q" aud hI' will
determine t.hat t.he passage is a dead end with probabilit.y (I-Pi -qi), in which
case he will return t.o the point from which he st.arted. Use an int.erchange
argument t.o show that. trying passages in order of decreasing p,/'1' maximizes
t.he probabi1it.y of escape within N days.
.
; ,
A person may go huuting for a cert.aiu type of animal on a given day or stay
home. When the animal population is x, the probability of capt.uring one
animal is p(x), a known increasiug function, and the probability of capt.uring
more than one is zero. A capt.ured animal is wort.h one unit and a day of
hunting costs c units. Assume that x does not. change due to dcaths or
birt.hs, that. t.he hunt.er knows x at. all times, that the horizon is finite, and
that the terminal reward is zero. Show that it is optimal to hunt only when
p(x) :::: c.
4.24 (Hardy's Theorem)
Let {al,'" ,an} and {b l ,... ,Un} be monotonically nondecreasing sequences
of numbers. Let us associate with each i = 1,.,., n a distinct jndex j" and
consider the expression L=l a.b ji . Use an interchange argument. to show
that this ""xpression is maximized when j, = i for all i, and is minimized
when ji = n - i + 1 for all i.
4.21
4.25
Considcr t.he scalar linear syst.em
Xk+l = aXk + bUk,
A busy professor has to complete N projects. Each project. k has a deadline
d k and the t.ime it takes the professor to complete it. is tk. The professor can
work on only one project at a time and must complete it before moving on
to a new project. For a given order of completion of the projects, denote by
Ck the time of completion of project k, i.e.,
where a and u are known. At each period k we have the option of using a
cont.rol Uk and incurring a cost qXk + 7'uL or else st.opping and incurring a
stopping cost tXk. If we have not stopped by period N, the terminal cost. is
the stopping cost Lx"iv. We assume that q :::: 0, r > 0, t > O. Show that there
is a threshold value for t below which immediate stopping is optimal at. every
initial state, and above which continuing at every state Xk and pcriod k is
opt.imal.
Ck = tk +
L
ti.
projects i
completed before k
The professor wants to order the projects so as to minimize the maximum
t.ardiness, given by
Considcr a situat.ion involving a blackmailer and his victim. In each period
thc blackmailer has a choice of: a) Accepting a lump sum paymcnt of R from
the vict.im and promising not to blackmail again. b) Demanding a payment of
u, where u E [0,1]. If blackmailed, the victim will cithcr: 1) Comply with t.he
demand and pay U to the blackmailer. This happens with probability 1 - u.
2) Refuse to pay and denounec the blackmailer to the po1icc. This happcns
with probability u. Once known to the policc, the blackmailer cannot. osk for
any more money. The blackmailer wants t.o maximizc the cxpected amount of
max max(O, Ck - d k ).
kE(1,....N}
Use an interchange argument to show that it is optimal to completc the
projects in the order of their deadlines (do the project with the closest dcadlinc
first).
4.22
.'
?-
.I
f.
__J
ii
i
"
5
ii'
:1:
I,
p
Hi
p,.
I'"
! ! , I
;f
I,
. !
Problems with Imperfect State
Information
Contents
5.1. Reduction to the Perfect Information Case
5.2. Linear Systems a.nd Quadratic Cost . . .
5.3, Minimum Variance Control of Linear Systems
5.4. Sufficient Statistics and Finite-State Markov Chains
5.5. Sequentia.l Hypothesis Testing
::i.6. Notes, Sources, and Exercises. . . . . . . . . .
p.186
p.196
p.203
p.219
p.232
p. 237
185
I::
, ,I"
1 11
'
, !
: t i
. F.L
_.J
186
Problems lVit,h Imperfect StaLe, Information
Chap. 5
We have assumed so far that the controller has access to thc cxact
value of the current state, but this assumption is often unrealistic. For
example, some state variables may be inaccessible, the sensors used for
measuring them ma.y be inaccurate, or the cost. of obtaining the exact
value of the state may be prohibitive. We model sit.uations of this type
by assuming that at each stage the controller receives some observations
about the value of the current state, which may be corrupted by stochastic
uncertainty.
Problems where the controller uses observations of this type in place
of the state are called problems of imperfect state information, and are
the subject of this chapter. We will find that even though there are DP
algorithms for imperfect information problems, these algorithms arc far
more computationally intensive than in the perfect information case. For
this reason, in the absence of an analytical solution, imperfect information
problems arc typically solved suboptimally in practice. On the other hand,
we will see that concept.ually, imperfect state information problems arc no
different than the perfect state information problems we have been studying
so far. In fact by various reformlllations, we can red uce an imperfect state
information problem to one with perfect state information. 'vVe will study
two difrerent reductions of this type, which will yield two different DP
algorithms. The first reduction is the subject of the next section, while the
second reduction will be given in Section G.4.
5.1 REDUCTION TO THE PERFECT INFORMATION CASE
We first formulate the imperfect state information counterpart of the
basic problem.
Basic Problem with Imperfect State Information
Consider the basic problem of Section 1.2 where the controller, instead
of having perfect knowledge of the state, has access to observations Zk of
the form
Zo = ho(xo, vo),
Zk = hk(Xk, Uk-I, Vk),
k = 1,2, ..., N - 1. (1.1)
The observation Zk belongs to a given observation space Zk. The random
observation disturbance Vk belongs to a given space V k and is characterized
by a given probability distribution
P1Jk (- I Xk, . . . ,Xo, Uk-I, . . . ,Uo, Wk-l, . . . , lUo, Vk-l, . . . , vo),
which depcnds on the current state and the past states, controls, and dis-
turbances.
Sec. 5.1
Heduction to the Per.
Information Case
187
The initial st.ate .1:0 is also random and charad,('riz('d hy ;\ gi\"l'll pl"l,h-
ability distribution P.To' The probability distribution P Wk (- I .7;k, Uk) of Wk
is given, and it may depend explicitly on Xk and Uk but not on the prior
disturbances Wo,..., Wk-l, vo,..., Uk-I. The control Uk is L:onsl.railled t.o
take values from a given nonempty subset Uk of the control space n,. It. is
assumed that this subset does not depend on 1:k.
Let us denote by h the information available to the controller at t.ime
k and call it the information vector. We have
i
;
i
1k = (zu,Zj",.,Zk,UO,Ul,...,Uk_l),
10 = ZOo
k = 1,2,.. .,N - 1,
(1.2)
5C'
We consider the class of policies consisting of a sequence of funct.ions
7r = {fLO, fLl,. . . ,fLN-J}, where each function p.k maps the information vec-
tor h into the control space C k and
ILk(h) E Uk,
for all h, k = 0,1,. . . , N - 1.
Such policies are called admissible. We want to lilld ,1!l admbsible policy
7r = {JiO, ILl, . . . , JLN-d that minimizes the cost function
{ N-l }
J rr = E gN(XN) + L 9k(Xk,fLk(h),lUk)
;TQ,Wk, 1J k
k=O,...,N-I k=U
(1.3)
subject to the system equation
, ,
I
!,
Xk+l = fk(xk,JLk(h),wk),
k = O,I,...,N - 1,
and the measurement equation
i
,
Zo = ho(xo, vo),
Zk = h k (Xk, 11k-l (h-J), Vk),
k = 1,2,... ,N - 1.
I
i
.f'
! ,
I i
i!
. I.;',
, 'I, '1;'
,I
t
. ,;!
Note the difference from the perfect state information case. Whered.o.;
before we were trying to find a rule that would specify the control 'ILk t.o
be applied for each state Xk and time k, now we are looking for a rule that
gives the control to be applied for every possible information vector h (or
state of inform,Ltion), that is, for every sequence of observat.iolls received
and controls employed up to time k.
Example 1.1 (M ultiaccess Communication)
Consider a collection of transmitting stations sharing a common channel,
for example, a set of ground stat.ions communicating with a satellite at a
common frequency. The stations are synchronized to transmit packet.s of
_. J
188
l'rovIems with lmperfect. State Information
Chap. 5
data at integer times, Each packet requires one t.ime unit (also called a
slot) for t.ransmission. The total number ak of packet arrivals during slot
k is iudependeut of prior arrivals and has a given probabilit.y distribution.
The st.ations do not. know t.he backlog J:k at t.he beginning of t.he hh slot.
(thl:' nlllnber of packets wait.ing t.o be transmit.t.ed). Packet. t.rausmissions are
scheduled using a strat.egy (known as slolled Aloha) whereby each packet
residing in the system at the beginning of the kth slot is transmitted during
the kt.h slot with probabilit.y Uk (common for all packets). If two or mOl'e
packets arc t.rausmit.t.ed simult.aueously, t.hey collide and have t.o rejoin t.he
backlog for retransmission at a later slot, However, the stations cau observe
the channel and determine whether in anyone slot. there was a collision (two
or more packets), a success (one packet.) , or an idle (uo packet.s). These
observatious provide information about the state of the system (the backlog
Xk) and can be used to select appropriat.ely the control (the t.ransmission
probability Uk)' The objective is to keep the backlog small, so we assume a
cost per st.age gk(Xk), which is a monotOllically increasing function of :Ck.
The stat.e of the syst.cm here is the backlog Xk and evolves accordiug to
t.hc cquatiou
Xk+l = Xk + ak - tk,
where <Lk is the number of new arri vals a.nd tk is the number of packets
successfully transmitted during slot k. Both ak and tk may be viewed as
disturbances, and the distribution of tk depends on the state Xk and the
control Uk. It can be seen that tk = 1 (a success) with probability xk u k(l -
Uk)Xk- l , and tk = ° (idle or collision) ot.herwise [the probability of anyone
of t.he Xk waiting packets being transmitted, while all the other packet.s are
not transmitted, is Uk (1- UdXk-IJ.
If we had perfect state information (i.e., the backlog Xk were known at
the beginning of slot k), the optimal policy would be to select the value of
Uk that. maximizes XkUk(1 - Uk)Xk- l , which is the success probability.t By
setting the derivative of this probability to zero, we find the optimal (perfect
state information) policy to be
1
/Lk(xk) = -,
Xk
for all Xk 2: 1.
t For a more det.ailed derivation, note that the DP algorithm for the perfect
st.ate information problem is
Jk(Xk) = gk(Xk) + min E{P(Xk,Uk)Jk+l(Xk + ak - 1)
u:::;uk:S 1 ak
+ (I - p(:q,uk)).hl-l(Xk + <Lk)},
where p(:rk,uk) is the success probabilit.y XkUk(1 _Uk)Xk- 1 . Since t.he cost. per
st.age Yk(:Ck) is au iucre<lliiug fuuction of the backlog Xk, it is clear t.hat each
cost.-to-go funet.ion A(Xk) is an increasing function of Xk (this cau also be easily
proved by induction), Thus Jk+l(Xk + <Lk) 2: Jk+l(Xk + ak - 1) for all Xk and
ak, based on which the DP algorithm implies that the optimal Uk maximizes
p(Xk, Uk) over [0,1].
Sec. 5.1
neduction to the PedeL .Ji"ormatiou Case
H!U
In practice, however, Xk is not. known (imperfect st.aLl' infol"lllatiou).
and the optimal control must be chosen on the basis of the available ob-
servatious (i.e., the cut ire chauucl history of successes, idles, alld collisions).
These observat.ions relate t.o t.he backlog history (t.he pa:o;t. sLIt.es) and t II(' past
transmission probabilit.ies (the past cont.rols), but. al'l:' ("O!Tupt.('!i by st.ocha:;t.ic
uncertainty. Mat.hematically, we may write an cquat.iou Zk+ 1 = /Jk 1-1, whcrc
Zk+l is the observation obtained a.t the end of t.he kth slot., and t.he ran-
dom variable Vk+l yields an idle with probabilit.y (1 - UkYk, a success wit.h
probabilit.y 'ck ILk( 1 - ltd""k -. I , and a. colli",ion ot.hcrwis(..
It can be seen t.hat t.his is a problem that fit.s t.he given imperfect. st.aLe
information framework. Unfortunately, the optimal solution to this problem
is very complicated and for all practical purposes cannot. be compnt.ed. A
subopt.imal solution will be discussed in Section 6.1.
Reformulation as a Perfect State Information Problem
\Vc now show how to effect the reduction from imperfect. l.o perfect.
state information. As in the discussion of state augmentation in Sect.ion
1.4, it is intuitively clear that we should define a new system whose state
at time k is the set of all variables the knowledge of which can be of benefit
to thc controller when ma.kiug the kth decision. Thus a [init candiuat.c a:;
the state of the new system is the information vector h. Indeed we will
show that this choice is appropriate.
We have by the definition of the information vector [ef. Eq. (1.2)]
h+l = (h, Zk+l, Uk),
k = 0,1,.. .,]V - 2,
10 = ZOo
(1.4)
These equations can be viewed as describing the evolution of a system of
the same natw'e as the one considered in the ba.sic pmblrm of Section 1.2.
The state of the system is h, the control is Uk, and Zk+l can be viewed as
a random disturba.nce. Furthermore, we have
P(Zk+1 I h, Uk) = P(Zk+l I h, Uk, Zo, Zl,..., Zk), (1.5)
since Zo, Zl, . . . , Zk are part of the information vector h. Thus t.he probabil-
ity distribution of Zk+l dcpends explicitly only on the state h and control
Uk of the uew system (1.4) and not on thc prior "disturbances" Zk,..., Zu.
By writing
E{Yk(."I:k, Uk, "Wk)} = E { E {9d:Ck, uk, "Wk) I h,I1.k} } ,
xk, 1V k
we can similarly reformulat.e t.he cost. fun<:l.ioll in t.erms of t.he variables of
the new system. The cost pcr stage as a function of the ncw state hand
the control Uk is
fjdh,Uk) = E {gk(xk,uk,Wk) I h,uk}.
'Ck,Wk
(1.0)
190
Problems with Imperfect State Informaiio,
Chap. 5
Sec. 5.1
HeduciiolJ to the PerL Information Case
l!H
_. J
Thus the basic problem with imperfect statc information has been
reformulated as a problem with perfect state information that involves the
system (1.4) and thc cost per stage (1.6). I3y writing the DP algorithm
for this lattcr problem and substituting the exprcssions (1.4) and (1.G), we
obtain
- 1
P(G 1 :c = }') = 4'
:\
P(lJ 1.1:= }') 4;
see Fig. 5.1.1. After each inspection one of two posiblc actions can [)('
takcn:
.IN-I (IN-I) = min [ E {9N(JN-I(XN-I,UN-I,WN-tJ)
tLN-IEUN_l XN-l'WN_l
+ f}N-I(XN-I, UN-I,WN-d 1 I N - J , UN-I}],
(1.7)
c: Continuc opcration of the machinc.
S: Stop the machinc, Jeterminc its state through an accuratc diagnost.ic
tcst, and if it is in thc bad slat.e P bring it back t.o t.he good stat.( I'
At each period there is a cost of 2 and ° units for starting t.11<' ]wri()d
with a machine in the bad statc P and the good state P, respectively. The
cost for taking the stop-and-repair action 5' is 1 unit and thc tenninal cost
is 0.
and for k = 0,1,. . . , N - 2,
Jk(h) = min [ E {9k(Xk, Uk, Wk) + Jk+1 (h, Zk+l, Uk) I h, u d ] .
ukEUk .r;k! wk, zk+l
( 1.8)
Thcse equations constitute one possiblc DP algorithm for thc imper-
fcct state information problem. An optimal policy {ILo,ILj,...,JLN_d is
obtained by first minimizing in the right-hand sidc of the DP Eq. (1.7) for
every possible value of the information vector IN -I to obtain JLN -I (IN -d.
Simultancously, .IN -I (IN -1) is computcd and uscd in the computation of
IN-2(IN-2) via thc minimization in the DP Eq. (1.8), which is carried
out for every possible value of IN-2. Proceeding similarly, one obtains
IN-3(IN-3) and JLN-3 and so on, until Jo(Io) = Jo(zo) is computeJ. The
optimal cost .1* is then obtaincd from
Figure 5.1.1 State transition diagram and probabilities of inspection outcomes
in the Inachinc repair cxan1plc.
\
\!i
J* = .e{Ju(zu)}.
ZO
(1.9)
Thc probkm is t.o dct."rmine th( policy that. minimi7.cs tlw exp('.cf.<'d
costs over the three timc periods. In other words, we want t.o jiIlll the
optimal action after the result of the first inspection is known, and aftcr
the results of the first and second inspections, a::; wcll as the action takcn
after the first inspection, are known.
It can be seen that this example falls within thc gencral framework
of the problem of this section. The state space consists of the two states P
and P,
I
j
Machine Repair Example
A machine can be in one of two statcs denoted P and P. State P
corresponds to a machinc in proper condition (good state) and state P to
a machine in improper condition (bad statc). If the machine is operatcd
for onc timc pcriod, it stays in statc P with probability 1 if it started in P,
and it stays in state P with probability 1 if it started in P. The machine is
operatcd for a total of three time periods and starts in state P. At the end
of the first and second timc pcriods the machine is inspected and there are
two possible inspection outcomes denoted G (probably good state) and B
(probably bad state). If the machine is in thc good state P, thc inspection
outcome is G with probability j if the machine is in the bad state P, the
inspcction outcomc is B with probability :
state space = {P, P},
and the control space consists of the two actions
I \
control space = {C, S}.
The system evolution may be described by introducing a system equa-
tion
Xk+1 = Wk,
k = 0,1,
whcrc for k = 0,1, the probability distribut.ion of Wk is givcn by
3
P(G I :r = P) = 4'
1
P(B I x = P) = 4'
2
P(Wk = P 1 Xk = P, Uk = C) = '3'
- 1
P(Wk = PI :rk = P,ILk '-.0 C) =},
_.1
192
Problems witlJ Imperfect State InFormatic.
ClJap. 5
P(lIJk = P I Xk = 15,1Lk = C) = 0,
2
P(Wk = PI Xk = P,Uk = S) = 3'
- 2
P(Wk = P I ;rk = P, Ilk = S) = 3'
P(Wk = 15 I ;rk = 15, 1Lk = C) = 1,
- 1
P(Wk = P I Xk = P, Uk = S) = 3'
1
P(Wk = 15! :l;k = 15,lI.k = 8) c- 3
We denote by Xo, XI, X2 the state of the machine at the end of the first,
second, and third time period, respectively. Also we denote by Uo the
action taken after the first inspection (end of first time period) and by 1LI
the action taken after the second inspection (end of second time period).
The probability distribution of :Z:O is
2
P(J;o = P) = 3'
- 1
P(xo = P) = -.
3
Note that we do not have perfect state information, since the inspec-
tions do not reveal the state of the machine with certainty. Rather the
result of each inspection may be viewed as a measurement of the system
state of the form
Zk = Vk,
where for k = 0,1, the probability distribution of Vk is given by
k = 0,1,
3
P(Vk = G I Xk = P) = 4'
- 1
P(Vk = G I :Z:k = P) = 4'
1
P(Vk = B I Xk = P) = -,
4
- 3
P(Vk = B I Xk = P) = -.
4
is
The cost resulting from a sequence of states Xo, Xl and actions uo, 1LI
g(Xo, uo) + g(XI' U]),
whcre
y(P, C) = 0,
g(P, S) = 1,
g(P, C) = 2,
g(15, S) = 1.
fo = ZO,
The information vector at times 0 and 1 is
h = (ZO,ZI,UO),
Hnd we seek functions ILo{Io), ILl (h) that minimize
E {g(xO,J.Lo(Io)) + g(:Z:I,J.LI(h))}
.rO,WO, tlJ l
lIU,VL
= E {9(xo,ILo(zo)) + g(XI,J.LI(ZO, Zl,JLO(ZO)))}.
XQ,WO,Wl
110,1 1 1
We now a.pply the DP algorithm.
(1.10)
I'
I
Sec. 5.1
Reduction to the Pel. InFormation Case
193
La.st Sta!le: \Ve use Eq. (1.7) t.o comput.e .1 1 (II) for ('i\(:h or t.ll(' eight.
possi ble information vectors f I = (zo, Zl, uo). For each of these vectors, we
shall compute the expected cost of the two possible actions, "ILl = C Hnd
UI = S, Hnd select as optimal the action with the smallest. cost. vVe have
'I
,I
;1
: il
cost of C = 2. P(XI = 15 lId,
cost. of S = 1,
and therefore
JI(h) = min[2P(xl = 15 I h), 1].
The probabilities P(Xl = 15 I h) can be computed by using Bayes' rule
and the problem data. Some of the details will be omitted. We have:
(1) For 1 1 = (G,G,S)
P ( x = 15 I G G S ) = P(X] = P,G,G,S)
I , , P( G, G, S)
.1 . 1. ( 1 ..Lt .1 . .1 ) 1
J 4 3 4 3 4
( 1. ;1 + 1. .1 ) 2 7
3 4 3 4
. I
11
I
I
:1
I
fl
II
I:
I
I
I
.
Hence
Jl(G, G, S) = ,
7
J.Li(G, G, S) = C.
(2) For h = (B, G, S)
- - 1
P(XI = P I B,G,S) = P(.""C] = PI G,G,S) =-,
7
2
JI(B,G,S) =-,
7
J.Li(B, G, S) = C.
(3) For h = (G, B, S)
P ( x = P 1 GB S ) = P(Xl = 15,G,B,S)
I , , P(G,B,S)
i. . (. + *. )
( 1. .1 + 1. ;1 ) ( 1 . ;1 + 1. .1 )
34343434
I'
;,
3
" '
v
h(G, B, S) = 1,
(4) For h = (B, B, S)
P(XI = PI B,B,S) = P(XI = 15 I G,B,S) =,
G
J.Li(G, B, S) = S. ,
JI(B,B,S) = 1,
J.Li(B,B,S) = S.
_ J
194
Problem:; wiih Imperfect State lnfoT1nat.ic
Clmp. 5
(5) :For h = (G,G,C)
P ( x = ]5 I G G C ) = P(:CI '-' r,G,G,C) =
1 , , P(G,G,C) G'
2
Jl(G,G,C) = 5' ILj(G,G,C) = C.
(G) For h = (B, G, C)
- 11
P(XI = PI B,G,C) = 23 '
22
J1(B,G,C) = 23 ' /lj(B,G,C) = C.
(7) For II = (G, B, C)
- !)
P(:rl = PI G,lJ,C) = 13 '
Jl(G,B,C) = 1, ILj(G,B,C) = S.
(8) For II = (B, n, C)
- :33
P(XI = PI B,n,C) = 37 '
Jl(B,n,C) = 1, ILj(n,B,C) = S.
SUllnllari"illg the results for t.he la,,:t. stage, t.he optimal policy is to
contillue (Ul = C) if the result of the last inspection was G, <Llld to stop
(Ul = S) if the result of the last inspection was B.
First Stage: Here we use the DP Eq. (1.8) to compute Jo(fo) for each
of the two possible illforlllation vectors 10 = (G), 10 = (B). We have
cost of C = 2P(xo = ]5 I la, C) -1- E{ Jl(fO, ZI, C) I 10, C}
Zl
= 2P(.LO = ]5 I 10, C) + P(ZI = G I la, C)J 1 (fa, G, C)
+ P(ZI = B I 1 0 , C)Jl (fo, n, C),
cost. of S = 1 + E{ Jl(fo, ZI, S) I 1 0 , S}
ZI
= 1 +- P(ZI = G I la, S).1 1 (fa, G, 8) +- P(ZI = n I 1 0 , S)Jl(fO, n, 8),
alld
.1 0 (10) = min[2P(1;O = ]5 I Io,C) +- E{JI(fo,zl,C) I Io,C},
21
1 +E{Jl(f0,ZI,8) I Io,S}]
Zl
Sec. 5.1
Hedllctjon to the Pc t 1nfurnwtiolJ CIl:;C
(1) For 10 = (G): Direct calculation yields
1G
P(ZI = G I G,C) =-,
28
lJ
P ( ZI = lJ I G,C ) =-,
28
7 [)
P(ZI = G I G,8) = -, P ( ZI = B I G,S) =-,
12 12
- 1
P(:ro = P I G,C) =-,
7
alld hellce
[ 1 15 13
Jo(G) = lllin 2. - + - 8 J1(G,G,C) + -.lI(G,n,C),
7 2 28
7 5 ]
1 + - 2 J1(G,G,8) + -JI(G,B,8) .
1 12
Using t.he values of J 1 obtained ill the previous stage
. [ 1 15 2 13 7 2 5 ]
Jo(G) = nun 2. - + - . - + - . 1 1 + - . - + - . 1
7 28 5 28' 12 7 12
[ 27 19 ] 27
= min 28 ' 12 = 28 '
J ( G ) = 27
a 28'
/lo(G) = C.
(2) For In = (B): Direct calculatioll yields
23
P(ZI = G I B, C) = -,
60
37
P(ZI = niB, C) = -,
GO
7 [)
P(ZI = G I B,S) = _ 2 ' P(ZI = n I n,8) =-,
1 12
- 3
P( Xo = P I B, C) = -,
5
and
. [ 3 23 37
Jo(B) = mlIl 2. - + -Jl(B, G, C) + -.lI(B, lJ, C),
5 60 60
1 + .h(B,G,s) + .11(n,n,S) ] .
12 12
Using the values of Jl obtained in the previous state
,
': ;
J ( B ) = min [ 19 ] = 19
o GO ' 12 12'
195
r:
I,
,
i'
I
I
I,
Ii:
I:
I
I'
I;
.t
I:
I
r
I;
I'
I
I!
I'
i:
II
I
\'
I:
I:
II
!
t!
!:
j'
!:
I
!
_ .J
196
Problems with Imperfect State In[ormath
Chap. 5
Sec. 5.2
Linear Systems and
adratic Cost
197
19
./o(B) = 12 '
I
1j,(B) = s.
where Zk E ?s, Ck is a given s x 11. matrix, and 'Ilk (: ., is all obs('r\"atioll
noise vector with given probability distribution. Furthermore, the vectors
Vk are independent, and independent from Wk and Xo as well. W" lIIake
the same assumptions as in Section 4.1 concerning the input uisturbances
Wk, Le., that they are independent, zero mean, and that they have finite
variance. The system matrices Ak, Bk are known; there is no analyt.ical
solution of t.he imperfect. information counterpart of the moue! with ranuolll
system matrices considered in Section 4.1.
From the DP Eq. (1.7) we have
I:
I:
I.:
Summarizing, the optimal policy for both stages is to continue if the
result. of the latest inspection is G, and to stop and repair otherwise.
The optimal cost is
J* = P(G).Jo(G) -I- P(B)./o(B).
We can verify that P( G) = f,;, and P( B) = f2, so that
7 27 [) 19 176
.1* = 12 . 28 -I- 12 . 12 = 144 '
-I- E {w'tv_lQNWN-d
WN-I
i
Ii
r
I.
I
I
l i, . .
j
i
Ii
II
In the above example, the computation of the optimal policy amI the
optimal cost by means of the DP algorithm (1.7) and (1.8) was possible
because the problem Willi very simple. It is eilliY to see that for a more com-
plex problcm, the computational requirements of the DP algorithm can be
prohibitive, part.icularly if the number of possible information vectors h is
large (or inflllite). Unfortunately, even if the control and observation spaces
are simple (one-dimensional or finite), the space of the information vector
h may have large dimension. This makes the application of the algorithm
very difficult or computationally impossible in many cases. However, there
arc problems where an analytical solution can be found, and the next two
sections deal with such problems.
.IN-l (IN-I) = lIlin [ E {.L'rv_lQN-lXN-l -I- U'rv_lRN-1UN-l
uN_l xN-l,wN_l
-I- (AN-l:/:N-l -I- BN-IUN-l -I- 'lIJN-I)'
. QN(AN-lJ:N-l -I- BN-1UN-l -I- WN-I) I IN-I}]
Since E{WN-l I IN-I} = E{WN-d = 0, this expression can be written as
.IN-l(IN-I) = E {:t:'tv_l(A'rv_1QNA N - 1 -I-QN-d;!:N-l I IN-I}
XN-l
The minimization yields the optimal policy for the last stage:
-I- min [u'tv_l(B'tv_lQNBN-l -I- RN-J)UN-I
tlN-l
-I- 2E{XN-l I IN-d' A'rv_1QNBN-lUN-I].
(2.1)
5.2 LINEAR SYSTEMS AND QUADRATIC COST
We will show how the DP algorit.hm of the preceding section can
be used to solve the imperfect state information analog of the linear sys-
tem/quadratic cost problem of Section 4.1. We have the same linear system
UN-l = IlN-l(IN-l)
= -(B'tv_lQNBN-l -I- RN-I)-lB'tv_1QNAN-lE{XN_l I IN-d,
(2.G)
but now the controller does not have access to the current state. Instead
it. receivcs at the beginning of each pcriod k an obscrvation of the form
where the matriccs K N - l and PN-l are given by
(2.6)
. 1 .
;.
r.
I:
t
It
i'
I
I:
I)
Ii
i
and upon subst.itution in Eq. (2.4), we obtain
Xk+l = AkXk -I- BkUk -I- Wk,
k=0,1,...,N-1,
(2.1)
.IN-l(IN-I) = E {X'tv_lKN-1XN-l I IN-d
XN_l
and quadratic cost
-I- E {(XN-l - E{XN-l I IN-d)'
XN-l
{ N-l }
E X'tvQNXN -I- (XQkXk -I- uicRkuk) ,
(2.2)
. PN-l(XN-l - E{.LN-l I IN-d) I IN-d
-I- E {w'tv_lQNWN-d,
WN-l
Zk = CkXk -I- "Uk,
k = 0, 1,...,N - 1,
(2.3)
P N - 1 = A'rv_IQNBN-l(RN-1 -I- B'rv_lQNBN-I)-IBN_IQNAN I,
K N - l = A'rv_lQNAN-l - P N - l -I- QN-l.
_ J
!'
IU8
Problems wiil1 Imperfeci Siaie lnfonnaiic
Chap. 5
Sec. 5.2
Linear Sysiems ill. Juadraiic Cosi
IUU
I!
Note that the opt.imal policy (2.G) is identical to its perfect. st.ate
illformation counterpart except that x N _I is rcplaced by its cOlHlitional
expect.ation E{XN-l I IN-d. Note also that the cost-to-go IN-l(IN-tJ
exhibits a corresponding similarity to its perfect state information counter-
part. except t.hat. J N -I (IN _ tJ cont.ains an addit.iollal middle term, which is
ill efrect a penalty for estimation error.
Now the DP equation for period N - 2 is
'/N-2(IN-2) = min [ E {.T'rv_2QN-2;CN-2 + U'rv_2 RN -2 1I N-2
UN-2 XN_2'WN_2'ZN_l
+'/N-l(I N - l ) I I N - 2 ,UN-2}]
= E{;C'rv_2QN-2 X N-2 I I N -2}
+ min [u'rv_2RN-2'UN-2 + x'rv_lJ(N-l;CN-l I I N - 2 }]
UN-2
alld when their system disturbance and observation noise vectors ,11"(' alsu
identical,
'WI.: = 'Wk,
'Uk = 'lfJ,;,
k = 0,1,... , N - 1.
Consider the vectors
Zk = (zo"",Zk)',
\lVk = (wO,...,Wk)',
k
Z =(zo,...,Zk)',
Vk = (vo,...,Vk)',
Uk = (uo, . . . , Uk)'
{
I,i
F
!'
Linearity implies the existence of matrices P k , Gk, and fh such t.hat.
Xk = Fk.TO + GkUk-l + JfJ,;Wk-l,
Xk = FiJo + HkWk-l.
t-E{(:I;N-l - E{;CN-l I IN-d)'
,PN-l(.TN-l-E{XN-ll IN-d) I IN-2,1LN-2}
+ E {w'rv_lQNwN-d.
WN-I
Since the vector Uk-l = (uo,... ,uk-d' is part of the information vector
h, we have Uk-l = E{Uk-l I h}, so
E{Xk I h} = }kE{XO I h} + GkUk-l + fhE{Wk-l I h},
E{Xk I h} = FkE{xo I h} + HkE{Wk-l I h}.
I:,
' I:,
I::
(2.7)
Note that we have excluded the next to last term from the millimization
with respect to UN -2. We have done so since this term turns out to be
independent of UN-2. To show this fact, we need the following lemma.
The lemma says essentially that the quality of estimation as expressed
by the statistics of the error Xk - E{Xk I h} cannot be influenced by the
choice of control. This is due to the linearity of both the system and the
mea.<-;uremellt cquation.
We thus obtain
Xk - E{Xk I h} = Xk - E{Xk I h}.
From the equations for Zk and Zk, we see that
-k
Z = Zk - RkUk-l = 5k Wk-l + Tk Vk,
Lemma 2.1: For every k, there is a function A1J,; such that we have
where R k , 5k, and '11 are some matrices of appropriate dimension. Thus,
the information provided by h = (Zk, Uk-l) regarding Xk is summarized
in Zk, and we have E{Xk I h} = E{Xk I Zk}, so that
.Tk - E{Xk I h} = Mk(XO, Wo,. .., Wk-l, vo,..., Vk),
-k
Xk - E{Xk I h} = Xk - EfTk I Z }.
independently of the policy being used.
The function Mk given by
We can now justify excluding the term
i
\
I,.
L
"
I
if
J!
I
I'
)1
j:1
Proof: Fix a policy and consider the following two systems. In the first
systcm thcre is control as determined by the policy,
Xk+1 = AkXk + BkUk + Wk, Zk = GkXk + Vk,
while in the second system there is no control,
Xk+l = AkXk + 'iiJJ,;, Zk = GkXk + 'i:i}.;.
We consider the evolution of these two systems when their initial conditions
are identical,
J\1J.:( x o, wo,..., Wk-l, Vo,..., vJ.:) = Xk - E{Xk I Zk}
serves the purpose stated in the lemma. Q.E.D.
E{(XN-l-E{.TN_l I IN-d)'PN-l(XN-l-E{XN_11 IN-d) I IN-2,UN--2}
from the minimization in Eq. (2.7), as being independent of UN-2. Indeed,
by using the lemma, we see that
Xu = Xu,
XN-l - E{;J;N-l I IN-I} = N-I,
!,
_ J
200
Problems with Imperfect State Informatic
Chap. 5
Sec. 5.2
Lincar Systems ant. ouadratic Cost
201
where N -I is a function of 1:0, Wo, . . . ,11J N -2, "/10, . . . ,"/IN-I. vV( have' that.
N-l as well as IN-2 are independent of UN-2, so the condit.ional expec-
tation of eN _I PN -I N -I satisfies
/LjJh) = LkE{:Ck I h},
(2.8)
Furt.hel"lilore, t.he gaill mat.rix Lk is independent. of" t.he st.a(,ist.i(" of" t.lH'
problem and is the same as the one t.hat would be used if we were faced
with the deterministic problem, where Wk and 1:0 wOlllcl be fixed ami eqllal
to their expecteu values. On the other hand, as shown in Appendix E, the
estimatei: of a random vector :r given some informat.ion (randolll v(:clor)
I, which minimizes the mean squared error E.r{Il.1: - il1 2 II} is precisdy
the conditional expectation E {x I I} (expand the quadratic form and set
to zero the derivative with respect to x). Thus t.he cstimator poTtiun of {he
optlmal controllcr is an optimal sollLtion of the problem of estimatil/g the
state Xk asslLming no control takes place, while the acbwtor portion is an
optimal sollLtion of the control problem asslLmin!/ perfect statc info1'1naUo1/.
prevails. This property, which shows that the two portions of the optimal
controller can be designed independently as optimal solut.ions of an est.i-
mation and a control problem, has been called the separation theorem for
linear systcms and qlLadratic cost and occupies a central posit.ion in model'll
automatic control theory.
ii
:t
E{'tv__JPN-lN-1 I IN-2, UN-2} = E{'tv_1PN-lN-l I IN-2}.
Retuming now to our problem, the minimization in Eq. (2.7) yields,
llsing an argument similar to the olle for the last. stage,
ll N _ 2 = /L N _ 2 (IN-2)
= -(RN-2 + B'tv_2J(N-IBN-2)-IBN_2J(N-IAN-2E{XN-2 IIN-2}.
We can proceed similarly to obtain the optimal policy for every st.age:
Uk
Lk
E{Xk"J
Zk
I"
i'
1.1
I;
ii
I'
I,
I
I'
f'((
L;
1:;
I
I..
i
I
i
where the matrix Lk is given by
Lk = -(Rk + BkJ(k+lBk)-IB.f(k+1Ak,
(2.9)
W k
Vk
with the matrices J(k given recursively by the Riccati equation
Xk
Zk
Xk+ t = A.0k+ 8IPk+ Wk
Zk= Ck"k+ Vk
Uk
KN = QN,
n = AV<k+IBk(Rk + BkKk+1Bk)-IBkKk+1Ak,
J(k = AkJ(k+1Ak - Pk + Qk.
(2.10)
(2.11 )
The key step in this derivation is t.hat at stage k of the DP algorithm, the
minimization over Uk that defines Jdh) involves the additional term:>
E{ (1:s - E{xs lIs})' Ps(xs - E{1: s I Is}) I h,llk},
Figure 5.2.1 Strudure of the optimal controller for the linear-quadratic prob-
lem. It consists of an estimator, which generates the conditional expectation
E{Xk I h}, and an actuator, which multiplies E{Xk I h} by the gam matrix f'k'
where s = k + 1, . . . ,N - 1. Dy using thc argument of the proof of t.he
earlier lemma, it can be seen that none of these terms depends on Uk so
that. the prcsence of these terms docs not affect the minimization in the
DP algorithm. As a result, the optimal policy is the same as the one for
the perfect information case, except that the st.ate Xk is replaced by its
conuitional expectation E{.1:k I h}.
It is interesting to note that the optimal controller can be decomposed
into the two parts shown in Fig. 5.2.1:
(a) An csti71wtO'l', which uses the data to gellcwte the conditional expec-
tation E{:rk I h}.
(b) An actuatur', which multiplies EVk I h} by t.he gain lllatrix Lk and
applies the control input Uk = LkE{:Ck I h}.
Another interesting observation is that the optimal controller applies
at each time k the control that would be applied when faced with the
deterministic problem of minimizing the cost-to-go
N-l
X't,QNXN + L (X;QiXi + ll;Ri U i ),
i=k
and the input dist.urbances Wk, Wk+l,"', 11JN-l and current. tat.e Xk were
known and fixed at their conditional expected values, which are zero and
E{ Xk I h}, respectively. This is another manifestation of the certa.inty
equivalence principle, which was referred t.o in Section ,1.1. A similar result.
holds in the case of correlated disturbances; see Exercise 5.1.
i
,
"
I':
,
_..1
202
Problems with Imperfect State InformatiOl.
Chap. 5
Implementation Aspects - Steady-State Controller
AH explained in t.he pcrfect information case, t.he lincar form of Lll( ac-
t.uator portion of the opt.imal policy is part.icularly attractive for implemcn-
t.ation. In the imperfect. informat.ion ca.o.;c, however, we lHed t.o implclllcnt.
an cstimator that produces the conditional expectation
Xk = E{Xk I h},
and this is not easy in general. Fortunat.ely, in thc important spccial case,
where the distu7'bances Wk, Vk, and the indial state Xo a're Gaussian mnrlum
vectors, a convenient implementation of thc estimator is possible by means
of thc Kalman filtcring algorithm, which is developed in Appendix E. This
algorithm h; organizcd rccursively so that to producc Xk-H at time /.; + 1,
only thc most recent measurement Zk+l is needed, together with Xk and
Uk. In particular, we have for all k = 0,.. ., N - 1,
Xk+l = Ak:h + BkUk + I:k+llk+lq+INk1 (Zk+l - Ck+I(AkXk + Bknd),
(2.12)
and
;/;0 = E{xo} + I:oloCbNol(zo - CoE{xo}) ,
(2.13)
where the matrices I:kl k are precomputable and are givcn recursively by
L:k+lik+l = L:k+llk - L:k+llkCk+1 (Ck+Ik+llkCk+1 + Nk+r)-ICk+1 L:k-Hlk,
(2.14)
I: k +lI k = AkL:klkA" + !.h, k = 0,1,..., N - 1, (2.15)
wit.h
I:ol o = S - SCb(CoSCQ + NO)-ICOS.
In thcse equations, Mk, Nk, and S are the covariance matrices of Wk, 'Uk,
and Xo, respectively, and we a.sume that Wk and Vk have zero mcan; that
is
E{wd = E{Vk} = 0,
M k = E{WkWk}, Nk = E{'Uk'U k },
S = E{(xo - E{xo})(xo - E{xo})'}.
In addition, the matrices Nk arc assumed to be positive definite.
Consider now thc case where the system and measurement equations
and the disturbance statistics are stationary. We can then drop subscripts
from the system matrices. Assume that the pair (A, B) is controllable and
that the matrix Q can be written as Q = F' F, whcre F is a matrix such that
thc pair (A, F) is observable. By the thcory of Section 4.1, if the horizon
tends to infinity, the optimal controller tcnds to the steady-state policy
J1*(h) = L.ik,
(2.16)
Sec. 5.3
lIIinimullJ Fariance ,Itro] of Linear Systems
203
wlwre
L = -(R + B'I<B)-IB' I<A,
(2.17)
and. j( s the uniqc positive scmidefinitc symmctric solut.ion of the al).!;l'-
bralc H,lCcat.1 cquat.lon
',I
"
f( = A' (K - KB(R + B'J(B)-IB'J() A + Q.
(2.18)
By a similar argument, it can be shown (see Appendix E) t.hat. .1',. can
be genera Leu in t.hc limit as k --+ 00 by a steauy-state Kalman filt.ering
algorithm:
,
.'
"
I
I;
I
I
i.
i'
j p
'
I
!
i
I
I:
I
Xk+l = (A + BL)Xk + I;C' N-I (Zk+l - C(A + BL)Xk),
(2.10)
whcre I; is given by
I; = I: - I:C'(CI:C' + N)-lCI:,
(2.20)
anu I: is t.hc unique positive semidefinitc symmctric solution of the RiccaU
equation
I: = A(I: - I:C'(CI:C' + N)-ICE)A' + M.
(2.21)
The assumptions required for this are that the pair (A, C) IS obscruablc
and that the matrix NI can be written as !vI = DD', whcTe D is 0. mr!tl'i:r;
such that the pair (A, D) is contmllable. The steady-stat.e controller of
Eqs. (2:16), (2.17), and (2.19) is particularly attractive for practical illlple-
mentatIon. Furthermore, as shown in Appendix E, it rcsult.s ill a st.able
closcd-loop system, under t.he preceding controllability and obscrvability
assumptions. .
i,
I.
(.'
I
.'
I:
l-
I
I
I
5.3 MINIMUM VARIANCE CONTROL OF LINEAR SYSTEMS
'Are havc considered so far the control of linear systems ill state vari-
able form as in thc prcvious scction. However, linear systems arc often
modeled by means of an input-output equation, which is morc economical
in the number of parametcrs needed to describe the system dynamics. In
this section we consider single-input, single-output, linear, time-invariant
systems: an.d a sp:cial typc of quadratic cost function. The rCHultin).!; opt.i-
mal pohcy IS particularly simple ctnd has found wide applicat.ion. \Ve first
introduce some of the basic facts regarding linear systems ill input-out.put.
form. Detailcd discussions may be fouud ill [AsW84] , [AsW90], [GoS84],
and [Whi63].
I
: I
- J
204
Problems with Imperfed State Informatio,
Chap. 5
See. 5.3
Minimum Variance
liro] of Linear Sysiems
205
We consider a single-input single-out.put discrde-tinw linear syst.em,
which is specified by an equation of the form
or
Yk + alYk-l +... + amYk-m = bouk + b1Uk-l +... + bmuk-m, (3.1)
where ai, hi are given scalars. The scalar sequences {1Lk I k = 0, :1.:1, :1.:2,.. .},
{Uk I k = 0, :1.:1, :1.:2,.. .} arc viewed as the input and output of the syst.em,
respectively. Note that we allow time to extend to -00 as well as +OOj
this will be useful for describing generic properties of the system relating
to stability. We williatcr revert to our usual convention of starting at time
o and proceeding forward.
It is convenient to describe this type of system by means of the
backward shift operator, denoted 05, which when operating on a sequence
{:Ck I k = 0, :1.:1, :1.:2,. ..} shifts its index by one unit; that is,
A(s)
B(s) Yk = Uk.
The meaning of both equations is that the sequences {yd and {'ud are
related via A(S)Yk = B(S)Uk' There is a certain ambiguity here in that,
for a fixed {ud, the equation A( S )Yk = B( S )Uk ha, an infinit.e number of
solutions in {yd. For example, the equation
Yk + aYk-l = Uk
f;
I:
: I .'
I...'-j
1[,
. : 1 ' i
i :j
II;'
1.1
f
i
I
o51'(Xd = Xk-1',
k = 0, :1.:1, ::1::2,.. .
(3.2)
for Uk == 0 has as solutions all sequences of the form Yk = {3( -a)k, where
{3 is any scalar; the solution becomes unique only after some boundary
condition for the sequence {Yk} is specified. As will be discussed shortly,
however, for stable systems and for a bounded sequence {ud there is a
unique solution {yd that is bounded. It is this solution that will be denoted
by (B(05)/A(s))Uk in what follows. The reader who is familiar with linear
dynamic system theory will note that B(s)/ A( 05) can be viewed as a tmnsfe1'
function involving z-transforms.
We now introduce some terminology. When the sequences {Yd and
{ud satisfy A( s )Yk = B (05 )Uk, we say that Yk is obtained by passing Uk
throu!/h the filter B (05) / A (s). This comes from engineering terminology,
where linear time-invariant systems arc commonly referred to as filters.
We also refer to the equation A(S)Yk = B(S)1Lk as the filter B(s)/A(s).
A filter B(s)/A(s) is said to be stable if the polynomial A(s) has all its
(complex) roots strictly outside the unit circle of the complex plane; that
is, Ipl > 1 for all complex p satisfying A(p) = O. A stable filter B(s)/A(s)
has the following two properties:
(a) Every solution {yd of
S(Xk) = :I:k-l,
k = 0, ::1::1, 1:2,...
We denote by 051' the operator resulting from l' successive applications of 05:
A(S)Yk = 0
We also write for simplicity o51'Xk = :Ck-1" The forward shift opcmtor,
denoted 05- 1 , is the inverse of 05 and is defined by
S-l(Xk) = Xk+l,
k = 0, ::1::1, :1.:2,...
Thus the notation (3.2) holds for all integers 1'. We can forlll linear com-
binations of operators of the form s1'. Thus, for example, the operator
(05 + 205 2 ) is defined by
(05 + 2s2)(Xk) = Xk-l + 2Xk-2,
k = 0, :1.:1, :1.:2,. . .
A(S)Yk = B(s)uk,
(3.3)
satisfies limk-+oo Yk = OJ that is, the output Yk tends to zero if the
input sequence {ud is identically zero.
(b) For every bounded sequence fad, the equation
With this notation, Eq. (3.1) can be written as
where A( s), B (05) arc the operators
A(S)Yk = B(05)'fik
A(s) = 1 + alS +... -1- a",sm,
(3.4)
has a unique solution W.,} within the class of bounded sequences.
Furthermore, every solution {yd (possibly unbounded) of the equa-
tion satisfies
B(s) = bo + blS +... + b",s"'. (3.5)
Sometimes it is convenient to write the equation A(S)Yk = B(S)Uk as
lim (Yk - Tlk) = O.
k-+oo
(3.G)
For example, consider the system
B(s)
Yk = A(s) Uk
Yk - 0.5Yk-l = Uk.
_' J
206
Problems with Imperfect State In[ormatio,
Chap. 5
Given the bounded input sequence Uk = {.. . , 1, 1,1, . . .}, the ,;et of
all solutions is given by
(3
Yk = 2 + 2 k '
wherc (1 is a scalar, but. of these the only bounded solut.ion is ilk
{.. . ,2,2,2,. . .}. The solution {yd can thus be naturally associated
with the input sequence {Uk}; it is also known as the forced n,s]uJnse
of the system due to the input {ud.
ARMAX Models - Reduction to State Space Form
We now consider a linear system with output Yk, which is driven by
two inputs: a random noise input fk, and a control input 'ILk. It ha",; the
forlll
Yk + alYk-1 +... + amYk-m = bl1Lk-1 +... + bmUk-",
+ f.. + Clfk-l +... + C",fk-m,
(3.7)
and it is known as an ARMAX model (AutoRegressive, Moving Average,
with eXogenous input). We assume throughout that the random variables
fk arc mutually independent. We can write the model in the shorthand
form
A(S)Yk = B(S)Uk + C(S)fk,
wherc the polynomials A(s), B(s), and C(s) arc given by
A(s) = 1 + alS +... + ams m ,
B(s) = /1105 +... + b",s"',
C(s) = 1 + C1S +... + c",s"'.
The ARMAX model is very common ancl its derivation is outlined
in Appcndix F, where it is shown that. wit.hout loss of generalit.y we can
assume that C(s) has no roots strictly inside the unit circle. For much
of t.he analysis in subsequent sections, it will be necessary to exclude the
critical case where C(s) has roots on the unit circle and assume that C(s)
has all its roots strictly outside the unit circle. This assumption is satisfied
in most practical cases.
In several situations, analysis and algorithms relating to the ARMAX
model are greatly simplified if C(s) = 1 sO that the noise terms C(s )fk = Ck
arc independent. However, this is typically an unrealistic assumption. To
emphasize this point and see how easily the noise can be correlated, suppose
that we have a first-order system
:Ck+1 = aXk + 1lJk,
See. 5.3
'I
d
it
[I
I,f
f
;
II
Ii
II
l
Minimum Variance .1t1'Ol of Linear Syste1lls
207
where we observe
Yk = Xk + Vk.
Then
Yk+l = .Lk+l + Vk+1
= (U;k + "llIk + Vk t I
= a(Yk - lid + Wk + lik./.I,
so finally
Yk+l = aYk + Vk+l - (LVk + Wk.
However, the noise sequence {Vk+l - aVk + wd is correlated even if {lid
and {wd are individually and mutually independent..
The ARMAX model (3.7) can be put into state space form. The
process is ba1ied on state augmentation and can perhaps be best understood
in terms of an example. Consider the system
Yk + alYk-l + a2Yk-2 = h U k-l + b2Uk-2 + fk + CIEk-l. (3.8)
We have
( YEI ) ( -r
Ek+1 0
-a2 b 2
o 0
o 0
o 0
'i'
Cl ) ( Yk ) ( bl ) ( ck+ I )
o Yk-I 0 0
O + 1 Uk + 0
Uk-I
o Ek 0 Ek+ 1
(:3.U)
'I'
'i,
i
[i,
t.,
t: ;
.1\,
i . F;;;
' ...... ; . '
: ii,
,
/'1.
, i
'j
i
i
:
I':
\:,
1.,-:
I"
j
jd
j:i
!
I!i
HI
I"
!
,
.,
By setting
( Yk )
Yk-l
Xk = ,
Uk-l
Ek
1lJk =
( Ek+ I )
Ekl '
Bn}
L(T
-a2 b2
o 0
o 0
o 0
)
o '
o
we can write Eq. (3.9) as
Xk+l = AXk + BUk + 1lJk,
(:UO)
, .
, .
where {1lJd is a stationary independent process. We have arrived at this
state space model through state augmentation. Notice that the state Xk
includes Ek. Thus if the controller is assumed to know at time k only the
present and past outputs Yk,)/k-l,"', and past controls Uk-I, Uk-2, .. .
(but not Ek, ck-l,. . .), we are faced with a model of imperfect state infor-
mation. If Cl = 0 in Eq. (3.8) then the state space model can be simplified
so that
( Yk )
Xk = Yk-l ,
Uk-l
_. J
E {(Yk)2} .
Minimum Variance ContI... of Linear Systems
Minimum Variance Control: Perfect Stat.e Information Case
!II.- = (k
! f'
209
(3.1G)
,
I
(:\. Hi)
208 Problems with Imperfect State InformatioJ Chap.
y
,
in which ca8C we have perfect. 8tate information. More generally, we have.
perfect state illfol'Illation in the ARMAX model (3.7) if bt i= 0 and Cl .
Cz = . . . = Crn = O.
Ii(s) = hI + hzs +... + hms lll - J = s-llJ(s),
The resulting closed-loop system is
(3.7):
N E{(Ek)2}.
Consider t.h( perfect st;lI.( informatioll case of t.he ARMAX mod
Yk + alYk-J +... + amYk-m = hl"U.k-l +... + h m "/1,k-m + EI.-,
tice that the optimal policy, called minimum val'iance control law, IS
e invariant and does not depend on the horizon N.
where hI i= O. A problem of interest, known as the minimum varia
conl1'ol p1'Ohlem., is to select Uk as a function of the present and past outpu
Yk, Yk-l, . . ., as well as t.he pa8t. controls Uk-I, Uk-Z,' . . , so as to mi . .
the cost
£k
1
A(s)
There are no constraints on 11,k. By transforming the system to state s
form, we 8ce that this problem can be reduced to a perfect state informati
linear-quadratic problem where the state :Ck is
Uk
B(s)
A(s)
+
Yk
(Uk, Uk-I,. " , Yk-m+l, Uk-I,' . . , Uk-rn+l)'.
The problem is of the same nature as the linear-quadratic problem of,
tion 4.1 except that the corresponding matrices Rk in the quadratic
function are zero here. Nonetheless, in Section 4.1 we used the invertib' .
of Rk only to ensure that variou8 matrices in the optimal policy and
Riccati equation arc invertible. If invertibility of these matrices can
guaranteed by other means, the same analysis ;tpplies even if Rk is
tive semidefinite. This is indeed the case here. An analysis analogous
the one of Section 4.1 shows that the optimal control uk at time k ( .
YI.-, Yk-l,..., Yk-rn+1 and Uk-I,..., Uk-m+l) is the same as the one
would be applied if all future disturbances (k+1,' . ., EN were set equal
zero, their expected v;11ue (certainty equivalence). It follows that
Uk
B(s)
;\(s)
Yk
Figure 5.3.1 Minimum variance control with pcrfect state information. Structure
oCthe optimal closed-loop system, where A(s) = 1 + ajS +... + amS m , 8(s) =
.6I"+'" + bms m , A(s) = S-l (A(s) - 1), and B(s) = s-l 8(s).
1
1 1 "k(Yk,..., Yk-m+l, uk-I,..., Uk-m+J) = b";(a1Yk +... + a m Yk-m+1
- hZUk-l - ... - bmUk-m
Whereas the optimal closed-loop system as given by Eq. (3.16) is
.... ly stable, we can anticipate serious difficulties if the filter A(s)/ll(s)
.:J%... -
i.e feedback loop is unstable. For if B(s) has some roots inside the
circle, then the sequence {ud will tend to be unbounded. This is
rated by the following example.
and {ILk} is generated via the equation
pIe 3.1 (An Optimal but Unstable Controller)
\ I
, I
,I
J
In other words, {uk} is generated by passing {yd through the linear
A(s)/B(s), where
A(s) = 0,1 + azs +... + a m s m - 1 = s-I(A(s) -1),
Yk + Yk-.I = Uk-l - 2U'k-2 + (k.
:::(
. i
i
, ,
I
'1
hlU'k+hZ/L'k_1 +.. .+bmU'k_m+l = al)Jk+o,z)Jk-l +.. .+amYk-m+l'
Uk = 2Uk-l + Yk
- J
210
Problems with Imperfect State lnformat.ior.
Chap. 5
and thc optimal closed-loop systcm is
Yk = Ck,
which is a st.ahlc sysl.cll1. 011 thc otlH'r hand, tlJ(' la.t. t.wo <,qllat.ions .vi"ld
Uk = 2Uk_1 + Ek.
TIlliS, Uk is generatcd by an unstable syst.em, alld in fad it. is givclI hy
k
Uk = L 2 u Ck_n..
n=O
Thereforc, cvcn t.hough thc out.put Yk stays bounded, the control Uk typically
becomcs unbounded.
For another view of the same difficulty, suppose that the coefficients
b I , . . . , h m of JJ (s) are slightly different from the ones of the true system.
We will show that if the feedback filter A( s) III (s) is unstable, then the
closed-loop system is also unstable in the sense that both Uk and Yk become
unbounded with probability one.
Assume that the system is governed by
AO(S)Yk = BO(S)Uk + Ek,
(3.17)
while the policy is calculated under the assumption that the system model
IS
A(S)Yk = B(S)Uk + Ek,
where the coefficients of A(s) and B(s) differ slightly from those of AO(s),
--.-0 -0
BO(s). Define A (s), 13 (s) by
--.-0
1 + sA (s) = AO(s),
(3.18a)
s]3°(s) = BO(s). (3.18b)
--0 - -0 -
Note l,hat A (s) = A(8) and B (8) = B(8) if AO(8) = A(s), BO(s) = B(s).
I3y nmlt.iplyillg Eq. (3.17) wit.h ll(s) and by using the relation defining the
optimal policy
B(S)Uk = A(S)Yk'
we obtain
B(s)AO(s)Yk = BO(s)A(s)Yk + ll(S)Ek'
--.-0 -0 - -
If the coefficients of A (s) and B (s) arc close to those of A (s), B (s), then
t.he roots of the polynomial
ll(s) + s(ll(s)jf1(s) -llo(s)A(s))
, '
Sec. 5.3
Minimum Variance l .I.rol uf Linear Systems
211
are close to t.he roots of ll(,). Thus the closed-looJl systclI/. is s(lIblt. only if
the roots o{IJ(s) arc outside the unit circle, or 0fjllivalcllt.ly, if awl only if t.!J('
filter A(s)/IJ(s) is stable. If our model is exact., t.he closed-loop syst.elll will
be stable due to what is commonly referred to as a j!olc-zem call.cellat.iulI,
Howev(,r, 1.11(' slighl.(,st. Ill()(!..lillg discrq);l.ll<'Y will illdllc(' illsl."hilil.y.
The conclusion from the preceding analysis is that t.he minilllulli vnri-
ance control law is advisable only if it can be realized through a stabh, tilter
[ll(s) has root.s outside the unit circle]. Even if B(s) has its roots out.side
the unit circle, but some of these roots arc near the unit circle, t.lw perfor-
mance of the minimum variance policy can be very sensitive to variatiolls
in the parameters of the polynomials A(s) and B(s). Olle way to overc:onle
this sensitivity is to change the cost t.o
E {((Yk)2 + R(Uk_I)2) } ,
where R is some positive parameter. This requires solution Viii !.lw Hic:cati
equation as in Section 4.1. For a detailed derivation, see [Ast83].
In some problems, the system equation includes an additional ext.ernal
input sequence {Vk}, the values of which can be meilSuwd by t.he controlkr
as they occur. In particular, consider the scalar system
,.'
I
,
:J!
Yk + UIYk-1 +... + amYk-.m = bPLk-l +... + b m 1L k-m
+ dlVk-l +... + dmVk-.m + fk,
where each value Vk becomes known to t.he controller wit.hout. error at tinw
k. The minimum variance controller then takes the form
/l'k(Yk,..., Yk-m+l,Uk-l,..., Uk-m+l, Vk,"', V/,:-m+l)
1
= t;;(alYk +... + amYk-m+l - dlvk -... - d",v/,: -m+1
- b2Uk-l'" - bmUk-m+I),
1.1'
and the optimal controls {un arc generated by
ij
ll(s)1L'k = A(S)Yk - D (S)Vk,
where
J1'
A(s) = al + a2S +... + a m S m -l,
B(s) = b i + b 2 s +... + bmS m - l ,
D (s) = d l + rhs +... + rlrns m - l .
The closed-loop system is again Yk = £/,:, but for practical pmposes it. is
stable only if ll(s) has its roots outside t.he unit circle. The process whcrcLy
external inputs are measured and used for control is commonly referred to
as feedforward control.
0:
_ J
212
Problems with Imperfcct State Information
Chap. 5
Imperfect State Information Case
Consider now the general ARMAX model
Yk + alYk-l + ... + amYk-m = bMUk-M + ... + bmUk-m
+ Ek + CIEk-l + . . . + CmEk-m
or, equivalently,
..4(S)Yk = B(S)Uk + C(S)Ek,
where
..4(S) = 1 + alS + ... + ams m ,
B(s) = bMs M +... + bms"',
C(8) = 1 + C1S +... + cms m .
We assume the following:
(1) bM i- 0 and 1 ::; M ::; m.
(2) {Ed is a zero mean, independent, stationary process.
(3) The polynomial C(8) has all its roots outside the unit circle. (As
explained in Appendix F, this assumption is not overly restrictive.)
The controller knows at each time k the past inputs and outputs.
Thus the information vector at time k is
Ik = (Yk, Yk-l,..., Y-m+l, Uk-I, Uk-2,..., 'U-m+M)'
(3.19)
(We include in the information vector the control inputs U-I,..., U-m+M.
If control starts at time 0, these inputs will be zero.) There are no con-
straints on Uk. The problem is to find a policy {Jlo(Io),,,',JlN-l(IN-d}
that minimizes
E {(Yk)2 } .
By using state augmentation, we can cast this problem into the frame-
work of the linear-quadratic problem of Section 5.2. The corresponding
linear system in state space format involves a state Xk given by
Xk = (Yk+M-I"'" Yk+M-m, Uk-I,..., Uk+M-m, Ek+M-l,..., Ek+M-m)'
Because Yk+A-f-I,...,Yk+l and Ek+M-l,...,Ek+M-m are unknown to the
controller, we are faced with a problem of imperfect state information.
An analysis analogous to the one of Section 5.2 shows that certainty
equivalence holds; that is, the oplimal cont1'Ol 'ILk al limc k !/ivcn h -is lhe
same as the one that would be applied in the deterministic problem where
the current state
Xk = (Yk+M-l,... ,Yk+M-m,Uk-I,... ,Uk+M-m, Ek+M-I,... ,Ek+M-m)
Sec. 5.3
l\Iinimurn Variance l. .,trol of Linear Systems
213
is set equal to its conditional expect cd valuc given h, iLnd thc futurc dis-
tur'bac S f k+ AI, - . '. ' EN ar'c 8cl cqual to zcro (th.eir c:r:pcclcd /il/.lu.r).
lhus the opt.unal control uk = 11'k(h) is obtainecl by soh'in'" for 11,.
the equation - h
E{Yk+M I uk,h} = E { Yk+M I Y k Y }
., k-l,... ,Y-m+l,Uk, Uk-I,..., IL",+i\l
=0.
,. _ (;).20)
1I11s leads t.o the problem of calculating E{Yk+M I h, ud, known as the
forecastmg or prediction problem, which is important in its own right. We
first treat the easier case where there is no delay (!vI = 1) and then discuss
the more general case where the delay can be positive.
Forecasting for ARMAX Models - No Delay (M = 1)
Assume that !vI = 1. We would like to generate an equation for
the forecast E{Yk+l I h, "U.d, and then determine the optimal control
uk = Il'k(h) by setting this forecast to zero. Let. us introduce an auxil-
iary sequence {zd via the equation
zk = Yk - Ek.
(:).21)
A key fact is that, since {Ed is an independent, zero-mean sequence, we
have
E{Zk+1 I h, ud = E{Yk+l I h, ud.
We can thus obtain the desired forecast of Yk+l by forecasting zk+1 instead.
We can thel obtain the ltimal control uk by setting E{Zk+l I h, uk} = O.
By usmg the defimtlOn Zk = Yk - Ek to express Yk in terms of Zk in
the ARMAX model equation for M = 1, we obtain
Zk+l + C1Zk +... + CmZk-m+l = bl"Uk +... + bmUk-m+l + Wk, (3.22)
where
Wk = (CI - al)Yk + . . . + (c m - iLm)Yk-m+l.
We note that Wk is perfectly observable by the controller however the
I ' ,
sca ars Zk, . . . ,Zk-m+l are not known to the controller because the initial
conditions Zo, . . . ,Zl-m of the system (3.22) are unknown. Nonetheless, the
system (3.22) is stable, since the roots of the polynomial C(s) have beell
assumeu t.o be outside the unit circle. As a result, t.he initial cundiLiulls du
no.t matter asymptotically. In other words, if we generate a sequence {yd
usmg the system (3.22) and zero initial conditions, i.e.,
!
K'
't
: I:'
!.
W
I I
jl
,!i
"
Ii:
I
i:
,;1
Ii
Yk + l + Cl . Qk + . . . + C ",y A k _ ll'+1 b lL + + I -t
= I k ... i",ILk_"'+1 - Wk,
(:).23)
_ J
214
Problems with imperfect. State information
Chap. 5
Sec. 5.3
Minimum V,"lriance Guntrol of Linear Syst.ems
215
I
i: _
,
,
with
Forecasting: The General Case
Yo = 0,
Y_I =0,
YI-m = 0,
lim (Yk - Zk) = o.
k---+oo
Considcr now thc gcncral case where the delay AI can 1)(' greater Ih.1I1
1. The forecastiug problem cau still be uiccly solved by usiug a certain Irick
to transform thc ARMAX equatiou into a nlOre convenient fOl"ln. To thi,,;
end, wc first obtain polynomials F(s) and G(s) of the form
, "
theu we will have
C(s) = A(s)F(s) -I- sMG(S).
(:3,2(j)
[:
f!
'I
:'(!
t :::
[, ':
II, <
\:1;',1
1.1",
Thus, :01.+ I h; au asymptotically accurate approximation to the optimal
forecast E{Yk+1 I h,ud.
F(s) = 1 -I-!Is -1-... -I- 1M_IS M - I ,
(J.:2.I)
Minirnum Variance Control: Imperfect State Information and
No Delay
G(s) = go -I- gls -1-... -I- grn_IS m - 1
(:3.25)
that ;atisfy
I3ased on the earlier discussion, an asymptotically accurate approx-
imation to the minimum variancc policy is obtained by setting Uk to the
value that makes Yk+l = 0; that is, by solving for Uk thc equation
The cocfficicnts of F( s) and G( s) arc uniquely determiued from thos(' of
C (s) aud A (s) by equating coefficicnts of both sides of thc relation
Yk+l -I- CIYk -I- ... -I- CrnYk-m+1 = hi Uk -I- .., -I- bmUk-m+1 -I- Wk.
1 -I- CIS -1-... -I- cms m = (1 -I- a.IS -1-'" -I- n m s 'T1 )(1 -I- lis I.... -1- fAl . I S l\l-l)
-I- sM (gO -I- gls -I- . . . -I- gm_IS m - I ).
If this policy is followcd, however, thc earlier forecasts ih,. .. , ih-m+1 will
bc cqual to zero. Thus thc (approximatc) minimum variance policy is given
by
Example 3.2
1
Uk = b l (Wk - b 2 uk-1 - ... - bmUk.-m+l)
1
= hI ((al - CJ)Yk -1-... -I- (am - C",)Yk-m-1
- bUk-l - ... - hm"ILk-m+I).
Let m = :3 and AI = 2. Then t.he preceding equation takes t.he form
1 -I- CIS -I- C2S 2 -I- C3S J = (1 -t- o.1S -I- a2s2 -I- o.3s3)(1 + !is) -I- s2(gO -I- gls -I- g2S2).
and by equating coefIicicnts we have
I3y substituting this policy in the ARMAX model
cl=o.l+!i, c2=o.2+o.l!i-l-go, c3=a3-1-a2!i-l-gl, aJ!I-I-g2=O,
[rom which h, go, gl, and g2 are uniqucly determined.
Yk+1 -I- alYk -1-... -I- amYk-",+1 = hlUk -1-..' -I- 11",'l1k-m+1
-I- Ek+l -I- CIEk -I- .. . + CmEk-m+J,
The ARMAX model can be written as
we see that the closed-loop system becomcs
A(S)Yk+M = B(S)Uk -I- C(Sh+M,
C.27)
i,
I'
F(s)A(s)Yk+M = F(.fIJ(s)Uk -I- F(s)C('h+A1,
Yk+1 - Ek+1 -I- CJ (Yk - Ek) -I- . . . -I- Cm(Yk-m+1 - Ek-m+J) = 0,
wherc
or equivalcntly C(S)(:l)k - Ek) = O. Since C(s) has its roots outside the unit
circle, this is a stable systcm, and wc havc
B(s) = s-MB(s) = hM -I- bM+IS -1-'" -I- bms m - M .
Multiplying both sides of Eq. (3.27) with F(s), wc have
Vk = Ek -I- ,(k),
and using Eq. (3.26) to exprcss F(s)A(s) as C(s) - sAfG(s), wc obtain
wherc ,(k) -> 0 as k -> 00.
(C(s) - sMG(s))Yk+M = F(s)B(s)Uk -I- F(S)C(S)Ek+Af,
.1
>1
_ J
216
Problems with Imperfect State In[ormatiofl
Chap. 5
Sec. 5.3
Minimum Variance ( .w1 of Lincar Systems
217
or cquivalcntly
and
C(S) (Yk+M - F(S)Ek+M) = F(s)ll(s)Uk + G(S)Yk.
(3.28)
m
E{Zk+M I h,uk} = Yk+M + L ii(k)E{ZM-i I h,lIA:},
i=1
Let us now introduce the auxiliary sequencc {zd via the equation
wherc il (k), . . . , im (k) are appropriatc scalars depending on k. Sincc C (s)
has all its roots outside thc unit circle, we have (compare with thc discussion
on stability earlier in this section)
Zk+M = Yk+M - F(S)Ek+M = Yk+M - (k+M - hEk+M-l -... - !M-IEk+l'
(:3.29)
Not.e that whcn M = 1, wc havc F(s) = 1 and Zk = Yk - Ek, so {zd is the
samc sequence as the one introduccd earlicr for the case of no delay. Again,
since {Ed is an indcpendent, zcro-mcan sequence, by taking cxpcctations
in thc dcfinition Zk+M = Yk+M - F(S)EkHd, we obtain
lim il(k) = lim i2(k) = .., = lim im(k) = O.
/..;-00 k-+oo k-OJ
(:L!fi)
It follows that, for large values of k,
Yk+M c::: E{Zk+M I h, ud = E{Yk+M I h, ud.
E{Zk+M I h,uk} = E{Yk+M I h,'lld,
C(S)Zk+M = Wk
(Morc prccisely, we have IYk+M - E{Yk+M I h, 11.dl -> 0 as k -> 00, where
thc convcrgcnce is in the mean-squarc sense.)
In conclusion, an asymptotically accurate approximation to the opti-
mal forccast E{Yk+M I h, ud is givcn by fh+M and is gencmted by the
equation
and wc can obtain the dcsircd forccast of Yk+M by forccasting Zk-t-M in its
placc. Furthcrmore, by Eq. (3.28), Zk+M is written as
or
Yk+M + CIYk+M-l +... + CmYk+M-m = F(s)B(s)Uk + G(S)Yk (3.3G)
Zk+M + CIZk+M-l +... + CmZk+M-m = Wk,
(3.30)
with the initial condition
where
Wk = F(s)B(s)Uk + G(S)Yk.
(3.31)
YM-l = YM-2 = '" = YM-m = O.
(3.37)
Since the scalar Wk of Eq. (3.31) is availablc at time k (i.e., it is
determined from h and Uk), the system (3.30) can serve as a basis for
forecasting Zk+M. We would be ablc to prcdict exact.ly Zk+M and use
it as a forecast of Yk+M if we knew appropriate initial conditions with
which to start the cquation (3.30) that generates it. Wc don't know such
initial conditions, but because this equation represents a stable system, the
choice of initial conditions does not matter asymptotically, as we proceed
to explain more formally.
We consider thc sequence Yk+M generated by
/1
Minimum Variance Control: The Gcneral Case
Bascd on the earlier discussion, the minimum variance policy is ob-
tained by solving for Uk the equation E{Yk+M I h, ud = O. Thus an
asymptotically accurate approximation is obtained by setting Uk t.o the
value that makes Yk+M = 0, that is, by solving for Uk the equation [ef. Eqs.
(3.36) and (3.37)]
t,
;f{'
,.
.,."
F(s)ll(s)Uk + G(S)Yk = 0;
(3.:!i))
Yk+M + CIYk+M-l +... + CmYk+M-m = Wk
(3.32)
F(s)B(s)Uk + G(S)Yk = QYk+M-l +... + Cr"Yk+M-m'
YM-l = YM-2 = ... = YM-m = 0,
(3.33)
If this policy is followed, however, the earlicr forecasts Yk+M -I, . . . ,Yk+M-m
will be equal to zero. Thus the (approximate) minimum variancc policy is
given by
with initial condition
and wc claim that the forecast E{Zk+M I h} can be approximated by
Yk+M. To sce this, note that from Eqs. (3.30) to (3.33) we have
that is, uk is gcncrated by passing Yk through the linear filter
Zk+M = Yk+M + (Jl(k)ZM-l +... + rm(k)ZM-m)
(3.34)
-G(s)/F(s)B(s),
_. J
218
Problems wUh Impedect State In[onuatioJ.
Chap. 5
Ek
C(s)
A(s)
Uk
+
Yk
!Jl
A(s)
Uk
-
F(s)B(s)
Yk
Figure 5.3.2 Minimum variance control with imperfect state information. Struc-
ture of the optimal closed-loop system.
as shown in Fig. 5.3.2.
From Eqs. (3.28) and (3.38), we obtain the equation for the closed-
loop system
C(S)(Yk+M - F(S)Ek+M) = O.
Since C(s) has its roots outside the unit circle, we obtain
Yk+M = F(S)Ek+M + i(k),
where ,(k) --> 0 as k --> 00. So asymptotically, the closed-loop system takes
the form
Yk = Ek + !1Ek-1 +... + fM-IEk-M-t-l.
Let us consiuer now the stability properties of the closed-loop system
when the true system parameters differ slightly from those of the assumed
mode!. Let the true system be described by
-=0
AO(S)Yk = sMB (S)Uk + CO(S)Ek,
(3.39)
while Uk is given by the minimum variance policy
F(s)B(s)Uk + G(S)Yk = 0,
(3.40)
where
C(S) = A(s)F(s) + sM G(s).
Operating on Eq. (3.39) with F(s)B(s) and using Eq. (3.40), we obtain
-=0 -
F(s)13(s)AO(s)Yk = -sM B (s)G(S)Yk + F(s)B(s)CO(S)Ek'
See. 5..1
SuIlicienl, ::Jl,aiisiic., .d Finite-::Jtaf.e Markov Chains
21!J
"
,
Combining the last two equations and collecting t.('l"Ins, \\"c have
{F(s)B(s)AO(s) + (c(s) - A(s)F(s))lt(s)}yk = F(s)IJ(s)ClI(sh
;f
'I
fi
; II
1' [ ' ,
I I
I '
, r l ' ,, '
/.'
I,
t..;.
I I ': '
f'
, i
r
or
{13\s)C(s) + F(s)(13(s)AO(s) - A(s)BO(s))}uk = F(s)13(s)ClI(s)Ek.
If the coefficients of AO(s), 130(s), and CO(s) arc near those of A(s), B(s),
and C(s), then the poles of the closed-loop systcm will be near t.he root."
of [J(s)C(s). Thus thc closed-luop system will be in effect stable only if the
roots ofB(s) are strictly outside the unit circle, similar to the perfect state
information case examined earlier.
5.4 SUFFICIENT STATISTICS AND FINITE-STATE MARKOV
CHAINS
The main difficulty \\lith the DP algorithm for imperfect st.ate infor-
mation problems is that it is carried out over a state space of expanding
dimension. As a new measurement is added at each stage k, the dimen-
sion of the state (the information vect.or h) increases accordiJlgIy. This
motivat.es an effort to reduce the data that are truly necessary for control
purposes. In other worus, it is of interest to look for quantities known
as sufficient statistics, which ideally would be of smaller dimension than
h and yet sUlllmarize all the essential content of h as far as control is
concerned.
Consider the DP algorithm (1.7) and (1.8) restated here for conve-
nience:
IN-l(IN-l) = min [ E {gN(JN-I(XN-l, UN-l,WN-J))
UN-IEUN_l XN-l,WN-l
+ gN-I(XN-I, UN-I, WN-l) I IN-I,UN-J}],
( 4.1)
Jk(h) = min [ E {9k(Xk,Uk,Wk) + Jk+l(h,Zk+l,Uk) I h,ud ] .
11.k EU k Xk,Wk,Zk+l
(4.2)
Suppose that we can find a funct.ion 5k (h) of the informat.ion vector, such
that a minimbling control in Eqs. (4.1) and (4.2) depends on h via Si.;{h).
By this we mean that the minimization in t.he right-hand side of the DP
algorithm (4.1) and (4.2) can be writ.ten ill terms of some function Ih as
;
min Hk(Sk(h),Uk).
ukE[Tk
I
"
_ J
220
Problems with Imperfeci State In[ormatioi
Chap. 5
Sec. 5.4
Sullicient Statistics a iinite-State Markov Chains
221
Such a function Sk is called a sufficient statistic. Its salient feature h; that
an optimal policy obtained by the preceding minimization can be written
as
IL'kUd = Jik(Sdh)) ,
Note that ifthe conditional distribution P.ckl1k is uniquely dctel"1ninl'd
by another expression Sk(h), that is, PXkllk = Gk(Sk(h)) for an appropri-
ate function Gk, then Sk(h) is also a sufficient statistic. Thus, for example,
if we can show that Pxkl 1k is a Gaussian distribution, theu the mean alld
the covilriance mat.rix corresponding to J>ckl1k form a sufficicllt. st.at.ist.ic.
Regardless of its computational value, t.he representation of the opti-
lllal policy as a sequence of functions of the conditional probability dist.ri-
bution PXkllk'
wherc Jik is an appropriatc function. Thus, if the ;uflicicnt 8tati:,;tic is
characterized by a set of fewer numbers than the information vector h, it
may be easier to implement the policy in the form Uk = Jik(Sk(h)) and
take advantage of the resulting data reduction.
The Conditional State Distribution as a Sufficient Statistic
llk(h) = Jik(PXkllk)' k = 0,1,..., N - I,
is conceptually very useful. It provides a decomposition of the optimal
controller in two parts:
(a) An estimator, which uses at time k the measurement Zk and the
control Uk-l to generate the probability distribution PXkl/k'
(b) An actuator, which generates a control input. to the system a1; a fUIlc-
tion of the probability distribution P xk1h (Fig. 5.4.1).
This interpretation has formed the basis for various suboptimal control
schemes that separate the controller a priori into an estimator and an actu-
ator and attempt to design each part in a manner that seems "reasonable."
Schemes of this type will be discussed in Chapter 6.
There are many different functions that can serve as sufficient statis-
tics. The identity function Sk(h) = h is certainly one of them. To obtain
another important sufficient statistic, we assume that the probability dis-
tribution of the observa.tion disturbance Vk+ I depends explicitly only on the
immediately pr'ecedin!l state, control, and system disturuance :L:k, "ILk, Wk,
and not on Xk-l,"',XO, Uk-l,...,UO, 1lJk-l...,1lJO, Vk-l,...,Vo. Under
t.his assumption we can show that a sufficient statistic is given by the condi-
tion,tl probability distribution Pxkl 1k of the state xk, given the information
vector h.
Indeed, for a fixed h and Uk, in order to calculate the expression
E {Yk(Xk, Uk, 1lJk) + Jk+l(h,Zk+l,Uk) I h,Uk}
xk,wk,zk+l
Wk
Vk
in the DP equation (4.2), we need the joint distribution
Uk
System
X k + 1 = fi/,Xk,Uk.Wk)
Xk
Measurement
zk= hi/,xk,Uk _1. v k)
z.
P(Xk, Wk, Zk+l I h, Uk)
Uk .,
or, equivalently,
p(Xk, Wk, hk+l (Jk(Xk, Uk, Wk), Uk, Vk+l) I h, Uk)'
Dy using Bayes' rule, this distribution can be expressed in terms of PXkllk'
the given distributions
Pxl/
z.
P(1lJk I :J;k,Uk),
p( Vk+l I fk(xk, Uk, 1lJk), Uk, 1lJk),
Figure 5.4.1 Conceptual separation of the optimal controller into an estimator
and an actuator.
and the system equation Xk+l = fk(Xk, Uk, Wk). Therefore the DP algo-
rithm can be written as
Alternative Perfect State Information Reduction
Jk(h) = min Hk(PXkI1k,Uk)
UkEUk
By using the sufficient statistic PTkl1k we can also write the DP algo-
rithm in an alternative form. The key fact here is that PXkl1k is generated
recursively in time and can be viewed as the state of a controlled discrete-
time dynamic system. In particular, as we discuss shortly, it turns out that
for a suitable functioll Hk, and the distribution PXkl1k is a sufficient statis-
tic.
_. J
:[
222
Problems with Impeded. State [n[ormatior.
Chap. 5
Sec. 5.4
Suflicicnt Statistics ilL Finite-Sf,ate Markov Chains
223
we can write for all k
PXk+lllk+1 = <Pk (P'CkI1k , Uk, Zk+I),
( 4.3)
that describes the evolution of t.he nHlditiollal dist.ribut.ion j"'kl1k lit.s th.
framework of the basic problem. Furthermore, the controller can calculate
(at least in principle) the state P'Ckl1k of this system at time k, so l)('rfpct.
state information prevails. Thus the alternate DP algorithm (4.4) and (4.G)
may be viewed as the DP algorit.hm of the perfect. st"lte illfonnat.iun prohkln
that involves the above system whose state is Pxkl h and an appropriately
reformulated cost function. III the absence of perfect knowledge of the state,
the contr'oller can be viewed as controlling the "probabilist'ic statc" P ck1h
so as to minimize the expected cost-to-go conditioned 011. thc info1'1n.a.tioll
h available.
where <Pk is somc funct.ion t.hat can be determincd from thc data of t.he
problem, 'Uk is the cOlltrol of the system, and Zk+ 1 plays the role of a random
disturbance, the statistics of which are known and depend explicitly only
on P"kl1k and 1Lk, and not on Zk,...,ZO. Suppose now_that. we use the
sufficient statistic property of P'Ckl1k to define functions J k by
-:h(l)ckl1k) = Jk(h) = min [ E {!ldXk,lIk,11Jd
ukEUk XbWk,Zk+l
+ Jk+l(h+d I h,Uk}].
The Conditional State Distribution Recursion
We can t.hen use the recursion (4.3) to write the DP algorithm for k < N -1
in the ,llt.C'rnat.e form
There remains to justify the recursion
]dP:CkI1k) = ruin [ E {9k(Xk,uk,11Jd
1LkEUk Xk, 11J k,Zk+l
PXk+lllk+l = <Pk(PXkllk,"lLk,Zk+I).
Let us first give an example.
+]k+l(<I>dPXkI1k,"llk,Zk-fl)) I h,nd].
(4.4)
Example 4.1 (Search Problem)
In the case where k = N - 1, the l:Orresponding equation is
min
tLN-IEU N - 1
[ E {.lJN(JN-I(:CN-I, UN-I, 11JN-l))
xN-l,wN--l
+ 9N-l(XN-I, "llN-I,11JN-I) I IN-I, UN-I}]'
( 4.5)
In a classical problem of search, one has to decide at each period whether to
search a site that may contain a treasure. If a treasure is present, the search
reveals it with probability (3, in which case the treasure is removed from the
sit.e. Here t.he st.ate has t.wo values: either a t.reasure is prescnt. in the sit.(.
or it is not. The control Uk takes two values: search and not. search. If the
site is searched, the observation Zk+l takes two values, treasure found or not.
found, while if the site is not. searched, t.he value of Zk+1 is irrelevant.
Denot.e
]N-I(PXN_tlIN_I)
I3y minimizing on the right-hand side of the above equations, this
altel'llat.e DP algorithm yields a policy of the form
Pk: probability a treasure is present at. the beginning of period k.
JL'k(Pxkllk)'
k = 0,1,..., N - 1.
This probability evolves according to the equation
In addition, the optimal cost is given by
J* = Epo(Pxolzo)}'
ZO
{ Pk
P - 0
k+1 - Pk(1-{3J
Pk(1-{3J+I-Pk
i[ the site is not searched at time k,
i[ the site is searched and a t.reasure is found,
if the site is searched but no treasure is [ound.
whcn, .10 is obt.ailHxl by t.hc last step of t.his a.lgorithm, and t.he proba-
bilit.y distribution of Zo is obtained hum the measurement equatioll Zo =
11. 0 (:1:0, vo) and the statistics of :r.o and vo.
The preceding ,lllalysis is in effect an alternate reduction of t.he basic
problem with imperfect state information to a problem with perfcct. st.ate
information. Indeed, the system
PXk+tl1k+l = <Pk(P'CklIk,"U.k,Zk+l)
The seeoud rclatioll holds becaue the treasure i rellloved aileI' a suecr'ssrul
search. The third relation [ollows by application of Bayes' rule (Pk+l is equal
to the kth pcriod probability of a treasure being present. and t.he search being
unsuccessful, divided by the probability of an unsuccessful search). The pre-
ceding equat.ion delines t.he desired recursion for t.he colldi\.iollal dist.rihut.ioll
o[ the state.
This recursion can be used t.o writc thc DP algorithm (4.4), assuming
that. the treasure's worth is V, that. each search costs C, alld t.hat. Ollee we
_..1
224
Problems with Imperfect State InformatiOJ.
Chap. 5
Sec. 5.4
Sullicient Statistics i. Finite-State Markuv CJWillS
225
I
II I '
fl':
f
: t
I;
I
I
i I '
I'
I
;,'r;
:1"
I.e,
It,
II
1\.
(
:1':
: \ '
t.."
II
; 1 '
I
I
: !
; i
I:
: I:
! I I:
j',.
I,'
'
i' ;
decide not to search at a particular t.ime, t.hen we cannot search at. future
t.imes. The algorit.hm t.akes t.he form
Finite-State Systems
]ik{3V C,
Wlwll t.he system is a fillit.c-st.at.e Markov chaill, t.lte cOlldit.iollal fJl'OiJ-
abilit.y dist.ribut.ion P."kllk is characterized by a finit.(. sd of 1l11l1l!H'rs. This
is particularly convenient, and leads to further simplificaUoll whell t.h(' COll-
trol and observation spaces are also finite sets. It then turns out that the
cost-to-go fUllctions ] k in the DP algorithm (4.4) alld (4.G) <Ire piecewise
lincar ;lIld concave. The demonstration of this fact is I>traightforward, but
tedious, and is outlined in Exercise 5.7. The piecewise lillearit.y of]/.: is,
however, an important property since it shows that] k can be characterized
by a fillit.e set. of I>calars. St.ill, however, for fixcd k, t.he IlllllllJcr of these
scalars can increase feuit with N, and there may be no l'omput,;lti()llall'
efficient way to solve the problem (see [PaT88]). We will not discuss hprc
allY special procedures for computing J k (see [Lov91], [SmS73], [So1l71]).
Instead we will demonstrate the DP algorithm by means of examples.
- [ , - ( ]lk(I(J) )]
.h(]lk) = lIlax 0, -C + ]lk{JV + (1- Pkj).h+l (1 j) J '
Pk - + - 1'k
wit.h ] N(JlN) = O. It can be shown by induction that the functions ]k sat.isfy
}k(pk) = 0 for all Jlk ::; Cj({3V). Furt.hermore, it is optimal to search at
period k if and only if
t.hat. is, if and only if the expected reward fwm the lIext search i gr"ater or
equal to the cost of the search.
The general form of the recursion PXk+Jilk+l = <Pk(PXkllk,lL/.:,Zk+l)
is developed in Exercise 5.7 for the case where the state, control, observa-
tion, and disturbance spaces are finite sets. In the case where these spaces
are the real line and all random variables involved possess probability den-
sity functions, the conditional density p(Xk+l I h+tJ is generated from
p(:J:k I h), 11k, alld Zk+1 by means of the equation
Example 4.2 (Machine Repair)
111 the two-state machine repair example of Sedioll 5.], let liS <1['1101.('
p(Xk+l I h+tJ = p(Xk+l I h, Uk, Zk+l)
p(Xk+l, Zk+l I h, 1L k)
p(Zk+l I h, Uk)
p(Xk+l I h,Uk)p(Zk+l I h,Uk,Xk+tJ
J.':, p(Xk+l I h, Uk)p(Zk+l I h, Uk, Xk+tJdXk+l
PI = P(XI = P I h),
Po = P(xo = ]5 I 10).
The equation relat.ing 1'1,Po,UO,ZI [ef. Eq. (4.3)] is writ.tcn as
1'1 = <I>o(Po, Un, Zl)'
On" may verify by st.raightforward calculat.ion t.hat <I'o is given by
In this equation all the probability densities appearing in the right-hand
side may be expressed in terms of p(Xk I h), Uk, and Zk+l. In particu-
lar, the density p(Xk+l I h,Uk) may be expressed through p(Xk I h), Uk,
and the system equation Xk+l = fdxk, Uk, 1JJk) using the given density
p(1JJk I Xk, Uk) and the relation
p( 1JJ k I h,Uk) = 1: p(Xk I h)p( 1JJ k I Xk,Uk)dxk.
{ I
'1
3
PI = <Po(PO,UO,ZI) = +2PO
7-4PO
3+6PO
5+4PO
if lLo = S,
if Uo = S,
iful = C,
if Uo = C,
Zl = G,
Zj = B,
Zl = G,
Zl = B.
The DP algorit.hm (4.4) and (4.5) may be writt.en in t.erms of Po, PI, and <f>o
above as
p(Xk I h), p( 1JJ k I Xk, Uk), p(Vk+l I Xk, Uk, 1JJk)'
}l(pJ) = min[2pl, I],
}o(po) = min [21'0 + P(ZI = G I Io,C)J1«I>0(po,C,G))
+ P(ZI = B I 10, C)Jl (<J>o(po, C, B)),
1 + P(ZI = G I fo, S)J l (<J>o(po, S, G))
+p(Zl = B I IO,S)Jl(<PO(po,S,B))].
The probabilities entering in the second cquation may be expresscd ill t.erms
of po by st.raightforward calculat.ion as
I
i I
I
, J
;,1
!
Similarly, the density p(Zk+l I h, Uk, ;I:k+l) is expressed through the mea-
surement equation Zk+l = h k + 1 (Xk+l,11k,Vk+J) using the densities
I3y substituting these expressions in the equation for p(Xk+l I h+l), we ob-
tain a dynamic system equation for the conditional state distribution of the
desired form. Ot.her similar examples will be given in subsequent sections.
A mathematically rigorous substantiation of the recursion PXk+lllk+1 =
<PI:( PXkllk' Uk, Zk+l) and the alternate DP algorithm (4.4) and (4.5) can be
found in [BeS78].
7 - 'Ipo
P(ZI = G I Io,C) =,
7
P(ZI = G I Io,S) = 12 '
5 + -Ipo
P(ZI = JJ I Jo,C) =,
P(ZI = B I Io,S) = l G 2
_.J
U8ing t.he8e values we have
Problems with Imperfect State Informatio.
Chap. 5
226
Jo(po) = min [ 2 PO + 7 - 1 ,;po JJ ( 1 + 2po ) + 5 + 1p0 71 ( 3 + 6PO )
" 7 - 1po 12' [) + 1pu '
1 7- ( 1 ) 5- ( 3 )]
+ 12 h 7 + 12 Jl "5 .
Now by minimization in thc equation defining Jl(Pl), we obtain an optimal
policy for thc last stage
_' ( { C
ILl p!) = S
if PI::; 4,
if PI> 4.
Also by sub8t.itution OfJI(PI) and by carrying out t.he straight.forward <:alcu-
lat.ioll we obtain
{ 19
- 12
J 0 (po) = 7+ 32 po
12
and an optimal policy for the first. stage
_' ( { C
/-Lo ]10) =
.'3
Notc that
- 1
P(xo = 1> I Zo = G) = 7'
7
P(zo = G) = 12 '
so that the formula
if ::; po ::; 1,
if 0 ::; po ::; ,
if Po ::; ,
if Po > .
- 3
P(xo = P I Zo = B) = ,
iJ
5
P ( z = B ) = -
o 12'
. { - ( } 7 - ( 1 ) 5 - ( 3 ) 176
J = E J 0 Pxo I '0) = - J 0 - + - J 0 - =-
'0 12 7 12 5 141
yields the 8amc optimal C08t as the one obtained in Section 5.1 by mcans of
t.hc DP algorit.hm (1.7) and (1.8).
A Problem of Instruction
Consider a problem of instruction where the objcctive is to teach a
student a certain simple item. At the beginning of each period, the student
may be in one of two possible states:
L: Item learned.
'"
Sec. 5.4
Suilicicnt Statistics aJ
'inite-State Markov Chains
227
I: Item not leamed.
At the beginning of each period, the instructor must makc onc of two
decisions:
T: Terminate thc instruction.
T: Continue the instruction for one pcriod and then conduct a. test that.
indicates whether the studcnt has leamcd thc item.
The test has two p08siblc outcomes:
R: Studcnt gives a correct answer.
R: Studcnt gives an incorrect answer.
The transition probabilities from onc state to the ncxt if instmctioll
takes place a.re given by
P(Xk+l = L I ."Lk = L) = 1,
P(Xk+l = L I Xk = L) = t,
P(Xk+l = L I J:k = L) = 0,
P(Xk+l = L I Xk = L) = 1 - t,
where t is a given scalar with 0 < t < 1.
The outcome of the test depends probabilistically on the state of
knowledge of thc tudent as follows:
P(Zk = R I Xk = L) = 1,
P(Zk = R I J;k = I) = 7',
P(Zk = R I :Ck = L) = 0,
P(Zk = R I J:k = I) = 1 - 7',
where r is a given scalar with 0 < r < 1. The probabilist.ic structme of tlw
problem is illustrated in Fig. 5.4.2.
Figure 5.4.2 Probabilistic structure of the instrncl.ion prohlem.
The cost of instruction and tcsting is I per pcriod. Thc cost. of termi-
nating instruction is 0 or C > 0 if the t. udent has Icamet! or has nol. lcamed
the item, respectively. The objective is to find an optimal instruction-
termination policy for each period k as a function of the tcst information
accrued up to that period, assuming that there is a maximum of N pcriods
of instruction.
_ J
228
Problems with Imperfed State In[ormatiG
Chap. 5
It is straightforward to reformulate this problem int.o the framework
of the basic problem with imperfect state information and to conclude that
the decision whether to terminate or continue instruction at period k should
depend on the conditional probability that the student has learaed the item
given the test results so far. This probability is denoted
Vk = P(xklh) = P(Xk = L I Zu, ZI,..., Zk).
In addition, we can use the DP algorithm (4.4) and (4.5) defined over the
space of the sufficient statistic Pk to obtain an optimal policy. However,
rather than proceeding with this elaborate reformulation, we prefer to argue
and obtain this DP algorithm directly.
Concerning the evolution of the conditional probability Pk (a.<,suming
instruction occurs), we have by Bayes' rule
P(Xk+l = L, Zk+l I ZO, . .. , Zk)
Pk+l = P(Xk+1 = L I zo,... ,Zk+1) = P( I )
-"k+1 o,...,Zk
where
P(Zk+l I zo,.. . ,Zk)
= P(Xk+l =L I zo,...,Zk)P(Zk+ll Zo,...,Zk,Xk+1 =L)
+ P(Xk+l = L I Zo,.. ., Zk)P(Zk+l I Zo,.. . , Zk, Xk+l = L).
From the probabilistic descriptions given, we have
{ 1 if Zk+ I = R,
P(Zk+1 I zo,..., Zk, ;Ck+l = L) = P(Zk+l I ;Ck+l = L) = 0 f R
i Zk+l = -.
P(Zk+l I Zo,"', Zk, Xk+l = L) = P(Zk+l I Xk+l = L)
{ 1" if Zk+1 = R,
= 1 - r if Zk+1 = R.
P(Xk+1 = L I Zo,. . . , Zk) = Pk + (1 - Pk)t,
P(Xk+l = L I Zo,'" ,zd = (1 - Pk)(l - t).
Combining these equations, we obtain
{
- "" ( Z ) - Pk+(1-Pk)t+(1-Pk)(1-t)r
Pk+l - '¥ Pb k+1 -
o
if Zk-j-l = R,
if Zk+1 = R,
or equivalently
{ 1-(I-t)(I-Pk)
- "" (]J Z ) - 1- ( I-t)(l-r )( I-Pk )
Pk+l - '¥ k, k+1 -
o
if Zk+l = R,
if Zk+l = R.
(4.6)
Sec. 5.1
SlIfTiciCllf. Sf.aUsf.ics a. Finif.e-St,,,t.,, MarkoI' (;!,aills
22!)
This equatiun is a sJwci:d ca:>e of t.he general n'cursi\"e updat.e ('qu;tI.ioll for
thc conditional probability of the state; cf. Eq. (4.3). A cursory examina-
tion of Eq. (4.6) shows that, as expected, the conditional probability ]!!.-+ I
that the student has learned the itcm incrcascs wit.h every coned answm'
and drops to zero with every incorrcct answer.
We now derive the DP algorithm for the problem. At, the end of
the Nth period, assuming instruction has continued to that period, t.he
cxpected cost is
IN(PN) = (1- PN)C.
(<1.7)
At thc cnd of period N - 1, the instructor has calculated the conditional
probability PN -1 that the student has learncd the item and wishes t.o decide
whether to terminate instruction and incur an cxpccted cost (1 - PN-dC
or continue thc instruction and incur an expected cost I + E {J N (]I N ) }.
This leads to thc following equation for thc optimal expected cost-t.o-go:
: (
J N-l(PN-d = min[(l - PN-dC, 1+ (1 - t)(1 - PN-dC].
The term (1- P N -1) C is the cost of terminating instruction, while the term
(1 - t)(l - PN-1) is the probability that thc student still ha.<; not karned
the itelll following an additional period of instruction.
Similarly, the algorithm is written for cvcry stage k by replacing N
by k + 1:
'I'
i
, ,
Jk(Pk) = min [ (1 - Pk)C, 1+ _ E Pk+l (<I>(pk, zk+d) } ] .
"k+l
Now using expression (4.6) for thc function <I> and the probabilities
P(Zk+1 = R I vd = (1 - t)(l - 1")(1 - Pk),
P(Z"+1 = R I Pk) = 1 - (1 -- t)(1 - 1")(1 - pd,
we have
Jdpd = min[(1- Pk)C, 1+ Ak(Pk)],
(4.8)
where
Adpk) = P(Zk+1 = R I h)J k +1(<I>(Pk,R))
+Ph+1 = R I h)J k + 1 (<I>(Pk,R))
or, cquivalently, using Eq. (4.G),
,. -
- ( l-(l-t)(I-Pk) )
Ak(Pk) = (1 - (1 - t)(1 - r)(1 - Pk)).Jk+1 1 _ (1 _ t)(l -1')(1 - pd
+ (1 - t)(l - r)(l - Pk)Jk+l(O).
, i
I
(4.9)
_ J
230
Problems with Imperfect State Information
Chap. 5
Sec. 5.,1
Suilicjent Statistics
. Finite-State Markov Chains
231
1< tC,
(4.10)
An optimal policy for period /.; is given by
continue instruction if Pk :S O'k,
terminate instruction if Pk > nj,-.
Since the functions Adp) are monotonically nOllJecrca.,illg with )"C-
spect to k, it follows from Fig. 5.4.4 that
1
CY.N-l < CY.N-2 < .., < CY. k < ... < 1 - -
- - - - - C'
and therefore the sequence {nd converges to somc scalar a as /.; --> -00.
Thus, as the horizon get.s longer, the optimal policy (at least for the initial
stages) can ue approximated by the stationary policy
!
As shown in Fig. 5.4.3, if I -I- (1 - t)C :::: Cor, cquivalently, if
then thcre exists a. scalar O'N-l with 0 < CtN-l < 1 that determincs an
optimal policy for the last period:
continue instruction
terminate instruction
if PN-l :S CY.N-l,
if PN-l > CY.N-l'
In the opposite case, where 12 tC, the cost of instruction is so high relative
to the cost of not learning that instrucLing the studcnt is nevcr optimal.
continue instruction
terminate instruction
if Pk :S ak,
if Pk > ak.
(4.101)
'I
C
(1- p)C
,/
I+A N . 1 (p)
\
---
c
o
p
Continue aN - 1
Instruction
Terminate
Instruction
o
(Y.N_1
(Y.N_2 (Y.N-3 1.1
C
p
Figure 5.4.3 Determining the optimal instruction policy in the last period.
It may be shown by induction (Exercisc 5.8) that if I < tC, the
functions Adp) arc concave and piecewise linear for each k and satisfy, for
all k,
Figure 5.4.4 Demonstrating that the instruction thresholds are decreasing with
time.
.fh(l) = o.
(4.11)
Adp) 2 Ak(p'),
for 0 :S P < 11' :S 1,
( 4.12)
It turns out that this stationary policy has a convenient implementa-
tion that docs not require the calculation of the conditional probability at
each stage. From Eq. (4.G) we have that Pk+l increases over Pk, if it correct
answer R is given and drops to zero if an incorrect allswer R is givell. For
m = 1,2,. . ., define recursively the probability 7r m of getting m successive
correct answers following an incorrect answer:
7r1 = (1)(0, R), 7r2 = <l.J(7rl, R), ,7rk+l = <I>(7rk, R),...
Let n be the smallest integer for which tr n > a. Then the stationary
policy (4.14) can be implemented by terminating illstruction if and only if
n successive correct answers have been received.
Furthermore, they satisfy for all k,
Ak-l(P) :::: Ak(p) :::: Ak+l(P), for all P E [0,1]. (4.13)
Thus the functions (1 - pdC and 1+ Ak(1Ik) intersect at a single point,
alld from the DP algorithm (4.8), we obtain that the optimal policy for
each period is determined by the unique scalars ak, which arc such that
(1 - ak)C = 1+ Adad,
k = 0,1,..., N - 1.
_ J
232
Problems with Imperfect State Information
Chap. 5
233
Sec. 5.5
Sequential HypotJ](.. Testing
5.5
SEQUENTIAL HYPOTHESIS TESTING
where (I-PN-1)Lo is the expected cost. for accept.ing fo and PN-1L, is the
expected cot for accepting fr. Taking into account Eqs. (5.1) and (5.2),
we can obtam the optnnal expected cost-to-go for the kth period as
]dJid = min [(1 - fJdL o , JikLl,
In this section we consider a hypothesis testing problem that is typical
of statistical sequential analysis. A decision maker can make observat.ions,
at a cost C each, relating to two hypotheses. Given a new observation,
he can either accept one of the hypotheses or delay the decision [or one
more period, pay the cost C, and obtain a new observation. At issue is
trading off the cost of observation with the higher probability of accepting
the wrong hypothesis.
Let Zo, Zl, . . . , ZN -1 be the sequence of observations. We assume that
they are independent., identically distributed random variables taking val-
ues on a finite set Z. Suppose we know that the probability distribution of
the Zk'S is either fo or Jr and that we are trying to decide on one of these.
Here, for any clement Z E Z, Jo(z) and fr(z) denote the probabilities of
Z when fo and Jr are the true distributions, respectively. At time k after
observing Zo,. . . , Zk, we may either stop observing and accept either fo or
fr, or we may take an additional observation at a cost C > O. If we stop
observing and make a choice, then we incur ",ero cost if our choice is cor-
rect, and costs Lo and Ll if we. choose incorrectly fo and Jr, respectively.
We are given the a priori probability p that the true distribution is fo, and
we assume that at most N observations are possible.
It can be seen that we are faced with an imperfect state information
problem involving the two states:
C + E { 7 ( pdO(Zk+l) ) } ]
Zk+l . H1 PkfO(ZH1) + (1- Pk)fr(Zk+l) ,
where the expectation over Zk+l is taken with respect to the probability
distribution
p(ZHd = Pkfo(zHJ) + (1 - Pk)fr(Zk+l),
Equivalently, for k = 0,1,. .. , N - 2,
]dPk) = min[(1 - pdLo, PkLl, C + Ak(pd], (G.4)
Zk+1 E Z.
where
Ak(Pk) = E { ]k H ( pk/O(Zk+rJ ) } (5.G)
Zk+l Pkfo(zkH) + (1 - Pk)fr (Zk+1) .
An optimal policy for the last period (see Fig. 5.5.1) is obtained from
the minimization indicated in Eq. (5.3):
accept fo if PN -1 :::: "
accept Jr ifpN_I <"
where, is determined from the relation (1 - ,)Lo = ')"L I or equivalently
Lo
,=
Lo + Ll .
x O : true distribution is fo,
Xl : true distribution is fr.
The altemate DP algorithm (4.4) and (4.5) is defined over the interval [0,1]
of possible values of the conditional probability
.1:
.1
..I
I
I
\
, f
:\
t
i
I
I
!
Pk = P(Xk = X O I zo,..., Zk).
I N - 1 (P)
pL 1
\
(1 . p)Lo
/
Similar to the previous section, we will obtain this algorithm directly.
The conditional probability Pk is generated recursively according to
the following equation [assuming fo(z) > 0, h(z) > ° for all z E Z]:
pfo(zo)
po = pJo(zo) + (1 - p)Jdzo) '
p
(5.1)
Pkfo(zkH)
PHl = Pkfo(zk+1) + (1 - Pk)h(Zk+1) '
where P is the a priori probability that the true distribution is Jo. The
optimal expected cost for the last period is
k = 0, 1,. .. ,N -- 2, (5.2)
o
La
La+ Ll
Accept f 1
Accept fo
] N -I (PN-1) = min [(1 - PN-l)Lo, PN _lLr],
(5.3)
Figure 5.5.1 Determining the optimal policy for the last period.
_ J
234
Problems wjth Imperfect State In[ormatjon
Chap. [j
Sec. 5.5
0equentjal I1ypoth I(stjIJg
235
We !lOW prove the following Icmma.
this inequality is equivalent to
Ak(O) = Adl) = 0,
AI J ( Pi!o(Z') ) (1 - -')6 -- ( P"2fO(ZI) )
-'l + (1- -')6 k+1 I + -'6 + (1- -')6 J k + 1 "2
S J k + 1 ( (-'PI + (1 - -')P2)fO(Zi) )
-'I + (1 - -')6 .
Lemma 5.1: Thc functions Ak : [0,1] -> R of Eq. (5.5) are concave,
and satisfy for all k and p E [0,1]
A k - 1 (P) S Ak(p).
This relation, howevcr, is implicd b y the concavit y of J
k+l.
Q.E.D.
Using Lemma 5.1, we obtain (see Fig. 5.5.2) that if
]N-2(p) S min[(l- p)Lo, pLo] = IN-l(P).
C ' + A ( L o ) LoLl
N-2 <
Lo + L 1 Lo + Ll '
Proof: We havc for all p E [0,1]
thell an optimal policy for each period k is of t.lw fO)"l]1
By lllakilig use of the stat.iollarity of t.he systcm and the mOlioLonicity
propcrty of DP (Excrcise 1.23 in Chapter 1), we obtain
accept fo if Pk :::: ak,
Jdp) S ]k+l(p)
accept /1 if Pk S {Jk,
for all k and P E [0,1]. Using Eq. (5.5), wc obtain Ak+l(p) S Adp) for all
k alld p E [0,1].
To provc concavity of Ak in view of Eqs. (G.3) and (5.4), it is sutficient
to show that com:avity of J k+1 implics concavity of Ak through relation
(5.G). Indeed, assumc that J k + 1 is concave over [0,1]. Let zl,z2,...,zn
uenote thc elements of the obscrvation spacc Z. We have from Eq. (5.5)
that
continue the observations if {3k < Pk < Q;k,
whcrc thc scalars ak, fJk arc determined from the relation::;
fJkLl = C + AdfJk),
(1 - adLo = C + Adak)'
F\lrthermore, we have
. . - ( pfo(z') )
Adp) = L...,Jpfo(z') + (1- p)/1(z'))Jk+ 1 1 I fo(z') + (1 - p)fl(z') .
,I
C
'.. S ak+l S Ctk < Ctk-I < ... < 1 - -
- - -- Do '
. - ( pfo(z') )
H,(p) = (pfo(z') + (1 - p)/1(Z')).Jk+l pfO(Zi) + (1 - p)/1(z') .
C
. . . :::: fJk+l :::: fJk :::: fJk-l 2: . . . :::: L 1 '
!!ence as !'1-> 00 the scquenccs {CtN-;}, {fJN-,} converge to scalars a,
fJ, :espcctlvely, and the optimal policy is approximated by the stationary
pohcy
I-Icnce it is sufficient to show that concavity of J k+ 1 implies concavity of
each of thc functions
'it) show concaviLy of Hi, we lllust show that for cvcry -' E [0,1], PI,
P"2 E [0,1] we have
accept fo if Pk :::: a,
(G.Ga)
(5.Gb)
accept h if Pk S 73,
-'IIi (p!) + (1 - -,)H, (P2) S II, (API + (1 - -,)p2).
continue the ob::;ervatioll::; if 7J < ]Jk < a.
Now the conditional probability Pk is given by
(G.Gc)
Using the notation
I = piJo(z') + (1 - PI)fl(Z'),
6 = p210(Zi) + (1 - P2)/1(Z'),
_ pfo(zo) . . . fO(Zk)
Pk -
pfo(zo)'" fo(zd f- (1- p)/1(zo)'" !i(ZI.)'
(5.7)
_ J
236
Problems wiLl] imperfeci State information
Chap. 5
wherc p is the a priori probability that fo is the true hypothesis. Using Eq.
(5.7), the stationary policy (5.6) can be written in the form
accept !o if Rk :::: A,
accept h if Rk ::; B,.
continue the observations if B < Rk < A,
where
(1 - p)fi
A= ,
p(l - fi)
(l-p)!3
B= ,
p(l - (3)
and
(5.8a)
(5.8b)
(5.8c)
fo(zo) .,. !O(Zk)
Rk = h(zo)'" h(Zk)'
Note that Rk can be easily generated by means of the recursive equation
Rk+l = fO(Zk+l Rk'
h(Zk+l)
(1 - p)Lo
................../
pL 1
\
I
I
I
I
I
I
I
I
I
I
I
--- C
I
LoL1
Lo+ L1
/-t+A N _ 2 (P)
C+A N _ 3 (P)
C
o
I
C l I L \ \ C
- {3N-3 {3N'2 (l.N-2 (l.N-31 --
L1 La+ L1 La
Figure 5.5.2 Determining the optimal hypothesis testing policy.
p
The policy (5.8) is known as the sequential probability mti test, an.d
was among the first formal methods studied in statistical sequential an.al y sls
[WaI47]. The optimality of this policy for the infinite horizon versIOn of
the problem will be shown in Vol. II, Chapter 3.
Sec. 5.6
Notes, Sources, all<.. ..xercises
237
5.6 NOTES, SOURCES, AND EXERCISES
For literature on linear-quadratic problems with illlp(rfect stat, in-
formation, see the references quoted [or Section 4.1 and Witscllhauscll's
survey paper [Wit71]. Detailed discussions of the Kalman filtering algo-
rithm can be found in many textbooks [AnM79], [Jaz70], [LjS83], [Med69],
[Mcn73], [Nah69]. For lincar-quadrat.ic problClns wit.h Gaussian ullccrt.ain-
ties and observation cost in the spirit of Exercise 5.6, see [AoL69]. Exercise
5.1, which indicates the form of the certainty equivalence principle when
the random disturbances arc correlated, is based on an unpublished report
by the author [Ber70]. Thc minimum variance approach is also described
in [AsW84] and [Whi63]. Imperfect state information problems with expo-
nential cost functions are discussed in [JBE94] and [FeM94].
The idea of data reduction via a sufficient statistic gained wide at-
tention following the 1965 paper by Striebel [Str65]; see also [ShiG4] and
[Shi66]. For the analog of the sufficicnt statistic idea in sequenLi,,1 minimax
problcms, sce [DeR73].
Sufficient statistics havc been used for analysis of finite-state problcms
with imperfect state information in [Eck68], [SmS73], and [Son71]. The
proof of pieccwise linearity of thc cost-to-go functions, and an algorithm for
their computation are given in [SmS73] and [Son71]. For further material
on finite-state problems with imperfect state information, sec [ADF93],
[Lov91], and [WhS89]. The instruction model described in Scction 5.4 has
been considered (with some variations) by a numbcr of authors [ABC6G],
[GrA66], [KaD66], [Sma71].
For a discussion of thc sequential probability ratio test and related
subjects, see [Che72], rDeG70], [Whi82], and the references quoted therein.
The treatment given here stems from [ABG49].
EXERCISES
5.1
';I
Consider t.he linear system and measuremcnt. equation of Section G.2 and
consider the problem of finding a policy {I-'o (In), . . . , /l.iv _ 1 (IN _ I )} that. mi 11-
imizes the quadratic cost
{ N-l }
E X'rvQXN + URkUk .
_.I
238
Problems with Imperfect. State Information
Chap. 5
Sec. 5.G
Notes, Sources, all(.. ercises
239
Assume, however, that t.he random vectors Xo, WO,. . . , WN-l, vo,. . . , VN -I an
correlated and have a given joint probability distribution, and linit.e first and
second moments. Show that the optimal policy is given by
5.3
A linear system with Gaussian disturbanccs
J.L'k(h) = LkE{Yk I h},
Xk+l = AXk + BUk + Wk,
wherc the gain matrices Lk are obt.ained from the algorithm
Lk = -(BU<k-t-1IJk + Rk)-I [JI<k+lAk,
KN =Q,
Kk = A(I<k+l - Kk+l[Jk(HI<k+1IJk + Rk)-l IJI<k+i)Ak'
and the vectors Yk are given by
Yk = Xk + AklWk + Ak l AkIWk-t-l + .., + Ak i '" A;;;'IWN-'1
(assuming the mat.rices Ao, Ai,..., AN-l arc invertible). Ilint: Show that
the cost can be written as
E {yJ{kYO + (Uk - LkYk)' jJk(Uk - LkYk) } ,
is to be cont.rolled so as to minimize a quadratic cost similar to t.hat. in Section
G.2. The difference is that. t.he cont.roller has t.he opt.ion of choosing at each
time k one of two types of measurements for t.he next. st.age (k t- I):
first. type:
second type:
Zk+l = C1Xk+l + vk+1
Zk+l = C2Xk+1 + V+l'
Here C l and C 2 are given mat.rices of appropriate dimension, and {vD and
{vi;} are zero-mean, independent, random sequences wit.h given finite covari-
ances that do not depend on Xo and {Wk}. There is a cost 91 (or 92) each time
a measurement of type 1 (or type 2) is taken. The problem is t.o find the op-
timal control and measurement selection policy that minimizes t.he expected
value of t.he sum of the quadratic cost
where
Xk+l = Xk + U.k + Wk,
N-l
X'r,QXN + 2.)XQxk + U.RUk)
k=O
and the total measurement cost. Assume for convenience t.hat N = 2 and that.
the first measurement. Zo is of type 1. Show t.hat t.he optimal measurement.
selection at k = 0 and k = 1 does not depend on the value of the informat.ion
vectors fo and 1" and can be determined a priori. Describe t.he nature of t.he
optimal policy.
P k = BI<k+lBk + Rk.
5.2
Consider the scalar system
Zk = Xk + Vk,
where we assume that the initial condition XO, and the disturbances Wk and
Vk are all independent. Let the cost. be
5.4
E { x7v + % (x% + un } ,
and let the given probability distributions be
1 1
p(xo=2)=2' p(wk=1)=2'
1 1
p(xo = -2) = 2' p(Wk = -1) = 2'
Consider a scalar single-input, single-output system given by
Yk + alYk-1 +... + amYk-", = bMUk-M +... + bmUk-m
+ Ck + C1Ck-l + .., + CmCk-m + "Uk-n,
P (Vk = D = ,
P (Vk = -D = .
where :::;!vI:::; m, 0 :::; n :::; m, and Vk is generated by an equat.ion of t.he
form
(c) DetcrIuine t.he asympt.ot.ic form of the policies in parts (a) and (b) as
N --> 00.
Vk + dlvk-l +... + dmVk-m = k + (Ik-l +... + C"'k-m,
and the polynomiaL" (1 + CIS + ... + ems"'), (1 + dls + .,. + dms"'), and
(1 + (IS + ... + cmS m ) have roots strictly outside the unit circle. The vahll'
of the scalar Vk is observed by the controller at time k together wit.h !Jk. The
sequences {cd and {k} arc zero mean independent. identically distribut.ed
with finite variances. Find an easily implement.able approximation t.o the min-
imum vriance controller minimizing E {L=o(Yk)2}. Discuss t.he st.ability
properttes of the closed-loop system.
(a) Determine the optimal policy. IIint: For this problem, E{Xk I h} can
be determined from E{Xk_1 I h-J), Uk-I, and Zk.
(b) Dct.Nmine the policy that is opt.imal within the class of all policies t.hat
arc linear functions of the measurements.
_ J
240
Prohlems with Imperfect; St.aJe Information
Chap. .5
Sec. 5.6
Notes, Sources, i11,
.xercises
241
Xk+l = Akxk -I- HkUk -I- !Uk,
Zk = C k 3;k -I- "Uk.
with a machine in good st.at.e at a cost R. The cost for producing a bad it('m
is C > O. Write a DP algorithm for obtaining an optimal inspection policy
assuming a machine initially in good stat.e and a horizon of N lJl'riods. Solve
the problem for t = 0.2, I = 1, H = 3, C = 2, and N = 8. (Th,' optimal
policy is to inspect at the end of the third period and not. inspect in any ot.her
period.)
5.5
(a) Within t.he framework of the basic problem with imperfect stat.e infor-
mation, consider the cose where the system and the observations are
linear:
5.7 (Finite-State Systems - Imperfect State Informat.ion)
The initial st.ate Xo and the disturbances !Uk and "Uk are assumed Gaus-
sian and mutually independent. Their covariances are given, and !Uk
and "Uk have zero mean. Show that E{xo I Io},..., E{XN-l I IN-I}
constitute a sufficient statistic for this problem.
(b) Use t.he result of part (a) to obtain an optimal policy for the special case
of the single-stage problem involving the scalar system and observation
Zo = Xo -I- "Uo,
Consider a system that at any time call be in anyone of a finit.e number of
states 1,2,..., n. When a control U is applied, t.he system moves from state
i to state j with probabilit.y p,j(u). The control Ii is dlOscn from a finit.c
collection u l , u 2 ,. . . ,urn. Following each state transition, an observat.ion is
made by the controller. There is a finite number of possible observation
outcomes Zl, Z2,. . . ,zq. The probabilit.y of occurrence of zO, given that the
current state is j and t.he preceding control was u, is denot.ed by 7' J (n,O),
O=l,...,q.
(a) Consider t.he column vector of conditional probabilit.ies
i
.1
I
,I
1
I
. i
Xl = ;co -I- no,
and the cost function E{lxkl}.
(c) Generalize part (b) for the case of the scalar syst.em
Pk = [P1".. ,Pk]',
whcrc
Xk+l = aXk -I- Uk,
p =P(Xk =j I zO,...,Zk,UO,...,Uk_I),
j = 1,... ,11"
Zk = =k -I- "Uk,
and the cost function E{L=llxkl}. The scalars a and c are given.
Note: You may find useful the following "differentiation of an integral"
formula:
and show that it can be updat.ed according to
J Ll PkPij(uk)rj(uk, zk+d
Pk+l =
L;=l L=l PPi.(uk)r.(uk, zk+d'
j=l,...,n.
i j l3(Y)
, f(y,)d =
. n(y)
(13(Y) df(y, ) d
J a(y) dy
d{3(y) ( ) dC\'(1j)
-I-f(y,{3(y))d)j-f y,C\'(Y) d(j'
Write this equation in the compact form
[r(uk,zk+d] * [P(Uk)'Pk]
Pk+l =
r( Uk, Zk+l)' P(Uk)' Pk
Consider a machine that can be in one of two states, good or bad. Suppose
that the machine produces an item at the end of each period. The item
produced is either good or bad depending on whether the machine is in a
good or bad st.ate at. t.he beginning of the corresponding period, respectively.
We suppose that. once the machine is in a bad state it remains ill that state
until it is replaced. If the machine is in a good state at the beginning of a
certain period, then with probability t it will be in the bad state at the end of
the period. Once an item is produced, we may inspect the item at a cost I or
not inspect. If an inspected item is found to be bad, the machine is replaced
where prime denotes t.ransposition and
P(Uk) is the n x n transition probability mat.rix with ijt.h element.
P.j(Uk),
r( Uk, Zk+ I) is the column vector with j th coordinate 7') (Uk, Zk+ 1),
[P(Uk)' P k L is the jth coordinate of t.he vector P(Uk)' Pk,
[r(uk,zk+l)] * [P(Uk)'P k ] is t.he vect.or whose jt.h C"oordinatc is
the scalar r)(Uk'Zk+l) [P(Uk)'PkL.
(b) Assume that there is a cost for each stage k denoted gd i, u, j) and
<JSsociated with the control U and a transition from i to j. There is no
terminal cost. Consider the problem of finding a policy that. minimizes
5.6
_ J
242
Problems with Imperfect State Information
Chap. 5
the sum of expected cost.s per stage over N periods. Show t.hat. the
corresponding D1> algorithm is given by
IN-I(P N - 1 ) = min J>_lGN-I(1i),
uE{u1,,,.,u m }
min [ PGk( u)
uE{u 1 ,.."u 1n }
q " - ( [r(u,o)] * [P(U)'Pk] )]
+ r(u,O) P(u) Pdk+l r(u,O)'P(U)'Pk '
where Gk(U) is the vector of expected kth stage cost.s given by
Jk(Pk) =
( L:;=I Plj()9k(1,U,j) )
Gk(U) = : .
L::'=I Pnj (u)gk(n, u, j)
(c) Show that, for all k, Jk when viewed as a function on the set of vectors
with nonnegative coordinates, is positively homogeneous; that IS,
Jk(M'k) = >.Jk(Pk),
A> 0.
Use this fact to write the D1" algorithm in the simpler form
Jk(Pk) = min [ PGk(1i) + tJk+l ([r(u,O)] * [P(U)I Pk ]) ] .
UE{ul....,u ffi } 0=1
(d) Show by induction that, for all k, J k has the form
- [ :>1 I P ' ffik ]
J k(P k ) = min 1 kOik,"', kOik '
where Oil, . . . ,Oink are some vectors in n.
5.8
Consider the functions Jk(Pk) in the inst.ruction problem. Show inductively
t.hat each of these functions is piecewise linear, concave, and of t.he form
- 1 1 :2 2 mk (3 11J'k ]
Jk(Pk)=min[ak+/hpk, Oi k+(3kPk,...,Oi k + k pk,
where Uk,. . . , u'k, /11,. . . ,{J'k arc suit.able sc;dars.
Sec. 5.6
Notes, Source.>, ane .erClses
243
5.9
A person is offered N free plays to be distributed as he pleases between t.wo
slot. machines !I. and 13. Machine A pays U dollars wit.h Imown probabilit.y "
and nothing with probability (1 - s). Machine 13 pays (3 dollars wit.h prob-
ability P and nothing with probability (1 - p). The person does not know p
but instead has an a priori probabilit.y dist.ribut.ion F(p) for p. The problem
is to find a playing policy t.hat maximizes expected profit. Let (m+ n) denote
the number of plays in machine B after k free plays (m + n ::; k), and let. m
denote the number of successes and n t.he number of failures. Show t.hat. a
DP algorithm for this problem is given for m + n ::; k by
IN-l(m,n) = max[sOi, p(m,n)(3],
Jk(m,n) = max[s(Oi + Jk+l(m,n)) + (1- S)J k + 1 (m,n),
p(rn,n)(r'I+]k+l(m+ I,n)) + (1--p(rn,1I.))]k+l(m,1I.+ I)]
where
t pffi+l(l - p)ndF(p)
p(rn,n) = 0 I .
fa pm(1 - p)ndF(p)
Solve the problem for N = 6, Ct = (3 = 1, S = 0.6, dF(p)jdp = 1 for ° ::; p ::; 1.
[The answer is to play machine 13 for the following pairs (rn,n) : (0,0), (1,0),
(2,0), (3,0), (4,0), (5,0), (2,1), (3,1), (4,1). Otherwise, machine A should
be played.]
5.10
Ji
!I. person is offered 2 to 1 odds in a coin-tossing game where he wins whenever
a tail occurs. However, he suspects that the coin is biased and has an a priori
probability distribution F(p) for the probability p that a head occurs at. each
toss. The problem is to find an optimal policy of deciding whether to continue
or stop participating in the game give.l the outcomes of the game so far. A
maximum of N tossings is allowed. Indicate how such a policy can be found
by means of D P.
5.11
Consider the ARMAX model of Section 5.3 where instead of E {L::=l (Yk)2},
the cost is
E { t(Yk - Y)2 } ,
k=l
where y is a given scalar. Generalize the minimum variance policy for t.his
case.
_. J
244
Problcms witl} Impedeci State Information
Chap. 5
5.12
Consider the ARMAX model
!lk -I- aYk-l = Uk-AI + (k,
where M 1. Show that the minimum variance controller is
AI M
2 +... _ (__1)M- 1 a M - 1 Uk_AHI - (-1) a !lk,
IlkCh) = aUk-I - a Uk-2
that the resulting closed-loop system is
2 .,. ( _1 ) M-1aM-1Ck_M+l'
!lk = (k - aCk-l + a (k-2 - +
and t.hat. the long-term out.put variance is
1 - a 2M 2
E{(yd} = E{((k) }.
Discuss the qualitat.ive difference belweenthe cases la\ < 1 illlllla\ > 1,nd
relate it to the stability properties of the uncontrolled system Yk + aYk-l - Ck
and the size of t.he delay M.
6
Suboptimal and
Adaptive Control
Contents
6.1. Certainty Equivalent and Adaptive Control
6.1.1. Caution, Probing, and Dual Control .
6.1.2. Two-Phase Control and Identifiability
6.1.3. Certainty Equivalent Control and Identifiability
6.1.4. Self-Tuning Regulators. . . . . . .
6.2. Open-Loop Feedback Control. . . . . . .
6.3. Limited Lookahead Policies and Applications
6.3.1. Flexible Manufact.uring .....
6.3.2. Computer Chess . . . . .
6.4. Approximations in Dynamic Programming
6.4.1. Discretization of Optimal Control Problems
6.4.2. Cost-to-Go Approximation
6.4.3. Other Approximations
6.5. Notes, Sources, and Exercises. .
p.246
p.251
p.252
p. 254
p. 259
p.261
p.265
p. 26U
p.272
p.280
p.280
p.282
p. 28G
p. 287
245
_ J
246
Suvoptimal and Adaptive Control
Chap. (j
We have seen that it is sometimes possible to solve the DP algorithm
and to obtain an optimal policy in closed form. However, this tends to be
the exception. In most cases, one has to resort to a numerical solution.
The computational requirements for this are often overwhelming, and for
many problems a complete solution of the problem by DP is illlpossible.
The reason lies in what Bellman has called the "curse of dimensionality."
Consider for example a problem where the state, control, and distur-
bance spaces are the Euclidean spaces n, m, and r, respectively. In a
straightforward numerical approach, these spaces are discretized. Taking
d discretization points per state axis results in a state space grid with d n
points. For each of these points the minimization in the right-hand side
of the DP equation must be carried out numerically, which involves com-
parison of as many as d m numbers. To calculate each of these numbers,
one must calculate an expected value over the disturballce, which is the
weighted sum of as many as d r liumbers. Finally, the calculations must be
done for each of N stages. Thus as a first approximation, the number of
computational operations is at least of the order of N d n and cali be of the
order of N d n + m + r . It follows that for perfect state information problems
with Euclidean state and control spaces, DP can be applied numerically
only if the dimensions of the spaces arc relatively small. Based on the
analysis of the preceding chapter, we can also conclude that for problems
of imperfect state information the situation is hopeless, except, for very
simple or very special cases.
Tlnls, in practice one often has to settle for a suboptimal cont;rol
scheme that strikes a reasonable balance between convenient implementa-
tion and adequate performance. In this chapter we discuss some general
approaches for suboptimal control.
6.1
CERTAINTY EQUIVALENT AND ADAPTIVE CONTROL
The certainty equivalent controllcr (CEC) is a suboptimal control
schem-, that is motivated by linear-quadratic control theory. It applies
at each stage the control that would be optimal if some or all of the un-
certain quantities were fixed at their expected values; that is, it acts as if
a form of the certainty equivalence principle were holding.
We take as our model the basic problem with imperfect state infor-
mation of Section 5.1. Let us assume that the probability distributions of
the disturbances llIk do not depend on :I:k and Uk, a.nd that the state spaces
and disturbance spaces are convex subsets of Euclidean spaces, so that the
expected values
Xk = E{a:k I h},
llh = E{wk},
Sec. 6.1
Certainty Equivalcl. nd Adaptive Control
247
belong to the corresponding spaces. Example 1.2 describes a moditication
to handle the imperfectly observed Markov chain case.
The control input jiJ.-{h) applied by the CEC at each time I.- is det('l"-
mined by the following rule:
(1) Given the information vector h, compute
Xk = E{Xk I h}.
(2) Find a control sequence {Uk, -ij'k+1, . . . , UN _ d that solves the deter-
ministic problem obtained by fixing the uncertain quantities :Ck and
Wk,..., WN-l at their typical values, that is,
N-1
minimize gN(XN) + L g,(:I:i, Ii" TIi,)
,=k
subject to the initial condition Xk = Xk and t.he constrnints
u, E U i ,
Xi+1 = fi(Xi, U" w,),
i = k, k + 1,. . . , N - 1.
(3) Apply the cont.rol input
P'k(h) =Uk.
"
''!-)
Note that step (1) is unnecessary if we have perfect statr infonnatiol1.
The determin\1,tic optimization problem in step (2) must be solved at. each
time k, once the initial state Xk = E {Xk I h} becomes known by mC<1ns of
an estimation (or perfect observation) procedure. (In practice, one oft.en
uses a suboptimal estimation scheme that generates an approximation to
Xk') A total of N such problems must be solved by the CEC at. every system
run. Each of these problems, however, is deterministic. In many cases of
interest, these problems can be solved by powerful Ilumerical teclmiques
such as steepest descent, conjugate gradient, and Newton's method, see
e.g. [Der82b], [Lue84], and [Ber95]. Furthermore, the implemcntation of
the CEC requires no storage of the type required for the optimal feedback
controller.
An alternative to solving N optimal control problems in an "on-line"
fashion is to solve these problems a priori. This is accomplished by com-
puting an optimal feedback controller for the deterministic optimal con-
trol problem obtained from t.he original problem by replacing allllllClltaill
quantities by their expected values. It is easy to verify, based on the equiv-
alence of open-loop and feedback implementation of optimal controllers for
deterministic problems, that the implementation of the CEC given earlier
is equivalent to the following.
\;i
i
[
. I
i
I
i
K
)
'''Ii'
"",;
:,:,
.
)
!)
', i
,,1
_ J
248
SuUoptillJai ilnd Adapiive Contra.
Chap. (j
fiec. (j.1
Certainty loJquivak . ilnd Adapiive Control
249
Let {J.Lg(xo),..., J.L'fv-1 (XN-I)} be an optimal controller obtained from
the DP algorithm for the deterministic problem
N 1
minimize 9N(XN) + L 9k(Xk,fJ.d x d,llh)
k=O
subject to :rk+1 = /k (Xk, rtdXk), 'iIh), 11k(Xk) E Uk, k = 0, . . . ,N - 1.
Then the control input j"tk(h) applied by the CEC at time k is givcn by
j"lk(h) = J.LHE{Xk I h}) = IlCXk)
as shown in Fig. 6.1.1.
then the control input j"tk(h) applied by t.his variant of CEC at. t.illlc k b
given by
jtdh) -. II.r(.ed.
Wk
Vk
Generally, there are several variants of the CEC, where tbo st.ocllil.t.ic
uncertainty about some of the unknown quantities is explicitly dealt with,
while all other unknown quantities are replaced by estimates obtained in a
vmiety of ways.
The CEC approach often performs well in practice and yields near-
optimal policies. In fact, for the linear-quadratic problems of Scct.iolls -1.1
and 5.2, the CEC is identical to the optimal controller (certainty equiva-
lence prillciple). It is possible, however, that a CEC performs strictly worse
than the optimal open-loop controller (see Exercise 6.2). We now provide
some exam pies.
Uk;Jl(Xk)
Syslem
X k + 1 ; fi/.Xk 'Uk 'Wk)
Xk
Measurement
Zk; hi/.xk,uk .,.vk)
Zk
Example 1.1 (Multiaccess Communication)
Uk .,
Xk; E(Xkl/k}
Zk
Consider the slotted Aloha system described in Example 1.1 of Scct.io!l :>.1. I (
is very difficult to obtain an optimal policy for this problem, primarily because
there is no simple characterization of the conditional distribution of the stat.e
(the system backlog), given t.he channel transmission history. We t.herefore
resort to a suboptimal policy. As discussed in Section 5.1, t.he perfect. st.at.e
information version of the problem admits a simple optimal policy:
Figure 6.1.1 Structure of the certainty equivalent controller when implemented
in feed back form.
1
Itk(Xk) = -,
Xk
for all Xk 2: 1.
As a result, there is a natural CEC,
In other words, an alternative implementation of the CEC consists of
finding a feedback controller {J.Lg, J.L, . . . , J.L'fv -1} that is optimal for a cor-
responding deterministic problem, and subsequently using this controler
for control of the uncertain system (modulo substitution of the state by ltS
expected value). Either one of the definitions given for the CEC can serve
as a basis for its implementation. Depending on the nature of the problem,
one method may be preferable to thc other.
The preceding description of the CEC sets all future disturbances
w, to their expected values Wi. A useful variation for some imperfect
state information problems is to take into account the stochastic nature
of these disturbances, and to treat the problem as one of perfect state
information, using an estimate Xk of Xk as if it were exact. In particular,
if {ll(XO),...,J.Lj,,_1(XN-1)} is an optimal policy obtained from the DP
algorit.hm for the stochastic perfect state 1.11formo.twn problem
minimize k=O,I'N-l {9N(XN) + 9k(Xk,J.LdXk),Wk)}
subject to Xk+l = h (Xk, Ildxd, Wk), ILdxk) E Uk, k = 0,. . . , N - 1,
it..(h) = min [1, J '
where Xk is an estimate of the current packet backlog based on t.he enUre
past channel history of successeS, idles, and collisions. Recursive estimat.ors
for generating Xk are discussed in [Mik79], [HaL82], [Riv85], [13eG92], and
[Tsi87].
The versions of the CEC discussed above apply to t.he case wherc the Late
and disturbance spaces are Euclidean, so that Xk and lIik are well defined
as averages. In the case where the system is a finit.e-stat.e J\larkov "hnin
under imperfect state informat.ion, t.h('se avernge have no lIIf'aning, hilt. 1.1..,
CEC approach can be applied ill the modilied form described earlier. The
idea is to solve the corresponding problem of perfect st.ate informat.ion, and
then use the controller thus obtained for control of the imperfectly obsorved
system, modulo substitution of the exact state by an estimat.e obt.ained via
the Vit.erbi algorit.llIll described in Section 2.2.2. III part.icular, snppos(' that.
Example 1.2 (A Modified CEC for Finite-State Markov Chains)
_ J
250
Suboptimal and Adi1pt.ive Contrc
Chap. 6
Sec. 6.1
Cert.aint.,v EquivaIc.
nd Adapt.ivc Cont.rol
251
{/l.r;, . . . ,/l' -I} is an optimal policy for the corre;;ponding problem where t.he
;;tat.e i;; perfectly observed. Then the modified version of t.he CEC, given t.he
information vector h, u;;e;; the Viterbi algorithm to obtain (in real time) an
estimate x(h) of the current stat.e Xk, and then applies t.he control
thi'i i'i t.J"Il<' for example if for a fixed 0, t.he problem i'i linear-quadrat i.. of I Ill'
type considered in Sections 4.1 and 5.2. Then a version of the CEC t.ake'i t.he
form
Tt.dh) = ILk(h,(h),
[ik(h) = J.L(xk(h)).
where ih is some estimate of 0 ba<.;ed on the information vector h. Thu'i, in
t.his approach, the system is identified while it is being cont.rolled. llo\\'e\'('r.
t.he estimates of the unknown paramet.ers are used as if they were exact..
Example 1.3 (Control of Systems with Unknown Paramctcrs)
'vVe have been dealing so far with systems having a known stilte equat.ion. In
practice, however, there are many cases where the system parameters are not
known exactly or change over time. One possible approach is to estimate the
unknown paramc1.ers from input-output records of the system by using system
identification techniques (see [KuV8G], [LjS83], [Lju86]). This, however, can
be time consuming. Furthermore, the estimation must be repeated if the
paramcters change.
The alternative is to formulate the stocha.5tic control problem SO that
unknown parameters are dealt with directly. It can be shown that problems
involving unknown system parameters can be embedded within the frame-
work of our basic problem with imperfect state information by using state
augmentat.ion. Indeed, let the system equation be of the form
The approach of the preceding example is one of the principalllldhods
of adaptive control, that is, control that adapts itself to changing values of
system paranlCters. In the renwinder of t.hi" section, we discuss SOllie of
the associated issues.
6.1.1 Caution, Probing, and Dual Control
ii
'I
I
Suboptimal control is often guided by the qualitative nature of op-
timal control. It is therefore important to try to underst.and some of the
characteristic features of the latter in the case where some of the system
parameters are unknown. One of these is the need for balance octween
"caution" and "probing," two notions that are best explaincd by IIIcaWi of
all example; see also [Uar8!].
'1
I
i
I
Xk+l = Jd;l:k, 0, Uk, Wk),
where 0 is a vector of unknown parameters with a given a priori prouability
distribution. We introduce an additional state variable Yk = 0 and obt.ain a
system equation of the form
Example 1.4 [Kum83]
Consider the linear scalar syst.em
, f
: i
I
( Xk+l ) = ( h(Xk,Yk,Uk,Wk) ) .
Yk+l Yk
Xk+l = .Tk + bUk + Wk,
k = 0, I,. . . , N - I,
Xk+l = lk(xk,uk,wk),
and the quadratic t.erminal cost. E{ (XN)2}. Here everyt.hing is a'i in Section
4.1 (perfect state information) except that t.he control codlicil'nt b is unknown.
Inst.ead, it is known that the a priori probability dist.riuut.ion of {) is Gaussian
with mean and variance
This equation can be written compactly as
where Xk = (Xk, Yk) is t.he new state, and 1 k is an appropriat.e function. The
initial state is
bE{b}>O,
2 { - 2 }
G'b = E (b - b) .
Xo = (xo,O).
Furthermore, Wk is zero mean Gaussian with variance G' for each k.
Consider first t.he case where N = 1, so the cost is calculated t.o be
With a suitable reformulation of t.he cost function, the resulting proulcm
becomes one that fits our usual framework.
Unfortunately, however, since Yk = ° is unobservable, we are faced wit.h
a problem of imperfect stat.e information eyen if the controller knows the st.ate
:l:k exactly. Thus, typically an optimal solution cannot be found. Nonet.heless,
a certainty equivalent controller is often convenient. In particular, suppose
that for a fixed parameter vector 0, we can compute the corresponding optimal
policy
E{ (Xr)2} = J{(xo + buo WO)2} = :r6 + 2bJ:01Lo + (b 2 + G'nnG I (J,,.
The minimum oYer Uo is attained at.
{1.L(l0,0),... ,J.L",,-I(lN-l,O)};
{;
Uo = -xo,
b + G'
_ J
252
Suvoptimal and Adaptive COIJtn
Chap. 6
Sec. (j.l
Certainty .E(juivalel
JJ Adapt,ive Control
:.![);J
a(1) 2 2
Jl(h)= _ 2 Xl+a w ,
(b(l)) + aW)
(1.1)
a parameter identification phasc and a contr-ul pha.se. In the first phase tl\('
unknown parameters are identified, while t.he cont.rol takes no accollnt of
the interim result.s of identification. The final paramder est.illlat.<.s [rOll!
the first phase are then used to implement an opt.imal control law in t.he
second phase. This alternation of identification and control phases may be
repeated several times during any system run in order to t.ake int.o account.
subsequent changes of t.he parameters.
One drawback of this approach is that information gathered during
the identification phase is not used to adjust the control law until the begin-
ning of the second phase. Furthermore, it is not always easy to determine
when to terminate one phase and start the other.
A second difficulty, of a more fundamental nature, is due to t.he fact
that the control process may make some of the unknown parameters in-
vi::iib1c to t.he identification process. This is t.he problem of paralndcr
identifiability [Lju86], which is best explained by means of an example.
ami the optimal cost is verified by straightforward calculat.ion to be
2
(J& 2 2
xo+aw'
b + a
Therefore, the optimal control here is cautious in that the optimum Juo I de-
crea.<;es as t.he uncertaint.y in b (i.e., a) inCreases.
Consider next the case where N = 2. The optimal cost-to-go at stage
1 is obtained by the preceding calculation: '
where h = (Xo,1/.0, Xl) is t.he informat.ion vector and
b(l) = .E{b lId,
a(l) = E{(b-b(1))2 1 h}.
Example 1.5
Let. us focus on the term a(1) in the expression (1.1) for .II (lJ). We
can obtain at(1) from the equation Xl = Xo + buo + Wo (which we view as a
noise-corrupted measurement of b) and least-squares estimation theory (see
Appendix B). The formula for at(l) will be of no further use to us, so we just
state it without going into the calculation:
Consider the scalar system
Xk+l = aXk + bUk + Wk,
k = 0, 1,..., N - 1,
wit.h the quadratic cost
E {t(Xk)2}.
2 2
2 (1) = _ (J/,a",
ab 2 2 2 .
uOa& + a w
( 1.2)
The salient feature of this equation is that at (1) is affected by the control Uo.
Basically, if 111.01 is small, t.he measurement Xl = Xo + buo + Wo is dominated
by Wo and the "signal-to-noise ratio" is small. Thus to achieve small error
variance a(l) [which is desirable in view of Bq. (1.1)], we must. apply a
control 11.0 that is large in absolute value. A choice of large control to enhance
parameter identification is often referred to as probing. On the other hand,
if luol is large, Ixd will also be large, and this is not desirable in view of Bq.
(1.1). Therefore, in choosing Uo we must strike a balance bet.ween caution
(choosing a small value to keep Xl reasonably small) and probing (choosing a
large value to improve the signal-to-noise ratio and enhance estimation of b).
\'lIe assume perfect state information, so if the parameters a and b are known,
this is a minimum variance control problem (ef. Section 5.3), and the optimal
control law is
J1.k(Xk) = -Xk'
Assume now that t.he parameters a and b are unknown and consider the two-
phase method. During the first phase the control law
ji.k(Xk) = "(Xk
(1.3)
The tradeoff between the control objective and the parameter estima-
tion objective is comlllonly referred to as dual control. It manifests itself
often when system parameters are unknown, but unfortunately it cannot
be quantified precisely in most cases.
is used ("( is some scalar; for example, "( = -a/b, where a and T, arc a priori
estimates of a and b, respectively). At the end of the first phase, the control
law is changed t.o
ji.k(Xk) = -Xk'
b
where hand b are the estimates obtained from the identificat.ion process.
However, wit.h the control law (1.3), t.he closed-loop syst.em is
6.1.2 Two-Phase Control and Idcntifiability
Xk+l = (a + b"()Xk + Wk,
An apparcntly reasonable form of suboptimal control in the presence
of unknown parameters is to separate the control process into two phases,
so the ident.ificatioll process can at best identify t.he value of (a + b"() but
not the values of both a and b. In other words, the identiftcat.ioll proc(,ss
_...1
254
Suboptimal and Adaptive Contr
Chap. G
Sec. G.l
Certainty Bquivall .wd Adaptive Control
255
cannot discriminate between pairs of values (al,b J ) and (a2,b 2 ) such that.
al + bn = a2 + byy. Therefore, a and b are not identifiable when feedback
cont.rol of t.he forlll (1.3) is applied.
One way to correct the difIiculty is to add an additional known input
Ok to the control law (1.3); that is, use
where ih is an estimate of 0 based on the information
I k -= {.Cu,:c: 1, . . . , .CA:, '/I.u, It 1 , . . . , lJ.k - d
available at time kj for example,
fik(:J:k) = "(Xk + Ok.
Ok = E{O I h}
Xk+l = (a + b"()Xk + bOk + Wk,
or, more likely in practice, an estimate obtained via an on-line system
identification method (see [KuV86], [LjS83], [Lju86]).
One would hope that when the horizon is very long, the parameter
estimates Ok will converge to the true value 0, so the certainly equivalent
controller will become asymptotically optimal. Unfortunately, we will see
that difficulties related to identifiability arise here as well.
Suppose for simplicity that the system is stationary with a priori
known transition probabilities P{Xk+1 I.Tk,lJ.k, O} and that the control law
used is also stationary:
Then the closed-loop system becomes
and the knowledge of {xd and {od makes it possible t.o identify (a + b"() and
b. Given "(, one can t.hen obtain estimates of a and b. Actually, to guaran-
tee this in a more general context where the system is of higher dimension,
the sequence {Ok} must satisfy certain conditions: it must be "persistently
exciting" (see [LjS83]).
A second possibilit.y to bypass the identifiability problem is to change
the structure of the syst.em by artificially introducing a one-unit. delay in
the control feedback. Thus, instead of considering control laws of the form
jLk(Xk) = "(Xk, we consider cont.rols of the form
Mk(h) = IL*(Xk,Bk),
k = 0,1,...
Uk = ILk(Xk-J) = "(Xk-J.
There are three systems of interest here (d. Fig. 6.1.2):
(a) The system (perhaps falsely) believed by the controller to be true,
which evolves probabilistically according to
The closed-loop system then becomes
p{Xk+ll.I:k,/J.*(:I:k,Ok),Od.
Xk+l = aXk + b"(Xk-J + Wk,
and given "(, it is possible t.o identify both parameters a and b. This technique
can be generalized for systems of arbitrary order, but artificially introducing
a control delay makes the system less responsive to control.
(b) The true closed-loop system, which evolves probabilistically according
to
P{Xk+ll Xk,IJ.*(xk,ih),O}.
6.1.3 Certainty Equivalent Control and Identifiability
(c) The optimal closed-loop system that corresponds to the true value of
the parameter, which evolves probabilistically according to
P{Xk+ll Xk,IL*(.Tk,O),O}.
At the opposite extreme of t.he two-phase method we have t.he cer-
taint:r equivalent control approach, where the parameter estimates arc in-
corporated into the control law as they are generated, and they are treated
as if they were true values. In terms of the system
Xk+1 = /k(.Tk, 0, Uk, Wk)
For asymptotic optimality, we would like the last two systems to be equal
asymptotically. This will certainly be true if ih -t 0. However, it is quite
possible that either
(1) fh does not converge to anything, or that
(2) ih converges to a parameter 0 i- 0.
There is not much we can say about the first C<1.e, so we cOllcentmtc
on the second. To see how the parameter estimates can converge to a wrong
value, assume that for some iJ i- 0 and all Xk+l, :I:k, we have
considered in Example 1.3, suppose that, for each possible value of 0, the
control law 71"* (0) = {1J. o (', 0), . . . , ILN -1 (.,O)} is optimal with respect to a
certain cost J,,(xu, 0). Then the (suboptimal) control used at time k is
Mk(h) = 1L(Xk, Ok),
P{Xk+11 Xk,/l*(.Tk,O),B} = p{Xk+1 I Xk,IL*(:rk,B),O}.
(1.4)
_. J
256
Suboptimal ami AJaptive Cont.re
Chap. (j
Sec. 6.1
CeriainLy Equivalc. .nJ AJapiive Conirol
257
In words, therc is a false value of parameter for which thc systcm under
dosed-loop control looks cxactly as if the false value were truc. Then, if thc
c01\troller estimates at some timc the parameter to be iJ, subsc(!uent. data
will tend to reinforcc this erroncous cstimatc. As a rcsult, a situation may
develop whcrc the identificat.ion procedure locks Ollto a wrong paramd(r
value, regardless of how long information is collccted. This is a diJliculty
with identifiability of the t.ypc discussed earlier in connection with two-
phasc control.
On thc othcr hand, if thc parametcr cstimatcs converge to some (pos-
sibly wrong) value, wc can argue intuitively that the first two systems
(believcd and true) typically become cqual in thc limit <l.'i k --+ 00, since,
generally, parametcr estimate convergcncc in idcntification mcthods im-
plics that the data obtaincd arc asymptotically consistent with thc view of
thc system one has bascd on thc current estimates. Howcvcr, the believed
and truc systems mayor may not become asymptotically equal to the opti-
mal closed-loop system. Wc first present two examples that illustrate how,
cven whcn the parametcr cstimatcs convcrge, the truc closed-loop system
can differ asymptotically from the optimal, thereby resulting in a certainty
cquivalcnt controller that is strictly suboptimal. We then discuss thc spe-
cial case of the self-tuning regulator for ARMAX models with unknown
paramcters, whcre, rcmarkably, it turns out that all thrce of thc above sys-
tems are typically equal in the limit, even though the paramet(,r est.imates
typically converge to falsc values.
Example 1.6 [Bo V79]
Consider a two-statc system wit.h two controls IL I and ILl. The transitiun
probabilities depend on the control applied as well as a parameter 0 which
is known to take one of t.wo values 0' and O. They arc as shown in Fi. 6.1..1.
There is ,.;cro cost for a transition from st.at.e I t.o it.self and a unit. ("ost. lor
all ot.her t.ransitions. Therefore, the opt.imal control at. st.at.p 1 is !II(' on,' t.hat
maximizes t.he probabilit.y of t.he st.at.e remaining at. 1. Assume t.hat. the t.rue
paramet.er is r)' and that
PII (11. 1,0) > PII (11. 2 ,0), P II (u 1, 0') < P II (IL 2 , 0').
Then t.he opt.imal control is u 2 , but if t.he controller think.< that. t.he t.rnc
parameter is 0, it will apply Ul. Suppose also that
pII(U 1 ,0) =PII(UI,O').
Thcn, undcr 11. 1 the system looks identical for bo.th values of t.he parameter,
so if the controller estimates the parameter to be 0 and applies u 1 , subsequent.
dat.a wi!l tend to reinforce the controller's belief that the t.rue parameter is
indeed O.
PI1(U 2 ) = 0.6
Cost = 1
P 1 1(U 1 ) = 0.5 Cost = 1
Cost=O : =:0
Optimal Closed-Loop System
P(Xk + 1 I X. ,,u' (Xk, 0),8}
True Closed-Loop System
P(Xk + 1 I Xk,,u' (Xk'ok). 0 }
Transition probabilities for IJ = IJ'
(true parameter value)
Cost = 1
System Believed to beTrue
P(Xk + 1 I xk,,u'(xk,9k),9k }
P 11 (U 1 ) = 0.5
Cost=O "=:0
P11 (u 2 ) = 0.3 Cost = 1
Transition probabilities for IJ = iJ
(false parameter value)
Figure 6.1.2 The three systems involved in certainty eqnivalent control, where
o is the trne paramcler and ih is the parametcr estimate at time k. Loss of
opt.imality occurs when the true system differs asympt.ot.ic..lly from t.he optimal
closed-loop system. If the parameter estimates converge to some v..lue 0, the true
syst.em t.ypically becomes asymptotically equal t.o the system believed t.o be true.
However, the parameter estimates need not converge, and even if they do, both
systems may be different asymptotically from the optimal.
Figure 6.1.3 Transition probabilities for t.he two-state system of Examplc
1.6. Under the nonoptimal cont.rolu l , the system looks ident.ic..] nnder the
true and the false values of the paramet.er O.
More precisely, suppose that we estimate 0 by sclect.ing al. each timc k
the valuc that. maximizes
P{O I h} = P{h 10}P(0)
P(h)
_ J
258
Suboptimal and Adaptive Contr
Chap. 6
See. G.1
Certainty Equivalc. .nd Adaptive Control
259
where prO) is the a priori probabilit.y that the true parameter is 0 (this is a
popular estimation method). Then if prO) > P(O'), it can be seen, by using
iuduction, that at. each t.ime k, the coutroller will est.imate falsely () t() be
a and apply t.he incorrect control u l . To avoid t.he difficulty illustrat.ed in
t.his example, it has beeu suggest.ed t.o occasionally deviate from tI\(' certaint.y
equivalent cont.rol, applying other controls that enhance the identilication of
t.he unknown parameter (see [DoS80], [KuL82]). For example, by making sure
that t.he control u 2 is used infrequently but. inlinitely often, we C,Ul guarantee
that the correct parameter value will be identified by the preceding est.imation
scheme.
Consider the linear scalar system
If at time k the controller estimates incorrectly t.he parameters t.o be rI = (1, 1),
because Vk(O) < VdO), the control applieu will be Uk = -."Ck/2 and t.he I.me
closed-loop syst.cm will evolve accoruing t.o
Xk
Xk+l = 2 + Wk. (1.7)
On the ot.her hanu, the controller thmks (given the est.imat.e () t.hat. t.he dosed-
loop syst.elll will evolve accoruing to
Xk Xk
Xk-j-I = Xk + Uk + Wk = Xk - 2 + Wk = 2 + Wk, (1.8)
so from Eqs. (1.6) and (1.7) we see t.hat undcr thc cuntrollaw Uk = .--.l:k/2,
the closed-loop system evolves identically for both the true and the false values
of the parameters [ef. Eq. (1.4)].
To see what. can go wrong, note t.hat if Vdrl) < Vk(O) for some k we
will have, from Eqs. (1.5)-(1.8),
Example 1.7 [Kum83]
N-l
L:((xd+ 2 (ud),
k=O
Xk+l + Uk = Xk+l - Xk - Uk,
so from Eqs. (1.5a) and (1.5b) we obtain
Vk+l(O) < Vk+l(O).
Therefore, if VI (0) < VI (0), the least-squares ident.ification method will yield
the wrong est.imate () for evel'y k. To see that this can happen wit.h positive
probabilit.y, note that, since the true system is Xk+l = -Uk + Wk, we have
VI (0) = (Xl - Xo - UO)2 = (wo - Xo - 2uo)",
Vl(O) = (Xl + uo)" = w5.
Therefore, the inequality VI (0) < VI (0) is equivalent to
(xo + 2UO)2 < 2wo(xo + 2uo),
which will occur with positive probability since Wo is Gaussian.
Xk+1 = aXk + bUk + Wk,
where we know that. t.he parameters are eit.her (a., b) = (1,1) or (a,lJ) =
(0, -I). The sequence {wd is independent., stationary, zero mean, and Gaus-
sian. The cost. is quadratic of the form
where N is very large, so the stat.ionary forlll of the optimal control law is
used (see Section 4.1). This cont.rol law can be calculated via t.he Riccat.i
equation t.o be
J.t*(Xk) = { f
if (a,b) = (I, I),
if (a, b) = (0, -1).
The preceding examples illustrate that loss of identifiability is a se-
rious problem that frequently arises in the context of certainty equivalent
control.
To est.imat.e (a, lJ), we use a least-squares identification method. The value of
t.he least-squares crj terion at. time k is given by
k-l
"" 2
Vk(l, 1) = L../Xi+l - Xi - u;) ,
i=O
6.1.4 Self-Tuning Regulators
for (n,b) = (1,1),
(1.5a)
Xk+1 = -Uk + Wk.
(1.6)
We described earlier the nature of the identifiability issue in cer-
tainty equivalent control: under closed-loop control, incorrect parameter
estimates can make the system behave as if these estimates were correct
[ef. Eq. (1.4)]. As a result, the identification scheme may lock onto false
parameter values. This is not necessarily bad, however, since it Illay hap-
pen that the control 1,1w implemented on the basis of the false parameter
values is near opt.imal. Indeed, through a fortuitous coincidence, it turns
out that in the practically important mini1lw.m "I!!LT'ilI.1l.ce colI.lml !o'!"1/1.'//.la-
tion (Section 5.3), whcn the pammeter estimates converge, they typicaLLy
converge to false valnes, but the l'csulting con/.rollaw typically Clin"l!c lye", to
the optimal. We can get an idea about this phenomenon by means of all
example.
k-I
VdO,-l) = L(Xi+l +Ui)2,
t=O
for (a, b) = (0, -1).
(1.5b)
The cont.rol a.pplied at time k is
{ _::.&.
Uk ji.k(h) = 0"
if Vk(l, 1) < VdO, -1),
irVdl,l) > VdO,-I).
Suppose the t.rue parameters are 0 = (0, --I). Then t.he true system evolves
according t.o
_. J
260
SllvopLimal and AdapLivc Contr,
Chap. 6
Examplc 1.8
Consider the simplest AIUvIAX model:
Yk+l + aYk = bUk + Ek+l.
The minimum variancc control law when a and b arc known is
a
Uk = ILk(h) = bYk.
Suppose now that. a and b arc not known but. arc identified on-linc by means
of SOIllC schcmc. Thc control applicd is
ak
Uk = -;;-Yk,
bk
(1.9)
where ak and bk are thc cstimates obtained at timc k. Then the difficulty
with idcntifiability occurs when
ak a)
bk --> b,
where a and b arc such that thc true closed-loop system given by
ba
Yk+l + aYk = -;:-Yk + Ek+l
b
coincides with the closcd-loop system that the controller thinks is true on the
basis of thc estimates a and b. This latter system is
Yk+l = Ek+l.
For t.hesc two systcms to be idcntical, we must have
a a
b t;'
which mcans that the control law (1.9) a.<;ymptotically becomes opt.imal de-
spitc the fact that the asymptotic estimates a and b may be incorrect..
Example 1.8 can be extended to the general ARMAX model of Section
5.3 with no delay:
m m m
Yk + L aiYk-i = L biuk-i + Ek + L CiEk-i.
i=1 i=1 i=1
(1.10)
If the parameter estimates converge (regardless of the identification method
used and regardless of whether the limit values are correct), then a min-
imum variance controller thinks that the closed-loop system is asymptoti-
cally
Yk = Ek.
(1.11)
Sec. 6.2
Open-Loop Feedv<lc )onirol
261
Furthermore, parameter estimate convergence intuitively nwans that the
true closed-loop system is also asymptotically Yk = Ek, and this is clearly
the optimal closed-loop system. Results of this type have been proved
in the literature in connection with several popular methods for parameter
estimation. In fact, surprisingly, in some of these results, the model adopted
by the controller is allowed to be incorrect to some extent.
One issue that we have not discussed is whether the paramct.er esti-
mates indeed converge. A complete analysis of this issue is quite difficult.
We refer to the survey paper [Kum85], and the textbooks [GoS84], [Ku V86],
and [AsW90] for a discussion and sources on this subject. However, exten-
sive simulations have shown that with proper implementation, these esti-
mates typically converge for the type of systems likely to arise in many
applications. For this reason, self-tuning regulator schemes are presently
used widely in practice as general purpose process control algorithms.
6.2 OPEN-LOOP FEEDBACK CONTROL
.,
The open-loop feedback controller (OLFC) is similar to the CEC ex-
cept that it takes into account the uncertainty about Xk,Wk,...,WN-1
when calculating the control 7lk(h) to be applied at time k. This control
is determined as follows:
(1) Given the information vector h, compute the conditional probability
distribution PXk11k'
(2) Find a control sequence {Uk, Uk+l"", UN-I} that minimizes
{ N-1 }
E gN(XN) + L 9i(Xi, Ui, wiJ I h
Xk, W k,...,wN_l
i=k
subject to the constraints
Xi+l = !i(Xi, Ui, Wi),
Ui E U i ,
i = k, k + 1,..., N - 1.
(3) Apply the control input
:
7lk(Ii) = Uk.
Thus the OLFC uses at time k the new measurement Zk to calcu-
late the conditional probability distribution P xk1h ' However, it selects the
control input as if no further measurements will be received in the future.
Similar to the CEC, the OLFC requires the solution of N optimal con-
trol problems. Each problem may again be solved by deterministic optimal
_ J
262
Suboptimal and Adaptive Contro,
Chap. 6
Sec. 6.2
Open-Loop Feedbu Control
263
control tcchniqucs. Thc computations, however, may be more complicatcd
than thosc for thc CEC, sincc now thc cost involves an expectation with
respect to the uncertain quantities. The main difficulty in the implementa-
tion of the OLFC is the computation of PXk11k' In many cases one cannot
compute PXkl1k cxactly, in which case some "reasonablc:' approimatio.n
scheme must be used. Of course, if we have perfect state mformatlOn, this
difficulty does not arise.
In any suboptimal control scheme, one would like to be assured that
measurements are advantageously used. By this we mean that the scheme
performs at least as well as any open-loop policy that uses a sequcnce
of controls that is independent of the valucs of thc measurements re-
ceived. An optimal open-loop policy can be obtained by finding a sequence
{ u o , U i , .. . , uN _ I} that minimizes
] k(h) = E {!lk (:rk, /idh), Wk)
Xk,Wb V k+l
+ ] k+l (h, hk+1 (ik (Xk, /ik(h), Wk), /ik(h), Vk+l), /ik(h)) I h},
k = 0,..., N - 1,
(2.4)
where h k is the function involvcd in the measurcmcnt equation as in the
basic problem with imperfect state information of Section 5.1.
Considcr the functions Jk(h), k = 0,1,. . ., N - 1, defined by
Jk(h) =
{ N-l }
min E gN(XN) + L gt(X t , Ui, Wi) I h .
uiEUi xk''Wi
i=k,...,N-l x t +l==/i(xi,ui''Wi) i=k
i=k.....N-l
{ N-l }
](uo Ul,...,UN-d= E gN(XN) + Lgk(Xk,Uk,Wk)
, xQ''Wk
k=O,I,...,N-l k=O
(2.5)
The minimization problem in this equation is the one that must be solved
at time k in order to calculate the control /ik(h) of the OLFC. Clearly,
Jk(h) can be interpreted as the optimal open-loop cost from time k to
time N whcn the current information vector is h. It can bc scen that
subject to the constraints Xk+l = fk(Xk, Uk, Wk), and Uk E Uk, k =
0,1,. . ., N -1. A nice property of the OLFC is that it performs at least as
well as an optimal open-loop policy, as shown by the following proposition.
Dy contrast, the CEC does not have this property (sce Exercise 6.2).
E{ J8(zo)} S J o .
zo
(2.6)
i
I
We will prove that
]k(h) S Jk(h),
for all hand k.
(2.7)
Proposition 2.1: The cost Jw corresponding to an OLFC satisfies
Then from Eqs. (2.2), (2.6), and (2.7), it will follow that
hSJ o ,
(2.1)
hSJ o ,
where J o is the cost corresponding to an optimal open-loop policy.
which is the relation to be proved. We show Eq. (2.7) by induction.
By the definition of the OLFC and Eq. (2.5), we have
Proof: We assume throughout the proof that all expected values appear-
ing are well dcfined and finite, and that the minimum in the following Eq.
(2.5) is attained for every h. Lct'if = {/iO,/iI,...,/iN-d bc the OLFC.
Its cost is given by
]N-l(IN-tJ = J'iv_l(IN-d,
for all IN-I,
and hence Eq. (2.7) holds for k = N - 1. Assume that
]k+l(h+l) S Jk+l(h+l),
for all h+l.
(2.8)
h = E{Jo(Io)} = E{Jo(zo)}, (2.2)
20 20
Then from Eqs. (2.4), (2.5), and (2.8), we have
]N-l(IN-I) = E {9N(JN-I(XN-I,/iN_1(IN-1),WN-tJ)
XN-l,WN-l (2.3)
+ gN-I (XN-I, /iN-I (IN-tJ, WN-I) I IN-I},
]k(h) = E {gk(Xk,/ik(h),Wk)
Xk,Wk'Vk+l
+ ] UI (h, hk+l (ik (Xk, /ik(h), Wk), /ik(h), VU1), /ik(h)) I h}
S E {gk(Xk,/ik(h),Wk)
Xk,Wk,Vk+l
wherc ]0 is obtained from the recursive algorithm
+ J k + 1 (h, hk+l (ik (Xk, /ik(h), Wk), /ik(h), Vk+l), /idh)) I h}
_..1
264
Suboptimal and Adaptive Contra.
Chap. 6
E { min E { 9k (Xk, Jidh), Wk)
xk,wk,vk+l . uiEUi xk+l,wi
1.=k+l,....N -1 xi+l =Ii(xi,ui,wi)
i=k+l.....N-l
N-l }
+ i1 gi(Xi, Ui, Wi) + gN(XN) I h+l} I h
s:
m i n
uiEUi
i=k+l.....N-l
{!/N(:r:N) +!/k (:r:k, Tik(h), Wk)
E
xk,wk,wi
xi+l =fi(xi,ui,wi)
i=k+l.....N-l
""k+l =h(""k./ik(lk)''''k)
N-l
+ L Yi(Xi, Ui, Wi) I h}
i=k+1
= Jk(h).
The second inequality follows by interchanging expectation and minimiza-
tion (since we generally have E{min[']) s: min[E{.}]) and by "intcgrating
out" Vk+l. Thc last equality follows from the definition of OLFC. Thus
Eq. (2.7) is proved for all k and the dBsired result is shown. Q.E.D.
It is worth noting that by Eq. (2.7), Jk(Id, which is the calculated
open-loop optimal cost from time k to time N, provides a readily obtainable
performance bound for thc OLFC.
The preceding proposition shows that the OLFC uses the measure-
ments advantageously even though it selects at each period the present
control input as if no further measurements will be taken in the future. Of
course, this says nothing about how closely the resulting cost approximates
the optimal. It appears, however, that the OLFC is a fairly satisfactory
mode of control for many problems.
Partial Open-Loop Feedback Control
A form of suboptimal control that is intermediate between the optimal
feedback controller and the OLFC is provided by what we call thc partial
open-loop feedback controller (POLFC for short). This controllcr uses past
measurements to compute Pxllk' but calculates the control input on the
ba:sis that some (but not all) of the measurements will in fact bc takcn in
thc future, and the other measurements will not be taken.
This method often allows one to deal with those measurements that
are troublesome and complicate the solution. As an example consider an
inventory problem such as the one cousidercd in Sect.ion 4.2, where forccasts
in the form of a probability distribution of each of the future demands
become available over time. A POLFC calculates at each stage an optimal
(8, S) policy based on the current forecast of future demands and follows
this policy until a new forecast bccomes available. When this happens,
Sec. 6.3
Limited Lookaheaa ..,jicies and Applications
265
the currcnt. policy is abandoned in favor of a I1CW onc t.hat is calculated
on the basis of the new probability distribution of future dcmands, ctc.
Thus tle complications due to the forccasts are bypasscd at thc cxpcusc of
suboptl1nahty of the policy obtaincd.
We notc that an analog of Prop. 2.1 can bc shown for t.he POLFC
(see [Ber76]). In fact thc corresponding error bound is potentially much
better than the bound (2.1), reflccting thc fact that thc POLFC t.akes int.o
acconnt. thc futurc availability of somc of t.he mcasuremcnts.
6.3 LIMITED LOOKAHEAD POLICIES AND APPLICATIONS
A practical way to rcducc thc computation rcquircd by DP is to trun-
cate the time horizon and use at cach stage a dccision based on loolmhcad
of a small number of stages. The simplest possibility is to use a one..
step look ahead policy whereby at stage k and statc Xk onc uses thc control
Jik(Xk), which attains the minimum in the expression
m U iI ( l ) E{9k(Xk,Uk,Wk) + Jk+I(!k(Xk,Uk,Wk) )} ,
Uk E k xk
",:hee J k + 1 is some approximation of the true cost-to-go function Jk+l.
Smularly, a two-step lookahead policy applies at timc k and state Xk, the
contrC?,1 ii-k(xd attaining the minimum in the preceding equation, where
now J k +1 is obtained itself on the basis of a one-stcp lookahcad approxi-
mation. In other words, for all possible statcs
Xk+1 = fk(Xk, Uk, Wk),
we have
lk+I(Xk+d = min E{gk+I(Xk+l,Uk+I,Wk+l)
Uk+l EUk+l (xk+d
+ lk+2 Uk+l (Xk+l, Uk+l, Wk+iJ) },
wlere lk+2 is some approximation of the cost-to-go function J k + 2 . Policies
with lookahead of more than two stages are similarly dcfined.
The computational savings of this approach are evident. For a OI1C-
st.ep lookahead policy, only a :;ingle 1I1ini1l1i",ation probleln lm.'i t.o 1)(, sol ved
per stage, while in a two-step policy the number of states at which the DP
equation has to be solved at stage k is one plus the number of all possible
next states Xk+l that can be generated from the current state Xk. Note
also that the fixed lookahead approach can be combined with the certainty
_ J
266
Suboptimal and Adaptive Control
Chap. 6
equivalent and open-loop-feedback control approachcs of Sections 6.1 and
6.2 to simplify cven further thc calculations.
Note that thc limited lookahcad approach can be used cqually well
with infinite horizon problems. One simply uses as thc terminal cost-to-go
function an approximation to the optimal cost of the infinite horizon prob-
lem that starts at the end of the lookahead. Thus the following discussion
applics to both finite and infinite horizon problcms.
Solving Limited Lookahead Subproblems
An important fact that facilitates the application of limited lookahead
policies is that a perfect state information problcm with a small number
of stages can often be solved as a nonlinear programming problem with
a relatively small number of variables. In particular, suppose that the
control spacc is the Euclidean space 3(m, and the disturbance can take a
finite numbcr of valucs, say r. Then it can be shown that for a givcn initial
state, an N-stage perfcct state information problem can be formulated as a
nonlinear programming problem with m(l+rN-l) variablcs. We illustrate
this by means of an important example and then discuss the general case.
Example 3.1 (Two-Stage Stochastic Programming)
Hcre we want to find an optimal two-stage decision rule for the following
situation: In the first stagc we will choose a vector Uo from a subset Uo C m
with cost !Jo(uo). Thcn an unccrtain evcnt represcntcd by a random variable w
will occur whcre w will take one of thc values Wi, . . . ,w r with corresponding
probabilities pi,. . . ,pr. Wc will know the value w j oncc it occurs, and e
must then choose a vector u{ from a subset U l C m at a cost 9l(U{,W J ).
Thc objectivc is to minimize the expect cd cost
9o(uo) + 2::>j 9l(U{, w j )
j=l
SU)jcct to
Uo E Uo,
u{EU l , j=l,...,'I'.
This is a nonlincar programming problcm of dimension m(l + r) (thc opti-
mization variables are Uo, u;,.. . ,uD. It can also bc viewed as a two-stage
pcrfect state information problcm, where Xl = Wo is the state equation, Wo
can take the values Wi,.. ., w r wit.h probabilities pi,. . ., pT, the cost of t.he
I1rst stage is 90(uo), and the cost of the second stage is 9l(XI, Ul).
The preceding example can be generalized. Consider the basic prob-
lem of Chapter 1 for the case where there are only two stages (N = 2)
and the disturbances Wo and WI can independently take one of the r values
Sec. 6.3
Limited Lookahcac. Ajcies and ApplicaUons
267
w l , . . . ,w T with corrcsponding probabilities pl, . . . ,pT. Thc optimal cost
function Jo(xo) is given by the two-stage DP algorithm
Jo(xo) =
min [ tpJ { 90(X O ' Uo, w j )
uoEUo(xo) ,
J=1
+ j min . [ t pi {91 (Jo(xo, uo,wj),ui, w')
U l EUI (fo(xo,uo,wJ)) i=1
+ 92(/1 (Jo(xo, uo, w j ), uL w')) } ] } ] .
This DP algorithm is cquivalent to solving thc nonlinear programming
pro blem
minimize P.i {90(XO, Uo, w j ) + p,{ 91 (Jo(xo, uo, wJ), u, wi)
subject to Uo E Uo(xo),
+,1]2 (h (Jo(xo, Uo, w j ), u{, Wi)) } }
u{ E Ul(JO(XO,uo,w j )), j = 1,...,1'.
If the controls Uo and u 1 are elements of 3(m, the number of variables in
the abovc problcm is m(l + r). More generally, for an N -stagc perfcct state
information problcm a similar reformulation as a nonlinear programming
problem rcquires m(l + r N - 1 ) variables. Thus if the number of lookahead
stages is relatively small, a nonlinear programming approach may be thc
preferred option in calculating suboptimal limited lookahead policics.
Terminal Cost-to-Go Approximations
A key issue in implementing a limited lookahead policy is the selection
of the cost-to-go approximation at the final step. It may appcar important
at first that the true cost-to-go function should be approximated wcll ovcr
the range of relevant states; however, this is not necessarily true. What
is important is that the cost-to-go differentials (or relative values) be ap-
proximated well; that is, for an n-step lookahead policy it is important to
have
Jk+n(X) - Jk+n(x') Jk+n(X) - Jk+n(:r'),
for any two states x and x' that can be generated n steps ahead from the
c_urrent state. For example, if cquality were to hold above for all x, x', then
Jk+n(X) and Jk+n(x) would differ by the same constant for each relevant
x and the n-step lookahead policy would be optimal.
_..1
268
Suboptimal and Adaptive Contro.
Chap. 6
Sec. 6.3
Limited Lookahcau Aides and Applications
269
The manncr in which the cost-to-go approximation is selccted dc-
pends very much on the problem being solved. There is a wide variety of
possibilities here. For example, in some games like chess, the approximate
cost-to-go in a certain position (state) involves a heuristic incorporation of
certain features of the position into a score for the position. In other prob-
lems, a cost-to-go approximation may be based on solution of a simpler
problem that is tractable computationally or analytically. We illustrate
this possibility with an example; see also the manufacturing problem in
Section 6.3.1.
tradc off accuracy of t.he cost-to-go approximat.ion with siw or lookahead;
that is, by looking additional stages ahead, we can ameliorate the effect of
errors in the cost-to-go approximation. The following subsections illustrate
some of thesc approaches.
6.3.1 Flexible Manufacturing
J (x, y) = . max [Pi (r d J (x - 1, y - 1)) + (1 - Pi) J (x, y - 1)] ,
t=l"..,rn
Flexible manufacturing systems (FMS) provide a popular approach
for increasing productivity in manufacturing small batchcs of related parts.
There are several workstations in an FMS, and each is capable of carrying
out a variety of operations. This allows the simultaneous manufacturing
of morc than one pa.rt type, reduces idle time, and allows production to
continuc even when a workstation is out of service because of failure or
maintenance.
Consider a work center in which n part types are produced. Denote
u: the amount of part i produced in period k.
d: a known demand for part i in period k.
xl: the cumulative difference of amount of part i produced and de-
manded up to period k.
Let us denote also by Xk, Uk, d k the n-dimensional vectors with coordinates
xL Uk' dL respectively. We then havc
Example 3.2
Consider the problem of the unscrupulous innkeeper who charges one of m
different rates rl,..., r m for a room as the day progresses, dependillg on
whether he has many or few vacancies, so as to maximize his expected total
income during the day (Exercise 1.24 in Chapter 1). A quote of a rate ri is
accepted with probability Pi and is rejected with probability 1 - p" in which
case the customer departs, never to return during that day. When the number
y of customers that will ask for a room during the rest of the day (including the
customer currently asking for a room) is known and the number of vacancies
is x, the optimal expected income J(x, y) of the innkeeper is given by the DP
recursion
for all x 2: 1 and y 2: 1, with initial conditions
J(x,O) = J(O, y) = 0,
for all x and y.
Xk+1 = Xk + uk - dk.
(3.1 )
On the other hand, when the innkeeper does not know y at the times of
decision, but instead only has a probability distriuution for y, it can be seen
that the problem becomes a difficult imperfect state information problem. Yet
a reasonable one-step lookahead policy is based on approximating the optimal
cost-to-go of subsequent decisions with J(x-1, y-1) or J(x, y-1), where the
function J is calculated by the above recursion and y is the closest integer to
the expected value of y. In particular, according to this one-step lookahead
policy, when the innkeeper has a number of vacancies x 2: 1, he quotes to the
current customer the rate that maximizes p, (r i + J (x - 1, Ti - 1) - J (x, Ti - 1)).
The work center consists of a number of workstations t.hat fail and get.
repaired in random fashion, thereby affecting the productive capacity of the
system (i.e., thc constraints on Uk)' Roughly, our problem is to schedule
part production so that Xk is kept around zero to the extent possible.
The productive capacity of the system depends on a random variable
ak that reflects the status of the workstations. In particular, we assume
that the production vector Uk must belong to a constraint set U(ad. Wc
model the evolution of CXk by a Markov chain with known transition prob-
abilities P(CXk+l I CXk). In practice, these probabilities must be cstimated
from individual station failure and repair rates, but we will not go into thc
matter further. Note also that in practicc these probabilities may depend
on Uk. This dependence is ignored for the purpose of developmcnt of a
cost-to-go approximation [ef. the following Eq. (3.7)]. It may be takcn into
account whcn the actual suboptimal cont.rol is computed [cf. the lilillillliza-
tion in the following Eq. (3.8)].
We select as system state the pair (Xk, CXk), where Xk evolves according
to Eq. (3.1) and CXk evolves according to the Markov chain described earlier.
The problem is to find for every state (Xk, ak) a production vector Uk E
Another approach, is to base a cost-to-go approximation on a subop-
timal policy such as a certainty equivalent or open-loop-feedback controller.
Generally, if a reasonably good suboptimal policy is known, it can be used
to evaluate a cost-to-go approximation. For example, one may use simu-
lation to evaluate the cost-to-go for some represelltative initial states and
then use interpolation for the remaining states. Finally, one may start with
a given cost-to-go approximation, obtain a corresponding policy by single
or multiple stage lookahead, use that policy to obtain an improved cost-
to-go approximation, and iterate. It is reasonable to assume that one can
_ J
270
Suboptimal and Adaptive Control
Chap. 6
Sec. 6.3
Limited Looka.hea ?olicics and Applications
271
U(ak) such that a cost function of the form
{ N-1 n }
J,,(xo) = E gi(xkJ
Furthermore, since !l(ak) C U(ak) C U(ak), the cost-to-go functions }
and l. provide lower and upper bounds to the true cost-to-go function Jk,
is minimized. The function gi expresses the desire to keep the current
backlog or surplus of part i near zero. Twoexamples are gi(xi) = .Bilxi[ or
gi(xi) = .Bi(X i )2, where .Bi > O.
The DP algorithm for this problem is
n n
LJ(xk,ak):::; A(xk,ak):::; Ll.(xk,ak),
i=1 i=1
(:3.7)
and can be used to construct approximations to Jk that are suitable for a
one-step lookahead policy. A simple possibility is to adopt the averaging
approximation
n
A(xk,ak)=Lgi(xk)+ min E {Jk+l(Xk+Uk-dk,ak+l) lad.
i=1 UkEU(o.k) o.k+1
(3.2)
If there is only one part type (n = 1), the optimal policy can be fairly easily
determined from this algorithm (see Exercise 6.7). Howevcr, in gencral,
thc algorithm rcquires a prohibitivc amount of calculation for an FMS of
realistic size (say for n > 10 part types). We thus considcr a one-step
lookahead policy with the cost-to-go Jk+l replaced by an approximation
J k + 1 that cxploits the nearly separable structure of our problem.
In particular, we note that the problem can to a large extent be
decomposed with respect to individual part types. Indeed, the system
cquation (3.1) and the cost per stage are decoupled, and the only coupling
between parts comes from the constraint Uk E U(ak)' Suppose we ap-
proximate U(ak) by inner and outer approximating hypercubes !l(ak) and
U(ak) of the form
n
Jk(Xk, ak) = L(J(x, ak) + l.k(xL ak))
i=1
and use at state (Xk, ak) the control Uk that minimizes [ef. Eq. (3.2)]
E {t(J+1(X + u - dLak+J) + l.+l(xk + ut - dL a k+1)) I ak}
(3.8)
over all Uk E U(ak). Multiple upper bound approximations, based 011
multiple choices of IIi (ak), can also be used.
ut
Inscribed Hypercube !:!(a,.) Circumscribed Hypercube U(cxk)
(3.3b)
(3.4)
!?2(a,.)
!11(a,.) Bl(cxk)
Production Constraint Set U(a,.)
u
!l(ak) = {ut 10:::; ut :::;fl.;(ak)},
U(ak) = {ut 10:::; ut :::;Bi(ak)},
!l(ak) C U(ak) C U(ak),
(3.3a)
as shown in Fig. 6.3.1. If U(ak) is replaced for each ak by eithcr U(ak)
or !l(ak), then the problem is decomposed completely with respect to part
types. For every part i the DP algorithm for the outer approximation is
given by
J(Xk' ak) = gi(xk)+ ml!.J. E {J(Xk +ul-d k , ak+1) I ak}, (3.5)
O::;u1::;B i (o.k) o.k+l
aud for the inner approximation it is given by
Figure 6.3.1 Inner and outer approximations of the production capacity con-
straint set by hypercubes.
l.l(xk,ak) =gi(X k ) + min E {l.l+1(Xk+uk-dt,ak+1) [ak}'
o::;u1::;lli(o.k) 0./';+1
(3.6)
To implement this scheme, it is necessary to compute and store the
approximate cost-to-go functions and l.l in tables, so that they can
be used in the real-time computation of the suboptimal control via the
_..1
272
Suboptimal and Adaptive Control
Chap. 6
Sec. 6.3
Limited Loolmhea, olicies and Applications
273
minimization of expression (3.8). Thc corresponding calculations [cf. the
DP algorithms (3.5) and (3.6)] are nontrivial, but they can bc carricd out
off-linc, and in any case they are much less than what would be required
to compute the optimal controller. Thc feasibility and the benefits of the
overall approach have been demonstratcd by simulation for FIVIS of realistic
size in [Kim82]. See also [KGD82] and [Tsi84a].
p
6.3.2 Computer Chess
Chess-playing computer programs are one of the morc visible suc-
cesses of artificial intelligencc. Their underlying methodology relics on DP
and provides an intercsting case study in the use of suboptimal control. It
involves the idea of limited lookahead, but also illustrates somc DP ideas
that we have not had much opportunity to look at in detail. These are the
idea of a forward search, an important mcmory-saving technique that was
discussed in Section 2.3 in thc context of label correcting methods, and the
idca of alpha-beta pruning, which is an effective method for reducing the
amount of calculation required to find optimal game stratcgies.
The fundamental paper on which all computer chess programs arc
based was written by one of the most illustrious modern-day applicd math-
ematicians, C. Shannon [Sha50]. It was argued by Shannon that whether
the starting chess position is a win, loss, or draw is a question that can
bc answered in principle, but the answer will probably never be known.
He estimated that, based on the chess rule requiring a pawn advance or
a capture every 50 moves (otherwise a draw is declared), there are on the
order of 10 120 differcnt possible sequences of moves in a chcss game. He
concludcd that to examine these and sclect the best initial move for White
would require 10 90 years of a "fast" computer's time (fast here relates to
the standards of the '50s). As an alternative, Shannon proposed a limited
look ahead of a few moves and evaluating the end positions by means of a
scoring function. The scoring function may involve, for example, the calcu-
lation of a numcrical value for each of a set of major features of a position
(:.;uch as material balance, mobility, pawn structure, and othcr positional
factors), together with a mcthod to combine these numerical valucs into a
singlc IJcore. The convcntion hcre is that White is favored in positions with
high score, while Black is favored in positions with low score.
Consider first a one-move lookahead strategy for selecting the first
move in a given position P. Lct NIt,. . . , M r be all the legal moves that
can be llladc in position P by thc side to move. Denote the resulting
positions by M1P,..., MrP, and let S(M1P)"", S(MrP) be thc corre-
sponding scores. Then the Illove selectcd by White (Black) in position P is
the move with maximum (minimum) score. This is known as the backed-up
score of P and is given by
M 1 P
%P
%P
M 4 P
Figure 6.3.2 A one-move lookahead tree. If White moves at position P, the best
move is Ml and the backed-up score is +5. If Black moves in position P, the best
move is M3, and the backed-up score of Pis -3.
This process is illustrated in Fig. 6.3.2.
Consider next a two-move lookahead strategy in a given position P.
Assume for concreteness that Whitc movC8, and let the legal moves be
M1,..., AIr and the corresponding positions be M 1 P,..., MrP. Then in
each of the positions MiP, i = 1,..., r, apply the one-move lookahead
strategy with Black to move. This gives a best movc and a backed-up
score BS(MiP) for Black in each of the positions M,P, i = 1,...,1'. Finally,
based on the backed-up scores BS(M 1 P),..., BS(MrP), apply a onc-move
lookahcad strategy for White, thereby obtaining the best move at position
P and a backcd-up score for position P of
BS(P) = max{ BS(M1P),... , BS(MrP)}.
BS(P) = { max{ S(M1 P )"" , S(MrP)}
min{ S(M 1 P),..., S(MrP)}
if White is to move in P,
if Black is to movc in P.
The sequcnce of best moves is known as the principal continuation. The
process is illustrated in Fig. 6.3.3. It is clear that Shannon's method as just
described can be generalized for an arbitrary number of lookahead moves
(sce Fig. 6.3.4).
Generally, to evaluate the best move at a given position P and the
corresponding backed-up scorc using lookahead of n moves, one can use the
following DP-like procedure:
1. Evaluate the scores of all possible positions that are n moves ahead
from thc givcn position P.
2. Using the scores of the terminal positions just evaluated, calculate
the backed-up scores of all possible positions that are n - 1 1ll0VPS
ahead from P.
3. For k = 1,..., n - 1, using the backed-up scores of all possible po-
sitions that are n - k moves ahead from P, calculate the backed-up
scores of all positions that are n - k - 1 moves ahead frolll P.
_ J
274
Suboptimal and Adaptive Control
Sec. 6.3
Limited LookaheaL olicies and Applications
275
Chap. 6
P (White to Move)
+10 -5 +3
+16 -1 +27
.4
o
+5
2 4 5 6
10 11
Figure 6.3.3 A two-move lookahead tree with White to move. The backed-up
scores are shown in parentheses. The best initial move is M J and the principal
continuation is (MJ, Mn).
22
20
15 16
Figure 6.3.5 Traversing a tree in depth-first fashion. The numbers indicate the
order in which the scores of the terminal positions and the backed-up scores of
the intermediate positions are evaluatE.'C!.
, ,
, ,
, ,
, ,
, ,
Q b
I \ I \
I \ I \
I \ I \
I \ I \
. , t .
P
+(. +20 + 18 + 16 +24 +20 + 10 + 12 -4 +8 +21 +11 -5 + 1 0 +32 +27 + 1 0 +9 +3
.
Figure 6.3.4 A four-move lookahead tree with White to move. The backed-up
scores are shown in parentheses. The best initial move is MI. The principal
continuation is heavily shaded.
J
, I II
, , I I
I \ I I
I I , I
66 66
The above procedure requircs a lot of memory storage evcn for a mod-
est number of lookal1ead moves. Shannon pointed out that with an alter-
native but mathematically equivalent calculation, the amount of memory
required grows only linearly with the depth of lookahead, thereby allowing
Figure 6.3.6 Storage requirements of the depth-first version of the mnnmax
algorithm for the tree of Fig. 6.3.5. At the time that the t.erminal position IlIarked
by a checkmark is scored, only the solid-!iue moves are stored in memory. The
dotted-line moves have been generated and purged from memory. The broken-line
moves have not been generated as yet. The memory required grows linearly with
the depth of the !ookahead.
_. J
276
Suboptimal and Adaptive Control
Chap. 6
Sec. 6.3
Limited Loolmhead _ Jlicies and Applications
277
TBS(n) := max{TBS(n), BS(ni)};
Thus the numbcr of tcrminal positions grows exponent.ially with t.he si"c of
lookahead, practically limiting n to no more than roughly 10 with prcsent
computers. Unfortunately, in some chess positions it is essential to look
a substantially larger number of moves ahead. In particular, in dynamic
positions involving many captures and countercapturcs, the necessary size
of lookahead can be very large.
Thcse considerations led Shannon to consider another strategy, called
type B, whereby the depth of the search tree is variable. He suggested that
at each position the computcr give all legal moves a preliminary exam-
ination and discard those that are "obviously bad." A scoring function
togcther with somc heuristic strategy can be used for this purpose. Sim-
ilarly, he suggested that some positions that arc dynamic, such as those
involving many captures or checkmate threats, be cxplorcd furthcr beyond
the nominal depth of the search.
Chess-playing computer programs typically use a variation of Shan-
non's type B strategy. These programs use scoring functions, the forms
of which have evolved by trial-and-error, but they also use sophisticated
heuristics to evaluate dynamic terminal positions in detail. In particular,
an effective algorithm, known as swupofJ, is used to quickly analyze long
sequences of captures and countercaptures, thereby making it possiblc to
score realistically complex, dynamic positions (see [Lev84] for a descrip-
tion). Onc may view such heuristics as cither defining a sophisticated
scoring function, or as implementing a type B strategy.
chcss programs to operate in limitcd-memory microprocessor systcms. This
is accomplished by searching the tree of moves in depth-first fashion, and
by generating new moves only when needed, as illustrated in Figs. 6.3.5
and 6.3.6. It is only necessary to store the one move sequence under cur-
rent examination together with one list of legal moves at each level of the
search tree. The precise algorithm is described by the following routine,
which calls itself recursively.
Minimax Algorithm
To determine the backed-up score BS(n) of position n, do the follow-
ing:
1. If n is a terminal position return its score. Otherwise:
2. Generate the list of legal moves at position n and let the cor-
responding positions be nl, " . , n r . Set the tcntative backed-up
score TBS(n) of position n to 00 if it is White's turn to move at
n and to -00 if it is Black's turn to move at n.
3. For i = 1,..., r, do;
a. Determine the backed-up score BS(ni) of position ni.
b. If it is White's move at position n, set
Alpha-Beta Pruning
if it is Black's move at position n, set
TBS(n):= min{TBS(n), BS(ni)}.
The efficiency of the mmlmax algorithm can be substantially im-
proved by using the alpha-beta pruning procedure (denoted 0'.-/3 for short),
which can be used to forego some calculations involving positions that can-
not affect the selection of the best move. To understand the 0'.-/3 procedure,
consider a chess player pondering the next move at position P. Suppose
that the playcr has already exhaustively analyzed one relatively good move
MI with corresponding score BS(M1P) and proceeds to examinc the next
movc M2. Suppose that as the opponcnt's replies are examined, a partic-
ularly strong response is found, which assures that the score of M 2 will be
worse than that of MI. Such a response, called a refutation of move M 2 ,
makes further consideration of move M 2 unnccessary (i.e., the portion of
the search trec that desccnds from movc M 2 can be discarded). An example
is shown in Fig. 6.3.7.
The 0'.-/3 procedure can be generalized to trees of arbitrary or irregular
depth and can be incorporated very simply into thc minimax algorithm.
Generally, if in the process of updating the backed-up score of a given
position (step 3b) this score crosses a certain bound, then no further cal-
culation is needed regarding that position. The cutoff bounds are adjusted
dynamically as follows:
4. Return BS(n) = TBS(n).
The idea of solving one-step lookahead problems with a terminal cost
(or backcd-up score) that summarizes future costs is of. curse centr1 in
the D? algorithm. Indeed, it can be seen that the ml1llmax algonthm
just described is nothing but the DP algorithm for minimax prob.lems .(see
Exercise 1.5 in Chapter 1). Here, positions and moves can be Identified
with states and controls, respectively, therc are only terminal costs (the
scores of the tcrminal positions), and the backed-up score of a position is
nothing but the optimal cost-to-go at the corresponding state.
The minimax algorithm is also known as the type A strategy. Shannon
argued that with this strategy, one could not expect a computer to seriously
challenge human players of even moderate strength. In a typical chess
position there are around 30 to 35 legal moves. It follows that for an n-move
lookahead there will be around 30 n to 35 n terminal positions to be scored.
_. .I
278
Suboptimal and Adaptive Control
Chap. 6
Sec. 6.3
Limited Lookaheac. Aicies and Applications
279
p
+5
+1
+20
+3
-2
-5
+3
+20
Figure 6.3.7 The a-{J procedure. White has evaluated move M I t.o have backed-
up score (+1), and starts evaluating move M2. The first reply of I31ack is a
refutation of M2, since it leads to a temporary score of -2, less than the backed-
up score of MI. Since the backed-up score of M2 will be -2 or less, M 2 will be
inferior to MI. Therefore, it is not necessary to evaluate move M 2 further.
+8 +20 +18 +16 +24 +20 +10+12 -4 +8 +21 +11 -5 +10+32 +27+10 +9 +3
1. The cutoff bound in position n, where Black has to move, is denoted
a and equals the highest current score of all ancestor positions of
n where White has to move. The exploration of position n can be
terminated as soon as its temporary backed-up score equals or falls
below a.
2. The cutoff bound in position n, where White has to move, is denoted
fJ and equals the lowest current value of all ancestor positions of n
where Black has the move. The exploration of position n can be
terminated as soon as its temporary backed-up scorc rises above fl.
The process is illustrated in Fig. 6.3.8. It can be shown that the
backed-up score and optimal move at the starting position are unaffected
by the incorporation of the a-fJ procedur'e in the minimax algorithm. We
leave the verification of this fact to the reader (Exercise 6.8). It can also
be seen that the a-fJ p7'Ocedure will be more effective if the best moves in
each position are explored first. This tends to keep the a bounds high and
the fJ bounds low, thus saving a maximum amount of calculation. Current
chess programs use sophisticated techniques for ordering moves so as to
maximize the effectivencss of the a-fJ procedure. We discuss briefly two of
these techniques: iterative deepening and the killer heuristic.
Iterative deepening, in its pure form, consists of first conducting a
search based on lookahead of one move; then carrying out (from scratch) a
scarch based on lookahead of t.wo moves; thcn carrying out a search bascd
on lookahead of three movcs and so OIl. This process is continued either
up to a fixed level of lookahead or until some limit on computation time
is exceeded. At each iteration associated with a certain level of lookahead,
one obtains a best move at the starting position, which is examined first in
Figure 6.3.8 The a-{J procedure applied to the tree of Fig. 6.3.4. For example,
the {J-cutoff in position PI is due to the fact that its temporary score (+20)
exceeds its current {J-bound (+16). The a-cntoffs in positions P2, P3, and P4 are
due to the fact that the corresponding temporary scores, +8, +11, and +11, have
fallen below the current a-bound, which is + 16, the current temporary scor" ill
position P.
the subsequent iteration that requires one extra move of lookahead. This
enhances the power of the a-fJ procedure, thereby more than making up for
the extra computation involved in doing a short lookahead search beforc
doing a longer one. (Actually, given that the number of terminal posit.ions
increases on the average by more than a factor of about 30 with each ad-
ditional level of lookahead, the extra computation is rclativcly small.) An
additional benefit of this method is that a best move is maintained through-
out the search and can be produced at any time as needed. This comes
in handy in commercial programs that incorporate a feature whereby the
computer is forced to move either upon exhausting a given time allocation
or upon command by a human opponent. An improvement of the method
is to obtain a thoroughly sorted list of moves at the starting position via
a one-move lookahead, and then use the improved ordering in subscquent
iterations to enhance the performance of the a-fJ procedure.
The killer heuristic is similar to iterative deepening in that it aims
at examining first the most powerful moves at each position, tlwwby en-
hancing the pruning power of the a-fJ procedure. To understand the idca,
suppose that in some position, White selects the first move JVh [rom a can-
didate list {Ml, M2, M3,.. .}, and upon examining Black's responses to Ml
finds that a particular move, which we will refer to as the killer move, is by
_ J
280
Suboptimal and Adaptive Control
Chap. 6
far Black's best. Then it is often true that the killer move is also Black's
best response to the second and subsequent moves lVh, M 3 , . .. in White's
list. It is therefore a good idea from the point of view of CY.-/3 pruning to
consider the killer move first as a potential response to thc remaining moves
M2, M3,. . . Of course, this does not always work as hoped, in which case it
is advisable to change the killer move depending on subsequent rcsults of
the computation. In fact, some programs maintain lists of more than one
killcr move at each level of lookahead.
The CY.-/3 procedure is safe in the sense that searching a game tree
with and without it will producc thc same result. Somc computer chess
programs use more drastic tree-pruning proccdures, which usually rcquire
lcss computation for a given level of looka.head, but may miss on occasion
the st.rongcst move. There is some debate at present regarding thc mcrits of
such procedures. References [Lev84] and [New75] consider this subject, and
provide a broader discussion of the limitations of computer chess programs.
6.4 APPROXIMATIONS IN DYNAMIC PROGRAMMING
Another general method to deal with the "curse of dimensionality" is
based on low-dimensional approximations of the DP algorithm. Thcre are
two general approaches here:
(a) Approximating the original problem by a computationally "simpler"
problem.
(b) Approximating the DP algorithm (for the original problem) by a com-
putationally "simpler" algorithm.
In this section, we will provide an informal discussion of these two ap-
proaches.
6.4.1 Discretization of Optimal Control Problems
A typical example of approximation arises in problems with contin-
uous state, control, or disturbance spaces, which are discretized in ordcr
to execute the DP algorithm. Here each of the continuous spaces of the
problem is replaced by a space with a finite number of elements, and the
system equation is appropriately modified. The approximate problem in-
volves a finite number of states, and the system equation is essentially
replaced by a set of transition probabilities between these states. Once the
discretization is done, the DP algorithm is executed to yield the optimal
cost-to-go functions and an optimal policy for the discrete approximating
problem. The optimal cost function and/or the optimal policy for the dis-
crete problem may then be extended to an approximate cost function or a
Sec. 6,4
Approximations iI. J 'namic Programming
281
suboptimal policy for the original continuous problem through some form
of interpolation.
A prerequisitc for success of this typc of discretization is consistency.
By this we mean that t.he opt.imal cost of t.he original problem should be
achieved in the limit as the discrctization bccomes finer and fincr. Consis-
tency is typically guaranteed if there is a "sufficient amount of continuity"
in the problem; for cxample, if the cost-to-go functions and the opt.imal
policy of the original problem arc continuous functions of the state. This
in turn can be guaranteed through appropriate continuity assumptions on
the original problem data (see the refcrcnces).
Continuity of the cost-to-go functions may be sufficicnt to guarantee
consistency, evcn if the optimal policy is discontinuous in the state. What
may happen hcre is that for somc statcs thcre may be a large discrepancy
between thc optimal policy of thc continuous problem and the optimal
policy of its discretized vcrsion, but this discrepancy may occur over a
portion of the state space that diminishes as the discretization becomes
finer. As an example consider the inventory control problem of Sect.ion 4.2
with nonzero fixed cost. We obtained an optimal policy of the (s, S) typc
. ( ) { Sk - Xk
J.Lk Xk = 0
ifxk<sk,
if Xk 2: Sk,
which is discontinuous at the lower threshold Sk. The optimal policy ob-
tained from the discretized problem may not approximate well the optimal
around the point of discontinuity Sk, but it is intuitively clear that the dis-
crepancy has a diminishing effect on the optimal cost as the discretization
becomes finer.
In the case of a continuous-time problem, the time must also be dis-
cretized, so that the original cont.inuous-time problem is rcplaced by a
discrete-time problem. The issue of consistency becomes now considerably
more complex, because the time discretization intcrval affects not only the
system cquation but also the control constraint set. Furthermore the con-
trol constraint set lIlay change considerably as we pass to the apropriatc
discrete-time approximation. As an example, consider the two-dimensional
system
Xl(t) = Ul(t),
X2(t) = U2(t),
with the control constraint lIu(t)1I = 1. It can be seen thcn that the state
at time t+.6.t can be anywhere within thc sphere centered at x(t) of radius
.6.t (note that the effect of zero control can be obtained in the continuous-
time system by "chattering" between two control vectors that have norm
equal to 1 alld arc opposite t.o each othcr). Thus, given .6.t, the appropriaLc
discrete-time approximation of the control constraint set should involve a
discretization of the entire sphere of radius Ilt and not just its surface. All
example that illust.rates some of the pitfalls associated with the discrctii:a-
tion process is given in Excrcise 6.10.
_. .1
282
Suboptimal and Adaptive Contro.
CIJap. 6
There is an interesting general method to address the discretization
i;;stle;; of cont.inuous-time opt.imal control. The main idea in this method
is to discretizc, in addition to time, thc state space using some kind of
grid, and then to approximate the cost-to-go of nongrid states by linear
intcrpolation of the cost-to-go valucs of the nearby grid states. By this we
mean that if a nongrid state x is expressed as
m
X = LEix i
i=l
in terms of the grid statcs xl, . . . , x m , where the positive weights E i add to
to 1, then the cost-to-go of x is approximated by
m
L iJ(xi),
i=l
whcrc J(x i ) is the cost-to-go of xi. When this is worked out, one ends up
with a stochastic optimal control problem having as states the finite num-
ber of grid states, and transition probabilities that are detcrmined from the
weights , above. If the original continuous-time optimal control problem
has fixed terminal time, the resulting stochastic control approximation has
finite horizon. If the terminal time of the original problem is free and sub-
ject to optimization, the stochastic control approximation is of the stochas-
tic shortest path type to be discussed in Section 7.2. We refer to the papers
[GoR85] and [FaI87], the survey [Kus90], and the monograph [KuD92] for a
descript.ion and analysis of the associated issues. An important special case
is the continuous-spacc shortest path problem (see Exercise 6.10). For this
problem, a finitcly tcrminating adaptation of the Dijkstra shortest path
algorithm has been developed [Tsi93] (sce also [PDT95]).
6.4.2 Cost-to-Go Approximation
Let us now consider the approach of approximating the DP algo-
rithm by a computationally simpler algorithm. Such an algorithm need
not ir.volve an explicit discretization of the state space. The main idea
is to calculate (possibly approximate) values for the cost-to-go (and/or
the optimal policy) at a finite set of state-time pairs, and then to make
a "least-squares fit" of these values with a function of a given type, such
as a polynomial function, a Fourier expansion, a wavelet expansion, or the
output of a neural network.
In particular, suppose that we have calculated the correct value of
the optimal cost-to-go I N _ 1 (X i ) at the next to last stage for m states
xl, . . . , x m through the D P formula
IN-l(X) = min E{g(x,u,w) + IN(J(X,U,w))},
uEU(x) w
(4.1)
Sec. 6.4
Approximations in . .amic Programming
283
and the given terminal cost function J N. (For simplicity in this and the
following formulas, wc drop the subscript.s of g, x, "IL, and w, ,).<;sllming it
stationary problem, but this is !lot esse!ltial for the met.hodology uescriueu.)
We can then approximate the entire function J N -1 (x) by a function of some
given form
I N - 1 (x,rN_l),
where rN-l is a vector of parameters, which can be obtained by solving
thc problem
m
min LIJN-l (Xi) - I N - l (xi, rW.
r
i=l
(4.2)
For example if x is an n-dimensional vector and J N -1 is spccified to be
linear, the vector r consists of an n-dimensional vector a and a scalar (3,
and
IN-l(X,r) =a ' x+{3.
The least squares problem (4.2) is
(4.3)
m
min LIJ N - 1 (X i ) - a'xi - (312
a, (3 i=l
(4.4)
and can be solved in closed form. Actually, a linear approximating function
is seldom adequate, but there are other possibilities that can be tailored to
the problem at hand.
Note that this approximation procedurc can be enhanced if we have
additional information on the true cost-to-go function IN-1(X). For ex-
ample, if we know that IN-l(X) 0 for all x, wc may first compute thc
approximation J N -1 (x, rN -1) by solving the lcast-squarcs problem (4.2)
and thcn rcplace this approximation by max{ 0, I N - 1 (X, TN-J)}. This idea
applies to all the approximation procedures to bc prescnted in this section
and later.
Once an approximating function I N - 1 (x,rN-d for the next to last
stage has becn obtained, it can be used to similarly obtain an approxi-
mating function I N - 2 (X, rN-2). In particular, (approximate) cost-to-go
function values I N _ 2 (X i ) are obtained for m statcs xl,... ,x m through the
(approximate) DP formula
I N - 2 (X) = rnin E{g(x,u,w) + I N - 1 (J(X,u,W),rN-l)}. (4.5)
uEU(x) w
These values are then used to approximate the cost-to-go function IN-2(X)
by a function of some given form
I N - 2 (X,rN-2),
_. J
284
Suboptimal and Adaptive Control
Chap. G
Sec. 6.4
Approximations in
.Jamie Programming
285
wherc 1'N -2 is a vector of parameters, which is obtaincd by solving thc
problem
Approximation Through Feature Extraction Mappings
m
min 2:=1 J N _2(Xi) - .J N -2 (:Z;i , r) 1 2 .
T
i=1
The process can be similarly continued to obtain ik(x, rk) up to k = 0 by
solving for each k thc problem
m
mjn 2:]J k (X i ) - J k (x i ,r)!2.
i=1
Given approximate cost-to-go functions Jo(x, ro),..., IN-dx, rN -1),
one may obtain a suboptimal policy by using at state-time pair (x, k) the
control
ilk(x)=arg min E{g(x,u,w) + J k +1(J(x,u,w),rk+1)}' (4.8)
uEU(x) w
This control must bc calculatcd on-linc once thc statc Xk at time k becomcs
known.
( 4.7)
Clearly, for thc success of thc approximation approach, it is vcry im-
portant to select a class of functions i(;z;,1') t.hat is suitable for the problem
at hand. One particularly interesting type of cost approximation is pro-
vided by feature extraction, a process that maps the state x into some other
vector y, called the feature vector associated with state x.
Feature vectors summarize, in a heuristic sensc, important charactcr-
istics of the state, and they are very useful in incorporating the dcsigner's
prior knowledge or intuition about the problem and about thc structure of
the optimal controller. For example, in a queueing system involving scveral
queues, a feature vector may involvc for each queuc a three-value indicator,
that specifies whether the queue is "nearly cmpty," "moderatcly busy," or
"nearly full." Also, feature extraction is closely related to the notion of
state aggregation, which will be discussed in Vol. II, Section 1.3.
Feature vectors arc normally associated with piecewise constant cost
approximation. In particular, when a set of features is selected to reprcsent
states, the cost function is approximated by a function that is constant
ovcr the subset of states that are mapped into the same fcaturc vector.
The parameter vector r in the function i(x, r) is the vector of the constant
levels, that is, the costs associated with all the different feature vectors.
In particular, in a system with m possible feature vectors, l' is an m-
dimensional vector (r1,...,r m )', and we have J(x,r) = 1'j for all statcs
x that are mapped into the jth feature vector. More gcnerally, one may
consider cost approximations of the form J(y,1'), where r is a parameter
vector and the cost approximation depends on the state i through its feature
vector y (see Fig. 6.4.1).
(4.6)
Approximating Policies
An alternative to calculating on-line suboptimal controls based on the
preceding cquation is to approximate opt.imal policies directly. In partiCll-
lar, suppose that the control space is a Euclidean space, and that we obtain,
for a finite number of states xi, i = 1, . . . , m, the minimizing controls
jJ.k(X i ) = arg min E{9(Xi, u, w) + J k + 1 (J(xi, U, w), rk+1)}'
uEU(x') W
We can then approximate the optimal policy Jlk(X) by a function of some
given form
iik(X, Sk),
where Sk is a vector of parameters obtained by solving the problem
m
aa ure Cost Approxi
State x Feature Extraction Vector y Cost Approximator wI J(y,r)
Mapping Parameter Vector r
F t
mation
m}n L IIMk(X i ) - Mk(X i , s)112. (4.9)
i=l
In the case of deterministic optimal control problems, we can take
advantage of the equivalence between open-loop and feedback control to
carry out the approximation process more efficiently. In particular, for
such problems we may select a representative finite subset of initial states,
and generate an optimal open-loop trajectory starting from each of these
states. (Gradient-based methods can often be used for this purpose.) Each
of these trajectories yields a sequence of pairs (Xk, Jk(Xk)) and a sequence
of pairs (:rk,JLk(:rk)), which can be used in the least-squares approximation
procedures discussed above. In particular, we can use the exact values
Jk(X i ) and /-Lk(Xi) obtained from the optimal open-loop trajectories ill place
of Jk(X i ) and Mk(X i ), respectively, in thc least-squares problems of Eqs.
(4.7) and (4.0).
Figure 6.4.1 Using a feature extraction mapping to generate an input to a cost
approximator.
A feature extraction mapping may also be useful in thc context of
problems of imperfect state information. In this context, a state cstimator
that provides some form of state reconstruction may be vicwed as a feature
extraction mapping. In fact, we may view a sufIicient stat.istic. a.. all "ex-
act" feature extraction mapping, in the sense that it is equivalent to the
full information vector for optimization purposes. This leads to the conjec-
turc that good feature extraction mappings arc provided by "approximate
sufficient statistics," that is, mappings that in a heuristic sensc arc close to
__J
286
Suboptimal and Adaptive Contra.
Chap. 6
being a sufficient statistic for the given problem.
G.4.3 Other Approximations
We meution briefly two additional approaches for using approxima-
tions. In the first approach, the optimal cost-to-go functions Jk(xd are
approximated with functions lk(xk,rk), where r = (rO,rl,...,rN) is a
vector of unknown parameters, which is chosen to minimize some form of
error in the DP equations; for example by solving the problem
mjn L _ ! lk(Xk, rk)
(xk,k)ES
- min E { 9(Xk,U, Wk) + lk+l (J(Xk, u, Wk), 1'k+l) } 1 2
uEU(xk) wk
( 4.10)
where S is a suitably chosen subset of "representative" state-time pairs.
The above minimization can be attempted using some type of gradient
method. Note that there is some difficulty in doing so because the cost
function of Eq. (4.10) may be nondifferentiable for some values of r. I-Iow-
ever, there are adaptations of gradient methods that work with nondiffer-
entiable cost functions, and for which we refer to the specialized literature.
One possibility is to replace the nondifferentiable term
min E { 9(Xk, u, Wk) + lk+l (J(Xk, u, Wk), r k + l ) }
uEU(Xk) Wk
by a smooth approximation (see [Ber82b], Ch. 3). Thc approach of approx-
imating cost-to-go functions by minimizing the error in the DP equations
will also be discusscd in more detail within an infinitc horizon context (see
Vol. II, Section 1.3).
The second approach is bascd Qn policy approximation. Thus in place
of an overall optimal policy, we look for an optimal policy within some
rcstricted class that is parameterized by somc vector s of relatively low
dimeusion. In particular, we consider policies of a given form
1rs = {ilo(xo,s),iLl(Xl,S),...,ilN-l(XN-l,S)},
where s is the unknown parameter vector, and P'k are given functions with
certain structure (e.g. linear in Xk). We then minimize over s the expected
cost
E{J".(xo)},
XQ
where the expectation is with respect to some suitable probability distribu-
tion of the initial state xo. The advantage of this approach is that it applies
Sec. 6.5
Notes, Sources, and
lfcises
287
even for complex problems where t.here is no explicit. model for t.he syst.('m
and the cost; instead the cost corresponding to a given policy may be cal-
culated by simulatiou. One difficulty is that the gradieut of the cost. with
respect. to s may not be easily calculated. However, optimizatiou methods
that require only cost values (and not. gradicnts) may be lIsed. Wf'. rcfer to
the specialized literature for description and analysis of such mcthous.
6.5 NOTES, SOURCES, AND EXERCISES
Excellent surveys of adaptive control are given in [Ast83] and [K um85],
which contain many other references. For textbook treatments, see [AsW90],
[GoS84], [Her89], [Ku V86], and [LjS83]. Self-tuning rcgulators received
wide att.ention following the paper by Astrom and Wittenmark [AsW73].
Open-loop feedback control was suggcsted by Dreyfus [DreG5]. Its
superiority over open-loop cont.rol was established by the author in the
context of minimax control [Ber72]. A generalization of this result is givcn
in [WhH80]. The POLFC was proposed in [Ber76].
For consistency analyses of various discrcti:lation and approxima-
tion procedures for discrete-time stochastic optimal control problems, sce
[Ber75], [Ber76a], [ChT89], [ChT91], [Fox71], [Whi78], [Whi79]. The dis-
cretization issues of continuous-time optimal control problems have been
the subject of considerable research; see the monograph [KuD92], thc sur-
vey [Kus90], and the sources cited there.
Whenever a suboptimal controller is used in practice it is desirable to
know how close the resulting cost approximatcs the optimal. Tight bounds
on the performance of the suboptimal controller arc useful but are usually
quite hard to obtain. For some interesting results in this direction, see
[Wit69] and [Wit70].
EXERCISES
6.1
Consider a problem with perfect state information involving the n-dimensional
linear system of Section 4.1:
Xk+l = Akxk + BkUk + Wk,
k = 0, 1,..., N - 1,
....1
288
Suboptimal and Adaptive Contn
Chap. 6
and a cost function of the form
{ N-I }
k=O'I.N-l 9N(C'XN) + 9k(Uk) ,
where cERn is a given vector. Show that the DP algorithm for this problem
can be carried out over a one-dimensional state space.
6.2 [Th W66]
Consider the following two-dimensional, two-stage linear system with scalar
control and disturbance
Xk+1 = Xk + bUk + dWk, k = 0,1,
where b = (1,0)' and d = (1/2, ../2/2)'. The initial state is Xo = O. The
controls Uo and Ul are unconstrained. The disturbances Wo and WI are in-
dependent random variables and each takes the values 1 and -1 with equal
probabilit.y 1/2. Perfect state information prevails. The cost is
E {lIx211},
WO,Wl
where 11.11 denotes the usual Euclidean norm. Show that t.he optimal open-loop
cost and the optimal closed-loop cost arc both ,;3/2, but the cost correspond-
ing to the CEC is 1.
6.3
Consider a two-stage problem with perfect state information involving the
scalar system
xo=l, XI=XO+UO+wo, x2=/(Xl,UI).
The control constraints are Uo, Ul E {O, -I}. The random variable Wo takes
the values 1 and -1 with equal probability 1/2. The function 1 is defined by
1(1,0) = 1(1,-1) = 1(-1,0) = 1(-1,-1) = 0.5,
1(2,0) = 0, 1(2, -1) = 2, 1(0, -1) = 0.6, 1(0,0) = 2.
The cost function is
E{X2}.
wo
(a) Show that one possible OLFC for this problem is
/i o( xo ) = -1, - ( ) - { o if ."!;l = ::1:1,2,
J1.l Xl - -1 if Xl =0,
and the rc.ulting cost is 0.5.
(b) Show that one possible CEC for this problem is
jio ( xo ) = 0, - ( . ) _ { 0 if XI = ::1:1,2,
ILl XI - -1 ifxl=O,
and the resulting cost is 0.3. Show also that this CEC is an optimal
feedback controller.
Sec. 6.5
Notes, Sources, and
..Jrcjses
28!J
6.4
Consider the system and cost function of Exercise 6.3 but. with the difTerence
t.hat
1(0, -1) = O.
fi
(a)
(b)
::,
;i
f
.;<
Show that the controller of part (a) of Exercise 6.3 is bot.h an OLFC
and a CEC, and that the corresponding cost is 0.5.
Assume that the control constraint set for the first stage is {O} rat.her
than {O, -I}. Show that. the controller of part. (b) of Exercise 6.3 is both
an OLFC and a CEC and that. the corresponding cost is O. Note: This
problem illustrates a pathology that occurs generically in suboptimal
control; that is, if the control constraint set is restricted, the perfor-
mance of a suboptimal scheme may be improved. To see t.his, consider
a problem and a suboptimal control scheme that is not optimal for t.he
problem. Let 71"* = {J1.o,... ,J1.N-d be an optimal policy. Rest.rict the
control constraint set so that only the optimal control ILk (Xk) is allowed
at state Xk. Then the cost attained by the suboptimal control scheme
will be improved.
6.5
Consider the ARMAX model
Yk+l + aYk = bUk + Ek+l + CEk,
where the parameters a, b, and C are unknown. The controller hypothesizes
a model of the form
Yk+l + aYk = Uk + Ck+l
and uses at each k the minimum variance/certainty equivalent control
Uk = UkYk,
where Uk is the least-squares estimate of a obtained as
"
,'1
k
_ . "' ( * ) 2
ak = argmm Yn + aYn-l - Un-l .
a
n=l
i',
'.
!;
Writ.e a computer program to test the hypothesis that. the sequence {<1d
converges to t.he optimal value, which is (c -- a)/b. Experiment. with values
lal < 1 and lal > 1.
1:
6.6 (Scmilincar Systems)
Consider the basic problem for semilinear systems (Exercise 1.13 in Chapter
1). Show that the CEC and the OLFC are optimal for this problem.
_ J
290
Suboptimal and Adaptive ContT<.
Chap. 6
6.7
Consider the production control problem of Scction 6.3.1 for thc case where
there is only one part type (n = 1), and assume that the cost per stage is a
convex function g with Iimlrl_4l g(x) = 00.
(a) Show that the cost-to-go function Jk(Xk,CXk) is convex as a function of
Xk for each value of CXk.
(b) Show that for cach k and CXk, there is a target value Xk+l such that
for each Xk it is optimal to choose the control Uk E Uk(CXk) that brings
Xk+l = Xk + Uk - d k as close as possible to Xk+l.
6.8
Provide a careful argument showing that searching a chess position wit.h and
without cx-{3 pruning will give the same result.
6.9
In a version of the game of Nim, two players start with a stack of fivc pcnnies
and t.ake t.urns removing one, two, or thrce pennies from the stack. The playcr
who removes the last penny loses. Construct the game tree and verify that
the second player to move can win with optimal play.
6.10 (Continuous Space Shortest Path Problems)
Considcr the two-dimensional system
Xl(t) = Ul(t),
X2(t) = U2(t),
with the control constraint Ilu(t)1I = 1. We want to find a state trajecto,ry that
starts at a given point x(O), ends at another given point x(T), and minimizes
T
1 r(x(t))dt.
The function r(') is nonnegative and continuous, and the final timc T is
subjcct to optimization. Suppose we discrctize the plane with a mesh of
size D. that passes through x(O) and x(T), and we introduce a shortest path "
problem of going from x(O) to x(T) using moves of the following type: from
each mesh point x = (Xl, X2) we can go to each of the mesh points (XI + D., X2),
(Xl - D., X2), (Xl, X2 + D.), and (Xl, X2 - D.), at a cost r(x)D.. Show by example
that this is a bad discretization of the original problem in the sense that the
shortest distance need not approach the optimal cost of the original problem
as D. -+ O.
7
Introduction to
Infinite Horizon Problems
Contents
7.1. An Overview . . . . . . . . . .
7,2. Stochastic Shortest Path Problems
7.3. Discounted Problems
7.4. Average Cost Problems
7.5. Notes, Sources, and Exercises
p. 292
p.295
p.306
p.310
p.324
291
_. J
292
Introduciion to Irlfjnjte IIorjzOIJ Problem.
Chap. 7
In this chapter, we provide an introduction to infinitc horizon prob-
lems. These problems differ from those considered so far in two respects:
(a) The number of stages is infinitc.
(b) The systcm is stationary, that is, the systcm equation, the cost. per
stage, and thc random disturbancc statistics do not changc from one
stage to the next.
The assumption of an inJinite numbcr of st.ages is never satisfied in practice,
but constitutes a reasonable approximation for problems involving a finitc
but very large number of stages. The assumption of stationarity is often
satisfied in practice, and in other cases it approximates well a situation
where the system parametcrs vary slowly with time.
Infinite horizon problems are interesting bccause their analysis is el-
egant and insightful, and the implementation of optimal policies is often
simple. For example, optimal policies are typically stationary; that is, the
optimal rule for choosing controls docs not change from one stage to the
next.
On the other hand, infinite horizon problems generally require more
sophisticated analysis than their finite horizon counterparts, because of
the need to analyze limiting behavior as the horizon tends to infinity. This
analysis is often nontrivial and at times reveals surprising possibilities.
Our treatment will be limited to finite-state problems. A far more detailed
development, together with applications from a variety of fields can be
found in Vol. II of this work.
7.1
AN OVERVIEW
There are four principal classes of infinite horizon problems. In the
first three classes, we try to minimize the total cost over an infinite number
of stages, givcn by
J,,(xo) = lim E
N -+ex:> Wk
k=O,l,...
{ N-l }
akg(xk' J.Lk(Xk), Wk) ,
where J,,(xo) dcnotes the cost associated with an initial state xo and a
policy 7r = {J.LO, J.L 1, .. .}, and a is a positive scalar with 0 < a ::; 1, called
the discount factor. Thc mcaning of a < 1 is that future costs matter to
us less than the same costs incurred at the present time. As an example,
think of kth period dollars depreciated to initial period dollars by a factor
of (1 +r)-k, where r is a rate of interest; here a = 1/(1 +r). An important
concern in total cost problems is that the limit in the definition of J,,(xo) be
Sec. 7.1
An Overvjew
293
finite. In the first two of the following classes of problems, this is guarant.eed
through various assumptions on the problem structure and the discount
factor. In the third class, the analysis is adjusted to deal with inJinite cost.
for some of the policies. In the fourth class, this sum need not. be finite for
any policy, and for this reason, the cost. is appropriately reclcJincd.
(a) Stochastic shortest path problems. Here, a = 1 but there is a special
cost-free termination state; once the system rcaches that. state it re-
mains there at no further cost. vVe will assume a problem struct.ure
such that termination is inevitable (this assumption will be relaxed
somewhat in Chapter 2 of Vol. II). Thus the horizon is in effcct finitc,
but its length is random and may be affected by the policy being
used. These problems will be considered in the next section and their
analysis will provide the foundation for the analysis of the other types
of problems considercd in this chapter.
(b) Discounted problems with bounded cost per stage. Here, a < 1 and
the absolute cost per stage Ig(x, u, w)1 is bounded from above by
some constant M; this makes the cost J,,(xo) well defincd because
it is the infinite sum of a sequence of numbers that arc bounded in
absolute value by the decreasing geometric progression {a k M}. We
will consider these problems in Section 7.3.
(c) Discounted and undiscounted problems with unbounded cost per stage.
Here the discount factor a mayor may not be less than 1, and t.he cost.
per stage may be unboundcd. These problems requirc a complicated
analysis because the possibility of infinite cost for some of the policies
is explicitly dcalt with. We will not consider these problems here; see
Chapter 3 of Vol. II.
(d) A vemge cost per stage problems. Minimization of the total cost J" (xo)
makes sense only if J,,(xo) is finite for at least some admissible policies
7r and some initial states Xo. Frequently, however, it turns out that
J,,(xo) = 00 for every policy 7r and initial state Xo (think of the case
where a = 1, and the cost for every state and control is positive).
It turns out that in many such problems thc avem!/e cost per sta!/e,
defined by
{ N-l }
lim N 1 E L 9(Xk' J.Lk(Xk), Wk) ,
N -CXJ Wk
k=O,I,... k=O
is well defined and finite. We will consider some of these problems in
Section 7.4.
A Preview of Infinite Horizon Results
There are several analytical and computational issues regarding infi-
nite horizon problems. Many of t.hese revolve around thc relation between
_J
294
Introduction to Infinite Horizon Problem,
Chap. 7
the optimal cost-to-go function J* of the infinite horizon problem and the
optimal cost-to-go functions of the corresponding N-stage problems. In
particular, consider the case Q = 1 and let J N (x) denote the optimal cost
of the problem involving N stages, initial state x, cost per stage g(x, u, w),
and zero terminal cost. The optimal N-stage cost is generated aft.er N
iterations of the DP algorithm
Jk+l(X) = min E {g(x, u, w) + J k (J(x, U, w))} ,
uEU(x) w
k = 0,1,... (1.1)
starting from the initial condition Jo(x) = 0 for all x (note here that we
have reversed the time indexing to suit our purposes). Since the infinite
horizon cost of a given policy is, by definition, the limit of the corresponding
N-stage costs as N --> 00, it is natural to speculate that:
(1) The optimal infinite horizon cost is the limit of the corresponding
N-stage optimal costs as N --> 00; that is,
J*(x) = lim IN(x)
N--+oo
(1.2)
for all states x. This relation is extremely valuable computationally
and analytically, and, fortunately, it typically holds. In particular, it
holds for the models of the next two sections [categories (a) and (b)
above]. However, there are some unusual exceptions for problems in
category (c) above, and this illustrates that infinite horizon problems
should be approached with some care.
(2) The following limiting form of the DP algorithm should hold for all
states x,
J*(x)= min E{9(x,u,w)+J*(J(x,u,w))}, (1.3)
uEU(x) w
as suggested by Eqs. (1.1) and (1.2). This is not really an algorithm,
but rather a system of equations (one equation per state), which has
as solution the costs-to-go of all the states. It can also be viewed as
a functional equation for' the cost-to-go function J*, and it is called
Bellman's equation. Fortunately again, an appropriate form of this
equation holds for every type of infinite horizon problem of interest.
(3) If J-L(x) attains the minimum in the right-hand side of Bellman's equa-
tion for each x, then the policy {J-L, J-L, . . .} should be optimal. This is
true for most infinite horizon problems of interest and in particular,
for all the models discussed in the subsequent sections.
Most of the analysis of infinite horizon problems revolves around the
above three issues and also around the issue of efficient computation of
J* and an optimal policy. In the next three sections we will provide a
discussion of these issues for some of the simpler infinite horizon problems,
all of which involve a finite state space.
Sec. 7.2
Stochastic SlJOrtet .Jth Problems
295
Total Cost Problcm Formulation
Throughout this chapter we assume a controlled discrete-time dy-
namic system whereby, at state i, the use of a control u specifies t.he t.ran-
sition probability Pij (u) to the next state j. Here t.he st.at.e i is an element.
of a finite state space, and the control u is constrained to take values in a
given finite constraint set U(i), which may depend on the current state i.
As discussed in Section 1.1, the underlying system equation is
Xk+l = Wk,
where Wk is the disturbance. We will generally suppress Wk from the cost to
simplify notation. Thus we will assume a kth stage cost g(Xk, Uk) for using
control Uk at st.ate Xk. This amounts to averaging the cost per stage over
all successor states in our calculations, which makes no essential difference
in the subsequent analysis. Thus, if fj( i, u, j) is the cost of using .u at state
i and moving to st.ate j, we use as cost per stage the expected cost g(i,u)
given by
g(i, u) = LPij(U)fj(i, u,j).
j
The total expected cost associated with an initial state i and a policy
7r = {J-Lo, J-Ll, . . .} is
{ N-l }
J 7T (i) = Jr;,E L QkY(Xk,J-Lk(xd) I Xo = i ,
k=O
where Q is a discount factor with 0 < Q ::; 1. In the following two sections,
we will impose assumptions that guarantee the existence of the above limit.
The optimal cost from state i, that is, the minimum of J 7T (i) over all ad-
missible 7r, is denoted by 1* (i). A stationary policy is an admissible policy
of the form 7r = {J-L, J-L, . . .}, and itli corresponding cost function is denoted
by J" (i). Fur brevity, we refer t.o {/l', fl" . . .} as t.he stationary policy /1.. We
say that J-L is optimal if J,"(i) = J*(i) for all states i.
1.2 STOCHASTIC SHORTEST PATH PROBLEMS
Here, we <1ssume that there is no discounting (0: = 1), and t.o ulake
the cost meaningfuJ, we assume that. there is a special cost-fr"cf termination
state t. Once the system reaches that state, it remains there at. no further
cost, that is, Ptt(u) = 1 and g(t, u) = 0 for all u E U(t). We denute by
1,. . . , n the states other than the termination state t.
_ J
296
lniroduciion io lnfiniic Horizon Problcms
Chap. 7
Scc. 7.2
SioclJastic ShortesL . ..lih Problems
297
We are interested in problems where reaching the termination state
is inevitable, at least under an optimal policy. Thus, the essence of the
problem is t.o reach t.he t.ermination st.ate wit.h minimum expect.ed cost.. 'vVe
call this problem the stochastic shortest path problem. The deterministic
shortest path problem is obtained as the special case where for each stat.e-
control pair (i, u), the transition probability Pij (u) is equal to 1 for a unique
statc j that depends on (i, u). The reader may also verify that the fi.nite
horizon problem of Chapter 1 can be obtaincd as a special case by viewing
as state the pair (Xk, k) (see also Section 3.6 of Vol. II).
Certain conditions are required to guarantee that, at least under an
optimal policy, termination occurs with probability 1. We will make the
following assumption that guarantees eventual termination under all poli-
cies:
Sincc p" depends only on the first m compollents of the policy 11", it follows
that thcrc is only a finite number of distinct valucs of p" so that
p<1.
We t.herefore have for any 11" and any initial st.ate i
P{X2m i= t I Xo = i, 11"} = P{X2m i= t I X m i= t, Xo = i, 11"}
X P{x m i= t 1."Lo = i,1I"}
p2.
Morc gencrally, for each admissible policy 7r, the probability of not rcaching
the termination state after km stages diminishes like pk rcgardless of the
initial statc, that is,
(2.3)
P7r = .max P{x m i= t I Xo = i,7r} < 1.
'l.=l,...,n
(2.1)
P{xkmi=tIXo=i,7r}pk, i=l,...,n. (2.4)
Thus the limit defining the associated total cost vector J" exists and is
finite, since the expected cost incurred in the m periods betwcen km and
(k + l)m - 1 is bounded in absolute value by
mpk max Ig(i, u)/.
t=l,...,n
uEU(i)
Assumption 2.1: There exists an integer m such that regardless of
the policy used and the initial state, there is positive probability that
the termination statc will be reached after no more that m stages; that
is, for all admissible policies 7r we have
In particular, we have
We note, howcver, that the results to be presented are valid under
more general circumstances. t Furthermore, it can be shown that if there
exists an integer m with the property of Assumption 2.1, then there also
exists all integer less or cqual to n with this property (Exercise 7.12). Thus,
we can always use m = n in Assumption 2.1, if no smaller value of m is
known. Let
00
IJ,,(i)! Lmpk!.nax [g(i,u)1 = max Ig(i,u)l. (2.5)
'-l.....n 1 - P ,=1 n
k=O uEU(i) UEU'('i)
The results of the following proposition are basic and are typical of
many infinite horizon problems. The key idea of t.he proof is that the "tail"
of the cost series,
00
p = maxp".
"
(2.2)
L E {9(Xk,J.!k(Xk))}
k=mK
vanishes as K increases to 00, since t.he probability that XmJ( i= t decreases
like pK [ef. Eq. (2.4)].
t Let us call a stationary policy 7r proper if the cOndition (2.1) is satisfied
for some m, and call 7r improper otherwise. It can be shown that Assumpt.ion
2.1 is equivalent to the seemingly weaker assumption that. all stationary policies
are proper (sce Vol. ll, Exercise 2.3). However, the rc:,;ults of Prop. 2.1 can also
be shown under the genuinely weaker assumption that there exists at least. one
proper policy, and furthermore, every improper policy results in infinite expected
cost from at least one initial state (see [BeT8!!], [BeT!!l], or Vol. II, Chapter 2).
These assumptions, when specialized to deterministic shortest path problems, are
similar to the ones we used in Chapter 2. They imply that there is at least one
path to the destination from every starting node and that all cycles have positive
cost.
Proposition 2.1 Undt";)r Assumption 2.1, the following hold for the
stochastic shortest path problem;
(a) Given any initial conditions Jo(l),..., Jo(n), the sequence ,h(i)
generated by the DP iteration
Jk+1(i) = min [ 9(i,U) + tPij(U)JkU) ] ,
uEU(,) .
;=1
i = 1,... ,n,
(2.6)
converges to the optimal cost. J*(i) for each i.
-
298
Introduciion to lnfinite Horizon Problems
Chap. 7
Sec. 7.2
Stochastic Shorte ath Problems
299
J*(i) = min [ 9(i,U) + tpij(U)J*(j) ] ,
uEU(0 j=l
i = 1,... ,n,
The expected cost during the Kt.h 111.-stage cycle [stage:; /\.111. to (1\.1 1 )/1/ 1]
is upper bounded by MpK [ef. Eqs. (2.4) and (2.5)], so that
I Joo E { tl g(Xk' fLk(Xk)) } I ::; M f pk = pi,!! .
k=mK k=K 1 P
(b) The optimal costs J*(I),..., J*(n) satisfy Dellman's equation,
(2.7)
Also, denoting Jo(t) = 0, let us view J o as a terminal cost function and
bound its expected value under 7r after mJ( stages. We have
and in fact they are the unique solution of this equation.
(c) For any stationary policy fL, the costs J,.(I),. .. , J,.(n) are the
unique solution of the equation
n
J,.(i) = g(i,fL(i)) + LPij(fL(i))J,.(j),
j=l
i=l,...,n.
IE{Jo(XmK)} I = IP(XmK = i I XO,7r)Jo(i)1
::; (tP(XmK = i I XO,7r)) iV,ax.JJo(i)1
::; pK . max IJo(i)l,
:::;l,...tn
Furthermore, given any initial conditions Jo(I),..., Jo(n), the
sequence Jk(i) generated by the DP iteration
n
Jk+l(i) = g(i,fL(i)) + LPij(fL(i))Jk(j),
j=l
i=l,...,n,
since the probability that XmK i= t is less or equal to pK for any policy.
Combining the preceding relations, we obtain
{ N-l }
J,,(xo) = lim E L 9(xk,fLk(Xk))
N->oo
k=O
{ mK-l }
= E g(Xk,fLk(Xk))
{ N-l }
+ lim E L y(xk,fLk(Xk)) .
N->oo
k=mK
_pK !.nax IJo(i)1 + J,,(xo)
tl,...,1L
{ mK-I } K
::; E JO(XmK) + L g(Xk,fLk(Xk)) + p M
k=O 1 - P
::; pK . max IJo(i)1 + J,,(xo).
t=l,...,n
(2.8)
Note that the expected value in the middle term of the above inequalities is
the mK-stage cost of policy 7r starting from state Xo, with a terminal cost
J o (.XmK ); the minimum of this cost over all 7r is equal to the value J mK (:1:0),
which IS generated by the DP recursion (2.6) after mK iterations. Thus,
by taking the minimum over 7r in Eq. (2.8), we obtain for all Xo and K,
converges to the cost J,.(i) for each i.
(d) A stationary policy fL is optimal if and only if for every state i,
fL( i) attains the minimum in Bellman's equation (2.7).
Proof: ( a) For every positive integer J{, initial st.ate XO, and policy 7r =
{fLo, fL I, . . .}, we break down the cost J" (xo) into the portions incurred over
the first mK stages and over the remaining stages
pKJvI
-pK!.nax IJo(i)1 + J*(xo) ::; JmK(:[O) + -
'-l....,n 1 - P
::; pK!.nax IJo(i)1 + J*(:/;o),
t-l,...,n
(2.9)
and by taking the limit. as J{ -t 00, we obtain limK->oo JmK(XO) = J*(.7:o)
for all Xo. Since
Let M denote the following upper bound on t.he cost of an m-stage cycle,
assuming termiriation does not occur during the cycle,
IJmK+q(XO) - JmK(xo)1 ::; pKM,
q = 1,... ,1/1,
M=mmax j!/(i,u)j.
=l,...,n
uEU(i)
we see that limJ(->ooJmK+q(xo) is the samc [or all q = 1,...,m, so that
we have limk->oo Jk(XO) = 1*(:[0).
_..1
300
Introduction to Infinite Horizon Problems
Chap. 7
Sec. 7.2
Stochastic SIJOrte, ath ]Jroblems
301
(b) By taking the limit as k -t 00 in the DP itcration (2.6) and using the
result of part (a), we see that 1*(1),..., J*(n) satisfy Bellman's equation.
'1'0 shuw uniqueness, observe that. if J(l),..., J(n) sat.isfy Bellman's eqll<L-
tion, then the DP iteration (2.6) starting from J(l),..., J(n) just repli-
cates .1(1),..., .1(n). It follows from the convergcnce rcsult of part (a) that
.1(i) = .1*(i) for all i.
(c) Given the stationary policy J.L, we can consider a modified stochastic
shortest path problem, which is the same as the original exccpt that the
control constraint set contains only one element for each state i, the con-
trol J.L(i); that is, the control constraint set is U(i) = {J.L(i)} instead of
U(i). From part (b) we then obtain that .11-'(1),..., .11,(n) solve uniquely
l3ellman's equation for this modified problem, that is,
In the special case where t.here is only one cont.rol at. each stat.e, .l'(i) repre-
sents the mean first passage time from i t.o t (see Appendix D). These times,
d(.not.ed m., are t.he uni'/lI<' solut.ion of the (''luations
mi = 1 + LPiJmJ'
J=l
-i = 1,... ,n.
Example 2.2
n
.11,(i) = g(i,ll(i)) + Lllij(ll(i)).1,,(j),
j=l
i = 1,... ,n,
A spider and a fly move along a straight line at times k = 0,1,... The initial
positions of the fly and the spider are integer. At each time period, the
fly moves one unit to the left with probability p, one unit to the right with
probability p, and stays where it is with probabilit.y 1- 2p. The spider, knows
the position of the fly at the beginning of each period, and will always move
one unit towards the fly if its distance from the fly is more that one unit. If the
spider is one unit away from the fly, it will either move one unit towards the
fly or stay where it is. If the spider and the fly land in the same position at the
end of a period, then the spider captures the fly and the process terminates,
The spider's objective is to capture the fly in minimum expected time.
We view as state the distance between spider and fly. Then the problem
can be formulated as a st.ochastic shortest path problem wit.h st.at.es 0, 1, . . . ,n,
where n is the initial distance. State 0 is the termination state where the
spider captures the fly. Let us denote Plj(M) and Plj( M) the t.ransition
probabilities from st.ate 1 to state j if the spider moves and docs not. move,
respectively, and let us denote by P,j the transition probabilities from a st.ate
i 2:: 2. We have
and from part (a) it follows that the corresponding DP iteration convcrges
to .11-'(i).
(d) We have that J.L(i) attains the minimum in Eq. (2.7) if and only if we
have
.1*(i) = min [ 9(i,U) + 'tPij(U).1*(j) ]
uEU(,) j=l
n
= g(i,J.1(i)) + LPij(lt(i))1*(j),
j=l
i = 1,..., n.
p.. = P, Pi(i-l) = 1 -- 2p, Pi(i-2) = P,
i 2:: 2,
Part (c) and the abovc cquation imply that .11-'(i) = .1*(i) for all i. Con-
versely, if .11-'(i) = 1* (i) for all i, parts (b) and (c) imply thc above equation.
Q.E.D.
Pll(M) = 2p, PIO(M) = 1 - 2p,
PI2( M ) = P, PI\( M ) = 1 - 2p, PIO( M ) = P,
with all other transition probabilities being O.
For states i 2:: 2, Bellman's equation is written as
Example 2.1 (Minimizing the Expected Time to Termination)
J*(i) = 1 + pJ*(i) + (1 - 2p)J*(i - 1) + pJ*(i - 2), i 2:: 2,
(2.10)
The case where
g(i,u) = 1,
i=I,...,n, UEU(i),
where J* (0) = 0 by definition. The only state where the spider has a choice
is when it is one unit away from the fly, and for that state Bellman's equat.ion
is given by
corresponds to a problem where t.he object.ive is to terminate as fast as possible
on the average, while the corresponding optimal cost J* (-i) is the minimum
expected t.ime to t.ermiuat.iol1 start.ing from state i. Under our ,..sumpt.ions,
the costs J* (i) uniquely solve Bellman's equation, which has the form
J*(I) = 1 +min[2pJ*(1), pJ*(2) + (1- 2p)J*(I)] ,
(2.11 )
J*(i) = min [ 1 + tpij(U)J*(j) ] ,
uEU(,)
j=l
where the first. and t.he second exprcssion within the bracket above are asso-
ciated with the spider moving and not moving, respectively. By writing Eq.
(2.10) for i. = 2, we obtain
i = 1,.., ,n.
.1*(2) = 1 + pJ*(2) -I- (1 - 2p).l*(I),
_ J
302
Introduction to Infinite Horizon Provlem6
Chap. 7
from which
P(2) = + (1 - 2p)F(1) .
1-p 1-p
Substituting this expression in Eq. (2.11), we obtain
(2.12)
P(l) = 1 + min [ 2 P P(1), + p(l - 2p)P(1) + (1 - 2 P )P(1) ] ,
1-p 1-p
or equivalently,
P(l) = 1 + min [2 P J*(1), 1 p + (1 -12*(1) ] .
To solve the above equation, we consider the two cases where the first
expression within the bracket is larger and is smaller than the second expres-
sion. Thus we solve for F(l) in the two cases where
J*(l) = 1 + 2pP(1),
2 P(l) < + (1 - 2p)F(1)
p -l-p 1-p'
(2.13a)
(2.13b)
and
P(l) = 1 + + (1- 2 p )J'(1) ,
1-p 1-p
2pJ*(1) + (1 - 2p)F(1) .
1-p 1-p
The solution of Eq. (2.13a) is seen to be J*(l) = 1/(1 - 2p), and by substi-
tution in Eq. (2.13b), we find that this solution is valid when
(2.14a)
(2.14b)
<+
1 - 2p - 1 - p 1 - p'
or equivalently (after some calculation), p 1/3. Thus for p 1/3, it is
opt.imal for the spider to move when it is one unit away from the fly.
Similarly, the solution of Eq. (2.14a) is seen to be F(l) = l/p, and by
sabstitution in Eq. (2.13b), we find that this solution is valid when
2>+ 1-2p
- 1 - p p(l - p)'
or equivalcntly (after some calculation), p 1/3. Thus, for p 1/3 it is
optimal for the spider not to move when it is one unit away from the fly.
The minimal expected number of steps for capture when the spider is
one unit away from the fly was calculated earlier to be
.r (1) = { 1/(1 - 2p) if p 1/3,
l/p if p 1/3.
Given the value of F(l), we can calculate from Eq. (2.12) the minimal ex-
pected number of steps for capture when two units away, 1*(2), and we can
then obtain the remaining values J*(i), i = 3,..., n, from Eq. (2.10).
Sec. 7.2
Stochastic Short( 'ath Problems
303
Value Iteration and Error Bounds
The DP it.eration
.h+l(i) = uJ(,) [ 9(i' u) + tP'j(U)Jk(j) ] , i = 1,..., n, (2.1G)
J=l
is called value iteration and is a principal method for calculating the optimal
cost function J'. Generally, value iteration requires an infinite number of
iterations, although there are important special cases where it terminates
finitely (see Vol. II, Section 2.2). Note that from Eq. (2.9) we obtain t.hat
the error
IJmK(i) - J'(i)1
is bounded by a constant multiple of pK.
The value iteration algorithm can sometimes be strengthened with
the use of some error bounds. In particular, it; can be shown (see Exercise
7.13) that for all k and j, we have
Jk+1(j) + (N.(j) -l)!2k J'(j) JJ1.k(j) Jk+1(j) + (Nk(j) -l)ck,
(2.16)
where ILk is such that p,k(i) attains the minimum in the kth value iteration
(2.15) for all i, and
N. (j): The average number of stages to reach t starting from j and
using some optimal stationary policy,
Nk(j): The average number of stages to reach t starting from j and
using the stationary policy p,k,
!2k =_min [Jk+1(i) - Jk(i)), Ck = . max [Jk+1(i) - Jdi)).
t-l,,,,,n t=l,...,n
Unfortunately, the values N.(j) and Nk(j) are easily computed or approx-
imated only in the presence of special problem structure (see for example
the next section). Despite this fact, the bounds (2.16) often provide a use-
ful guideline for stopping the value iteration algorithm while being assured
that Jk approximates J' with sufficient accuracy.
Policy Iteration
There is an alternative to value iteration, which always terminates
finitely. This algorithm is called policy iteration and operates as follows:
we start with a stationary policy ILo, and we generate a sequence of new
policies p,1, p,2, . . . Given the policy Ilk, we perform a Jlolicy evaluation stcp,
that computes J1,k (i), i = 1,. . . , n, as the solut.ion of t.he (linear) syst.em of
equations
n
J(i) = g(i,p,k(i)) + LPij(p,k(i))J(j),
j=l
i=I,...,n,
(2.17)
_.I
304
Introduction to infinite Horizon Problems
Chap. 7
Sec. 7.2
Stochastic Shorte" Path Problems
305
in the n unknowns J(I),..., J(n) [ef. Prop. 2.1(c)]. We then perform a
policy improvement step, which computes a new policy JLk+1 as
Jtk+l(i) = arg min [ 9(i,U) + tP,J(U)J,.k(j) ] ,
uEU(.) )=1
i = 1,..., n.
for all i, and by cont.inuing similarly we have
.fo(i) 2: .f 1 (i) 2:...2: IN(i) 2: .IN+I(i) 2: "',i = 1,. ..,n. (2.1!J)
Since by Prop. 2.1(c), IN(i) -> .f/-Lk+l(i), we obtain Jo(i) 2: J,"k+l(i) or
J/-Lk(i) 2: .f/-Lk+l(i), i = 1,...,11., k = 0,1,....
Thus the sequence of generated policies is improving, and since the numbcr
of stationary policies is finite, we must after a finitc number of iterations,
say k + 1, obtain J/-Lk(i) = J/-Lk+l(i) for all i. Then we will have equality
holding throughout in Eq. (2.19), which means that
J/-Lk(i) = uJ?i) [ 9(i'U)+tpi)(U)J/-Lk(j) ] . i=I,...,n.
J=l
Thus thc costs J/-Lk(I)...., J/-Lk(n) solve l3ellman's cquation, and by Prop.
2.1(b), it follows that J/-Lk(i) = J*(i) and that /lk is optimal. Q.E.D.
(2.18)
Thc process is rcpcatcd with JLk+ 1 used in placc of JLk, unlcss we havc
J,"k+l (i) = J/-Lk (i) for all i, in which case the algorithm terminates with
the policy /lk. The following proposition establishes thc validity of policy
iteration.
Proposition 2.2 Under Assumption 2.1, thc policy iteration algo-
rithm for the stochastic shortest path problem generates an improving
sequence of policies [that is, J/-Lk+l(i) :s; JJ.Lk(i) for all i and k] and
terminates with an optimal policy.
n
IN+1(i) = !/(i,/lk+1(i)) + L,lh) (/lk+1(i))JN(j),
)=1
i = 1,... ,n,
The linear system of equations (2.17) of the policy cvaluation step
can be solved by standard methods such as Gaussian elimination, but whcn
the number of states is large, this is cumbersome and time-consuming. A
typically more efficient altcrnative is to approximate the policy cvaluation
step with, few value iterations aimed at solving the corresponding system
(2.17). One can show that the policy iteration mcthod that uses such
approximate policy evaluation yields in the limit the optimal costs and an
optimal stationary policy, cven if we evaluate each policy using any positive
number of value iterations (see Vol. II, Section 1.3).
Proof: For any k, consider the sequence generated by the rccursion
wherc N = 0,1,..., and
Jo(i) = J1.k(i),
From Eqs. (2.17) and (2.18), we have
i = 1,... ,n.
Linear Programming
n
Suppose that we use value itcration to gencrate a scquencc of vec-
tors Jk = (Jk(l),..., Jk(n)) starting with an initial condition vector .fo =
(.10(1),..., Jo(n)) such that
Jo(i):::; min [ 9(i,U)+tPi)(U)Jo(j) ] i=l,...,n.
uEU(,) .
J=l
Then we will have Jk(i) :::; Jk+1(i) for all k and i (the monotonicity propcrty
of DP; Exercisc 1.23 in Chapter 1). It follows from Prop. 2.1(a) that we will
also have Jo(i) :::; J*(i) for all i. Thus J* is the "largest" J that satisfies
the constraint
n
Jo(i) = g(i,Jtk(i)) + L,Pij(JLk(i))Jo(j)
)=J
::::: g(i,JLk+ 1 (i)) + L,Pij(JL k + 1 (i))J o (j)
)=1
= .h(i),
for all i. By using thc above inequality we obtain (compare with the mono-
tonicity property of DP, Exercisc 1.23 in Chapter 1)
n
J 1 (i) = g(i" L k+1(i)) + I:>'j(/lk+1(i))J o (j)
)=1
n
n
J(i) :::; g(i, u) + LP'j(u)J(j), for all i = 1,..., nand U E U(i).
)=1
In particular, J*(I),...,J*(n) solve the linear program of maximizing
2:7=1 J(i) subject to the above constraint (see Fig. 7.2.1). Unfortunately,
for large n the dimension of this program can be very large and its solution
can bc impractical, particularly in the absence of special structurc.
::::: y(i,ILk'\-!(i)) + LPij(/lk+1(i))J1(j)
j=l
= h(i),
_. .1
306
Intmductioll to Infinite Horizon Problems
Chap. 7
Sec. 7.3
Discounted ProL s
307
Pij(U)
Pi/{u) ill:. : Pj/u) apjj(u)
Pji(u)
apij(u)
J(2)
J(2) = g(2.u')+ P ,(u')J(1)+p,iu')J(2)
o
J(1)
Figure 7.3.1 Transition probabilities for an a-discounted problem and its associ-
ated stochastic shortest path problem. In the latter problem, the probability that
the state is not t after k stages is a k . The expected cost per stage at each state
i = I,..., n is g(i, u) for both problems, but it must be multiplied by a k lwcause
of discounting (in the discounted case) or because it is incmrcd with probability
a k when termination has not yet been reached (in the stochastic shortest path
case).
J(1)= g(1.u')+ p.,(u')J(1) + pdu')J(2)
Figure 7.2.1 Linear program ussociated with a two-state stochastic shortest
path problem. The constraint set is shaded, and the objective to maximize is
J(I) + J(2).
Proposition 3.1 The following hold for the discounted cost problem:
(a) The value iteration algorithm
7.3 DISCOUNTED PROBLEMS
Jk+l(i) = min [ 9(i,u)+atPij(U)Jk(j) ] ,
uEU(,) j=1
i = 1,... ,17.,
We now consider a discounted problem, where there is a discount
factor a < 1. We will show that this problem can be convcrted to a
stochastic shortest path problcm for which the analysis of the preceding
section holds. To sce this, let i = 1,...,17. be the states, and consider an
associated stochastic shortest path problem involving the states 1,...,17.
plus an extra termination state t, with state transitions and costs obtained
as follows: From a state i # t, when control U is applied, a cost !/(i, u)
is incurred, and thc next state is j with probability ap,j (u) and t with
probability 1- a; see Fig. 7.3.1. Note that Assumption 2.1 of the preceding
section is satisfied for thc associated stoclwBtic shortest path problem.
Suppose now that we use the samc policy in the discounted prob-
lem and in the associated stochastic shortest path problem. Then, as long
as termination has not occurred, the state evolution in the two problems
is governed by the samc transition probabilities. Furthermore, the ex-
pected cost of the kth stage of the associated shortest path problcm is
!/(Xk,JLdxk)) multiplied by the probability that state t has not yet been
reacheu, which i:,; n k . Thb is also the expect.eu cost of the kth stage for the
discounted problem. We conclude that the cost of any policy starting from
a given state, is the same for the original discounted problem and for the
associated stochastic shortest path problem. We can thus apply t.he rcsults
of thc preceding section to the latter problem and obtain the following:
(3.1)
converges to the optimal costs J*(i), i = 1,...,17., starting from
arbitrary initial conditions Jo(l),..., Jo(n).
The optimal costs J*(l),..., J*(n) of the discounted problem
satisfy Bellman's equation,
(b)
J*(i) = min [ 9(i,U) + a'tpij(U)J*(j) ]
uEU(,) j=1
i = 1,... ,n,
(3.2)
(c)
and in fact they are the unique solution of this equation.
For any stationary policy J.L, the costs JI'(I),..., JI'(n) are the
unique solution of the equation
n
JJl.(i) = g(i,J.L(i)) + a LPij(J.L(i))J1,(j),
j=1
i=I,...,n.
Furthermore, given any initial conditions J o (I),..., Jo(n), thc
sequence Jk( i) generated by the DP iteration
_ J
308
Introduction to Infinite Horizon Provle1I16
ClJap. 7
Sec. 7.3
Discounted Prob.. .s
309
n
Jk+l(i) = g(i,p,(i)) + Q I>ij(p,(i))Jk(j),
j=l
i = 1,... ,n,
reach the termination state t is 1/( 1 - a), so that the terms N * U) - 1 and
Nk(j) - 1 appearing in Eq. (2.16) arc equal to a/(l - a). We note also
that there arc a number of additional enhancements to the value iteration
algorithm for the discounted problem (see Section 1.3 of Vol. 11). There arc
also discounted cost variants of the approximate policy iteration aud linear
programming approaches discussed for stochastic shortest path problems.
(d)
converges to the cost Jp.(i) for each i.
A stationary policy p, is optimal if and only if for every state i,
p,(i) attains the minimum in Bellman's equation (3.2).
The policy iteration algorithm given by
Example 3.1 (Asset Selling)
generates an improving sequence of policies and terminates with
an optimal policy.
Considcr an infinite horizon version of the asset sclling example of Section
4.4, assuming the set of possible offers is finite. Here, if acccptcd, t.hc aIllount
Xk offered in period k, will be invested at a rate of interest r. By depreciating
the sale amount to period 0 dollars, we view (1 + r)-k xk as the reward for
selling the asset in period k at a price Xk, whcre r > 0 is the rate of int.erest.
Then we have a total discounted rcward problem with discount factor a =
1/(1 + r). The analysis of the present section is applicable, and the optimal
value function J* is the unique solution of Bellman's equat.ion
(e)
p,k+l(i) = arg min [ 9(i' u) + Q tPij(U)JP.k(j) ] , i = 1,..., n,
uEU(.) j=l
Proof: Parts (a)-(d) and part (e) arc proved by applying parts (a)-(d) of
Prop. 2.1, and Prop. 2.2, respectively, to the associated stochastic shortest
path problem described above. Q.E.D.
* [ EU*(W)} ]
J (x) = max x, ,
l+r
[cf. Eq. (4.4) in Section 4.4]. The optimal reward function is charactcrizcd by
the critical number
Bellman's equation (3.2) has a familiar DP interpretation. At state
i, the optimal cost J*(i) is the minimum over all controls of the sum of
the expected current stage cost and the expected optimal cost of all future
stages. The former cost is g(i, u). The latter cost is J*(j), but since this
cost starts accumulating after one stage, it is discounted by multiplication
with a.
As in the case of stochastic shortest path problems [see Eq. (2.9) and
the discussion following the proof of Prop. 2.1], we can show that the error
_ EU*(w)}
a=
l+r '
which can be calculated as in Section 4.4. An optimal policy is to sell if and
only if the current offer Xk is great.er than or equal to a.
Example 3.2
IJk(i) - J*(i)1
A manufacturer at cach timc pcriod receivcs an ordcr for her product wit.h
probability p and receives no order with probability 1 - p. At any pcriod
she has a choice of processing all t.he unfilled orders in a batch, or process no
ordcr at all. The cost pcr unfilled order at cach time period is c > 0, and thc
setup cost to process the unfilled orders is K > O. The manufacturcr wants
to find a processing policy that minimizes the total expectcd cost, assuming
the discount factor is a < 1 and the maximum number of ordcrs that can
remain unfilled is n.
Here thc state is the number of unfillcd orders at thc beginning of cach
period, and Bellman's cquation takes thc form
is bounded by a constant times a k . Furthermore, the error bounds (2.16)
become
Jk+l(j) + _ 1 a rk:; J*(j):; Jp.k(j):; Jk+l(j) + _ 1 a Ck, (3.3)
-Q -Q
where p,k is such that p,k(i) attains the minimum in the kth value iteration
(3.1) for all i, and
rk = . mill [Jk+l(i) - Jk(i)],
t=l,.o.,n
Ck = . max [Jk+1(i) - Jk(i)],
1.=l,...,n
J"('i)=minlK+u(l-p)J'(O)+opJ"(l),ci-ln([---p)J"(i) t (>1'./'(/.11)],
(3.4)
for the states i = 0, 1, . . . , n - 1, and takes the form
since for the associated stochastic shortest path problem it can be shown
that for every policy and starting state, the expected number of stages to
J*(n) = K + a(1 - p)J*(O) + apJ*(l)
(3.5)
_. J
310
Introduction to Infinite Horizon Problem
Chap. 7
Scc. 7.'1
A vcrage Cost pCI .age Problcms
311
for state n. The first exprcssion within brackets in Eq. (3.4) corrcsponds
to processing the i unfilled orders, while the second expression corrcsponds
to leaving the orders unfilled for one more period. When the maximum n
of unfilled orders is reached, the orders must necessarily be processed, as
indicated by Eq. (3.5).
To solve the problem, we observe that the optimal cost J* (i) is mono-
tonically nondecreasing in i. This is intuitively clear, and can be rigorously
proved by using the value iteration method. In particular, we can show by us-
ing the (finite horizon) DP algorithm that the k-stagc optimal cost. functions
Jk(i) are monotonically non decreasing in i for all k (Exercise 7.7), and then
argue that the optimal infinite horizon cost function J* (i) is also monotoni-
cally nondecreasing in i, since J*(i) = limkoo Jk(i) by Prop. 3.1(0.). Given
that. J*(i) is monotonically nondecreasing in i, from Eq. (3.4) we have that
if processing a batch of m orders is optimal, that is,
To this end we note that the average cost. per st.age of" a policy pri-
marily expresses cost incurred in the long term. Costs incurred in t.he early
stages do not matter since their contribution to the average cost per stage
is reduced to zero as N -> 00; that is,
K + a(1 - p)J*(O) + apJ*(I) ::; em + a(1 - p)J*(m) + apJ*(m + 1),
oo E { t9(.Tk'J1k(Xk)) } =0,
k=O
for any fixed J(. Consider now a stationary policy II and two states i and
j such that the system will, under J1, eventually reach j with probability 1
starting from i. Then intuitively, it is clear that the average costs per stage
starting from i and from j cannot be different, since the costs incurred in
the process of reaching j from i do not contribute essentially to the average
cost per stage. More precisely, let J(ij(J1) be the first passage time from i
to j under J1, that is, the first index k for which Xk = j starting from Xo = i
under II (see Appendix D). Then the average cost per stage corresponding
to initial condition Xo = i can be expressed as
(4.1 )
then processing a batch of m + 1 orders is also optimal. Therefore a threshold
policy, that is, a policy that processes the orders if thcir number cxceeds some
threshold integer m *, is optimaL
We leave it as Exercise 7.8 for the reader to verify that. if we start
t.he policy iteration algorithm with a thrcshold policy, every subsequently
generated policy will be a threshold policy. Since thcrc are n + 1 distinct.
threshold policies, and the sequence of generated policies is improving, it
follows t.hat the policy iteration algorit.hm will yicld an optimal policy aft.er
at most n itcrations.
1 { Kij(I.)-l }
J,..(i) = N E 9(Xk,II(Xk))
+ E { t Y(;Ck'II(;Ck)) } '
k=Kij("')
If E{ J(ij(II)} < 00 (which is equivalent to assuming that the system even-
tually reaches j starting from i with probability 1; see Appendix D), then
it can be seen that the first limit is zero [ef. Eq. (4.1)], while the second
limit is equal to J,..(j). Therefore,
7.4 AVERAGE COST PER STAGE PROBLEMS
J,..(i) = J,..(j),
for all i,j with E{ I(ij(J1)} < 00.
The methodology of the last two sections applies mainly to problems
where the optimal total expected cost is finite either because of discounting
or because of a cost-free termination state that the system eventually en-
ters. 1'1 many situations, however, discollnting is inappropriate and there is
no natural cost-free termination state. In such situations it is often mean-
ingful to optimize the average cost per stage starting from a state i, which
is defined by
The preceding argument suggests that the optimal cost J* (i) should
also be independent of the initial state i under normal circumstances. To
see this, assume that for any two states i and j, there exists a stationary
policy J1 (dependent on i and j) such that j can be reached from i with
probability 1 under J1. Then it is impossible that
J*(j) < J*(i),
{ N--l }
J,,(i) = E L g(J:k,flk(Xk)) I ;TO = i .
k=O
since when starting from i we can adopt the policy J1 up to the Lime when
j is first reached and then switch to a policy that is optimal when st.arting
from j, thereby achieving an average cost start.ing from i t.hat. is equal t.o
J*(j). Indeed, it can be shown that
Let us first argue heuristically that. for most problems of inteTesl thc
average cost per stage of a policy and the optimal average cost per stage are
independent of the initial state.
J*(i) = J*(j),
for all i, j = 1,. . . , n,
under the preceding assumption (see Vol. II, Section 4.2).
_J
312
Tntrodllction to Tnfinite Horizon Problem;,
Cha.p. 7
The Associated Stochastic Shortest Path Problem
The results to be developed in this section can be proved under a
variety of different assumptions (see Chaptcr 4 in Vol. II). Here, we will
make the following assumption, which will allow us to use the stochastic
shortest path analysis of Section 7.2.
ASSUIuptiou 4.1: One of the statcs, by convention state n, is sllch
that for some integer m > 0, and for all initial states and all policies,
n is visited with positive probability at least once within the first m
stages.
Assumption 4.1 can be shown to be equivalent to the assumption that
the special state n is recurrent in the Markov chain corresponding to each
stationary policy (see Appendix D).
Under Assumption 4.1 we will make an important connection of the
average cost problem with an associated stochastic shortest path problem.
To motivate this connection, consider a sequence of generated states, and
divide it into cycles marked by successive visits to the state n. The first
cycle includes the transitions from the initial state to the first visit to
state n, and the kth cycle, k = 2,3, . . ., includes the transitions from the
(k - l)th to the kth visit to state n. Each of the cycles can be viewed as a
state trajectory of a corresponding stochastic shortest path problem with
the termination state being essentially n. More precisely, this problem is
obtained by leaving unchanged all transition probabilities Pij (u) for j i-
n, by setting all transition probabilities PinCU) to 0, and by introducing
an artificial termination state t to which we movc from each state i with
probability Pin(U); see Fig. 7.4.1. Note that Assumption 4.1 is equivalent
to the Assumption 2.1 of Section 7.2 under which the results of Section 7.2
on stochastic shortest path problems were shown.
We have not yet fixed the cost structure of the stochastic shortest
path problem. We will next argue that if we fix the expected stage cost
incurrcd at state i to be
g(i,U)-A*,
where A * is the optimal average cost per stage starting from the special state
n, then the associated stochastic shortest path problem becomes essentially
equivalent to the original average cost per stage problem. Furthermore,
Bellman's equation for the associated stochastic shortest path problem can
be viewed as Dellman's equation for the original average cost per stage
problcm.
For a heuristic argument of why this is so, note that under all station-
ary policies there will be an infinite number of cycles marked by successive
visits to state n. From this, it can be conjectured (and it can also be shown)
Sec. 7.'1
A verage Cost per .ge Problems
313
Pii(U)
Pjj(U)
(
Artificial Termination State
Figure 7.4.1 Transition probabilities for an average cost problem and its asso-
ciated stochastic shortest path problem. The latter problem is obtaincd by intro-
ducing, in addition to 1, . . . ,n, an artificial termination state t to which we move
from each state i with probability Pin(U), by setting all transition probabilities
Pin(U) to 0, and by leaving unchanged all other transition probabilities.
that the average cost problem amounts to finding a stationary policy J1 that
minimizes the average cycle cost
AI> = G nn (J1)
N nn (J1) ,
( 4.2)
where for a fixed Jt,
G nn (JL) : expected cost starting from n up to the first return to n,
N nn (J1) : expected number of stages to return to n starting from n.
The optimal average cost per stage A * starting from n, sat.isfies A,. :::: A *
for all J1, so from Eq. (4.2) we obtain
Gnn(fL) - Nnn(JL)A* :::: 0,
(4.3)
with equality holding if J1 is optimal. Thus, to attain such an optimal JL,
we must minimize over J1 the expression G nn (J1) - Nnn(lt)A*, which is the
expected cost of It starting from n in the associated stochastic shortest
path problem with state costs
g(i,u) - A*,
i = 1,... ,n.
(4.4)
Let us denote by h * (i) the optimal cost of this stochastic shortest path
problem when starting at the nontermination states i = 1, . . . , n. Then by
Prop. 2.1(b), h*(I),..., h*(n) solve uniquely the corresponding Bellman's
equation, which has the form
[ n-l ]
h*(i) = min g(i,u) - A* + LPij(u)h*(j) ,
uEU(,) j=l
i=l,...,n, (4.5)
_. J
314
Introduction to Infinite Horizon Problem.
Chap. 7
Sec. 7.3
A verage Cost per .ge Problems
315
since in the stochastic shortest path problem, the transition probability
from i to j cI n is Pij (u) and the transition probability from i to n is
zero undcr all u. If It* is an optimal statiollary policy for t.he average cost
problem, then this policy must satisfy, in view of Eq. (4.2),
C nn (J1.*) - NTm(J1.*).. = 0,
(b) If a scalar).. and a vector 11. = {h(I), . . . ,11.( n)} satisfy Bellman's
equation, t.hen ).. is the average optimal cost per stage for e,Lch
initial state.
(c) Given a stationary policy J1. with corresponding average cost per
stage )../.', there is a unique vector 11.1' = {h , .(I),..., hl,(n)} such
that h/.'(n) = 0 and
and from Eq. (4.3), this policy must also be optimal for the associated
stochastic shortest path problem. Thus, we must have
h*(n) = C nn (J1.*) - N nn (J1.*)..* = O.
(4.6)
n
)..J1. + h/.'(i) = g(i,J1.(i)) + LPij(J1.(i))h/.'U),
j=l
i = 1,... ,n.
Dy using this equation, we can now write Dellman's equation (4.5) as
(4.0)
)"*+h*(i) = min [ 9(i'U)+"tPij(U)h*(j) ] , i=l,...,n. (4.7)
uEU(,) j=l
Proof: (a) Let us denote
Equatioll (4.7), which is really Dellman's equation for t.he associated
stochastic shortest path problem, will be viewed as Bellman's equation
for the average cost per stage problem. The preceding argument indicates
that this equation has a unique solution as long as we impose the constraint
h*(n) = O. Furthermore, by minimization of its right-hand side we should
obtain an optimal stationary policy. We will now prove these facts formally.
"\ . C nn (J1.)
II = mln - N ( ) '
J1. nn J1.
(4.10 )
where C nn (J1.) and N nn (J1.) have been defined earlier, and the minimum is
taken over the finite set of all stationary policies. Note that Cnn(lt) and
N nn (J1.) are finite in view of Assumption 4.1 and the results of Sectioll 7.2.
Then we have
C nn (J1.) - N nn (J1.))" :::: 0,
(4.11)
The followillg proposition provides the main results regarding l3ell-
man's equation.
with equality holding for all J1. that attain the minimum in Eq. (4.10). Con-
sider the associated stochastic shortest path problem when the expected
stage cost incurred at state i is
Bellman's Equation
g(i, u) - )".
( 4.12)
Proposition 4.1 Under Assumption 4.1 the following hold for the
average cost per stage problem:
(a) The optimal average cost )..* is the same for all initial states
and together with some vector 11.* = {h*(1),..., h*(n)} satisfies
Bellman's equation
Then by Prop. 2.1(b), the costs h*(l),...,h*(n) solve uniquely the corre-
sponding Bellman's equation
)..* + 11.* (i) = min [ 9(i,U) + "tPij(U)h*(j) ] ,
uEU(D j=l
i = 1,... ,n.
11.*( i) = min [ 9(i' u) - )" + I: Pij (u)h* (j) ] ,
uEU(i) j=l
since the transition probability from i to n is zero in the associated stochas-
tic shortest path problem. An optimal stationary policy must minimize the
cost C nn (J1.) - N nn (J1.))" and reduce it to zero [in view of Eq. (4.10)], so we
see that
(4.13)
( 4.8)
Furthermore, if J1.(i) attains the minimum in the above equation
for all i, the stationary policy fL is optimal. In addition, out of
all vectors 11.* satisfying this equation, there is a unique vector
for which h*(n) = O.
h*(n) = o.
(4.14)
Thus, Eq. (4.13) is written as
)" + h*(i) = min [ 9(i,U) + "tPij(U)h*U) ]
uEU(,) j=l
i = 1,... ,n.
( 4.15)
_. J
316
Introduction to Infinite Horizon Problcm6
Chap. 7
Sec. 7.3
A verage Cost per S ) Problems
317
We will show that this relation implies that = ). *.
Indeed, let 7r = {J1.o, J1.1, . ..} be any admissible policy, let N be a
positive integer, and for all k = 0, . . . , N -1, define J k( i) using the following
recursion
Jo(i) = h*(i),
i = 1,... ,n,
average cost per stage of 7r starting at state i.. The r(,11"on is that. from
the definition (4.1G), IN(i) is the N-stage cost of 7r starting at 'i, when the
terminal cost function is h*; when we take the limit of (ljN)JN(i), the
dependence on the terminal cost fnnct.ioll h* disappears. TilliS, by t.akill
the limit as N ---+ 00 in Eq. (4.17), we obtain
n
Jk+1(i) = g(i,JLN-k-l(i)) + LPij(JLN-k-l(i))Jdj), i = 1,... ,n.
j=l
( 4.16)
Note that J N (i) is the N -stage cost of 7r when the starting state is i and
the terminal cost function is h*. From Eq. (4.15), we have
). .J-rr(i),
i = 1,... ,11.,
for all admissible 7r, with equality if 7r is a stationary policy IL snch t.hat
J1.(i) attains the minimum in Eq. (4.15) for all i and k. It follows that
= min J-rr(i) =).*,
-rr
i = 1,... ,n,
n
+ h*(i) g(i,J1.N-l(i)) + Lp,j(J1.N-l(i))h*(j), i = 1,..., n,
j=1
+ Jo(i) :::; J 1 (i),
i = 1,... ,no
and from Eq. (4.15), we obtain the desired Eq. (4.8).
Finally, Eqs. (4.14) and (4.15) are equivalent to Bellman's equation
(4.13) for the associated stochastic shortest path problem. Since the sol u-
tion to the latter equation is unique, the same is tme for the SOllltioll of
Eqs. (4.14) and (4.15).
(b) The proof of this part is obtained by using the argument of the proof
of part (a) following Eq. (4.15).
(c) The proof of this part is obtained by specializing part (a) to the case
where the constraint set at each state i is U(i) = {J1.(i)}. Q.E.D.
or equivalently, using Eq. (4.16) for k = 0 and the definition of Jo,
Using this relation, we have
n
g(i,J1.N-2(i)) + + LPij(J1.N-2(i))Jo(j)
j=1
n
An examination of the preceding proof shows that the unique vector
h* in Dellman's equation (4.8) for which h* (n) = 0 is the optimal cost vect.or
for the associated stochastic shortest path problem when the expected stage
cost at state i is
9(i,J1.N-2(i)) + LPij(J1.N-2(i))Jl(j), i = 1,... ,n.
j=l
By Eq. (4.15) and the definition Jo(j) = h*(j), the left-hand side of the
above inequality is no less than 2 + h*(i), while by Eq. (4.16), the right-
hand side is equal to h(i). We thus obtain
- 1 1
). + N N JN(i),
i = 1,... ,n.
( 4.17)
g(i,u) - ).*,
[ef. Eq. (4.13)]. Consequently, h*(i) has the interpretation of a relative 01'
diffC'rential cost; it is the minimum of the difference between the expected
cost to reach n from i for the first time and the cost that would be incurred
if the cost per stage was the average). *. We note that the relation be-
tween optimal policies of the stochastic shortest path and the average cost
problems is clarified in Exercise 7.15.
We finally mention that Prop. 4.1 can be shown under considerably
weaker conditions (see Section 4.2 of Vol. II). In particular, Prop. 4.1 can
be shown assuming that all stationary policies have a single recurrent class,
even if their corresponding recurrent classes do not have state n in common.
The proof, however, requires a more sophisticated use of the connection
with all associated stochastic shortest path problem. Proposition 4.1 can
also be shown assuming that for every pair of states i and j, there exists
a stationary policy under which there is positive probability of reaching
j starting from i. In this case, however, an associated stochastic shortest
path problem cannot be defined and the corresponding connection with the
2 + h*(i) :::; h(i),
i = 1,..., n.
Dy repeating this argument several times, we obtain
k + h*(i) :::; Jk(i),
k = 0,..., N, i = 1,..., n,
and in particular, for k = N,
Furthermore, equality holds in the above relation if ILk( i) attains the min-
imum in Eq. (4.15) for all i and k.
Let us now take the limit as N ---+ 00 in Eq. (4.17). The left-hand
side tends to. We claim that the right-hand side tends to J-rr(i), the
_.J
318
Introduction to Infinite Horizon Problcn.
Chap. 7
Scc. 7.'1
A verage Cost pcr. .ge ]Jrovlems
:nu
average cost per stage problem cannot be made. The analysis of Chapter 4
of Vol. II relies on another connection that exists between the average cost
per stage problem and the discount.ed cost problem, but. t.o establish t.his
connection and to fully explore its ramifications, a much more sophisticated
analysis is required.
To show this, let us define the recursion
J k + 1 (i) = nlin [ 9(i' u) + 't)hJ (IL)J k (J) ]
uEU( j=1
'i = 1,. . . , II,
(-l.(])
Example 4.1
with the initial condition
Consider t.he average cost vcrsion of the manufacturcr's problem of Example
3.2. Here, state 0 plays the role of the special state n in Assumption 4.L
Bellman's equation takes the form
JO'(i) = h*(i),
i = 1,...,"It,
(4.21)
where h* is a differential cost vector satisfying Bellman's equation
).* + h*(i) = min[ K + (1 - p)h*(O) + ph*(I), ci + (1 - p)h*(i) + ph*(i + 1)],
(4.18)
for the states i = 0, 1,..., n - 1, and takes the form
).* + h*(n) = K + (1 - p)h*(O) + ph*(I)
)..* + h*(i) = min [ 9(i' u) + tp'j(U)h*(j) ] ,
uEU(,) j=1
i = 1, .. . , n. (4.22)
for st.ate n. The first expression within brackets in Eq. (4.18) corresponds
to processing thc i unfilled orders, while the second exprcssion corrcsponds
to leaving the orders unfilled for one more period. The optimal policy is to
process i unfilled orders if
Using this equation, it can be shown by induction that for all k we have
Jk(i) = k)"* + h*(i),
i = 1,.., ,n.
(4.23)
On the other hand, it can be seen that for all k,
K + (1 - p)h*(U) + ph*(I) ::; ci + (1 - p)h*(i) + ph*(i + 1).
IJk(i) - .J k (i) I ::; .max j.Jo(j) - h*(j)l,
J=l,...,n
i = 1,... ,n.
(4.24 )
If we view h*(i), i = 1,..., n, as different.ial costs associated with an opt.imal
policy, it is int.uitively clear that h* (i) is monotonically nondccrcru;ing with
i (t.his can also be proved with an analysis bascd on the theory presented in
Vol. 11, Section 4.2). As in Example 3.2, the monotonicity property of h*(i)
implies that a threshold policy is optimal.
The reason is t.hat .Jk(i) and Jk(i) are optimal cost.s for two k-stage prob-
lems that differ only in the corresponding terminal cost functions, which
arc .10 and h*, respectively. From the preceding two equations, we see that
for all k,
Value Iteration
l.Jk(i) - k)..*j SE1ax l.Jo(j) - h*(j)[ +nax Ih*(j)[,
J-l''''Jn J-l,...,n
i=I,...,n.
.Jk+l(i) = min [ 9(i' u) + tP'j(U).J dj ) ]
uEU(,) j=l
i=l,...,n.
(4.19)
(4.25)
so that .Jk(i)/k converges to )..* at the rate of a constant divided by k.
Note that the above proof shows that .Jk(i)/k converges to )..* under any
conditions that guarantee that Bellman's equation (4.22) holds for some
vector h*.
The value iteration method just described is simple and stT<lightfor-
ward, but has two drawbacks. First, since typically some of the compo-
nents of J k diverge to 00 or -00, direct calculation of limkoo Jdi)/k is
numerically cumbersome. Second, this method will not provide us with a
corresponding differential cost vector h*. We can bypass both difticultics
by subt,racting the same constant from iLlI COllll)()lJCnt.s of t.lw vcctor .h,
so that the difference, call it hk, remains bounded. In particular, we can
consider the algorithm
The most natural version of the value iteration method for the average
cost problem is simply to select arbitrarily a terminal cost function, say J o ,
and to generate successively the corresponding optimal k-stage costs Jk(i),
k = 1,2,... This can be done by executing the DP algorithm starting with
Jo, that is, by using the recursion
It is natural to expect that t.he ratios Jdi)/k should converge to t.he optimal
average cost per stage )..* as k -+ 00, that is,
Jk ( i )
lim - = )..*.
koo k
hk(i) = Jk(i) - Jd,),
i = 1,... ,n.
(4.2G)
_. J
320
IJJiroduciion to InllJJiie Horizon ProblclT.
Chap. 7
Scc. 7.'1
Average Cosi per
C Prublc11I8
321
whcre s is some fixcd state. Dy using Eq. (4.10), we thcn obt.ain
of the relativc valuc it.erat.ion for which convergence is guarant.eed unt!l'r
Assumption 4.1. This variant is given by
h k + 1 (i) = Jk+l(i) - Jk+1(S)
= 1nin [ !/(i,U) + f:JJ,j (U)Jk (j) ]
uEU(,) .
J=l
- min [ !/(S,U) + tP.j(U)Jk(j) ]
uEU(.) .
J=l
hk+l (i) = (1 - T)hk( i) + min [ 9(i' u) + T t Pij (U)hk(j) ]
uEU(,) j=l
- min [ g(S'U)+TtpSj(U)hk(j) ] , i=l,...,n,
uEU(.) j=l
(4.28)
from which in view of the relation hk(j) = Jdj) - Jk(S), we have
where T is a scalar such that 0 < T < 1. Notc that for T = 1, we obtain
the relative value iteration (4.27). The convergence proof of t.his algorithm
is is somewhat complicated. It can be found in Section 4.3 of Vol. II.
hk+l(i) = min [ 9(i,1L) + tpij(U)hk(j) ]
nEUe,) .
J=l
- min [ g(S,U) + tp'j(U)h dj ) ]
uEU(.) .
J=l
Policy Iteration
( 4.27)
It is possible to use a policy iteration algorit.hm for the avcrage cost
problem. This algorithm operates similar to thc policy iteration algorithms
of the preceding sections: given a stationary policy, we obtain an improved
policy by means of a minimization process until no further improvement
is possible. In particular, at the typical step of the algorithm, wc have a
stationary policy J.Lk. We then perform a policy evaluation step; that is, we
obtain corresponding average and differential costs )..k and h k (i) sat.isfying
i = 1,... ,no
The above algorithm, known as relative value iteration, is mathemat-
ically equivalent to the value iteration method (4.19) that generates Jk(i).
The iterates generated by the two methods merely differ by a constant [ef.
Eq. (4.26)], and the minimization problems involved in the corresponding
iterations of the two methods are mathematically equivalent. However, un-
der Assumption 4.1, it can be shown that the iterates hk(i) generated by
the relative value iteration method arc bounded, while this is typically not
true for the value iteration method.
It can be seen that if the relative value iteration (4.27) converges to
some vector h, then we have
n
)..k + hk(i) = g(i,J.Lk(i)) + LPij(J.Lk(i))hk(j),
j=l
i = 1,...,n, (4.29)
hk(n) = O.
(4.30)
\Ve subsequently perform a policy impmvement step; that is, we find a
stationary policy J.Lk+1, where for all i, J.Lk+1(i) is such that
)"+h(i) = min [ 9(i'U)+tpij(U)h(j) ]
uEU(,) .
J=l
n
g(i,J.Lk+l(i)) + LPij(J.Lk+ 1 (i))h k (j)
j=l
).. = min [ g(S,U) + tpSj(IL)h(j) ] .
uEU(.) .
J=l
= min [ 9(i,U) + tpij(U)hk(j) ]
uEU( j=l
(4.31)
where
Dy Prop. 4.1(b), this implies that).. is the optimal average cost per stage
for all initial states, and h is an associated differential cost vector. Unfor-
tunately, t.he convergence of the relative value iteration is not guaranteed
under Assumption 4.1 (sec Exercise 7.14 for a counterexample). A stronger
assumption is required. It turns out, however, that there is a simple variant
If )..k+1 = )..k and hk+1(t) = hk(i) for all i, thc algorithm t.ermillates;
otherwisc, the process is repeated with J.Lk+ 1 replacing ILk.
To prove that the policy iteration algorithm terminates, it. is sufficicnt
that cach iteration makes some irreversible progress towards optimality,
since there arc finitely many stat.ionary policies. The type of irrcvcrsible
progress that we can demonstrate is described in the following proposition,
which also shows that an optimal policy is obtained upon termination.
_.I
322
Introduction to Infinite Horizon Problem
Chap. 7
Proposition 4.2 Under Assumption 4.1, in the policy iteration algo-
ritllln, for each k we either have
Ak+1 < Ak
( 4.32)
or else we have
Ak+1 = A k ,
hk+l(i)hk(i), i=I,...,n.
( 4.33)
Furthermore, the algorithm terminate:; and the policies Ik and flk+ 1
obtained upon termination are optimal.
Proof: To simplify notation, denote flk = I, Ik+l = Ti, A k = A, Ak+l = ):,
hk(i) = h(i), hk+ 1 (i) = h(i). Define for N = 1,2,...
n
hN (i) = g( i, Ti(i)) + L Pij (Ti(i)) hN -1 (j),
j=l
i=I,...,n,
(4.34)
where
ho(i)=h(i), i=l,...,n. (4.35)
Note that hN(i) is the N-stage cost of policy II starting from i when the
terminal cost function is h. Thus we have
- 1 )
A = JJi(i) = lim N hN(i,
Noo
i = 1,...,n,
(4.36)
since the contribution of the terminal cost to (l/N)hN(i) vanishes when
N ---> 00. Dy the definition oflI and Prop. 4.1(c), we have for all i
n
'q(i) = g(i,lI(i)) + LPij(lI(i))ho(j)
j=1
n
g(i,fl(i)) + LP'j(fl(i))ho(j))
j=l
( 4.37)
= A + ho(i).
From the above equation, we also obtain
n
h2(i) = g(i,Jl(i)) + LPi) (Jl(i))hl(j)
j=1
n
g( i, lI( i)) + L Pij (lI(i))(A + ho(j))
j=1
Sec. 7.3
Discounted Problt
323
n
= A + !/(i,7J.(i)) + Lll,j(7J.(i))ho(j)
j=1
n
:::; A + g(i,IL(i)) + LPij(J1(i))ho(j)
)=1
= 2A + ho(i),
and by proceeding similarly, we see that for all i and N we have
hN(i) :::; NA + ho(i).
(4.38)
Thus,
hN(i) :::; A + ho(i),
and by taking the limit as N ---> 00 and using Eq. (4.36), we obtain): :::; A.
If ): = A, then it is seen that the iteration that produces flk+1 is a
policy improvement step for the associated stochastic shortest path problem
with cost per stage
!/(i, u) - A.
Furthermore, h( i) and h( i) are the optimal costs starting from i and cor-
responding to J1 and lI, respectively, in this associated stochastic shortest
path problem. Thus, by Prop. 2.2, we must have h(i) h(i) for all i.
Let us now show that when the algorithm terminates with ): = A
and h(i) = h(i) for all i, the policies II and J1 are optimal. Indeed, upon
termination we have for all i
A + h(i) = ): + h(i)
n
= g(i,lI(i)) + LPij(Jl(i))h(j)
j=1
,
:i
'.4
n
= g(i,7J.(i)) + LPij(7J.(i))h(j)
j=1
= min [ 9(i,U) + t 1 J;j(U)h(j) ] .
uEU(,) )=1
Thus A and h satisfy Bellman's equation, and by Prop. 4.1(b), A must
be equal to the optimal average cost. Furthermore, lI(i) attains the mini-
mum in the right-hand side of Bellman's equation, so by Prop. 4.1(a), II is
optimal. Since we also have for all i
n
>. + h(i) = g(i,fl(i)) + LPij(J.L(i))h(j),
j=1
i
1.
the same is true for Ii. Q.E.D.
We note that policy iteration can be shown to terminate under less
restrictive conditions than Assumption 4.1 (see Vol. II, Section 4.3).
_...1
324
Introduction to Infinite Horizon Pro bIen
Chap. 7
7.5 NOTES, SOURCES, AND EXERCISES
This chapter is an introduction to infinite horizon problems. There is
an extensive theory for these problems with interesting mathematical and
computational content. Volume II provides a comprehensive treatment
and gives many references to the literature. The line of development given
here is original in that it uses the stochastic shortest path problem as the
starting point for the analysis of the other problems. Also, the error bounds
(2.16) are new.
EXERCISES
7.1
A tennis player has a Fast serve and a Slow serve, denoted F and S, respec-
tively. The probability of F (or S) landing in bounds is [iF (or ps, respec-
tively). The probability of winning the point assuming the serve landed in
bounds is qF (or qS, respectively). We assume that PF < ps and qF > qs.
The problem is to find the serve to be used at each possible scoring situation
during a single game in order to maximize the probability of winning that
game.
(a) Formulate this as a stochastic shortest path problem, argue that As-
sumption 2.1 of Section 7.2 holds, and write Bellman's equation.
(b) Computer assignment: Assume that qF = 0.6, qS = 0.4, and ps = 0.95.
Use value iteration to calculate and plot (in increments of 0.05) the
probability of t.he server winning a game with optimal serve selection
as a function of PF.
7.2
A quarterback can choose between running and passing the ball on any given
play. The number of yards gained by running is integer and is Poisson dis-
t.ributed with parameter AT' A pass is incomplete with probability P, is inter-
cepted with probability q, and is completed with probability 1 - P - q. When
completed, a pass gains an integer number of yards that is Poisson distributed
with parameter Ap. We assume that the probability of scoring a touchdown
on a single play starting i yards from the goal is equal to the probability of
gaining a number of yards greater than or equal to i. We assume also that
yardage cannot be lost on any play and that there are no penalties. The ball
Sec. 7.4
Notes, Sources, al. ;xercises
325
is turned over t.o t.he ot.he'r team on a fourt.h down or when <In int"ITept.ion
occurs.
(a) Formulate the problem as a stochast.ic shortest path problem, argue' that.
Assumption 2.1 of Section 7.2 holds, and write Bellman's equation.
(b) Computer assignment: Use value iterat.ion to compute the quarterback's
play-selection policy that maximizes the probability of scoring a t.ouch-
down on any single drive for AT = 3, Ap = 10, P = 0.4, and q = 0.05.
7.3
A computer manufacturer can be in one of two states. In state 1 his product.
sells wel!, while in state 2 his product sells poorly. While in state 1 he can au-
vertise his product in which the one-stage reward is 4 units, ,,"nd the transit.ion
probabilities are PII = 0.8 and Pl2 = 0.2. If in state 1, he does not advert.ise,
the reward is 6 units anu t.he transition probabilit.ies arc ]ill = 1'12 = n.G.
While in state 2, he call do research to improve his prouuct, ill which ca.", the
one-stage reward is -5 units, and the transition probabilities are [J2l = 0.7
and P22 = 0.3. If in state 2 he does not do research, the reward is -3, and
the transition probabilities are [J2l = 0.4 and P22 = 0.6. Consider the infinite
horizon, discounted version of this problem.
(a) Show that when the discount factor ex is sufficiently small, the computer
manufacturer should follow the "shortsighted" policy of not advertising
(not doing research) while in state 1 (state 2). By contrast, when ex
is sufficiently close t.o unity, he should follow the "farsighted" policy of
advertising (doing research) while in st.ate 1 (state 2).
(b) For ex = 0.9 calculate the optimal policy using policy iteration.
(c) For ex = 0.99, use a computer to solve the problem by value iteration,
with and without the error bounds (3.3).
7.4
"r
An energetic salesman works every day of the week. I-Ie can work in only one
of two towns A and 13 on each day. For each day he works in town A (or 13)
his expected reward is r A (or r B, respectively). The cost for changing towns
is c. Assume that c > r A > r B and that there is a discount factor ex < 1.
(a) Show that for ex sufficiently small, the optimal policy is to stay in t.he
town he starts in, and that for ex sufficiently close to 1, the optimal
policy is to move to town A (if not starting there) and stay in A for all
subsequent times.
(b) Solve the problem for c = 3, rA = 2, rn = I, and ex = 0.9 using policy
iteration.
Ji
,
;
:
1
.'!
'p
ii-!.
(c) Use a computer to solve the problem of part (b) by value iteration, with
and without the error bounds (3.3).
_. J
326
Introduction to Infinite Horizon Problel,
Cliap. 7
Sec. 7.5
Notes, Sources, aI. ;xercises
327
7.5
7.9
A person has an umbrella that she takes from home to office and vice versa.
There is a probability p of rain at the time she leaves home or office indepen-
dently of earlier weather. If the umbrella is in the place where she is and it
rains she takes the umbrella to go to the other p1acc (this involves no cost).
If thre is no umbrella and it rains, therc is a cost W for getting wet. If
the umbrella is in the place where she is but it does not rain, she may take
the umbrella to go to thc other place (this involves an inconvcniencc cost
V) or she may leave thc umbrella behind (this involves no cost). Costs are
discountcd at a factor a < 1.
Solvc the avcrage cost vcrsion (a = 1) of the computcr manufacturer's prob-
lem (Exercise 7.3).
7.10
(a)
Formulatc this as an infinite horizon total cost discount cd prob1cm.
Hint: Try to use as fcw statcs as possible.
Characterize thc optimal policy as best as you can.
An unemployed worker receives a job offer at. each time period, which she Illay
accept or reject. The offered salary takes one of n possible values Wi,.. . ,111"
with given probabilities, indcpendcntly of preccding offers. If she accept.s t.he
offer, she must kcep the job for the rest of her lifc at the same salary level.
If she rejects thc offer, she rcceives unemploymcnt. compcnsation c for t.he
current period and is eligiblc to accept future offcrs. Assume that income is
discountcd by a factor a < 1.
(a) Show that there is a thrcshold 'ill such that it is optimal to accept an
offer if and only if its salary is larger than 'ill, and characterize w.
(b) Consider the variant of the problcm where thcre is a given probability
Pi that thc worker will bc fired from hcr job at anyone period if her
salary is Wi. Show t.hat the rcsult of part. (a) holds in the casc where 11,
is t.he same for all i. Analyze the casc whcre Pi depends on i.
(b)
7.6
For the tcnnis player's problem (Excrcise 7.1), show that it is optimal (re-
gardlcss of score) to use F on both serves if (PFqF)/(PSqs) > 1, to use Son
both scrves if (PFqF)/(PSqs) < 1 + PI" - ps, and to use F on thc first scrve
and S on thc second otherwise.
7.11
7.7
Do part (b) of Exercisc 7.10 for the case whcrc incomc is not. discounted and
the workcr maximizes hcr average income per period.
Consider the value iteration method for the Example 3.2:
7.12
Jk+l(i) = min[K + a(l- p)Jk(O) + aph(I),
ci + a(1 - p)Jk(i) + apJk(i + 1)],
i = 0,1,. ., ,n - 1,
Show that one can always take m = n in Assumption 2.1. lIint: For any 7f
and i, lct Sk(i) bc t.he sct of states that arc reachable with positive probabilit.y
from i undcr 7f in k stages or less. Show that under Assumption 2.1, we cannot
have Sk(i) = Sk+l(i) while t f. Sk(i).
Jk+l(n) = K + a(1 - p)Jk(O) + apJ k (I),
wherc Jo(i) = 0 for all i. Show by induction that Jk(i) is monotonically
nondecrcasing in i.
7.13
7.8
Show the error bounds (2.16). Hint: Complete the details of the following
argumcnt. Lct ,.l(i) attain the minimum in the value iteration (2.15) for all
i. Then, in vector form, we have
Consider the policy iteration algorithm for thc problcm of Example 3.2.
(a) Show that if wc start thc algorithm with a threshold policy, every sub-
sequcntly gencrated policy will bc a threshold policy.
(b) Carry out the algorithm for the casc c = 1, K = 5, n = 10, P = 0.5,
a = O.!J, and an initial policy that always processes thc unfilled orders.
.h+l = gk + PkJk,
where Jk and gk are the vectors with component.s Jk{i) , i = 1,..., n, and
gk(i,,.l(i)) , i = 1,...,n, respectively, and P k is t.he mat.rix with plement.s
Pi) (ILk(i)). Also from Bellman's cquation, we have
J*:S gk + PkJ',
_ J
328
Introduction to Infinite Horizon Pro bIen
CIJap.7
where the vector inequality above is meant to hold separat.ely for each com-
ponent. Let e = (1,.. .,1)' be t.he unit vector. Using t.he above t.wo relations,
we have
J* - Jk ::; J* - Jk+l + Cke ::; Pk(J* - Jk) + Cke.
Mult.iplying this relation with Pk and adding; CkC, we obtain
Pk(J* - Jk) + Cke ::; P(J* - Jk) + Ck(I + Pk)e.
Similarly cont.inuing, we have for all r 2: 1
J* - Jk+l + Cke ::; P"k(J* - Jk) + Ck(I + P k + ... + p;-I)C.
For s = 1,2,. .., t.he it.h component. of the vector Pke is equa] to the probabil-
ity P{x. of. t I Xo = i, J.l.k} that t has not been reached after s stages starting
from i and using the stationary policy fJ.k. Thus, Assumption 2.1 implies that
limr= P; = 0, while we have
(5.1)
lim (1 + Pk +... + p;-1)C = N k ,
roo
where N k is t.he vector (N k (1),..., Nk(n) )'. Combining the above two rela-
tions, we obtain
J* ::; Jk+1 + ck(N k - e),
proving the desired upper bound.
The lower bound is proved similarly, by using in place of fJ.k, an optimal
stationary policy It*. In particular, in place of Eq. (5.1), we can show that
Jk - J* ::; Jk+l - J* - fke::; F*(Jk - J*) - fke,
where P* is the matrix with elements Pij (fJ.* (i)), We similarly obtain for all
r 2: 1
Jk+l - J* - fke::; (PT(Jk - J*) - fk(I + p* +... + (F*)'-l)C,
from which A+l +fk(N* -e) ::; J*, where N* is the vector (N° (1),.. . , N*(n) )'.
7.14
Apply the relative value iteration algorit.hm (4.27) for the case where there
are two stat.es and only one control per state. The transition probabilities are
VII = €, Pl2 = 1 - €, P2l = 1 - €, and ])22 = €, where 0 ::; € < 1. Show that if
o < € the algorit.hm converges, but if € = 0, the algorithm may not converge.
Show also that the variation (4.28) converges when € = 0.
7.15
Consider the average cost problem and its associated stochastic shortest path
problem when the expected cost incurred at state i is g(i,u) - >.*.
(a) Use Prop. 2.1(d) and Prop. 4.1(a) to show that if a stationary policy is
optimal for the latter problem it is also optimal for the fonner.
(b) Show by example that the reverse of part (a) need not be true.
APPENDIX A:
Mathematical Review
The purpose of this appendix is to provide a list of mathematical
definitions, notations, and results that are used frequently in the text. For
detailed cxpositions, the reader may consult references [HoK61], [Roc70],
[Roy68], [Rud64], and [Str76].
SETS
If x is a member of the set S, we writc XES. We write x t/:. 3 if x is
not a member of 3. A set S may be specified by listing its elements within
braces. For example, by writing 3 = {Xl,X2,... ,Xn} we mean that the set
S consists of the elements Xl, X2, . . . , Xn. A set 3 may also be specificd in
the generic form
S = {x I x satisfies P}
as the set of elements satisfying property P. For example,
S = {:c I x: real, 0 ::; x ::; 1}
denotes the set of all real numbers x satisfying 0 ::; x ::; 1.
The union of two sets Sand T is denoted by SUT and the intcr'scction
of 3 and T is denoted by SnT. The union and the intersection of a sequence
of sets 31,3 2 ,..., Sk,... are denoted by Ul 3k ami nl 3k, respectively.
If S is a subset of T (i.e., if every element of 3 is also an clement of T), we
write SeT or T::) 3.
329
_ J
330
Mathcmatical HcviclV
,ppcnJix A
Finite and Countable Sets
A set S is said to be finite if it consists of a finite number of clements.
It is said to be countable if there exists a one-to-one function from S into
the sct. of nonnegative intcgers. Thus, according to our definition, a finitc
sct is also countable but not conversely. A countable set S t.hat. is not
finite may be represented by listing its elements XO,X1,X2,... (i.e., S =
{:ro, :r 1, :f.2, . . . } ). A co untable union of countable sets is countable, that is,
if A = {ao, a I, . . .} is a countablc set and Sao, Sa 1 , . .. arc each countable
sets, thcn Uk=OSak is also a countable set.
Sets of Real Numbers
If a and b arc real numbers or +00, -00, we denote by [a, b] the set
of numbers x satisfying a ::; x ::; b (including the possibility x = +00 or
:r = - 00). A rounded, instead of square, bracket denotes strict inequlit in
the definition. Thus (a, b], [a, b), and (a, b) denote the set of all x satlsfymg
a < x < b a < x < b and a < x < b, respectively.
If S \s ;; set of' real numbers that is bounded above, then there is a
smallest real number y sucl, that x ::; y for all xES. This number is called
the least uppcr bound or supremum of S and is denoted by sup{ x I XES} or
max{x I XES}. (This is somewhat inconsistent with normal mathematicl
usage, where the use of max in place of sup indicates that thc suprcmum IS
attained by some element of S.) Similarly, the greatest real number z such
that z < x for all xES is called the greatest lower bound or infimmn of
S and i denoted by inf{x I XES} or min{x I XES}. If S is unbounded
above, we write sup{x I xES} = +00, and if it is unbounded below, we
write inf{x I XES} = -00. If S is the empty sct, then by convention we
writc inf{x I XES} = +00 and sup{x I XES} = -00.
A.2
EUCLIDEAN SPACE
The set of all n-tuples x = (Xl,"" J;n) of real numbers constitutes
the n-dimensional Euclidean space, denoted by ?n. The elements of 3?n
arc referred to as n-dimensional vectors or simply vectors when confusion
cannot arise. The one-dimensional Euclidean space 3?1 consists of all the
real numbers and is denoted by 3? Vectors in 3?n can bc added by adding
their corresponding components. Thcy can be multiplied by a scalar by
multiplication of each component by a scalar. The inner produt of to
vectors x = (X1,...,X n ) and y = (Y1,...,Yn) is denoted by xy and IS
e q ual to 2:: n _ XiYi. The norm of a vector x = (Xl" . . , Xn) E 3?n is denoted
,-1 1/2
by II xII and is equal to (X'X)1/2 = (2::=1 x) .
Sec. A.3
Matrices
331
. A sct of vectors aI, a2, . . . , ak is said to be lincarl depcndcnt if there
eXlst scalars AI, A2,... ,Ak, not all zero, such that 2::i=l Aiai = O. If no
such sct of scalars exists, the vectors arc said to be linearly indepcndent.
A.3
MATRICES
An 1n x n matrix is a rectangular array of numbers, referred to as
elements, which are arranged in m rows and n columns. If m = n the
matrix is said to be square. The element in the ith row and jth column
of a matrix A is denoted by a subscript ij, such as aij, in which case we
write A = [aij]. The n x n identity matrix, denoted by I, is thc matrix
with elements aij = 0 for i cI j and aii = 1, for i = 1,..., n. The sum
of two m x n matrices A and B is written as A + B and is the matrix
whose elements are the sum of the corresponding elements in A and B.
The product of a matrix A and a scalar A, written as AA or AA, is obtained
by multiplying each element of A by A. The product AB of an Tn x n
matrix A and an n x p matrix B is the m x p matrix C with elemcnts
Cij = 2::=1 aikbkj. If b is an n-dimensional column vector and A is an
m x n matrix, then Ab is an m-dimensional column vector.
The transpose of an m x n matrix A is the n x 111 matrix A' with
elements a;j = aji. The elements of a given row (or column) of A constitut.e
a vector called a row vector (or column vector, respectively) of A. A square
matrix A is symmetric if A' = A. An n x n matrix A is called nonsingular
or invertible if there is an n x n matrix called the inverse of A and denoted
by A-I, such that A-1A = I = AA-I, where I is the n x n identity
matrix. An n x n matrix is nonsingular if and only if its n row vectors are
linearly independent or, equivalently, if its n column vcct.ors are lincarly
independent. Thus, an n x n matrix A is nonsingular if and only if the
relation Av = 0, where v E 3?'", implies that v = O.
Rank of a Matrix
The rank of a matrix A is equal to the maximum number of linearly
independent row vectors of A. It is also equal to the maximum nmnber of
linearly independent column vectors. Thus, the rank of an m x n matrix
is at most equal to the minimum of the dimensions Tn and n. An m x n
matrix is said to be of full rank if its rank is maximal, that is, if its rank is
equal to the minimum of m and n. A square matrix is of full rank if and
only if it is nonsingular.
Eigenvalues
Given a square n x n matrix A, the determinant of the matrix II - A,
where I is the n x n identity matrix and I is a scalar, is an nt.h dq,;rcc
__ J
332
MaLllClIlatical lievjew
ppendjx A
Sec. A.3
Mairjccs
333
polynomial. The n roots of this polynomial arc called the eigenvalues of A.
Thus, , is an eigenvalue of A if and only if the matrix ,I - A is singular,
or equivalellt.ly, if and only if there exist.s a non"ero vector v sllcli that
Av = ,v. Such a vector v is called an eigenvector corresponding to ,.
The eigenvalues and eigenvectors of A can be complex even if A is real. A
matrix A is singular if and only if it has an eigenvalue that is equal to zero.
If A is nonsingular, t.hen t.he eigenvalues of A -I are the reciprocals of t.he
eigenvalues of A. The eigenvalues of A and A' coincide.
If ,1, . . . , ,n are the eigenvalues of A, then the eigenvalues of cI + A,
where c is a scalar and I is the identity matrix, arc c + ,1, . . . , c + ,n' The
eigenvalues of Ak, where k is any positive integer, are equal to ,, . . . , ,k.
From this it follows that limk-->o Ak = 0 if and only if all the eigenvalues of
A lie strictly within the unit. circle of the complex plane. Furthermore, if
the latter condition holds, the iteration Xk+1 = Ark + b, where b is a given
vector, converges to x = (I - A)-Ib, which is the unique solution of the
equation x = Ax + b.
If all the eigenvalues of A are distinct, then their number is exactly
11., and there exists a set of corresponding linearly independent eigenvec-
tors. In this case, if ,1, . . . , ,n are the eigenvalues and VI, . . . , V n are such
eigenvectors, every vector x E n can be decomposed as
A positive definite symmetric matrix is invert.ible and its inverse is
also positive definite symmetric. Also, an invertiblc positive semidefinitc
sYllllndric mat.rix is positive definite. Allalogolls result.s hold for llcgativc
definite and semidefinite symmetric matrices. If A and Bare n x n positive
semidefinite (definite) symmctric matrices, then the matrix )"A + /l.B is also
positive semidefinite (definite) symmetric for all ).. > 0 and J1 > O. If A is
an II. x 1/. positive sellliddinit.e symmetric nwtrix ,lml G is an II! x I/, InaLrix,
then the matrix GAG' is positive semidefinite symmctric. If A is positive
definite symmetric, and G has rank m (equivalently, m ::; nand G has full
rank), then GAG' is positive definite symmetric.
An n x n positive definite symmetric matrix A can be written as
GG' where G is a square invertible matrix. If A is positive semidefinite
symmctric and its rank is m, then it can be written as GG', where G is an
n x Tn matrix of full rank.
A symmetric n x n matrix A has real eigenvalues and a set of n real
linearly independent eigenvectors, which are orthogonal (thc inner product
of any pair is 0). If A is positive semidefinite (definite) symmetric, its
eigenvalues arc nonnegative (respectively, positive).
Partitioned Matrices
n
X = LiVi'
i=1
It is often convenient to partition a matrix into submatrices. For
example, the matrix
where i arc some unique (possibly complex) numbers. Furthermore, we
have for all positivc integers k,
( all
A = a21
a31
a12 al3
a22 a23
a32 a33
a14 )
a24
a34
n
Ak x = L ,fiVi.
i=1
may be partitioned into
A = ( All
A 21
A I2 )
A 22 '
If A is a transition probability matrix, that is, all the elements of A
are nonnegative and the sum of the elements of each of its rows is equal to
one, then all the eigenvalues of A lie within the unit circle of the complex
plane Furthermore, 1 is an eigenvalue of A and the unit vector (1,1,. .. , 1)
is a corresponding eigenvector.
where
A square symmetric n x n matrix A is said to be positive semidefinite
if :r' Ax 0 for all x E n. It is said to be positive definite if x' Ax > 0
for all nonzero x E n. The matrix A is said to be negative semidefinite
(definite) if -A is positive semidefinite (definite). In this book, the notions
of positive definiteness and semidefiniteness will be used only in connection
with symmetric matrices.
All = (all aI2), A I2 = (a13 aI4),
A - ( a21 a22 ) A ( a23 a24 )
21 - ,22 = .
a31 a32 a33 a34
We separate the components of a partitioned matrix by a space, as in (B G),
or by a comma, as in (B, G). The transpose of the partitioned matrix A is
A' = ( AI A1 )
A2 A2 .
Partitioned matrices may be multiplied just as non partitioned matrices,
provided the dimensions involved in the partitions are compatible. Thus if
A= ( All A12 ) B= ( Bll B12 )
A2l A 22 ' B 21 B 22 '
Positive Definite and Semidefinite Symmetric Matrices
_...1
334
Mathematical Review
ppelJdlx A
Sec. AA
Analysis
335
AB = ( AllBll + A 12 B 21
A21Bll + A 22 B21
AllBl2 + Al2B22 )
A 21 B 12 + A22B22 '
A vcctor x is said to be a limit point of a S('qIH'IH'(' {.I'd if I.IH'I"(' iii
a subscquence of {xd that converges to :/:, that is, if thcre is an infinite
subset f( of the nonnegativc integers such that for any c > 0, then' is an
integer N such that. for all kEf( wit.h k :2': N we have II.I'/,;- .1:11 < ( .
A sequence of real numbers {1'k}, which is monotonically lIonde-
creasing (nonincreasing), that is, satisfies Tk ::; r/,;+1 for all k, must ei-
ther converge to a real number or be unbounded above (below). In t.he
lat.ter case we write limkooTk = 00 (-00). Givcn ,my bOllllded ii('-
quence of real numbers {rd, we may consider the sequence {sk}, where
5k = sup{r; 1 i 2:: k}. Since this sequence is monotonically nonincreasing
and bounded, it must have a limit. This limit is called the limit superior
of {Tk} and is denoted by limsuPkoo rk. The limit inferior of {1'k} is
similarly defined and is denoted by lim inf/,;oo rk. If {TJ.:} is unbounded
above, we writ.e limsuPkoo 1'k = 00, and if it is unbounded below, we write
lim infkoo rk = -00. We also use this notation if l'k E [-00,00] for all k.
then
provided the dimensions of the submatrices are such that the preceding
products A,jBjk, i,j, k = 1,2 can be formed.
Matrix Inversion Formulas
Let A and B be square invertible matrices, and let C be a matrix of
appropriate dimension. Then, if all the following inverses exist, we have
(A + CBC/)-1 = A-I - A-1C(B-l + C/A-1C)-IC'A-l.
The equation can be verified by multiplying the right-hand side by A +
CBC' and showing that the product is the identity matrix.
Consider a partitioned matrix M of the form
Open, Closed, and Compact Sets
M=( ).
M-1 = (
_D 9 1CQ
-QBD-l )
D-l + D-ICQBD-l '
A subset 8 of Rn is said to be open if for every vector :/: E 8 one
can find an E > 0 such that {z I Ilz - xii < E} C S. A set 8 is closed if
and only if every convergent sequence {xd with elements in 8 converges
to a point that also belongs to 8. A set S is said to be compact if and
only if it is both closed and bounded (Le., it is closed and for some M > 0
we have Ilxll M for all XES). A set 8 is compact if and only if every
sequence {Xk} with elements in 8 has at least one limit point that belongs
to S. Another important fact is that if So, SI,..., Sk,... is a sequence
of nonempty compact sets in Rn such that Sk =:J Sk+l for all k, then the
intersection nk=oSk is a nonempty and compact set.
Then we have
where
Q = (A - BD-IC)-l,
provided all the inverses exist. The proof is obtained by multiplying M
with the expression given for M-l and verifying that the product yields
the identity matrix.
Continuous Functions
Convergence of Sequences
A function f mapping a set SI into a set S2 is denoted by f : Sl -> S2.
A function f : n -> m is said to be continuous if for all x, f(Xk) -> f(:I;)
whenever Xk -> x. Equivalently, f is continuous if, given x E Rn and E > 0
there is a 8> 0 such that whenever Ily-xll < 8, we have Ilf(y) - f(.1:) II < c:
The function
AA
AN A.L YSIS
(adl + a2h)(.) = al/1(') + a2h(')
A sequence of vectors XO,X1,...,Xk,'" in n, denoted by {xd, is
said to converge to a limit x if Ilxk - xII -> 0 as k -> 00 (Le., if, given any
E > 0, there is an integer N such that for all k 2:: N we have IIxk - xII < E).
If {xd converges to x, we write Xk -> x or limk-+oo Xk = x. We have
AXk + BYk -> Ax + By if Xk ---+ x, Yk -> y, and A, B are matrices of
appropriate dimension.
is continuous for any two scalars a I, az and any t.wo cont.il1uous fUl1ctiolls
/1, h : ac n -> m. If b\, Sz, 83 are any sets and /1 : 8 1 ---+ 82, h : 8 2 -> 83
are functions, the function h . /1 : SI ---+ 83 defined by (f2 . /1)(x) =
h (/1 (:£)) is called the composition of it and h. If it : 3(n -> m and
h : m -> p are continuous, then h . it is also continuous.
_J
336
Mathematjcal Revjew
ppendix A
Derivatives
Let f : " f-> be some function. For a fixed x E an, the first partial
derivative of f at the point x with respect to the ith coordinate is defined
by
af(x) = lim f(x + aei) - f(x) ,
ax, 0--->0 a
where ei is the ith unit vector, and we assume that the above limit exists.
If the partial derivatives with respect to all coordinates exist, f is called
differentiable at x and its gradient at x is defined to be the column vector
( 8f(x) )
8Xl
'ilf(x) = : .
8f(x)
BXn
Thc function f is callcd differcntiable if it is diffcrentiable a.t every :c E n.
If 'il f(x) exists for every x and is a continuous function of x, f is said to
be continuously differentiable. Such a function admits, for every fixed x,
the first order expansion
f(x + y) = f(x) + y''il f(x) + o(llyl!),
where o(lIyl!) is a function of y with the property limllYII--->o oUlyl!)/llyll = o.
A vector-valued function f : " f-> m is called differentiable (re-
spectively, continuously differentiable) if each component fi of f is differ-
entiable (respectively, continuously differentiable). The gradient matrix of
f, denoted by 'il f(x), is the n x m matrix whose ith column is the gradient
'il fi(X) of J;. Thus,
'ilf(x) = ['ilh(x)"''ilfm(X)].
The transpse of 'il f is the Jacobian of f; it is the matrix whose ijth entry
is equal to the partial derivative afi/fJx j .
If the gradient 'il f(x) is itself a differentiable function, then f is said
to be twice differentiable. We denote by 'il 2 f(x) the Hessian matrix of f
at x, that is, the matrix
'il 2 X = [ a 2 f(X) ]
f() axiax j
the elements of which arc the second partial derivatives of f at x.
Let f : k f-> m and g : m f-> n be continuously differentiable
functions, and let h(x) = g(J(x)). The chain rule for differentiation states
that
'ilh(x) = 'ilf(x)'ilg(J(x)), for all x E k.
For example, if A and B are given matrices, then if h(x) = Ax, we have
'ilh(x) = A' and if h(x) = ABx, we have 'ilh(x) = B'A'.
1 A.5
I
I
Sec. A.S
Convex Sets 1.. FuncUons
337
CONVEX SETS AND FUNCTIONS
A subset C of n is said to be convex if for every x, y E C ,tJ1d every
scalar a with 0 ::; a ::; 1, we have ax + (1 - a)y E C. In words, C is
convex if the line segment connecting any two points in C belongs to C.
A function f : C -> defined over a convex subset C of n is said t.o he
convex if for every x, y E C and every scalar a with 0 ::; a ::; 1 we have
f(ax + (1 - a)y) ::; af(x) + (1 - a)f(y).
Thc function f is said to be concave if (- f) is convex, or equivalently if
for every x, y E C and every scalar a with 0 ::; a ::; 1 we have
f(ax + (1 - a)y) ;::: af(x) + (1 - a)f(y).
If f : C -> is convex, then thc sets r,\ = {x I x E C, f(:1.;) :S ..\} arc
also convex for every scalar..\. An important property is that a rea.l-valucd
convex function on n is continuous.
If h, h, . . . , f m arc convex functions over a convex subset C of n and
a I, 0'2, . . . , am are nonnegative scalars, then the function a I h + . . . + am fm
is also convex over C. If f : m -> is convex, A is an m x n matrix,
and b is a vector in m, the function g : n -> defined by g(x) =
f (Ax + b) is also convex. If f : n -> is convex, then the function
g(x) = Ew{f(x + w)}, where w is a random vector in n, is a convex
function provided the expected value is finite for every x E n.
For functions f : n -> that are differentiable, there are alternative
characterizations of convexity. Thus, the function f is convex if and only
if
f(y) ;::: f(x) + 'il f(x)'(Y - x),
for all x,y E n.
If f is twice continuously differentiable, then f is convex if and only if
'il 2 f(x) is a positive semi definite symmetric matrix for every x E n.
_....1
Sec. D.l
Optjmal Solut. s
339
Note that a minimizing clement need not exist.. For exampk, the scalar
functions f(x) = x and f(x) = eX have no minimiing clements over t.lw
sct of real numbers. Thc first function decreases without bound to -00 as
x tends toward -00, while the second decreases toward 0 as x tends toward
-00 but always takes positive values. Given the range of values that f(x)
takes as x ranges over X, that is, the set of real numbers
APPENDIX B:
{J(x) I x EX},
On Optimization Theory
there an two possibilities:
1. The set {J(x) I x E X} is unbounded below (Le., contains arbitrarily
small real numbers) in which case we write
min{J(x) I x E X} = -00 or
min f(x) = -00.
xEX
2. The set {f(x) I x E X} is bounded below; that is, there exist.s a scalar
M such that M :::; f(x) for all x E X. The greatest lowcr bound of
{J(x) I x E X} is also denoted by
The purpose of this appendix is to provide a few definitions and result.s
of deterministic optimization. For detailed expositions, see references such
as [Lue84] and [Der05].
min{J(x) 11: E X} or
m}l f (:1:).
x"x
In either case we callminxEx f(x) thc optimal valuc of problem (D.1).
A maximization problem of the form
B.1
OPTIMAL SOLUTIONS
maximize f (x)
subject to x E X
Given a set S, a real-valued function f : S --> , and a subset XeS,
the optimization problem
may be converted to the minimization problem
minimize f (x)
subject to x E X,
(B.l)
minimize - f (x)
subject to x E X,
is to find an clement :1;* E X (called a minimizing element or an optimal
solution) such that
f(;I;*) :::; f(x),
for all x E X.
in the sense that both problems have the same optimal solutions, and the
optimal value of one is equal to minus the optimal value of the other. The
optimal value for the maximization problem is denoted by max.rEX f (1;).
Existence of at least one optimal solution in problem (13. j) is guar-
anteed if f : n --> is a cont.inuous function and X is a compact subset.
of n. This is the Weierstrass theorem. By a related result, existence of
an optimal solution is guaranteed if f : n --> is a continuous function,
X is dosed, and f(x) --> 00 if 111:11 --> 00.
For any minimizing element x*, we write
x* = arg mi f(x).
xEX
338
_. J
340
On Optimization Theory
Appendix B
B.2 OPTIMALITY CONDITIONS
Optimality conditions are available when f is a differentiable function
on n and X is a convex subset of n (possibly X = n). In particular, if
x* is an optimal solution of problem (D.1), f : n --> is a continuously
differentiable function on n, and X is convex, we have
"'\1f(x*)'(x - x*) 2: 0,
(B.2)
for all x E X,
where "'\1f(x*) denotes the gradient of f at X*. When X = n (i.c., the
minimization is unconstrained), the necessary condition (B.2) is equivalent
to
"'\1 f(x*) = O.
(D.3)
When f is twice continuously differentiable and X = n, an additional nec-
essary condition is that the Hessian matrix "'\12 f(:/;*) be positive semidcfinite
at X*. An important fact is that if f : n --> is a convex function and X
is convex, then Eq. (B.2) is both a necessary and a sufficient condition for
optimality of x* .
B.3 MINIMIZATION OF QUADRATIC FORMS
Let f : n --> be a quadratic form
f(x) = 1x'Qx + b'x,
where Q is a symmetric n x n matrix and b E n. Its gradient is given by
"'\1f(x) = Qx + b.
The function f is convex if and only if Q is positive semidefinite. If Q is
positive definite, then f is convex and Q is invertible, so by Eq. (B.3), a
vector x* minimizes f if and only if
"'\1f(x*) = Qx* + b = 0,
or cquivalently
x* = -Q-Ib.
APPENDIX C:
On Probability Theory
This appendix lists selectively some of the basic probabilistic notions
that we will be using. Its main purpose is to familiarize the reader with
some of our terminology. It is not meant to be exhaustive, and the reader
should consult references such as [Ash70], [FeI68], [Pap65], and [Ros85] for
detailed treatments. For fairly accessible treatments of measure theoretic:
probability, see [AdG86] and [Ash72].
C.1
PROBABILITY SPACES
A probability space consists of
(a) A set n.
(b) A collection :F of subsets of n, called events, which includes nand
has the following properties:
(1) If A is an event, then the complement. A = {w E n I w rf A}
is also an event. (The complement of n is the empty set and is
considered to be an event.)
(2) If AI and A 2 arc events, then Al n A2 and AI U A 2 are also
events.
341
_ J
342
On Probability T1Jeory
Sec. G.2
343
HandoIlJ
Appendix C
iables
(3) If Al,A2,...,Ak,... are events, then UblAk and nk=lAk are
also cvents.
(c) A function P(.) assigning to each event A a real number P(A), called
the probability of the event A, and satisfying:
(1) P(A) :::: 0 for every event A.
(2) p(n) = 1.
(3) P(AI U A2) = P(Al) + P(A2) for every pair of disjoint events
AI, A 2 .
(4) P(Uk=l A k ) = 2:r=l P(A k ) for every sequence of mutually dis-
joint events AI, A2,..., Ak,...
random vector x = (Xl,.T2,... ,x n ) by
F(ZI, Z2,..., z,,) = P( {w E n I .1:1(W) S Zl,:C2(W) :S: Z2,..., .r,,(w) , :'n}).
Given thc distrilmtion function of a random vector:1: = (.1:1,..., .1:,,),
thc (marginal) distribution function of each random variable Xi is obtained
fro 111
F,(Zi) = lim. F(Zl' Z2,. . ., z,,).
Zj--+ CXJ tJ:f:.1.
The random variablcs Xl, . . . , X n arc said to bc independent if
F(Zl, Z2,. . . , zn) = Fl (ZJ)F2(Z2)' . . F,,(zn),
Thc function P is referred to as a probability measure.
for all scalars Zl, . . '.' Zn.
The expected value of a random variable :r with distribution fnnction
F is dcfined by
Convention for Finite and Countable Probability Spaces
E{:r} = I: zdF(z)
The case of a probability space wherc the set n is a countable (possi-
bly finite) set is encountered frequently in this book. When we specify that
n is finite or countable, we implicitly assumc that the associated collection
of events is the collection of all subsets of n (including nand thc empty set).
Then, if n is a finite set, n = {WI, W2, . . . , w n }, the probability space is spec-
ified by the probabilities PI, P2, . . . , Pn, where Pi dcnotes the probability of
the event consisting of Wi. Similarly, if n = {WI, W2, . . . , Wk, . . .}, the proba-
bility space is specified by the corresponding probabilities PI, P2, . . . , Pk, . . .
In either case we refer to (Pl,P2,... ,Pn) or (pl,p2,'" ,Pk,..') as a proba-
bility distribution over n.
provided the integral is well defined. The expected value of a random vector
x = (Xl,.." Xn) is the vector
E{x} = (E{xI}, E{X2},..., E{x n }).
The covariance matrix of a random vector x = (Xl"", Xn) with cxpected
val ue E {x} = (Xl" . . , x n ) is defined to be the n x n positive semidcfinitc
symmetric matrix
( E{(:rl-xl)2}
E{(x" -X1)(XI-XJ)}
E{(:I:I - Xl(X" - X,,)} )
: '
E{ (x n - x n )2}
C.2
RANDOM VARIABLES
A random variable on a probability space(n, F, P) is a function x :
n --> such t.hat for every scalar .>- the set
provided the expectations arc well defined.
Two random vectors x and yare said to be uncorrelated if
{wEnlx(w):s:.>-}
E{(x-E{x})(Y-E{y})/} =0,
is an evcnt (i.e., belongs to the collection F). An 11r-dimensional random
vector' :1: = (.1:1, :1:2,. . . , :I: n ) is an 11r-tnple of randoln vari,lhles :/:1, .7:2,. . . ,Tn,
each defined on the same probability space.
We define the distribution function F : --> of a random variable
x by
where (x - E{:r}) is viewed as a column vcctor and (y - E{y})' is viewed
as a row vector.
The random vector x = (Xl,. .. , Xn) is said to be characterized by a
probability density function f : n --> if
J Z! J Z2 J zn
F(Zl,Z'2,,,,,ZH) = ... f(Yl,...,y,,)dYl...dYH'
-00 -00 -00
F(z) = p({w E n I :r(w) :S: z});
that is, F(z) is the probability that the random variable takes a value less
than or equal to z. We define the distribution function F : 1ftn --> 1ft of a
[or evcry Zl,. . . , Zn.
_. J
344
On Probability Theory
ppendix C
C.3
CONDITIONAL PROBABILITY
We shall restrict ourselves to the case where the underlying probabil-
ity space n is a countable (possibly finite) set and the set of events is the
set of all subsets of n.
Given two events A and B, we define the conditional probability of B
given A by
{ p(AnB) . f P(A) 0
P(B I A) = J5(A) 1 > ,
o if P(A) = O.
We also use the notation P{B I A} in place of P(B I A). If Bl, B2,... are a
countable (possibly finite) collection of mutually exclusive and exhaustive
events (Le., the sets B, arc disjoint and their union is n) and A is an event,
then we have
P(A) = L P(A n B i ).
From the two preceding relations it is seen that
P(A) = 'L P(Bi)P(A I B,).
We thus obtain for every k,
(B I A) = P(A n B k ) = P(Bk)P(A I B k )
P k P(A) 2:i P(Bi)P(A I B i ) ,
assuming that P(A) > O. This relation is referred to as Bayes' rule.
Consider now two random vectors x and y taking values in n and
m, respectively [Le., x(w) E n, y(w) E m for all wEn]. Given two
subsets X and Y of n and m, respectively, we denote
P(X I Y) = p({w I x(w) E X} I {w I y(w) E Y}).
For a fixed vector v E n, we define the conditional distribution function
of x given v by
F(z I v) = p({w I x(w):::; z} I {w I y(w) = v}),
and the conditional expectation of x given v by
E{x I v} = ( zdF(z I v),
J!RU
assuming that the integral is well defined. Note that E {x Iv} is a function
mapping v into n.
Sec. G.3
Condiiional Prob, ty
345
Finally, let us derive Bayes' rule for random vect.ors. If WJ, We,. . . an'
the elements of n, denote
z, = :E(Wi),
v, = y(w,),
i'-= 1,2,...
Also, for any vectors z E an, v E m, let us denote
P(z) = p({w I x(w) = z}),
P(v) = P({w I y(w) = v}).
We have P(z) = 0 if z cI Zi, i = 1,2,.. ., and P( v) = 0 if Ii cI Vi, i = 1,2, .. .
Denote also
P(z I v) = P({w I x(w) = z} I {w I y(w) = v}),
P(v I z) = p({w I y(w) = v} I {w I x(w) = z}).
Then, for all k = 1,2,..., Bayes' rule yields
{ d [> ( v l z d
P(Zk I v) = f,I'(Z,W(V1z,J
if f)(v) > 0,
if P(v) = O.
_ J
APPENDIX D:
On Finite-State Markov Chains
This appendix provides some of the basic probabilistic notions related
to stationary Markov chains with a finite number of states. For detailed
presentations, see [Ash70], [Chu60], [KeS60], and [Ros85].
D.l
STATIONARY MARKOV CHAINS
A square n x n matrix [pij] is said to be a stochastic matrix if all its
clements are nonnegative, that is, Pij 2:: 0, i, j = 1,. . . , n, and the sum of
the elements of each of its rows is equal to 1, that is, 'L,7=1 Pij = 1 for all
i = 1,. . ,n.
SUPPOS() we arc giveu a stocha.tic n x n matrix P together with a
finite set of sLates S = t 1,. .., n}. The pair (8, P) will be referred to as
a statio1/.ary finite-state Mm'kov chain. We associate with (S, P) a process
w hereby an initial state :I:o E 8 is chosen in accord,lllce with some initial
probability distribution
7'0 = (1'6,1'5,..., 1').
Subsequently, a transition is made from state Xo to a new state Xl E 8 in
accordance with a probability distribution specified by P as follows. The
346
Sec. D.2
Classification of, ,cs
347
probability that the new state will bc .i is equal to 11,j whenever the initi<1l
state is i, that is,
P(Xl = j I Xo = i) = Pij,
i,j=l,...,n.
Similarly, subsequent transitions produce states :C2, X3,... in accordance
with
P(Xk+l = j I Xk = i) = Pij,
i,j=l,...,n.
(D.l)
The probability that after the kth transition the state Xk will be j, given
that the initial state :ro is i, is denoted by
Pj = P(Xk = j I Xo = i),
i,j = 1,...,n.
(D.2)
A straightforward calculation shows that these probabilities arc cqual to
the elements of the matrix pk (P raised to the kth power), in the sense
that Pj is the element in the ith row and jth column of pk:
pk = [P k ]
'J .
(D.3)
Given the initial probability distribution po of the state Xo (viewed as a row
vector in n), the probability distribution of the state Xk after k transitions
rk = (r,r,... ,r k )
(viewed again as a row vector) is given by
7'k = roPk,
k = 1,2,...
(D.4)
This relation follows from Eqs. (D.2) and (D.3) once we write
n n
r = L P(Xk = j I Xo = i)ro = LPj1'o'
i=l i=l
CLASSIFICATION OF STATES
Given a stationary finite-state Markov chain (8, P), we say that two
states i and j communicate if there exist two positive integers k l and k2
such that p} > 0 and P;; > O. In words, states i and j communicate if
one can b reached from the other with positive probability.
Let S c 8 be a subset of states such that:
1. All states in S communicate.
_. J
D.3
348
On Finite-State Markov Chains
Jpendix D
2. If i E Sand j S, t.hen pJ = 0 for all k.
Then we say that S forms a recurrent class of states.
If S forms by itself a recurrent class (Le., all states cOlllmullicate with
each other), then we say that the Markov chain is irnducible. It is possible
that thcre exist several rccurrellt classes. It is also possible to prove that at
least one recurrent class must exist. A state that belongs to some recurrent
class is called rccurr-ent; otherwise it. is called tmn.sient. We have
lim pt = 0 if and only if i is transient.
k -+ CXJ
In other words, if the process starts at a transient state, the probability
of returnillg to the same state after k transitions diminishes to zero as k
tends to infinity.
The defillitions imply that if the process starts within a recurrent
class, it stays within that class. If it starts at a transient state, it eventually
(with probability olle) enters a recurrent class after a llumber of transitions
and subsequently remains there.
LIMITING PROBABILITIES
An im portant property of any stochastic matrix P is that the matrix
p. defined by
. 1 N-l
p. = hm - ""' pk
Noo N L
k=O
exists [in the sense that the sequences of the elements of (1/ N) L,:ol pk
converge to the corresponding clements of P']. The elements pi j of p.
satisfy
n
""' P '. = 1
tJ '
j=l
Thus, p. is a stochastic matrix. A proof of this is given in Prop. A.1 of
Appendix A in Vol. II.
Note that the (i, j)th element of the matrix pk is the probability that
the state will be j after k transitions starting from state i. With this in
mind, it can be seen from the definition (D.5) that P'i j can be interpreted as
the long term fraction of time that the state is j given that the initial state
is i. This suggests that for any two states i and i' in the same recurrent
class we have p' = p'" and this can indeed be proved. In particular, if a
tJ t J
Markov chain is irreducible, the matrix p' has identical rows. Also, if j is
a transient state, we have
pi j 2: 0,
i,j = 1,...,n.
pi j = 0,
for all i = 1,.. . , n,
(D.5)
Sec. D.4
First Passage Time
34U
so the columJis of the mat.rix p. correspondillg to trallsicnt st.atl's rOllsist.
of zeroes.
FIRST PASSAGE TIMES
Let. us denot.e by q the probabilit.y that. thc st.atc will be j {(H' LlH'
first time after exactly k 2: 1 transitions given that the initial state is i,
that is,
qj = P(."'Ck = j, X m 1= j, 1 ::; m < k I Xo = i).
Denotc also, for fixedi and j,
Kij = min{ k 2: 1 I Xk = j, :00 = i}.
Then J{ij, called the first passage time from i to j, may hc viewed a.'i it
random variablc. We havc, for cvery k = 1,2,. . . ,
P(Kij = k) = qj'
and we write
00
P(Kij = (0) = P(Xk 1= j, k = 1,2,... I Xo = i) = 1 - L qj'
k=l
Note that it is possible that L,l qfj < 1. This will occur, for example, if
j cannot be reached from i, in which case qfj = 0 for all k = 1,2, . . . The
mean first passa!/e time from i to j is the expected value of K'J:
{ "'00 k k
E{I(,j} = k=l qij
. f ",k k 1
I k=l qij = ,
. f ",00 k 1
1 k=l qij < .
It may be proved that if i and j belong to the same recurrent class thcn
E{K ij } < 00.
In fact if there is only one recurrent class and t is a state of that class, the
mean first passage times E{ Kit} are the unique solution of the following
linear system of equations
n
E{Kit} = 1 + L p,jE{I{jt},
j=l,ji't
i = 1,. .., n, i 1= t;
see Example 2.1 in Section 7.2. If i and j belong to two ditferent recurrent
classes, then E {Kij} = E{ Kj'} = 00. If i belongs to a recurrent class and
j is transient, we have E {Kij} = 00.
_ J
Sec. E.l
Least-Squares
imation
351
APPENDIX E:
:r: lllay have been the expected value E{:r}, once t.he valne of if is kilo\\" II ,
we want to form an updated estimate x(y) of the value x. This updated
estimate depends, of course, on the value of y, so we arc interested in a
rule that gives us the estimate for each possible value of y, i.e., we arc
interested in a function xC), where :r(y) is the estimate of :r given y. Such
a function x(.) : 1R m --> 1R m is called an estimator. We are seeking an
estimator that is optimal in some sense and the criterion we shall (lllploy
is based on minimization of
Kalman Filtering
E{llx-x(y)112}.
x,y
(E.1)
Here, 11.11 denotes the usual norm in1R n (lIz112 = z'z for z E 3n). Further-
more, throughout the appendix, we assume that all encountered expected
values are finite.
An estimator that minimizes the expected s(!uared error above over
all x(.) : 1R n --> 1R m is called a least-squares estimator and is denoted by
x*(.). Since
E {llx - x(y)112} = E{E{II:r - x(y)1121 y}},
x,y y x
In this appendix we present the basic principles of least-squares esti-
mation and their application in estimating the state of a linear discrete-time
dynamic system using measurements that are linear in the state variables.
Fundamentally, the problem is the following. There are two random
vectors :I: and 1/, which are related through their joint probability distri-
bution so that the value of one provides information about the value of
the other. We get to know the value of y, and we want to estimate the
value of x so that the average squared error between x and its estimate
is minimized. A related problem is to find the best estimate of x wit.hin
the class of all estimates that are linear in the measured vector y. 'vVe
will specialize these problems to a case where there is an underlying linear
dynamic system. In particular, we will estimate the state of the system
using measurements that are obtained sequentially in time. Dy exploiting
the special structure of the problem, the computation of the state estimate
can be organized conveniently in a recursive algorithm - the Kalman filter.
it is clear that x* (.) is a least-squares estimator if x* (y) mllllmlZCS the
conditional expect.ation in the right-hand side above for every y E m,
that is,
E{lIx-x*(y)112Iy}= min E{lIx-zI1211/},
x zE!Rm. x
for all y E 3rn. (E.2)
By carrying out this minimization, we obtain the following proposition.
Proposition E.l: The least-squares estimator x*(-) is given by
x*(y) = E{x I y},
x
for all y E 1R m .
(E.3)
Proof: We have for every fixed z E 1R"
E.l
LEAST-SQUARES ESTIMATION
E{llx - zl12 I y} = E{II:r:1121 y} - 2z' E{.T I y} + IIz11 2 .
x x x
Consider t.wo jointly distributed random vectors x and y taking values
in 3" and 1R rn , rcspectively. We view y as a measurement that provides
some information about x. Thus, while prior to knowing y our estimate of
By setting to zero the derivative with respect to z, we see that the above
expression is minimized by z = Ex{x I y}, and the rcsult follows. Q.E.D.
350
_.1
352
Kalman Filtering
Jpendix E
Sec. E.2
Unear Least-S res EsUmatiolJ
353
E.2
LINEAR LEAST-SQUARES ESTIMATION
To simplify our proof we assume that I: is a posit.ive dcfinit.e synlllldric
matrix so that it possesses an inverse; the result, however, holds without
this assumption. Since z is Gaussian, its probability density function is of
the form
The least-squares est.imat.or Ex{x I y} may be a complicat.ed nonlinear
function of y. As a result its practical calculation may be difficult. This
motivates finding optimal estimators within the restricted class of linear
estimators, Le., estimators of the form
p(z) = p(x, y) = cc-1(z-z)'E- l (z-z),
where
x(y) = Ay + b,
(E.4)
c = (27r)-(n+m)/2(dctI:)-1/2
where A is an n x m matrix and b is an n-dimensional vector. An estimator
x(y) = Ay + b
and det I: denotes the determinant of I:. Similarly the probabilit.y density
functions of x and yare of the form
where A and b minimize
l ( - ) '-1 ( - )
p(x) = qc--'; x-x ""xx x-x,
E {lix - Ay - b11 2 }
x,y
p(y) = C2 c - 1 (y-1i)'EY"J , (y-1i)
over all n x m matrices A and vectors b E 1R n is called a linear least-squares
estimator.
In the special case where x and y are jointly Gaussian random vectors
it turns out that the conditional expectation Ex {x I y} is a linear function
of y (plus a constant vector), and as a result, a linear least-squares estimator
is also a least-squares estimat.or. This is shown in the next proposition.
wherc Cl and C2 are appropriate constants. Dy Baycs' rule thc conditional
probability densit.y function of x given y is
p(x I y) = p(x ( ,y ) ) = c-H(z-z)'E-l(z-z)_(Y-1i)'EY"J(Y-1i»). (E.7)
p y C2
Proposition E.2: If x, yare jointly Gaussian random vectors, then
the least-squares estimate Ex {x I y} of x given y is linear in y.
It can now be seen that there exist a positive definite symmetric n x n
matrix D, an n x m matrix A, a vector b E 1R n , and a scalar s such that
z=()
(z - z)'I:-l(z- z) - (y- Y)'I:yJ(y -y) = (x - Ay -b)'D-l(x - Ay -b) +s.
(E.8)
This is because by substitution of the expressions for z and I: of Eqs. (E.5)
and (E.6), the left-hand side of Eq. (E.8) becomes a quadratic form in x
and y, which can be put in the form indicated in the right-hand side of Eq.
(E.8). In fact, by computing the inverse of I: using the partitioned matrix
inversion formula (Appendix A) it can be verified that A, b, D, and s in Eq.
(E.8) have the form
Proof: Consider the random vector z E 1Rn+m
and assume that z is Gaussian with mean
z = E{z} = ( E{X} ) = ( )
E{y} y
(E.5)
A = I:XyI:yJ,
b = x-I:xyI:ilJy,
D = I:xx - I:XyI:yJ I: yx ,
s = o.
and covariance matrix
I: = E{(z -z)(z -z)'} = ( E{(X -x)(x -x)'}
E{ (y - y)(x - x)'}
_ ( I:xx I: xy )
- I:yx I:yy .
E{ (x - x)(y - Y)'} )
E{ (y - y)(y - y)'}
Now it follows from Eqs. (E.8) and (E.7) that the conditional expectation
Ex {x I y} is of the form Ay + b, where A is some n x m matrix and b E 1R n .
Q.E.D.
(E.6)
'vVe now turn to the characterization of t.he linear least-squares esti-
mator.
_ J
354
Kalman Filtcrjng
ppcndix E
See, E.2
Ljncar Least- ares Estimation
355
so t.hat
Proposition E.3: Let x, y be random vectors taking values in 1R n
and 1R m , respectively, with given joint probability distribution. The
expected values and covariance matrices of x, yare denoted by
A I
Y E {A(y - y) - (x - x)} = o.
x,y
(E.18)
By subtracting Eq. (E.18) from Eq. (E.17), wc obtain
E{x} = x
E{ (:r - x)(:r - x)'} = ELI'
E{ (x - x)(y - ilY} = Exy,
E{y} = y,
E{ (y - y)(y - y)1} = Eyy,
E{ (y - y)(x - X)I} = Ei y ,
(E.9)
A I
E{(y-y)(A(y-y)-(x-x))} =0.
x,y
(E.lO)
(E.ll )
Equivalently,
Eyy.1 ' - Eyx = 0,
and we assume that Eyy is invertible. Then the linear least-squares
estimator of x given y is
from which
.A = ExEyJ = ExyEyJ.
(E.19)
x(y) = x + ExyEyJ(y - y).
(E.12)
Using the expressions (E.16) and (E.19) for b and .A, respectively, we obtain
The corrcsponding crror covariance mat.rix is given by
i(y) =.A.y + b = x + ExyL.yyl(y - y),
E {(x - x(y)) (x - X(y))'} = Exx - ExyEyJEyx.
x,y
(E.13)
which was to be proved. The desired Eq. (E.13) for the error covariance fol-
lows upon substitution of the exprcssion for x(y) obtained above. Q.E.D.
Proof: The lincar least-squares estimator is defined as
We list some of the properties of the least-squares estimator as corol-
laries.
x(y) = .1y + b,
Corollary E.3.1: The linear least-squares estimator is unbiased, Le.,
where .1,b minimizc the function f(A,b) = Ex,y{lIx - Ay - b11 2 } over A
and b. Taking the derivatives of f(A, b) with respect to A and b and setting
them to ero, we obtain the two conditions
E{X(Y)} =X.
y
0= ?f I d = 2 E {(b +.1y - X)yl},
aA A,b x,y
(E.14)
Proof: This follows from Eq. (E.12). Q.E.D.
Df I A A
o =- a ..=2E{b+Ay-x}.
b A,b x,y
(E.15 )
The second condition yields
Corollary E.3.2: The estimation error x - x(y) is uncorrelated with
both y andx(y), Le.,
b = x - .1y,
(E.16)
E {y(x - X(y))'} == 0,
x,y
E {x(y)(x - X(y))'} = o.
x,y
and by substitution in the first, we obtain
E {y(.A.(y - y) - (J: - X))'} o.
x,y
(E.17)
A } '
E {A(y - y) - (x - x) = 0,
X,y
Proof: The first equality is Eq. (E.14). The second equality can be written
A A I
as Ex,y{(Ay + b)(x - x(y)) } = 0 and follows from the first equality and
Cor. E.3.1. Q.E.D.
We havc
_ .1
356
Kalman Filterjng
Sec. E.2
Uncar Lea:;i-. .lrcs EsthnaUon
ppcndix IE
Corollary E.3.2 is known as the orthogonal projection principle. It
statcs a propcrty that characterizes the linear least-squares estimate and
forms the basis for an altcrnative treatment of least-squares estimation as a
problem of projection in a Hilbert space of random variables (see [Lue60]).
x(z) = x + ExyC'(CEyyC')-l(z - CfJ - u),
and the corresponding error covariance matrix is
357
(E.21)
Corollary E.3.3: Consider in addition to x and y, the random vector
z defined by
E {(x - x(z)) (x - :i:(z))'} = Exx - ExyC'(CEyyC')-lCEyx. (E.22)
X,z
Proof: We have
z=Cx,
where C is a given p x m matrix. Then the linear least-squares estimate
of z given y is
z=E{z}=Cy+u,
Ezz = E{ (z - z)(z - z)'} = CEyyC',
E zx = E{ (z - z)(x - x)'} = CE yx ,
E xz = E{ (x - x)(z - z)'} = ExyC'.
From Prop. E.3 we have
z(y) = Ci(y),
and the corresponding error covariance matrix is given by
E {(z - z(y)) (z - z(y))'} = C E {(x - i(y))(x - i(y))'}C'.
z,y . . x,y
Proof: We have E{z} = z = Cx and
Ezz = E{ (z - z)(z - z)'} = CExxC',
z
E zy = E {(z - z)(y - V)'} = CE.T.Y,
z,y
i(z) = x + ExzE;}(z - z),
E {(x - i(z)) (x - x(z))'} = Exx - ExzE;}E zx ,
X,z
(E.23a)
(E.2:b )
(E.23c)
(E.23d)
(E.2..ta)
(E.24b)
wherc Ezz = CEyyC' has an inverse, since Eyy is invcrtible alHl C has rank
p. Dy substituting the relations (E.23) into Eqs. (E.24a) and (E.24b) t.he
result follows. Q.E.D.
Eyz = Ei y = Eyx C '.
Dy Prop. E.3 we have
z(y) = z + EzyE;;(y - y) = Cx + CExyE;;(y - y) = Cx(y),
E {(z - z(y)) (z - z(y))'} = Ezz - EzyE;;Eyz
x,y
Frequently we want to estimate a vector of paramcters x E 3{n given
a measurement vector z E 1R m of the form z = Cx + v, where C is a givcn
m x n matrix, and v E ac m is a random measurement error vector. The
following corollary gives the linear least-squares estimate x(z) and its error
covariancc.
= C(Exx - ExyEyylEyx)C'
= C E {(x - :i:(y)) (x - :i:(y))'}C'.
x,y
Q.E.D.
Corollary E.3.5: Let
z = Cx + v,
where C is a given m x n matrix, and the random vectors x E (n and
v E 1R m are uncorrelated. Denote
Corollary E.3.4: Consider in addition to x and y, an additional
random vector z of the form
z""" Cy + u,
(E.20)
E{x} = x,
E{ (x - x)(x - x)'} = Exx,
E{(v-v)(v-v)'} = Evv,
where C is a given px m matrix of rank p and u is a given vector in
1Rp. Then the linear least-squares estimate i(z) of x given z is
E{v} = v,
and assume further that Evv is a positive definite matrix. Then
_ J
358
Kalman Filtering
pendix E
i(z) = x + ExxC'(CExxC' -\_l:vv)-I(Z - Cx -.v),
E {(x - x(z)) (x - x(z))'} = Exx - ExxC'(CExxC' + Evv)- lCE xx.
x,v
Proof: Define
y = (x' v' )' ,
y = (x' v')',
C=(C 1).
Then we have z = Cy, and by Cor. E.3.3,
x(z) = (I 0 )Y(z),
E{ (x - x(z)) (x - x(z))'} = (I 0) E{ (y - y(z)) (y - y(z))'} () ,
where y(z) is the linear least-squares estimate of y given z. By applying
Cor. E.3.4 with u = 0 and x = y we obtain
y(z) = y + E yy C'(CE yy C')-l(z - Cy),
E{ (y - y(z)) (y - y(z))'} = Eyy - EyyC'(CEyyC')-lCEyy.
Dy using the equations
E ( Exx
yy = 0
EJ,
C=(C 1),
and by carrying out the straightforward calculation the result follows.
Q.E.D.
The next two corollaries deal with least-squares estimates involving
multiple measurement vectors that are obtained sequentially. In particular,
the corollaries show how to modify an existing least-squares estimate x(y)
to obtain x(y, z) once an additional vector z becomes known. This is a
central operation in Kalman filtering.
Corollary E.3.6: Consider in addition to x and y, an additional
random vector z taking values in Rp, which is uncorrelated with y.
Then the linear least-squares estimate x(y, z) of x given y and z [Le.,
given the composit.e vector (y, z)] has t.he ["orin
x(y, z) = x(y) + x(z) - x,
(E.25)
Sec. E.2
Linear Least-Squ. "s Estimation
359
where (y) and x(z) are the linear least-squares estimates of :r given
y and given z, respectively. Furthermore,
X E {(x - x(y, z)) (x - x(y, z))'} = Exx - ExyE;JEyx - ExzE;}E zx ,
,y,z
(E.26)
where
E xz = E{(X-x)(z-z)'},
x,z
E zx = E{(z-z)(x-x)'},
x,z
Ezz = E{ (z - z)(z - z),},
z
z = E{z},
z
and it is assumed that Ezz is invertible.
Proof: Let
w=(n, w=(;).
By Eq. (E.12) we have
x(w) = x + ExwE;;;(w - w).
(E,27)
Furthermore
Exw = [Exy, E xz ],
and since y and z are uncorrelated, we have
E _ ( Eyy
JWW - 0
EJ.
Substituting the above expressions in Eq. (E.27), we obtain
x(w) = x + ExyEyyl(y - y) + ExzE;}(z - z) = x(y) + i:(z) - x,
and q. (E.25) is proved. The proof of Eq. (E.26) is similar by using the
relatIOns above and the covariance Eq. (E.13). Q.E.D.
Corollary E.3.7: Let 'z be as in the preceding corollary and assume
that y and z are not necessarily uncorrclated, that ii>, we may have
Eyz = Ei y = E {(y - y)(z - z)'} # o.
y,z
_. .1
360
Kalman FjJtering
Appendix E
Sec. E.2
361
State Esti, ,on - The Kalman FjJier
Then
\Vc lIse thc notation
S = E{ (xo - E{.TO}) (xo - E{xo}) '}, lvh = E{ 1JJ k W ;J,
N k =- E{VkV},
(E.:U)
x(y, z) = i(y) + x(z - z(y)) - x,
(E.28)
where i(z - i(y)) denotes the linear least-squares estimate of x given
the random vector z-i(y) and i(y) is the linear least-squares estimate
of z given y. Furthermore,
and we assume that N k is positive definite for all k.
A Nonrecursive Least-Squares Estimate
E {(x - iCy, z))(x - iCy, z))'} = Exx - ExyE;;Eyx - txzt;}t zx ,
x,y,z
(E.29)
We first give a straightforward but somewhat tedious method to de-
rive the linear least-squares estimate of .Tk+1 or .Tk given the valm's of
Zo, Zl, . . . , Zk. Let us denote
Zk - ( z' z' , ) ' ( ' " )
.- 0' l""'k' rk-1= xO,.WO,w1"",w_I'.
In this method, we first find the linear least-squares estimate of 1'k-l giv(n
Zk, and we thell obtain the lincar least-squares cst.imat.e of :I:k givcn Zk
after expressing Xk as a linear function of 1'k-1.
For each i with 0 :::: i :::: k we have, by using the system equation,
X'+l = Lir"
where L i is the n x (n(i + 1)) matrix
L, = (Ai...Ao, Ai...A 1 , , A" 1).
As a result we may write
where
t xz = E {(x-x)(z-i(y))'},
x,y,z
t zz = E {(z - z(y))(z - z(y))'},
y,z
t zx = E {(z-i(y))(x-x)'}.
x,y,z
Proof: It can be seen that, since iCy) is a linear function of y, the linear
least.-squares C>it.illlat.c of :1: given y and z is t.he samc a; t.h( lilHar lca.t.-
squares estimate of x given y and z - i(y). By Cor. E.3.2 the random
vectors y and z - i(y) are uncorrelated. Givell this observation the result
follows by applicatioll of the preceding corollary. Q.E.D.
Zk = Wk-11'k-1 + V k ,
where
Vk = (vb, v , .. . , vU',
and Wk-1 is an s(k + 1) x (nk) matrix of the form
Wk-l = ( cL J
C k - 1 L k - 2 0
CkL k - l
the equations above,
alld the data of the
E.3
STATE ESTIMATION - THE KALMAN FILTER
Consider now a linear dynamic system of the type considered in Sec-
tion 5.2 but without a control vector (Uk == 0)
We can thus use Cor. E.3.5,
problem to compute
Xk+l = Ak.1:k + Wk,
k = 0,1,..., N -- 1,
(E.30)
where Xk E 1R n and Wk E 1R n denote the state and random disturbance
vectors, respectively, and the matrices Ak are known. Consider also the
measurement equation
rk-1(Zk) and E{ h-1 - 7\-1(Zk)) (rk-1 -7\-1(Zk))'}.
Let ,us denote tlC linear least-squares estimates of :1:k+l and :Ck given Zk
by xk+l!k and .xklk, respectively. We can now obtain xklk = ik(Zk) and
the corrcspondmg error covariance matrix by using Cor. E.3.3, that is,
.1:klk = Lk-1':k-I(Zd,
E{(.1:k - xklk)(Xk - Xk!k)'}
= Lk-1E{ h-l - rk-l(Zd) h-l - 7:k-l(Zk))'}L_1'
These equations may in turn be Ilsed to yield Xk+1lk and thc corresponding
error covariance again via Cor. E.3.3.
zk = CkXk + Vk,
k = O,I,...,N - 1,
(E.31 )
where Zk E 1R s and 1Jk E 1R s are the observation and observation noise
vectors, respectively.
We assume that xo, wo, . . . , W N -1, vo, . . . , V N -1 are independent ran-
dom vectors with givcn probability distributions and that
E{wk} = E{vk} = 0,
k = 0,1,. ..,N - 1.
(E.32)
_. J
362
Kalman Filtering
fJGnJix E
Sec. E.2
State EstimL .1- The Kalman nIter
363
The Kalman Filtering Algorithm
Using Eqs. (E.36)-(E.:38) in Prop. E.;I, We' oht.ain
The preceding method for obtaining the least-squares estimate of Xk
is cumbersome when the number of measurements is large, Fortunately,
the sequential structure of the problem can be exploited and the computa-
tions can be organized conveniently, as first proposed by Kalman [Ka160].
The main attractive feature of the Kalman filtering algorithm is that the
estimate :1:k+llk can be obtained by means of a simple equation that in-
volves the previous estimate :1:klk-l and the new measurement Zk but does
not involve any of the past measurements Zo, ZI,..., Zk-l'
Suppose that we have computed the estimate :1:klk-l together with
the covariance matrix
:h (zk-h(Zk-l)) = E{xd+ E klk_l Ck(Ck2:klk-l C£ +Nd- l (Zk-Ck:i\lk_I),
and Eq. (E.35) is writ.ten as
Xklk = xklk-l + Eklk-1C£(CkEklk_1C£ + Nk)-I(Zk - Cki:kl k - 1 )'
(E.30)
Dy using Cor. E.3.3 we also have
klk-l = E{(:l:k - :1:klk-l)(:l:k - :/:klk-l)'}.
(E.;)il )
At time k we receive t.he addit.ional measurement
Xk+llk = AkXklk.
(E.10)
Zk = CkXk + Vk.
We may use now Cor. E.3.7 to compute the linear least-squares estimate
of Xk given Zk-l = (zb, z, . . . , zk_l)' and Zk. This estimate is denoted by
xklk and, by Cor. E.3.7, it is given by
Concerning the covariance matrix Ek+llk, we have from Eqs. (E.30), (E.32),
(E.33), and Cor. E.3.3,
Xklk = i:k\k-l + :1:k(Zk - Zk(Zk-l)) - E{Xk},
(E.35)
Ek+llk = AkEklkA + Mk,
(E.41)
where Zk(Zk-l) denotes the linear least-squares estimate of Zk given Zk-l
and :/;k (Zk - Zk(Zk-l)) denotes the linear least-squares estimate of Xk given
(Zk - .h(Zk-l))'
We now calculate the term Xk(Zk - Zk(Zk-l)) in Eq. (E.35). We have
by Eqs. (E.31), (E.32), and Cor. E.3.3,
where
Eklk = E{ (Xk - xklk)(Xk - xklk)'}.
The error covariance matrix Eklk Illay be computed via Cor. E.3.7 similar
to xklk [ef. Eq. (E.35)]. Thus, we have from Eqs. (E.29), (E.37), (E.38)
h(Zk-l) = CkXklk-l.
(E.36)
Also we use Cor. E.3.3 to obtain
Eklk = E k1k - 1 - Eklk-1Ck(CkEklk-1C£ + Nk)-ICkEklk_l' (E.42)
E{ (Zk - Zk(Zk-JJ) (Zk - Zk(Zk-l))'} = C k E k l k - 1 C£ + Nk,
E{X;,.(Zk - Zk(Zk-l))'}
= E{:Ck(Cdxk - ikl k - I ))'} + E{:CkVU
= E { (:Ck - :1:klk-1 ) (:I:k -- :/:klk-l)'} C, + E{ i: k1k - l (:I:k - xklk- tJ' }C".
(E.37)
Equations (E.39)-(E.42) with the initial conditions [cf. Eq. (E.33)]
£01-1 = E{xo},
2: 0 1-1 = S,
(E.43 )
E{Xk(Zk - Zk(Zk-l))'} = Eklk-1C£,
(E.38)
const.it.ut.e t.lw Kalm!l.7/. jiltC7i7/..'} !l.lyori/.hm.. This algoriUllll n:cursivdy g':u-
erates the linear least-squares estimates ik+llk or xklk toget.her with the
associated error covariance matrices Ek+llk or Eklk' In particular, given
E k1k - 1 and :1:klk-l, Eqs. (E.39) and (E.42) yield Eklk and xklk, and then
Eqs. (E.41) and (E.40) yield Ek+llk and :1:k+llk.
The last terlll in the right-hand side above is 2ero by Cor. E.:3.2, so by using
Eq. (E.34) we have
_..1
364
Kalman Filtering
Appendix }<;
An alternative expression for Eq. (E.3!)) is
Xklk = Ak-1Xk-llk-l + 1',klkCkN;l(Zk - CkAk-1Xk-llk-l), (E.'l4)
which can be obtained from Eqs. (E.39) and (E.40) by using thc following
equality
1',klkCN;l = 1',klk_1C(Ck1',klk_1C + Nk)-l.
This equality may be verified by using Eq. (E.42) to write
(E.45)
1',klkCN;l = (1',klk-l - 1',klk-1C(Ck1',klk-1C + Nk)-lCk1',klk_l)CNel
= 1',klk_1C(N;1 - (Ck1',klk-lq + Nk)-lCk1',klk_1CN;1),
and then use in the above formula the following calculation
Ne 1 = (Ck1',klk-1C + Nk)-1(Ck1',klk-1C + Nk)Ne l
= (Ck1',klk-1C + Nk)-I(Ck1',klk_1CN;1 + I).
When the system equation contains a control vector Uk,
Xk+l = Akxk + BkUk + Wk,
k = O,l,...,N - 1,
it is straightforward to show that Eq. (E.44) takes the form
Xklk = Ak-1Xk-1Ik-l + Bk-1Uk-l
+ 1',klkCN;l(Zk - CkAk-1Xk-1Ik-l - CkBk-1Uk-I),
(E.46)
wherc xklk is the linear least-squares estimate of Xk given zo, Zl, . . . ,Zk and
Uo, Ul,..., Uk-l. The equations (E.41)-(E.43) that generate 1',klk remain
unchanged.
Steady-State Kalman Filtering Algorithm
Finally we note that Eqs. (E.41) and (E.42) yield
1',k+llk = Ak(1', k1k - 1 - 1',klk-lq(Ck1',klk-lq + Nk)-lCk1',klk-l)A + M k ,
(E.47)
with thc initial condition 1',01- 1 = S. This cquation is a matrix Riccati
equation of the type considered in Section 4.1. Thus when Ak, Ck, Nk,
and Mk are constant matrices,
Ak = A, Ck = C, Nk = N, Mk = M,
k = O,I,...,N - 1,
Sec. ]<;,2
StalJility A ;ts
365
we have by invoking t.he proposition proved t.lH'w, t.hat. )k I ilk h'nds t.o
a positive definite symmetric matrix 1', t.hat solves the algebraic Riccati
equation
1', = A(1', - 1',C'(C1',C' + N)-lC1',)A' -1- AI,
assuming obscrvability of the pair (A, C) and controllability of LIt(' pair
(A, D), where M = DD'. Under the same conditions, we havc 1',klk ---> ,
whcre from Eq. (E.42),
E = 1', - 1',C'(C1',C' + N)-lC1',.
We may then write the Kalman filtcr recursion [cf. Eq. (E.44)] in the n..YII1P-
totic form
Xklk = AXk-llk-l + EC'N-l(Zk - CAXk_llk_l)'
(E.48)
This estimator is simple and convenient for implementation.
EA
STABILITY ASPECTS
Let us consider now the stability properties of the steady-statc form
of the Kalman filter. From Eqs. (E.39) and (E.40), we have
Xk+1lk = AXklk-l + A1',C'(C1',C' + N)-I(Zk - CXklk-d.
(E.49)
Let Ck denote the "one-step prediction" error
Ck = Xk - xklk-l'
Dy using Eq. (E.49), the system equation
Xk+1 = A.Tk + Wk,
and the measurement equation
Zk = C:rk + Vk,
we obtain
Ck+l = (A - A1',C'(C1',C' + N)-lC)ek + Wk - A1',C'(C1',C' + N)-lVk.
(E.GO)
_.I
366
Kalman Filtering
Ippendix E
From the practical point of view it is important that thc error equa-
tion (E. 50) represents a stable system, that is, the matrix
A - AC'(CEC' + N)-IC
(E.51)
has eigenvalues strictly within the unit circle. This, however, follows by
Prop. 4.1 of Scction4.1 under the observability and controllability assump-
tions given earlier, since is the unique positive semidefinite symmetric
solution of the algebraic Riccati equation
= A( - CI(CCI + N)-IC)A' + M.
Actually this proposition yiclds that the transpose of the matrix (E.51)
has eigenvalues strictly within thc unit circle, but this is sufficient for our
purposes since the eigenvalues of a matrix are the same as those of its
transpose.
Let us consider also the stability properties of the equation govcrning
the estimation error
Ck = Xk - Xkjk'
We have by a straightforward calculation
ek = (I - CI(CCI + N)-IC)ek - CI(CCI + N)-luk.
(E.52)
Dy multiplying both sides of Eq. (E.50) by I - CI(CC' + N)-lC and
by usillg Eq. (E.52), we obtain
CHI + CI(CC' + N)-IUk+1
= (A - CI(CCI + N)-ICA) (Ck + C'(CC' + N)-I Uk )
+ (I - CI(CCI + N)-IC) (Wk - AC'(CC' + N)-I Uk ),
or equivalently
CHI = (A - C'(CCI + N)-lCA)Ck
+ (I -- 2:CI(CC' + N)-IC)Wk - C'(CC' + N)-IUk+I'
(E.53)
Since the matrix (E.G1) has eigenvalues strictly within the unit circle, the
sequellce {Ck} gcnerated by Eq. (E,50) tends to zero whenever the vcctors
Wk and Uk are identically zero for all k. Hence, by Eq. (E.52), the same is
truc for the sequencc {cd. It follows from Eq. (E. 53) that the matrix
A - C'(CCI + N)-ICA
(E.54)
has eigenvalues strictly within the unit circle, and the estimation crror
seq uence {e d is generated by a stable system.
Sec. B-2
Gauss-Ale >v Estimators
367
Let us finally consider thc stability properties of 1,11<' 2/i-dilll<'lIsioll,il
system of equations with state vector (x, xU:
:Ck+1 = A.rk 1- IJ L.i:k,
(E.S»)
Xk+1 = EC'N-ICAxk + (A + IJL - ECIN-ICA)Xk.
(E.SG)
This is thc steady-state, asymptotically optimal closcd-Ioop systelll that
was encountered at the end of Section 5.2.
We assume that the appropriate observability and controllability as-
sumptions stiLted there are in effect. Dy using the equation
EC'N-I = C'(CC' + N)-l,
shown earlicr, wc obtain frolll Eqs. (B.5S) and (E.5G) that
(.1:k+1 - h+l) = (A - EC'(CC' + N)-lCA)(:rk - :i;d.
Since we havc proved that the matrix (E. 54) has eigenvalucs strictly within
the unit circle, it follows that
lim (Xk+l - :i:k+IJ = 0,
k-oo
(KG 7)
for arbitrary initial states :ro and xo. From Eq. (E.5G) wc obtain
Xk+l = (A + BL).Tk + IJL(Xk - ;z:d.
(E.S8)
Since in accordance with thc theory of Section 4.1 the matrix (A -+ IJL)
has eigenvalues strictly within the unit circle, it follows from Eqs. (E.57)
alld (E.58) that we have
lim Xk = 0
k-oo
(E.G9)
and hence from Eq. (E.57),
lim Xk = O.
k-+oo
(E.GO)
Sincc the equations above hold for arbitrary initial states :ro and :i:o it.
follows that the system defined by Eqs. (E. 55) and (E.5G) is stable.
E.5
GAUSS-MARKOV ESTIMATORS
Suppose that we want to estimatc a vector .1: E n given a nW<1sur('-
ment vcctor z E !Jm that is related to x by
z = Cx + u,
(E.Gl)
_. J
368
Kalman Filiering
Appcndix E
Sec. B.2
Gauss-Ala. / Estimators
369
where C is a given m x n mat.rix with rank 'In, and v is a random nJe,J.urc-
mcnt crror vector. Lct us assume that v is un correlated with :r, and has a
known mean and a positive definite covariance matrix
To derive thc optimal solution A of probbn (E,(;'I), 1('(, u; (knot(' tll<'
ith row of A. We have
E{v} = v,
E{ (v - v)(v - v)'} = Evv.
(E.62)
JlA(v - v)112 = (v - v)' (a1
an) ( ": ) (v - v)
an
If the a priori probability distribution of :r is known, we can obt.ain
a linear least-squares estimate of x given z by using the theory of Section
E.2 (d. Cor. E.3.5). In many cases, however, t.he probability distribution
of x is unknown. In such cascs we can usc t.hc Gauss-Markov estimator,
which is optimal within the class of linear estimators that sat.isfy cC1tain
rcstrictions, as described below.
Let us consider an estimator of the form
n
= 2)v - v)'aia;(v - v)
i=1
n
= L a;(v - v)(v - v)'ai.
i=l
X(z) = A(z - v),
Hcnce, the minimization problem (E.G4) can also be written as
f(A) = E {11:r - A(z - v)11 2 }
x,z
(E.G3)
n
minimize L a;Evvai
<=1
subject to C'ai = ei,
i=l,...,n,
w here A minimizes
over all n x m matrices A. Since x and v are uncorrelated, we have using
Eqs. (E.G1 )-(E.63)
f(A) = E {llx - ACx - A(v - v)112}
X,v
= E{II(I - AC)xI1 2 } + E{JlA(v - v)Jl2},
x v
where ei is the ith column of the identity matrix. The minimization can
be carried out separately for each i, yielding
Xi = E;;v1C(C'EvvC)-1Ci,
i = 1,... ,n,
and finally
A = (C'E;;;C)-1C'E;;v 1 .
Thus, the Gauss-Markov estimator is given by
where I is the n x n identity matrix. Since f(A) depends on the unknown
statistics of x, we sce that the optimal matrix A also depends on these
statistics. We can circumvent this difficulty by requiring that
X(z) = (C'E;;;C)-1C'E;;;(z - v).
(E.65)
AC=I.
Let us also calculate the corresponding error covariance matrix. Wc have
minimize E{JlA(v - v)112}
v
(E.64)
E{ (x - x(z)) (x - x(z))'} = E{ (x - A(z - v)) (x - A(z - v))'}
= E{A(v -v)(v -v)'A'}
= AEvvA'
= (C'E;; v 1 C)-lC'E;;; EvvE;;v1C( C'E;;;C)-1,
Thcn our problem becomes
subject to AC = I.
Note that the requirement AC = I is not only convenient. analyti-
cally, but also makes sense concept.ually. In particular, it is equivalent t.o
requiring that the estimator x(z) = A(z - v) be unbiased in the sense t.hat.
and finally
E{ (:r - x(z))(J: - .i;(z))'} = (C'E;;,;C)-l.
(E.GG)
E{x(z)} = E{J;} = X,
for all x E 1R n .
Finally, let us compare t.he Gauss-Markov cst.imator wit.h t.h(, linpar
least-squares estimator of Cor. E.3.5. Assuming that Exx is invertible, a
straightforward calculation shows that the latter estimator can be written
as
This can bc sccn by writing
E{x(z)} =E{A(Cx+v-v)} = ACE {x} =ACx=x.
A ( ) _ - + ( ",-1 C ,"'-1 C) -l c ,"'-1 ( C - - )
X Z - X L.,xx + L.,vv L.,vv z - x - v .
(E.67)
_.J
370
Kalll1an Jo'ilicring
Jpcndix }<;
By comparing Eqs. (E.G5) and (E.G7), we see that the Gauss-Markov esti-
mator is obtaincd fwm the linear least-squares estimator by sctting :I: = 0
and E;1 = 0, Le., a zero mean and infinite covariance for the unknown
random variable x. Thus, the Gauss-Markov estimat.or may be view(d as
a limiting /(JI'nl or th( lincar kast.-sqllal"<'s <'st.illlat.or. Tlw error <:oVarii\.I1('(
matrix (E.6G) of thc Gauss-Markov et;timat.or is similarly relat.ed wit.b t.be
crror covariance matrix of the lincar least-squares cstimator.
APPENDIX F:
E.G
DETERMINISTIC LEAST-SQUARES ESTIMATION
Supposc again that we want to cstimate a vector :c E :}n givfn a
IlH'a>iIlI"<'IlH'Ilt. v('c!.or :c ( \)III !.hat. is rcb!."d !.o .1: by
Modeling of Stochastic Linear
z = Cx +"V,
Systems
wherc C is a known m x n matrix of rank m. However, we know nothing
about thc probability distribution of :c and "v, and thus wc call't. usc a
statistically-based cstimator. Thcn it is rcasonable t.o select as our cstimate
the vcctor :i: that minimizes
J(x) = IIz - C:cI1 2 ,
that is, t.he cstimate that fits best the dat.a in a least-squares sense. We
denote this estimate by:i:(z).
By setting to zero thc gradient of J at i(z), we obtain
In this appendix we show how controlled lincar timc-invariant syst.cms
with stochastic inputs can be represented by the ARMAX model used in
Section 5.3.
VJlx(z) = 2C'(Cx(z) - z) = 0,
from which
F.l
LINEAR SYSTEMS WITH STOCHASTIC INPUTS
i(z) = (C'C)-lC'z.
(E.6S)
An intcresting observation is that the estimate (KGS) is thc same as
the Gauss-Markov estimate given by Eq. (E.65), provided the measurement
error has zero mean and covariance matrix equal to the identity, i.e., v = 0,
EVlI = I. In fact, if instead of Ilz - Cx1l 2 , wc minimize
(z - v - Cx)'E;;,;(z - v - C:c),
then the deterministic least-squares estimate obtained is identical to the
Gauss-Markov estimate. If instead of IIz - CxI12 we minimize
(x - x)'E;1(z - x) + (z - v - Cx)'E;;lI l (z - 11 - C.T),
Consider a linear system with output {yk}, control input {uk}, and
an additional zero-mean random input {wk}. We assume that {wk} is
a stationary (up to second order) st.ochastic process. That is, {wk} is a
sequence of random variables satisfying, for all i, k = 0,::1::1, ::1::2, . . . ,
E{wk} = 0,
E{WOWi} = E{WkWk+d < 00.
thcn thc cstimat.c obtained is identical to thc linear least.-squares cst.imate
given by Eq. (E.67). Thus, we arrive at the interesting conclusion that
the cstimators obtained earlier on the basis of a stochast.ic opt.imization
framework can also be obtaincd by minimization of a deterministic measure
of fitness of estimated parameters t.o the data at ham!.
(All references to stationary processes in this section are mcant in the
limited sense just described.) By linearity, Yk is the sum of one sequence
{yD due to the presence of {uk} and another sequence {YD due to the
presence of {wk}:
Yk = yi, + y'f.,. (1".1)
Vve assume that Yk and YZ are generated by some filters n 1 (s) / A I (s) ,Uld
B2 (s) / A2 (s), respectively:
Al(S)Yk = Bl(S)Uk,
(F.2a)
371
_. J
372
Mudeling of Stuchastic Lincar Systems
,ppendix 1<'
Scc. 1<'.:2
1 'roccsses \ . HationaJ Spcctrum
373
A2(S)Y = B2(S)Wk.
(F.2b)
where a is a scalar, C(z) and D(z) arc some polynomials with real cod li-
cients
Operating on Eqs. (F.2a) and (F.2b) with A2(S) and Al(S), rct;p(x;-
tively, adding, and using Eq. (F.1), we obtain
C(z) = 1 + CIZ +... + c",z"',
(F.Ga)
A(S)Yk = B(S)1Lk + Uk,
(1<'.:3)
D(z) = 1 + (hz + '" + d",zrn,
(F.Gh)
Uk = Al(S)B2(S)Wk,
(FA)
and D(z) has no roots on the unit circle {z Ilzl = 1}.
Thc following facts are of intcrest:
(a) If {ud is anuncorrclated process with V(O) = a 2 , V(k) = 0 for k I- 0,
then
where A(s) = Al(S)A2(S), B(s) = A2(S)Bl(S), and {vd, givcn by
is a zero-mean, generally correlatcd, stationary stochastic process.
We are interested in the case where Uk is a control input applied
after Yk has occurred and has becn observed, so that in Eq. (F.2a) we havc
Bl(O) = O. Then, we may assume that the polynomials A(s) and B(s)
have the form
SV(A) = a 2 ,
A E [-7r,7r],
(b)
and clearly {vd has rational spectrum.
If {vd has rational spectrum Sv given by Eq. (F.5), thcn Sv can be
written as
A(s) = 1 + UIS + ... + u"'osm o ,
B(s) = b1s +... + b",osrno
S ( A ) = 0-2 1(eP')12
v ID(eP')12'
A E [-7r, 7r],
for some scalars Ui and bi, and SOUle positive integer mo.
To summarize, we have constructed a model of the form
where 0- is a scalar and C(z), D(z) are unique real polynomials of the
form
A(S)Yk = B(S)Uk + Vk,
C(z) = 1 + GIZ +... + G",Zrn,
wherc A(s) and B(s) are polynomials of thc prcceding form and {ud is
some zero-mean, correlated, stationary stochastic process. Wc now nced to
model further the sequence {vd.
D(z) = 1 + (Ilz +... + (I",zrn,
Given a zero-mean, stationary scalar process {Vk}, denote by V(k)
t.hc autocorrelation function
such that:
(1) C(z) has all its roots outside or on the unit circle, and if C(z)
has no roots on the unit circle, then the samc is truc for C(:.:).
(2) D(z) has all roots strictly outside the unit circle.
These facts are seen by noting that if p f. 0 is a root of D(:.:), then
ID(e J >')1 2 = D(ej>')D(e- j >.) contains a factor
F.2
PROCESSES WITH RATIONAL SPECTRUM
V(k) = E{ViVi+d,
k = 0, :1:1, :1:2,...
(1- p- 1 e j >')(1 - p-1e- j >') = p-2(p - d>')(p - cJ>').
IC(e i >')12
Sv(A) = a2 ID(ej>')12 '
AE[-7r,7r],
(F.5)
A little reflection shows that the roots of D( z) should bc p or p-l
depending on whether p is outside or inside the unit circle. Similarly,
the roots of C(z) are obtained from the roots of C(z). Thus thc
polynomials C(z) and D(z) as well a.. 0- 2 can be nniquely determined.
We may thus assume without loss of generality that C(z) and D(z)
in Eq. (F.5) havc no roots inside the unit circle.
There is a fundamental result here that relates to the realization of
processes with rational spectrum. The proof is hard; see for example,
[AsG75, pp. 75-76].
We say that {Vk} has rational spectrum if the transform of {V (k)} defined
by
00
Sv(A) = L V(k)e- jk >'
k=-OQ
exists for A E [-7r,7r] and can be expressed as
_...1
374
Modeling of Siochasiic Linear Systems
Appendix F
Proposition F .1: If {vk} is a zero-mean, stationary stochastic pro-
cess with rational spectrum
IC(e P ')12
Sv(..\) = 0'2 ID(e P ' )1 2 '
..\ E [-71",71"],
where the polynomials C(s) and D(s) are given by
D(s) = 1 + d1S +... + dms m ,
References
C(s) = 1 + C1S +... + cms m ,
and are assumed (without loss of generality) to have no roots .inside
the unit circle, then there exists a zero-mean, uncorrelated statlOnary
process {Ek} with E{ €n = 0'2 such that for all k
[ABC65] Atkinson, R. C., Dower, G. H., and Crothers, E. J., 1965. An
Introduction to Mathematical Learning Theory, Wiley, N. Y.
[ABF93] Arapostathis, A., Borkar, V., Fernandez-Gaucherand, E., Ghosh,
M., and Marcus, S., 1993. "Discrete- Time Controlled Markov Processes
with Average Cost Criterion: A Survey," SIAM J. on Control and Opti-
mization, Vol. 31, pp. 282-344.
[ADG49] Arrow, K. J., Blackwell, D., and Girshick, M. A., 1949. "Daycs
and Minimax Solutions of Sequential Design Problems," Econometrica, Vol.
17, pp. 213-244.
[AGK77] Athans, M., Ku, R., and Gershwin, S. B., 1977. "The Uncertainty
Threshold Principle," IEEE Trans. on Automatic Control, Vol. AC-22, pp.
491-495.
Vk + d1Vk-1 +... + dmVk-m = €k + C1€k-1 +... + Cm€k-m.
F.3
THE ARMAX MODEL
Let us now return to the problem of representatio of a line system
with stochastic inputs. We had arrived at the model A(S)Yk = B(S)Uk +
Vk. If the zero-mean stationary process {vk} has ratioal spectrum, the
preceding analysis and proposition show that there eX1sts a zero-mean,
uncorrelated stationary process {€k} satisfying
[AHM51] Arrow, K. J., Harris, T., and Marschack, J., 1951. "Optimal
Inventory Policy," Econometrica, Vol. 10, pp. 2GO-272.
[AKS58] Arrow, K. J., Karlin, S., and Scarf, H., 1958. Studies in the Math-
ematical Theory of Inventory and Production, Stanford Univ. Press, Stan-
ford, CA.
[AdG86] Adams, M., and Guillemin, V., 1986. Measure Theory and Prob-
ability, Wadsworth and Brooks, Monterey, CA.
[AsG75] Ash, R. B., and Gardner, M. F., 1975. Topics in Stochastic Pro-
cesses, Academic Press, N. Y.
[AnM79] Anderson, B. D.O., and Moore, J. D., 1979. Optimal Filt.ering,
Prentice-Hall, Englewood Cliffs, N. J.
[AoL69] Aoki, M., and Li, M. T., 1969. "Optimal Discrde-TilIl(' Cont.rol
Systems with Cost for Ouservat.ion," IEEE Trans. Aut.omatk Cont.rol, Vol.
AC-14, pp. 1G5-175.
[AsW73] Astrom, K. J., and Wittenmark, 8., 1973. "On'SelVI'uning Reg-
ulators," Automatica, Vol. 9, pp. 185-199.
D(S)Vk = C(S)€k,
(F.7)
where C(s) and D(s) are polynomials, and C(s) h no rootinsidc the
unit circle. Operating on both sides of the equation A(S)Yk = B(S)Uk + Vk
with D(s) and using the relation D(s)vk = C(S)€k, we obtain
A(S)Yk = B(S)Uk + C(S)fk'
(F.8)
where A(s) = D(s)A(s) and B(s) = D(s)1i(s). Since A(O) = 1, n(O) = 0,
we can write Eq. (F.8) as
m m m
)Jk + ai)Jk-i = b i 1L k-i + Ok + Ci€k-i,
i=1 i=l ,=1
. . -- Tl' . th ARMAX
for some mteger m and scalars ai, bi, Ci, - 1, . . . , m. liS 1S e
model that we have used in Section 5.3.
375
_..1
376
References
Heferenccs
377
[AsWS4] Astrom, K. J., and Wittenmark, D., 19S4. Computer Controlled
Systems, Prentice-Hall, Englewood Cliffs, N. J.
[AsW90] Astrom, K. J., and Wittenmark, B., 1990. Adaptive Control,
Addison- Wesley, Reading, MA.
[Ash70] Ash, R. B., 1970. Basic Probability Theory, Wiley, N. Y.
[Ash72] Ash, R. B., 1972. Real Analysis and Probability, Academic Press,
N. Y.
[AstS3] Astrom, K. J., 19S3. "Theory and Applications of Adaptive Control
- A Survey," Automatica, Vol. 19, pp. 471-4S6.
[AtF66] Athans, M., and Falb, P., 1066. Optimal Control, McGraw-Hill, N.
Y.
[BGM94] Bertsekas, D. P., Guerriero, F., and Musmanno, R., 1994. "Par-
allel Label Corrrecting Methods for Shortest Paths," Lab. for Info. and
Decision Systems Report LIDS-P-2250, Massachusetts Institute of Tech-
nology, J.O.T.A., to appear.
[DPS92] Dertseka.s, D. P., Pallottino, S., and Scutella', M. G., 1992. "]y-
nomial Auction Algorithms for Shortest Paths," Lab. for Info. and DeCISiOn
Systems Report. LIDS Report P-2107, Massachusetts Institute of Technol-
ogy; also Computational Optimization and Applications, Vol. 4, 1U95, pp.
99-125.
[BarS1] Bar-Shalom, Y., 10S1. "Stochastic Dynamic Programming: Cau-
tion and Probing," IEEE Trans. on Automatic Control, Vol. AC-26, pp.
l1S4-1105.
[BeD62] Bellman, R., and Dreyfus, S., 19G2. Applied Dynamic Program-
ming, Princeton Univ. Press, Princeton, N. J.
[BeG92] Bertsekas, D. P., and Gallager, R. G., 1992. Data Networks (2nd
Edition), Prentice-Hall, Englewood Cliffs, N. J.
[BeR73] Bertsekas, D. P., and Rhodes, 1. B., 1973. "Sufficient.ly Inform-
tive Functions and the Minimax Feedback Control of Uncertam Dynamic
Systems," IEEE Trans. Automatic Control, Vol. AC-18, pp. 117-124.
[BeS7S] Bertsekas, D. P., and Shreve, S. E., 1975. Stochastic Optimal Con-
trol: The Discrete Time Case, Academic Press, N. Y.
[BeTS9] Dertsekas, D. P., and Tsitsiklis, J. N., 19S9. Parallel and Ds-
tributed Computation: Numerical Methods, Prentice-Hall, Englewood Chffs,
N. J.
[BeT91] Bertsekas, D. P., and Tsitsiklis, J. N., 1991. "An Analysis of
Stochastic Shortest Path Problems," Math. Operations Res., Vol. 16, pp.
5S0-595.
[Del57] Bellman, R., 1957. Applied Dynamic Programming, Priuc('(on Uui-
versity Press, Princeton, N. J.
[Ber70] Dertsekas, D. P., 1970. "On the Separation Theorem for Linear
Systems, Quadratic Criteria, and Correlated Noise," Unpublished Report.,
Electronic Systems Lab., Massachusetts Institute of Technology.
[Ber72] Bertsekas, D. P., 1072. "On the Solution of Some Minimax Cont.rol
Problems," Proc. 1972 IEEE Decision and Control Conf., New Orleans,
LA.
[Ber75] Bertsekas, D. P., 1975. "Convergence of Discretization Procedures
in Dynamic Programming," IEEE Trans. Automatic Control, Vol. AC-20,
pp. 415-419.
[Ber76] Bertsekas, D. P., 1976. Dynamic Programming and Stochastic Con-
trol, Academic Press, N. Y.
[Ber82a] Bertsekas, D. P., 19S2. "Distributed Dynamic Programming,"
IEEE Trans. Automatic Control, Vol. AC-27, pp. 610-616.
[DerS2b] Bertsekas, D. P., 1982. Constrained Optimization and Lagrange
Multiplier Methods, Academic Press, N. Y.
[Der91a] Bertsekas, D. P., 1991. Linear Network Optimhmtion: Algorithms
and Codes, MIT Press, Cambridge, MA.
[Der91b] Bertsekas, D. P., 1991. "The Auction Algorithm for Shortest
Paths," SIAM J. on Optimization, Vol. 1, pp. 425-447.
[Ber93] Bertsekas, D. P., 1993. "A Simple and Fast Label Correcting Algo-
rithm for Shortest Paths," Networks, Vol. 23, pp. 703-709.
[Ber95] Bertsekas, D. P., 1995. Nonlinear Programming, Athena Scient.ific,
Belmont, MA, to appear.
[BoV70] Borkar, V., and Varaiya, P. P., 1979. "Adaptive Control of Markov
Chains, I: Finite Parameter Set," IEEE Trans. Automatic Control, Vol.
AC-24, pp. 053-95S.
[CCP70] Cannon, M. D., Cullum, C. D., and Polak, E., 1970. Theory of
Optimal Control and Mathematical Programming, McGraw-Hill, N. Y.
[CDP02] Cerulli, R., De Leone, R., and Piacenti, G., 1092. "A Modified
Auction Algorithm for the Shortest Path Problem," University of Salerno
Report, J. of Optimization and Software, to appear.
[ChTSO] Chow, C.-S., and Tsitsiklis, J. N., 19S9. "The Complexit.y of
Dynamic Programming," Journal of Complexity, Vol. 5, pp. 4G6-488.
[ChT91] Chow, C.-S., and Tsitsiklis, J. N., 1991. "An Optimal One-Way
Multigrid Algorithm for Discrete-Time Stochastic Control," IEEE Trans.
on Automatic Control, Vol. AC-36, 1991, pp. S9S-914.
_. J
378
J{c[crcnccs
Hcfcrences
37!J
[FeM94] Fernandez-Gaucherand, E., and Markus, S. 1., 1994. "Risk Sensi-
tive Optimal Control of Hidden Markov Models," Proc. 33rd IEEE Conf.
Dec. Control, Lake Duena Vista, Fla.
[FelGS] Feller, W., 19G5. An Introduction to Probability Theory and Its
Applications, Wiley, N. Y.
[ForG6] Ford, L. R., Jr., 1956. "Network Flow Theory," Report P-923, The
Rand Corporation, Santa Monica, CA.
[For7:] Forney, G. D., 1973. "The Viterbi Algorithm," Proc. IEEE, Vol.
61, pp. 26S-27S.
[Fox71] Fox, 13. L., 1 U71. "Finite Stat.e Appl"Oxinlat.ions t.o ]),:llullll:ralJk
State Dynamic Progams," J. Math. Anal. Appl., Vol. 34, pp. 665-670.
[GaPS8] Gallo, G., and Pallottino, S., 19S8. "Shortest Path Algorit.hms,"
Annals of Operations Research, Vol. 7, pp. 3-70.
[GoRS5] Gonzalez, R., and Rofman, E., 10SG. "On Det.erminist.ic Control
Problems: An Approximation Procedure for the Optimal Cost, Parts I, II,"
SIAM J. Control Optimization, Vol. 23, pp. 242-285.
[GoSS4] Goodwin, G. C., and Sin, K. S. S., 1984. Adaptive Filtering, Pre-
diction, and Control, Prentice-Hall, Englewood Cliffs, N. J.
[GrA66] Groen, G. J., and Atkinson, R. C., 19GG. "Models for Optimizing
the Learning Process," Psychol. Bull., Vol. 66, pp. 309-320.
[GuF63] Gunckel, T. L., and Franklin, G. R., 1963. "A General Solution
for Linear Sampled-Data Control," Trans. ASME Ser. D. J. Basic Engrg.,
Vol. S5, pp. 197-201.
[HMS55] Holt, C. C., Modigliani, F., and Simon, H. A., 1955. "A Lincar
D(cision Rule for Production and Employment. Scheduling," Management
Sci., Vol. 2, pp. 1-30.
[HaLS2] Hajek, D., and van Loon, T., 1982. "Decent.ralized Dynamic Con-
trol of a Multiaccess Broadcast Channel," IEEE Trans. Automatic Control,
Vol. AC-27, pp. 559-569.
[Hes66] Hestenes, M. R., 1966. Calculus of Variations and Optimal Control
Theory, Wiley, N. Y.
[HerS9] Hernandez-Lerma, 0., 19S9. Adaptive Markov Control Processes,
Springer-Verlag, N. Y.
[HoK61] Hoffman, K., and Kunze, R., 10G1. Linear Algebra, Prentice-Hall,
Englewood Cliffs, N. J.
[IEE71] IEEE Trans. Automatic Control, 1971. Special Issue on Linear-
Quadratic Gaussian Problem, Vol. AC-16.
[JDE04] James, M. R., Baras, J. S., and Elliott, R. J., 1994. "Risk-Sensitive
Control and Dynamic Games for Partially Observed Discrete-Time Nonlin-
ear Systems," IEEE Trans. on Automatic Control, Vol. AC-39, pp. 780-792.
[Jac73] Jacobson, D. H., 1973. "Optimal Stochastic Linear Systems With
Exponential Performance Criteria and their Relation to Deterministic Dif-
ferential Games," IEEE Trans. Automatic Control, Vol. AC-1S, pp. 124-
131.
[Che72] Chernoff, H., 1072. "Sequential Analysis and Optimal Design,"
Regional Conference Series in Applied Mathematics, SIAM, Philadelphia,
PA.
[Chu60] Chung, K. L., 1960. Markov Chains with Stationary Transition
Probabilities, Springer-Verlag, Derlin and N. Y.
[CoL55] Coddington, E. A., and Levinson, N., 195G. Theory of Ordinary
Differential Equations, McGraw-Hill, N. Y.
[DeG70] DeGroot, M. H., 1070. Optimal Statistical Decisions, McGraw-
Hill, N. Y.
[DePS4] Deo, N., and Pang, C., 19S4. "Shortest Path Problems: Taxonomy
and Annotation," Networks, Vol. 14, pp. 275-323.
[DoS80] Doshi, D., and Shreve, S., HJ80. "Strong Consistency of a lVlodi-
f\Cd Maximum Likelihood Estimator for Controlled Markov Chains," J. of
Applied Probability, Vol. 17, pp. 726-734.
[DreG5] Dreyfus, S. D., 1065. Dynamic Programming and the Calculus of
Variations, Academic Press, N. Y.
[DreG9] Dreyfus, S. D., 1969. "An Appraisal of Some Shortest-Path Algo-
rithms," Operations Research, Vol. 17, pp. 395-412.
[Eck68] Eckles, J. E., 1965. "Optimum Maintenance with Incomplete In-
formation," Operations Res., Vol. 1G, pp. 1058-1067.
[Elm78] Elmaghraby, S. E., 1978. Activity Networks: Project Planning and
Control by Network Models, Wiley - Interscience, N. Y.
[FaI87] Falcone, M., 10S7. "A Numerical Approach to the Infinite Horizon
Problem of Deterministic Control Theory," Appl. Math. Opt., Vol. 15, pp.
1-13.
[Jaz70] Jazwinski, A. H., 1070. Stochastic Processes and Filtering Theory,
Academic Press, N. Y.
[JoT61] Joseph, P. D., and Tou, J. T., 1961. "On Linear Control Theory,"
AlEE Trans., Vol. SO (II), pp. 193-196.
[K81382] KinlCmia, J., Gershwill, S.D., and 13erLsekas, D.P., lU82. "COlll-
putation of Production Control Policies by a Dynamk Programming Tech-
nique," in Analysis and Optimization of Syst.ems, A. Densoussan awl .1. L.
Lions (eds.), Sprillger- Verlag, N. Y., pp. 243-269.
_..1
380
"c[crcnccs
Hcfcrcnccs
381
[KaD66] Karush, W., and Dear, E. E., 1966. "Optimal Stimulus Presen-
tation Strategy for a Stimulus Sampling Model of Learning," J. Math.
Psychology, Vol. 3, pp. 15-47.
[KaK58] Kalman, R. E., and Koepcke, R. W., 1958. "Optimal SYllthesis of
Linear Sampling Control Systems Using Generalized Performance Indexes,"
Trans. ASME, Vol. 80, pp. 1820-1826.
[KaIGO] Kalman, R. E., 1960. "A New Approach to Linear Filtering and
Prediction Problems," Trans. ASME Ser. D. J. Basic Engrg., Vol. 82, pp.
35-45.
[KeS60] Kemeny, J. G., and Snell, J. L., 1960. Finite Markov Chains, Van
Nostrand-Reinhold, N. Y.
[Kim82] Kimemia, J., lU82. "Hierarchical Control of Production in Flexible
Manufacturing Systems," Ph.D. Thesis, Dep. of Electrical Engineering and
Computer Science, Massachusetts Institute of Technology.
[KuA77] Ku, R., and Athans, M., 1977. "Further Results on the Uncer-
tainty Threshold Principle," IEEE Trans. on Automatic Control, Vol. AC-
22, pp. 866-868.
[KuD92] Kushner, H. J., and Dupuis, P. G., 1992. Numerical Methods for
Stochastic Control Problems in Continuous Time, Springer-Verlag, N. Y.
[KuL82] Kumar, P. R., and Lin, W., 1982. "Optimal Adaptive Controllers
for Unknown Markov Chains," IEEE Trans. Automatic Control, Vol. AC-
27, pp. 765-774.
[KuV86] Kumar, P. R., and Varaiya, P. P., 1986. Stochastic Systems: Es-
timation, Identification, and Adaptive Control, Prentice-Hall, Englewood
Cliffs, N. J.
[Kum83] Kumar, P. R., 1983. "Optimal Adaptive Control of Linear-Quad-
ratic-Gaussian Systems," SIAM J. on Control and Optimizatioll, Vol. 21,
pp. 163-178.
[Kum85] Kumar, P. R., 1985. "A Survey of Some Results in Stochastic
Adaptive Control," SIAM J. on Control and Optimization, Vol. 23, pp.
329-38J.
[LjS83] Ljung, L., and Soderstrom, T., 1983. Theory and Practice of Hc-
cursive Identification, MIT Press, Cambridge, MA.
[Lju86] Ljung, L., 1U86. System Identification: Theory for the User, l'rent.ice-
Hall, Englewood Clift's, N. J.
[Lov91] Lovejoy, W. S., 1991. "A Survey of Algorithmic Methods for Par-
tially Observed Markov Decision Processes," Allnab of Operations 11e-
search, Vol. 18, pp. 47-66.
[Lue69] Luenberger, D. G., 1969. Optimization by Vector Space Methods,
Wiley, N. Y.
[Lue84] Luenberger, D. G., 1984. Linear and Nonlinear Programming, Addi-
son- Wesley, Reading, MA.
[Med69] Meditch, J. S., 1969. Stochastic Optimal Linear Estimation and
Control, McGraw-Hill, N. Y.
[Men73] Mendel, J. M., 1973. Discrete Techniques of Parameter Estimation
- The Equation Error Formulation, Marcel Dekker, N. Y.
[Mik79] Mikhailov, V. A., 1979. Methods of Random Multiple Access, Can-
didate Ellgineering Thesis, Moscow Institute of Physics and Technology,
Moscow.
[Mos68] Mossin, J., 1968. "Optimal Multi-Period Portfolio Policies," J.
Business, Vol. 41, pp. 215-220.
[Nah69] Nahi, N., 1960. Estimation Theory and Applications, Wiley, N. Y.
[New75] Newborn, M., 1975. Computer Chess, Academic Press, N. Y.
[PDT95] Polymenakos, L. C., Bertsekas, D. P., and Tsitsiklis, J. N., 1995.
"Efficient Algorithms for Continuous-Space Shortest Path Problems," Lab.
for Info. and Decision Systems Report LIDS-P-2292, Massachusetts Insti-
tute of Technology.
[PaS82] Papadimitriou, C. H., and Steiglitz, K., 1982. Combinatorial Opti-
mization: Algorithms and Complexity, Prentice-Hall, Englewood Cliffs, N.
J.
[Kus90] Kushner, H. J., 1090. "Numerical Methods for Continuous Control
Problems in Continuous Time," SIAM J. on Control and Optimizatioll,
Vol. 28, pp. 990-1048.
[Las85] Lasserre, J. D., 1985. "A Mixed Forward-Dackward Dynamic Pro-
gramming Method Using Parallel Computation," J. Optimization Theory
Appl., Vol. 4G, pp. IG5-1G8.
[Lev84] Levy, D., 1084. The Chess Computer Handbook, B. T. Batsford
Ltd., London.
[PaT87] Papadimitriou, C. H., and Tsitsiklis, J. N., 1987. "The Complexity
of Markov Decision Processes," Math. Operations Res., Vol. 12, pp. 441-
450.
[Pap74] Pape, V., 1074. "Implementation and Efficiency of Moore Algo-
rithms for the Shortest Path Problem," Math. Progr., Vol. 7, pp. 212-222.
[Pap65] Papoulis, A., 1965. Probability, Random Variables and Stocha.tic
Processes, McGraw-Hill, N. Y.
[Pea84] Pearl, J., 1984. Heuristics, Addison-Wesley, Reading, MA.
_..I
382
J{e[crCIlCCS
Hc[crCllCC:;
383
[PicOO] Picone, J., 1990. "Continuous Speech Recognition Using Hidden
Markov Models," IEEE ASSP Magazine, July Issue, pp. 26-41.
[piuDG] Pinedo, M., 19DG. Scheduling: Theory, Algorithms, and Systems,
Prentice-Hall, Englewood Cliffs, N. J.
[PoB94] Polymenakos, L. C., and Bertsekas, D. P., 1994. "Parallel Shortest
Path Auction Algorithms," Parallel Computing, Vol. 20, pp. 1221-1247.
[PoI71] Polak, E., 1971. Computational Methods in Optimization: A Uni-
fied Approach, Academic Press, N. Y.
[PrS94] Proakis, J. G., and Salehi, M., 1994. Communication Systems En-
gineering, Prentice-Hall, Englewood Cliffs, N. Y.
[RabS9] Rabiner, L. R., 10S9. "A Tutorial on Hidden Markov Models and
Selected Applications in Speech Recognition," Proc. of the IEEE, Vol. 77,
pp. 257-2S6.
[Riv85] Rivest, R. L., 19S5. "Network Control by Dayesian Broadcast,"
Lab. for Computer Science Report LCS/TM-2S5, Massachusetts Institute
of Technology.
[Roc70] Rockafellar, R. T., 1970. Convex Analysis, Princeton Uuiversity
Press, Princeton, N. J.
[RosS3] Ross, S. M., 19S3. Introduction to Stochastic Dynamic Program-
ming, Academic Press, N. Y.
[RosS5] Ross, S. M., 10S5. Probability Models, Academic Press, Orlando,
Fla.
[Roy6S] Royden, H. L., 19G5. Principles of Mathematical Analysis, McGraw-
Hill, N. Y.
[Rud64] Rudin, W., 1964. Real Analysis, McGraw-Hill, N. Y.
[Sar87] Sargent, T. .1., 19S7. Dynamic Macroeconomic Theory, Harvard
Univ. Press, Cambridge, MA.
[Sca60] Scarf, H., 1960. "The Optimality of (8, S) Policies for the Dynamic
Inventory Problem," Proceedings of the 1st Stanford Symposium on Math-
ematical Methods in the Social Sciences, Stanford University Press, Stan-
ford, CA.
[Sha50] Shannon, C., 1050. "Programming a Digital Computer for Playing
Chess," Phil. Mag., Vol. 41, pp. 356-375.
[Shi64] Shiryaev, A. N., 1964. "On Markov Sufficient Statistics in Non-
Additive Bayes Problems of Sequential Analysis," Theory of Probability
and Applications, Vol. 9, pp. 604-61S.
[Shi66] Shiryaev, A. N., 1966. "On the Theory of Decision Functions and
Control by an Observation Process with Incomplete Data," Selected Trans-
lations in Jvlath. Statistics and Probability, Vol. 6, pp. 162-188.
[ShrS1] Shreve, S. E., 19S1. "A Note on Optimal Switching Detween Two
Activities," Naval Research Logistics Quarterly, Vol. 28, 185-1UO.
[Sim5G] Simon, l-I. A., 1956. "Dynamic Programming Under UnccrLaillty
with a Quadratic Criterion Function," Econometrica, Vol. 24, pp. 74-151.
[SkISS] Sklar, B., 19S8. Digital Communications: F\l1ldanllnLals alHl Ap-
plications, Prentice-Hall, Englewood Cliffs, N. Y.
[SmS73] Smallwood, R. D., and Sondik, E. .1., 1973. "The Optimal Cont.rol
of Partially Observable Markov Processes Over a Finite Horizon," Opera-
tions Res., Vol. 11, pp. 1071-10SS.
[Sma71] Smallwood, R. D., 1971. "The Analysis of Economic Teaching
Strategies for a Simple Learning Model," J. Math. Psychology, Vol. 8, pp.
2S5-301.
[Son71] Sondik, E. J., 1971. "The Optimal Control of Partially Observ-
able Markov Processes," Ph.D. Dissertation, Department of Engineering-
Economic Systems, Stanford University, Stanford, CA.
[StLS9] Stokey, N. L., and Lucas, R. E., 19S9. Recursive Methods in Eco-
nomic Dynamics, Harvard University Press, Cambridge, MA.
[Str65] Striebel, C. T., 1965. "Sufficient Statistics in the Optimal Control
of Stochastic Systems," J. Math. Anal. Appl., Vol. 12, pp. 576-592.
[Str76] Strang, G., 197G. Linear Algebra and its Applications, Academic
Press, N. Y.
[ThW66] Thau, F. E., and Witsenhausen, H. S., 1966. "A Comparison of
Closed-Loop and Open-Loop Optimum Systems," IEEE Trans. Automatic
Control, Vol. AC-11, pp. 619-G21.
[The54] Theil, E., 1954. "Econometric Models and Welfare Maximization,"
Weltwirtsch. Arch., Vol. 72, pp. 60-S3.
[Tsi84a] Tsitsiklis, J. N., 19S4. "Convexity and Characterization of Optimal
Policies in a Dynamic Routing Problem," J. Optimization Theory Appl.,
Vol. 44, pp. 105-136.
[Tsi84b] Tsitsiklis, J. N., 19S4. "Periodic Review Inventory Systems with
Continuous Demand and Discrete Order Sizes," Management Sci., Vol. 30,
pp. 1250-1254.
[TsiS7] Tsitsiklis, J. N., 19S7. "Analysis of a Multiaccess Control Scheme,"
IEEE Trans. Automatic Control, Vol. AC-32, pp. 1017-1020.
[Tsi93] Tsitsiklis, J. N., 1993. "Efficient Algorithms for Globally Optimal
Trajertories," Lab. for Info. and Decision Systems Report LIDS-P-221O,
Massachusetts Institute of Technology, IEEE Trans. on Automatic Control,
_...1
384
Hc[crcnces
to appear.
[Vei65] Veinott, A. F., Jr., 1965. "The Optimal Inventory Policy for Dat.ch
Ordering," Operations Res., Vol. 13, pp. 424-432.
[Vei66] Veinott, A. F., Jr., 19G6. "The Status of Mathematical Inventory
Theory," Management Sci., Vol. 12, pp. 745-777.
[Vit67] Viterbi, A. J., 1967. "Error Dounds for Convolutional Codes and
an Asymptotically Optimum Decoding Algorithm," IEEE Trans. on Info.
Theory, Vol. 1T-13, pp. 2GO-269.
[WaI47] Wald, A., 1947. Sequential Analysis, Wiley, N. Y.
[WeP80] Weiss, G., and Pinedo, M., 1980. "Scheduling Tasks with Expo-
nential Service Times on Nonidentical Processors to Minimize Various Cost
Functions," J. Appl. Prob., Vol. 17, pp. 187-202.
[WhH80] White, C. C., and Harrington, D. P., 1980. "Application of Jen-
sen's Inequality to Adaptive Suboptimal Design," J. Optimization Theory
Appl., Vol. 32, pp. 89-99.
[WhS80] White, C. C., and Scherer, W. T., 1989. "Solution Procedures for
Partially Observed Markov Decision Processes," Operations Res., Vol. 30,
pp. 791-797.
[WhiG3] Whittle, P., 19G3. Prediction and Regulation by Linear Least-
Square Methods, English Universities Press, London.
[WhiG9] White, D. J., 1969. Dynamic Programming, Holden-Day, San Fran-
cisco, CA.
[Whi78] Whitt, W., 1978. "Approximations of Dynamic Programs I," Math.
Operations Res., Vol. 3, pp. 231-243.
[Whi79] Whitt, W., 1979. "Approximations of Dynamic Programs II,"
Math. Operations Res., Vol. 4, pp. 179-185.
[Whi82] Whittle, P., 1982. Optimization Over Time, Wiley, N. Y., Vol. 1,
1982, Vol. 2, 1983.
[Wit69] Witsenhausen, H. S., 1969. "Inequalities for the Performance of
Suboptimal Uncertain Systems," Automatica, Vol. 5, pp. 507-512.
[Wit70] Witsenhausen, H. S., 1970. "On Performance Bounds for Uncertain
Systems," SIAM J. on Control, Vol. 8, pp. 55-89.
[Wit71] Witsenhausen, H. S., 1971. "Separation of Estimation and Control
for Discrete-Time Systems," Proc. IEEE, Vol. 59, pp. 1557-1566.
INDEX
A
ARMAX model, 20G, 371, 374
Adaptive control, 250
Admissible policy, 12
Aggregation, 285
Alpha-beta pruning, 272, 277
Asset selling, 158, 309
Asynchronous algorithms, 85
Auction algorithm, 74, 84
Augmentation of state, 30
Autoregressive process, 206
Average cost problem, 293, 310
B
Dackward shift operator, 204
Dasic problem, 10, 186
Bayes' rule, 344
Bellman's equation, 294
Best-first search, 70, 82
Brachistochrone problem, 114
Branch-and-bound algorithm, 72
Breadth-first search, 69
c
Calculus of variations, 90, 102
Capacity expansion, 178
Caution, 251
Certainty equivalence principle, 23,
142
Certainty equivalent controller, 246,
254
Chess, 9, 13, 27, 272
Closed-loop control, 4
Closed set, 335
Communicating states, 347
Compact set, 335
Composition of functions, 335
Concave function, 337
Conditional probability, 344
Continuous function, 33G
Controllability, 134
Control law, 11
Control trajectory, 88
Convergence of sequences, 334
Convex function, 337
Convex set, 337
Convolutional coding, 60
Correlated disturbances, 32, 163
Cost-to-go function, 20
Countable set, 330
Covariance matrix, 343
Critical path analysis, 54
D
D'Esopo-Pape method, 71
DP algorithm proof, 19
Delays, 30
Depth-first search, 69
Detectability, 141
Differential cost, 317
Dijkstra's algorithm, 71, 82
Discounted cost, 40, 293, 30G
Discretization, 280
Distributed computation, 85
Distribution function, 342
Disturbance, 11
Dual control, 252
E
Eigenvalue, 332
Eigenvector, 332
Euclidean space, 330
Event, 341
Existence of optimal solutions, 339
Expected value, 343
385
_.J
386
Exponential cost function, 40, 173
F
Feature extraction, 285
Feature vectors, 285
First passage time, 340
Flexible manllfacturing, 2G9
Forecasts, 33, 1G7, 173, 264
Forward DP algorithm, 52
Forward search, 272
Forward shift operator, 204
Four queens problem, 63
G
Gambling, 170
Gauss-Markov estimator, 3G7
Gaussian random vector, 202, 352
H
Hamilton-J acobi-Dellman eq uation,
91
Hamiltonian function, 100
Hidden Markov model, 56
Hypothesis testing, 232
I
Identifiability, 252
Improper policy, 296
Independent random variables, 343
Information vector, 187
Inner product, 330
Interchange argument, 172
Inventory control, 3, 144, 175
Investmcnt problems, 48, 152
Irreducible Markov chain, 348
Isoperimctric problem, 125
Iterative deepening, 278
K
K-convexity, 148
Kalman filter, 202, 350, 360
Killer heuristic, 279
Inrlex
Index
L
p
L'H6pital's problem, 12G
LLL strat.egy, 71
Label correcting method, 66, 83
Label setting method, 71
Limitedloolmhead policy, 26G
Limit inferior, 335
Limit point, 335
Limit superior, 335
Linear independence, 331
Linear programming, 305
Linear quadratic problems, 23, 96,
104, 130, 173, 196, 208, 237
Partially myopic policy, 157
Partial open-loop feedback control,
264
Pole-zero cancellation, 211
Policy, 11
Policy iteration, 303, 308, 321
Pontryagin minimum principle, 97
Portfolio analysis, 152
Positive definite matrix, 332
Positive semidefinite matrix, 332
Principle of optimality, 16
Probing, 251
Probability density function, 343
Probability distribution, 342
Probability space, 341
Proper policy, 29G
Purchasing problem:,;, 162
M
Manufacturing, 2G9
Markov chains, 346
Maxinlllm likelihoud decoJ"r, 5G
Mean first passage time, 34U
Minimax algorithm, 276
Minimum principle, 07, 110
Minimum-time problem, 117
Minimum variance control, 203, 2G9
Min-ma.'C problems, 39, 276
Moving average process, 206
Multiaccess communication, 187,240
Multiplicative cost, 41
Myopic policy, 1G7
Q
Quadratic cost, 23, 96, 104, 130,
173,106,208,237,258
Queueing control, 7
R
Random variable, 342
Rank of a matrix, 331
Rational spectrum, 373
Recurrent state, 348
Relative cost, 317
Replacement problems, 7
Riccati equation, 96, 133
Riccati equation convergence, 135
N
Norm, 330
o
Observability, 134
One-step-lookahead rule, 165
Open-loop control, 4, 262
Open-loop feedback control, 261
Open set, 335
Optimal cost, function, 12
Optimal value function, 12
Optimality principle, 16
s
SLF method, 71
Scheduling problems, 64, 168, 183
Self-tuning regulator, 259
Semilinear systems, 42, 2RJ
Separation theorem, 201
Sequential hypothesis testing, 232
Sequential probability ratio, 236
3S7
Shortcst pat.h problem, Gl, 8, 90
Singular problem, 120
Slotted Aloha, 188
Stabilizability, 141
Stable filter, 205
Stable system, 135
Statc trajectory, 88
Stationary policy, 295
Stochastic matrix, 346
Stochastic programming, 2Gli
Stochastic shortest, paths, 282, 29;),
295
Stopping problems, 158
Sufficient statistic, 219
T
Terminating process, 41
Time lags, 30
Transient stat,e, 318
Transition probabilities, 6
Transition probability graph, 6
Traveling salesman problcm, 64
u
Uncertainty threshold principle, 142
Uncorrelated random variables, 343
v
Value iteration, 303, 307, 318
Value of information, 13
Viterbi algorithm, 56
w
Weierstrass theorem, 339