/
Author: Bard Y.
Tags: mathematical statistics control systems statistical analysis parameter estimation
Year: 1974
Text
Olllillear Paralneter
Estilnatioll
YONATHAN BARD
International Business !llachines Corporation
Cambridge, lvIassacllllselts
@
ACADEMIC PRESS New York and London 1974
A Subsidiary oj Harcourt Brace JOl'OIIOL'ich, Publishers
.
..
,;
Contents
Preface ix
Chapler 1 Introduction
]-1. Curvc FIttlllg J 1-2. Modcl Fitting "2 1-3. Estimation 3
]-4. Lincarity 5 1-5. Point and Intcrval Estimation 6 1-6. Historical Back-
ground 6 1-7. Notation 7
Chapter 11 Pmblem Formulation
A DETERMINIST]C MODELS
2-1 BasIc Concepts 1/ 2-2. Structural Modcl /2 2-3. Paramcter Evalua-
tion /3 2-4. Rcduccd Modcl /3 2-5. Application Arcas /4
B DATA
2-6. Expcrimcnts and Data Matrix /7
C PROBABILISTIC MODELS AND LIKELIHOOD
2-7. Randomncss in Data /8 2-8. Thc Normal Distribution /8
2-9. Thc Uniform Distribution 2/ 2-10. Distribution of Errors 22
2-11. Stochastic Form of thc Model 24 2-12. Likclihood-Standard Rcduccd
Modcl 26 2-13. Likelihood-Structural Models 27 2-14. An Example 29
2-15. Utility of Distribution Assumptions 32
D PRIOR INFORMATION AND POSTERIOR DISTRIBUTION
2-16. Prior Information 32
and Noninformativc Priors 34
2-17. Prior Distribution 33
2-19 . Bayes' Thcorem 36
2-18 Informative
2-20. Problems 37
Chapler III Estimators and Their Properties
A STATISTICAL PROPERTIES
3-1. Thc Sampling Distribution 39 3-2. Propertlcs oj the Sampling Distribution 40
3-3. Evaluation of Statistical Properties 45
v
B
MATHEMAT]CAL PROPERTIES
47 3-5, Unconstraincd Optimization 48 3-6. Equality
3-7. Incquality Constraints 51 3-8. Problems 53
3-4. Optimization
Constraints 49
Chapter I V Methods of Estimation
4-1. Rcsiduals 54
A LEAST SQUARES
4-2 Unwcightcd Lcast Squarcs 55 4-3. Wcightcd Lcast Squarcs 56
4-4. MultipIc Lincar Rcgrcssion 58
B MAX]MUM LIKELIHOOD
4-5. Dcfinitlon 61 4-6. Likelihood EquatIOns 62 4-7. Normal Distribution 63
4-8. Unknown Diagonal Covariancc 64 4-9. Unknown Gcncral Covariancc 65
4-10. Indcpcndent Variablcs Subject to Error 67 4-11. Exact Structural Modcls 68
4-12. Data Rcquircmcnts 69 4-13. Somc Othcr Distributions 70
C BA YESIA N ESTIMATION
4-14. Dcfinition 72 4-15. Modc of thc Postcrior Distribution 73
4-16. Minimum Risk Estimates 74
D OTHER METHODS
4-17. Minimax Dcviation 77
4-19. Lincarizing Transformations
4-21. Problcms 80
4-18. Pscudomaximum Likclihood 78
78 4-20. Minimum Chi-Squarc Mcthod 80
Chapter V Computation of the Estimates I: Unconstrained Problems
5-1. Introduction 83 5-2. ltcrativc Schcmc 84 5-3. Acccptability 85
54. Convcrgcncc 87 5-5. Stecpcst Descent 88 5-6. Ncwton's Mcthod 88
5-7. Dircctional Discrimillation 91 5-8. Thc Marquardt Method 94 5-9. The
Gauss Mcthod 96 5-10. The Gauss Method as a Scqucncc of Lincar Rcgression
Problems 99 5-11. Thc ]mplementation of thc Gauss Method 101
5-12. Variablc Metric Mcthods 106 5-13, Step Sizc 110 5-14. Intcrpolal1on-
Extrapolation 111 5-15. Tcrmination 114 5-16. Rcmarks on Convcrgcncc 115
5-17. Dcrivativc Frcc Mcthods 117 5-18. Finite DitTerences 117
5-19. Dircct Search Mcthods 119 5-20. Thc Initial Gucss 120 5-21. A Singlc-
Equation Lcast Squarcs Problcm 123 5-22. Adding Prior Information 131
5-23. A Two-Equation Maximum Likelihood Problcm 133 5-24. Problcms 139
Chapter VI Computation of the Estimates II: Problems with Constraints
A INEQUALITY CONSTRA]NTS
6-1. Penalty Functions 141
with Boundcd Paramcters 151
6-5. Minimax Problcms 154
6-2. Projcction Methods 146 6-3. Projcction
6-4. Transformation of Variablcs 153
B EQUALITY CONSTRAINTS
6-6. Exact Structural Modcls 154 6-7. Convcrgcncc Monitoring 156
6-8. Somc Special Cascs 157 6-9. Pcnalty Functions 159 6-10. Lincar
Equality Constraints 160 6-1]. Lcast Squarcs Problcm with Penalty Functions 160
6-12. Lcast Squares Problcm-Projection Mcthod 162 6-13. Indcpcndcnt Variables
Subjcct to Error 163 6-14. An Implicit Equations Modcl 167
6-15. Problcms 168
Chapter VII Interpretation of the Estimates
7-1. introduction 170 7-2. Rcsponsc Surfacc Techniqucs 17l 7-3. Canonical
Form 174 7-4. Thc Sampling Distribution 175 7-5. Thc Covariancc Matrix
of thc Estimatcs 176 7-6. Exact Structural Model 179 7-7. Constraints 180
7-8. Principal Components /83 7-9., Confidcncc Intervals 184
7-10. Confidcnce Rcgions /87 7-11. Linearization 189 7-12. Thc Postcrior
Distribution /91 7-13. The Rcsiduals /92 7-14. Thc Indcpcndcnt Variablcs
Subjcct to Error 196 7-] 5. Goodncss of Fit /98 7-16. Tcsts on Rcsiduals /99
7-17. Runs and Outlicrs 20/ 7-18. Causcs of Failure 202 7-19. Prcdiction 204
7-20. Paramctcr Transformation 205 7-21. Singlc-Equation Lcast Squarcs
Problcm 206 7-22. A Montc Carlo Study 210 7-23. Indcpcndcnt Variablcs
Subjcct to Error 2/2 7-24. Two-Equation Maximum Likclihood Problcm 2/3
7-25. Problems 2/6
Chapter VIII Dynamic Models
8-1. Modcls Involving Diffcrcntial Equations 218 8-2. Thc Standard Dynamic
Model 221 8-3. Modcls Rcduciblc to Standard Form 223 8-4. Computation
of thc Objective Function and Its Gradicnt 225 8-5. Numcricallntcgration 230
8-6. Somc Difficultics Associatcd with Dynamic Systcms 23/ 8-7. A Chcmical
Kinctics Problcm 233 8-8. Lincarly Dcpcndcnt Equations 238
8-9. Problcms 242
Chapter IX Some Special Problems
9-1. Missing Obscrvations 244 9-2. Inhomogcncous Covariancc 246
9-3. Scqucntial Rccstimation 248 9-4. Computational Aspccts 249
9-5. Stochastic Approximation 25/ 9-6. A Missing Data Problcm 251
9-7. Further Problcm with Missing Data 253 9-8. A Scqucntial Rccstimation
Problcm 255 9-9. Problcms 257
Chapter X Design of Experiments
10-1. Inrroduction 258 10-2. Information and Unccrtainty 26/ 10-3. Dcsign
Critcrion for Paramcter Estimation 262 10-4. Dcsign Critcrlon for Prcdiction 265
VIII
Contents
10-5. Dcsign Critcrion for Modcl Discrimination 266 10-6. Tcrmination
Critcria 269 10-7. Somc Practical Considcrations 271 10-8. Computational
Considcrations 273 10-9. Computcr Simulatcd Expcrimcnts 276
10-10. Dcsign for Dccision Making 283 10-1 I. Problems 286
Appendix A Matrix Analysis
A-I. Matrix Algcbra 287 A-2. Matrix Diffcrcntiation 293 A-3. Pivoting and
Swceping 296 A-4. Eigcnvalucs and Vcctors of a Rcal Symmctric Matrix 302
A-5. Spectral Decompositions 303
Appendix B Probability 3/0
Appendix C The Rao-Cramer Theorem 3/3
Appendix D Generating a Sample from a Given Multivariate Normal
Distribution 316
Appendix E The Gauss-Markov Theorem 318
Appendix F A Convergence Theorem for Gradient Methods 320
Appendix G Some Estimation Programs 323
References
325
AI/lhor Illdex 333
Sl/bjecI Index 337
Preiace
This book is intended primarily for use by the scientist or engineer who
is concerned with fitting mathematical models to numerical data, and for
use in courses on data analysis which deal with that subject. Such filling is
frequently done by the method of least squares, with no regard paid to
previous knowledge concerning the values of the parameters (coetlicients),
nor to the statistical nature of the measurement errors. In Chapters II-IV
we show how the problem can be formulated so as to take all these factors
into account. In Chapters V-VI we discuss the computational methods used
to solve the problem, once its formulation has been completed. Chapter VII
is devoted to the question of what conclusions can be drawn, after the estimates
have been computed, concerning -the valid ity of the estimates, or of the model
which has been fitted. In Chapter VIII we discuss the important special case
of models which are stated in the form of differential equations. Other
special problems are treated in Chapter IX. Finally, in Chapter X we suggest
methods for planning the experiments in such a way that the data will shed
the greatest possible light on the model and its parameters. We cannot stress
too strongly the point that if data are to be gathered for the purpose of
establishing a mathematical model, then the experiments should be designed
with this purpose in mind. Hence the importance of Chapter X
A practical, rather than theoretical point of view has been taken through-
out this book. We describe computational algorithms which have performed
well on a variety of problems, even if t heir convergence has not been proven,
and even i I' they have failed on some other problems. We have as yet no
foolproot etlicient methods for solving nonlinear problems; hence we cannot
afford to throwaway useful tools just because t hey are not perfect.
The presentation uses matrix algebra and probability theory on a very
elementary level. Reviews of the needed concepts and proofs of some impor-
tant theorems will be found in the appendixes. Some supplementary material
has been included in the form of problems at the ends of chapters. Problems
requiring actual computation have not been included; the reader is likely to
have his own data to compute with, and additional data may be found in
many of the cited references. Several numerical problems have, however,
been worked out in great detail in separate sections at the ends of Chapters
V-IX for the purpose of illustrating the methods discussed in those chapters.
x
Preface
The author is deeply indebted to the IBM Corporation, and in particular
to the managements of the New York and Cambridge Scientific Centers, who
have supported the writing of this book and provided all the necessary
resources, The author is also grateful to Professor L Lapidus of Princeton
University, and to his colleagues J. G. Greenstadt, p, G Comba,
H. Eisenpress, K. Spielberg, and P. Backer, for many helpful discussions,
and for reviewing portions of the manuscript.
Chapter
I
In LroducliolJ
1-1. Curve Fitting
A scientist who has compiled tables of data wishes to reduce them to ,
more convenient and comprehensible form. He accomplishes this by repre
senting the data in graphical or functional form. In the first case, he plots hi:
data points, and then draws some curve through them. In the second case, he
selects a class of functions, and chooses from this class the one that best fit
his data. This is called curve fitting,
In the simplest case, the data consist of values YI' Yz, , . . , YII of a depen
dent variable Y measured for various values XI' X 2 ' ' , . , XII of an independen
variable x, A frequently chosen class of functions is the set of all polynomial
of order not exceeding m
Y = 0 0 + 0lX + 02X2 + ... + O",x'"
(1-1-1
The values of the parameters 0 0 , 0 1 , . . ., 0", are chosen so as to get the bes
possible fit to the data. The most commonly used technique for accomplishin,
tllis is the least squares method, in which those values of the 0 i are selectee
which minimize the sum of squares of the residuals, i.e..
" ( III ) 2
s=I Yp-IO"x,,"
11=1 a=O
(1-1-2
Curve fitting procedures are characterized by two degrees of arbitran
ness, First, the class of functions used is arbitrary, being dictated only to
minor extent by the physical nature of the process from which the data came
Second, the best fit criterion is arbitrary, being independent of statistical con
siderations. This arbitrariness can be exploited to make the fitting jo
easy, Choosing equations which, like Eq. (l),t are linear functions of th
parameters; using orthogonal Or Fourier polynomials (in place of ordinar
t This reference is to the first equation of thc current scction, i.c., Eq. (1-1-1).
2
I Introduction
polynomIals) as the functions to fit; employmg the least squares cnterion-all
these contribute to making the computation of the parameters a mathemati-
cally easy job. On the other hand, due to their arbitrary nature, the equations
that we get are useful only for summarizing the data and for interpolating
between tabulated values. They cannot be used to extrapolate, i.e., to predict
the outcome of experiments removed from the region of already available
data. Also, the equations and the parameters occurring in them shed little
insight on the nature of the process being measured, except to answer such
questions as to whether variable x has an influence on variable y.
Curve fitting techniques have widespread applications in situations that
go far beyond the simple y vs. x table. An example is the identification of
dynamic systems by means of rational transfer functions or Volterra series.
Most multiple linear regression, analysis of variance, and econometric time-
series problems are also of a curve fitting nature, since the equations used are
not derived from" laws of nature." 1n most of these applications, however,
assumptions are made concerning the statistical behavior of the errors, there-
by elevating them at least partly to the status of estimation problems as dis-
cussed in Section 1-3.
1-2, Model Fitting
Often the scientist is, to a certain extent, familiar with the laws which
govern the behavior of the physical system under observation. He can then
derive equations describing the relationships among the observed quantities.
For instance, the fraction y of a radioactive isotope remaining x seconds after
the isotope's formation is given by
y = e- ox
(1-2-1)
where the parameter 0 is a physical constant proportional to the IIlstantaneous
rate of decay of the isotope. The magnitude of 0 is unknown, but we wish to
assign to it a value which makes Eq. (]) fit the data (YI' Xl)' (Y2, x 2 ), ...,
(YII' X,.) as well as possible, e.g., by the least squares criterion,
An equation such as Eq. (I) which is derived from theoretical considera-
tions is called a model, and the procedure just described constitutes model
jilling. In principle, model fitting is not much different from curve fitting,
except that we can no longer guide the selection of a functional form by
considerations of computational convenience. For instance, Eq. (I) is not a
linear function of the parameter, and because of this the computation of the
" best fit" is more difficult than the computation of the e i in Eq. (I-I-I).
1-3, Estimation
1-3. Estimation
A new consideration arises in model fitting that does not exist in curv
fitting, The parameters occurring in a model, e.g., 0 in Eq. (1-2-1), usuall
represent quantities that have physical significance. If the model is a corree
one, then it is meaningful to ask what is the true value of 0 in nature. Becaus
of the generally imprecise nature of measurements we can never hope t,
determine the true values with absolute certainty. Also, due to the randor
nature of the errors in measurements, the value of 0 that best fits one serie
of measurements differs from the value that fits another series, even thoug
both series are performed on the same isotope. However, we can look fc
procedures to obtain values of the parameters that not only fit the data we!
but also come on the average fairly close to the true values, and do not var
excessively from one set of experiments to the next, The process of determinin
parameter values with these statistical considerations in mind is termed modI
estimation,
The classical problem of statistical estimation differs somewhat from tl-
model estimation problem that we have just defined. The statistician observ(
a sequence of values (" realizations") that a random variable assumes, F(
instance, he may obtain a sequence of numbers such as I, 5, 6, 3, . . . denotir
successive throws of a die. The statistician assumes a "model" in the fon
of a probability distribution which may depend on some unknown paran
eters. In our case, the statistician who suspects the die may be load(
assigns probabilities [0 1 , O 2 ,0 3 ,0 4 , Os, I - I.F= 10;] to the six possible ou
comes of a throw, He then attempts to estimate the OJ from the observ(
values of the random variable. Here he will probably use the estimate
6
OJ = 11;/ 2:: n j (1-3-
j=l
where llj is the number of throws on which the number i showed up (i = 1,
_.,6).
As a further example, the observed value of the random variable may
the height h of adults in a community. If we assume that this variable h
normal (Gaussian) distribution with mean ho and standard deviation u, tl1-
the probability density function is given by
p(h) = [1/(27I)1/2 u ] exp[ -(1/2u 2 )(h - hO)2] (1-3-
If we measure the heights hI' h 2 , .." h" of n randomly chosen individw
from the community, we form the usual estimates:
"
ho = (I/n) 2:: hi'
1'= I
(1-3.
"
u 2 = [1[(11 - I)] 2:: (hi' - ho)2
1'=1
(1-3.
4
I Introduction
The model estimation problem can be embedded in the statistical estI-
mation problem in the following way: It is reasonable to suppose that the
outcome Y of a measurement taken at time x/l (we shall phrase our discussion
in terms of the radioactive decay model of Section 1-2) is a random variable
whose mean value is given by Eq, (1-2-1) as exp( - Ox l ,). If many measure-
ments were to be taken at the same XI' we would discover that the observed
values Y/l fluctuate around their mean value with standard deviation u.
Suppose these Auctuations have a normal probability distribution. The prob-
ability density function for )'1' would then have the form similar to Eq. (2)
p(Y/l) = [lj(2rr)I/2 u ] exp{ -(lj2u 2 )[Y/l - exp( -Ox)Y}
(1-3-5)
In fact we only take one measurement at any specific x/l' What we have are
realizations YI' )'2' ..., Yn' each of a different random variable whose dis-
tribution depends on the parameter XI' which varies from one variable to the
next, and on some other parameters (0, u) which are common to all these dis-
tributions, The parameter estimation problem which is the primary concern
of this book is the problem of estimating these common parameters,
At first glance, the parameter estimation problem appears more general
than the classical statistical estimation problem, since in the latter all samples
are taken from the same distribution. The distinction between the two prob-
lems disappears if we choose to regard all the data as being a single multi-
variate sample from the joint distribution of all the observations made in the
course of the series of experiments. It follows that many of the statistical
estimation methods can be applied to our parameter estimation problems.
The single sample point of view is, however, rather awkward when one
examines, say, the asymptotIc properties of these estimates (see Chapter III
for definitions) since it requires that the entire set of experiments be repeated
over and over again.
Parameter estimation techniques may be applied as computational tools
to pure curve fitting problems. One must remember, however, that the sta-
tistical properties of these estimates (e.g., those described in Chapters II I and
VII) sometimes lose their meaning in the curve fitting context.
Clearly, parameter estimation is a more difficult operation than curve
fitting, calling for more sophisticated analysis and more extensive computa-
tion. The effort is worthwhile since a well established model and precisely
estimated physical parameters are much more versatile tools, both for illu-
minating the present situation and for prediction in new situations, than ar-
bitrarily fitted curves can ever be. To bring home this point, one need only
observe that a physical parameter estimated from one model can always be
used in another model to which it is relevant. For instance, the viscosity of a
1-4, Linearity
5
liqUId estimated from viscometer data can be used to predict the required
pumping load for a piping system being designed.
There are other mathematical problems which may be solved by means of
parameter estimation or curve fitting techniques. These techniques may be
regarded as attempts to solve (as best one can) an overdetermined (more
equations than unknowns) system of simultaneous equations. Solving a sys-
tem of n equations in n unknowns may, therefore, be regarded as fitting to n
data points a model involving /I unknown parameters. Two-point boundary
value problems in ordinary differential equations may be treated as models
in which the known terminal conditions are the data, and the missing initial
conditions are the unknown parameters. Some optimal control problems may
be solved by regarding the control actions as unknown parameters, and the
desired trajectory of the system as the data to be fitted. Similarly, some engi-
neering design problems may be posed as requiring parameter values which
induce the systems to meet prescribed conditions as closely as possible.
1-4. Linearit)
To understand what we mean by the term" nonlinear estimation" we
must first make the following definitions: An expression is said to be linear
in a set of variables CPt' CP2' .., q)" if it has the form ao + 2::;'= 1 a; q)j, where
the coefficients aj (i = 0, I, . . ., /I) are not functions of the cPj. An expression
is quadratic in the cP; if it has the form ao + 2::;'= I a; (/) j + 2::i', j= t b jj cP j cP j'
again with all coefficients not depending on the qJ;. If we differentiate a quad-
ratic expression with respect to one of the cP;, we obtain a linear expression.
Linear estimation problems are ones in which the model equations are
linear expressions in the unknown parameters, e.g., Eq. (1-1-1). When the
model equations are not linear, as in Eq. (1-2-1), we speak of nonlinear estima-
tion, However even some apparently linear problems are essentially nonlinear.
This is so because in order to estimate the parameters we usually minimize
some function, such as the sum of sq uares of residuals. To find the minimum
we equate the derivatives of the function to zero and solve for the values oj
the parameters. Now when the model equations are linear, the sum of square
function is quadratic, and the derivatives are again linear. The estimates arc
obtained, therefore, by solving a set of simultaneous linear equations, ane
all is well. But if some other functions which are not quadratic are chosen te
be minimized, then the equations to be solved are no longer linear, even wher
the model equations are linear. Such problems should also be regarded m
nonlinear estimation problems. Exam pies of such problems are given ir
Sections 4-8-4-9.
6
I Introduction
1-5. Point and interval Estimation
There exist many methods (e.g., least squares) which calculate specific
numbers representing estimates for the parameter values. Such numbers are
called point estimates. A point estimate for the parameters e, u appearing in
Eq. (1-3-5) may take the form
0* = 4, u* = 0.1 (1-5-1)
A point estimate standlllg alone is not very satisfactory. Random errors are
present in all measurements, and no mathematical model accounts for all
facets of a physical situation. Therefore we cannot hope to obtain point esti-
mates exactly equal to the true values of the parameters (if such exist). Nor
can we expect point estimates calculated from different data samples to be
equal, even if the samples were obtained under similar conditions, Therefore
we need to augment the point estimate with some information on its vari-
ability. For instance, in place of Eq. (I) we wish to have a statement such as
0* = 4 :t 0.2,
u* = 0.1 :t 0.02
(I -5-2)
The numbers 0.2 and 0.02 are meant to represent the standard deviatIOns of
the variability of the estimates for 0, u.
The information contained in Eq. (2) may be translated into a statement
of the typet "We are 75 sure that e is between 3.6 and 4.4, and we are 75 %
sure that u is between 0.06 and 0.14." This statement constitutes an interval
estimate for our parameters.
Interval estimates can be computed directly, without first calculating
point estimates and their variability. In fact, many statisticians prefer interval
estimates, because they feel one is not justified in picking out one specific
preferred value to be used as a point estimate. We feel, however, that the
needs of the scientist or engineer are best served by point estimates with
measures of their reliability, so we will not discuss any direct interval esti-
mation procedures. The calculation of interval estimates (called confidence
intervals in this context) from point estimates is discussed in Sections 7-9-7-10.
1-6, Historical Background
Legendre (1805) was the first to suggest in print the use of the least squares
criterion for estimating coefficients in linear curve fitting. Gauss (1809) laid
the statistical foundation for parameter estimation by showing that least
squares estimates maximized the probability density for a normal (Gaussian)
t The statcmcnt is dcrivcd tram Eq. (2) using the Bienayme-Chebyshev inequality with
k = 2. See Eq. (7-9-11).
1-7, Notation
7
distribution of errors, In this, Gauss anticipated the maximum likelihood
method. Gauss and his contemporaries seemed to prefer, however, purely
heuristic justifications for the least squares method. Further work in the 19th
and early 20th centuries, by Gauss hi mself, Cauchy, Bienayme, Chebyshev,
Gram, Schmidt, and otherst concentrated on computational aspects of
linear least squares curve fitting, including the introduction of orthogonal
polynomials,
The development of statistical estimation methods received its impetus
from the work of Karl Pearson around the turn of the century and R. A.
Fisher in the 1920s and I930s. The latter revived the maximum likelihood
method and studied estimator properties such as consistency, efficiency, and
sufficiency [see the collection of Fisher's ( 1950) papers]. The development of
decision theory by Wald and others has, in the post-World War II years,
introduced a new basis for selecting estimation criteria. The practical impact
of these methods in the area of nonlinear parameter estimation has so far
been slight, except for causing increased awareness of the uses of prior dis-
tributions,
The first modern applications of statistical estimation theory to model
estimation were made in the field of econometrics by Koopmans and others,
starting in the I 930s. Their work is summarized in the Cowles Commission
Reports (Hood and Koopmans, 1953). The main contributions to the appli-
cation of statistical techniques in the construction and estimation of mathe-
maticalmodels in the physical sciences have come from professor G. E. P. Box
and his coworkers at Princeton University and the University of Wisconsin.
The computation of estimates for nonlinear models usually requires find-
ing the maximum or minimum of a nonlinear function. Computational
methods bearing the names of Newton, Gauss, and Cauchy have been known
for a long time, but their extensive application to practical problems had to
await the arrival of the electronic computer. The first general purpose com-
puter program for solving nonlinear least squares problems was written by
Booth and Peterson (1958) in collaboration with Box. The program employed
a modified Gauss method. It has since been followed by many other programs,
some more general in nature and some dealing with more specific estimation
problems. A list of such programs can be found in Appendix G.
1-7. Notation
Matrix and vector notation are used throughout this book.
A boldface capital letter denotes a matrix: A, r.
A boldface lower case letter denotes a column vector: a, 'Y.
:j: References to this work, along with a more detailed historical survey are given by
Seal (I 967).
s
I Introduction
The (i,j) element, appearing in the ith row andjth column of A is denoted
Aij or [A]ij'
The ith element of a is denoted ai or [a];.
All is the Jlth in a sequence of matrices AI' A 2 , A 3 , ,.., The (i,j) element
of All is denoted Allij or [AJ,]ij' Analogously for vectors.
AT is the transpose of A, i.e., [A T]ij = [A]ji'
aT is the row vector with the same elements as a.
A - J is the inverse of A if such exists.
A'" is the pseudo inverse of A.
det(A) is the determinant of A.
Tr(A) = 'L A iI is the trace of A.
A is said to be m x 11 if it has 111 rows and 11 columns. A column vector is
m x I and a row vector I x 11.
I is the identity matrix, i.e.,
/..=6..=(1
IJ ') \0
(i=j)
(i ¥ j)
I", is the 111 x m identIty matrix.
A = diag( a) means that A is a matrix with elements A ij = a i 6 ij .
Suppose a IS a functIon of the vectors a and b and the matrix A. Then:
claj ca
DajiJA
a 2 a/iJa iJb
is the column vector [aa/aa]; = aa/aa,
is the matrix [aa/aA],) = aajaAij
is the matrix [iJ 2 ajiJa ab]ij = a 2 a/aa; ab j
Suppose a is a vector function of the scalar 13 and the vector b. Then:
ca/(113 IS the column vector [clajiJI3]; = iJaJiJI3
aajab is the matrix [aa/ab]ij = aaJab j
Suppose A is a matrix function of the scalar a. Then:
iJA/iJa is the matrix [iJAjaa]ij = aA;)aa
Derivatives of matrices with respect to vectors and matrices, or of vectors
with respect to matrices, give rise to arrays with more than two dimensions.
Rules for differentiating vector and matrix expressions are given in Section
A-2 Appendix A,
We also make use of some notation associated with probability concepts.
Pr(A) is the probability of event A.
1-7. Notation
9
If x is a random variable, then
p(x) is the probability density function of x.
p(x I A) is the probability density of x given that A occurred.
E(x) is the expected value of x,
E(xl A) is the expected value of x given that A occurred.
!Ix = E{[x - E(x)f} is the variance of x.
U x = V./2 is the standard deviation of x.
The notation p(xl y) is meant to indicate that the probability density of x
is also a function of the variable y
The reader totally unfamiliar with matrix and probability theories is urged
to study texts on these subjects. The reader who merely wishes to refresh his
memory may consult Appendixes A and B which contain skeleton definitions
of the terms involved and the operations applying to them
Other notation:
A == B means that A equals B by definition.
A B means that A equals B approximately, or to within the order of
approximation being considered (e.g., up to second-order terms in a Taylor
series).
log x is the naturallogarithm of x.
exp(x) == eX.
Nk(a, V) is the k-dimensional normal distribution with mean a and covar-
iance matrix V.
Unless otherwise stated, the notation x = a :t b denotes that x is a random
variable or estimate with mean a and standard deviation b.
The estimated value of some quantity x is denoted x*, and its true (though
unknown) value is denoted x.
Formulas and equations are numbered by chapter and section. For in-
stance, Eq, (5-3-6) is the sixth equation in Section 5-3. The chapter and section
numbers are omitted from references to equations within the same section.
Subscripts:
a, b, c, ... refer to model equations or dependent variables. The usual
range is I to /11.
Example: in J'a = f,,lx, 8) the ath dependent variable J'a is a function fa of
X and 8.
CI., {3, I" , .. refer to parameters. The usual range is I to I.
Example: q7 = (kPjee a is the ath component of the gradient of c[J with
respect to 8.
10
1 Introd uction
p, 'I, qJ refer to experiments. The usual range IS I to 11.
Example: Y" is the vector of dependent variables measured 111 the pth
experiment. Its Gth component is Yllu'
i frequently (but not always) refers to iteration number.
Example: e j is the vector e appearing in the ith iteration. Its o:th component
is G ja .
Chapter
II
Problem Formulation
A. Determimstic Models
2-1, Basic Concepts
The scientist often expresses his theones In the form of mathematical re-
lationships among certain quantities. Similarly, the engineer derives equatiom
that describe the properties of his structures or the workings of his processes
We refer to the relations which supposedly describe a certain physical situa-
tion, as a model. Typically, a model consists of one or more equations. The
quantities appearing in the equations we classify into variables and param-
eters. The distinction between these is 110t always clear cut, and it frequently
depends on the context in which the variables appear. Usually a model is de-
signed to explain the relationships that exist among quantities which can be
measured independently in an experiment; these are the variables of the
model. To formulate these relationships, however, one frequently introduces
"constants" which stand for inherent properties of nature (or of the ma-terials
and equipment used in a given experiment). These are the parameters.
We illustrate by means of an examp]e: A cylindrical vessel of cross-sec-
tional area A is filled with a liquid of density p and viscosity II. It is allowed
to drain through a capillary tube of radius R and length L. Let 11 and 110
denote the depth of the liquid in the vessel at times t and to, respectively.
The equations of laminar flow yield, for this case, the relation
log(hofh) = (ngR 4 f8AqJL)(t - to)
(2-]-1)
where g is the acceleration due to gravIty, and qJ = !llp IS the kinematic vis-
cosity of the liquid. If we interpret Eq. (I) as a relationship between the height
of the liquid and the time, then we shall regard h, 11 0 , t, and to as the variables,
and g, A, R, L, and qJ as the parameters. Among the latter, the first is a con-
stant of nature, the next three reflect the properties of the apparatus, and the
12
I I Problem Formulation
last one a property of the material used. If we performed experiments on
several different vessels, we might add R, A, and L to the list of variables,
leaving g and qJ as the sole parameters,
On the other hand, suppose our instrument is to be used as a viscometer.
We place two marks on the vessel, at heights 110 and 11 from the bottom, and
measure the time f..t that it takes for the surface of the liquid to pass from the
higher to the lower marl<. The kinematic viscosity of the liquid can then be
calculated from the following rearrangement of Eq. (1)
qJ = CI. f..t
(2-1-2)
where CI. = ngR''','8AL 10g(110/11). We calibrate the instrument with liquids
whose viscosities are known. For the purposes of calibration, then, Eq. (2)
contains the variables f..t (directly measurable) and qJ (which can be found in
published tables), and the parameter CI. (in whose physical significance we are
not at the moment interested).
The values of some of the parameters which appear in a model may be
known with great precision (e.g., the gravitational constant g in Eq. (I)).
The role of such parameters does not differ, at least for our purposes, from
that of purely numerical constants, such as n or 8 in Eq. (I). We exclude
such parameters from further considerations.
2-2, Structural Model
The models we have so far considered take the general functional form
g(z, 8) = 0
(2-2-1)
where:
g = {g t, g 2 , . . . , g ,,,) V
is an m-dimensional vector of functions.
Z = {ZI' Z2, ..., ZlJ T
is a k-dimensional vector of variables.
0= {Ol' O 2 , . . . , O[}T
is an I-dimensIonal vector of parameters whose values are not precisely
known.
Equations (I) are referred to as the structural equatIOns of the model
Looking at the model represented by Eq. (2-1-1), we find that there is only
one equation, hence /J1 = I: there are four variables Zl = 11, Zl = 11 0 , Z3 = t,
2-4. Reduced Model
13
and::: 4 = to; and there are four unknown parameters 0, = .-1, O 2 = R, OJ = L,
0 4 = qJ. Eq. (I) then takes the form
y,(Z, 8) == IOg(Z2/ZI) - (rry/S)(O//OI(JJ (}4)(ZJ :::4) = 0
(2-2-2)
A model for which 111 = I is called a single equation model.
We refer to a model as linear if each one of the model equations has the
form
I
g,(Z, 8) = BiO(Z) + I BiJ(z)Oj = 0
j1
(/=1,2,. ,111)
(2-2-3)
where B,j (i = 0, I, ..., m:j = 1,2, ..., k) are known functions of the z.
Models which are not linear are referred to as nonlinear. Equation (2-1-2) is
a linear model (with CJ. as the parameter), whereas Eq. (2) is nonlinear.
2-3, Parameter Evaluation
A model whose form corresponds to Eq. (2-2-1) is called a deterministic
model, since all the quantities appearing in it are assumed to be well deter-
mined, at least in principle. The model can be of little practical value, how-
ever, unless the values of its parameters are known. There are two principal
methods by which we may establish the values of the parameters:
1 Calculate the value of a parameter by applying established laws of nature
to already known quantities. For example, if R, .4, L, II, and 110 have been
measured, we can compute CJ. = rrgR 4 j8AL 10g(II o /lI) as the value of the
parameter to be used in Eq. (2-1-2).
2. Measure the values of the model variables that occur in actual physical
situations, and then seek parameter values which cause the model equations
to be satisfied, at least approximately. We are concerned here with the imple-
mentation of this second procedure.
2-4. Reduced Model
The structural Eqs. (2-2- I) are suitable for checking the validity of the
model. If values can be found for the parameters such that the equations are
at least approximately satisfied, then we do not reject the model. The most
important practical use to which the model may be put is that of prediction.
For this purpose, the variables z are classified into two groups:
I. The r variables y = )'1' )'2' ..., )'r whose values we wish to predict.
These we call the dependent variables.
14
II Problem Formulation
2 The s variables x = XI' Xl' ..., Xs on the basis of which we wish to do
the prediction. We call these the independent variables.
The problem of prediction, then, is that of determining in advance the
values that the dependent variables will take for given values of the indepen-
dent variables.
Rewriting the structural equations with x and y replacing z
g(x, y, 0) = 0
(2-4-1)
We see that reasonable prediction tS possible if all of the following conditions
hold:
I The model IS reasonably correCL
2. The values of the parameters are known to a good approximation
3. The structural equations can be solved for the dependent variables,
yielding the reduced equations
y = f(x, 0)
(2-4-2)
where f = /1' /1' ...,.r.. is an r-dimensional vector of functions.
SlIlce the number of structural equations is m, we can usually solve for
the values of up to m dependent variables, leaving s = k - m independent
variables.
A linear reduced model is one in which the functions f are linear in the
O. A linear structural model may result in a nonlinear reduced model. For
instance, the linear structural model log y + Ox = 0 reduces to the nonlinear
model y = e-o,.
Strictly speaking, we should refer to lhe "structural form" or "reduced
form" of the same model. I n practice, however, we shall attach the designa-
tion .. model" to whatever set of equations we happen to be dealing with at
the moment
2-5, Application Areas
There is nothing in Eqs. (2-2-1) or (2-4-2) to imply that we need have
explicit analytic expressions for the functions g and f. All that is required is
that given the values of their arguments (z and 0, or x and 0), one can cal-
culate the values of the functions. This may require solution of a system of
difTerential equations, or a complicated system simulation. When the struc-
tural equations cannot be solved explicitly, we may still obtain predicted
values of the dependent variables by solving the equations numerically.
2-5. Application Areas
15
The followmg example of a model requiring the solution of differential
equations is taken from the field of chemical reaction kinetics. Consider a
chemical reaction in which molecules of a certain species (compound) A
decompose spontaneously into molecules of Band C. In chemical notation,
the reaction would be written as
A->B+C
(2-5-1)
The law of mass action states that the rate of decomposition is, at any
moment, proportional to the concentration of A at that moment. This leads
to the differential equation
(VA/dl = -lel.!'A
(2-5-2)
where YA is the concentration of A at time I, and k I is the so-called reaction
rate constant Eq. (2) may be integrated explicitly to yield
YA = x A exp( -kl/)
(2-5-3)
where X A is the concentration of A at zero time. This is a reduced equation,
with YA the dependent variable, X A and I the independent variables, and k. 1
a parameter. While in this case the differential equation could be solved
explicitly, it is not uncommon to find models where the integration can only
be performed numerically. Such models are treated in detail in Chapter
VIII.
We cannot show here how mathematical models are derived in the various
branches of science, but we can cite a few examples to demonstrate that the
utility of parameter estimation methods is not confined to the field of chemical
reaction kinetics.
(a)- Nuclear Physics, Scattering data have been used to estimate parameters
referring to nuclear structure or nuclear-nuclear forces [see Melkanoff el at.
(1966); Arndt and MacGregor (1966)].
(b) Geophysical Exploration, Geophysical surveys are often conducted by
flying over the region of interest and recording measured values of variables
such as magnetic and gravimetric field intensities. These records are then
scanned for anomalies which may indicate the underground presence of
valuable ore deposits. Assuming the ore deposit to have given shape, size
and location, it is possible to derive expressions for the magnetic and gravi-
metric fields along the flight paths [see Grant and West (1965)] Although
these expressions are very complicated, they can be used (Eisenpress and
Surkan, 1966) to estimate ore deposit parameters from aerial survey data.
(c) Biophysics, To study the manner in which substances are transported
from one part of an organism to another, biologists conceive of the body as
16
II Problem FormulatIOn
conslstmg of compartments separated by semipermeable membranes. A
tracer substance is injected into one compartment, and its concentration in
the other compartments is subsequently measured at various points in time.
These data may be used to estimate intercompartmental transport rate
parameters (Berman el a/.. 1962; Turner el at., 1963: Beauchamp and
Cornell,1966).
Another interesting application is the determination of the dipole moments
of various sections of the heart from measurements of skin potential (Beil-
man, Collier, Kagiwada, Kalaba, and Selvester, 1964).
(d) Probability, Given many samples of a random variable having a given
probability distribution, we wish to determine parameters (e.g., mean,
standard deviation, etc.) appearing in the distribution. This is the classical
estimation problcm in statistics. A "curve fitting" approach to the problem
is to construct a histogram from the data, and fit to it the expression for the
probability density function.
(e) Econometrics. Econometricians attempT to const.ruct mathematical models
for the national economy or certain segments of it. These models describe
the dynamic relationships among variables such as income, sales, produc-
tion and employment. Parameters appearing in the model may be estimated
from past data, and used to predict future trends (J ohnston, 1963).
(f) Orbit Calculations. The orbit of a satellite can be expressed as a function
of parameters which describe the heavenly bodies that attract the satellite.
These parameters can be estimated from the observed orbits (Kelley and
Denham, 1966).
All these are examples in which the parameters to be estimated possessed
(more or less) a physical significance, and, the model equations attempted to
represent true cause and effect relationships. The following examples are of a
different kind. We attempt to determine design parameters which will confer
desirable properties on a device to be constructed.
(g) A smoothing filter is to be installed in an electrical circuit We cal-
culate the ideal transfer function for the filter by solving the Wiener-Hopf
equation (Wiener, 1949). The filter must be constructed from passive ele-
ments (resistors, capacitors, and inductors) so that its transfer function can
only be a rational function, i.e., the ratio of two polynomials. Our task is to
determine the coefllcients in the two polynomials so that their ratio approx-
imates the Wiener-HopI' solution as closely as possible.
(h) Designers of artificial limbs attempt to reproduce the observed kine-
matics of natural limbs. They must estimate the design parameters so as to
best approximate the observed motions (Freudenstein and Woo, 1968).
2-6 Expenmel1ls and Data M{[{ri.\
17
B. Data
2-6, Experiments and Data Matrix
Parameter estimation is based on data, and the data conit or oberved
or measured values of the model variables. One may obtain thc data by
observing situations occurring in nature. or onc may sct up e.\pcriments in
which conditions are controlled so as to favor the process or observation.
In Chapter X we shall go into the question of \\hat experiments hould be
performed for estimating a given model. For thc present, however. it is
immaterial where and how the data were obtained, e\cept inasmuch as the
measurement process affects the errors in the obsen'ations.
In most cases the data gathering process possesses a certain structurc.
Performing an experiment consists or recording the observed values or a sct
of variables under a given set of eXlu'rimcntal condition.l. Sometimes this means
that the dependent variables are observed for given values of thc independent
variables. Sometimes, however, the experimental conditions themselves are
not among the variables of the model. We may, for instance, wish to relate
height and weight of individuals in a population. In this case, the individual
chosen can be considered the" set of e,\ perimental cond ition." whereas the
height and weight are the model variables.
Frequently, in the course of an investigation, several experiments are per-
formed, each under a different set of e\perimental conditions. A variable
subscripted by a letter {/, I}, or </) denotes the value of that variable as measured
in the corresponding experiment
ZJ(= [':JII:::JI2' ....:Jt/JT
are the values of the model variables observed in the 11th experiment. A func-
tion subscripted with one of these letters denotes that function computed for
the values observed in the corresponding experiment
g,,(O) = g(z", 0)
(2-6-1)
We shall use corresponding capital letters to designate the data matrix,
i.e., the matrix whose flth row consists of the data vector for the flth ex-
periment. Thus, Z and G are the matrices whose {Ith rows are Z"T and g,,T
respectively, e.g.,
[- " ""'12 -H]
Z= Z'I (2-6-2)
-"I.
"'-,,2
n:
II Problem FormulatIOn
where 11 is the number of experiments. The defimtlOns of xp' Y J1 , X, and'
are obvious
I n practice it happens frequently that not all variables are measured in
every experiment, or even that the set of dependent variables measured differs
completely from one set of experiments to the next. In most cases this will
raise no undue dilliculties; we simply use the appropriate set of model
equations for each experiment. Some of the problems that do arise in this
connection are discussed in Section 9-1.
C. Probabilistic Models and Likelihood
2-7. Randomness in Data
Deterministic models describe reality only in an idealized sense. If the
values of all the variables were known exactly, and if no forces other than
those explicitly considered were at work, then and only then could we expect
to find parameter values that cause the model equations to be satisfied
exacLly.
I n practice, we k now that measurement techmques possess lImited
accuracy, that repeated measurements of one and the same quantity yield
different values. that the conditions for which the model was derived are never
quite attainable. and that disturbances which could not be predicted or taken
into account in the model always occur. Yet these unpredictable disturbances
are as much parts of physical reality as are the underlying exact quantities
which appear in the model. The model is not complete, then, unless it also
describes in an appropriate manner these random elements of the situation.
The appropriate description of random phenomena is through probability
statements. The following sections will demonstrate the manner in which the
deterministic model can be imbedded in the probabilistic description of the
data, but first we digress somewhat to describe some probability distribu-
tions that are applicable to experimental errors.
2-8, The Normal Distribution
The Importance of the normal distribution (defined below) derives from
several reasons
(a) It has been found to approximate closely the behavior of many
measurements in nature.
2-8. The Normal Distribution
19
(b) It is the limit which many other distributions approach when the
sample size is increased beyond bound. In particular, we have so-called
central limit theorems (Feller, 1966) which state that, under fairly general
conditions, the distribution of the sum of n independent random vari-
ables approaches the normal distribution as /1 is made sumciently large.
Central limit theorems are often used to explain the widespread occurrence
of this distribution in nature: if the observed value of the random variable
is the resultant of many additive, independent, effects, the resulting distribu-
tion is likely to be normal.
In some cases the normal distribution applies not to the variable itself,
but to some function of it. For instance, if a given effect is built up over a
period of time as the sum of many random effects, each of which has a standard
deviation proportional to the magnitude of the overall effect at the ti me,
then the distribution of the logarithm of the overall effect is likely to be
normal. This phenomenon is observed in situations relating to the growth of
individuals (Cramer, 1946, p. 2201
(c) By specifying the distribution of a random variable, we convey a
certain amount of information concerning the values assumed by the vari-
able. A suitable measure of the information contained in the distribution
whose probability distribution function (pdf) is pix) is given by
I(p) == E(log p) = Jp(X) log pix) dx
(2-8-1 )
(Shannon, 1948; see also Section 10-2). Consider the followtl1g situation:
A scientist knows that the measuring errors of some apparatus have mean p
and standard deviation a. For certain reasons (which should become clear in
the sequel) the scientist is compelled to assign a pelf pix) to the measurement
errors. This pdf will later be used to make inferences concerning the true
values of the measured variable. In the absence of any further Information,
what function p(x) should be chosen?
The function pix) must satisfy the followtl1g conditions:
It is a pdf, i.e., pix) :;?; 0, and
r" plxj dx = I
-.:.c
(2-8-2)
2. Its mean is as specified, i.e.,
(' xp(x) dx = l
.. -Cfj
(2-8-3)
3. Its variance is as specified, i.e.,
r: (x - p)lp(X) dx = a 1
"-00
(2-8-4 )
20
1 [ Problem FormulatIOn
It is reasonable to select from those functions p(x) satisfYing these con-
ditions the one IV/lOse information content is least. By doing so, we are adding
the smallest possible additional information over and above what we legiti-
mately know (i.e., the values of l and a).
Finding p(x) such that [(p) is minimized and Eqs. (2)-(4) are satIsfied is
an exercise in the calculus of variations. Following standard procedures, we
form the Lagrangian functional
A(p) == (CD [p logp + )"IP + /2 X P + l3(X-p)2 p ] dx
- - ':r
(2-8-5)
where the }'I are Lagrange multipliers (see Section 3-6). The Euler equation
that p must satisfy to make A(p) stationary is obtained by differentiating with
respect to p the expression under the integral sign
log p + 1 + l, -r 1. 2 x + /3(X - 11)2 = 0
(2-8-6)
Hence
p(x) = exp[ - I - /1 - }2X - }3(X . p)2]
(2-8-7)
The values of the I" can be determined by substituting Eq. (7) in Eqs. (2)-
(4) Using the relation Jc,- exp( _i_u 2 ) du = (iT!X)1/2, we find ultimately that
I., = . log lit + log () - I,
i' 2 = 0,
}'3 = (I (2a 2 )
(2-8-8)
Hence
pIx) = [1'(2iT)I:2 a ] exp[ -( 1/2( 1 )(x p)2]
(2-8-9)
This is the ulliuariate normal dislribution with mean p and variance a 2 . We
designate this distribution N,(p, ( 2 ).
When x is an n-dimensional vector random variable with mean and
nonsingular covariance matrix V, \ve flnd by similar arguments that the least
informative pM has the form
p(x) = (2iT)-,,/2 det- I / 2 V exp[ -t(x - ?V-I(X- )]
(2-8-10)
which is the multlVartale normal distribution with mean and covariance
matrix V We designate this distribution NI1(' V).
To summarize: when we specify only the mean and variance of a random
variable, we have not determined the entire distribution. If an entire dis-
tribution is demanded, however, then by specifying the normal distribution
we assume the least possible amount of extraneous information.
(d) The normal distribution is particularly tractable mathematically.
Many resulb can be worked out expJicitly only for Ihis distribution. Therefore,
2-9. The Uniform Distribution
21
it is frequently convenient to assume a normal distributIon where no specific
justification for it exists. This is unlikely to cause much harm, except when the
estimation method selected is very sensitive to the shape of the tails of the
distributions.
A normal distribution of an n-dimensional random vector x is completely
characterized by the mean and the covariance matrix Y, as shown by
Eq. (10). We assumed that Y was nonsingular; otherwise y-I could not have
been formed. When Y is singular, i.e., det Y = 0, we speak of a singular
normal distribution. If m < n is the rank of y, then there exist m linear
combinations of the x which possess a nonsingular normal distribution.:!.
In a normal distribution, uncorrelated variables are independent. The
mean, mode, and median coincide. The following are additional useful prop-
erties of the normal distribution. We assume throughout that x is NII(O, Y)
(this is just another way of saying that x is an /I-dimensional normally dis-
tributed random vector with mean 0 and covariance V). Then
I. Ax is NII/(O, A Y AT), where A is any /II x n matrix
2. Let C be an /I x n matrix, such that CTC = y-I. Then y == Cx is
NII(O, I), i.e., the elements of yare n independent normal variables with zero
means and unit variances. Such variables are called standard normal deviates.
3. If y is NII(O, 1), then yTy is a random variable whose distribution is
called chi-square with n degrees of freedom, designated X/. In other words,
x/ is the distribution of the sum of sq uares of 11 independent standard normal
deviates.
4. We have yTy = XTCTCX = xTy- tx. Hence, xTy-lx is X/.
5. If d and !J.J are independent random variables with distributions
X p 2 and X,/, respectively, then qd jpdJ is a random variable whose distribution
is designated Fp.</ .
The land F distributions play an Important role in establIshing con-
fidence intervals for estimated parameters, and in testing the goodness of
fit of the model to the data (see Chapter Vll).
2-9. The Uniform Distribution
The uniform distribution (also called rectangular) is one in which the
range of possible values of each variable is confined to a finite interval, and
all values in the interval are equally likely Thus, if a and b are n-dimensional
t Thcsc arc thc principal componcnts corresponding to nonzcro cigcnvaIucs of V.
See Section 7-8.
22
II Problem Formulation
vectors with b j > a, for I = I, 2, .., II, then the uniform distribution within
the II dimensional rectangle a,::; x j ::; b, is given by:
p(x) = [(b, a l )(b 2 - a 2 ). . Ib" - a,,)r l
for a j ::; x j ::; b j, i = 1, 2, . . . , 11
(2-9-1)
p(x) = 0
otherwise
For this distribution we have:
x = 1(3 + b)
Vii = (lfI2)(b j - aJ2
Vij = 0 (j #- j)
(2-9-2)
(2-9-3)
The components of the vector x are independent.
The uniform distribution frequently describes the errors of measurement
due to the limited number of significant digits that can be read on a scale,
because all intermediate values between scale marks are equally probable.
The assumption of a uniform distribution with known bounds for all
the errors in a model implies that all these errors are restricted in magnitude.
This means that the model must be rejected if no parameter values can be
found that keep all the errors within the permitted bounds. The use of this
distribution can be justified only if one is willing to accept such drastic con-
clusions. One must be quite certain that the measurements really differ from
the true values of the variables by no more than the specified error bounds,
and that no other random factors have been overlooked in the model In
contrast, the normal distribution assigns nonzero (though small) probabilities
to any error, no matter how large. It is more forgiving towards inadequacies
in the model, and does not break down upon the appearance of an occasional
unexpectedly large error. This objection to the uniform distribution does not
apply if the upper bound on the error magnitude is not known in advance.
2-10, Distribution of Errors
Attempting to relate the deterministic model to the data gathered from
II experiments, we are led to the set of equations
gll(O) = 0
(Jl = I, 2,. ,II)
(2-10-1)
The total number of equations in Eq. (I) usually far exceeds the number of
unknown parameters O. Only under exceptional circumstances do there exist
2-10. Distribution of Errors
23
values of 0 which cause all the equations in Eq. (I) to be satisfied. Indeed
we cannot expect all these equations to be satis/led, since
I. The measured values of the variables do not always represent their
true values
2. The model is not exactly accurate, various effects having been neglected
in its formulation.
To account for errors of type I. we scan the list of all the measured
quantities Z. and break it up into two sets: quantities U which are believed
to be free of significant error, and quantities \V whose measured values may
differ significantly, in a random manner. from their true underlying values,
which we designate \\1. The difference between the measured and true values
of a variable we call the error
E==W-\\'
(2-10-2)
that is
Ell == WI' - \V I ,
(p = I, 2, . . . .11)
,
t1
.i
1.'
'\!'
.Y!
1, f,
! t-
We now assume that each U'IIn is a realization of a random variable w IW '
or, equivalently, that \V is a realization of a matrix random variable Q. This
means that \V is one sample out of all possible results of our series of 11
experiments. Furthermore. we assume that the random variables WI'" possess
a joint pdf which depends on the true values 1\'1"" as well as on some param-
eters \jJ, whose values mayor may not be known. Thus. the pdf has the
form p(QI \Y, \jJ). It is usually the case that the pdf depends explicitly on the
Q and \Y only through their difference, i.e., it has the form peE 1'1'). It is also
frequently the case that the errors in different experiments are statistically
indpendent. That means that we have a pdf PII(EJlI '1'1') associated with the
errors E/l in the (th experiment. an9 the joint pdf for all experiments is given
by
11
peE I '1'1' , 10' . 'I"J = 11 fJ,.(E/l1 '1'1,)
Jl= 1
(2-10-3)
To illustrate, assume that the errors in the pth experlmem are distributed as
NAO, VIJ Then
( 1\ 1 ) _')_ ) -'-/2 d -1/2 \ ' " ( __L T \ ,-I )
PI' EJI 1'. - (_IL et I' exp. 2EI' I' EI'
(2-10-4)
Hence, the joint pdf is given by
( E I V \T V ) - ( ')- ) -11,./2 11 d t- J / 2 V . ( _J. '\"' . T\T-J )
P I' 2".', 11 - IL /II e I,exp. 21'IEI' II EI'
(2-10-5)
24
II Problem Formulation
The vector of distribution parameters \jI here consists of the elements of the
matrices V" .
It must be remembered that when we speak here of random variables we
arc referring to the results of the measurements, not to the choice of experi-
mental conditions. In many cases, the experimental conditions are selected
randomly, e.g., by drawing individuals at random from a population. This
docs not concern us here; once the individual has been chosen, he ceases to
be random. What we are interested in are the random differences that may
ari-,c between repeated measurements on the same individual.
The valui;s of w', and w'] for J1 =f= tl are usually realizations of different
vector random variables w" and w'!' Only in the case when experiments J1 and
'I are replications of each other are w" and W n realizations of one and the
same random variable.
2-11. tochastic Form of the Model
I-low can the deterministic model be modified so as to account for the
variability in the data and model? There are several ways in which this can
be done. The specific form chosen should depend On what we know about the
system being described. Do we have strong confidence in the model but not
in the data? Do we trust the data but not the model? Perhaps both are subject
to significant errors? The type of model that is appropriate in a given situa-
tion depends on the answers to these questions. We list below some of the
forms that the model may take. A typical example which illustrates the con-
ditions under which these forms are appropriate follows in Section 2-14.
(a) Suppose the data are subject to measurement errors, but the model
equations are thought to apply exactly, to the Irue (though unknown) values
of the variables
g(U", \\" 0) = 0
(p=1,2,...,n)
(2-11-1)
\Ve refer to Eq. (I) as an exact structural model. The measurement errors
E = W - \V are assumed to have a joint pdf p(EI\jI).
(b) Suppose the model equations apply only approximately even to the
true values of the variables. The error in the model at the pth experiment is
assumed to be a random variable 'Y"
g(U", \V", 0) = 'Y"
(p=I,2,...,n)
(2-11-2)
We refer to Eq. (2) as an 1I1exact structural model. In conformity with our
usual notation, we let r be the matrix whose pth row is 'Y/. The 'Y" are sup-
posed to account for forces that were neglected in the formulation of the
2-J J, Stochastic Form of the Model
25
model. One usually assumes that the "I" have zero means, and that they are
statistically independent of the measurement errors. Then the overall pdf
applicable to this model has the form peE I \jI)p(r I \1/'), where \1/' is an additional
set of distribution parameters.
(c) A special case of (b) occurs when all vanables are measured precisely
so that w is vacuous. Then
!'
g(u" , 0) = "III
(p = I, 2, . , n)
(2-11-3)
L
The relevant pdf is simply per I \jI'). Let us introduce an arlificial variable
Y" to which we assign the" observed" value zero, and let us define £1 1 == -"1".
Then Eq. (3) is equivalent to Y I , = g(u ll , e) + £", which has the same form as
the reduced model Eq. (9) discussed below.
(d) In some applications, particularly in the field of econometrics, it has
been found appropriate not to introduce the" true value" 'v" explicilly, but
rather to treat the model equations as applying approximately to the mea-
sured values
g(U II , w II ' 0) = "III
(p = I, 2, . . . , n)
(2-11-4)
where "I" is an error term which IS not treated as a random variable in its
own right. Rather, one assumes that w lI is a random variable distributed in
such a way that "II' = g(U II , W II , 0) (regarded as a funclion of WII) has a given
pdf p("I,,). The pdf for the original variable w lI is lhen obtained according lo
the rules for transforming variables in probability distributions
peW) = p("I,,) I del(ogjiiw l ,) I
(2-11-5)
The quantity det(og,,/ow ll ) is the Jacobian of the transformation from w',
to g", and the dimension of W must be the same as that of g. The econo-
metricians refer to the W as the endogel1OlLS variables.
Example The following two-equation production model is due to Bodkin
and K]ein [1967, Eqs. (12) and (18)]
{g = 1\ ' - ° O .lIpl \I /I-O,) - ) '
(J = Jll - Jll 3 2 /12 - III
bH -
gl12 == IV,lI - 11 1 ,2/°1 = }',,2
(2-] ]-6)
where WI is the ratio of real production output to labor input, 11'2 the ratio
of capital input to labor input. III the time. and 11 2 the ratio of wage rate to
price of output. Here
( ... 1 1
det ogIJOW II ) = I
-(1-01)030'2PIIVI-:f' I =(I_O)O ()"p, 11'-0, ( 2-11-7)
o t 3 2 ,,2
26
II Problem Formulation
If [1'1 11 ' i ' 1 12] are assumed distributed as N 2 (O, V), then [Will' w 112 ] have the
pdf
p(w ll ) = I( 1- ()I)UJ Oi"\V120II(2Tir I(det V)-1/2 exp( _g/V-lgl,) (2-11-8)
(e) Suppose the dimension of w is equal to the dimension of g, and sup-
pose further that the structural equations g(u, w, 0) = 0 can be solved for w.
Then we obtain the reduced model w = flu, 0). In the Jlth experiment there
may be errors of two kinds; errors EI1I in the measurement of w, and errors
E I / 2 in the model equations. In conformity with usual practice when dealing
with reduced models, we write y in place of wand x in place of u. The model
now takes the form
YII = f(x " , 0) + E" = fll(O) + EI'
(2-/1-9)
where Ell == EIII + E 112 . If we deflne 5'1 1 == YII .- Ell' then we may write the model
as
5'11 = f(x ll , 0) = fll(O)
(2 11-10)
The quantity 5'11 cannot legitimately be thought of as a "true" value of YII
unless E I . 2 is negligible. The relevant pdf for the model Eq. (9) has the form
p(E III , E 112 ), but in practice the dual nature of the errors is lIsually ignored, and
the pdf is written simply as p(E I .). The joint pdf for all the errors has the form
PIE I \11).
We refer to a reduced model 111 which the Independent variables x are
measured precisely as a standard reduced model. The appropriate representa-
tion of such a model is given by Eq. (9) or, equivalently, by Eq. (10). Of all
nonlinear models, this is the one for which the calculation of the estimates is
easiest. For this reason, it is tempting to neglect errors in XII in any reduced
model, regardless of whether this is justified in physical fact. The resulting
errors in the estimates are difficult to predict, and a Monte Carlo study would
be appropriate (see Section 3-3). Be that as it may, the vast majority of all
nonlinear estimation calculations have in practice been undertaken on the
implicit assumption that the model was in standard reduced form.
2-12, Likelihood-Standard Reduced Model
The standard reduced model Eq. (2-11-9) can be put III the more conCise
form
Y = F(X, 0) + E
(2-12-1)
Suppose the model is specified, along with the joint pdf PIE I \jJ) and with the
data", X. For any given values of the parameters 0 we can compute the
residuals
E(O) == Y - F(X, 0)
(2-12-2)
2-13. Likelihood-Structural Alodels
27
i.e., the differences elw(O) between the observed values )'JllI and the" computed"
valuest;,(x l " 0) of the dependent variables. If 0 is close to the true value 0,
then E(O) should be close to the true errors E. In the joint pdf, let us replace
the errors by the expressions for the residuals. The resulting expression, which
is a function of 0 and I alone, is called the likelillOodfU/1ctio/1 of the sample
L(O, \jI) == p(E(O) I \jI) = p(Y FIX, 0) I \jI)
12-12-3)
Note that since X and Yare known guantitie, they do not appear as variables
among the arguments of the likelihood function.
As an example, suppose the pdf is given by Eg. (2-10-5), i.e., the errors In
the pth experiment are distributed as N",(O, VI')' and errors in different
experiments are independent. The likelihood is obtained by substituting
Y I , - f(x ll , 0) for Ell in Eg. (2-10-5)
"
L(O, VI' V 2, ., V,,) = (2n)-",..!1 TI deC I!1V I ,
1'= I
( /1 ,
x eX P l - 1 I [YII - f(x ll , OWVI I [YII - f(x ll , 0)] /
Jf= I
(2-12-4)
The likelihood function can be defined in more generality as follows:
take the joint pdf of the deviations or errors, and substitute for all random
variables their sample values in the form of expressions involving measured
variables and unknown parameters; the resulting expression is the likelihood
function. In the next section we carry out this procedure for several additional
models.
2-13, Likelihood-Structural Models
(a) Exact Structural Model. Referring to Eg. (2-11-]) we find that the \V I .
appear as additional unknown parameters in the model. We define the
residuals here as the differences between the measured values w lI and any
particular assumed values for '''JI' i.e.,
E(\V) ==W - W
(2-13-1)
Hence the likelihood function, as derived from the pdf piE I \1/), has the form
L(W, /) == peW - \\7 I /)
(2-13-2)
The parameters \\7 are not free to assume any values whatsoever; they are
constrained to satisfy the structural Egs. (2-1 ]-1).
28
II Problem Formulation
As an example, when the joint pdf is given by Eq. (2-10-5), all we have to
do is substitute w ,1 - \V II for E" to form the likelihood
L( \' I' Y ' . . ., Y n' (v I' (v ,
"
Vi ) = ( In ) -nr.'2 TI det-]i\
. tJ - J1
JI= 1
x eX P [ -1 f (W" - W,,)Ty, I (W '1 - (V,,) ]
}1= I
(2-13-3)
Note again that WI' W 2 , ..., w" are known vectors (being the measured
data). Hence they do not appear among the arguments of L.
Since an exact structural model requires a large number of additional
unknown parameters \Y, it is desirable to transform it to reduced form when
possible.
(b) Inexact Strucmral Model. Ler the model be described by Eq. (2-11-2).
Once more, the residuals are defined by Eq. (I), and they take the place of E
in the pdf p(EI\)I)p(rj1'). An expression for r is obtained simply by evaluating
Eq. (2-11-2) for specific values ony and 0, i.e., by substituting G(U, \Y, 0) for
r. Thus
L(W, 0, \)I, \)I') =p(W \YI\)I)p[G(U W,O)I\)I']
(2-13-4)
In this case, nO restnctIons apply to the 0 and W. As an example, suppose the
errors in Ware again distributed as Eq. (2-10-5) and that the "1 1 1 are similarly
distributed as N",(O, QJ1)' We obtain the likelihood function by substituting
"II \V II for EI" and gjl for "1"
L(O, YI' Y, " Y", QI' Q,
.. Qn, \"1' w 2 , ..' w,J
"
= (2n)-("!)(r+m) TI (det-l/ Q" det- 1/2 VJ1)
}I= I
I J [ . A ) Ty-I ( A ) T ( A " )Q -I ( . A t:\ )] 1
x ex p . \ - 1 L (W" - "II II ",I - ",I + g,l lJ ll , "J1' U I' g" lJ ,I , "", \J f
1'= I
(2-13-5)
Again, lJ l , and W ,I , being known vectors, do not appear as variables among
the arguments of L.
(c) For the econometric models discussed llllder (d) III Section 2-1 I, the
likelihood function is found by multiplying the terms Eq. (2-11-5) for a1l
values of p, i.e.,
" "
L(O) = TI p(wJ1) = TI p(g,,) I det(og"f o "J1) I
}1= t JI= 1
(2-13-6)
r
i:
2-14. An Example
29
For the model of Eqs. (2-11-6) this turns om to be
L(O, Y) = [(I - OdOJ(2n)-I(det y)-1/2]"
[ n J -0, [ " J
X01:;:=IJl u , [1 11' ex p _l.'\'uTy-l g
2 J12 1 L bJI. Jl
p=1 p=1
(2-13-7)
The reducibility of an exact model g(u l " \\" 0) = U to the form \" =
f(u p , 0) depends primarily on the re!ation between the number m of equations
and the number r of random variables per experiment. We distinguish three
cases:
I. r = m. Except in certain singular cases (vanishing Jacobian) the equa-
tions may, in principle, be solved for the \V II (although the solution may not
be unique). Even when the solution cannot be exhibited in explicit form, it
can be computed numerically. Thus, at least in principle, the \V I , may be
eliminated from the likelihood function, which remains a function of 0 and
I alone.
2. r> /11. Not all the \V I , can be e!iminated. We can, however, choose 111 of
the ,\" solve for those, and substitute in the likelihood function. This leaves
us with only r - m unknown \V I , per experiment, and these are unrestricted
in value.
3. r < 111. In this case, we may solve r of the equations for the W II , and use
these to eliminate the \V p from the likelihood and from the remaining m - r
model equations. For each experiment there remain /11 - r equations, con-
taining only up and O. Therefore. if /1 is the number of experiments, the total
number of equations is (/11 - r)/1, and the mode! must contain at least (117 - r)11
unknown parameters 0 for there to be a solution. In most cases we find that
the number of random variables per ex periment must at least equal the number
of equations.
2-14, An Example
The following example should clarify the conditions under which the
vanous types of mode! are appropriate. A sphere of radius r and mass 111 is
dropping freely through an incompressible Newtonian fluid of viscosity JI.
The force of gravity acting on the sphere is gem - 117 0 ), where mo is the mass
of the fluid displaced by the sphere, According 10 Stokes's law, the drag
opposing the motion of the sphere (when the motion is slow) is 6mpu, where
u is the velocity of the sphere. Newton's first law of molion takes the form
mv = gem - 111 0 ) - 6nrpu
(2-14-1)
30
II Problem Formulation
where l' == d1'/dl is the acceleration. \Ve may rewnte this as
V + 21' = (3
(2- I 4-2)
where
2 == 6Tr1"JI/m,
(3 == g(111 - 111 0 )/111
(2-14-3)
Assumll1g that the sphere was initially at rest. we can integrate Eq. (2) to find
1'(1) = «(3/2)(1 - e- 2 /)
(2-]4-4)
The distance s traveled by the sphere since the inception of its motion is
.I = r dT) dT = ((3/2)1- ((3/2 2 )(1 - e- U )
-'0
which can be translated into the "model"
gis, 1. 2, I') == s -j'[l - (I''Z)( I - e- 7 ')j = 0
/2-14-5)
where
j' == /1/2 = 09(111 - 111 0 ),167[1"/1
(2-14-6)
Suppose we have measurements SI' 52' .., 5" recorded when the clock
indicated times II' l2' ..., l". We are interested in estimating some of the
physical constants appearing in the model. At the outset it is clear that the
model equation contains only two independent parameters. Hence only two
of the physical constants 09,1", Ill, ilia, /1 appearing in the model can be estimated
independently. Since all the information contained in Eq. (5) relative to these
constants is derivable from the values of 'Z and }', we shall assume that these
are the parameters to be estimated. \Ve examine the following cases, in all
of which it is assUllled that errors in different measurements are statistically
independent. The parameters of the error distributions are represented as \jI.
(a) The model equation Eq. (5) is exact, i.e., any systematic deviations from
it are negligible compared to measurement errors.
I. The measurements of l" are precise. but those of 5" are su bject [Q
errors with pdfp(EI/). Eq. (5) becomes in reduced form
g" = )'[/" - (I !'Z)( I - exp( -'l./,,))]
and the likelihood is
(/1 = I. 2,
, 11)
(2-14-7)
"
L('l.,)',\I/) = f1fJIs"-j'[/,, (1/2)(1 --exp(-2I,,))]I/)
JJ:::::; I
(2-14-8)
2. Both 1" and SII are subject to meaSUrement errors E, and Es, respectively,
with pdf p(e r , eJ' \11). We have the exact structural model
\, - )'[;" (I !'Z)( 1 - exp( -Cl.i))] = 0
(fl=I,2,...,n) (2-14-9)
2-14. All Example
31
where S" and i" are the" true" values of sand 1 at the 11th measurement. The
likelihood is
"
L[s", i,,(11 = 1,2... ,11), /] = TI p{r" - i", .1"" '/JI\l/j
1'= I
(2-14-10)
with i" and S" constrallled by Eq, (9). Alternately, we can substitute sp from
Eq. (9) into Eq. (10) to obtain
L(i,,(1l = I, 2,. .,11), cr., )', \)I)
"
= TIp{r"-i,,.)'[i,,-(I;cr.)(I-exp(-cr.i,,))]I/: (2-14-11)
JI= I
with no constraints applying to the l J ,
3. Only l J , IS subject to significant errors. If Eq. (5) could be solved for 1"
we would have a standard reduced model. Since this is impossible. we again
adopt an exact structural model, except that ,IJI is replaced by .1"" in Eq. (9).
and all references to sJ' and .1"" are deleted from Eq. (10).
(b) The model Eq. (5) is inexact. For instance, if the sphere is sufliciently
small. then the drag force is randomly perturbed due to the impact of in-
dividual molecules. A Brownian motion is thereby superimposed on the
falling motion of the sphere. Eq. (1) must be amended to read
i, + !J.1' = (J + 'I
(2-14-12)
where I) is a random variable whose distribution may be derived from the
laws of statistical mechanics. When the equations are integrated, there will
arise a perturbation on s, i.e., Eq. (5) will take the form
.I"-}'[1 (1/0:)(I-e-")]=4)
(2-14-13)
where q) is also a random variable. Let p( qJ I' (/)2' . . . , 4),,1 Y. )', \11) be the joint
pdf of the (/)J" If the measurements of sJ' and l J , are precise relative to the
standard deviation of q). then we have the likelihood
L(cr.')'./)=p{SI -)'[1 1 -(I/cr.)(1 -exp(-w l ))].
.1"2 - )'[1 2 - (lio:)( I - exp( -cr.1 2 ))], ..,10:,)" \I/} (2-14-14)
We have used a joint pdf instead of the product of individual pdf's because
the assumption of independence between observations is tenable here only if
the experiment is restarted from rest for each measurement. Otherwise, if the
disturbances up to time 1 2 have conspired to make S2 larger than expected
from Eq. (5), then .I is likely to remain too large in succeeding periods. See
Problem 4 of Section 8-9
32
II Problem Formulation
2-15. Utility of Distribution Assumptions
The problem of estimating the parameters 0 has now been augmented,
inasmuch as we must also estimate the parameters /, and possibly \\7. We
let <p denole the entire set of unknown parameters, i.e.,
q> == {O, \)I, \V}
(2-15-1)
Those \V which could be eliminated (see Section 2-13) are excluded. A scientist
may be reluctant to erect the entire probabilistic superstructure, only to find
himself with a larger problem than he started with. It is true that some para-
meter estimation procedures may be applied without making any probabilistic
assumptions. The resulting estimates, however, are rather meaningless. They
may suflice for cmve fitting, but nothing will be known about the relationship
between the estimated and true values of the parameters.
Frequently people make implicit assumptions concerning the probability
distribution without realizing the fact. This happens, for instance, when
weights are assigned in the least squares procedme (see Section 4-3). By recog-
nizing the role of such weights as parameters \)I of a distribution, we are able
to estimate the weights rather than assign them. Thus we are able to shift a
burden from ourselves to the computer (see Sections 4-8-4.9).
It should be noted that in some cases (particularly with linear models) it
suffices to specify the covariance matrix of the distribution without commit-
ting oneself to any specific form of the density functions.
D. Prior Information and Posterior Distribution
2-16. Prior Information
The scientist usually has some ideas concerning the values of his param-
eters even before any data have been gathered. He is frequently able to
exclude entirely some values. For instance, the rate constant in a chemical
reaction or the viscosity of a liquid must be positive. An estimation procedure
that came up with negative values for such parameters should be entirely
unacceptable. His physical intuition may lead the scientist to reject some other
values as being entirely implausible, even though they are strictly speaking
not impossible. Even among the admissible values the scientist may regard
some as more plausible than others. For instance, suppose a chemist knows
with great precision the viscosities '76 of n-hexane and '78 of n-octane, and he
2-17. Prior Distributiol/
33
is trying to determine '17, the viscosity of /I-heptane. Experience with the
properties of homologous series of organic compounds will lead him to re-
ject entirely values of '17 such that '17 ::( 1/" or 117 ?oIls. Among the remaining
values, he will prefer those near (II" + Ils)/1 to those near lit, or 178 .
2-17, Prior Distribution
The sCientist may summarize his prior It1formatlon In what IS called the
prior distributiol/ of the parameters. The prior distribution may bc character-
ized by means of the prior del/sity function Po(q». The prior density function
is required to be nonnegative, and to possess the property that if q)1 and (1)2
are any two values of q>, then Po(q>1 )/PO(q>2) represents the ratio of the plausi-
bility of q>1 to that of (1)2' Note that we do not require the normalization
condition J Po(q» dq> = I. In fact, we do not even require the integral
J Po(q» dq> to exist. Thus, we are permitted to assign thc ul/Uim/l priority de/lsity
Po(q» = I to describe the case when all values of q> are equally plausible. We
always assign Po(q» = 0 to all values of q> which are to be entirdy excluded.
Controversy still rages around the question of whether or not the prior
distribution may be regarded as a true probability distribution. For the stat-
istician belonging to the frequentist school a probability distribution is
meaningful only when applied to a random variable When the parameters
represent physical constants, their values are perfectly definite (although un-
known), and they cannot be regarded as random variables. Proponents of
subjective probability (e.g., Savage, 1954) and decision theory (e.g.. Rainl
and Schlaifer, 1961; Ferguson, 1967), however, do indeed admit "degrees
of belief" and subjective choices of plausibility as probability densities. Not
only do they allow one to postulate prior densities, they actually insist that
one do so. They believe that any sensible subsequent action (e.g., parameter
estimation) must be based on some choice of a prior distribution.
We shall not attempt to resolve the controversy here. For a conCtse dis-
cussion of the problem we refer the reader to Cornfield (1967). We feel that
it is up to the scientist to decide for himself the extent of his commitment to
a prior distribution. He should remember that introducing a prior distribution
biases the results of the estimation process so as to favor parameter values
for which Po(<I» is relatively large. This bias diminishes as the number of ex-
periments is increased. In other words, when the amount of data available
is sufficiently large, the effect of the prior distribution on the paramcter esti-
mates is negligible, except that values of q> for which Po(<I» = 0 remain
excluded.
t This is an intuitivc conccpt, which wc do not attcmpt to dcfine hcrc. Somc authors
have givcn dcfinitions bascd on bets that thc scientist is willing to lay on cach valuc of cpo
34
II Problem Formulation
There are several cases for which the use of a prior distribution is non-
controversial:
(a) Assigning Po( Ij» = 0 to physically impossible values of Ij>.
(b) If Ij> is truly a random variable, its pdf should be used as the pnor density.
For instance, Ij> may represent the physical properties of a batch of chemicals.
If these properties are known to vary randomly from one batch to the next
according to some pdf p(Ij», it is entirely proper to use Po(lj» = p(lj» when
attempting to estimate the properties of one specific batch.
(c) Suppose a number of relevant experiments has already been conducted,
but additional experiments are being planned. As will be seen shortly, the
information on the parameters contained in the data may be expressed in the
form of a so-called posterior distribution. It is entirely proper to use the
posterior distribution from the already completed experiments as the prior
distribution for the experiments yet to be conducted.
2-18. Informative and Noninformative Priors
In case (c) above the choIce ot the pnor distrIbutIon IS obvious. The same
is true in case (b), provided p(lj» is known. How to choose a distribution in
other cases? \Ve distinguish three situations:
(a) The l1ol1i1f(ml1alil'e case occurs when we really have no marked prefer-
ences for some values of Ij> over others, at least within the relevant region in
Ij>-space.: The simplest solution, and a ,very satisfactory one in practice,g is
to assume no prior distribution at all, and to use a parameter estimation
method which does not require one. This avenue, however, is closed to prac-
titioners who are irrevocably committed to decision theory and Bayesian
statistics. They are forced to choose a prior distribution, and are likely to
assume a uniform prior density. This is logically unsatisfactory, since if q;
has a uniform prior distribution, any nontrivial function of q; (say q;3) has a
nonuniform distribution. As we could have written the model in terms of q;3
rather than 4>, we find ourselves in a situation where the choice of param-
etrization affects the outcome of the estimation.
l By the rclevant region we mean t1lat regIOn In which tIle likclihood functIOn IS far
from vanishing.
Author's opinion.
t;
"
:,"
;.
,..
,
;
"
;.i
i.'
."1
2-18, Informative and Noninformative Priors
35
Alternative procedures are available. Raiffa and Schlaifer (1961) introduce
the concept of the conjugate distribution. This is a distribution which has the
same mathematical form as the likelihood function derived from some hypo-
thetical sample. A suitable noninformative prior distribution can sometimes
be obtained by finding the function to which the conjugate distribution tends
as the sample size (now assumed a continuous variable) tends to zero.
Another suggestion, due to Jeffreys (1961), is to use IN as the noninformative
prior density for nonnegative variables. The justification given is that this
distribution is unaffected when 4) is replaced by 4/'.
(b) The informative case occurs when we do prefer some values of e!> to others
within the relevant region. Generally the precise form of the prior distribution
is immaterial as long as it has approximately the right shape. The method of
conjugate distributions can be used here too; the scientist postulates what
seems to him a likely data set, and the likelihood function corresponding to
it is used as a prior density. The advantage this method offers is that the prior
density, the likelihood function, and the posterior density all have the same
mathematical form, which sometimes simplifies formal manipulations. In the
case of nonlinear models, however, all these functions are complicated.
Numerical evaluation replaces formal manipulation, and this method no
longer offers any advantage. One is probably better off assuming a normal or
other simple prior density with suitably chosen means and variances. Another
approach is to attempt the graphical construction of a suitable pdf or cumula-
tive distribution function. Winkler (I967) has carried out experiments in
which students were made to construct prior dcnsities to represent their
beliefs concerning the values of certain parameters, using several of the above
mentioned methods. The feasibility of constructing such functions was demon-
stI:ated, but the question of whether they would be of practical value to the
parameter estimator remains unanswered.
In practice, our prior information on a parameter 0 often takes the form
of a value eo ::!:: (I. reported in the literature. The number (I. may be a standard
deviation, in which case we assign the prior density N , (00' (1.2); or, ::t (I. may
represent absolute bounds on the deviation from 00, so that the uniform dis-
tribution over the interval Go - (I. ,.:;:; 0,.:;:; Go + (I. is appropriate. In both cases,
the chosen distribution is the least informative one among all distributions
satisfying the given conditions.
(c) It may happen that e!> is truly a random variable with pdf pee!» (as in
case (b), Section 2-17), yet the function pee!» is not known. The empirical
Bayes' method of Robbins (1955, and 1964) (see also Neyman, 1962) provides
an approach to the estimation of pee!» on the basis of available data.
36
II Problem Formulation
2-]9. Bayes' Theorem
We have summarized the information contained in the data by means of
the likelihood Function L, and the prior information by means of the prior
density Po(<I». We combine the two in the so-called posterior density which is
proportional to their product
p*(<I» = cL(<I> )Po(<I»
(2-19-1)
with
c = [J L«!»Po(<I» d<l>]-1
(2-19-2)
provided the integral exists.t
1 F Po(<I» is regarded as Ihe probability density ascribed to <I> before the
experiments were performed, then p*(<I» is the density we must ascribe to (!>
after Ihe data wcre obtained. This follows from Bayes' (Bayes, 1763) theorem,
which may be stated as follows:
Bayes Theorem. Let A and B be two events whose probabilities of occurrence
are PCA) and PCB) =F 0 respectively. Let peA I B) denote the conditional prob-
ability that A occurs, given that B has occurred, and let PCB I A) be defined
analogously. Then
peA I B) = PCBI A)P(A)jPCB)
(2-19-3)
Fr()(!f: The proof Follows immediately from the definition of conditional
probability
peA & B) =P(AIB)P(B)
(2-19-4)
whcre peA & B) is the probability of A and B both occurring. But also
peA & B) = P(BI A)P(A)
(2-19-5)
Dividing Eq. (5) by Eq. (4) and solving for peA I B) yields Eq. (3) directly,
In our casc we define A to be the event" the true value of <I> is within a
hypcrcube of volume!! d<l> centered at <1>0" and B to be the event" the true
value of Q. is within a hypercube of volume dQ. centered at W,"
1. In some applications the value of C IS immaterial, and we can proceed even when the
integral does not exist. Sce Section 4-]5.
* We lIse thc notation dcp as a shorthand for dcp, dcp2 ... dcp, and dQ for dWll dW12
...dWlrdw21...dwllro
2-20. Problems
37
By the definitions of the pdf, the prior distribution, and the likelihood
function, we have
peA) = Po( 1\>0) dl\>
PCB 1.4) = L( 1\>0) dQ
(2-19-6)
(2-19-7)
The value of PCB) is obtained by summing peA & B) = P(BI A)P(A) over all
possible A's, i.e.,
P( B) = [f L( 1\>0)Po( 1\>0) dl\> 1 df!
Substitution in Eq. (3) yields
peA I B) = L(l\>o)Po(l\>o) dl\>j J L(l\>o)Po(l\>o) dl\>
(2-19-8)
(2-19-9)
But peA I B) is the probability of A occurring given that thc cxperiments
yielded the data W. By definition, then
peA I B) = p*( 1\>0) dl\>
(2-19-10)
from which Eq. (I) follows immediately.
Note thatp*(I\» is a meaningful pdf only if Po(!» was one. The frequentist
who in a given case does not accept a prior distribution will not accept a
posterior distribution either.
When the prior distribution is uniform, the posterior density is propor-
tional to the likelihood function. If the results of two series of experiments
are statistically independent, the joint likelihood Function is the product of
the two individual likelihoods. Formally it is then possible to regard the
likelihood from the first series as the prior for the second series, and then the
posterior for the second series equals (excepI for a constant factor) the joint
likelihood function. This is the basis for the assertion made undcr case (c) in
Section 2-17.
The posterior distribution (equal to the likelihood in the absence of prior
information) combined with any constraints that may be applicable, embodies
all four elements that enter into the parameter esti mation problem, namely
the model, the data, the probability distribution of the errors, and the prior
information on the parameters. Formulating a parameter estimation problem
is in many cases equivalent to writing down the posterior distribution
2-20. Problems
I. WrIte down explictt expressions for all the likelihood functions ap-
pearing in Section 2-14, assuming that all error distributions are normal with
zero means and given variances.
38
It Problem Formulation
2. The gamma distribution with parameters a and v has the pdf
_ f1- t(v)a"O"-le-aO
ra,ve O ) = \0
(G)< 0)
(0 < 0)
where
rev) == I'D X,,-I e -., dx
o
is the gamma function. Show that E(G) = vja and V(G) = vla 2 .
3. Suppose an object is measured II times to determine its length G. The
measurements are denoted IV II (Jl = I, 2, ..., 12). The model equation is
IV II = O. Assume the errors are distributed as N I (0, ( 2 ). Suppose u is known
and 0 is assigned the prior distribution Nt (Go, Uo 2). Write down the likelihood
and the posterior pdf p*(O). Show that p*(O) is normal, and find its mean and
vanance
4. As in the previous problem, but assume G is known and u is to be
estimated. Let T == Iju 2 and assume that T has the P rior distribution 1a .. .
o. o
Show that the posterior distributio:1 of T is also gamma, and find its para-
meters.
5. Investigate the shape of the gamma distribution for various ranges of
its parameter values. Under what circumstances is the gamma distribution a
suitable prior for a parameter?
6. Show that the examples of Problems 3 and 4 contain conjugate dis-
tributions. Show how the parameters of the prior distribution are related to
the sizes of hypothetical samples. Investigate the behavior of the prior
distribution as the hypothetical sample size is reduced to zero.
Chapter
III
Estimators and Theil' Properties
A. Statistical Properties
3-1. The Sampling Distribution
A point estimation method is a procedure which enables one to compute
an estimate (1)* for the parameter vector <1>, given the data matrix W The
estimation method defines (at least implicitly) a vector valued function h
(1)* = heW)
(3-1-])
\Ve shall use the expression" the estimator h" to mean" the estimation pro-
cedure defined by the function h."
[f the experiments which yielded our data were to be repeated, we would
obtain different values of the W, i.e., different realizations of the random
variables Q.. Application of the estimator h to the new data would yield dif-
ferent values of <1>*. We see then that the estimates <1>* are themselves random
variables, possessing a certain probability distribution, which depends both on
the nature of h and on the distribution of the Q.. We refer to this distribution
as the sampling distribution of the estimator h, and denote its pdf(if one exists)
Ph( <1>*),
Note the fundamental difference between the sampling distribution and
the posterior distribution. The sampling distribution refers to the estimate (1)*,
which is truly a random variable. The sampling distribution is defined only
once the estimation procedure is defined; different estimation procedures when
applied to the same data generally give rise to different sampling distributions.
On the other hand, the posterior distribution is independent of the estimation
procedure. It applies to the true values <1>, and its interpretation in cases where
these are not random is, therefore, controversial.
A glance at Eq. (1) reveals that the sampling distribution of (1)* depends on
the actual distribution of Q.. This distribution however, depends on the true
values <1>, which are generally unknown (or we would not be trying to estimate
40
II I Estimators and Their Properties
them). Therefore, even when we can derive a formula for the sampling dish'i-
bution, we can evaluate only the approximation obtained by substituting the
estimated parameter values for the true ones. Still, one can frequently deduce
some important properties of the distribution, as shown by the fOllowing
example: SUppose we measure an object 11 times to determine its length O. The
measurements will be denoted by \VII (tl = I, 2. , . . , 11). The model equations
take the form
\\',/ = 0
(3-1-2)
Assume the measurements to be independent, and normally distributed with
variance (J2 and mean D. Consider the estimator
II
0* = (1/11) I \VII
/1=1
(3-1- 3)
i.e" 0* is the mean of the observations. It is well known that ()* is normally
distributed with mean D and variance (J2/11. Thus we have found that the mean
of the sampling distribution is equal to the true value e, and its variance
decreases as ]/11 when the sample size is increased,
The above example introduced the concepts of the mean and variance of
the sampling distribution, also referred to as the mean and variance of the
estimate. In the genera] case, the mean (j) and covariance matrix V) of the
estimate are given by: -
(jJ == £(1>*) = J <I>*p,,(I>*) d<l>* = J h(W)p(W 1<1» dW
(3- I -4)
v) == £(<1>* - (jJ)(<I>* - (jJfr = J (h(W) - (jJ)(h(W) - (jJ)Tp(WI<I» dW (3-1-5)
3-2. Properties of the Samplin Distribution
It is clearly desirable to have an estimator whose sampling distribution is
concentrated in the neighborhood of the true values of the parameters. More
formally, we define the following properties of estimators:
(a) The hias of an estimator is the difference between the expected value of
the estimator and the true value of the parameter, i.e" b == (jJ - <1>. An esti-
mator is unbiased if its bias vanishes, i.e., if (jJ = (I>. In the example of Section
3-1 we saw that the estimate Eq. (3-1-3) was unbiased. Clearly, we desire
estimators with small (in absolute value) bias, but total unbiasedness is mostly
unobtainable. Nor is unbiasedness particularly important, since the bias is not
the only error in any given estimator. Furthermore, if an estimate is unbiased
,
r"
I:':.
.,
i'
3-2. Properties of the Sampling Distribution
41
for some parameter 4), it is generally biased for nontrivial functions of 4).
That is, even if £(4;*) = ([;, it need not be true that, say, £(4/ 1 ,2) = ([;2. Thus.
the presence or absence of bias in an estimator is affected by a change in para-
metrization.
;.-
t.,
(b) While the bias is a measure of the" systematic" error in an estimator, the
variance measures its random error. The Rao-Cramer theorem (see Appendix
C) establishes a theoretical lower bound on the attainable covariance matrix
Voj> of an estimator.
We see from Eq. (3-I-..J.) that qJ is a function ofcj). If this function is dif-
ferentiable, then we can form the matrix
..
P == oqJjocj) = J h(W)(OpjOcj»T dW
(3-2-1 )
We also define a matrix R by
R = £«0 log pjocj»(o log pjiJcj»T)
= J (0 10gpjocj»)(0 10gpjocj»T p dW
(3-2-2)
The theorem asserts that the matrix V) - PR -I pT is positive semideflnite,
and that it is nul! if and only if there exists a matrix A (whose elements may be
functions of cj») such that
(1)* - cj) = A(cj»)o log pjocj)
(3-2-3)
The proof is given in Appendix C. The matrix PR - I pT is called the minimum
variance bound (MVB).
Since the diagonal elements of a positive semidefinite matrix must be
nonnegative, we have for the variance of each 4),,*
V)a" ;?; I P"fJ[R -1 JfJr Pa)'
fJ,)'
(3-2-4)
with equalit:y holding only when Eq. (3) is satisfied,
An estimate is called efficientt if its variance is the lowest theoretically
attainable, i.e., when
V) = PR-lpT (3-2-5)
When the estimate is unbiased, we have qJ = cj). and hence P = I. The Rao-
Cramer theorem reduces to the statement that V) - R - t is positive semi-
definite. and an efficient unbiased estimate has the covariance
V)=R-I
(3-2-6)
t Some authors use the term" MVB estimate" instead, ;md reserve the tcrm" ct1lcient"
for what we call" asymptotically efficient" hcre.
42
III Estimators and Their Properties
In the example of Section 3-1 we had an unbiased estimate for the single
parameter O. Its variance was found to be Va = u 2 /n.
The likelihood function is
whence
[ n ]
-? ?..... ?
peW, 0) = (2rr)-n/- u -n exp -(I/2u-)Il/1I'1l - G)-
(3-2-7)
n
o 10gp/oO = (I/u 2 ) I (IV II - 0)
11=1
(3-2-8)
and
R = E(O log p/(0)2 = (I/u 4 )E{ [J/w ll - G)r}
= (I/u 4 )ELtl J/w i . - 0)(11'.,- G)]
Now the variables 11'/' (JI = 1,2, . . ., n) were assumed independent with means
o and variances u 2 . Hence
(3-2-9)
and
E(w i . - lJ)(IV - 0) = u 2 6 WJ
(3-2-10)
n
R = (1/0.4) I E(\I.I' 0)2 = nu 2 /u 4 = n/u 2
/ 1 ==1
(3-2-1 I)
J n this case, then, V o = R - I, and the esti mate is efficient. We could have
deduced this fact also from Eq. (3-1-3), which yields
0* - 0 = (1/11) f 11'1' - 0 = (1/11) f (w ll - 0) = (u 2 /11) a log p/oe (3-2-12)
I'=:: I p=]
so that Eq. (3) is satisfied with A = u 2 /11.
As with unbiased ness, efficiency can be attained only In a small class of
relatively simple models. It should be pointed out that in some cases, although
no ellicient estimate exists, among those estimates that do exist or among
estimates of a certain class there may be one whose variance is least. For
instance, if all errors in a linear model are identically and independently
distributed, then least squares estimates have, among all linear unbiased
estimates, the smallest variance. Yet, they are not necessarily efficient.
(c) While unbiased and efficient estimates cannot generally be found for
samples of finite size, the situation changes drastically when the number of
experiments increases beyond bound. Under this condition, the actual
value of many estimates converges with probability one to the true parameter
values. We refer to such estimates as consistent or asymptotically unbiased.
The variances of most of the estimates that we shall deal with approach zero
as 1/1/ when 1/ increases. Hence, if the estimator is consistent for cj; it is COIl-
;
I.-
k:'
,-
, -
i
;
ie,
!.
L'
,-
L
<
,.
,..
I.,-{,
;
i.'
i
!ff
i , _."
I"
3-2. Properties of the Sampling Distribution
43
sistent for any well-behaved function of qJ. Thus consistency is a more signifl-
cant concept then unbiasedness.
(d) Both V) and R- 1 tend to zero as I/n for most relevant estimators. Hence
we call a consistent estimator asymptotically efficient if with probability one
lim n(V)-R-I)=U (3-2-13)
n-CO
(e) Our discussion of estimation criteria would not be complete without the
mention of sufficient statistics. A statistic p of a sample is any function com-
puted from the values of the sample for the purpose of extracting relevant
information
p = r( W)
(3-2- I 4)
In particular, any estimate <1>* is a statIstic defined by Eg. (3-1-1). A statistic p
is deemed sufficient for the parameters Q> if the value of p conveys as much
information concerning the value of Q> as did the original sample W. In other
words, we may compute the value of p from the sample, and then discard the
data W without losing any information relevant to the estimation of (!>.
This statement should be interpreted as follows. A sample can contain
information concerning the value of <I> only if the distribution of the sample is
a function of <1>. To say that p is a sufficient statistic for <1> implies, then, that
once the value of p is determined, the distribution of the sample can be repre-
sented in terms of p alone, with no further dependence on (1). Therefore, there
must exist a function q(W, p) such that
p(WI<I>, p) =q(W, p)
(3- 2-15)
But we have, from the definition of conditional probability
p(WJ<1» = p(WIif>, p)p(plif»
(3-2-16)
Letting p(pl<1» s(p, <1» (this is the sampling distribution of the statistic p)
we are led to the factorization theorem for sufficient statistics p
p(WI<1» =q(W, p)s(p, <1»
(3- 2-17)
Takmg logarithms and differentiating both sides with respect to > we obtain
a log peW I <i»)/oip = 0 log s(p, ip)/oip == t(p, ip)
(3-2- I 8)
In the example of Section 3-1, we have
p(W/G) = (2rr)-n!2u- n exp[ -(1/2u 2 ) ( f \lip 2 - 2G f \lip + nU 2 )}
\ Jl= I 11= 1
= {(2rr rn !2 u -n exp [ -(I /2( 2 ) ptl 11',/] }{ exp [( I /2( 2 )( 2UJI II'p - nlP) ] }
(3-2-19)
44
I [I Estimators and Their Properties
which is in the form Eq, (17) with p = I= I IV". Thus, the sum of the observa-
tions is a sufTicient statistic in this case.
An estimate which is a function of a sufficient statistic is a sufficient
estimate. [n om example, 0* = (Ifn)p, hence 0* is a sufficient estimate for O.
Supposing <1>* is an efTicient estimate of <1>, then from Eq. (3) we find that
cllogp!c(b = A -I(»(<I>* - <1»
(3-2-20)
Comparison of Eq. (20) with Eq. (18) shows that p =<1>* is sufficient. Thus we
have proved that if an eflicient estimate exists, it must also be sufficient.
Conversely, if a sufficient estimate exists, some function of it is an efficient
estimate of some function of the parameters.
(0 The value of an estimate usually depends on the form of the probability
distribution that we assume. We rarely possess exact knowledge of the distri-
bution, and must usually coment ourselves with a rough approximation. We
desire, therefore, that om estimate be robust, i.e., that it be only slightly
affected by seemingly unimportant changes in the form of the assumed distri-
bution.
(g) The choice of parameters appearing in a model is often arbitrary. We may
replace the original parameters <I> with a different set of parameters tj) which
are single valued functions of the <1>, e.g., tj) = s(<I» where-s is a vector of
functions. It is desirable that our estimators be in1'Griant under reparametriza-
tion. That is. if <1>* and tj)* are the estimates obtained when the model is repre-
sented respectIvely in terms of <I> and tj), then we expect to find that tj)* = s(<I>*).
(h) An estimation procedure which cannot be implemented on available com-
puting machinery is of little use. An estimate which is readily computable is,
from a practical point of view, more valuab(e than a statistically more efficient
estimate which is computable only with an excessive amount of labor. In
other words, an adherent of decision theory should make his cost function
depend not only on the error of the estimate. but also on the cost of computing
the estimate.
(i) A linear estimate is one which is a linear function of the data, i,e., <1>* is a
linear estimate if there exists a matrix A (which may be a function of the u,J
such that
<1>* =AW
(3-2-21)
Useful linear estimates valid over a wide range of data values can be found
only when the model itself is linear in the parameters,
Among the properties that we have defined, the most important in practice
are small bias, small variance, robustness, and computability.
i.:.
[:
t.:
(
,
P
t
i
\
.
3-3, Evaluation of Statistical Properties
45
A true measure of the accuracy and precision of an estimate IS given by
the mean square error relative to the true, rather than mean, value, It is easily
verified that
£(<1>* - 4»(<1>* - 4»T = £(<1>* - (j»(<1>* - (j»T + «j) - 4>)(j) - 4»T = V'I' + bb T
(3-2-22)
where b is the bias. The root-mean-square error in the estimate 4)i* of the ith
parameter is given, therefore. by (o} + b/-)1!2, where (Jj == V,!.{/ is the stan-
dard deviation of the estimate 4)j*. This points out the fact that we gain little
in laboring to make a highly biased estimate (large bJ very efficient (small (Ji),
or to eliminate bias in an inefficient estimate.
Ideally, we would like to derive formulas for estimators having the
desired properties, This, unfortunately, is not possible except in simple cases
such as with linear models (see Appendix E). The best that can be done in
practice is to propose reasonable estimators, and to test their properties
as described in the next section.
3-3. Evaluation of Statistical Properties
Given an estimation procedure heW). how do we determine its statistical
properties? How do we determine whether its bias and variance are within
acceptable limits? How can we compare its performance to that nf other
estimators? How can we assess robustness? Several approaches to the answer-
ing of these questions suggest themselves:
(a) Theoretical Analysis, Once an estimation procedme is propcrly dclInccL
it may be possible to derive the precise sampling distribution, or at least some
of its relevant properties such as the mean and variance. In practice, such an
analysis can be carried out only for linear models (see Appendix E), or for
the asymptotic distribution when the sample size is increased beyond bound.
This unfortunately leaves open the most common situation, i.e.. a nonlincar
model with moderate sample size.
(b) Replication. If we repeat the whole series of e;..periments many timcs, and
apply our estimator to each data set in turn, our estimates will form a large
sample drawn from the sampling distribution. This samplc can be used to
estimate the mean, variance, and other properties of the sampling distribution.
This procedure possesses some serious drawbacks. First. it is expensive.
inasmuch as a very large number of experiments must be performed each
time the estimator is applied to a new model. Second. although we may find
the mean of the sampling distribution, we cannot determine the bias unlcss
46
III Estimators and Their Properties
the true values of the parameters are known. Therefore, we must first carry
out a series of experiments on a system whose parameters are known, Such a
system is not always available.
(c) Computer Simulation (Monte Carlo Method). The objections to the
replication method disappear if the experiments are not carried out on a real
physical system, but are simulated on a computer instead. To simulate a
series of experiments on the computer we proceed as follows:
I. Define the system by prescribing the model equations, the probability
distribution of the errors. and, where applicable, a prior distribution. Assign
" true" values 4) to all the parameters cp.
2. Assign "true" values ZJI to the variables in the ,Llth experiment
(Jl = 1.2. . .., n). Choose these values so that the model equations g(ZJI' 8) = 0
are satisfied. One way of doing this is to choose a set of independent vari-
ablcs. assign arbitrary vafues to these, and then solve the equations for the
values of the remaining variables. The task is most easily performed if the
equations are in reduced form.
3. Use the computer to produce a set of errors eJI drawn from rhe pre-
scribed probability distribution. For most computers there are available
routines which generate a stream of numbers having the appearance of being
random numbers uniformly distributed in the interval zero to one. They are
referred to as pseudorandom I7l1/71bers. From these. suitable transformations
may be used to obtain samples from any other desired distribution. In
Appendix 0 we show how a sample from a multivariate normal distribution
may be obtained. For a more general treatment the reader is referred to the
literature on Monte Carlo methods, e.g" Hammersley and Hanscomb (1964).
Having generated the values of the errors ell' we add them to the VJI
(previously generated among the ZJI) to obtain the actual" data" w JI '
4. Thc estimation procedure is now applied to the data generated by the
computer as though they were obtained in real experiments. This yields an
estimate cp* for the parameters.
5. Replicate the series of experiments as many times as we please by
repeating steps 3 and 4, each time with a new sample of errors,
6. The relevant properties of the sampling distribution are obtained by
averaging over all replications. Let cp/' be the estimate of cp obtained in the
ith replication of the series of experiments, and let N be the total number of
replications. Then. we estimate the mean of the sampling distribution as
and its covariance matriX as
i\
(j)* = (I IN) }' q)/'
i=:.1
(3- 3-1)
N
Q* = [I/(N - I)] I (q);* - (j)*)(cp;* _ (j)*)T
i:=;l
(3-3-2)
3-4. OpTImization
47
These formulas apply, of course, also when the experiments are real, not
simu1ated.
The bias b of the estimator is then estimated as
b (jJ* - tf>
(3-3-3)
The ftexibility of the simulation method is endless. We can estimate the
properties of the sampling distribution for any model. and for any values of
the parameters within the model. We may examine the efTects of errors in the
formulation of the model, by deliberately using a slightly difTerent model in
the estimator than was used in the data generator. \Vhen the errors come from
a distribution that is not the same as the one assumed in the estimation
procedure, we obtain a measure of the robustness of the estimator. We can
also compare the true samp1ing distributions to theoretically derived approxi-
mations (see Chapter VIJ). All this can be done on the modern computer at a
small fraction of the cost, in time and money, of a comparable set of physical
experiments.
The results of an actual Monte Carlo study appear in Section 7-22.
B. Mathematical Properties
3-4. Optimization
In most parameter estimation methods we proceed in two stages:
(a) Denne a function <fJ(<I» which is a suitable measure of the departure of
the data from the model. i.e.. of the" lack of nt". We refer to this function as
the objectil'efllllction. For example. in the least squares method. the objective
function is the sum of squares of residuals.
(b) Seek those values <1>* of the parameters <I> at which the objective function
attains its minimum or maximum, as appropriate. \Ve accept <1>* as our
estimate for <1>. The process of computing <1>* is called optimi:::ation.
Chapter IV is devoted to the realization of (a). i.e.. to the definition of the
objective functions. The remainder of this chapter is devoted to the analytic
properties exhibited by the solutions to the optimization problem. Descrip-
tions of the optimization process itself will be found in Chapters V and VI.
When the unknown parameters are free to assume any values whatsoever,
we speak of Hnconstrained oplimi:::alioll. Sometimes. only parameter values
satisfying certain constraints are permitted. We may have a vector of equality
const rain t s
g( <1» = 0
(3-4-1)
4g
III EstlIllators and TheIr PropertIes
andjor inequality constraillls
h(<))) :;:, 0
(3-4- 2)
whcre h ;: 0 means that each component hi;: O. The set of all values of <jJ
satisfying all the constraints is called the feasible region. F.easible points
satisfying Eq. (2) with strict inequality constitute the interior of the feasible
region. A point satisfying hi<jJ) = 0 for some j is said to be on the jth con-
straint, and also on the bOlllldary of the feasible region.
We examine conditions that characterize the minimat in the various
cases that may arise. A point <jJ* is said to be a local minimum of cfJ(<jJ) if
in some neighborhood (e.g., sphere) around <jJ* there is no feasible point
<jJ** such that q)(<jJ**) < cfJ(<jJ*). A point is a globalminimul7l if there is no
feasible point <jJ** such that rJ)(<))**) < cfJ(<jJ*). Clearly, any global minimum
is also a local minimum. Although we wish to find the global minimum, the
conditions at our disposal usually characterize local minima, and there is
generally no easy way to tell whether a given local minimuIll is the global
mll1lmum.
The problem of optimization, with or without constraints, is often
referred to as the problem of mathel7larical programming. If both objective
function and constraints are linear functions of the unknown parameters, we
speak of linear prugramming. When constraints are present, but either they
or the objective function are nonlinear, we speak of nonlinear programming.
In the sequel. we shall assume that the objective and all constraint functions
are .lwice difTerentiable functions of the parameters.
3-5. unconstrained Optimization
To characterize an unconstrained minimum we use the rules of elementary
calculus. We state the results briefly The following are necessary conditions
for <jJ* to be a minimum of rJ):
NI. <jJ* is a SlationCliT point of cfJ, that is. the gradient of cfJ vanishes
at <jJ*
M)ja<jJh=,j,* = 0
(3-5-1)
N2. Let H«I) be the Hessian matrix of cfJ, i.e., HaP == (J2cfJ/aq;a aef)(!. Then
H(<jJ*) must be [Josi(ive semidefinite, i.e., for any nonzero vector y we must
have
y TH( <b*)y :;:, 0
(3-5-2)
:1. If a maXimUm or q; is required, we seek the minimulll of -q; instead.
3-6. Equality Constraints
49
t
.,
i:
The following conditions are sltficient for (1)* to be a local minimum of (I)
l
f
SI. (1)* is a stationary point of <P.
S2. H(<!>*) is positive definite, i.e.. Eq. (2) holds with strict incquality,
If H(<!» is positive definite for all <!>, and <!>* is stationary. then <!>* is the
umque global minimum of <1).
When <!>* satisfies NI and N2 but not S2, it is impossible to determine
whether <!>* is a local minimum without considering higher order derivatives.
Relations Eq. (I), regarded as equations in the unknown <1)*, are called
the normal equations. Condition N I states that the minimum must be a
solution of the normal equations. Any solution to the normal cquations may
be a local minimum only ifit also satisfies N2, and it must be a local minimum
if it satisfies S2.
'f
3-6. Equality Constraints
,
r
I;
i
,
,
!:
i
f.
Suppose <1)(<!» IS to be minimized subject to the equality constraints
Eq, (3-4-1). If the solution is denoted <!>*, then in the neighborhood of <!>*
the function <P(<!» must be stationary .[Or all vanations in <I) that stay within
the constraints. Let ()<!> be such a variation, i.e.,
t.
I:
g«I)* + ('5<1») = 0
(3-6-1 )
To a firsI-order approximation
g(<!>* + b<!» g(<!>*) + (Dg/a<!» b<!)
From Eq. (I) and Eq. (2) we conclude that
(3-6-2)
t
I,
"
.
r
f
!t,
.,
(c7g/c7<!» ('5<!> = 0
(3-6-3)
for all permissible vanations ('5<!>. Expanding </)(<!» around <!>* we find, to a
first-order approximation
}.
r
1:
.
f
:
<1)(<!>* + ('5<!» 1)(<!>*) + (iJ<1J/iJ<!»T ('5<!>
(3-6-4)
Since <P(<!» is to be stationary, we must have
(iJ<P/a<!» T ('5<1) = 0
(3-6-5)
for all b<!> satisfying Eq. (3). This condition may be paraphrased as follows:
The vector c7<1J/c7<!> must be orthogonal to all vectors ('5<!> which are orthogonal
to the rows of the matrix og/c<!>, It follows that iJ<!)/c7<!> must belong to the
I..
..
-.. -;
50
III Estimators and Their Properties
,
"
'
'
,:t
';;t
,.
; f
:!
subspace panned by the rows of DgIDe!>, which in turn implies the existence
of numbers }'I *, l1*' ..., Ic,,* such that
P
ri1)!tl.+. = \' ) * ri q /(l'+'
r . '11 L.1 ,I, '11
i= I
(3-6-6)
i.
f
;:11
"
,
where p is the number of constraints. The },,.* are called Lagrange multipliers.
Let us now construct the function
p
ft(G>. i,) == 1)(e!» - I I.,g,(e!»
;= J
(3-6-7)
which has a stationary point at e!> = e!>*. i = i,* if and only if
p
11A 1r7e!»<I> =.1>' = N)/I'e!»<I>=<j,. - I 1.,* gle!>*) = 0
l= I
(3-6-8)
i'
.f
and
(l/l/a}.J.!,=<j,. = -g,(G>*) = 0
(i = I, 2, . .' , p)
(3-6-9)
,
t
-;}
,f
:
1,
r
i'
But Eq. (8) and Eq. (9) correspond exactly 10 Eq. (6) and Eq. (3-4-1), respec-
tively. We thus conclude that 1)(e!» has a constrained stationary point at
e!> = G>* if and only if 11«1). i) has ln unconstrained stationary point at
G> = G>* and i, = P.
In some problem we have an II x 11/ matrix of constraints G(e!» = O. In
this case we need an 11/ x II matrix of Lagrange multipliers L, and the
Lagrangian takes the form
i-
i.
:
/1(e!>. L) == (p(e!» + Tr(LG)
(3-6-10)
r
r
f
>
r
r
--;
r
f.
',
To determine the nature of the stationary point (e!>*, i,*), expand both
1) and the .II,. in Taylor series around e!>*. retaining terms up to second order:
(/J(e!>* + be!»:::o (/J(e!>*) + (ik/JjT 8e!»T ()e!> + t be!>T ((!1f[J/8e!> ae!» b<l> (3-6-11)
.11,.(<1>* + <)e!» :::og,(e!>*) + (ogJ<le!»T ()G> + t ()e!>T (OlgJ<le!> (1<») ()<I>
(i= 1.2" ,/7) (3-6-12)
If e!>* is a local conIralJled minImum of 1), it follows that f[J(G>* + be!»
1)(e!>*) for all sufllciently small ('ie!> for which
(og Fe!» I ('ie!> +. ()e!> r (r'lg ,/ i'e!> ('G» i)e!> = 0
(i = I. 2..... p) (3-6-13)
'.'
?'
;:.
i
f:
t
t:
i;
t
1:
i
(
:
l
.II,(e!>* + ()e!» = 0
(i = I. 2, . . ., p)
Such ()e!> musl therefore satIsfy (approximately) the equations
and the inequality
((11) ji'e!» I ()<I> + 1 (5e!> T (iJ2<1) /I'e!> riG» ()e!> 0
(3-6-14)
t
'.
f{
3-7, Inequality Constraints
51
Multiplying each Eg. (13) by Jc i * and subtracting from Eq. (14) we obtain,
in view of Eg. (6)
i b<l)T [(8 2 <P/8<1) 8<jJ) - J//I' (02g)8<1) 8<1)] b<l);: 0 (3-6-15)
This must hold for any b<jJ satisfying Eq. (14), i.e., by cords joining the solu-
tion point cjJ* to neighboring points approximately on the constraint surfaces,
Because of continuity, Eq. (15) must, therefore, also be satisfied by the
tangents to the constraint surfaces at the solution, these being limits of such
cords. These tangents satisfy
(8gji"<jJ) b<jJ = 0
(3-6-16)
That is, they are the null vectors of the matrIX cg/8<jJ. Now let B be a matrix
whose columns span the null space of agj8<jJ. That is, every null vector b<jJ
of ogj8cjJ can be expressed in the form bcjJ = Bx, where x is an arbitrary vector
whose dimension is that of the null space. If the dimension of cjJ is r, and if
8gj8<jJ has p linearly independent rows. then the dimension of x is r - p, and
B is an r x (r - p) matrix. Letting
a 2 1) p iJ2g.
A == irt-. rt-. - :L Jc,"' a rt-.
e,,!, C'II i= I'll C'II
(3-6- 17)
" ..
we know, from Eq. (15), that
xTBTABx;: 0
(3-6-18)
;'t
for all r - p dimensional vectors x, i.e., B T AB must be positive semidefinite.
Conversely, if B T AB is positive definite, it follows by continuity that
Eq. (15) holds for all sufficiently small (5<jJ satisfying Eq. (13). I-lenee, <jJ* is a
constrained minimum of 1).
>.:
... .:;
.:
,.; [
3-7, Inequality Constraints
,:t.,:
....;
's;
Suppose we wish to mlI1lmlZe 1J(<jJJ subject to the lI1equality constralI1t
Eq. (3-4-2). If the minimum occurs at a point <jJ'I' which is in the interior of
the feasible region, that is h,(<jJ*) > 0 for all i, then the constraints are irrele-
vant as far as the local nature of the minilllulll is concerned. Therefore, <jJ*
must satisfy the conditions for an unconstrained minimum. When cjJ* lies on
the boundary of the feasible region. there will be some values of i for which
\ :'
h,(<jJ*) = 0
(3-7-1)
We refer to these hi as the actiL'e constraints. For the purpose of character-
izing the point <jJ*, we may disregard all the inactive constraints. and in the
t:
,:'
52
III Estimators and Their Properties
t
r
sequel we shall consider the vector h to consist of all the active constraints
alone. We denote by t the number of active constraints, and by T the total
number of constraints.
At an inequality constramed minimum it is required that the gradient of
the objective function should point decisively into the feasible region. We
observe that since the constraint functions are positive inside and negative
outside the feasible region, their gradients also point into the feasible region,
The necessary condition for minimality is, then, that the gradient of the
objective function should be a linear combination with positll'e coefficients
of the gradients of the constraint functions.
More precisely, John (1948) has proven that <I)'" is a minimum only if
there exist nonnegative numbers lo. ll' )'2' .." )'t not all zero such that
h
r
,
i.
J:
f
t,
!
t"
I.
r
.'
,.
f:
f
f:
,
)'0 acfJ(<I>*)ja<l> = }' li[ah;(<I>*)/a<l>J
i=1
(3-7-2)
I,
k
The more famous Kuhn-Tucker condition (Kuhn and Tucker, 1951) asserts
that we may choose lo = I provided the constraints [Eq. (1)] meet a certain
qualification,t which in practice is almost always satisfied.
Clearly, Eq. (2) is unaffected if we add on the right-hand side terms COrres-
ponding to the inactive constraints, provided their multipliers assume the
value zero. The Kuhn-Tucker condition can then be stated as follows:
T
D(/J(<I>*)/r<l>= I I. i 8h;/c<l>
i= I
(3-7-3)
h,(<I>*) ;: 0
)'i;: 0
(i = I, 2, . . ., T)
(i = I, 2, . . " T)
(i = I. 2. .. . , T)
(3-7-4)
(3-7-5)
(3-7-6)
).;11,.«1)*) = 0
The last equation states that either a constraint is active (hi = 0) or its multi-
plier vanishes. This is known as the principle of complementary slackness.
Sufficient conditions for local optimality have been derived by McCormick
(1967) and Fiacco (1968). They require the quantity y T Ay to be positive for
all vectors y which point into the feasible region or are tangent to it at (1)*,
Here A is the Hessian of qJ - Ii lei hi U o having been set to one).
The role of the I.,. here IS analogous to that of the Lagrange multipliers
in the case of equality constraints. The l, are called dual variables or shadow
prices. When both equality and inequality constraints apply, the necessary
:j: The qualification rcqUircs mat for cvcry vector u such that u T 811;/8<1»$=$*:'- 0
(i = 1, 2, ..., I), thcrc should cxist a vcctor of functions <I>(T) such that <1>(0) = <1>*, <I>(T)
is in thc feasiblc rcgion for 0 -- I <; I, and 8<1>/8T$= $* = u. A case where the qualification
is not met occurs whcn <1>* is at a cusp formcd by constraints tangcnt to cach other.
3-8, Problems 53
conditIon for a mil1lmum states that there eXIst scalars Ill' Jl2'. . and non-
negative scalars Xl' )'1' . , , such that
acfJ/i3cj>).b='b' = I Jli ag)i'lcj»4'=4" + I Ai iJh)(lcj»",=",.
i .
(3-7-7)
(r
The subject of optimality conditions is discussed in great detail by Mangasa-
rian (1969).
. "
3-8, Problems
:.,
1. Consider the model whose likelihood is given by Eq. (3-2-7). Show that
if e and v are unknown parameters, then I;: = I \1'1 1 and I;:= I 11'/ form a pair
of statistics which is jointly sufficient for 0 and v.
2. Show that the Lagrange multiplIers represent the unit cost of the con-
strain.ts. That is, if a constraint g;(O) = 0 is replaced by g,(O) = E, then the
minimum attainable value of cfJ is increased (to a first-order approximation)
by an amount Xi E.
3. As above, but for lllequality constrall1ts.
4. Let A be a square matrix. Using Lagrange multipliers, show that of all
vectors x of unit length (i.e., x T x = I), the ones for which x TAx i:-. eithcr a
minimum or a maximum are eigenvectors of A.
..'.'
;;r.::.,
;:
::.
'::.
I""
. ,,\ .
":5.
'.
::,'.
I.'
)
.\:.'
";",
.
\\
;i
Wt.
.
!;:,:
... ,,'
:.i
h'"
i:;:
7:
E'.,'
vr
.
t'"
j
J'
,'
Chapter
I'T
lethods 01 Estimation
4-1. Residuals
In the previous chapter we have discussed, in general terms, what desirabll
properties an estimator should possess, and how any specific estimator ma
be tested in order to determine whether it possesses these properties. We nm
proceed to describe specific estimators, or estimation methods, which are i
widespread use.
We have defined the error £,,,, as the difference between the measured an
true values of a variable. In the case of a reduced model y = f(x, e), if \\
knew the true value e of e we could compute the error
EJW == Y,1lI - .fll(X Jl , 0)
(-:1--1-
We can also compute these differences for any other value of e. This defir
functions
e,,,,C e ) == V,,,, - .fe,Cx , ., e)
14-1
to which we refer as the residuals, The errors are equal to the re:;idu
evaluated with the true value e = e.
The residuals in an inexact structural model g(z, e) = 'Yare obtai I
simply by evaluation of the model equations
elwCe)=gQ(Z,,' e)
(4-
When the structural model is exact, the residuals are the differences bet\\
the observed and estimated values of the variables
( '"II ) *
e Jw VY == H'Jlll - \-\'JJlI
(4-
I
i:'
I.
i
\
I
I
\
i
I
\
\
I
!
I
\
4-2. Unll'eighted Least Squares
55
A. Least Squares
4-2. Unweighted Least Squares
e
y i
IV i
n 1
\
Id
ve \
I
I
.1) I
!
I
les I
I
i
-2)
ta1s
ned
1-3)
,een
-1-4)
The method of least squares is the oldest and most widely used estimation
procedure. At least some of its popularity is due to the fact that it can be
applied in an ad hoc manner directly to the deteriminstic model, without any
cognizance being taken of the probability distribution of the observations.
Needless to say, estimates obtained in such a way may be very unsatisfactory
indeed, although one can envision situations in which nothing better can be
done, We do not wish to imply that least squares estimates are always merely
ad hoc. Quite the contrary is true, and where the observations have certain
probability distributions, these estimates even possess optimal statistical
properties, which will be described in the sequel.
In cases of pure curve fitting, where the coefficients have no physical sig-
nifi\:ance, the least squares method is usually adequate.
We employ the following notation: A small capital letter denotes the
vector formed by adjoining to each other the rows of the matrix denoted by
the same letter. Thus
ET= [E I1 ,E 12 , ."EI",.E2I,...,E",,,]
The least squares procedure in Its simplest form consists of finding the values
of e which minimize the function
<p(e) =ET(e)E(e) = Tr[ET(e)E(e)]
(4-2-1)
which is, in component form
III II
epee) = I I e:wC e )
lI= I Jl=1
(4-2-2)
i.e., we minimize the sum of squares of the residuals. When m = I we speak
of single equation least squares. In practice, most estimation problems fall
into this category. A typical problem is. worked out in detail in Section 5-21.
We derive the normal equations easily
III 11
a<P/ae = 2 I I e/ Ui ae/w/ae = 0
lI=J 11=1
(4-2-3)
/..t ;:
.:\1' _
. ,.i!;.:'
56
IV Methods of Estimation
In the most common case of a single reduced equation, e" = J'" - f(X", 0),
and we have:
11
<P(O) = L [Y" - .r(X" , oW
IL=I
(4-2-4)
11
(jcJ)/c{}a = - 2 L e" cl(x,!, O)/OOa
11=1
(CI. = I, 2, . .. , [)
(4-2-5)
4-3, Weighted Least Squares
An objective function consistlllg of a simple sum of squares is often
unsatisfactory for the following reasons:
(a) The vanous quamities J',UI (or g,w) may represent entities having different
physical dimensions. or measured on different scales, For instance, )I,d may
be the concentration of a chemical, expressed in mole fractions and falling
in the range zero to one; at the same times, Y,'2 may be a temperature mea-
sured in degrees centigrade, and falling in the range 500-1000. It clearly mal, es
no sense to sum together squares of numbers of such disparate orders of
magnitude; the residuals of the temperatures are likely to dominate those of
the mole fractions and any information contained in the latter will be lost.
(b) Some observations may be known to be less reliable than others, and we
want to make sure that our parameter estimates will be less influenced by
those than by the more accurate ones. (Note that we cannot. after all, escape
the statistical structure of the data.)
The solution to both of these problems is one and the same; assign a nc n-
negative weight factor b,,,, to each epo(O), and minimize
", 11
([l(O) == L L b,w eo(O)
0= 111= I
(4-3..1)
We choose small b,,,, for J',w which are measured on a large scale, or which are
highly unreliable, and conversely for large b,UI' A more general formulation
is one which assigns weights to cross product terms as well, i.e.,
Tn 111 n n
q)(O) = L L L L BlJ"')(h) e,,,,(O)eh(O)
0=1 h=1 p=1 .,=1
(4-3.2)
The weights BIIILI)(./h) must be elements of a positIve delll1lte or semidefinite
matrix, for otherwise ([)(O) can be made to approach - 00. Clearly, Eq. oJ)
is a special case of Eq. (2), with Bllw)(hJ = b,,,,(5J!(5,,h' and Eq. (4-2-2) i, a
special case of Eq. (I), with all b ,w = I.
i%
it
;
. ':.-:
.....J.
\
;i.
' I
j
i
_..
. ,
"',
4-3. Weighted Leos; Squares
57
Additional important special cases of (2) are tile followl!1g:
(a) Weighting by VariabJe. Where BU.w)(l/b) = 0 (I! 0;6 II), and he weighb are
independent of J.l
trJ m 11
<p(e) = I L B"b L elltl(O)el,,,(O) =
(/= I b= I Jl= I
11
I ejlTBclt
Jt=l
(4-3-3)
When B is a dic:.gonal matrix, this simplifies to
HI "
$(8) = I b" I ea(e)
0= I J1= I
(4-3-4)
(b) Weighting by Experiment. Applicable mostly to the single equation case
n
:pee) = I b l , e/(€j)
/1=1
H-3-5)
(c) Weighting by Experiment am! Variable.
n l1l ", 11
$(e) = L L I.hI,B"bellnlO)eIl"lS)= Lbl,e/Be l ,
Jl=lll=lh=l JI=J
(4-3-6)
Does statistical theory tell us what values should be assigned to the weights,
or when we are entitled to use the simpler formulas? The answer is at least
partly in the affirmative We shaI! see later that if the noclel eqLtations are
linear in the parameters (Section 4-4), or if the IlllmDer of observations is
large and the errors are normally distributed (Section 4-7). then the choice of
weights leading to least-variance estimates is given by the elements of the
inverse. of the covariance matrix of the errors. That is
B(fltl)(r/b) = (V - j )(JItI)(b)
(;-3-7)
where
J'(Jln)(b) = E(GIItIG",,)
(4-3-3)
Although we cannot prove optimal properties in the general case (non-
normal distributions with nonlinear models), it is still reasonable, and ap-
proximately optimal, to use weights which are the elements of the inverse of
the covariance matrix. When the covaria.nce matrix is not known, one may
choose either to guess or to use a method such as maximum iikelihood which
sometimes enables one to estimate the weights along with the other param-
eters. Or, one can obtain a direct estimate of the covariance Illatrix by
replicating at least some of the experiments,
58
IV Methods of Estimation
4-4, Multiple Linear Regression
When the model is linear, the choice of proper weights in Eq. (4-3-2)
ensures optimal statistical properties for the corresponding estimators. The
linear model takes the form
flL = f(xlI' e) = BJl(xJl)e
(4-4..1 )
where BI,(xIJ is a matrix of given functions (polynomials and trigonometric
functions are often used in curve fitting).! Adjoining the equations for all
values of fL, we obtain, in matrix form
F= Be
(4-4-2)
where F T == [fiT, f 2 T, ' . . , f,/] and B T == [Bl T(Xl)' B/(x 2 ), ..., B/(x lI )]. For
given data, B is a constant matrix. Suppose the XI' are measured precisely, and
each observation Y I , is a sample from a random variable whose mean value
is fJl' and let the joint covariance matrix of all the elements of Y be Y, i.e.,
E(ylw - .!;",)(Y'/b - I,/b) = V;lw)U/b)
If we deterllltlle e = e* so as 10 minimize the function
(4-4-3)
qJ(e) == (Y - Be)T Y-I(y - Be)
(4-4-4)
then 0* must satisfy:
(MJf(le) = - 2B T y- 1 (Y - Be*) = 0
(4-4-5)
This is equivalent to the 110rmal equal/Vl1s
BTy-1Be* = BTy-ly
(4- 4 -6)
Solving for e*, we find, provided BTy-1 B IS nonsmgular,9
e* = (BTY-1B)-tBTy-Jy
(4- L -7)
t Suppose wc arc fitting thc single cquation model J(x, e) = 8 1 + 8 2 x + 8 3 x 2 . Thcn
B" is the row vcctor [I, x, x 2 1 and
[ ' XI X1 2 ]
1 Xz X2 2
B = :
] X'I x ,J 1.
!i If BTV-'B is singular, thc normal equations posscss infinitcly many solutions Of
thcse, thc onc for which eHe* is minimum is given by
e* = (BTV-IB)+BTV-Iy
where A + is thc pscudoinverse of A (see Scction A-I).
:--
Lt',
r,:,.
fi
I
;
'
j
.
.
t
.%
If
..,
.
}.
'.,ii:
1:.iJ'
""'t,
to'
. .I
---. '
.,
.1
. r
'"
"
-lr
t'
"t;f
!',
.
. . . .. .. I
...
: .
......
.,
,; '.
':f.
:!t
. l'
,
'I
.
:1
4-4, Multiple Linear Regression
59
This is the well-known multiple linear regression formula, Clearly e* is a
linear estimate, having the form e* = A Y. By our assumption, E(Y) = Be.
Hence, B and V being constant, we find from Eq. (7)
E(e*) = e
(4-4-8)
That is, e* is an unbiased estimate of e. Now, Eq. (3) is equivalent to
E(Y - Be)(y - Be)T = V
Also, it is easily seen that
e* - 0 = (BTV-IB)-IBTV-I(y - Be)
Hence the covariance of the sampling distribution of the estimate e'" turns
out to be
V o == E(e* - e)W' - O)T = (BTV-IB)-I
(4-4-9)
The Gauss-Markov Theorem (proved in Appendix E) asserts that among
all linear unbiased estimates, Eq. (7) yields the one whose variance is smallest.
[f, furthermore, the distribution of the E" is normal. the estimate is efllcient.
In the case where the errors of all the observations are independent and of
equal variance (J2, we have V = ,,,21. and
e* = (BTB) - I BTy
(4-4-10)
which is the usual unweighted linear least squares estimate. The covariance
matrix of this estimate is given by V 0 = (J2(B T B) -I,
Computational methods for solving linear regression problems are dis-
cussed in Section 5- I I. A question that often arises in connection with linear
regression problems is which variables should be included, and which should
be excluded from the model. Stated in another way, the question is which
parameters should be left out (assumed to be zero) because they do not con-
tribute significantly to the model. The method of stepwise regression (Section
A-3) provides an answer to this question.
Before leaving the subject of linear models. let us examine briefly the
question of how the optimal properties of the regression estimate are alfected
when the assumed model is incorrect.
First, suppose some important terms were omitted from the model. As a
result, it is no longer true that E(Y) = Be; rather, we have
E(Y) = Be + s
(4-4-11)
where s is a fixed vector consisting of the omitted terms. If e* is computed
from Eq, (7), we find
E(e*) = e + (BTV-1B)-IBTV-IS (4-4-12)
60
IV Methods of Estimation
so that 0* is no longer an unbiased estimate, The bias IS precisely equal to
(BTy- 1 B)-I BTy-I S .
Secondly. consider the case where an erroneous value has been taken for
V. Suppose the true covariance matrix is U =ft Y. Then the covariance matri {
or the estimate Eq. (7) is given by
v = (BTy-IB)-IBTy-IUy-IB(Bfy-IB)-1
(4-4-13)
We wish to determine how inefficient this estimate is relative to the bet
possible estimate in which Y = U. The covariance of the latter estimate i;,
according to Eq. (9), (BTU-' B)-I, We define the relative inefficiency e cf
Eq. (7) as the ratio of its generalized covariance to the minimum attainable
generalized covariance, i.e..
e = det(BrV-IB)-IBTY-'UV-IB(BTy-IB)-I/det(BTU-1Brl (4-4-14)
Clearly, e = 1 irV = U In other cases. it can be shown that e can assume any
value in the range given below; its actual value depends on B
l:(ei((l+::N!4C1.
(4-4-1:5)
where CJ. is the condition number, i.e., the ratio of largest to smallest eigen-
values, of the matrix V-I ,'2UV- 112 . To illustrate, suppose an unweighted least
squares estimate is used where in fact the error variances range between] 0
and 100. We have, then, V-I /2 = I, and U = diag(u), where u is a vector .)f
numbers in the range 10 to 100. It follows that V- I / 2 UV- 1 / 2 = U = diag(u),
and C/. = 100flO = 10. The inefficiency of 0* may be as high as (I + 10)2/40 3.
While an estimate of the form Eq. (7) or Eq. (10) is the best u/1bias.d
estimate, it is possible to construct biased estimates whose total expected
squared error is less. For instance, in t)le method of ridge regression (Hoerl,
1962; Hoer! and Kennard, 1970), one substitutes for Eq. (10) the estimate
O*(}.) = (BTB + )J)-IBTy
(4-4-16)
where ), is a positive parameter. It can be shown that the expected squcre
error is
UU') =£(O*U) - (j)(U"V) - {j)T = (BTB + },I)-I(u 2 B T B +},2{j{jT)(BTB + }J)-I
(4-4-17)
and that the quantity del UU,) is minimum when }, satisfies the equation
),{jT(u2BTB + ),2{j{jl)-I{j = Tr(BTB + Xlr 1
(4-4- , 8)
Since 0 i::o unknown, the optimal }, cannot be determined a priori. Hoer! and
Kennard recommend construction of a so-called ridge trace, which is a plot
of the components of O*(l) versus }, with }, increasing from zero. One chooses
l
't'o
t
i
I
,
t
,.,
t."
E
"-
f."'o
.
;
1i
f....
t;;
I """
tfi,':
- ..,(
;, l'!;...
t!;f&
1
'G;
; 1 1 .,'.
.,",-
;:.
'.
4-5. Definition
6!
a value of /. where e*u) ceases to vary rapidly. Note that at ), = 0 we have
the usual least squares estimate. Note also that /. = 0 docs not satisfy Eq. (18).
Hence, the least squares estimate is never the linear least squared crror
estimate.
B. .Maximum Likelihood
4-5. Definition
In Sections 2-12-2- I3 we ha ve dellned the likelihood function L(q» of the
sample as being the joint pM of the observations, viewed as a function of the
unknown parameters q>. These unknown parameters were of three kinds:
I. e represents the unknown parameters of the deterministic models.
2. \\1 represents the true values of the observed variables.
3. \jJ represents other distribution parameters.
In Section 2-13 we saw that the model equations could be regarcled as equalIty
constraints which limit the possible values that the e and \\t could attain.
In addition, prior information may impose certain inequality constraints
(e.g., nonnegativity) on the parameters.
The maximum likelihood estimate (M LE) of q> is that value of q> satisfying
all the equality and inequality constraints, for which the likelihood function
attains its maximum value (if such a value exists).
Under relatively mild conditions on the form of the likelihood function,
the MLE is consistent and asymptotically eAlcienl. This is a strong argument
for using the MLE v..'hen the sample is large. The MLE does not usually
possess any optimal properties for small samples. It is generally neither
unbiased nor efficient, although it is sufllcient when a sunIcient statistic
exists. Sampling experiments [see. e.g.. Chow (1964). Cragg (1967), Carney
and Goldwyn (1967)] have shown, however. that the maximum likelihood
method produces acceptable estimates in many situations. Whereas better
methods may be available in specific cases. a powerful argument for the
use of the maximum likelihood method is the generality and relattve ease of
application,
Since the logarithm is a monotonic tJ1creastJ1g function of its argument,
the value of q> that maximizes L(q» also maximizes log L(q». Since log L is
frequently a simpler function than L. it is in terms of maximizing log L that
we shall clsually formulate the problem.
62
IV Methods of Estimation
The following heuristic argument may make the maximum likelihood
method seem plausible: the probability of observing a sample lying in a region
15W around the actually observed sample W is given by peW I <I» bW =
L(<I» ()W. The value <I> = <1>* for which this probability is greatest is the MLE.
We say that <1>* is the most likely value of <I>. Of all possible values of the
parameters, <1>* is the one having the largest probability of giving rise to a
sample within i5W of the actually observed one.
4-6. Likelihood Equations
In this and subsequent sections we examine the application of MLE to
various cases. We shall proceed as far as we can formally. In most applica-
tions, the final computation of the estimates requires numerical methods to
be described in the next two chapters.
We first discuss the case where no constraints of any kind apply, Thi,
occurs when our model is of the reduced type, discussed in Section 2-12,
Then the likelihood is a function of e and \jJ alone, as shown by Eq, (2-12-3).
We know from Section 3-5 that a free (unconstrained) maximum of th,
function log L(<I» must satisfy the set of likelihood equations
(llog U<l»/8<» = 0
(4-6-1)
Since L(<I» =p(WI<I». we find from Eq. (3-2-18) that if p is a sufficient
statistic for <1>, then Eq. (I) is equivalent to
f(p, <I» = 0
(4-6-2)
Hence, to calculate the maximum likelihood estImate it is sufficient to know
the value of the sl!l(iciel1f statistic p, and we may discard the original data W
Unfortunately, there are few practical cases involving nonlinear models for
which a sufTicient statistic can be found.
One may approach the problem of finding the estimates in two ways:
(a) By solving the likelihood equations. and then determining whether the
solution is indeed a maximum. This is the approach taken when the solution
is to be found analytically.
(b) By attempting to find the maximum of the likelihood function directly,
paying no regard to the likelihood equations. This is the more fruitful ap-
proach when the solution is to be found numerically (see Chapter V). Even
in this case. however, the likelihood equations can sometimes be used t::>
eliminate some of the parameters. thus reducing the size of the problem to be
solved numerically. This method, known as stagewise maximization, works
out particularly well for the elimination of the distribution parameters '!',
and specific illustrations are given in Sections 4-8 and 4-9.
r
t
f
;.
i
I
i.
1
{
I
,
I:.:
,
. h
.
;'
,
i' '.
'- .'
r.,
..
,
If.
,".'
4-7. Normal Distribution
63
4-7. Normal Distribution
We consider the case of a normal distribution. In the most general case,
we denote by T(IIlI)('/h) the covariance of c 11lI with f.h' and by B(IIf/)(,/h) the ele-
ments of the matrix inverse to T. The errors f. IW (p = I, 2, ... _ n; a = I, 2,
.. " m) possess the normal distribution N"",(O, T). The logarithm of the pdf
may be derived from Eq. (2-8-10) as being
III 111 tI n
log L = -(II/J1f2) log 2IT -tlog det T - t L L L L BIlIf/)('IhICllf/(O)C,,/,(S)
(l= 1/1= 1/'= 1'1= I
(4-7-1)
If all the elements of T are known. finding the values of e that maXIllllze
Eq (I) is equivalent to minimizing
111 111 tI
cfJ(S) == L L L L BIJIlI)(h)CI,"(S)C'/h(S)
1I= J /1= I 11= I '1= I
(4-7-2)
which is the same as Eq. (4-3-2). Thus, for a normal distribution with known
covariance, M LE reduces to weighted least squares, with the weights given
by the elements of the inverse'of the covariance matrix.
We now turn to the case when the covariance matrix is not known, and
must be estimated from the data. Variances and covariances are measures of
the magnitudes of the errors. The data themselves can tell us nothing about
the magnitude of an error unless we have replications of the same error. If we
measure the length of an object once. we can gain no idea of the error in the
measurement; if we measure it twice, the difference between the measurements
can be used to estimate the error.
In the general case described by Eq. (I) we have no replication: each
error f. IW is assumed to have its own variance T;I/II)II'/I). In order that estima-
tion of the variances should be feasible, we III liSt assume thaI several
measurements (or quantities derived fTom them) possess identical variances.
The following are typical assumptions:
(a) Errors in different experiments are independent.
(b) Errors in each experiment are distributed with the same covanance
matrix V.
Both assumptions may be summarized by
T;,IlI)(r,h) = ()/l1, I ';,h' 1';,/, = E([;,lf/C 1 ,h) (4-7-3)
The trace of a matrix A is the sum of its diagonal elements. i.e., Tr(A) ==
Ii A ii. It follows that Tr(AB) = Li.i A Ii B i /' It is easily veri lied then that
11
L e/(S)V-lel,(S) = Tr[V- 1 M(S)]
i=J
(4-7-4)
64
IV Methods of Estimation
where M(O) is the moment matrix of the residuals, defined by MOb ==
L;:=te/lUe/lIJ' i.e,
"
M(O) == L e/,(O)e/(O)
JI= 1
(4-7-5)
Under our assumptions (a) and (b) above, the likelihood function takes
the form
log L = -(11111/2) log 2n - (n/2) log det V - t Tr[V- 1 M(O)] (4-7-6)
Clearly, maximizing Eq. (6) when Y is a known matrix, is equivalent to
minimizing
(J)(O) == t Tr[Y- I M(O)]
(4-7-7)
Retaining the factor .} in the above expression (and similar constant co-
efficients in other objective functions to be derived later) is important in
Bayesian estimation, when log PoCO) is added to - <p(0) (see Section 4-15).
A further specialization of Eq. (4) is obtained when one adds the following
assumption:
(c) All errors are independent, i.e., Y is a diagonal matrix
l..;,b = 6 ab l'" '
U o == E(£n) == val
(4-7-8)
111 which case
iii m
log L = -(nl11/2) log 2n - (11/2) L log va - t L v; I lV/,,"tO) (4-7-9)
u= 1 (/== 1
In the single equation case
log L = - (11}2) log 2n - /1 log (J - (I /2(J2)M(0)
(4- 7 -10)
where
"
Al(O) = L e,/(O)
/1=1
(4-7-11)
Whether or not (J is known, log L is maximized relative to 0 by minimizing
J\1(0). Maximum likelihood here is equivalent to unweighted least squares.
4-8. Unknown Diagonal Covariance
We shall treat Eq. (4-7-9) first, and then generalize to Eq. (4-7-6). Assuming
then that Va are unknown, we seek those values of 0 and the VII that maximize
Eq. (4-7-9), We proceed by the method of stagell'ise maximi::atio/1 (Koopmans
and Hood, 1953). This consists of finding, for any value of 0, the values of
the I'll that maximize log L. These will be some functions of 0, say VII(O).
Substitution of i\(O) for l'a in Eq. (4-7-9) reduces log L to a function fl5 of
l
lit.!!
I ;
,<
..
:i
iCE
'I;
:fi
;
.f.
'j
1
"
'f
IF
.
"
I
fl'
',
,
ID
fi
:@
:.r
't
i;;"
:'1
I
',t.:.
f;
'J,
':J
:
.if;
-;fii
jf
.
;
,
4-9. Unknown General Couariance
65
o alone, and we seek 0* so as to maximize ..:l(G). The first step, then, is to
differentiate Eq. (4-7-9) with respect to each I'a' and equate the derivatives
to zero
II
d log L/dv a = -n/21'" + (1/21',/) I: e/,,«(}) = 0
/1=1
(a = 1,2,... ,111) ('-i-8-!)
This equatIOn has the ullIque finite solution
"
v,,(O) = (I/n) I e;Ju)
Jt=l
(4-0-2)
Substituting Eg. (2) in Eq. (4-7-9) one obtains
2(0) = log L(e, vaC G ))
III , - II 1
= -(111l1/2) log 2iT - (11/2) I log (l/n) I e;,,(ij)
n=1 L }I=I J
III
--1
a=]
[ II / II ]
1..(\.' I 1. '!);\
I e,wC'd / (1/11) L ('Iii/tV!
11= 1 JI= I
which can be reduced to
"' "
2'(0) = (I1J1lf2)(log (iI/2IT) - I) - (n/2) I: log I e;JO)
u=! jf= I
( 4-8-3)
Maximizing Eq. (3) is ckarly equivalent lO minimizing
III
di(O) == (11/2) I log M,/{/(O)
11=1
(-1-g-4)
where !VI(D) is the moment m£1.lrix of the residuals defined by Eq. (-1--7-5). Vye
refer to .0(0) as the concentrated likelihood function.
To solve our estimation probiem. we proceed as follows:
I. Find 0* which maximizes Y(8), or minimizes (j>(fn
2, Estimate l',,* = iJ,l8*), using Eq. (4-8-2). This estimate for /'" :s biased
but consistent. The bias may be eliminated approximately (e:>.actly for certain
linear models) by replacing 1',,* with nv/,/(n - m/) (see Sec:tion 7-13 for further
details).
4-9, Unknown General Covariance
The results for the case of a nond13gonal unknown covariance matri\
(Eq, 4-7-6) are similar to the ones obtained in the preceding section. but
require some additional matrix calculus. Let nV) be some scalar !"tnc1.ion of
66
IV Methods of EstImatIon
a nonsingular matrix V, and let Z'j7aV denote the matrix of partial derivatives
off with respect to the elements of V, i.e.,
(ufl E'V)ab = (f12 V,'b
Then the following formulas hold (see Appendix A-2)
Z; log det VlaV = (VT)-I
il[Tr(V- 1 M)]I(W = _(VT)-I M(v T )- I
(4-9-1)
(4-9-2)
(4-9-3)
Applying these formulas to Eq. (4-7-6) and remembering that V is sym-
metric, i.e.. V 1 = V, we obtain
i' Jog LI(W = -(nj2)V- I + -tV-IM(O)V-I = 0
We may rewrite Eq. (4) as
V-I = (l/n)V-I\'I(O)V- 1
(4-9-4)
(4-9-5)
Premultlplying and postmultlplying Eq. (5) by V, we obtain
Y(O) = (Iln)Y1(0)
(4-9-6)
We now have
Tr[y-I(O)M(O)] = n Tr I", = nl1l
(4-9-7)
whence. by substilUting Eq. (6) in Eq. (4-7-6) we are led after simplification, to
2"(0) = (nlll/2)(log (n/1IT) - I) - (nil) log det M(O)
(4-9-8)
Maximizing this is equivalent to minimizing
(p(0) == (nil) log det 1"1(0)
(4-9-9)
The [wo steps are:
I, Find 0* to maximize 2"(0) or mllllt11IZe <])(0).
1. Estimate V* = Y(O*) from Eq. (6). Here, too, the estimate is biased,
See Section 7-13 for possible bias removal.
If the ofT-diagonal elements of M(O) are neglected, we have det M =
iV!I t M 22 . .. M",,,,, and log det I\l = I'= I log Maa. In that case, Eq. (9)
reduces to Eq. (4-8-4).
The cases dealt with in thIs and the preceding sections may be regarded
as the solving of weighted least squares problems with unknown weights.
Formulas (4-8-4) and (9) give maximum likelihood estimates in the case
of a normal distribution. One is tempted, however, to recommend their
use even where the form of the distribution is unknown, provided assumptions
(a) and (b) in Section 4-7 are valid (see Section 4-18). The use of Eq. (9) is
illustrated by means of practical problems in Sections 5-23 and 9-7.
'..
,
ji
1.
4-10. Independent Variables Subject to Error
67
,"
..
i>
4-10. Independent Variables Subject to Error
}::
Suppose our model is in reduced form, but the independent variables are
also subject to error. It will be recalled that in this case the model equations
take the form
f
'L'
)
f:
Y ll = f(x l " 0)
We now have residuals of two kinds:
(4-10-1 )
..
."
t"
4);,
e.q,(x lI ) == XII - X Il
evl,(O, XII) == YII - f (XII' 0)
We adjoin the s-dimensional e xl , and l11-dimensional e y /, into a single (s + 111)-
dimensional vector elL of residuals
e (0 x) =e, ( ex/I ) (4-10-3)
II ' JI - \C Y11
(4-10-2)
;i:,
;...
f:
1
Jr".
;I- .
{', '
If the ell are assumed normally distributed with zero means and covariance
matrices V, the likelihood function is given by
log L(O, XI" V) = - [(s + 111)17/2] log 2iT - (17/2) log det V
';'''
;It
r
.
/I
- -! I e/(O, x/I)V-1eJO, XI,) (4-10-4)
/L=1
....
.
'.'
If V is known in its entirety, the function to be minimized is
r.:
t
f.
,4,'
/I
<P(O, X) == 1 I e/(O, x/,)V-leiO, XI,)
J{=1
(4-10-5)
Unfortunately, when V is entirely unknown, it cannot be estimated by the
method of maximum likelihood, To see why this is so, we partition the matrix
Vas follows:
!.
, :
:-:
:J
P.
>. ;
V = [ V x . . X.I' J
V XJ ' \.1'.1'
(4-10-6)
where
.
f
"
V. u == E(e X / L e;),
V XJ ' == E(ex/le;I')'
V J'Y == E( e YIL e;l,) (4-10-7)
i,....;
:'
.
.,
Let us set V xy = 0, and V. u =£1, where £ is a very small positive number,
and let us set XII = XII (iL = 1,2, .,., n), i.e., e. m = O. Then Eq. (4) is
reduced to
¥,-r:-
log L = - [(s + m)n/2] log 2n - (n/2) log del V)'y - (ns/2) log £
.;
/I
J. T \ r- 1
- 2 Lent Y3' e)'JI
Il=t
(4-10-8)
..
/1<..
6X
IV Methods of Estimation
Because of thc term - (l7s/2) log D_ thc quantity log L may be made arbitrarily
large by choming I; small enough. Thus. thc likelihood function does not
posscss a maxlmum_
Thc ahovc dilliculty disappcars when v,-, is known. say
\!\=p
(4-10-9)
whcrc P is a known positlvc dcfinitc matrix. If. in additton. we assume that
thc x and y crrors arc mutually uncorrelated, then thc nonconstant part of
Eq. (4) reduccs to
"
log L(O. X, \,,) = (11/2.) log dct VI" - 1 I (C;" P - le.</, + e, V,;. I e,.,,)
1'= I
(4-10-10)
Onc vcrilies easily that the M LE for V,.)" is
9", = (1117)1\1,,. = (1/17) I e,," e.,
I'
(4-10-1 I)
so that the concentratcd objectivc function to be minimized is
"
({)(O_ X) = (17/2) log det 1\'1, + 1 Ie;" p- I ex"
JI= I
(4-10-12)
The bias of the esti mate Eq. (II) for V'I" can bc considcrable. and this estimate
is not cven consistcnl. ;\ suitablc correction factor is derived in Section 7-14.
Computationally it is oftcn best to treat the problems discusscd in this
section as constrained minimization problcms. That is. f(x lI . 0) is not sub-
stitutcd for 5'" in thc C\prcsion for (/1. Thc 5'" are retained as explicit un-
knowns. and Eq. (I) is trcated as a set of equality constraints. ]n this form,
the problem is amcnable to solution by thc method of Sections 4-11 and
6-6-6-8.
4-11. Exact Structural Models
Rccall thaL Lhc model takes thc form
-.j
g(u", ,v" _ 0) = ()
(4-11-1)
;i
The U l , havc bccn mcasurcd prcciscly. the \\'1' are subject to mcasurement
errors. The rcsiduals are defined by Eq. (4-1-4). The likelihood function is
givcn by Eq. (2-13-2)_ If thc errors in each cxperiment are indepcndently dis-
tributed as N,.(O. V) \vhere r is thc dimcnsion of \\'11' then the likelihood takes
the form
.'i
"
log L("\' _ V) = - (17/2) log det V - -2 I e/V- I ell
11= I
(4-11-2)
....
..
.1
:c
.:
4-12. DaTa Requiremel1ts
69
(constant terms have been dropped). The mL\lmUm likelihood estlmatc IS
found by determining the values of \\1 and 0 which maximize log L while
satisfying the constraints [Eq. (I )]. \\lc introducc an III-dimcnsional vector of
Lagrange multipliers )'/1 for each experiment to form the Lagrangian function
.
,.t
.
,
.,
II
/1(W, O. V, )'1' ." )'11) == log L + I )'1 / ' g(u ll . {V", 0)
JI= I
(4-11-3)
The solution to the estimation problem will be found at a sraLtonary point
of A. Numerical methods for finding the solution are described in Sections
6-6-6-8. and an example is worked out in detail in Section 6-13.
, "J
. .j
4-12. Data Requirements
In Sections 4-8 and 4-9, we saw that when unknown, the elements of the
covariance matrix (or. equivalently, the weights for weighted least squares)
could be estimated along with the model parameters. In the case of indepen-
dent observations we found that we must minimize
III If
1),(0) = I I U , ; le/)U)
{/= I If= I
(4-12-1)
. ,
when the l'fi are known, and
UI If
(])2(0) = I log I elII(O)
(/= I Jl= 1
(4-12-2)
when the Ufi are not known. Clearly. (1)1 (0) () for all O. and the equality holds
if and only if
.1'//(/ = .!;,(x/ I ' 0)
or
gllll(Z/I' 0) = ()
(4-12-3)
. i
for aJl fl and a. Thus. we have a total of 1111/ equations to be satisfied, and
meaningful estimation can occur as soon as 1111/ at least equals /, the number
of parameters to be estimated. On the other hand, suppose we can find values
of fJ which satisfy Eq. (3) exactly just for one specific value of a. This could
occur if f n 1/. where In is the number of parameters appcaring in the ath
equation. But in this case the ath term in Eq. (2) is - cr..'. For meaningful
estimation. then, we must have 1/ > maxfi (I,,). In particular. if aliI parameters
appear in every equation, we must have 1/ > I. The situation where V is not
assumed to be diagonal is similar, but we have an additional restriction when
Eq, (4-9-9) is used. The 1Il x 17l matrix !\"iCO) is the sum of 1/ matrices ell e/,
each of rank one, Hence, the rank of IVI cannot exceed 11, and for \\'I to be
nonsingular it is necessary that 11 111. If M is singular, its determinant
vanishes and Eq. (4-9-9) is meaningless.
,
. .
,' '. 1f:>:' t
70
IV Methods of Estimation
To summarize, the number 11 of required experiments must satisfy:
I. /1 > max" (I,,) if V is unknown. Also, 11 Tn if V is not known to
be diagonal.
2. 11 I/m If V IS known.!
More observations are usually required when V is unknown than when it
is known. This is not surprising.
4-13. Some Other Distributions
Perhaps the greatest virtue of the maximum likelihood method is its
straightforward applicability to the formulation of a wide variety of estima-
tion problems. Given a distribution for the errors, it is an easy matter to
write down the expression for the likelihood function, When this function is
continuous and smooth, its maximum can be found by means of some of the
gradient methods of Chapter V as are applicable to the normal distribution
problems. The situation is entirely different, however, in the case of a dis-
continuous distribution, such as the following: Suppose our measurement
errors all follow uniform distributions. Let the range of GJlO be :!: rJlO' Any
value 0 for which I Cllll(O) I > r/l O for even one fl,a has likelihood zero. All
values of 0 for which I e llO I rJIO for all JI,a possess the same positive likeli-
hood, and are all equally acceptable as maximum likelihood estimates. It may
easily happen that no such values exist. The best procedure is to find the
value of 0 for which
eP(O) == max Ie Jlo(O)/r J ,,, 1
(4-13-1)
Jl. "
attains its minimum value. If this minimum value turns out to be no greater
than unity, we have found a maximum 1<ikelihood estimate, Otherwise, we
know that no such estimate exists, What we have found, then, is a minimax
weighted deviation estimate, as described in Section 4-17. When the range of
G 1lll is not known, but all errors are assumed to have the same range, then
minimizing ([J(O) == max I elw(O) I gives a maximum likelihood estimate, and
the minimum value of (p is an estimated value of all r lw '
If the errors have the two-sided exponential distribution
P(L 1W ) = 1.'1'" exp( -k,,,, I E""I)
(4-13-2)
then, provided the k lw are known, the maximum likelihood estimate calls for
minimizing the weighted sum of absolute values of the residuals
(P(O) == L k ll " I elw(O) I
Jl,a
(4-13-3)
lln some exceptional Circumstances, a smaller number of experiments may suffice.
4-13, Some Other DlSlribwiol1s
71
One verifies easily that the constants c llO and k/ w are related as follows to
the standard deviation (J/la:
C/l a = 1/J2 (JIlO'
kilO = J2/(J/w
(4-13-4)
. If we assume (J/la = (Ja for all J.L, and if all errors are independent, the log
likelihood is
)
10gL= -(l1m!2)log2-I1I 10g(JD-J2I(l!(JD) I I epl,(S) I (4-13-5)
0=1 0=1 p=J
Differentiating with respect to (Ja, equating to zero, and solving for (J1I gives
the maximum likelihood estimate
!
n
iJ a = (J2/n) I I e/w(S) I
/l=t
(4-13-6)
Substituting back in Eq. (5) we eventually find that to estimate S when the
(Ja are unknown, we must minimize
111 11
<P(S) == 11 I log I I el'D(S) I
n= 1 u;::;l
(4-13-7)
The objective functions of Eqs. (3) and (7) may be brought into the
realm of conventional mathematical programming problems by means of the
following device: We define new variables e;lI and e;';, satisfying
e/la(S) = e;;a - e;D'
e;" 0,
e/a 0
(4-13-8)
Equation (3) is replaced with
<P(S, e;D' e;J = I k lu ,( e;;a + e/J
/l,a
(4-13-9)
:J
Clearly, if e/lD(S) is posItive then el"'(S) = e l ;" and e;-;" = 0, and vice versa. The
theory of mathematical programming leads us to expect that the number of
nonzero variables in the solution will equal mil, i.e., the number of equality
constraints in Eq. (8). Among these will be the I paraneters 0, leaving only
mil - I nonzero residuals, This means that the fitted equations will pass exactly
through at least I of the observed data points. Hence this estimation method is
relatively insensitive to the presence of a few observations with very large
errors; these are simply ignored.
The mathematical programming formulation of problems given by Eqs.
(I) and (3) are discussed by Kelley (1958) and Wagner (1959), who deal
specifically with the linear programming problems which arise when the
model equations are linear in the parameters.
11
'd .;;r
'l:X.
72
IV Methods of EstimatIon
C. Bayesian Estimation
4-14. Definition
In the estimation methods discussed so far we have made no use of the
prior information. which in Sections 2-16-2-19 we have treated as an integral
part of the problem. As we have seen, the posterior pdf p*(I») is given (Eq,
2-19-1) by Bayes' theorem as
p*(I») = cL(I»)Pn(lj»
(4-14-1)
where L(I») is the likelihood function, and Po(<!» the prior pdf, which sum-
marizes the prior information. Estimates which make use of the prior infor-
mation are usually based on the posterior distribution. and are therefore
known as Bayesian estimates.
If p*(I») is to be a pdf, we mllst have f p*(I») dl» = I, and hence c must
be I/f, where
J=. f L(<!»Po(l») dlj>
(4-14-2)
We refer to the function p*(I») as a proper or improper posterior distribution
if the integral f does or does not exist, respectively. In the latter case, we let
c = I. The following are sulliclent but by no means necessary conditions for
the existence of f:
I. L( 1») JS bounded and 170(1») is normal.
2. L(I») and Po(l») are bounded. and 170(1») vanishes everywhere except JI1
a bounded region of I» space.
To select an estimate for the parameters 1», we pick some typical values
of the posterior distribution, such as the mean, median, or mode. Such values
are referred to as location parameters of the distribution since they locale the
region in I» space where most realizations of the random variable occur. Some
of these location parameters exist even if the posterior distribution is im-
proper, while others may not exist even for proper distributions. Among
those parameters which exist for a given problem, the choice is somewhat
arbitrary. In the sequel we shall describe two distinct approaches toward
making this choice.
At this point it is well to summarize some of the benefits that accrue from
Bayesian estimation:
I. One is sure to obtain estimates which are physically meanll1gful. It is
guaranteed that estimates for parameters known to be positive are indeed
positive.
i
.1
J
J
4-15. Mode of the Posterior Distribution
73
(.
'j
,
!
,f,
"i
t
.
2. The model equations may be degenerate relative to some of the param-
eters. For instance, the model for the falling sphere [Eq. (2-14-5)] contains
only two independent combinations of five distinct parameters. Non-Bayesian
methods can be used to estimate at most two of these parameters, and exact
knowledge of the others is required. But if inexact prior information is avail-
able on at least three of the parameters. then the posterior density. being a
nondegenerate function of all five, can be used to estimate all five.
:1
"
!
,i
.
4-15. Mode of the Posterior Distribution
#.
The natural extension of the maximum likelihood method to Bayesian
estimation problems consists of looking for the mode of the posterior dis-
tribution. That is, we accept as our estimate the value of 4> for which p*«I»
is maximum. This method, to which we refer as MPO (maximum of posterior
distribution), offers the following advantages:
1. The estimate coincides with the maximum likelihood estimate III case
of a uniform prior distribution. since then p*(<I» is proportional to L(4»). The
estimates coincide even if Po(<I» is uniform only within a bounded region and
zero elsewhere. provided only the maximum of L(<I» occurs within this region.
The practitioner who accepts the M LE when no prior information is given
would naturally wish his estimates to be affected only slightly when a slight
amount of prior information becomes a\.ailable. MPO satisfies this require-
ment.
2. We know from a theorem by von Mises (1919). that if Po«p) is contin-
uous and does not vanish at the maximum of L(4». then the MPD converges
to the MLE as the number of experiments is increased indefinitely. The MPD
shares the consistency and asymptotic cAiciency of the M LE.
3. The MPO can be obtained whether or not p*(<I» is a pmper distribution.
4. It is usually much easier to compute the M PO than other Bayesian
estimates,
In computing MPO estimates, we distingUIsh two cases:
(a) The prior distribution does not vanish anywhere. In this case, we
maXlm]ze
tP(<I» = log L(<I» + 10gPo(<I»
(4-15-1)
The same techniques as are used for MLE can be applied here. In particular,
if L(<I» is one of the normal cases discussed in Section 4-8 and Section 4-9.
and if Po does not depend on the elements of V, then those may be eliminated
as before, and the concentrated likelihood may replace the likelihood in
'-I.:
74
IV Methods of Estimation
Eq. (I). Care must be taken, however, to retain any constants multiplying the
concentrated likelihood. For instance, if L is given by Eq. (4-7-6), we may
use Eq. (4-9-8) to replace Eq. (I) by
cfJ(O) = (n/2) log det M(O) -logpo(O) (4-15-2)
which IS to be mlllimized. For numerical examples, see Sections 5-22 and
8-7. Note that in the presence of log PoCO), the factor n/2 may not be dropped.
In the case of single equation least squares with unknown a, the term
(n/2) log det M takes the form (n/2) log Iz= I e,/(O).
(b) The prior distribution vanishes outside the region defined by a set of
constraints
h(<I» 0
(4-]5-3)
In this case we have a typical nonlinear programmlllg problem; find the
maximum of Eq. (I) subject to all the applicable constraints. Methods of
dealing with this problem are described in Chapter VI.
4-]6, Minimum Risk Estimates
So far Our motive has been to find values of 0 which are most likely to be
close to the true values. Sometimes, however, the estimated value is required
for a specific purpose, e.g., for designing a plant, and we are interested in
finding the value of 0 which is best for this particular purpose. In many situa-
tions, what is "best" is determined by economic considerations, and the
choice of the best estimate can be made by means of decision theory.
In decision theory, a cost is assigned to any loss suffered because of an
error in the estimate. That is, to the act (9f using the parameter value <1>*
when the true value is 4) we assign the cost c(<I>*, <]». Since <]> is unknown, the
actual cost C(<I>*, <]» cannot be computed. However, if we are willing to say
that <I> is distributed according to the posterior distribution, then we can
compute the risk, defined as the expected value of the cost of assigning the
value <1>* to <I>
1t
Jl
-'j
...i
)!
" ;
R(<I>*) == £c(<I>*. <1» = J c(<I>*, <I»p*(<I» d<l>
(4-16-1 )
The minimum risk estimate (M RE) is defined as the value of <1>* which mini-
mizes R(<I>*). Here p*(<I» must be a proper pdf. The following is a simple
example:
A manufacturer conducts experiments to measure the tensile strength 0
of an alloy. He intends to use the alloy to manufacture a component whose
size, and hence cost, will be inversely proportional to O. Let 0* be the estimate
'.i
.'".
;1
4-16. Minimum Risk EstimaTes
75
"
to be used for O. Then the cost of the component will be $a/O* (any additional
fixed cost is irrelevant to the present discussion). The component will fail if
the true value {j is less than 0*. However, if the component does fail. the
manufacturer will have to pay a fine of $K. His total cost will be
.
_* - _ { a/O*
c(O , 0) - K + a/EI*
(0* D)
(0* > 0)
(4-16-2)
",
Assuming that the posterior density p*(O) summarizes all available informa-
tion on e, then the risk or expected cost is
00 O.
R(O*) = f c(O*, O)p*(O) dO = 010* + K f p*(O) dO
-00 -
(4-16-3)
To find the minimum risk estimate, we differentiate
dR/dO* = -a/(O*)2 + Kp*(O*) = 0
(4-16-4)
:, '
Hence, one should use the value 0* which satisfies the equation
(fIYp*(O*) = a/K
(4-16-5)
,
",
r:' .
'\
In a sense, the MRE is not really an estimate. The value of 0* which
satisfies Eq. (5) cannot be considered the most likely to be true; it is merely
the value which in the given economic situation involves the least risk
Attempts to use decision-theory-like methods in pure (i,e., economics-free)
estimation problems usually start with the assumption of quadratic cost
functions taking the form
-" .'
c(<I>*, <1» = (<1>* . 4»)Tp(<I>* - 4»)
(4-16-6)
;: ;
where P IS a given positive definite weighting matrix. This essentIally defines
the cost flS a weighted sum of squares of the estimation errors. Substituting
Eq, (6) in Eq. (I) yields
R(<I>*) = f (<1>* - <I»Tp(<I>* - <I»p*(<I» d<l> (4-16-7)
and
, )
8R/8cP* = 2P f (cP* - cP)p*(<I» dcP
(4-16-8)
.>
'., ) Where R(if;*) attains its minimum, 8R/o<l>* vanishes, whence (assuming P
nonsingular)
f <I>*p*(<I» d<l> = f <l>p*(<I» d<l>
Since <1>* is a constant and J p*( <1» d<l> = L Eq. (9) reduces to
<1>* = f <l>p*(<I» d<l>
(4-16-9)
(4-16-10)
rJl i:. .
76
IV Methods of Estimaiion
We conclude, then, that the M RE for a quadratic cost function is the meal!
of the posterior distribution. More explicitly. Eq. (10) can be written as
4>';. = I .:j>L(qJ)Po(,p) (hp / J L(<jJ)po(<!» d<!>
(4-16-11)
Fortu;latc!y, the c!:illate Eq. (I I) does not depend on the weights P. Hence,
onc nced not worry about wh:lt values should be assigned to them.
Thcrc are many practica: disadvantages associated with this MRE:
(:1) Thc c:,t:mate does not exist if p*(,p) is an improper distribution. Consider
thc case where our model takes the form
.1\, = c( ex p( - (}x,)
(-+-16-12)
CI. being 2. knowp constant. Assuming a normal distribution with standard
deviation (J, we have
L(O) = f2iT)-1I!2(J-1I expf -(lj2(J:= [Y II - CI. exp( -OxIIW} (4-16-13)
Suppose all .Y" arc positive. As (} increases bcyond bound, L(O) becomes pro-
portional to exp[ - (I /2(Jl) I;: = I J'/J -# O. Hence, if PoCO) is uniform for all
values of 0, the integrals in Eq. (I I) diverge. What is even worse, however, is
the fact that if wc assume floW) = 0 oUIside the region 0 (} A, the integrals
in Eg. (II) cxist. b:t their ratio tends to infinity with A. Thus, the estimatc
Eq. (II) is r.ot robust under seemingly unimportant changes in PoW). After
all, the vaIuc chmen for ..; is arbitrary, and the estimate should not depend
stro:1gly on this choi..:c. ;hc SOUrce of the diftlculty here is the sensitivity of
the .:lRE to the tails of thc assumed distribution.
(h) Even \vhen the integrals in (II) exist. their evaluation may be impractical.
If is an /-dimensiona! vector. then I + I imegrals must be evaluated: one in
thc (lcnominator, and onc for each component of <jJ in thc numerator. Each
one of these integrations must be carried out over an I-dimensional space,
each din 1 cnsion possibly extending from v:., to + co. No satisfactory
methods for !,crforming such integrations (unless I = I) are available. In
addition. any reaonable approach to this integration problem requires find-
ing, as a first stcp. the loc<:tion of thc mode of the posterior distribution,
Thus. computation of thc Iv! PD is a prcrequisite to the computation of the
MRE.
(c) Thc M R.E ie, not iilvari::nt under reparametnzaIlon, whereas the MPD IS.
(d) Thc MRE doc not gcnerally COIl':ergc to the MLE as Po(<!» approaches
the uniform ditribution.
;"ir.>
.I: .
.
i.
t;
,.,
lt
If
f '..
, '';:.:
_, h \
4-! 7 Minimax Deuillt ion
77
In conclusion. the M RE can be recommended only where called for by
true economic decision making purposes. For a further discussion of mtl1i-
mum risk estimates, not confined to quadratic cost functions. the reader is
referred to Chapters :2 and 3 of Deutsch (1965) and Chaptcr 6 of RailTa and
Schlaifer (1961).
;:
'. f :',
. :.!.
.'
,f::'
D. Other Methods
4-17. Minimax Deviation
The parameters are determined in such a way as to minimize the maximum
deviation of the model from the data. This is particularly useful for design
purposes (see Section 2-5), or for obtaining maximum likelihood estimates
with uniform error distributions (Section 4-13). Such estimates are sometimes
called Chebyshev estimmes.
Let' denote the magnitude of the largest residual. Then the following
conditions are satisfied
e,w(e) :E; (,
- e,U/( e) :E; (
<II = I. 2. . . . , 1/; a = I, 2, . . . , m) (4-1 7-1 )
or, equivalently
ea(e) :E; (2
(/-I = I. 2. . . , . n; a = I. 2. . . . , 111)
(4-17-2)
Our problem may then be formulated as follows:
Find the values of the parameters e and ( which minimize the objective
function
(IJ(e. 0 == (
(4-17-3)
while satisfying Eq. (I) or Eq. (2). This is a classical nonlinear programming
problem.
It is possible to attach different weights to different residuals. i.e.. by
replacing Eq. (2) with
'l./llle;a ::E; (2
(Jl = I. 2, . . . , 1/; a = I, 2, .. ,/H)
(4-17-4)
with r:J. pa gIven positive numbers.
Numerical procedures for solving this problem are discussed in Section
6-5. If the model equations are linear, the algorithm of Bartels and Golub
(1968) may be used.
':"
7'cl
IV Methods of Estimation
4-18, Pseudomaximum Likelihood
In this method we employ the maxImum likelihood equations derived on
the assumption of a normal distribution, regardless of whether the distribution
is or is not in fact normal. Since in practice we often assume normality even
when we have no basis for doing so, pseudomaximum likelihood is perhaps
the most widely used method. We may regard any use of nonlinear least
squares or weighted least squares as an application of this method. Equations
(4-8-4) and (4-9-9) are the most important extensions of pseudomaximum
likelihood beyond the weighted least squares concept.
4-19. Linearizing Transformations
Consider the model equations y = f(x, e), Suppose we were able to effect
a transformation of variables y = try) in such a way that the function
t[f(x, e)] is linear in e. Then we apply the method of multiple linear regression
to estimate e. The advantage gained derives from the fact that this estimate
may be obtained by direct calculations, whereas nonlinear estimation pro-
cedures require complicated iterative schemes. Understandably, this method
was very popular before nonlinear estimation codes for electronic computers
became available.
We illustrate by means of a simple example:
Let
r = XI exp( -Ox 2 )
(4-19-1 )
be our model equation, with () to be estimated. Letting.j' == log.l' transforms
the equation into
j'; = log Xl - OX2
(4-19-2)
This is linear in 0, which can be estimated, say, by minimizing
II
<1)(0) == I (log .I'll - log XJl! + OX 112 )2,
11= I
The method can be applied equally well when the transformed equations
are linear not in the! parameters e, but in a set of I independent functions of
them, say rr(e). Then we can obtain linear regression estimates rr* of the rr,
and estimates ofe by solving the equations rr(e*) = rr*, To illustrate by means
of a trivial example, leI
y = 0 1 exp( - O 2 x)
(4-19-3)
. , i!* t .'
, .
,
.:r"
i <1
.' .
: !
!:'
. .
;.
4-19, LlI1eanzlI1g Transformations
79
with 0 1 and O 2 to be estimated. As before, let P == log y so that
.II = log 0 1 - O 2 x
Letting 1I 1 == log 0 1 , IT 2 == O 2 , we have
(4-19-4)
.1 1 = IT 1 - IT 2 X
(4-19- 5)
j.' which is linear in IT 1 and IT 2 . Estimates for e may be obtained from those of 1t
. ., by means of
-,
..; J
.! 3
.i l :}
><;i
.i l \
j'it
.,..;;t
..
.
1";
..:
;::i
",'
'-:,
fii'
1:
,..
i:
.'i
.;,
"",'
..."
.
..
I.
i:
...
,
ii'
..
..
r'
i::
:
r::
,cd
i
ii,.
....
'.,:, '
- ,
. '
.;, :
- .!i...,
0 1 * = exp(IT 1 *);
0 *- *
2 - 1[2
Other examples, arising in chemical reaction kinetics, are
(a)
y = Xl exp[ - 0tX3 exp( - 02/X2)]
(4-19-6)
which, under
y == log[ -log (y/xJ],
ITI == log 0t,
IT 2 == O 2
transforms into
.p = ITI - (l/X 2 )IT 2 + log X 3
(4-1 9- 7)
and
(b)
.1' = 0IXt!(l + O 2 X 2 + 0 3 X3 + 0 4 X4)2
(4-19-8)
which under
y == (X I /.1')1/2, ITI == 1/0 1 \.:, IT 2 == O 2 /0 1 \':', IT3 == 03/0t Yl, IT4 == 0 4 /0 1 \'2
becomes
.f' = ITI + X 2 IT 2 + X3IT3 + X4IT4
(4-19-9)
The main objection that can be raised to this method (other than its
limited applicability) is that the statistical distribution of the errors on the
calculated values of YII == t(YII) is not the same as that of the errors in YII'
Therefore it may be appropriate to apply the least squares criterion to the
residuals in Y but not in y. We can overcome this problem in part by recog-
nizing that if VII is the covariance matrix of the errors in YII' then (provided
these errors are small and the transformation t and its derivatives are con-
tinuous and bounded) the covariance matrix of YII is approximately
VII = (Ct/cy")V,, (Ct/ay/,)T
(4-19-10)
Hence, in place of minimizing I;: = I (YII - f)TVIl(YII - I'll) we should mini-
mize I= I [y l , - t(f)]TVI 1 [y l , - t(fl')]. In the single equation case, a variance
a 2 of Y ll is translated, approximately, into a variance (j2 = (c..vjaY I I)2(/ of YII'
80
IV Methods of EstimatIon
The transformation 't usually introduces a bias; i.e., if the errors In Y/l have
zero means, those in Yll do not. This bias can usually be neglected under the
above assum ptions.
4-20. Minimum Chi-Square Method
The minimum chi-square method (Cramer, 1946; Rao, 1957) is used to
put statistical estimation problems (i.e., estimating the parameters in a prob-
ability distribution) in a least squares form. Suppose we have observed Jv
realizations Xi of a random variable (, and suppose ( is supposed to have a
pdf p((IO). If we divide the entire range of ( into 11 disjoint intervals [aj. b;J
then the expected number of observations Xi falling in each interval is
I).
E;(O) == N Pr(a j <.;::::; b) = N J Jp(xIO) dx
lIJ
Let N j be the number of Xi actually observed in this interval. The mil11mum
chi-square method consists in finding the value of 0 which minimizes
(/)(0) == I [N j - ElOW/ElO)
j
(4-20-1 )
In the modified mt11ImUm chi-square method, I/N j is usecIln place of I/Ej as
the weighting factor
(/)(U) == I [N j - E/O)f/N j
j
(4-20-2)
We must choose our intervals in such a way that N j i= 0 for allj. The modified
form Eq. (2) is easier to use, since the denominators are constants.
Both estimates are consistent and asymptotically efficient (if each N j goes
to infinity). These properties are identicalto those of the maximum likelihood
method, and the latter may be preferred on account of its greater simplicity.
The minimum chi-square method, however, enables us to examine the fit of
the proposed distribution to the data throughout the range of the distribution,
by comparing the observed occurrences N j with the predicted values EiO*).
When the number of observations is small, the loss of information due to the
grouping of the data may be considerable, and the method cannot .be
recommended.
4-21. Problems
I. Work out the statistical assumptions on the distribution of errors
which correspond to the weighting schemes of Eqs. (4-3-3)-(4-3-6),
2. Verify the relation Eq. (4-4-15) for the two parameter case (i.e" V is
\;
iF
;..
,
4-21. Problems
81
if'
t:
)
'S
fi
"l.
if:
it;
ii',
2 x 2) with U = I. Explain why the latter assumptIon entails no loss of
generality.
3. Derive Eqs. (4-4-17)-(4-4-10).
4. Show that if the model equations are linear in the paramctcrs. the
covariance matrix is known. and the distribution of errors is normal. then
the MLE is efficient.
,
t:'
5. Show that if the errors in each experimcnt are normally and inde-
pendently distributed with covariance matrix V = TQ. whcrc Q is a known
matrix and T is an unknown constant. then the M LE can be found by
minimizing the concentrated objective function
j,
r'
ik
f;
J
1.,:,
(fJ(a) = (11111/2) log Tr[Q-1 M(O)]
('+-21-1)
and T can be estimated from
i{a) = (1/11111) Tr[Q-1 M(O)]
(4-21-2)
"L.,
ti.
.
z,
<I
6. Consider the exact structural model .\\ - {I'/I = 0 {Jl = I. 2.. . 11).
The measurement errors of XII and -"/1 are normal and independent with known
variances 6,2 and 6/. respectively, Show that the M LE for {I is
i
,;
0* = {'Y.S,.}. - Su + [('Y.S,.,. - S,.)2 + 47.S.;,.]li 2 ]/2'Y.S".
f
tor
jiY
\.'
. .
. '
.'.
where
- j - S '\" - 2
CI.. = U x " U),7, xx == LII:::::.IX"
S - '\" . 2
\'.\'-LII=I.\J' .,
S - '\" ,- \ .
.\\' - L/J=I.'\JI I'
Show that this estimate converges to the lIslialleast squares estimate S.,,)S,-,
as ax -> 0, See also Barnett (1967) for a slightly more general case.
. .'<
Ii..
t
..
7. Consider a sequence of disjoll1t time intervals of lengths II' 1 2 , . . . , I".
Let the number of phone calls passing through an exchange in the {lth interval
be N p . Under the assumption of a Poisson process. the probability that
N p = k (k = 0. I, 2. .,.) is [(I.tlflk!] exp( -I.t ll ). where i is the average arrival
rate of calls. Find the maximum likelihood and minimum chi-square
estimates for k
f
t
:
'
<;;
,..
J
t;
.
.
; (_¥! rii:
8, Consider the following model: Certain" state" variables 5 are functIons
of the independent variables x
5 11 = [(xII' 0) + EJI
and the dependent variables Y II are linear functions of the state variables (B
is a known matrix)
YII = B5JI + 0Il
82
1,
'!i
IV Methods of Estimation
,..
.'
\
Assume that £" and i5 '1 are random variables independently distributed as
N,.(O, P) and N",(O, Q) respectively, that Q is a known matrix, and that III > /"
Show that the M LE of e can be obtained as follows'
i. Compute the multiple linear regression estimates of the" true values"
of the state variables
,
:.!
..
8 " = (BTQ-IB)-IBTQ-IYll
(p = I, 2, . . . , 11)
ii. Use the computed 8 '1 as though they were actually measured values of
Sl" and apply the appropriate M LE estimate from Sections 4-7-4-9
Note: For an applIcation, see Section 8-8.
9. Suppose each experiment consists of measuring the same quantity y
several (/11) times. Show how the results of the previous problem can be
applied to the present case. Note that under a reasonable set of assumptions
one has Q = a 2 1, and one does not have to know the value of a 2 in order to
apply the method,
10. Let S be a matrix such thar STS = V-I. Define B = SB and Y = SY.
Show that this transformation reduces Eq. (4-4-4) to an unweighted sum of
squares.
,L
,:.\
5:
.,j
Chapter
v
Computation of tile Estimates I:
llnconstrained Problems
t' .
(.
. '.
. ,
5-1. Introduction
Most parameter estimation methods require that we find values cp*of the
parameters cp for which some objective function (;1J(cp) attains its maximum or
minimum. Typical objective functions to be minimized are the sum of squares,
weighted sum of squares, and risk function. Those to be maximized are the
likelihood, concentrated likelihood, and posterior density functions. All these
were described in Chapter IV.
In most practical applications, any unknown distribution parameters \v
, . are eliminated from the objective function by methods such as those described
in Sections 4-8 and 4-9. Therefore, we shall wrile our objective function as
c.P(e) rather than c.P(cp) and we regard 9 as a vector with I components. Some of
the methods of solution we shall discuss are easily extended to the case when
distribution parameters remain present in the objective function.
Sometimes the unknown parameters are free to assume any values what-
soever, and we speak of unconstrained optil11ization.t In other cases, only
values satisfying certain inequalities and/or equations are admissible. The
problem is then to find 9 such that 01)(9) is maximum (or minimum) subject to:
h(9) :;, 0
g(9) = 0
(5-1-1)
(5-1-2)
'1"
where hand g are vectors (possibly vacuous) of given functions.
We sacrifice nothing by restricting our attention to minimization, for
maximizing a function can always be accomplished by minimizing its negative.
Minimizing a function subject to Eq. (1) and Eq. (2) constitutes the problem
,of nonlinear programming. In spite of extensive treatment in the literature
[see Daniel (1971), Wilde and Beightler (1967), Abadie (I 967a), Kunzi and
- .
, t Minimization or maximization, as appropriate.
84
V Computation of the Estimates I: Unconstrained Problems
KreI:e (1966), Hadley (1964), . . .J na single methad has emerged which is best
for the salutian lOf all nanlinear programming problems. One cannat even hape
that a " best" methad will ever be faund, since problems vary sa much in size
and nature. For parameter cstimatian problems we must scek methads which
are particularly suitable to the special nature af these problems which may be
characterizcd as fallaws:
, .
'-.',
,';j
1",
<
.'
I. A rclatIvely small number af unknowns, rarely exceedlllg a dazen or sa,
2. A highly nanlincar (though continuC'us and differentiable) abjective
functian, whase camputatian is aftcn very time cansuming.
3. A rclatively small number (sametimes zero) af inequality canstraints.
Thase are usually af a vcry simple nature, e.g., uppcr and lawer baunds.
4. Na equality canstraints. except in the case af exact structural madels
(where, incidentally. the number af unknawns is large). These will be treated
separately in Scctians 6-6-6-8.
SlIlce tllC constraints playa relatively minor role in mast esrimcttian prob-
lems, we shall first discuss some methads far uncanstrained aptimizatian.
Additianalmcthads can be faund in Kawalik and Osbarne (1968). in Sectian
6-1 we shall shaw haw cantraincd problems can be canverted inta uncan-
strained oncs, making thc pn:viously described mcthads applicable.
5-2. Rter4t:ve SchcIne
Thc mcthods we shall diCllss are itemtire in nature. We start with a givcn
paint 0, known a the initial gues.\, and procecd ta generate a sequence of
paints O 2 , D], ... which we hape canverges ta the paint e* at which q)(O) is
minimum. The camputation af e; ,., is called the ith iteration, and the paint
e; the ith iterate. In practice, ane terminates the sequence after a finite number
N af iteratians, and anc acccpts e.v as an approximatian ta e*. The vector
v, =:0 0;+ I - 0;
(5-2-1)
is called the ith step. We wish each step ta bnng us c!aser ta the minimum.
Since wc do nat know whcre the minimum is, we cannat test far this canditian
directly. In a sense, hawcver. we may cansider the ith step ta have" improved"
aur situatian (by bringing us claser ta the minimum in q) space, if nat in 0
space) if
(J)i+1 < (1);
(5-2-2)
where
(JJ j =:ocJJ(e)
(j = I, 2, . . .)
(5-2-3)
. .J
-:
5-3, Acceptability
5
;"
We call the ith step accepTable if Eq. (2) holds. An iterative method is
! acceptable if all the steps it produces are acceptable. We shall only consider
acceptable methods.
All the methods we shall discuss adhere to the following schemc:
1. Set i = 1. An initial guess 0 1 must be provided externally.
2. Determine a vector V j in the direction of the proposed ith step
3. Determine a scalar pj sllch that the step
>
ITj = pjv,
(5-2-4)
,.
/is acceptable. That is, we take
OJ+t = OJ + PjV j
(5-2-5)
'
. 'nd require that pj be chosen so that Eq. (2) holds.
" 4, Test whether the termination criterion (see Section 5-15) is lllCt. If not,
,'increase i by one and retu rn to step 2. If yes, accept 0 j + t as the value of 0*.
.. The various methods to be described below differ only in the manner of
: phoosing Vj and p,. We refer to these quantities as step direction and step .I'i::e
:.tespectively. Since Vj is not required to be a unit vector, p j is only proportional,
but not necessarily equal, to the step length in the llsual sense.
",
"
;""
'.
5-3. Acceptability
1_,'
r,,\
Consider the ith Iteration of a minimization procedure. Suppose we strike
qut from OJ along some direction v, generating the ray
;1
O(p) == 0 j + pv
(p 0)
(5-3-1 )
Along this ray, the objective function varies as p is changed, thus becoming a
function of p alone, We designate this function
lJ'jy(p) == q)(O(p» = q)(Oj + pv)
(5-3-2)
.,' ,
Its derivative is given by
dlJ'jvldp = (iXPjiJO)T(aOjop) = (c"JcPj20)T V
(5-3-3)
'Fp.e gradient vector of <P(O) is M)jaO, which we designate as q(O). The exth
c,omponent of q is the quantity a<PjaOa. Denoting by qj the gradient veclor
eyaluated at 0 = OJ, we have
IP;y == d'P jvldp)p = 0 = qjTv
In.the sequel we assume qj #- 0,
(5-3-4)
".",,!,
86
V Computation of the Estimates I: Unconstrained Problems
1
I
'I
The quantity II1;,. is callcd the directional derivative of c[) relative to v at
OJ. If 1[1;,. is negativc, then q;(O) decreases in value when one starts moving
away from OJ in the direction ofv. Therefore, if P is a sufficiently small positive
number, the step (!v is acceptable. On the other hand, if Pi" 0, there may
not exist any positive valuc of p for which pv is an acceptable step. We calI v
an acceptable direction if 1[1;,. < O.
We can noW prove the following:
Theorem A direction v is acceptable if and only if there exists a positive
definite matrix R such that
v = -Rqj
(5-3-5)
Proo)
I Let R be a positive definite matrix, and let v be gIven by Eq. (5). Then,
from Eq. (4) and the definition of positive definiteness
1[1;" = qjTv = _qjTRqj < 0
2. Suppose qjTv < o. Select
R = [1 - (qjq/lq/qi) - (VVT/VTqj)]
(5-3-6)
Then Eq. (5) holds, and R is positive definite (we leave the details as an exercie
for the reader)
The reqUlremcnt 1[1;,. = qi T V < 0 says that the directton v leads downhill
if it forms a greater than 90° angle with the gradient qj' The theorem staVs
that this condition can be insured if the direction is determined by operating
on the negative gradient with a positive definite matrix according to Eq. (5).
A minimization method in which the directions are obtained in this manner is
called an acceptable gradient method (it is simply a gradient method if R is not
required to bc positive dcfinite). The basic equation of the ith iteration in a lY
grad ient method is
f)i+1 = Oi - PiRjq,
(5-J.7)
Various gradient methods differ in the manner of choosing the R i and Pi'
In devising or choosing an optimization method one attempts to minimize
the total computation time rcquired for convergence to the minimum. This
time is composed primarily of the following two factors
I Function and derivative evaluations.
2. Algebraic manipulations such as matrix inversions or eigenvalue deler-
minations. It is usually possible to tradc off these factors against each other.
A method employing morc laborious algebraic proced ures may require fe'.ver
iterations, and hcnce fewer function evaluations. This is likely to pay olf if
;:
:;
.:
5-4, Convergence
87
the objective function is a complicated one. In parameter estimation problems,
the objective function is synthesized from the model equations and from the
data obtained in many experiments. Its computation is usually time consum-
ing. We do not hesitate therefore to recommend methods which are sophisti
cated algebraically, as long as they are efficient in terms of the number of
required function and derivative evaluations.
5-4. Convergence
; 1 .
..
,i
I
.
One would like to be able to prove that the method one has selected
converges to the true minimum of the objective function. Unfortunately con-
vergence proofs usually require that certain assumptions be made concerning
the nature of the objective function, and the validity of these assumptions is
difficult to verify on any given problem. Even more significantly, the existence
of a convergence proof is no guarantee of reasonable performance in practice,
A method may converge in theory, yet take an excessive number of iterations,
or require computations to be carried out with an unreasonable number of
significant digits. For this reason, our discussion of convergence theorems
will be brieF.
Let q\ denote the value of qJ(O;). If at each iteration we select an acceptable
point, then the sequence {qJJ == {q)o, q)1 ' q)2, ...} is monotone decreasing. If
the values of the objective function possess a lower bound, then this sequence
must converge to a limit q)oo. If the sequence {OJ is bounded (i.e., there exists
a number /vI such that OjTO j < 11J for all i) then it has at least one limit point.
It follows from the continuity of qJ that q)(Ooo) = q)oo, where 0 00 is any limit
point of {OJ. Because of this, the sequence {OJ can have more than one
limit point only by remarkable coincidence. In all practical cases, the sequence
{OJ is either unbounded, or converges to a point 0 00 ' The rate of convergence,
however, may be so slow that the sequence appears nonconvergent.
A stationplJl point of the objective function is one at which q(O) = o. If OJ
is stationary, i.e., qi = 0, then Eq, (5-3-7) shows that all OJ (j:;?; i) coincide
with OJ. Therefore, the most that we can hope to prove about any gradient
method is that it converges to a stationary point. Convergence to the true
minimum can be guaranteed only if it can be shown that the objective function
has no other stationary points. In practice, convergence to a local maximum
or saddle point requires an improbable coincidence. One usually reaches at
least a local minimum.
Convergence proofs require that the Pi be chosen sufficiently large, and the
matrices R i sufficiently positive definite. The following theorem is typical. Its
proof is given in Appendix F:
1
88
V Computation of the Estimates I: Unconstrained Problems
I
I
: J :'
.
,\;i
.'-a
'; 1 1
,
Theorem Let 9 denote the set of all 0 such that q)(O) < q)l. Suppose the
following conditions are satisfied:
I. cP has continuous first and bounded second derivatives in q..
2. Let pj be the smallest nonnegative value of P at which tpj\.,(p) attains
a local maximum, where v j = -Rjqj. Let rL be a positive number less than
one, and Po a positive number. We choose each pj so that either rLpj pj lj
or min(po, C/-Ilj) ::( pj ::( pj.
3. Lct (j and y be constants satisfying )' > f3 > O. We choose each R j so
that all its eigenvalues lie between f3 and)'.
Then all the limit points of {OJ} are stationary points of q).
These conditions are sufficient, but not necessary. In the algorithms that
we shall describe the R j are usually chosen so as to satisfy condition 3. There
does not seem to be any nced, in practice, to trouble oneself with satisfying
condition 2 precisely. Condition I is almost always satisfied in principle, but
the limited accuracy of numerical calculations sometimes causes trouble.
5-5, Steepest Descent
The simplest gradient method employs R j = I, so that V j = -qj in all
iterations. The direction -qj is the one in which the objective function de-
creases most rapidly, at least initially. Hence this method is called sTeepest
descent. Unfortunately, as discusscd more fully in subsequent sections, this
mcthod is often very incHlcient, requiring a large number of steps which tend
to zigzag in a so-called hemstitching pattern. The method is not recommended
for practical applications, and is discussed here only for reference.
5-6, Newton's Method
The Hessian matrix H(O) of the function q)(O) is the matrix of second partial
denvatives, i.e.,
Hap(O) == a 2 cpjae a ae p
(5-6- I)
Let H j be the Hessian matrix of cP evaluated at 0 = OJ. We define the function
Qi(O) = q)j + qiT(O - OJ) + t(O - OJ)THj(O - OJ)
(5-6-2)
which consists of the terms up to second order In the Taylor series expansion
of q) around the point OJ. In a sense, Qj(O) matches the behavior of q)(O) at
0= 0; more closely than does any other second order surface.
5-6. Newton's Method
89
Suppose we wish to find the point at which Q;(O) is stationary. We equate
to zero the gradient of Qj
8Q;/80 = qj + H;(O - OJ) = 0
(5-6-3)
which, if Hi is nonsingular, has the solution
0;+1 = 0; - H;l qj
(5-6-4)
Equation (4) defines the ith iteration of the Newton (also known as
Newton-Raphson) method. It conforms to the general formula of gradient
methods Eq. (5-3-7) with Pi = I and R i = H j - t .
If cP(O) coincides with Q;(O), i.e., if qJ is a quadratic function, then 0;+ I IS
stationary point of qJ. This is a minimum if Hi is positive definite. In that
case R j is positive definite, the method is acceptable, and it converges in a
. single iteration. If H; is negative definite, 0;+ I is a maximum, and if B i is
indefinite, 0;+1 is a saddle point. [n both cases the method is not acceptable.
When cP is not quadratic, OJ+ I does not generally coincide with the sta-
tionary point, and the method does' not converge in a single iteration. The
method is acceptable, however, as long as Hi is positive definite, as it should
be at [east in some neighborhood of the minimum. [n this neighborhood,
convergence is quadratic. This means tl:tat the number of correct digits in 0
is approximately doubled by each iteration, until further improvemcnt is
barred by the rounding errors in the calculations. Outside this neighborhood,
convergence cannot be guaranteed. It is worth noting that the step v = -Rq
(R positive definite) solves the problem of minimizing the function q TV +
1v T R- 1 v. The closer R is to B- 1 , the closer is this modified problem to the
:'original one.
Returning to the case of a quadratic function (p with a positive defimte
Hessian fI, we find that the Newton step -H-Iq is the only one that takes us
to the minimum in a single iteration. Any other step - Rq with R =1= H - I will
miss the minimum. If we define the efficiency of a method as the decrease in
function value obtained by a single step of the method, divided by the maxi-
r. "inum possible decrease, then the efficiency of the Newton method is I. It was
'shown by Greenstadt (1967) that the efficiency e( R) of a method using the
,direction - Rq is bounded by
41'/( I + d e(R) I
( 5-6-5)
",
. '\rere Y is the condition number, i.e., the ratio of largest to smallest eigen-
';va[ues of the matrix R 1 / 2 HRI/2. With R = I (the method of steepest descent),
:.';}',= YH is the condition number of H itself. [t is quite common to find cases
,. ''Yhere Y > 10 5 , so that the Newton method may be 25,000 times more efIlcient
.. t,ran steepest descent!
:i.)-
90
V Computation of the Estimates I: Unconstrained Problems
Even in the nonquadratic case, a large value of I'H indicates the presence
of a long and narrow ridge or trough in the surface q)(O). The ridge is aligned
with the direction of the eigenvector of H whose eigenvalue is very small in
absolute value. The ridge is concave upwards if the eigenvalue is positive, and
downwards if it is negative. In the first case, the Newton method attempts to
reach a minimum or saddle point at the bottom of the ridge; in the second
case a maximum or saddle point at the top of the ridge. In either case, as
shown below, it proceeds to take long steps along the ridge in the direction of
the expected stationary point. The results are good in the first case, but dis-
astrous in the second. In contrast, a method employing a matrix R which differs
radically from H- 1 tends to crisscross the ridge in a so-called hemstitching
pattern, and makes very slow progress.
Let H = I: =; li Vi V iT be the eigenvalue decomposition of H (see Section
A-5), and let the gradient be expressible in terms of the eigenvectors as q =
I= I ct.j v, . Then, since V,T vi = 6 ii ,
:i
:l
;
i1
)i
I I I
_ H-1 q = _ '\" ;:-IV.V T '\" ct..V. = - '\" ( ct.. f ).)v.
L.' I I L .1 J L I I I
i= I J= I i= I
( 5-6-6)
If li is very small, then unless ct. i is accidentally also very small, the step
_H-1q has a large component in the direction Vj' That is why the Newton
step approximately parallels the direction of a ridge.
In spite of its splendid performance in those cases where it works, the
Newton method is not a practical one, for the following two reasons:
I. It does not work in many cases, because H(O) is not necessarily positive
definite except near the minimum,
2. It requires the evaluation of second derivative. This places a heavy
burden on the user, particularly where the objective functions are as compli-
cated as those to be found in parameter estimation problems.
Various tricks have been devised for overcoming these difficulties, while
retaining the advantage of the method. Directional discrimination and
Marquardt's method are dcsigned to overcome the problem of indefiniteness,
whereas the Gauss and variable metric methods eliminate the need for comput-
ing second derivatives.t All these methods are described in succeeding sections.
We note that of the two deficiencie, the first one (nonconvergence) must
be overcome if the method is to be useful. The second difficulty (second
derivatives required) is merely a matter of convenience. One may raise the
question of whether the Newton method (modified to guarantee convergence)
.f
,j
: One may also compute second derivatives directly by finite differences on q, as pro-
posed by Goldstein and Price (1967).
5-7, Directional Discrimination
91
II
I <
:\;
;;
is not so much more efficient than methods that do not require second deriv-
atives, as to make the evaluation of these derivatives worthwhile. We have
no definitive answers to this question, but a limited amount of experience has
led to the following tentative conclusions:
I. If the model fits the data well, the Gauss method often requires no
more iterations than the Newton method (Bard, 1967).
2. If the model does not fit well, the Newton method may require fewer
iterations than the Gauss method, but thecompllting times for the two methods
are roughly the same (Flanagan et al., 1969).
5-7. Directional Discrimination
,.
Assume that we have a matrix A j which IS the Hessian Hj or an approxima-
tion to it. We would like to obtain a positive definite (or at least semidefinite)
matrix R j which is in some sense close to Ai 1, so that v = - R j qj is an
acceptable direction. Furthermore, we wish to obtain a reasonable v even if
Ai is singular, or nearly so.
The idea behind directional discrimination methods is to use different
formulas for computing the various components (in a suitably chosen co-
ordinate system) ofv. Jennrich and Sampson (1968) use the original coordinate
system, and set to zero those components of v which do not seem, at the
moment, to affect CPo The technique is most suitable for use in conjunction with
the Gauss method, and will be discussed further in Section 5- I l. Generally,
however, it pays to transform the coordinates so as to eliminate" interaction"
among the parameters, i.e., so as to obtain a diagonal Hessian. In such a
coordinate system, the effect on q) of varying one component of v is approxi
mately independent of the value of any other component of v. Fariss and Law
(1967) have coined the term rotational discrimination to describe such methods.
To obtain a suitable transformation of coordinates, we may use any spec-
tral decomposition of Ai (see Appendix A-5). Good numerical accuracy is
obtained with scaled decompositions, and these will be used here. Let the
inverse scaled decomposition of Ai be given by Eq. (A-5-l9), i.e.,
"
,.
"
,>-
Ai = (GT)-ITIG- I
(5-7-1)
"
t:
The equation A j v = -qi can, therefore, be written as
TIG-Iv = GTqi
Let e == G - 10 and v == G - I V . We have
(5-7-2)
J.
.
I
qa == aq)jae a = I (8cpjae{l) aO{ljM a
{I=I
,.
:C1;(:
92
V Computation of the Estimates 1: Unconstramed Problems
But 0 = Ga, so that aOpjae a = G pa and i/a = I=I (8cpjae p )G pa ; or in matrix
notation
q = GTqi
Hence Eq. (2) may be written in the 0 coordinate system as
II" = - q
(5-7-3)
(5-7-4)
or, since II is diagonal with Daa = na
n" V" = -iJ"
(ct = I, 2, . , . , I)
(5-7-5)
The solution is
Va: = -}'aqa.
(a = 1, 2,
, I)
(5-7-6)
with
I'a = n; 1
(5-7-7)
We are now ready to exercise directional discrimination by departing from
Eq. (7) for some of the components. First of all, to guarantee an acceptable
step, all 1'" must be positive. Hence we follow Greenstadt's suggestion
(Greenstadt, 1967) and replace any negative eigenvalues by their absolute
values. In place of Eq. (7) we take
1'" = 1 n; I 1
(0:=1,2,.. ,1)
(5-7-8)
As noted in Section 5-6, the presence in H of a small negative eigenvalue
corresponds to a downward concave ridge or trough in the objective function,
Taking I' j = I nj-II merely aligns the direction of search with the direction
pointing downwards along the ridge (see Fig. 5-1).
We still face the problem of what to do about zero or very small eigell-
values. In numerical computations, an eigenvalue is almost never exactly
zero, so we need not worry unduly about that situation, If a n" is very small,
e.g., Inal < 10- 5 maxplnpl, then we face a dilemma; on the one hand, the
Newton method tells us that the direction corresponding to this eigenvalue is
precisely the one leading most efficiently to the minimum, provided the
objective function is quadratic. On the other hand, we know the function is
not quadratic, and we are reluctant to take the very long step I n;11 iJa which
is based on the quadratic assumption. Also, the convergence theorem of
Section 5-4 requires that the range of eigenvalues of Rj be limited. We can
think of at least three basic strategies;
(a) Newton-Greenstadt Method
I'a = max[1 n"I- 1 , <5]
(5-7-9)
where <5 is a small constant.
5-7. Direct iO/1al Discrimi/1at iO/1
93
;';j
Small pOSlllve
Eigenvalue
. .,
;.,':(
"-
...
<p 0 I
r:"
'.
".
.
New Ion dlreC lion .
GreenSlodl dlreCIEDI
r:
f
Small negohvE
eigenvalue
3
t,
L
<p 0 I
Fi.'l' 5-1, Ridgcs on objcctivc function.
(b) Modifiedt Farris-Law Method
max[I7[I-I, (5]
'n -
I -
a a 17[ I
( 17[ I > £)
(i7[I£)
(5-7-10)
' ,
where (5, £, and a are constants. This method may be viewed as being Newton-
like relative to the large eigenvalues, and steepest-descent-like rclative to the
small eigenvalues. If we take 6 =£ = 0 we get the pseudoinverse, so that
this method may be characterized as an approximate pseudoinverse method.
(c) A "Neutral" Method
.
, '
, ,l
_ f1l1ax[l7[ I-I, (5]
y -\f3
(i 7[. I > d
( 17[. I £)
(5-7-11 )
)1",.
;1:'
where 6, £, and f3 are constants. This method is appealing because it is the only
one which confines the eigenvalues of Rj to a range bounded by positive
numbers from above and below, as required by the convergence theorem of
Section 5-4.
A.
,.
. i0j. <
t In the original method, y. = 0 for ncgativc and very small 7T a'
94
V Computation of the Estimates I: Unconstrained Problems
Practical experience with these methods does not as yet provide us with a
clear cut choice among them. In some extremely ill-conditioned problems
[e,g., Jennrich and Sampson's two exponentials (Jennrich and Sampson, 1968)]
methods (b) and (c) performed splendidly while (a) failed. In several less
artificial cases, however, method (a) has converged fastest.
Whatever our choices for the fa' let r be the diagonal matrix with
raa = fa. Then, from Eq. (6)
v = - rij
(5-7-12)
j
Replacing v and q by their definitions we obtain, after premultiphcation by G
v = -GrGTqi (5-7-13)
so that R i == GrG T is the required" almost inverse" of Ai
,I
An alternatIve method for converting an arbitrary matrix into a positive
definite one is given in papers by Levenberg (I 944), Marquardt (I963), and
Goldfeld ef af. (1966). The method rests on the observation that if P is any
positive definite matrix, then Ai + ),P is positive definite for sufficiently large ;;,
)" no matter what Ai Marquardt suggests the choice Pi == B/ where B i is A
defined by Eq. (A-5-14). That is, P is a diagonal matrix whose elements
coincide with the absolute values of the diagonal elements of Ai (with say
zero elements replaced by ones). Thus, we use
5-8. The Marquardt Method
R i == (Ai + )'iB/)-1
and take a step with pj = I, i.e.,
(5-8-1)
..
.t
O"i = -Rjq,
(5-8-2)
Observe that as )'j-> 00, the tenr )'jB/ dominates Ai. The step then
becomes
O"j -> .- ),i IBi 2 qj
)'1-+ ry:;
(5-8-3)
This is an extremely short step in a downhill direction, B/ being positive
definite. A su!llciently large )'i always produces an acceptable step. On the
other hand, when )'i is very small, /)j approaches the Newton direction
-Ailqj.
Marquardt (1963) suggests the followll1g algorithm for the selection of )'i:
I. When i = I, start, say with ), = 0.01.
2. At the start of the ith iteration, compute v = -(A j + ),B/)-l qi ,
0(1) = OJ + v, and cJP} == rP(O(l)),
j
,)-8. The Marquardt Method
95
3, Ifq}!) < cV i , accept 0i+1 = 0(1), and replace), with max(O.U, £) where
£ is a small positive number, say 10- 7 .
;; Otherwise
4. If (vTqY/[(q?qi)(VTV)] <.1. replace), with 10Jc and return to step 2.
Otherwise, interpolate, i.e., find a value Pi sufIlciently small so that <fJ(0i + Pi v)
< ([Ji (see Section 5-14). Accept 0i+1 = Oi + PiV.
The preceding algorithm may require computation of v for several values
of A in a single iteration. This can be avoided by replacing step 4. with:
4', Find a value Pi sufIlciently small so that q){Oi + PiV) < q)i (see Section
5-14), Accept 0i+1 = Oi + PiV, Replace), with 10k
Yet another method for choosing )'i is described by Smith and Shanno
(1971).
There are two ways in which the actual computation of v may be under-
. taken:
(a) Solve the set of simultaneous linear equations
(Ai + )'iB/)V = -qi
( 5-8-4)
"!fseveral values of )'i must be tried In a single iteration, these equations must
,; Qe re-solved each time. Note that if all diagonal elements of Ai are positive,
,:then Ai + )'iB/ is identical to Ai except that each diagonal element is multi-
'plied by I + )'i' If Ai + )'iB/ is known to be positive definite (as, e.g., in the
, pauss method; see Section 5-9), then the Cholesky decomposition (Section
"A-5) method should be used to solve for v. Even if positive definiteness is not
$sured, it is a good idea to start the Cholesky decomposition. If a negative
.. nivot occurs, the procedure is halted, )'i is multiplied by 10, a new malrix
,g£;nerated 1 and the procedure restarted. Alternately, a separate ), may be
; illlaintained for each row of the matrix. If a negalive pivot is encountered say
;:ipthe third row, then the corresponding), is increased sufIlciently to make that
>fivot positive, and the Cholesky procedure can now be continued without
, restart.
:..,(b) Let Eq. (A-5-22) be the inverse scaled decomposition of Ai' Then the reader
.:. may easily verify that
1
R i = (Ai + )'iB/)-t = I (ITa + )'i)-l gaga 1
a=I
( 5-8-5)
\:,0nce the decomposition of Ai has been obtained, R i may be computed for
"y number of values of )'i' We observe that we can restrict our choice to
\vaJues of )'i satisfying
)'i > -mill ITa
(5-8-6)
:: 't,,".iL.
a
:; 1
96
V ComputatIOn of the Estimates I: Unconstrained Problems
It is worth remarking that Marquardt's method finds the step v which
minimizes the quadratic approximation to f/J given by
Qi(V) == f/J i + VTqi + -tvTAiv (5-8-7)
subject to the restriction that
That is, the step v takes us to that point on the ellipsoid defined by Eq. (8) .
at which the function Qi(V) attains its minimum.
To prove this, we form the Lagrangian
v'B/v = c
(5-8-8)
A(v)" = qJ i + VTqi + _}v T Ai v + j)'i(vTB/ v - c)
We differentiate with respect to v, equate to zero
qi + Aiv + )'iB/v = 0
(5-8-9)
.
. .
:..!
(5-8-10)
and solve for v
,
v = -(Ai + )'iB/)-lq,
(5-8-11)
:\1
in agreement with Eq. (I). The particular ellipsoid chosen depends on )'i'
since by substituting Eq. (II) into Eq. (8) we find
c = q/(Ai + )'iB/)-IB/(Ai + )'iB/)-lqi
(5-8-12)
The larger )'i is, the smaller is c, and the smaller is the ellipsoid to which we
are confined, The algorithm starts with an ellipsoid of a certain size, deter-
mined through Eq. (12) by the initial choice for )'i. If the corresponding step v
fails to decrease the objective function, this is an indication that the chosen
ellipsoid is much larger than the region within which the quadratic approxi-
mation Eq, (7) is valid. By increasing )" we shrink the ellipsoid and try again.
The Marquardt method has proven very reliable in practice. In the prob-
lem of Section 5-21 with difficult initial guess, and in several cases of the
problem of Section 5-23, the Marquardt method proved faster than direc-
tional discrimination methods. On the other hand, in a series of ten other test
problems (Bard, ] 970), the reverse held true. The need for further testing is
evident.
I
j
5-9. The Gauss Method
Y" I :
".1
, .
In most parameter estimation problems, the unknown parameters appear
only indirectly in the objective function. The latter depends explicitly on the
model equations, which in turn depend on the parameters. To compute
derivatives of the objective function, we first differentiate it with respect to
,.:q
if.
.fI
",,'
, .,
5-9, The Gauss Method
97
j""
the model equations, and then differentiate those with respect to the para-
meters. The Gauss (1809) method, originally applied to least squares problems,
consists of simply omitting the second derivatives of the model equations when
the Hessian is being computed,
We illustrate by means of the simplest example, that of single equation
nonlinear least squares. Here we minimize
'
II 11 "
<P(O) = I [yP - f(x l " oW = I (Yll - J;,f = I e/
p=t p=1 p=1
(5-9-1)
whence
'",, '
II 11
qrz = ocpjoerz = 2 I e" oe"joe rz = -2 I e l , aJ;,/oe rz
p=t p=t
(5-9-2)
",'1
I'';
and
II II
Hap = o2q)joe a 8e p = -2 I e p 8 2 J;,/oe a oOp + 2 I (8J;,joOa)(oJ;,/oep) (5-9-3)
p=1 p=1
In the Gauss method, we neglect the first term, and use N in place of H,
where N is defined by
/I
Nap = 2 L (oj;JoO,.)(8J;J80{j)
J1=)
(5-9-4)
, "
A numerical example appears in Section 5-21.
In the preceding discussion we have derived N as an approximation to H,
and the Gauss method as an approximalion to the Newton method. There is
an alternative interpretation: suppose the model equations are replaced by
their tangents; that is, the nonlinear (in 0) model is approximated by one that
is linear. If we now try to solve the corresponding linear least squares problem
we find the solution to be precisely e = 0; - N- 1 q. Now e is usually not the
correct solution to the nonlinear p.roblem. Yet, if we accept 0; + I = e, we may
regard the Gauss method as solving a sequence of linear problems. This
interpretation is pursued further in Section 5- I 0
The term neglected in Eq. (3) contained the residual e l , as a factor. Since
the residuals are, hopefully, small, this provides some justification for regard-
ing N as a good approximation to H, particularly near the minimum. The
. same justification applies to all of the more general cases in which the ob-
jective function depends on the parameters only through the elements of the
moment matrix of the residuals M(O) = Lll eJO)e/(O). This includes most of
the least squares and maximum likelihood estimates for normal distributions
that were discussed in Chapter IV, e.g., Eqs. (4-7-7), (4-7-9), (4-8-4), (4-9-9),
and (4-21-1). In all these cases we have
I..
1'])(0) = ljJ(M(O))
(5-9-5)
where tp is a sUItable function.
.: W ::.<
= 2 L (a,p/aMnb)e/J,,(Oe/lb/aOa)
1 ' ,(1,11
(5-9-8)
'. j
-.
jl
t
j
t
j
,t
..I
'v
.3
t
"
.
i
.'
J
:
'
"1
j
.s
i
;j
1
]
't
1
.
;:
t
.
J
\,
':;"t
A
:j
.1
.;j
:1
.;!
'.
,
.11
.';1
98
V Computation of the Estimates I: Unconstrained Problems
Differentiation of Eq. (4-7-5) yields:
aM",,!aOa = L (e/ Ill VC/lb/ aO ", + ejLb oCjwfilO a )
/l
(5-9-6)
( - Cf;Lb/UOa
clcjLI,! ao '" =
ag/IIJiWz
for reduced models
(5-9-7)
for inexact structural models
Therefore, from Eq. (5) and because of the symmetry of M
C/a = aClJ/?ocr = L (c11flji'M"I»(aM"I,/DO",)
(I,"
= L Uil/JIDM"b)(e IW cc/l/,/oO", + C/Jb ac/ lll /aO a )
1 1 ,(1,1)
One can further work out that
Hal! = c 12 (/JfiJO", aOf! = 2 L (ulfl/uM"b)(aCjW/aOa)(aejl/,/aOp)
JI, (It b
7'\' ll f 1 / "'. [ ) (a 2 I O O)
+ _ L., (L (,11 "I) e/ J " e jLb c: a 0 IJ
1 1 ,11,/1
+ 4 L L (ollfJ/oM",) aiV/cd)e,laC8c/lI,/80a)e'IC(8e'ld/OOp) (5-9-9)
Jl, n./) 'I. c. ,I
As we see, the second derivatives of the model equations are always
multiplied by residuals c jlll ' and the terms involving them are dropped in the
Gauss method. We note also that the terllls involving ollfJ/8M aM contain
residuals and we drop these too. This leaves us with the approximate Hessian
Nail == 2 I (OlfJ /8lVfllb)(OCjw/80",)(OCjlb/aOp)
If. II, b
or, in matrix notation
n
N = 2 L B/rBjl
11= 1
(5-9-10)
where B/I and r are matrices defined by
B JI ,lI')'. == -i 1 (:Ju'/oOa. = C!;w/aoa.,
l"b == uP/aNf lib
(5-9-11)
Using the same notation, the gradient Eq. (8) is given by
n
-_?'\' B T r
q - - L., " ell
JI=1
(5-9-12)
In Table 5-1 we give formulas for r in the cases of the norlllal distributions
discussed in Scctions 4-7-4-9. It is significant that in all these cases r turns out
to be positive definite (or at least semidennite). It follows from Eq. (10) that
5-10. A Sequence oj Linear Regression Problems
99
"1-
{
1';!:
,
if:.
i-i'o;:
'::
?:.
Table 5-1
Somc Log Likelihood Functions and Thcir Dcrivativcs
Objcctive
function givcn
by
DcscrIpuon
Objcctivc
function
l[J (IVI)
al[J
r=-
aM
Eq. (4-7-7)
Tr(V-'M)
Normal distribution,
known covariancc matrIX,
wcightcd Icast squarcs.
_V_1
Eq. (4-21-1)
2
Normal distribution,
covariancc matrix known
cxccpt for multiplicativc
factor.
/1111 /1111
- Iou Tr(Q-'IVI) Q
2 " 2 Tr(Q-'I\1) -J
f
;,'
!t).
:.
I
1 10 .
?t
d
Eq, (4-9-9)
Normal distribution,
unknown covariancc
matrix.
11
- log dct M
2 -
/I
_M- 1
2
Nand R = N- 1 are also positive definite. In particular, this is true in the
single equation least squares case, Eq. (4). Application of the Gauss method
to an objective function of the form Eq. (4-9-9) is illustrated in Section 5-23,
If the objective function is a posterior density as in Eq. (4- 15-1) or
Eq, (4-15-2), the Gauss method may still be used to approximale the Hessian
of the log likelihood, to which the true Hessian of the log prior density must
be added. The latter is frequently easy to compute. For instance, if PoCO) is
normal with covariance matrix V, then the Hessian of log Po is simply - V-I.
See Section 5-22 for a numerical example.
5-10, The Gauss Method as a Sequence of Linear Regression Problems
The essence of the Gauss method is to use in the jth iteration a step whose
direction is given by
Vj = -N;l q ,
( 5-1 0-1)
Equivalently, Vj is the solution of the set of simultaneous linear equations
NjVj = -qj
(5-10-2)
which, in view of Eq. (5-9-10) and Eq. (5-9-]2), may be writ/en out as
II II
L B/rB1,v = I B/re l ,
II= 1 JI= 1
(5-10-3)
where the subscript i has been dropped for convenience.
J!I'?
100
V Computation of the Estimates I: Unconstrained Problems
-..
.:
.
:;.
.1
Referring back to Section 4-4, we find that if we want to determine by
multiple linear regression the coe!1lcients v in the expressions
,.
';.;:,.
.cl
.
ell =BI' v
(fl = I, 2, . . . , 11)
(5-10-4)
91
E(e l . - BI' v)(e'l - B'I V)T = ('5 ll1/ r- 1
we are led to minimizing the function
(5-10-5)
-:0£
.
,;11>
'11
. :1
":.
and if we assume the covariance matrix
"
(fPI(V) == I (e l . - BI' v)Tr(e ll - BIL v)
1 1 ::::: 1
(5-10-6)
,.
'J
'1
'\i.
.;<;.
'
;t
.."
;
The normal equations corresponding to (fpI(V) are given precisely by Eq. (3),
with all quantities evaluated for 0 = OJ. Thus, each iteration of the Gauss
method may be regarded as the solution of a multiple linear regression
problem.
The above remark applies only to the determination of the direction V j
but not of the length pj. The solution of the linear problem is OJ + Vj, i.e.,
pj = I. This step may prove unacceptable in the nonlinear problem, and we
use an interpolation-extrapolation scheme as described in Section 5-14 to
determine a better value of p,. Such schemes were originated by Box (1957),
Hartley (1961), McGhee (1963).
The linear regression problem represented by the objective'function Eq. (6)
and its normal equations Eq. (3) can be. generated by inspection, without
reference to Table 5-1 and the derivations of Section 5-9. All we have to do is
replace the model equations by their linear approximations around the current
value OJ
1'1,(0) = fl.(Oj) + (('fl,/DO)(O - OJ = f"te;) + BII Vj
(5-10-7)
.'
"J.
so that elL = YIL - fl.(Oj) = BI' V j as in Eq. (4). These are the lineanzed model
equations. For the weighting matrix r we take V-I when known, or its
current estimate (e.g., l1iVl- I ) when not known.
It goes without saying that various tricks that are useful for solving linear
regression problems are applicable here too. For instance, it is well known
that the condition of the normal equations is usually improved if one" sub-
tracts out the means." This strategy applies provided the model has a constant
term. Suppose, for instance, that for a single-equation model, Eq. (4) has the
form
:}
;;J!
t
el' = 1'1 + I b",l'a
2=2
(5-10-8)
'i
:. :"-;,,
Iii
:
:i
Let b be the average value of b l ", i.e.,
"
b, == (1/11) I b l "
1 ' = 1
(5-10-9)
..",
5-11. The Implementation 0./ the Gal/SS Method
101
,\'
Then Eq. (8) may be rewritten as
1
e" = VI + " iJJI \'2
a2
(5-10-10)
/-
where
I
iii == \', + I lJa\'
a=2
(5-10-1 I)
[jJI> == bJI - lJ
(5-10-12)
We noW use model Eq. (4) to calculate v" \'2' ..., \ l' and from these and
Eq. (11), we can compute \'1'
It is a remarkable property of the normal equations in regression problems
that they always have a solution, even when N is singular. I n fact, when N is
singular there are infinitely many solutions. Among these, the one of III 1111 mum
length is given by
,'1.
Vi = -Nj+qj
(5-10-13)
Other solutions may be obtained, for instance, by means of stepwise regres-
sion (one continues pivoting until no nonzero pivots are ]eft. See Section A-3.)
In this solution, the number of nonzero components in Vi does not exceed the
rank of N j . We prefer the pseudo inverse solution because of its minimum
length property. However, presence of rounding errors makes use of the exact
pseudo inverse undesirable, and it is best to make N j nonsingular by the direct-
ional discrimination or Marquardt methods.
5-11. The Implementation of the Gauss Method
Thereare several ways in which the directIon Vi given by Eq. (5-]0-1) may
be computed. Any method suitable for the solution of multiple linear regres-
sion problems is also useful here, In a sense, however, linear problems place
more stringent requirements on methods of solution: We expect to obtain the
correct answer in a single step, and must therefore compute N - I very precisely.
A nonlinear problem, on the other hand, requires several iterations; slight
errors in each iteration can be tolerated, as long as the chosen directions are
acceptable. In other words, Ni I need in principle only be positive definite for
'nonlinear problems. However, substantial errors in the computation of N j - I
plaY greatly increase the num ber of iterations required.
In the presence of a prior density (as when we seek the mode of the pos-
terior distribution) or a penalty function (when inequality constraints apply,
see Section 6-1), appropriate terms must be added to both N j and qj. The
,errns added to N are usually positive definite, and when N is ill-conditioned,
an improvement in its condition may result. The linear regression structure is
:;,jt,c.J
102
V Computation of the Estimates I: Unconstrained Problem,
seemingly lost, but at the end of this section we indicate how it can sometime,
be recovered.
Numerical techlllques for computing the direction V j fall into two classes
First, methods for solving the normal equations, without taking account of
their particular structure. These methods are obviously applicable whether or
not the equations possess the linear regression structure. Second, methods
that rely on the linear regression structure, and are sometimes inapplicable in
the presence of a prior distribution. We may also classify methods into those
which simply compute v j = - N j - I qj , and those which allow the inverse to be
adjusted in favorable ways
(a) Normal Equation Methods, The simplest method simply solves the Eq.
(5-10-2) for Vj, using standard simultaneous equation techniques. The fastes:
method is the Cholcsky decomposition (see Section A-5), but is not recom.
mended unless N j is known to be positive definite and fairly well-conditioned.::
In general, we recommend one of the directional discrimination method:;
(Section 5-7) or the Marquardt method (Section 5-8), all of which allow us to
improve the condition of N j .
(b) Rcgression Mcthods, The method of Jennrlch and Sampson (1968) con-
sists of applying the tepwise regression technique (see Appendix A-3) to the
regression problem of Section 5-10. The normal equatipns are formed, but
components of V, which cannot significantly reduce the value of the objective
function are set to zero. This is a directional discrimination method, the
directions coinciding with the coordinate axes. We hazard the guess that
backward stepwise regression would be more eflicient than forward regression
in most nonlinear problems.
Methods which do not require formation of the normal equations ae
capable of greater numerical accuracy" and are particularly suitable when
precise solution to highly ill-conditioned linear regression problems are re-
quired. The main disadvantage of these methods is the need to keep in com-
puter storage all the Ell matrices and CII vectors (to form the normal equatiom"
these may be generated and discarded one at a time).
We mention here two of these methods: (1) Golub (1965) generates the
Cholesky decomposition of N directly from the Ell' using Householder tram.-
formations; and (2) modified Gram-Schmidt orthogonalization has been
:r Thc mcthod can bc adaptcd to thc singular or ncar singular case, as shown by Healy
(1968), but this adaptation has pcrformcd poorly in somc tcst cascs wc tried. This is bccaue
although thc Cholcsky method givcs an accuratc solution to thc normal cquations cven
whcn thcy arc ncarly singular, thc stcp dircction thus gcncratcd is so far from the ncgathe
gradicnt as to bc almost unacccptabic.
Golub's mcthod can bc adaptcd to opcratc scqucntially on small scgmcnts of tte
matrix B, thcrcby ovcrcoming thc computcr storagc problcm. The resulting algorithm
'fJ
,'i
'£:f'
';1,
t:
. o l :
.." .
,.'{.-
-;.,.- ,
:'1
.....
;.;{'
. ' .; i ''
.,.
t.::
'
r:;:
:'1
,
'i
")}
:
:;
'I
'1
i
,
,I
...;
"'1
5-11, The Implementation oj the Gauss Method
103
Jound by Longley (1967) to be considerably more accurate than solution of
i;normal equations. Golub (I969) reports it to be slightly more accurate than
, the Householder procedure.
, We present here the details of orthogonalization: Let us adopt the follow-
"ing notation, similar to the one used in Section 4-4
f BI J
_ B 2
B= " '
f r 0 .,
11= ? r
o
01, E J :: 1
r J l e"j
(5-11-1)
, B has 111n rows aild I columns; we denote the latter as b I' b 2 , ..., b l II is
,;:11111 x 111n, and E is a column vector with 111n elements. The normal Eq. (5-10-3)
;;',may be rewritten a5
BTIIBv = BTIIE
(5-11-2)
Suppose the b" are linearly independent. Then we can find a set of I vectors
Pl, P2, . . ., PI which are orthonormal relative to II, i.e.,
p;TIIpj = c'5ij (i,j = I. 2.. ,I) (5-11-3)
. and which form a basis for the b". This means that the b a are independent
",linear combinations of the Pi' i,e., there exists a nonsingular / x /matrix A
'such that
I
h" = L Ai"Pi
i=1
(a=I,2,...,/)
(5-11-4)
Let P be the matrix whose columns are the Pi' TheIl Eq. (3) and Eq. (4) are
equivalent to:
pTlIP = I
B=PA
(5-11-5)
(5-11-6)
"In the sequel, whenever we use the term" orthogonal," we mean orthogonal
"e1ative to II. The vector E can be decomposed into a component which is a
)inear combination of the Pi' and a component D which is orthogonal to all
qfthem, This can be stated concisely in the following equations
I
E = D + I t i Pi = D + Pt (5-11- 7)
i= 1
',where D satisfies pTIID = 0 and t is an I-vector of coefficients. We verify
..easily that pTIIE = pTIID + pTIIPt, i.e.,
t = pTIIE (5-11-8)
)mploying Eqs. (5)-(8), we can write the solution 10 Eq. (2) as
v = (BTIIB)-IBTIIE = (ATp1IIPA)-IATpTIIE = (AT A)-IATt = A-It
.",JlL (5-11-9)
?I
.1;
'J
..0.11'
,,[;!.;,
;j
,:..
104
V Computation of the Estimates I: Unconstrained Problems
1
'.'1
'" 1
;.,..
{"
'I
I
I
:t
<t
I i
'.
I
The computation of v is particularly easy if A is an easily inverted matrix,
e,g" of upper triangular form. The following procedure generates such a
matrix, It is known as modified Gram-Schmidt orthogonalization,
I, Form the matrix C == [B, E]. This matrix, which has 111/1 rows and
I + I columns, will be transformed as described below, We let C i denote the
ith column of C
2 Set k = L
3, Let S',k = (c k TITc k )1/2
4, Replace c" with c,,/SkI.' Note that now c" is normalized in the sense that
C k TITc" = I.
5. Let S',i = ciTITC" for i = k + I, k + 2, . . . , 1 + I.
6. Replace c i with C i - S',iC" for i = k + I. k + 2, ,.., 1 + I Note that
thereby the c i (i> k) are rendered orthogonal to c", without losing their
previously established orthogonality to the c j (j < k).
7. If k = I, terminate. Otherwise, replace k by k + I and return to step 3.
It is clear from the remarks accompanying steps 4 and 6 that the first I
columns of the final C tableau form an orthonormal (relative to IT) basis for
B, and that the last column of C contains the component of E orthogonal to
the vectors in that basis. In other words, we now have
\1
,;ji
i$i
C = [P; D]
l5-11-1O)
I
It is easily verifi.ed by reference to our algorithm that
( a-I )
Pa = (I/SaJ b a - .L SiaP;
t:::::.:l
(CI. = I, 2,. . . , I)
(5-11-11)
and
..i
I
D=E- L Si.l+lP;
l= 1
( 5-11-12)
Hence
a- t
b a = L ShPi + SaaPa
;=1
(5-11-13)
and
,,'
-".
I
E = D + L Si.l+J Pi
i=l
...,.
:,-1.
(5-11-14)
Comparing Eg, (13) to Eq. (4) yields:
A,a=Sia (i= 1,2,.."CI.- I)
Aaa = Saa
A.a = 0 (i = CI. + I, CI. + 2, .., I)
Similarly, from Eq. (14) and Eg. (7) it appears that
'i=SU+1 (i=1,2,...,1)
(5-11-15)
(5-11-16)
5-11. Tile Implementation l:l the GOliSS Method
105
The matrix A and the vector t are thus fully determincd. The systcm of
equations Av = t can now be solved for v by successive substitutions:
I'I=t1lA 11
\'= ( t2- i A'IJ\'IJ )j .-4 n (7.=1-1./-2.. ,I) (5-11-17)
lJ=2+ I .
It may happen that the b are not linearly independent. Suppose the rank
of B (i.e., the number of linearly independent b 2 ) is II < I. Then in 1- II of
the iterations in the orthogonalization procedure it will turn out that C I . = ().
hence Ski. = 0 and steps 4-6 cannot be carried through. The simplest solution
is to leave C unchanged and set SI.; = 0 (i = k + I, I, + 2, ..., I + I) It
follows then that I - It rows of A and the corresponding elcmcnts of twill
be zero, and the corresponding \'2 will be indeterminate according to Eq. (17).
To these \'2 we may assign arbitrary values. c.g., zero. and the remaining \' are
computed using Eq. (17).
The following is a simple numerical example. Let
BJ,:
L6
P j
19
8 '
10
E{H nl rj
Hence follow the steps of the algorithm:
I.
Cl:!
12
19
8
10
14 j
4
9
13
2. k= I
3. SII = 35.41 186
4, Replace the fi.rst column of C with [0.056478, 0.169435, 0.423587,
0.451 826]T.
5. S12 = 26.82717, SI3 = 27.92849
6. The second and third columns of C are replaced by
l 10.48485
14.4) . -455
-3.36364
-2.12121
12.4 2265 j
-0.73206
-2.83014
0.38118
7. k= 2
3, S22 = 27,80833
4. Replace the second column of C with [0.37704, 0.51979, -0.12096,
-0.07628]T,
5, S13 = 15.99368
106
V Computation of the Estimates I: Unconstrained Problems
". 1 ':
j
6. Replace the third column of C with [6,39239, - 9,04545, - 0.89558, ;',\
1.60117]T ",
v
N I A [ 35.41186 26.82717 ] [ 27.92849 ] h
ow we 1ave = ° 27.80832' t= 15.99368' so tat
[ 0.35296 ] I . . 1 . f ' d I h . . fi I .
0.575 I 4' t IS easl y ven 1e t wt t IS satls es the norma equatJOns
[ 1254 950 ] [ 989 ]
950 1493 v = 1194
"i l '
,.
r
In some cases, the normal equations take the form
(B T I1B + Q)v = B T I1E + <I>
(5-11-18)
where Q is a given positive definite matrix and <I> is a given vector. For instance:
:!
I In the Marquardt nlethod, Q consists of the diagonal elements of -.''
B TI1B multiplied by a scalar J., and <I> = O.
2. If 0 has a normal prior distribution with covariance V 0 and mean 0 0 ,
then Q = V; I and <I> = V;I(OO - OJ .
3. If 0 is subject to inequality constraints and the penalty function method ;
is used, then Eq. (6-1-9) supplies Q and Eq. (6-1-7) supplies -<I>, both to be
slimmed over all constraints.
By appending to the /1/71 model equations a fictitious I additional equations
one can reduce Eq (18) to the normal regression form Eq. (2). Let S be a
matrix such that SST = Q (e.g., the Cholesky decomposition). Define
B == [:T l E == [s I <I> l ti == [ I]
One verifies easily that BTtiB = BTrm + Q, and BTtiE = B T I1E + <1>. There-
fore, performing a linear regression with B, E, and ti replacing B, E, and I1 is
equivalent to solving Eq. (18).
5-12. Variable Metric Methods
The Gauss method in its various forms is undoubtedly the best available ',j
for the solution of those problems to which it applies. When the objective
function is not one of those shown in Table 5-1, however, the method may
not be applicable. One of the so-called variable metric methods is recommended
in such cases, The term variable metric methods was coined by Davidon (1959)
to designate schemes in which the matrix R is systematically adjusted from ,W
iteration to iteration in such a way as to make it behave like H- 1 . These
"f'J
.;
5-12- Variable Metric Ivfethods
107
methods may be viewed as sophisticated finite difference schemes For comput-
ing the second derivatives of cP. The specific scheme proposed by Oavidon
has been modified slightly by Fletcher and Powell (1963), and has been widely
used in this form, gaining a reputation of being the most eflicient general
unconstrained optimization method available. This particular implementation
was admittedly arbitrary, and subsequent papers have come Forth with alter-
native implementations, e.g., Broyden (I967), Greenstadt (1970), Fiacco and
McCormick (1968), Davidon (1968), Pearson (1969), Bard (1970), and
Fletcher (1970).
Following a general introduction to these methods, we shall describe in
detail the ROC and IROC variations which we have found (Bard, 1970)
somewhat more efficient than others. This will be Followed by a brieF descrip-
tion of the Davidon-Fletcher-Powell method. which is well documented in
the literature
The main idea behind the variable metric methods is the following: From
the definitions of the gradient q and the Hessian H we have
Hi = (cq/aO)o=o,
(5-12-1)
'Therefore, to a first-order approxImatIon
HiO"i =11i
(5-12-2)
where O"j = 0i+ I - 0;, and 11i = qi+1 - qj. This means that
O"i= Hi l 11i
(5-12-3)
Suppose that before the ith iteration we have a matrix Aj which is an
?-pproximation to Hi I, We wish to add to il a correction L':.A j in such a way
,Jhat the rsulting matrix A i + I satisfies Eq. (3) when replacing Hi I That is,
with
Ai+1 == Ai + f'1A i
(5-12-4)
we require that
O"i = A j + I 11i = A i 11i + L':.A i 11i
(5-12-5)
Hence
L':.A i 11i = Pi
(5-]2-6)
where
Pj = O"i - A i 11j
(5-12-7)
'Bq, (6) does not determine L':.A j uniquely, since it contains only I conditions
for the 1(1 + 1)/2 independent elements of the symmetric matrix L':.A i .
.:t...
108
V Computation of the Estimates I: Unconstrained Problems
The stmplest possible marrix !'J.A i is of rank one, i.e., it has the form
!'J.A i = rirjT
(5-12-8)
where rj is some vector. Substituting in Eq, (6) we obtain
rirjTTji = Pi
(5-12-9)
that is
r i = (I/r i T T1i)Pi = O:Pi
(5-12-10)
where 0: == (r j T T1J- I is an unknown constant. Substituting in Eq, (9) we find
0:2PAPiTT1i) = pj
(5-12-11)
Therefore
0: 2 = !/Pi T T1,
(5-12-12)
Finally
!'J.A i = rjr j T = C(2pjp? = (I/P?T1i) PiPj T
(5-12-13)
.'!.
Eq. (13) defines the Rank One Correction method (ROC): Broyden (1967),
Davidon (1968), and Fiacco and McCormick (1968), have all proven the
following:
Theorem Suppose qJ(e) is a quadratic function with a constant nonsingular
Hessian matrix H. Let e I' e z , . . . , e, + I be a set of points such that the vectors
(Jj == e i + I - e , (i =1,2, . . ., I) are linearly independent. Let AI be an arbitrary
symmetric matrix, and let Ai (i = 2, 3, ..'., 1 + I) be defined recursively by
means of Eq. (4) and Eq. (13). Then, provided Pi T T1i # 0 for i = 1, 2, . .., I, we
have
.
A'+I = H- I
(5-12-14)
The theorem says that if c]) is quadratic, the ROC method produces the
exact inverse Hessian in 1 steps. Once the inverse Hessian is known, a single
Newton step converges to the minimum, When c]) is not quadratic, one expects
Ai (i I) to represent an approximation to H- 1 evaluated somewhere in the
region of the last 1 iterates, This should. be particularly true near the minimum,
where successive iterates lie close together. We expect the matrices Ai to
converge to the value of I-r I at the minimum. Though no rigorous proof of
this proposition has been found as yet, numerical tests have confirmed its
validity.
:,t
?
'.;..'"oj
..
5-12. Variable Metric Methods
109
Although the theorem holds in principle for arbitrary AI' numerical stabil-
ity of the calculations [see Bard (1968)] requires that the elements of AI have
the right order of magnitude. A good choice is a diagonal matrix with
A Iu = -()I)ql.
(5-12-15)
,(
SlIlce Ai is an approximation to H; I, we would like to take A, for Rj
There is no guarantee, however, that Aj is positive definite. This may be
remedied by means of a slightly modified Greenstadt procedure:
Let Eq. (A-5-21) be the scaled decomposition of Ai' that is
I
A. = '" IT.f.fT
I L J J J
j=l
(5-12-16)
Let us define
l'j = max[fJ, min(j rr;l, 1')]
(5-12-171
where fJ and l' are small and large positive constant, respectively. Then
I
R. == '" 1 ,.f.fT
, L J J J
j=1
(5-12-18)
is positive definite. It coincides with Ai when the latter is positive definite with
all eigenvalues between fJ and )'.
Marquardt type corrections to the diagonal elements could, in princi pie, be
also used for rendering A j positive definite This type of correction does not
'appear to work very well when applied to a matrix that is an approximation
to the inverse, rather than to the Hessian itself. It happens, however, that a
procedure entirely analogous to the ROC method can be used to construct an
approximation to the Hessian directly. We call this method inverse rank onc
correction (IROC) (Bard, 1970). In this case we wish to satisfy (A, + 6A j )a j =
.1}i [see Eq. (2)], and we are led to
Aj = tl/sjTa,)SjS/
(5-12-19)
i,where
Sj ==1}j - A,a j
( 5-12-20)
We initialize AI as the inverse of the matnx given by Eq. (15). The matrices Ai
,.!:onverge to H in the quadratic case. Since Ai is an approximation to H, we
"can use the Cholesky decomposition with the Marquardt method to compute
:.Vi efficiently, as described in Section 5-8,
. {1.!tJ.
, ,;!<ii;..
.,
6.A; = (lfO/TJ;)O";O";T - (l/TJ/AjTJi)AiTJjTJ/A j
(5-12-21)
110
V Computation of the Estimates I: Unconstrained Problems
In the Oavidon-Fletcher-Powellmethod (OFP) (Oavidon, 1959; Fletcher
and Powell, 1963), the matrix 6.A j is of rank two, instead of rank one. The
simplest such choice satisfying Eq. (6) is
Suppose we choose 0", = - Il j A j qj, where II; is a positive value of P at which
(IJ(e, - pA, q,) attains a minimum. Fletcher and Powell have shown that under
these conditions A, + I = A, + 6.A, is positive definite provided Ai was so.
Thercforc using R, = A, always produces an acceptable step.
5-13. Step Size
In the preceeding sections we were concerned primarily with choosing the
direction of the step taken in the ith iteration, that is, with the choice of R j .
We shall now turn our attention to the determination of step size, i.e., to the
choice of {I, Thc mcthods that have been used fall into three categories:
I. p, = I. Required by Newton's method (to guarantee quadratic con-
vergence near the minimum) and by Marquardt's method. In the latter case,
the step size is determined indirectly through the choice of ).i'
2. p, = p, I.e.. we proceed along the chosen direction to the point at
which (IJ ceases to decrease, as required by the Oavidon-Fletcher-Powell
method. Suitablc mcthods of searching for Pi are given by Fletcher and Powell
(1963). Bard (1970), GoldlHb and Lapidus (1968), and others.
3. Intcrpolation-extrapolation, employed in conjunction with the Gauss
and ROC methods. Hcre one expends a certain amount of effort on finding a
good, acceptablc value of IIi' without bothering to locate Pi precisely.
It is true, on the whole, that the closer Pi is to IIi' the smaller is the total
number of iterations required. On the other hand, the more precisely we wish
to determine the value of Pi' the larger is the number of times that we must
evaluate the objective function in each iteration. The difference between cases
2 and 3 is that in the former the best balance is sIruck when Pi is determined
with much greatcr precision than is required in the latter. In the succeeding
section we suggest a simple algorithm for determining Pi in 3. This has worked
with a reasonable degree of success, but there is no evidence that it is the most
efficient possible. There is no end to the degree of ingenuity that may be
expended on devisi ng uch algorithms.
In all cases the search for Pi proceeds without computation of derivatives.
It would be wastcful to compute at each point I + I functions (cP and the I
components of its gradient) in order to conduct a one-dimensional search.
The gradient is required only at the main iterates e l , e 2 , ,., .
\:.'.
"2.....
:
f' 5-14, Interpolation-Extrapolation
III
In the algorithm of the succeeding section it is assumed that at each
iteration we are given an upper bound Pi. mil' on the feasible values of Pi.
When inequality constraints need to be satisfied (see Section 6-1) Pi,mllx is the
.' ,smallest positive value of P for which 0; - pR; q; lies on the boundary of the
. feasible region. If no inequality constraints apply, Pi, mllx can be chosen as an
t{;: arbitrarily large number. We are also given a lower bound Pi. min (see Section
5-15). If no acceptable P > Pi,min can be found, the search is terminated,
:;';
::
i::.'.
ft:
):.;
(
"?:..
5-14. Interpolation-Extrapolation
Assuming that we have chosen an acceptable direction, there always exists
a number 1]; such that if 0 < P < I] i' then tp ;(p) == <])(0; - pR i q;) < q) i. The
basic idea of the interpolation method is that if we have inilially picked a
value P = pea) such that tp j(pf0» q) i' we next try a smaller value of p, and
keep repeating the process until an acceptable value is found. The idea behind
extrapolation is that if our initial choice P = pea) turned out acceptable, it pays
,to try at least one other value of p to see whether we cannot do even better.
'In both cases, the new trial value of p is chosen so as to minimize a quadratic
approximation to tPj(p) We know that tPi(O) = cD i , and dtp;/dp)p=o =
-q?Riqj (see Eq. (5-3-6).
Suppose we have computed tPj(p(O». Let us define CI.. == q\, [3 == tp;(p(oJ),
i\.'" .}' == -qjTR i q;, and let us try to find a quadratic function a + bp + ep2 whose
f'" 'values match those of tpi(p) at p = 0 and p = pea). and whose slope matches
he' 'that of tPj(p) at p = 0 We have then:
"1,;:t
f
t....,,:
a=CI..
( 5-14-1)
(5-14-2)
(5-14-3)
b = l'
a + bp(O) + ep(O)2 = [3
Whence
e = ([3 - CI.. - 1Jp{O»1 pIon
(5-14-4)
.,The quadratic a + bp + ep2 has a slationary point at
p* = -bl2e = yp(O)2/2(vpf°) + CI.. - (3)
(5-14-5)
The initial value of pea) for each iteration is determined cautiously or
qptimistically depending on whether or not the previous iteration did or did
pot require interpolations. A detailed implementation of these ideas is given
,in the flowcharts of Fig, 5-2.
Yes
" 1
;(
: I .
'- I :'.
i:
(,
I :;"
.
"
.. I
';' '
-
II
Entry: Given (IJ, , 8" Ri' q" pi,",,,, p,,",'n, J (an integer set = ] in the first iteration), a
(= 0.5 if penalty functions are used, = I otherwise).
Set
p(O) = 2 -J min(l, (J Pi.on.,)
y -q,T R, q,
Computc 'I '(0) and
p* ._ yp,0!1/2(yp'OI '1>,.- '1'(0,).
To Fig. 5-2b
,j
. !
No
":
.j
d
'u
j
' I .'
!I\ .
'I:
j.,
,
j
11
;j
'
;;j
Compute
....( 21
max[0.25p' ", minCO.75p' ", p*)I.
Yes
No
Terminatc. Accept
Computct 'I" 1'; IIlcrcasc J by I.
8* =8,.
Yes
Out
No
.,.
,-"
"
.;.,
"
Sct p' ,) - p' 21
Compute p'
yp" )1/
II
(I>,
'I" 11).
Out
(To next iteration.)
Notation:
'I" II '1',(p'JI) (1)(8, - pIJIR,q,)
Fig. 5-211. Delernlination of step Jength, interpolation. t If the conlputation of 'P(2) }.
IS impossible (due, e.g., to an excessively large argumcnt in an exponential) then incrcase
T t........ 1.,....1....... ....(2) 0.....-1 r.o,. .,......,;....
" ...J:
f
:0.'"'.:
'I.;;.
'ii;
t'..
!.>'
From Fig. 5-2a,
Replace J by the largest integer
not exceeding J{2.
r:'.'
No
',:C
.1':"
(;
....'::..\
....
l'.
,;,:..
:.'..:
"i.'
,!.'
''''.',
f:!..
j,:
!:.7.:';
.
.:;
i
t;
}f.
Uf
No
:,...:,
..:.,
Out
(To next iteratIOn,)
e...,.
i
Out
(To next iteration.)
Notation:
'f'(j) "" 'F,(p(j) = <1>(8, - p(j)R,q,)
Fig, 5-1b, Determination of step length, extrapolation,
114
V Computation of the Estimates I: Unconstrained Problems
.;;
;"
it
5-15. Termination
PI
_';..1
--'.:-1»
It IS necessary to devise a cntenon for stopping the iterative search for
<
the minimum of qJ(O). As was stated before, all one can hope for is conver- 'j;
gence to a stationary point of qJ. It may seem natural, therefore, to adopt .
the vanishing of the gradient as the termination criterion, Unfortunately, i
rounding errors and poor scaling often make the goal of a vanishing gradient'i:
unattainable even approximately. In many cases, the computer comes up with .j:
parameter values very close to the minimum, yet the gradient is still sizable, it,;
In addition, if perchance the algorithm fails to converge at all, a termination ,."';
rule based entirely 011 the gradient leaves the progranl to iterate endlessly. :
A more practical criterion dictates stopping as soon as further iterations .j
fail to change the parameter values significantly. That is, given a set of smaU. 1 .,'
numbers 6a (0: = 1,2, ..., I), we accept 0i+J as the solution 0* provided
10i+I,a - 0i,al 6a
(ex = I, 2, . , ., I)
(5-15-1);
Where Oi, a is the exth component of 0,. The numbers 6a may either be pre '
scribed in advance, or they may be computed by the program. In the latter"
case, folJowing Marquard t (I963), we recommend
6a = 1O- 4 (Oi, a + 10- 3 )
(5-15-2)
where the additive term 10- 3 is designed to avoid embarrassment if e tt
happens to be nearly zero. This criterion has worked very well in practice. It ,
tends to be on the conservative side, sometimes allowing a few more iterations :
than are strictly necessary. The rationale for the criterion is that, convergence.
or no convergence, it does not pay to keep iterating if the parameter values.
cease changing.
Suppose in the ith iteration a step direction Vi has been determined. Then"
Eq. (I) is satisfied if for each ex, P I Vi, a I 6 a , i,e., if P min[6a/ I Vi, a I], Hence.
a
the mlJ1lmum admissible P for the ith iteration is Pi,min = min[6 a /lv i ,al], As,
shown in Fig. 5-20, termination occurs if the algorithm is forced to choose.:
Pi fJi,min'
The above criterion does not offer an ironclad guarantee that the process,:
wilJ terminate in a finite number of steps. If the objective function is known ';.
to have a finite minimum, then termination can be guaranteed if we stop whenc
ever qJ i-I - c]J i < c; for some small prespecified positive number 6. That is, we
stop as soon as no significant progress is made in reducing the value of the
objective function. It may be safer, however, to require qJi-1 - qJi < 6, i.e., to;
continue unless no significant progress has been made over a number oL'
'<..;'
ii::'16. Remarks on Convergence
115
iterations, The variable metric methods are particularly liable to stall over a
;,p;tImber of iterations, then to make sudden progress.
"Finally, an upper bound may be placed on the number of iterations
:19wed, This should be coupled with a restart procedure to permit continua-
;ijOIj with possibly a different algorithm.
,;,...,pnce the iterative proces is termintd at 0 = 0*, one would like to know
:.whether or not one has arrived at a mlllll11Um. We assume that we know the
'>:I<,--;,'
:dient q* = q(O*) and at least some approximation H* to the Hessian
)i(9*), If we cut a cross section of the cP surface along the Ou axis. we have a
;$iIrve whose approximate equation near 0* is given by
IJI(Ou) = cP* + qa *(Oa - Oa *) + Y-J:(Oa - Oa *)2
;>Vfh has a stationary point at
Oa = Oa* - qa*/H:
(5-15-3)
(5-15-4)
iWhe"quantity Da = I qa*/H: a I is therefore a measure of the error In the deter-
:ation of 0*. If each Da is small on the scale by which Oa is measured, then
iiis likely that 0'" is very close to a stationary point of CPo Though not fool-
poof, this test works in most cases, Its reliability is improved if applied in the
';£Qordinate system of the canonical variables (see Section 7-3).
. If H* is indeed the Hessian of cP at 0*, then we may easily determine
il1ther 9* (already known to be a stationary point) is really a minimum. All
t is required is that H* be positive definite (see Section 3-5), i.e., that all its
igenvalues be positive. When using the Gauss method, our approximation N*
",., t
Wc,onstructed so as to be automatically positive definite, regardless of whether
9 t '11Ot H(O*) is so. In these cases, then, N* contains no information perlaining
o the nature of the point 0*. Our only recourse is to explore directly the be-
:vior of CP.around 0*. In the ROC method, on the other hand, the matrix
¥i from the. last iteration may be a true approximation to [H(O*)r I, We
}1ot prove that R i is or is not positive definile when 0* is or is not a mini-
tqn; however, if R i is not positive definite we suspect that 0* is not a
um, and vice versa,
*¥: . -"
. ", If one has reason to doubt that 0* is a minimum, one should restart the
!rative procedure from a point close but not identical to 0*. If convergence
!9. the same 0* is obtained, this is likely to be at least a local minimum.
6,.. Remarks on Convergence
;;:Suppose our process has converged to a point 0* at which cP turns out to
;ot stationary. We aSSume that we have used a gradient method that is
Ht nominally acceptable. The directional derivative -qiTRi qi of cP in the
116
V Computation of the Estimates I: Unconstrained Problems
direction of the vector - R i qi is supposed to be negative, yet even a very small
step in this direction fails to produce a decrease in CPo The cause is probably,.
' T: ta:::;':a::led with ,"'uffiei'n t accuraey. If an income! vecto,'11
p is used in place of qi in the calculation of the direction, the true directional'''' 1
derivative is - qiTRi p, which need not be negative even when R i is positive;
definite. What is needed, then, is increased accuracy in the computation of the ,
derivatives. This is why the derivatives should be computed from their ana- ...'
lytic formulas. If this is impossible, one must revert to finite difference
approximations. The question offinite differences is discussed in Section 5-18,
When the model equations are thought to be too complicated for analytic
differentiation by hand, one may use the computer to perform this task.
Computer programs to perform analytic differentiation are available. In fact,
the FORMAC system (Bond et a!., 1964) is utilized by Eisenpress et aT.
(1966) to supply first and second analytic derivatives of nonlinear model
equations in an implementation of the Newton-Greenstadt algorithm (see
Eisenpress and Greenstadt (1966)) for parameter estimation of econometric
models. For a survey of other analytic differentiation by computer schemes
see Sammet (1966).
(b) The matrix R i is not suf'ficiently positive definite, due to accumulation of
rounding errors. This cannot happen with either the Gauss 'or ROC methods,
as long as we use the discrimination or Marquardt trick to insure positive ,
definiteness. ] t can, ho\vever, happen in ill-conditioned cases as a result of 'Jl
rounding errors when conventional matrix inversion or simultaneous equa-
tions solution methods are used. These are to be avoided.
We do not possess any methods that guarantee convergence to the global
minimum. If we have computed a solution which we suspect is not the global
minimum, we can restart the calculatiorls from a radically different initial'
guess, and repeat the process until we are satisfied. Such procedures are
rarely required in well-posed parameter estimation problems, that is, in prob-
lems where: (I) the errors in the data are not excessive, (2) the model fits the';
data well (with the proper values of the parameters), (3) the true parameter-
values are not outside the permitted range, and (4) the data were obtained
from properly designed experiments (see Section 7- 18 and Chapter X). When
these conditions do not hold, almost anything can happen: the objective
function may possess multiple minima, or may slope down asymptotically as
certain parameters increase beyond bound. These things do occasionally
happen even in well-posed problems, but not very frequently.
The reader should realize that the state of the art of nonlinear optimization
is such that one cannot as yet write a computer program that will produce the
correct answer to every parameter estimation problem in a single computer
I
.
:18, Finite Differences
117
:rup, All too often, the first run produces unacceptable results. By studying
,;.tqese results one can perhaps obtain better starting guesses; one can choose to
,jinpose bounds or a prior distribution on the variables, or to relax previollsly
)I]}posed bounds; one can search for errors in the coding of the model equa-
,ftj(ps or their derivatives, By careful coaxing, the computer may be made to
:i'yeld acceptable results in subsequent runs. An interactive computer system
':3'n be particularly useful for this purpose.
i?-.17. Derivative Free Methods
We have dwelt at length on gradient methods because these have proven
.t9,'be fastest and most reliable for a large number of problems.:j: This is not
;iirprising; precise knowledge of the objective function gradient at any point
iIiimediately puts at our disposal the totality of downhill directions al that
:'point. To test whether a given direction belongs to this class, all we nced to
'do is verify that it forms an obtuse angle with the gradient. As we have
'marked in the previous section, we lose this crucial ability as soon as the
:frue gradient is replaced by an approximation.
Nevertheless, the burden of differentiating the model equations may at
.'fll:ges prove too onerous, and while precise derivatives may be crucial in some
Pwblems, other (perhaps most) problems can be solved without them. We
iJ.iscuss below some of the methods of doing so.
8. Finite Differences
"The most obvious, and in our experience most successful method for
aY9iding analytic differenliation of the objective function is to use a gradient
ethod, with finite difference approximations supplying the reqllired
perivatives.
The simplest finite difference approximation to the gradient IS given by
;e one-sided dfference method
,;,. cJ;(OI' O 2 , . . ., 0" + (50", . . . , 0,) - cJ;(0" ° 2 , . . . , 0", . . . , 0,)
q;
".;t , (50"
(CI.=I,2,. ,f)
( 5-18-1 )
, :;l:Colville (1968) reports this to be the casc cven with some highly constrained non-
qitifar programming problems.
>!'i" " "
..?J
118
.,.::;
.:?.
v Computation of the Estimates I: Unconstrained Problems
Two sources of error contribute to the inaccuracy of qa: (I) the rounding
error arising when two closely spaced values of cJ; are subtracted from each
other, and (2) the truncarion error due to the inexact nature of Eq. (l),which
is accurate only in the limit as Mia -> O.
The roundmg error increases as Mia decreases. We shall henceforth write
qa = [cJ;(e + i){),) - (f)(e)]/()o, as shorthand for Eq. (I). Let £ be the relative
error in the computed values of 1) (at best E = 2- b where b is the number of
binary digits carried by the computer in use), The actual error in 1J has
magnitude £ 1 (f) I, and the error in (/)(0 + MI,) - 1J(0) can be as high as 2el rPl.
although the root-mean-quare crror is only )2. e I 1J I. The maximum
rounding error in Eq. (I) is, therefore
!
,.
::;
The maximum total error is approximately
(\ == () /I., + () T. , = 2E 1 1J / MI,I t- i 1 H n ria" 1
This has a minimum at
( 5-18-5)
()II., == 2£1 1 )1/1('\0,1
(5-18-2)
On the othcr hand. wc have the Taylor series expansion
(f)(O + (){)a) = (/Ji0) + q, bOa + iHn ('\0/ + .,.
(5-] 8-3)
The truncation crror in Eq. (I) is, therefore, approximately
()T., == 11 Hnll{50al
( 5-18-4)
i MI,I = (4£ 11)/Hnl )1i1
(5-18-6)
If wc are interested In the mean square error instead, we would minimize
2[;1 (f) 1 IM),1 + H;a ()0,1/ 4
! :
so that
1 ) ' (I 1 _) h . 1 "'I f:f 1 ) 1/1
{ ,-(_,,_e 'l-'j a'
(5-18-7)
Eq. (6) or Eq. (7) can be used as a basis for estimating the step sizes DO.
(a = I, 2, . . ., I) required for computing the differences. The same equations
could have been obtained by requiring that the two error sources contribute
equally. In the cac of Eq. (6) the total maximum error turns out to be
2(£1 q)Haal )1/1, whereas the root-mean-square error attendant upon Eg. (7)
is 2'/-I(cl 1JH,al )li 1 .
To apply these formulas we need estimates of the Hessian H. These are
available in the Gauss and variable metric methods. In the latter, we usually
have A H -I rather than H. However, if we start out with a diagonal Al we
can form A I easily. Then the easily verifiable formula
'1.;
J:.
:$
.'I
A,-:; 't = A.--I - (I faiT Ai Ip.-)Ai-Ip.-p/A.-- t
f t'
".
r
;;,}
(5-18-8)
5-19, Direct Search Methods
119
enables us in the ROC method to compute successively the matrices A; I
which are approximations to H. A similar procedure for the Davidon-
Fletcher-Powell method is given by Stewart {I 967).
In the Gauss method we need the derivatives of the individual model
equations, rather than of the objective function directly, In the absence of
anything better, we would still recommend using Eq. (6) or Eq. (7) for
choosing (50.. In place of Eq. (I), however, we would apply similar equations
to the model equations for each experiment in turn.
These formulas should not be used blindly. Gross errors in the estimated
H may lead to absurd values of (50.. Lower and upper limits should be
imposed on the (50., e.g., 10- 5 10.1 1c50.1 10- 2 10.1. A smaller lower
bound would be appropriate if the calculations are performed in double
precision.
Eq. (I) represents the crudest possible estimate of qa' A better estimate is
given by the central ddference scheme
rJ>( ° 1 , ° 2 , . . . , 0. + (50., . . . , 0/) - CP(O t, O 2 , . . ,,0. (50. ' ", 0/)
qa 2 (50.
(C1. = I, 2, .. . , I)
(5-18-9)
Unfortunately, this scheme requires computation of two additional function
values for each gradient component, instead of the one required by the one-
sided difference. The truncation error of Eq. (9) is (1)..J.) Mia 3/24, where
cp;4) == a 4 cpl aO a 4. The step with least maximum error has length 1 {50al =
(l6£ 1 cp 1 1 1 rJ>..J.) 1 )1/4, and the attendant error has magnitude t[£31 cp 1 3 1 q)4) j)1 /4.
These formulas are not very useful, since we rarely know cp..J.). However, it is
safe to say that somewhat larger values of (50. may be used here than with the
one-sided difference scheme.
For economy of calculation we suggest that one-sided diAerences be
used for several iterations, until no further progress can be made. Then one
may switch to central differences if one feels that the solution has not been
attained.
5-19. Direct Search Methods
The term direct search was coined by Hooke and Jeeves (Hooke and
Jeeves, 1961). It has come to be applied to methods which (like Hooke and
Jeeves') search for the minimum without explicit evaluation of derivatives,
analytic or numerical. The idea of direct search methods is appealing, and
they have performed well in certain cases [see, e.g., the survey by Box (Box,
1966)]. Our own experience, however, has been disappointing; gradient
120
V Computation of the Estimates [: Unconstrained Problems
methods, even using finite diAerence approximations, have outperformed
direct search methods on all but the most trivial parameter estimation prob-
lems, both in reliability and speed of convergence. For this reason we shall
mention a rew or the more promising or popular methods, but not describe
any or them in detail.
The methods that perrormed best in the Box (1966) survey were those due
to Powell (1964, 1965). The first one minimizes an arbitrary function; it has
been amended by Zangwill ( 196 7b), who also describes a method of his own.
The second Powell method (Po\\ell. 1965) is designed specifically for mini-
mizing a Slim or squares. but can be adapted easily to other problems which
admit the Gauss approximation. This algorithm is related to the Gauss
method, with finite difrerences taken along the search directions (instead of
along the coordinate directions. as would be the case with the usual finite
difference version of the Gauss method). The weakness of the method
derives from the fact that the difrerences are taken only in a single direction
per iteration, so that one's estimated derivatives in all other directions are
perennially out or date This effect worsens as the dimension of the parameter
vector increases.
Other methods that have found considerable use in solving optimization
problems are those of Hooke and Jeeves (1961), Rosenbrock (1960) [see also
Rosenbrock and Storey (1966)], Buzzi Ferraris (1968), Brent (1971), and the
Simplex method of Neider and Mead (1965). The latter method was adapted
to least squares problems by Spend ley (1969). A review of direct search
methods appears in Fletcher (1965)
5-20, The Initial Guess
All the optImIzation methods that we have described require that one
supply an initial guess 0 1 for the values of the parameters. The choice of a
good initial guess can spell the difference between success and failure in
locating the optimlllll, or between rapid and slow convergence to the solution,
Unrortunately, while we can prescribe algorithms for proceeding from the
initial guess. we must rely heavily on intuition and prior knowledge in select-
ing the initial guess. Nevertheless, we can provide some suggestions which
Illay be helprul in many cases. A comprehensive discussion of such methods
can be round in Kittrell cl 01. (] 965).
At the Olltset wc must caution the reader nol to exaggerate the importance
of finding a good initial guess. In many cases the proper solution has been
obtained starting rrom the first initial guess that came to mind. In these
cases, at the possible expense of a rew additional minutes of computer time,
5-20. The Initial Guess
III
one has saved oneselfa considerable amoulH of trouble. We suggest, therefore,
that (unless computer time is exceptionally scarce or expensive) one attempt
to estimate the parameters" by brute force ". Only if this strategy rails should
one resort to more delicate techniques.
The most obvious method for making the initial guesses is by the LIse of
prior information. Estimates calculated from previous experiments, known
values from similar systems, values computed from theoretical considerations:
all these form ideal initial guesses.
On the opposite end of the spectrum stand problems in which our only
information concerning the parameter values is given in the form of upper
and lower bounds on their valtles. Ifwe do not even have such bounds, we can
transform our variables into bounded ones: e.g.. a positive variable (/ can
be replaced by the bounded variable cp = I I( I + 0), or a completely free
variable e may be replaced by cp = arctan O.
Once we have all our parameters confined to a rectangular region in
e space, we can conduct a grid search: compute the value of the objective
function at every point on a regular rectangular grid, and choose the point
with the best value as the initial guess. The main dilliculty with this approach
is" the curse of dimensionality": in a grid Wtth k levels in each one of the I
dimensions, the total number of points at which the objective function must
be evaluated is k l . This is a prohibitively large number for all but the smallest
values of k and I.
An alternative to the grid search IS random search. Here a number of
points within the feasible region are chosen at random, and the one giving
the best value of the objective function is used as the initial guess. It is true
that among a hundred points there is a good chance of finding one that is
within I % of the solution. However, this 1 . applies to the volume or the
feasible region. If there are I parameters, the relative accuracy of each
parameter is only 0.01 1 /1, or 31.6 of the permitted range when 1=4. The
random search method does not overcome the curve of dimensionality, but
it does offer some advantages over grid search. One may bias the sampling
so as to favor certain regions of parameter space (this can be regarded as
sampling from a prior distribution). and one may use a sequential termination
criterion: stop sampling as soon as a function value significantly better than
the average has been found. Sometimes, a transformation of variables is
called for prior to commencement of the search. For instance, if even the
order of magnitude of a parameter is unknown, it should be replaced by its
logarithm
. It is not always necessary to provide initial guesses for all the parameters
In a model. If some of the parameters enter the model equations linearly, and
an initial guess is provided for the other parameters, then Ihe linear param-
eters can be estimated by linear multiple regression. Suppose, for instance,
k
'ii!. i.
';;..:
y = kx exp( - EfT)
(5-20-1)
122
V Computation of the Estimates I: Unconstrall1ed Problems
that the model has the form Ji" = OJ exp( - Oz x). If we have the initial guess
Oz = 6, and let :::" == exp( - 6x), then an initial guess for 8 1 can be found by
solving the linear least squares problem min III (Y il - 0 1 :::,,)z. Special versions
of the Gauss method to deal with partly linear models have been devised
(Lawton and Sylvestre, 1971; Golub and Pereyra, 1972).
The most fruitful approach to finding an initial guess is to substitute a
simpler problem for the original estimation problem. The answers to the
simpler problem can be used as initial guesses for the original problem, There
is no systematic way of applying this idea to all problems. but the following
is a parttal list of what may be attempted.
(a) Linearization, We try, by means of transformation of variables, to change
the model equations into ones that are linear in the parameters (see Section
4-19). The linear problem can be solved by multiple linear regression with no
need for an initial guess.
(b) Multistage Estimation. By breaking up the data into groups, we may
estimate certain auxiliary parameters for each group; then we estimate the
original parameters as functions of the auxiliary parameters. For example, the
rate of a chemical reaction is given by the expression
where Y is the rate, x the concentration, T the temperature, and k and E
the parameters to be estimated. The rate y is measured as a function of x at
several values of T, say T" Tz' ..., Tq. Suppose we use the data taken at
T j to estimate the coeflicient Kj in the equation
.r = A.;.,
(i = I, 2, . . . , q)
( 5-20-2)
The estimated 1\., can then be llsed as data for estJmatll1g log k and E in the
linearized model
log F..', = log k EfT;
(i = I, 2, . .. , q)
(5-20-3)
Of course, in this case we could have linearized Eq. (I) directly; however, in
kinetics models IIlvolving simultaneous reactions, the original equations
cannot be linearized, whereas the multistage procedure still applies.
(c) Model SimplificatIOn, It is frequently possible to approach the [lnal
model through a sequence of simpler ones, in which various effects are
neglected and the corresponding parameters suppressed. After the param-
eters have been estimated for a simple model, analysis of the residuals (see
Section 7-13) can provide an indication as to what terms should be added to
the model next. This method serves not only to obtain initial guesses for a
given model. but also (and perhaps more importantly) to synthesize a final
5-21. A Single-Equation Least Squares Problem
123
model where none is given. Examples for such syntheses are given by Box and
Youle (1955), Peterson (1962), Box and Hunter (1962), Hunter and Mezaki
(1964), and Kittrell, Hunter, and Mezaki (1966).
(d) Simpler Estimation Method. We replace the proper objective function by
one which is easier to minimize. For instance: (I) We linearize the model as
under (a) above. (2) In a multi-equation model, we use one of the equations to
obtain preliminary estimates of the parameters. It is true, however, that some-
times it is easiest to obtain estimates when all equations are used simullane-
ously, since otherwise the information relating to the values of some of Ihe
parameters is lost. To give a trivial example. let the three model equations be
YI = (0 1 + 02)X,
J'2 = (02 + 03)X,
J'J = (03 + 01)X
),
?J;
Clearly, the three parameters can be estimated independemly only if all three
equations are used. (3) In dynamic models we may use easy to apply data
integration or differentiation methods to obtain initial guesses for the integra-
tion of equations method (see Section 8-1).
5-21. A Single-Equation Least Squares Problem:j:
Let y be the fraction remailling at time x I of a chemical compound A
undergoing the first order reaction
A->B
(5-21-1 )
The variable .I' satisfies the differential equation
dyJdx, = -ky
(5-21-2)
where k is the rate constant. The solution lO this equation with the illitial
condition y = I at x, = 0 is
y = exp( -kx l )
(5-21-3)
. The rate constant k depends on the absolute temperature x 2 as follows
k = {)I exp( -()2/X2)
(5-21-4)
t The numerical results quotcd in thc discussIOn of this and subscquent problems were
'. 'Obtaincd as output from calculations performcd in singlc prccision Iloating point arithmctic
on an IBM Systemi360 computcr. Thc results wcrc convcrted to dccimal from a binary
i: representation inside the computcr. Thcrcfore, thc results of pcrforming the samc calcula-
... tions on a dccimal dcsk calculator (or, for that mattcr, on any other computcr, or evcn
'. using a differcnt program on the samc computer) would diffcr slightly from those prcscnted
" .here, In a long itcrativc proccdure such diffcrcnccs can build up to such an extcnt that a
IJ:(:/:.:.iJ!fferent number of iterations may bc requircd to rcach substantially thc same cnd result.
124
':'1
.., . :. "!
V Computation of the Estimates I: Unconstrained Problems ,
where 0 1 is the so-called frequency constant, and 8 2 is the activation energy
(expressed in suitable units). Our model equation takes the form
Y =/(X 1 , X 2 , 0 1 , 8 2 ) = exp[ -8 1 x 1 exp( -8 2 /X 2 )]
i
:-
f.
(5-21-5)
Data for a set of fifteen observations on x and yare given in Table 5-2.
Table 5-2
Data for Lcast Squarcs Problcm
Fraction A
Expcrimcnt Timc, Tcmpcraturc, rcmaining,
numbcr, {.l. x o1 (hr) x,,2CK) Yo
I 0.1 100 0.980
2 0.2 100 0.983
3 0.3 100 0.955
4 0.4 100 0.979
5 0.5 100 0.993
6 0.05 200 0.626
7 0.1 200 0.544
8 0.15 200 0.455
9 0.2 200 0.225
10 0.25 200 0.167
II 0.02 300 0.566
12 0.04 300 0.317
13 0.06 300 0.034
14 0.08 300 0.016
]5 0.1 300 0.066
m
iit
I
,,::;j
I '
r
is-
'!J
i
'.,
;. .:.
Our aim is to estimate 0 1 and O 2 . As far as we know, the errors in the x p
are negligible, whereas those of the yp are all independent and with equal
standard deviations. The least squares criterion is, therefore, appropriate, We
seek to minimize
15 15
(1)(0) = I e!l2(0) = I [Yll - J;,(0)f
1'= I 11= I
(5-21-6)
FoJlowing Eg. (5-9-2), we find that the gradient of cp is given by:
15 15
qt == (icp!i!OI = -2 I ell iJf)iJO t = 2 I eJII exp(-82!XIl2)Xpt
p=l =1
(5-21-7)
15 15
q2 == i J cfJ!a0 2 = -2 I e'l a.t;P 0 2 = -2 I ell (OIX Il I!X ,(2 )f ll exp( -8 2 !x ,(2 )
p=l p=l
(5-21-8) '; 1-1
. }-2I, A Single-Equation Least Squares Problem
125
; ,The approximate Hessian N is given by Eq. (5-9-4)
15
NaP = 2 I (cJ;'/cOa)(of;.;aO p )
1 1 :;;::1
(a, f3 = I, 2)
(5-21-9)
Let our initial guess be
,;
\t..
0 1 = [ 01, I ] = [ 750 ]
01.2 ] 200
Table 5-3 gives the values of the f;" ell' and Of;,/iiO for 0 = 0 1 From Table 5-3
',entries and Eqs. (6)-(9) one easily calculates
CPl = 1.090441
[ -0.002230450] N = [ 0.2689478 -0.7730614] 10- 5
qt = 0.006863795' t -0.7730614 2.310325 x
Table 5-3
Least Squarcs Problcm Functions at 6 T = [750, 1200]
{.L /" e"=Yll-'!" 10 6 :< 8[',/88, 10 5 >. 0/,,/88 2
I 0.9995393 -0.0195393 -0.6141379 0.4606032
2 0.9990788 -0.0160788 -.1.227710 0.9207821
3 0.9986185 -0.0436185 - 1.840716 1.380537
4 0.998]585 -0.0191585 - 2.453158 1.839868
5 0.9976986 -0.0046986 -3.065035 2.298776
6 0.91 I 2362 -0.2852362 -112.9364 42.35113
7 0.83035] 5 - 0.2863515 -205.8234 77.18373
8 0.7566463 0.3016463 - 281.3304 105.4990
9 0.6894834 -0.4644834 -341.8115 128.]793
10 0.6282821 -0.46]2821 - 389.3387 146.0020
11 0.7597739 -0. I 937739 - 278.3] 46 69.57867
12 0.5772563 -0.2602563 -422.9124 105.7281
13 0.438584 I -0.4045841 -481.9767 120.4942
14 0.3332248 -0.3172248 -488.2577 122.0644
]5 0.2531757 -0.1871757 -463.707] I 15.9268
To determine the first step direction VI we must compute VI = - N - I qt,
li;e,,> solve the set of simultaneous equalions N I VI = -ql. In our two-dimen-
,:s!nal problem this can be done trivially on a desk calculator. However, for
prposes of illustration, we shall apply the Greenstadt method. For this, we
1:i 1eoo to compute the inverse scaled decomposition of N (we omit the sub-
i!:cript I). We follow the steps outlined in Section A-5:
I. NW = 0.001639963, NW = 0.004806584.
[ 0.001639963 0 J [ 1
N = BCB = 0 0.004806584 -0,980716
. [ 0.001639963 0 ]
x 0 0.004806584
-0.90716J
126
V Computation of the Estimates I: Unconstrained Problems
Therefore
2. The matrix C has the form [_ -n. The reader may verify that
such a matrix has eigenvalues I + a and I - a with corresponding eigen-
vectors [1/,/2, - 1/,./2J and [1/,./:2., 1/,,/2.J respectively. Hence the eigenvalue
decomposition of C is given by
_ T _ [ 1//2 1/...'/2 ] [ 1.980716 0 J [ 1/J2 -1/,./'2 J
C - un, f - , / r r;:;
1/)2 1/" 2 0 0.019284 1/....; 2 1/....; 2
3. The inverse decomposition of N is given by N- I = Grr-le T where
_, [ ( I / vi 2) I /0.00 1639963
G=B U= . ,
_ -(1/,,/2)1/0.004806584
= [ 431.1723 431. 1723 J
-147.1122 147.1122
II - I = [ I (1.9807 I 6 0 J = [ 0.504868 0 J
o 1/0.019284 0 51.8573
(I /J2) I /0.001639963 ]
(1/ J5.) I /0.004806584
Hence
N- I = [ 431.1723431.1723 ][ 0.504868 0 J[ 431.1723 -147.112:2 ]
-147.1122 147.1122 0 51.8573 431.1723 147.1122
The ratio of eigenvalues was 1.980716/0.019284 100, indicating that N is
mildly ill-conditioned. We can tolerate, however, eigenvalue ratios of up to
10 4 or 10 5 , hence there is no need to adjust the value of the smaller eigenvalue,
4. To compute N-Iq we proceed as follows:
[ 431.1723 -147.1122 ][ 0.002230450 ] [ 1.971456 ]
431.1723 147.1122 -0.006863795 = -0.04803973
[ 0.5048679 0 ] [ 1.971456 J = [ 0.9953249 J
o 51.85727 -0.04803973 -2.491209
VI = [ 431.1723 431.1723 ] [ 0.9953249 ] = [ -644.9785 J
-147.1122 147,1122 -2.491209 -512.9099
5-21. A Single-Equation Least Squares Problem
127
;,;::
tf
:':1.:
i:'-.8
;;1,;.
"r.,.
'.-.
Along the ray 0 = 0 1 + pv l , the directional derivative at p = 0 IS
:t';,
!:
81f1j8p = vtTql = (-64-'+.9785)( -0.002230450) + (-512.9099)(0.006863795)
= -2.081916
;
This quantity being negative confirms the fact that (/J decreases, at least
initially, as one proceeds from 8 1 in this direction.
Trying initially pro) = ] we arrive at
0(0) = [ 750 - 6-.+4. 9785 1 = [ 105.02]5 1
1200 - 512.9099 687.0901
where <P(O(O») = 0.9133969 < q),. This indicates that we are at an acceptable
point. We try to find an even better point by fitting a parabola to (1)(0 1 + pv l ),
The equation of the parabola is
tP(p) = a + bp + Cp2
And it must assume the following values:
tV(O) =;= 1.090441 = CI.
lfJ(l) = 0.9133969 = [3
dlf J jdp)j1=o = -2.081916 =)'
Using Eq. (5-14-5) we find that 'V(p) has a stationary point at
p* = -2.08]916 x Ij2( -2.081916 x I + 1.090441 - 0.9133969)
= 0.5464714
Trying 8(1) = 0 1 + p*v l = [397.5376, 919.7092]T we find 1)(0(1)) = 0.3345645
which is a great improvement over both 8 1 and 0(0). We accept 0(1) then as
92' the starting point for the next iteration.
A computer program using the flowchart of Fig. 5-2 produced the sequence
of iterations given in Table 5-4. No further reduction of (1)(8) was obtained
Table 5-4
Lcast Squares Problem, Good Illilial Guess (Gauss
Itcrations)
cP(eJ
e,
I
2
3
4
5
6
1.09044]
0.3345645
0.05765885
0.04038005
0.0398073 I
0.03980599
[750, 1200]'
[397.5376,919.7092]'
[646.0847,938.5288]'
[810.6260, 965.7625]'
[818.3628, 962.1228]'
[813.4583, 960.9063]T
128
V Computation of the Estimates I: Unconstrained Problems
i.I
< !
,\1)'
,1i
:.1 1
"I"
'.j
'!;
i
t
after six Iterations, so that we took as our estimate 0* = [813.4583, 960,9063]T
with (1)* = 0.03980599. At this point, the gradient was
* = [ -0.218524 ] 10-6
q 0.631308 x
and the approximate Hessian
N * = l 0.271890 -0.957336 1 10 -5
1 _ 0.957336 3.50371 x
Applying the test of Section 5-15 we find that
p., e" * = )'" - [(X", 8*) p., e,,* = )',. - [(x"' 8*)
1 -0.0145552 9 -0.0387225
2 -0.00613993 10 -0.0219878
3 -0.0287542 ]1 0.0497515
4 0.000602186 12 0.0504873
5 0.0199295 13 -0.103587
6 -0.0906165 14 -0.0550289
7 0.0304608 ]5 0.0293314
8 0.0869893
1i" I "'
?
..
I
:
'1.\.
)if
"
.,j
'. ::'
,.
,
'
; 1 ''':
; .
;-,$
if
()I = Iql*/Nill 0.],
()2 = Iq2*/ Ni21 0.02
These values are negligible compared to Ot* and O 2 *, so we may assume that
we have converged to a stationary point. The final residuals corresponding to
this solution are given in Table 5-5.
Table 5-5
Lcast Squarcs Problcm, Final Rcsiduals
l
.J.'h
In the preceding calculatIons we were, fortunate enough [0 have started
from a good initial guess; 0 1 = [750, 1200]T as compared to the final estimate
0* = [813.4583, 960.9063f. Suppose now that we had started from the much
poorer initial guess 0 I = [100, 2000]T. Proceeding as before we find
({), = 5.299502
_ [ -0.0007098080 1
ql - 0.0002442936
[ 0.7036033 -0.2354773 J
N I = -0.2354773 0.07896382
- I [ - 1 34608.0 J
v, = -N, ql = -432361.0
(01__ ,_ [ -134508.0 J
o - 0, + ", - -430361.0
P
i
.
Jf.;
I'
i"
.1
(5-21-10)
X 10- 7
\ 1
".\
(5-21-11)
5-2]. A Single-Equation Least Squares Problem
129
,<
When we attempt to compute cP(OCO» we note that the exponents occur-
ring in the formulas for];, are so large that computer capacity is exceeded. We
attempt to remedy things by halving p, but we have to repeat this process
eight times before the exponentia]s are brought under control. We have then,
with pCO) = r 8 = 0.00390625
OCO) - 0 000'90675' _ [ -425.8]40 ]
. - I + . .J - "I - 311.0039
with p(O) = cP(OCO» = 0.3366272 x 1020. Since this exceeds cPl' OCO) IS not
acceptable and we must interpolate. Here
}' = VI Tql = -10.08228
Hence, from Eq. (5-]4-5)
- ]0.08228 x (0.00390625)2
p* = 2( -I 0.08228 x 0.00390625 + 5.299502 - 0.3366272 x 1020)
::::05 x 10- 25
Sjnce this is less than 0.25pC°), we follow Fig. 5-2A and set
p<2 J = 0.25p<°) = 0.0009765625
leading to
0(2) = [ -31.45349 1
1577.782 '
'.p(2) = 5.471375
Ihis is still unacceptable. Repetition of the interpolation procedure once
more forces us to take
p(3) = 0.25 p ( 2) = 0.000244141
and
.,
,..
O(3)= [ 67.13663 ]
1894.446 '
IJ1(3) = 5.301888
Once more we need to interpolate
-10.08228 X (0.000244141)2
p* = 2( -10.08228 x 0.000244141 + 5,299502 - 5.30] 888)
= 0.0000619701
This time, 0.75 p (3) > p* > 0.25p(3), so we take p(.+) = p* and
.
--, 1
'"'"
O(4) = [ 91.65955 ]
1973.211 '
tp(4) = 5.299135
;t.
. ',
130
V ComputatIOn of the EstImates 1: Unconstramed P.roblems
,,,,
!;..
:
We have finally an acceptable pomt. We set O 2 = 0(4) and proceed to the
next iteration. The procedure converges, though rather slowly, in 25 iterations,
which are plotted in Fig. 5-3, Also shown is the direction of the negative
gradient at 0 1 , Since the vector pointing from 0 1 to 0* lies between the negative
gradient and the Gauss direction from 0 1 to O 2 , it appears that the Marquardt
method may prove efficient in this case, and indeed it turns out to be so,
:
,,:
.,
!
i,!:
8 2
8,
2000 r---<-:----
I
\ Method
" \ > 1 -0- Gouss
Ii . \ UI1C(J115trQlf1a f'..' d
I I -.8.- I orqUGr t
I \ C0 r 5ha m c1 1 -.1- SnIlSS'" penolly,f?I'>O
,: '-; ,( J,t: Sechons lt2) - '- - o>t..I prOJection,
I I '\ ,-
f ,
1 .
1500 I f "
f "
I \.
Ii \.
,: .,
Ii
Ii
1 1 .
1000
I:
r; I /';/<-
I , -1-'"'/
I /x
: / /
"'/
1/
500 0
.'
':,
d
.,
;; I
..".
...
'.,
j,;"
Ih
:>:f
,i
;:
....
.....
.....
.......
......
.
'-'-
8-'-
\:t l
]
____x-:;-4.
.,+ --:;.-- ---
8*
\'
\1
500
1000
8,
Fig, 5-3, Lcast squarcs problcm
Returning to the first iteration, the Marquardt step would be given by
= _10 7 [ 0.7036033(1 + )'J)
VI -0.2354773
x [ -0.0007098080 ]
0.0002442936
-0.2354773 ] -1
0,07896382(1 + )'1)
Trying first )'1 = 0.01, we obtain a step leading to a value of 0 for which c]j
cannot be computed, and similarly for )'1 = 0.1. With )'1 = 1, however, we find
[ 3272.00l J
VI = -10589,988'
O(D) = [ 3372.001 J
-8589.988 '
c]j(D) = 6.183162
'l."
!i,,22, Adding Prior Information
131
This is still unacceptable, so we increase )'1 to 10, and eventually to }'I = 100
with
[ 98.8778 ]
v) = -303,391 '
[ 198.8778 1
0 1 = 1696.609 '
rP 1 = 4.979104
Which IS an acceptable starting point for the next iteration. The full process,
\vhich converges in 10 iterations, is also shown in Fig. 5-3.
If the initial guesses for the parameters are much too large, J or its deriv-
.:. atives vanish, and the process does not converge. We can use the linearization
(method to obtain good initial guesses. For this purpose, we observe that
q. (5) is equivalent to.
10g( -log y) = log 0 1 - 01/Xl + log x,
(5-21-12)
or
where
y+ == log( -log y),
y+ = 0 1 + + 0 1 + x....
(5-21-13)
0 1 + == °1
(5-21-14)
1'-We now have a model linear in the parameTers 0 l .j. and 0 1 +, which may be
!nstimated by linear least squares as
x+ == -1/x 1 ,
0)+ =:clogO),
8 1 + = 6.643963,
8 1 + = 928.6492
::.corresponding to
8) = exp 8 1 + = 768.1331,
0 1 = 8 1 + = 928.6492
i:!;.Wfie are obviously good initial guesses for estimating 0 by the nonlinear
iHilst squares procedure,
::;'f2. Adding Prior Information
.;.... Assume that prior to having obtained the data of Table 5-2 we had some
ii.;owledge concerning the values of the parameters. Let us suppose that this
lncnowledge could be summarized in the following equations
8 1 = 1000::!:: 200,
0 1 = 1000::1:: 200
(5-22-1)
irh quantity 200 is meant to represent the standard deviation of the disIri-
!::kufions of 8 1 and 0 1 , Let us elect to assign to 0 a normal prior distribulion,
!{s'o:that apart from an additive constant
logpo(O) = --!-[(I/200 2 )(8 1 - 1000)2 + (1/200 2 )(0 2 - 1000)1] (5-22-2)
qJ(O) = (15/2) log S(O) + (1/80,000)[(0 1 - 1000f + (0 2 - 1000)2]
V Computation of th, E,timat" L Unwn,tmin,d Pwbl,m, 'I
":
",:
.. :
.'.:
->.
.".
':..
15
log L(O) = -(1/21') I [Y ll - .t;,(0)] 2
Jl=1
(5-22-3)
132
Assuming that the observation errors in yare also normal with unknown
variance 1', then the log likelihood is
and the log posterior distribution is the sum of the two. As in Section 4-8,
we can eliminate v and form the concentrated log posterior distribution,
which with sign reversed, reduces to the following objective function to be
minimized
(5-22-4)
t;
where
15 15
S(O) == I [y" - ,f,'(O)] 2 = I e/(O)
/'= I jJ=1
(5-22-5)
f-'
Hence
:
;
:-l
[ - ---,- Ie ..c...!!... + - (0 - 1000) ]
q () ,::, "%, 40.OO '
] 5 15 (If. I
-- "e .:.:.!'...+-(O -1000)
S(O) ,.0 1 ,. i1{)2 40,000 2
[ I 1/;. ) 2 +
S(OL.= t ( . 1101, 40,000
N=
I - 1- -f 1 / .
I I!I,.
S(O) ,. = I iJ(l1 1'(12
(5-22-6)
;:
.:.t
J
] - 15 if -f ]
) ,,(/,. 0./"
-L.,;--
S(O) ,.= 1 aO I a0 2
15 I 5 fll 2 I
- I ::;- +-
S(O) ,,= I CoJ 40,000
.',
.f.
,i
(5-22-7)
:i.J::
With this prior distribution it is natural to start with the initial guess 0 1 =
[1000, 1000]T
From here, the Marquardt method converges in three iterations to
0* = [ 929.7134 ]
990.8511
(ua = 200)
When the standard deviations of (II and O 2 are assumed to be 100 instead of
200, our estimates turn out to be
0* = [ 976.2349 1
1000.1695
(ua = 100)
The solution of Section 5-21 may be regarded as corresponding to a prior"
distribution with IIlf1nite standard deviations. We recall that the result was
0* = [ 8I3.4583 J
960.9063
(ua = co)
:'t;;;... "'!i
"";,'i
:;am1"".;,
(J-23, A Two-Equation Maximum Likelihood Problem
133
;Observe how the solution progressively approaches the mode of the prior
'.,distribution [1000, 1000F as the variance (i.e.. uncertainty) of our prior
,information decreases.
..5-23. A Two-Equation Maximum Likelihood Problem
. We take a two equation econometric model which was used by Bodkin
['and Klein (1967) to fit U.S. production data for the years 1909-1949. The
:: model is based on the constant elasticity of substitution (CES) theory of
i,.production, and it takes the form
gl = c I 10 C2 =4[c S 2;-C4 + (1 - cs)zZ"c4rCJlc4 - 23 - 0
(5-23-1)
gz = [c s !(1 - C s )](2 1 !2 Z )-t-c4 - 2s = 0
i:here ZI is capital input, 2z is labor input, .:] is real output, ':-1- is time (in
,years; 1929 taken as origin), 2S is ratio of price of capital services to wage
i: cale, and c t , c z , C] , C 4 , C s are unknown parameters.
.,. The data, in the form of yearly values of Z" for II = I, 2, . . . ,41, are given
:'j11 Table 5-6. Of the variables involved, ':1 and 2z are considered dependent
'..{c:;ndogenous) whereas 2 3 , ':4 and Zs are independent (exogenous) The treat-
(inent given by Bodkin and Klein is the standard one in econometrics, i.e.,
:!;;thc:; distribution of the measured values of Z I and.: z is such as to give rise to
inormally distributed errors in Eq. (I). The likelihood is formed as in Eg.
,:It2-13-6). The details of those calculations are given in Eisenpress ef al. (1966)
i:d Eisenpress and Greenstadt (1966) .
, For illustrative purposes, we shall adopt a different approach here. We
Eiiqte that .the model equations can be solved explicitly for the dependent
;i:apables to give the reduced form equations
Zl = A ZS l /(I+C41,
2Z = A[(I - c s )!c s ]t/(t+C4)
(5-23-2)
't;;w&ere
A = (z]!c 1 1O c 2=4)I/C3{C S [((1- c S )/C S )I/(I+C4) + :;4/(I +C4)]}I/C4
(5-23-3)
.,:
:;(1;0' cast these equations into more tractable form we introduce the following
:nw, variables
.}. . .".,
'Y1\== ZI,
)'z = 2Z ,
Xl =24'
xz==log=-3'
X] == log.:s (5-23-4)
r4",ve reparametrize the problem by defining
,f,;':c"," .
8 1 == (l/c 4 ) log C s - (l/c]) log C l , 8z = -(cz/c]) log 10
8 3 = l/c 3 , 84. == l/c4.' 8s = [(I - cs)/CSf/(1 +C4)
(5-23-5)
" "''''
.",' 'i<1I:_,.,
.;' ,!;..:'.
134 V Computation of the Estimates I: Unconstrained Problems ,I
Tahle 5-6
U.S. Production Data" I
.;.t
,.:,'
IL :::, Zl z, Z4- -5 ?i
..,
1.33135 0.64619 0.4026 -20 0.2-144 7 ,
2 1.39235 0.66302 0.4084 -/9 0.23454 :j
3 1.41640 0.65172 0.4223 -18 0.23206
4 1.4877 3 0.67318 0.4389 -17 0.2219/ ,i
5 1.510/5 0.67720 0.4605 -16 0.22487 },'
6 1.43385 0.65/75 0.4445 -/5 0.2/879 :' 11
7 1.48188 0.65570 0.4387 -14 0.23203
8 1.67115 0.71417 0.4999 -]3 0.23828 I
9 1.71327 0.77524 0,5264 -12 0.2657/ (
10 1.76412 0.79465 0.5793 -II 0.23410 '."*.
" I .76869 0.71607 0.5491 -10 0.22181
11 1.80776 0.70068 0.5052 - 9 0.18/57 r
13 1.54947 0.60764 0.4679 8 0.22931 £
14 1.66933 0.67041 0.5283 7 0.20595 .t
] 5 1.93377 0.74091 0.5994 6 0./9472
16 1.95460 0.71336 0.5964 - 5 0.1798]
17 2.11198 0.75159 0.6554 - 4 0.180]0
18 2.26266 0.78838 0,6851 - 3 0.16933 .
19 2.33128 0.79600 0.6933 2 O. ] 6279 \;
20 2.43980 0.80788 0.7061 I o. I 6906
21 2.58714 0.84547 0.7567 0 O. I 6239
22 1.54865 0.77232 0.6796 ] 0.16103
23 2.26042 0.67880 0.6136 2 0.14456
24 1.91974 0.58519 0.5145 3 0.20079
25 1.80000 0.58065 0.5046 4 0.] 8307
26 I. 86020 0.62007 0.571 ] 5 0.18352 I
27 1.88201 0.65575 0.6/84 6 0.] 8847
28 1.97018 0.72433 0.7113 7 0.20415
29 2.08132 0.76838 0.74'61 8 0.19006
30 I. 94062 0.69806 0.6981 9 0.17800
31 1.98646 0.74679 0.7722 10 0.]9979
32 2.07987 0.79083 0.8557 1/ 0.21 ]]5
33 2.18232 0.88461 0.9925 12 0.23453
34 1.52779 0.95750 1.0877 13 0.20937
35 2.62747 1.00285 I. 1834 14 0.19843
36 2.61235 0.99329 1.2565 15 0.18898
37 2.52320 0.94857 1.2293 16 0.17203
38 2.44632 0.97853 1.1889 17 0.]8]40
39 2.56478 1.02591 1.2249 18 0.19431
40 2.64588 1.03760 1.2669 ]9 0.] 9492
41 269105 0.99669 1.2708 20 0.179] 2
a Data adaptcd from Solow (1957).
"
;",
r.><:- ,t; "'.'1
'.. . '-. ,
;;.1.':
";\:i:
...,",
5-23, A Two-Equation Maximum Likelihood Problem
135
Our reduced equations now take the form
!l; .
';;
t..;;
:i!.!.:'
E:'
i;l:
r{;,
'7.':;.
YI =J;(x,e)=exp[a-8 4 x 3/(I +8 4 )],
Y2 = fz(X. e) = exp(a + log 8 5 )
(5-23-6)
where
a == 8 1 + 8 z x I + 83XZ + 8 4 10g{8 5 + exp[x3/(I + 04)]}
(5-23-7)
We shall solve the problem in terms of the e, and then convert the answer into
c using the inverse transformations
C 1 = [1 + 81+e4)fe4re4103 exp( -0 1 /8 3 ), C z = -8 z /(8 3 log 10)
C3 = 1/8 3 , C4 = 1/8 4 , c 5 = 1/[1 + 81 +04)/04] (5-23-8)
Let us formulate some alternative likelihood functions to be maximized.
Assuming the errors in the reduced equations to be normally distributed,
independently for each year and with covariance matrix Y, we have (apart
from irrelevant constants)
41
10gL(e, Y) = -(n/2) log det Y -t I (Y11 - f)Ty-1(YII - flJ (5-23-9)
1 1 =1
We examine the following cases:
(a) Unknown Y. The concentrated likelihood is equivalent to the objective
function Eq. (4-9-9)
(e) = (11/2) log det M
f 41 4t
= (41/2) log \J;::I(Yld - 1;d)ZII2;I(YIIZ - f/1z)Z
- [ I (YII! - 1;" )(Y I /2 -1;,2) ] Z } (5-23-10)
11= I
I
(Ii) Unknown diagonal Y. The objective function is given by Eq. (4-8-4)
'c«P(e) = (1J / 2)atl log lvl aa (e) = (41/2) log LY'/11 - 1;d)ZI,/YI'2 - 1;12)Z]
(5-23-11)
:(c), Covariance matrIx proportional to Q == [ ], I.e., the errors in YI are
.'umed twice as large, and independent of. the errors in Yz. The relevant
';i;iqjective function is given by Eq. (4-21-1)
«p(e) = (1111112) log Tr (Q- 1M)
= (41 x 2/2) log [ t I (Y1d - f/1I)! + I (Y 1 12 - 1;,2)2 ]
/1=1 /1=1
(5-23-12)
L;:;,
136
',I
.
V Computation of the Estimates I: Unconstrained Problems ';i
(d) As in (a) above, but 10g)'1 and log)'2 are the dependent variables. The
objective function has the same form, with log )'''1' - log/;w replacingy l ", - .010'
All these objective functions have the Gauss form, and the approximate
Hessian has the form N = 2 I: I B/rB II . Let us examine case (d) first. Here
BII = n log flJcJG
a
2
f
-'':1
.,
;1' 1 "
.,
.,.
"
.:-:.'
j
ll
jj
/ IN
.-
W
i.
Table 5-7
Elclllcnts of E,,1, Case (d)" B"D' = a 10g/',n/aO,
a
2
X"I
3
XII:!
4
(0 x"' )
log ',+cxJ)-
(I + 0..)
LJ X"'
XI" II.. Cxp _ 0
(I -I..)
' (0 x", )
(I + 8..)- 5 -J cxp _ 0
- (I -!- ..)
X,d
(I +oy
5
0..
0, +cxp
- (I -j 8..)
X,II
:z,
:
X'I:!
I )
X'i 3
log ( 0, -!-exp _ 0
(I + ..)
O X",
X"' ., cxp _ 0
(I + .,)
O ' (0 ., X"' )
(I + .,)- 5 +cxp _ 0
(I + ..)
0..
,
o X", T e;
5'+ CX P (I +0 4 )
"In Cascs (a), (b), and (c) Multiply Tablc EntrIcs by /'...
The elements of BII are given in Table 5-7. Referring to the third row of
Table 5-1, we find that
-1-1
I (log )'111 - log.!;,,)2
I'=- 1
41
r=-
.,
41
I (log .1',,, - log.!,,,)
/1= I
x (log )'1 12 - log.!;,z)
41
I (log )'1" -log.!;,))
lJ= I
x (log )'1,2 - log.!;,z)
-)
41
I (log )'1.2 - 10g.!;,z)Z
11= I
(5-23-13)
:l .O' tp
" , ,-,.,.. . ,"1 I '
>,. I,.
:"-23, A Two-Equation Maximum Likelihood Problem
137
In cases (a), (b), and (c) we obtain Bl'aa by multiplying the corresponding
,p.try in Table 5-7 by J;w, since aJ;,alaea = J;w a logJ;w/aea'
.fhe expression for r as given in Table 5-1 comes out to be:
;tase (a)
[ 41
41 III(YIII - fl'l)2
r=-
2 41
I'I (YIII - fl'l)(YI'2 - f 1l 2)
41 ] -1
1'I(Yl'l - fl't)(YI'2 - f1'2)
41
L (Y1'2 - f1'2)2
Jl=l
(5-23-14)
,:gase (b)
[ [ 41 ] - t
L (Yill - flll)2
41 1'=1
r=-
2 0
[ t (y,,: ;;,J' ] _I ]
1 1 - 1
(5-23-15)
;,se (c)
41 [ ! 0 ]
r= I 41) 2 41) 20 1
"4 LIl=l (}PI - fill) + 'L=l () 1'2 - f1'2)
(5-23-16)
: shall omit the details of the calculation. The results for cases (a)-(d) are
:;reported in Table 5-8 in the form of the final estimates e* and the minimum
iiTible 5-8
!{iksUlts of Paramctcr Estimation for Production Thcorv Problcm (Estimatcs of e and
hlimum of Objcctive Function) .
;'€iise 8,* 8 2 * 8,* 8 4 * 8 5 * 1];*
.:',. .
H:a -0,0758463 -0.0115747 0.790686 1.00224 0.859255 -82.71488
-4.27288 -0.00882702 0.7]3410 4.85440 1.47252 -76.70853
i(b) -0. 155586 -0.00978696 0.737126 1.05720 0.878024 - 79.1353
. -4.27489 -0.00834795 0.696653 4.85221 1.47180 - 75.8870
-4.76163 -0.00569320 0.600316 5.27596 1.49747 -66.5601
n,(C) -2.45064 -0.00556030 0.598743 3. I 3535 1.30821 -66.0913
j\( -0.0409260 -0.01 18384 0.802121 0.96870 0.850246 -99.3714
-3.54201 - 0.00882492 0.724913 4. I 8994 1.41740 -95.2669
i.i':" ..,_
a Local mllllmum of obJcctlvc functIOn.
13g
V Computation of the Estimates I; Unconstrained Problems
objective function values (f)*. In Table 5-9 the results are given in terms of the
original variables c*, Also reported are the results of Bodkin and Klein (1967),
who used an objective function of the form Eq. (2-13-6).
Table 5-9
Rcsults of Paramctcr Estimation for Production Thcory Problcm.
Casc Cl* C2* C3* C4* C5*
(a) 0.5460 0.00636 1.265 0.998 0.5752
(a)" 0.6074 0.00537 ] .402 0.206 0.3854
(b) 0.5417 0.00577 1.357 0.946 0.5629
(b)" 0.6051 0.00520 ] .435 0.206 0.3855
(c) 0.5935 0.00412 1.666 0.]89 0.3822
(c)" 0.5791 0.00403 1.670 0.319 0.4123
(d) 0.5473 0.00640 1.247 1.027 0.5806
(d)" 0.6049 0.00529 1.379 0.239 0.3936
(c) 0.5839 0.00589 1.362 0.475 0.447]
(f) 0.5410 0.00643 1.238 1.130 0.6037
" Local minimum of objcctivc function.
(e) The model equations take the form of Eq. (I).
(f) The model equations are written as
gl == log C I + C 2 ':-1- log 10 - (c 3 /C-1-) log[c52"c4 + (I - C 5 )':1 04 ] -log 2"3 = 0
g2 == log C 5 - log( I - c 5 ) - (I + c-1-)(log': l - log 2"2) - log':5 = 0
It is revealing to f1nd such discrepancies in the estimates (particularly of
C4), since in all cases the same model equations were f1tted to the same data,
the only difference being in the assumptions concerning the distribution of
errors. We must defer further discussion of this problem to the end of
Chapter VII.
Matters are further complicaled by the question of convergence. In at-
tempting to solve these problems using various algorithms and starting values,
we found that in each case there exists at least one local minimum of the
objective function otl1er than the global minimum. These local minima are
also recorded in Tables 5-8 and 5-9. The other minima are, we think, global,
but we have no way of proving that this is so. The performance of various
algorithms is summarized in Table 5-10.
tll
I : I
;,
;1
"
;
"..
-1£f
5-24, Problems
139
. Table 5-10
C:onvergence of Various Algonthms for Problcm 2
Casc
Starting
point.
Objcctive function
cvaluation for
intcrpolation-
Algorithm" Convcrgcncc C Itcrations cxtrapolation.
I I 15 28
2 I 27 55
3 I 28 56
4 I 12 16
5 ] 35 73
6 I 36 350
3 2 28 56
15 32
3 2 8 15
3 8 17
3 2 ]5 27
I ] 38 78
2 2 8 15
3 2 8 15
4 10 15
;
:fJ;
(a)
2
3
(b) I
4
(c) 4
(d)
. Starting points:
I. [0, 0, 0, 0, 1]
e _2. [4.27489, -0.00834795, 0.696653, 4.85221, 1.47180]
I - 3. [-1.29970, -0.00995700, 0.734317, 2.10535, 1.15490]
4. r -0.0758463, -0.0115747,0.790686, 1.00224,0.859255]
. ,b Algorithms:
:1.' Gauss, directional discrimination, using Eq. (5-7-9).
:2.:Gauss, dircctional discrimination, using Eq, (5-7-10), E = 10- 5 , a = I.
3!-'Gauss, dircctional discrimination, using Eq, (5-7-] I), E = IO-S, f3 = I.
4.:,Pauss, Marquardt.
;': Variablc mctric, ROC.
:. Variablc mctric, DFP.
C Convcrgcncc:
:1 .:To best known solution.
,!o a local minimum.
:5-24. Problems
I. Verify that the Gauss method is invariant under reparametrization;
;i., the sequence of iterations is the same whether we use the original set of
'yables 0 or a transformed set c = c(O), provided only that c i = C(OI)' and
"that c is a linear function of O.
',.
:.::;: :J
140
V Computation of the EstImates I: Unconstrained Problems
':1i
,I
'&
.t..';
2. Determll1e what conditions must be fulfilled for the Marquardt and,.
variable metric methods to be invariant under reparametrization.
31
3. Suppose cjJ(O) = 0 1 2 + 100000/. Let Ot T = [100, I]. Compare the pro- }
gress towards the minimum of cjJ that can be made in a single iteration of the
steepest descent and Newton methods. 'I
4. For the objective function of Problem 3, find initial guesses for whIch
the upper and lower limits of Eq. (5-6-5) are satisfied. Show that these limits <
cannot be violated.
5. Devise an algorithm similar to the Gauss method for finding 0 and X .,
to nlininlize the objective function of Eq. (4-10-]2).
':;,
,apter
:VI
Computation o the Estimates II:
Problms with Constraints
';A, Inequality Constraints
::1:. Penalty Functions
.Inequality constraints of the form Eq. (5-1-1) limit the domatn of para-
,'meter values within which the estimate is to be found. They often arise from
'prior information concerning the values of the parameters (see Section 2- I 6).
;'rhe presence of inequality constraints. particularly in the form of upper and
Jt:)er bounds on each parameter, often exerts a beneficial influence on the
;o-!1vergence of an optimization algorithm. In quite a few problems, converg-
jce to a correct minimum results from imposition of somewhat arbitrary
::,bo}ll1ds, without which the algorithms bog down in irrelevant regions of
:;patflmeter space. We would go so far as to recommend imposition of generous,
;$ogh not. unreasonably so, bounds in all nonlinear parameter estimation
Jp.f9!?lems,
". ,We possess several powerful algorithms for unconstrained optimization
t,#qwould like to apply the same algorithms to the constrained problems. We
Ieq to modify the objective function in such a way that it remains almost
!:HJ:!.anged well in the interior of the feasible region, but increases drastically
is 'qne approaches the constraints. To accomplish this, we assign a penal/I'
!il1 to each inequality constraint. This function is nearly zero whe
jii;h:e::constraint function is strongly positive, but increases sharply as the con-
.".. "
:Striillit function approaches zero from above. To the constraint
hj(O) 0
(6-1-1)
eassign, following Carroll (1961), the penalty function
..:r:":, _
C(O) == ajhj(O)
(6-1-2)
':".M,ii.L./
J,;t'
142 VI Computation of the Estimates II: Problems with Constraints
where Cl. j is a small positive constant. We now modify the objective function
by adding to it the penalty functions for all the constraintst
1)t(S) =1)(S) + I Cl.j/hj(S)
j
(6-1-3)
Let st and S* be the points at which 1)t and cP attain their respective
minima within the feasible region, Fiacco and McCormick (1964) have
proven that under suitable conditions
lim st = S* (6-1-4)
aj_ 0
These concepts are employed in SUMT (sequential unconstrained maximi-
zation technique), originally presented in Carroll's paper but later amplified
by Fiacco and McCormick:
I. Select the Cl. j and a feasible initial guess Sl'
2. Find st using one of the unconstrained optimization methods,
3, Reduce the values of the Cl.j' and return to step 2, using st as the inItial
guess, The process is continued until st does not change significantly upon
reducing the Cl. j . Then we accept st as our estimate of S*.
The search for the minimum of 1)t in step 2 must still be confined to the
feasible region. It may appear, therefore, that nothing has been gained. Why
not minimize q) directly? The answer is that we have created a situation where
the objective function always starts increasing before one has a chance to
leave the feasible region. Therefore, the procedures for determining step
length always succeed in producing an acceptable feasible step. If we happen
to be near a constraint, it is quite possible for a minimization method when
applied to (I) to direct one towards the infeasible region, even though there exist
feasible directions in which the function decreases. By adding the penalty
functions we deflect our step to a feasible direction, The point is illustrated in
Fig. 6- I where the contours of an objective function are drawn. The minimum
occurs at point A. Slarting out at point B, the steepest descent procedure carries
us along the path BC DA to the minimum. If the feasible region is constrained
to lie to the left of the line FG, the path is blocked at point C. Introducing a
penalty function (Fig. 6-2) leaves the contours around A almost undisturbed,
but distorts them near the constraint in the manner shown. We now have a
feasible path BHIA to the minimum. Although the example is given in terms
of steepest descent, it applies equally well to other minimization methods.
This example illustrates the important point that the path from a feasible
starting point to a feasible minimum may pass through infeasible territory.
"t Altcrnativcly, wc may usc -C/. Lj log hie). This has thc bcncfit of bcing unaffcctcd
by scaling thc functions hie).
Thc dctails of thc hcmstitching ncar thc minimum wcrc omittcd from thc Figurcs,
6-1. PenalTy Functions
143
I
.. . . . .' 1
,-,:'. 1
"?: . j
Fig,6-1, Contours of W; Minimization without pcnalty functions.
If matters were always as simple as in our illustration, then we could simply
do away with the constraint altogether. In practice it may happen, however,
that there are local minima in the infeasible region, or that it is impossible
even to compute the value of the function for infeasible parameter values (this
occurs frequently with dynamic systems whose differential equations become
unstable), Hence the importance of creating paths that lie entirely within the
feasible region,
Suppose st is well in the interior of the feasible region. This is recognized
to be the case when Lj Cf.)h)St) is very small. Then the minima with and
without penalty functions nearly coincide. Having obtained st we may take
S* = st, or perhaps go through an additional iteration of the minimization
I :;
,""
F
InfeoSible
B
:' I x
"J,\
i
G
Fig, 6-2, Contours of Wi": Minimization with pcnalty functions.
144 VI Computation of the Estimates I I: Problems with Constraints
procedure, starting at ot and omittmg the penalty functions entirely, If ot
turns out to be near some constraints [sizable value of Ij a)hj(Ot)], a
gradual reduction in the aj is called for. All iterates Oi must be restricted
to the inferior of the feasible region, except for the last iterate which
should be allowed to fall on the boundary. This means that in Eq. (5-3-7) we
must have
Pi < Pi. milx
(6-1-5)
in all iterations but the last, and
Pi:::; Pi,milx
( 6-1-6)
in the last iteration, taken with no penalty functions. Here, Pi, n1ilx is the
greatest lower bound on the positive values of P for which Oi + PV i is not
feasible. The flowcharts of Section 5- I 4 assume that Pi, milx can be calculated
at each iteration.
If we use a gradient method to minimize <pt(O) we need to compute first
and sometimes second derivatives of the penalty functions. From Eq. (2):
a()r70" = - la)h/(O)] ch)aOa
(6-1-7)
a 2 () aO a CO'I = [a)h/(0)][2(ah)aOa)(ah)aOp) - h)O) a 2 h j /aO a aop] (6-1-8)
When 0 is far from the jth constraint, the contribution of C to the objective
function and its derivatives is very small. Near the jth constraint, h j is nearly
zero, and the second term of Eq. (8) may be neglected relative to the first
term. In eilher case, it is safe to replace Eq. (8) with
a2UiiOa (lOp [2a j /h/(0)](ah)cO a )(ah)aO p )
(6-1-9)
which does not require computation of the second derivatives of the constraint
functions. This is analogous to the way in which the second derivatives of the
model equations are suppressed in the Gauss method. Note also that Eq,
(9) is at least positive semidefinite, and does not spoil the definiteness of N
when added to the latter.
Frequently (e.g., with upper or lower bounds) the constraints are linear
functions whose second derivatives vanish anyway. Eq. (9) is then exact.
The initial choice of aj should be dictated by the range of values that
hj(O) and qJ(O) can take in the feasible region. For instance, if we have two
constraints reflecting the bounds b" :::; 0" :::; aa
h)O) == aa - 0" 0
hj+1(0) == Oa - b a 0
(6-1-10)
(6-1-11)
.6-1, Penalty Functions
145
I
we might set Cf.j = Cf. j +1 = O,OOI(aa - b a ) 4J(01)' If the initial guesses 0t are
well in the interior of the feasible region, then a good choice for Cf.j may be
Cf.j = 0,00111/( 1 ) 4J(01)
( 6-1-12)
...
The penalty function method is easy to program, and has been found to
work well when the solution is known (or expected) to be in the interior of the
feasible region. A numerical illustration appears in Section 6-11. When the
solution is likely to be on the boundary, then the projection method discussed
below is preferable. Even an interior minimum may be reached faster by the
projection method, but the complexity of the latter mitigates against its use,
Penalty function methods other than the one described here have been
proposed, e,g" by Zangwill (1967a) and Fiacco and McCormick (1967),
These methods possess the advantage (for general nonlinear programming
problems) that the initial guess and intermediate iterates are not restricted to
the interior of the feasible region; i.e., they are allowed to violate the con-
straints, In the case of parameter estimation problems, however, this is not at
all an advantage. In the first place, it is usually easy to stay in the feasible
region because of the simple nature of the constraints. Secondly, the objective
function often behaves in an erratic manner, and may even be uncomputable
outside the feasible region. The importance of staying within the feasible
region was stressed above,
While penalty functions may appear to be merely a computational artifact,
they do indeed possess a statistical interpretation. Suppose 4J(0) is a log
likelihood, and let us assign to 0 a discontinuous prior distribution, which is
zero outside, and uniform inside the feasible region. Then the posterior density
is zero outside, and proportional to 4J(0) inside the region, and the probem
can be formulated as being that of finding the maximum of the posterior
density. Let us, however, try to smooth out the prior distribution so that its
density approaches zero continuously (though rapidly) as one goes out to the
boundary of the feasible region. This can be accomplished precisely by
making
II
-log PoCO) = I ()O)
j
(6-1-13)
As an example, suppose 0 8 1 I, and we use (j = -Cf. log 11)0). fn
this case
-log PoCO) = - Cf. log 0 1 - Cf. log (I - 0 1 )
(6-1-14)
li so that
PoCO) = 8 1 a( I - 0 1 )a
(6-1-15)
which is the BI +a, 1 +a (beta) distribution.
146
VI Computation of the Estimates II: Problems with Constraints
.1
6-2, Projection Methods
Another class of methods for optlmlzlIlg with inequality constraints is
variously known as gradient projection and reduced gradient (Rosen, 1960,
1961; Wolfe, 1963; Faure and Huard, 1965; Abadie and Carpentier, 1966;
Abadie, 1967b). These methods, which in Fig. 6-1 would take us along the
path BCEA, may be summarized as follows:
At each iteration define the 110rma/ step as the one computed according to
the gradient method of our choice, with the constraints ignored. We now face
one of the following two situations:
I. If e j is in the interior of the feasible region, apply the normal step (e.g.,
BC in Fig. 6-1). If this results in an infeasible point, the step is truncated so as
to leave us on the boundary of the feasible region. I
2. ]f e j is on the boundary, take the normal step or a fraction thereof if
'h;, ;, 1m; hi, (,. g.. EA ;n Ftg. 6-1). Oth"w'". t"at '0 m, 0 r th, a"tv< con- :.
straints as equality constraints, and take a step along these constraints (e,g"
CE in Figure 6- I).
The question of which ones of the active constraints should be retained in
any given situation is a difficult one. Although Rosen (1960, 1961) gives a
working solution to this problem, this solution is not necessarily optimal. The
quadratic programming solution listed below is a good one provided all the
algebraic manipulations involved are less time-consuming than the function
evaluations. ;
Efficient algorithms have been constructed by using these techniques in
combination with variable metric methods for generating the step directions
[Goldfarb and Lapidus (1968) and Murtagh and Sargent (1969) for linear
constraints; Davies (1970) for nonlinear constraints]. These algorithms are
usually superior to the penalty function method for finding minima that lie
on a constraint. r n many parameter estimation problems, where we hope to
find an interior minimum, the penalty function method seems preferable
because of its greater simplicity. Exceptions to this rule do occur, however,
and therefore we indicate how the ith iteration of a gradient method may be
modified in the presence of linear inequality constraints (e.g., upper and lower
bounds). Note that when the penalty function method is used, all iterates
e j (except possibly the last) are in the interior of the feasible region, and hence
the method is immediately applicable even when some of the constraints Eq,
(6-1-1) are in the form of strict inequalities hie) > o. Such situations arise, for
instance, when terms like I IDa or log Oa appear in the model equations, re-
quiring Ocr > O. 1 n the projection method some iterates may fall on the bound-
ary. Hence the constraint Oa > 0 must be replaced by 8a - E 0, where E is a
small positive number.
'I
I
6-2. ProjectIOn Methods
147
I
-'
!
[n an unconstraIned gradient method we take a step in the direction
v = - R i qj _ The minimum would be attained in a single step if the objective
function to be minimized were, in fact
Q;(v) == c]Ji + vTqj + -!vTRi1v (6-2-1)
which is close to c]J provided R j is a good approximation to H i - J .
Suppose the constraints take the form
a.TOb. (j= 1,2,.__) (6-2-2)
J J
Let Ai be the mmrix whose columns are precisely those vectors 3} such that
aj TO i = b J , i_e" those constraints which are active at the point OJ _ Let p be the
number of such constraints; A j is I x p. Any feasible step must satisfy
A/v 0 (6-2-3)
The followmg strategy will be adopted for finding the direction Vi:
,Minimize the current approximation Eq_ (I) to the objective function, subject
,to the currently active constraints (as will be seen later, currently inactive
,constraints will help determine the step length, but not its direction). The quad-
ratic expression Eq. (I) therefore acts as a temporary objective function, and
the problem of finding its minimum subject to the linear constraints Eq. (3)
its called quadratic programming (QP). Since R j , qj, and Ai differ from iteration
'!O iteration, it follows that each iteration of the original problem requires the
>l.)olution of a different QP problem. The algorithm described below is very
.. :efficient, however, and in many estimation problems the computation of c]J
'and its derivatives is more time consuming than the solution of the QP
,problem,
Let v j be the solution to the ith iteration QP problem, At 0 = 0; + V j the
'gradient of Q ;(v) is obtained from Eq. (I) as
I
q(v;) = qj + Ri1v;
( 6-2-4)
'According to the Kuhn-Tucker conditions, there must exist a vector of La-
range multipliers Ai satisfying Eq, (3-7-3), which becomes in our case
qj + Ri 1V ; = AiA i
(6-2-5)
:l-et Wi == A j T Vi denote the vector of constraint functions evaluated at e =
9 i + Vi' Then Eqs, (3-7-4)-(3-7-6) take the form
W; 0 A; 0 AiTwi = 0 (6-2-6)
J..et z; ==-AjTRiq; and W j == A;TRjAj Premultiply Eq. (5) by A;TRj. Con-
;ditions Eqs. (5) and (6) are now transformed into the following problem (we
,henceforth drop the subscript i):
Find A and W satisfying
:.
15
W=Z+WA
w O
AO
A TW = 0
(6-2-7)
148 Y I Computation of the Estimates II: Problems with Constraint.
This is known as the complementary pivot problem (Cottle and Dantzig, ]968)
and can be solved by an algorithm given by Dantzig and Cottle (1967). We
present here a simpler and faster algorithm (Zoutendijk, ] 960; Bard, ] 971)
whose convergence has not been proven, but which has not failed in hundreds
of applications.
Observe that from Eg. (7) for each j either }'j or \\'j vanishes. Since H'j is the
value of the jth constraint function at the solution to the QP problem, it
follows that this constraint remains active if \\'j = O. In this case we refer to the
jth constraint as binding. If \I'j > 0 then the solution would not be affected by
removal of this constraint, and it is called nonbinding. Let W B = 0 and W N > 0
be the vectors of binding and nonbinding constraints, respectively. Of course,
we do not know as yet which constraints are going to be included in which set,
but for the time being we ignore this difficulty. Let A B and AN be the correspond-
ing partition off.. Then from Eg. (7) AN = O.
Let W be partitioned along the same lines into
[ W BB WBN ]
WN W NN
and ZT into ZIJT, ZN T . Then Eg. (7) becomes:
W N > 0,
)'B 0
(6-2-8)
(6-2-9)
(6-2-10)
0= ZB + W1JBJ. B
W N = ZN + WN)'B
From Eq. (8), AIJ = - Wi;"BlzB' so that from Eg. (9) w N = ZN - WN Wi;"B1ZB'
Let us form the tableau
[ W BB WBN ZB ]
E = [W. z] = T
W BN W NN ZN
(6-2-11)
If Gauss-Jordan plVOlS (see Section A-3) were to be effected in turn on all the
diagonal elements of W IJIJ' then the last column of E would be transformed
into
[ Wi;"BIZIJ ] [ - AB ]
ZN - WN Wi;"BlzB = WN
It follows that if the proper partitioning of the constraints into binding and
nonbinding ones is at hand, and if we sweep those rows of E which correspond
to the binding constraints, then the last elements in those rows must become
non positive (since )'B 0). Conversely, the last element in each unswept row
of E must become nonnegative (since w N > 0, but zero elements may appear
.Rii
I
6-2, Projection Methods
149
under certain conditions of degeneracy), I n order to find the proper partition,
the following algorithm is suggested:
it
I. Form the tableau E which has p rows and p + I columns (p is the
number of constraints active at OJ
2. Assign to thejth row of E (j = 1,2, ..., p) the indicator k j = I.
3. Let e j denote the current value of the last element in the jth row. Find
a= min kje i , If a -£ (where £ is a small positive constant, say £ = 10- 6
j
for smgle precision calcu]ations), proceed to step 5. Otherwise:
4. Let r be an index for which a = kre r . Sweep the rth row (i.e., execute
a Gauss-Jordan pivot on £,.,.) and change the sign of k,.. Return to step 3.
5. The solution is now at hane\. Consider the jth row of E. If k j = I, the
jth row is unswept (or swept an even number of times). Hence the jth con-
straint is nonbinding. Therefore, e j = \I'j = (AT V)j 0, and A j = o. If k j = - I,
the jth row is swept and the jth constraint is binding. I-lence e j = - Aj 0,
and Il'j + (A TV)j = O. To compute v j (we now restore the subscript i), solve
Eq. (5)
"
V j = R;(AjAj - qj) = R;(ABAB - qj)
(6-2-12)
, ,
where A B consists of the columns of A; corresponding to the swept rows of E,
and J' B consists of the last elements of those rows, with signs changed.
6. The actual step (Jj = pj Vj is computed by interpolation-extrapolation
along the ray OJ + PV j (p > 0), with the additional proviso that OJ+ I =
9 j + pj v j must also satisfy all the constraints inactive at OJ, and therefore not
included in A j . If we denote these constraints as c/O - b j 0 (j = I, 2, . . .),
then we must have c/O; + Pie/vi - b j O. Since pj > 0 and c/O j - b j > 0
,(inactivity at 0;), it follows that only constraints for which c/v j < 0 threaten
,to become violated. Hence Pi must satisfy the inequality
Pi Pm". =e= min [(b j - C/O;)/C/V;]
jEJ
(6-2-13)
where J is the set of indices j for which c/ V; < o. If J IS vaCLlOUS, then
Pm.. = 00.
The following simple example illustrates the computation of Vi. Assume
Jhat the objective function is approximated locally by the quadratic function
Q(V) = 1<1'1 + ])2 + -!-I'/
(6-2-14)
We stan at I't = "1 = 0, where the active constraints are
1'2 0,
\'t + \'2 O,
- VI + 1'1 0
(6-2-15)
,.:;;
150 VI Computation of the Estimates II: Problems wtth ConstraItlts
The constraints and the contours of the objective function are displayed in
Fig. 6-3. From Eq. (14) and Eq. (15) we deduce qT= [1,0], H=lz; R=
H-1=Iz,and
- :]
A = [
We form z = - A TRq = [0, - I, I]T and
\V A 'RA [: ]
0'
.,,/
'!./
...../
/
/'
/'
/
vio L // "'-'"
/ "-
./ '
//
" /
/
//
/
../
"
v-'
, '0
(-1,0'
.'.
/
/
(6-2-]6)
V2
v > 0 t
2-
v,
""-
""-
'''-,
Fig, 6.3. Projcction mcthod wih Q(v) = .H'l, + 1)2 + V22,
The algortthm proceeds as follows:
-:J
I. [ 1 I
E =:
2. k T = [I. I, I].
3. a=ezk z = -I < _10- 6 .
4. ,. = 2, hence we pivot on £z. z. The resulting tableau is
E [\
I
o
2
I
-2
3. {f = ('lkl = ezk z ==, > _10- 6 .
o
I
o
2
4 ] [ 1 ]
-t' k = -:
6-3, Projection with Bounded Parameters
151
5, Since k 2 = - I, the second constraint is binding, whereas the first and
third are not. Hence As = UJ (the second column of A) and J' B = [fJ (minus
the second element in the last column of E). Hence, according to Eg. (12)
v = [ J ([:J x t - []) = [ -tJ
which, as is evident from Fig. 6-3, is the correct solution.
An extension of this algorithm to the case of nonlinear constralllts ,s
given by Bard (1971).
6-3. Projection with Bounded Parameters
The r.eader may wonder at the need for a complicated algorithm when
bounds are the only constraints imposed on the parameters. Why not simply
suppress those components of the step v which would violate the bounds? A
simple example shows that such a procedure can produce erroneous results.
Let, for instance
Oi = [l
qi=[-l
R i = [ J
.and suppose both components of e have zero lower bounds. Since ql = -I,
[it is clear that the objective function can be reduced by increasing 0, while
,keeping O 2 constant. However, if we compute v we find - R i qi = [- 3, - 6].
'According to the above suggestion, since both components of v are to be
:reduced beiow their lower bounds, we would conclude erroneously that we
are at the minimum and take no step at all. We shall proceed therefore to
apply the algorithm of the preceding section.
At the ith iteration, several of the components of 0 (we omit the subscript
i for convenience) may be at their lower bounds, and others at their upper
bounds, while the remainder are free to move in either direction. For sim-
plicity in the algebraic exposition to follow, we shall treat lower bounds only,
but later we give the arithmetic details for both cases. If Ocr is at its lower
bound, the corresponding constraint takes the form
I'a 0
(6-3-1)
Let v(O) = - Rq be the step that would be taken 111 the absence of con-
straints, It is easy to see that forming A TR merely picks out of R those rows
which correspond to variables which are at a bound; postmultiplying this by A
152 VI Computation of the Estimates II: Problems with Constraints
picks out the corresponding columns, Thus, W = A TRA = it where it is the
set of elements of R at the junction of active rows and columns. Similarly,
z = - A TRq is obtained by taking the active elements of v(O). Hence, E can
be formed by inspection from Rand v(O)
E = [it, y(O)]
(6-3-2)
It is easily verified that the followlllg procedure is eqUivalent to the al- "
gorithm of Section 6-2 in the present context:
I. Set up the tableau E, which has p rows and p + I columns.
2. For j = 1,2, .. ., p let m j = r:: I} ifthejth row corresponds to a variable ;1.
at its { IOWC q bound
upper J .
3. For j = I, 2, . . ., p let k j = I.
4. Let e j denote the current value of the last element in the jth row of E.
Find a = minj JI1 j k j c j . If a -£, proceed to step 6, Otherwise:
5. Let r be an index for which a = 111rkre,.. Sweep the rth row (Gauss-
Jordan pivot on E,.,.) and change the sign of k,.. Return to step 4.
6. The solution is now at hand. Bounds for which k j = - I are bindll1g,
others are not. We construct An by taking the elements of -e in the binding
rows, and we construct RAn by taking the columns of R corresponding to. .,:
variables which are at a binding bound, It is thus easy to compute v using:
Eq. (6-2-12)
v = (RAn)AB + v(O)
( 6-3-3)
.1
Since vanables at binchng bounds cannot change their values, the correspond-'".
ing elements of v automatically come out O.
":.
We illustrate the procedure by means of a numerical example (a further. ,.
illustration appears in Section 6- I 2):
Suppose the following values are current in the ith iteration
e, [+
",-m
Ri= [ -I
-4
-10
30
8
-]
. }
Hence, v:O) = - R i qi = [- 3, 16, 6r. Suppose all parameters are restricted to ,16
the range zero to one. Thus OJ and 0 3 are at their lower and upper bounds;.
respectively. The algorithm proceeds as follows:
l. E= [_ -
2. m= [I, -Ir.
3. k= [I, Ir.
4. a=m2k2e2= -6< _10- 6 ,
-3 ]
6 '
);g:J
. .6-4, Transformation of Variables
153
5, r = 2, hence we pivot on £2.2 obtaining the new tableau
E = [ 1.8 0.8
- 0.8 0,2
1.8 ]
1.2 '
k = [ _]
4, a = 171 2 k 2 e 2 = 1.2> _10- 6 .
6, Since k 2 = -I, the second constraint (the bound on 03) is binding.
We have As = [- 1.2] (second element of -e) and RAu = [-4, 8, 5]T (third
: column of R), Hence
v = [ - ] x (-1.2) - [ ] = [ : ]
5* 6* 0
(6-3-4)
:.;,1\s expected V 3 = 0, and we could have replaced the starred elements in Eq. (4)
: gy zeros without altering the result.
The actual step (j" will be some multiple of v, i.e., (j" = pv. The multiplier
. p is limited to the range 0 P min(l, PmaJ, with Pm", to be determined by
'the requirement that OJ + Pmax V remains feasible. That is, we must have
0+ 1.8Pmax I,
0.5 + 6.4Pmax I
. Hence
Pmax = min[(1 - 0)/1.8, (I - 0.5)/6.4] = 0.3125
.Tile actual value of P would be determined by interpolation-extrapolation
:(ee Section 5-14) to guarantee a decrease in the objective function value,
:&4. Transformation of Variables
"
""Sometimes a change of variables can transform a constrall1ed problem into
:,unconstrained one. For instance, to minimize C/J(e) with e required to be
:p.ositive is equivalent to minimizing C/J(p 2 ) with P free to assume any value.
$im,ilarly, if e must satisfy 13 + e a, then we minimize C/J((a + (3)/2 +
;[(t;:..... [3)/2] sin p) with P unconstrained, since as P varies from - OJ to OJ, the
;quantity (a + 13)/2 + [(a - 13)/2] sin P remains within the bounds a and {J,
ox, (1966) has demonstrated that with ingenuity even more complicated
fO,nstraints can sometimes be eliminated by means of such transformations.
;We pave some numerical evidence that the use of transformations is no more
:.!'..' -',-
f!1.Cient than the use of penalty functions, and we prefer the latter because of
. ,Thi'r greater generality,
..
154 VI Computation of the Estimates II: Problems wIth ConstraIl1ts
6-5, Minimax Problems
Some estImation problems (see Sections 4-13 and 4-17) led us to seek the
value of 0 that minimizes
(1)(0) == max e,,(Oj
)'
(6-5-1 )
<,:;:.
This \Va, shown to be equivalent to findIng ::.0 so as to minllllIze
tP(O, ::) == ::
(6-5-2)
...
":1
subject to
z - 11',,(0)1 0
(J.l = I. 2. . . . . n)
(6-5-3)
An iterative schemc which is analogous to the Gauss method for least
squares problems has been suggested by Osborne and Watson (1969).
Let 0; be the value of 0 at the ith iteration. We approximate e,,(Oj + \';) by a
linear expression e,,(O,) + b;;\'I' The nonlinear programming problem Eq. (2),
Eq. (3) is replaced by the linear programming problem:
Find ::. v, so as to minimize Eq. (2) subject to
I ;
,
,
z - e,,(Oi) - b,, v, O.
:: + 1',,(0,) + b;;\', 0
(l = 1,2,.." n) (6-5-4),
This prohlcm can be solved by means of standard linear programming(LP)
methods. Once v, has been computed, the length of the step taken in this
direction is determined by interpolation-extrapolation.
,,'_J
.,.:
I
"
':
'"
B. Equality Constraints
6-6. Exact Structural Models
In Section 4-1 I we have formulaled some estimation problems as requiring
the minimization of a function q)(\V) subject to the constraints GC\V, 0) = 0,
The" true" data \V and the parameters (') are the unknowns. In the ensuing
discussion we will treat G and \V as vectors. G and \Y. We suppose that at the
ith iteration we have current values of \Y i and 0;. The method of solution
given below follows suggestions by Deming ( 1943).
We denote the current values of q) and G by q), and G" respectively, and
adopt the following notation:
q == MJ/?\Y
A == c:G/i'\V,
H == if cp / C\Y c\Y
B == CG/i.'O
(6-6-1 )
l
"'
6-6, ExaCT StrucTlIral Models
155
with subscript i denoting the value at W = 'V j , 0 = OJ. We define the following
functions:
<1)j«(5\\1) == cfJ i + q/ (5\\1 -t--} (5W T Hi (5\\1
G j «(5\V, (50) == G j + A, (5\\1 + B j (50 i
(6-6-2)
(6-6-3)
W j is the second-order Taylor series approximation to r!l, and G i is the first-
order Taylor series approximation to G. We now replace our original problem
by the following: _
Find (5\\1 and (50 so as to mlI1lmlZe ql j, while satisfYlI1g the constraints
G j = O. We introduce a vector of Lagrange multipliers A and seek the station-
ary point of the Lagrangian
11«(5\V, (50. J.) == li + J.Tc,
(6-6-4)
I
Accordingly, we form the normal equations:
al1/a«(5\V) = qi + Hi ()\V + A/J. = 0
a/1/aJ. = G j + A j (5\V + B i (50 = 0
ali/a«(50) = B/J. = 0
(6-6-5)
(6-6-6)
(6-6-7)
From Eq, (5)
(Wv = -Hi I(qi + A/J.)
(6-6-8)
So that, from Eq (6)
G j - AjHi lqj - AjH j - 1 A/A + Hj (50 = 0
Solving for J. we obtain
J. = Ci I(B j (50 - AjH,-lqi + GJ
(6-6-9)
(6-6-10)
where
C j == AiHi 'A j '
( 6-6-11 )
,Substituting Eq. (10) in Eq, (7) and solving for (50 we obtain
(50 = D i - I B/C j - I (A)-I,- ' qi - G j )
(6-6-12)
where
D j = Bi TC ; I Bj
(6-6-13)
The matrix D j plays a role analogous to that of N j in the Gauss method,
c:. and the same" almost inversion" methods (e.g.. directional discrimination or
tvfarquardt's method) should be used where D; I is required.
Eq. (12) enables one to compute 0i+1 = Oi + (50. Then Eq. (10) can
i:.Qe used to compute A, which in turn can be substituted in Eq. (8) to compute
;::W and the new approximation \V j + 1 = \V j = ()\'v. Usually \V is close to
. ',,, .,&.,}the observed values W, so that we naturally take \V = W as the initial guess.
;i(::'."
'..';:1
:'r
',5.
156 VI ComputatIon of the EstImates II: Problems with ConsIfalllts
6-7. Convergence Monitoring
I
When we apply the Deming procedure there IS no natura] way of tellmg
whether or not progress towards the solution has occurred in any given itera-
tion. There is no way of telling in advance whether the final value of the
objective function or the Lagrangian must be less or greater than the current
value. At the solution. however, the equations G = 0 must be satisfied, so that
if Qj is some positive definite matrix. it is natural to require that the ith
iteration must cause the value of G TQ, G to decrease. Now we have:
'-'.:
'::'
",1
::...
'::
;:
,',1
';:\
E'(GTQj G)/?O = 2B/Qj G
c'(G TQj G)/C\V = 2AjTQj G
(6-7-1)
( 6-7-2)
Hence, if liO and ()\V are given by Eq. (6-6-12) and Eq. (6-6-8). it turns out
after much arithmetic that
£'i(GTQjG) = -2GTQiG <0
( 6-7-3)
This means that the quantity GTQj G decreases initially as we take a small
step in the prescribed direction. A natural choice of Q; is the inverse of the
covariance matrix of G. If V w is the covariance matrix of the data W. then the
covariance of G is approximately A j V w Ai T. Usually V w = Hj- I, so we choose
Qi=(AjH;-IA/)-1 =C" I
Our strategy at the ith iteration is the following:
I. Compute Zo == G,TC;- I G "
2. Compute (50 and ()\V using Eq. (6-6-12), Eq. (6-6-10). and Eq. (6-6-8),
3. Compute ZI == G1(0; + ()O. \V; + b\v)C: IG(O; + ()O. \V i + chv).
4. If Zt < Zo. set 0,+ I = 0, + (50, \V;+ i =\V i + chv and proceed to the
next iteration. Otherwise, use interpolation (Section 5. ]4) to shorten bO and ,;
()\V, and return to step 3.
(6-7-4)
I
If.
,
Note thm a difTerent weighting matrix C;- I is used in each iteration, ,;
Therefore. we can compare ZI and Zo in any given iteration, but not from one
iteration to the next.
At the end of the calculations. we can apply the necessary condition Eq.
(3-6-6) as a test of convergence. In terms of our present variables, the con-
ditions take the form
BTA = 0,
q + A T A = 0
(6-7-5)',
These, along with the ongma] equatIons
G=O
(6-7-6)
must be satisfied by the final values 0*, \v*, J,*.
.".,
1i., !
6-8. Some Special Cases
157
6-8. Some Special Cases
Consider an objective function of the weighted least squares form
/I
q)(\V) = 1- I (\V II - WI,)TVI;-I(\"II - W/I)
JI=I
(6-tH )
Therefore:
( V 1(\Vt - WI » )
V:;-I(\V, - w,)
q = v} I(\V - W,)
(6-1)-2)
'I
Vl
H= 0
o
o
V;-I
v,1
/I J
(6-8-3 )
The other interesting objective function is typified by Eq. (4-10-12), and
:;,occurs when a subset of variables YII (not exceeding /11 in number) have un-
:known covariance, while the remaining (if any) variables XII have known co-
+yariance P/I' We have then (assuming zero correlation between XII and Y I ,)
n "
m ( ) ( /2) ] d t ( A , ( A ) 1, I ( X X ) T p -I ( x X )
'l' X, Y = IJ og e L YII - y l ,)' Y I , - Y II . T }- L . I' - . I' I' . I' - . I'
1 1 =1 JI= I
The vector q is made up of elements
OeJ.;/ox ll = PI I(X II - XI')'
CeJ)/aYII = IlMJ.1(5'11 - YII)
\6-8-4)
}The Hessian H has a complicated form, but we may use the Gauss approxima-
:iJion in which (I /n)M n replaces the covariance of YII where required. We take
ithen
2rf) / .. P -1
o 'J. ex l , uX/I = I'
8 2 eJ.;/UX II i'YII = 0
[ /I ] -1
a 2 eJ.;jOY I I OYII IJMJ.I = IJ I (YII _ YI')(YII _ yl,)T
1 1 =1
In the sequel, then, we shall assume that Eqs. (2) and (3) apply always,
;;.ith (in the case of partly unknown covariance)
(6-8-5)
v = [ P IJ 0 ]
I' 0 (I/n)MJ'Y
(6-8-6)
!'Npte that P JI remains the same, but (I/n)Myy varies from iteration to itera-
!;,fjop, If the unknown portions of V are assumed diagonal, then we only use
. 'ili;e'corresponding diagonal elements of (l/n)M, and substitute zeroes for the
: (bff-dia g onal elements,
,¥P'
t:.:.:"
150 VI Computation of the Estimates I I: Problems with Constramts
We now work out the details of the algorithm, LettlIlg
we note that:
l q: J
q = -
qn
lVJ'
I-I =
()
n
G = 2
u
"'"
AI ()
() A 2
(\-
,. -
n
I B, J
B = B 2
BII_
so that:
lC' n
C Alr1A 1 = : C 2
II
Do BTC-IB= IB/C,IB"
JI::::: I
e l , == \V " - w"
where q,l == V, I e"
()
v,J
l where All == cg , ,/c'hv l1
:"J
whcre 13" == (lgji'O
]
(6-8-7)
( 6-8-8)
( 6-8-9)
( 6-8-10)
(6-8-11)
( 6-8- I 2)
;
.
(6-8-]4)
- I .
..
"
I ;
.,
where C,I==A/IV/IA/ (6-8-]3)
Computations which we leave as an exercise result in:
n
<5{) = D- I I B/C/-;-I(AI,e " - g,,)
JI= I
)," = c; 1(13" c)o - Aile,. + g,,)
C)\"" = - e/I - V"A/ I TAil
(6-8-15)
,J
(6-8-16)
(6-8-17)
6-9, Penalty FuncTions
159
The pseudo-objeclive function of Section 6.7 takes the form
n
Z =GTQG = I g/C;l gl ,
1,=1
(6-8-] 8)
Further simplifications Occur III the single equation case. Here g/' is a
single number gl" while AI' and BI' are row vectors ai' T and b" T respectively
Hence:
ai' == cg,J(Wll'
b / , == cgjeO
( 6-8-19)
C = dJag(c ll ),
where C I , == a/,' Vl,a l ,
(6-8-20)
II
D = I (I/c/,)b/lb/
1 1 =1
(6-8-21 )
.
,I
II
£50 = D- I I O/c)(a/e l , - gl,)b/,
Jl= I
(6-8-22)
)'11 = (l/cl,)(b/ ()o - a/e / , + g/')
(6-8-23)
( 6-8-24)
£5w l , = - e l , - AIIV/,a / ,
II
Z = gTQg = I (I/cll)g'/
1'= I
(6-8-25)
1!Ii ;i
r'"
.:f.
!;:
;....
'I;;
.
The algorithm is always started with some initial guess, 0 1 , ane! with
,v,. = WI" A difficulty arises when the covariance matrices V" are at least
partly unknown. For then. from Eq. (6). certain rows and columns of V/I are
taken from (l/n)M; but in the first iteration 1\-1 = 0 causing VI' to be singular.
Furthermore. a glance at Eq. (17) reveals that all components of (hv,. cor-
responding to the J' variables will be zero. The dilliculty is easily overcome by
arbitrarily assigning to the unknown elements of V some reasonable initial
guesses for the first iteration only. The method is illustrated in Sections
6- ]3-6-] 4.
o.
'"
6-9. Penalty Functions
The idea of penalty functions can also be applied to equality constraints.
Here we penalize values of the unknown $ (including both 0 and 'v) according
to their deviations from the constraints. To the objective function <11($) we
add a term proportional to g/($) for each equality constraint gj = O. SUMT
(sequential unconstrained maximizing technique) as applied to problems
including both equality and inequality constraints consists of defining the
objective functions (Fiacco and McCormick 1965, 1967)
.-Ii/
CPl;t($) == CP(<t» + !Y.!. I ]/l1 j ($) + !Y., 111 I g/($)
j j
( 6-9-1)
160 VI Computation of the Estimates II: Problems with Constramts
where (1.1' (1.2' . . . is a sequence of decreasing positive numbers converging to
zero, Let $k * be the value of $ which minimizes cp/; then, under suitable
assumptions on the convexity of the feasible region and the concavity of the'.!
functions, Fiacco and McCormick prove the convergence of the sequence '"
Q>t*, $2*' ... to the minimum of cp($) satisfying the constraints '.\'
17/$) 0 (j = 1,2,.. .),
g / $) = 0 (j = I, 2, . . .)
(6-9-2)
<.:.
The application of the Gauss method to the minimization of cpk t is
obviom.
a;
6-10. Linear Equality Constraints
If the unknown parameters are supposed to satisfy linear equality relation-
ships, these can be handled by means of the projection method of Section 6-2,
All we need to do is include permanently the equality constraints in the bind-
ing set by sweeping the corresponding rows of E. All tests on the sign of the
last element in a row belonging to an equality constraint are omitted, The
initial guess must be chosen so- that it satisfies all the constraints.
::g
.:
6-11. Least Squares Problem with Penalty Functions
We return to the slIlg]e equation least squares Problem of Section 5-21.
We recall that we encountered some difficulties in converging to the solution
when starting from the initial guess 0 1 = [100, 2000]T. We shall attempt to
overcome these difficulties by imposing bounds on the parameters, Specifically
let us require that
o :0( 0 1 :0( 100,000,
0:0( O 2 :0( 2,000,000
corresponding to the constraints:
ht(O) == 0 1 0,
17](0) == O 2 o,
17 2 (0) == ]00,000 - 0 1 0,
17 4 (0) == 2,000,000 - O 2 0
(6-11-1)
According to Eq. (6-]-2) we form the penalty function
<-'j
!
4
(0) == L ceO) = 0.01/0 1 + 0.01/(100,000- 0 1 ) + 0.2/0 2 + 0.2/(2,000,000- 02)
j= I
(6-11-2)
The coefficients (l.i were determined as 10- 50 1 . Our new objective function is
15
(JJt(o) = I e/(O) + (0)
(6-11-3)
:i
.1
i
:i p.
"
-j" .
/ 1 = I
6-11, Least Squares Problem with Penalty Functions
161
Using Eqs, (6-1-7) and (6-1-9) we find:
[ _? I e ai JL _ 0.01 + 0,0] ]
-1'=1 I' ae 1 e/ (100,000 - 0 1 /
qt =
15 a 1r 0 7 0 'J
" ClI"- .
-2 f....- eJL - - -Z + z
JL= 1 ae z e z (2,000,000 - Oz )
[ 15 ( ,r ) z OO 'J 00 7 15 If I f. ]
" OJ JL . , . - " (:1 I' (II
2f....- - +-, ') L --
JL=1 ae 1 e l 3 (100,000-ed 3 -11=laOla:
Nt=
ai JL ai JL a];, z 0.4, 0.4
2 L -- 2f....- - +-,
JL= 1 ae 1 ae z 1'= 1 (aeJ 0/ (2,000,000 - O 2 )3
(6-11-5)
16-11-4)
For the first iteration, using the data of Eq. (5-2 I -I 0), we find:
cpt = 5.299702
[ 0,01 ]
-0,0007098080 ----,
100- [ _ 0.0007 I 08080 ]
q/ = 0,2 = 0,0002442436
0.0002442936 - -----,
2000-
[ 0.7036033 X 10- 7 + : -0.2354773 x IO- t
N/=
-0,2354773 X 10- 7
[ 0,9036033 -0.2354773 ] x 10-7
= -0,2354773 0,07946382
t _ - ( N t ) -1 t _ [ -629.9713 ]
VI - 1 ql - -32604,29
7 0.4 ]
0.07896382 x 10 - +
2000
,omparing v/ to V t given by Eq, (5-21-11) we see that although the penalty
function has but a small effect on the value of q), it has the power to turn the
step direction away from the troublesome 0 1 = 0 axis.
We compute now the largest value of P for which pv 1 t is feasible
Pm"x = min{(100 - 0)/629,9713, (2000 - 0)/32604.29} = 0.06134162
Following the flowchart of Fig. 5-2a, we initially try
p(O) = 0.5Pma' = 0.03067081
" ! '
0;: .
',-",
r ' .,.05'. . '"t
162 VI Computation of the Estimates II: Problems with Constraints
for which
e = [ 100 - 0.03067081 x 629.9713 ] = [ 80,67827 ]
2000 - 0.03067081 x 32604,29 1000
where q)t = 3.652150, an acceptable value. We proceed, however, to extra-
polate according to Fig. 5-2b, and obtain
e = [ 80.67828 - 0.5 x 0.03067081 x 629,971 3 J = [ 71.01 740 J
1000 - 0.5 x 0.03067081 x 32604.29 500
with q)t = 0.5349855.
This much improved value is the basis for starting the second iteration,
We converge after nine iterations (plotted in Fig. 5-3) of the Gauss method
to
() = [ 814.4814 J
961.1797 '
q)t = 0.04002851
" I "
l
At this point, (/) = 0.03980605. One further iteration without penalty functions ' .
leads to
() = [ 813.838I J
960,9944 '
q) = 0,03980603
which is close to the solution obtained 111 24 iterations without penalty
functions.
6-12. Least Squares Problem-Projection Method
Suppose we try to solve the problem of the previous sectIOn, but using the
projection method in place of penalty functions. A glance at Fig. 5-3 shows
that even in the absence of any constraints, the Gauss method never carries
one actually as far as the 0 1 = 0 axis, Hence, no occasion to project a step into
any of the constraints Eg. (6-1 I - I) arises, and the projection method does not
affect the course of the iterations (other than avoiding some of the futile
function evaluations in the unfeasible region). Let us, however, change the
constraints to read
0 1 100,
O 2 0
(6-12-1)
We now apply the algorithm of Section 6-3:
I. R = N- ' is obtained by inversion from Eg. (5-21-10) and ,,<D) from Eq. J f
(5-21-11). Since only (JI is at a bound, only the first row and column are taken
from R, and only the first element from ,,<D), to form E = [0.7199572 X 10 10 ,
- 134608.0].
6-13. Independent Variables Subject to Error
163
I ...
1
'"
, >
i*
..,
2. Since 0 1 is at its lower bound, 111 1 = I.
3. kt = I.
4. a=l11tktet = -134608 < _10- 6 .
5, r= I. Pivot on Ell to obtain E= [0.1388971 x 10- 9 , -0,1869666
x 10-4.], k) = -I.
4. a = l11tkte) = 0.1869666 x 10- 4 > _10- 6 .
7, Since k l = -), the bound on 0 1 is binding. Hence VI = 0 (01 cannot
change) and "2 = RJ2AI - viol = 2.146975 x 10 10 x 0.1869666 X 10-4._
432361.0 = - 30948.37. We now find Pm'" = 2000/30948.37, and try (f =
Pm"x"l = [0, -2000]T. This brings us to 0 = [100, O]T and (/1 = 6.028293,
which is unacceptable. Interpolation forces P = 0.25pmax with consequent
e = [100, 1500]T, q; = 4.975522, which is acceptable. At this point the tableau
reads E = [0,5366083 x 10 8 , - 7951.062]. Row I must be swept to produce
E= [0,1863556 x 10- 7 , -0.1481725 x 10- 3 ] and "2 = [0, -6144.531]T.
Only in the fourth iteration with 0 = [100, 656.7756P do we obtain the
tableau E = [93635.87, 196.3826] which does not require sweeping, so that the
lower bound on 8 1 ceases to be binding. Tn this iteration "4 = [196.3826,
269.I0l8r, Convergence Occurs in ten iterations, which are also plotted In
Fig, 5-3,
6-13, Independent "\tariables Subject to Error
I
'(i
1'.:;1;
.
Let us take once more the model ofEq. (5-21-5), but assume now that in
addition to y, also XI and x 2 are subject to measurement errors. We shall use
the exact structural model approach, writing
gll == g(\I',II, \1'112' \\1 1 <3, 0., O 2 ) = exp[ - 0 1 \I'I!! exp- (0 2 /\1'1 1 2)] - \1 ' ,,3 (6-13- I)
where \\1 111 , "IJl2' \1 1 113 represent the" true" but unknown values of the measured
quantities WI!! == XIII' 1\'1 12 == X 112 , 11'1 1 3 == )'. We assume all measurement errors
to be independent, and form the objective function
3 15
1: I (I/p,,) I (WIlli - 11'lllIf
a::::: I 1'= I
I
Since we have only one equation, it follows from the discussion in Section 4-10
that a maximum likelihood estimate can be obtained for only one of the
variances Po Let us then assign standard deviations of 0.01 and 0.5 to the
time (Will) and temperature (11', 12 ) measurements, respectively, and let the
variance of the 11'1<3 measurements remain unknown. Following Eq. (4 I 0- 12),
the concentrated objective function. from which 1'3 has been eliminated, takes
the form
ql(\\!) = (15/2) log SJ + 1« I/O.OOOI)S, + (1/0.25)S2)
(6-13-2)
164 VI Computation of the Estimates II: Probiems with Constnunts
where
15
Sa == I (11'/'" - IV/ IlI )2
1 1 =1
(a = 1,2,3)
(6-13-3)
Because we take as ll1itial guesses IV/Ill = 1I'/IlI' it follows thm S3 = 0 initially,
and neither 1> nor its derivatives can be evaluated. The solution suggested
in Section 6-8 is to take for the first iteration only the objective function
(fiO)(W*) = 1«1/0.000I)S) + (1/0.25)S2 + (I/UO)S3)
(6-13-4)
where vO) is an Initial guess for the variance of 1\'p3' We take vO) = 0,01 2 =
0,0001. As our initial guess for 0 we take 0t = [750, 1200]T.
The solution is obtained iteratively, using Eqs. (6-8-19)-(6-8-25), We pre-
sent details of the first iteration,
First we compute the vectors a/I = (8g p /8fv) and b p = (8g)80) for It =
I, 2, . . ., 15. The values of b/l already appear in the last two columns of Table
5-3. In Table 6-1 we list the values.of a p ' as well as g/1 (from Eq. (I» and of
c p which according to Eq. (6-8-20) is given by
f
,;
- I T V - 0 0001 t 2 'O?5 a 2 ,dO) a 2
C J ! - G II 11ll'1 -. (Ill -r._ ,,2 -r 3 113
Table 6-1
First /leratlon Data
a" = og"/o{'-I,,
,.,. g" = g(6, w,,) alfl 10 4 (/1.2 a,'3 ]0 3 e"
0.01953936 -0.004606035 -0.5527238 I 0.1000029
2 0.01607883 -0.00460391 I - 1.1 04938 -I 0.1000052
3 0.04361850 -0.004601792 1.65664-+ -I 0.1000090
4 0.01915848 --0.004599672 -2.207842 -I' 0.1000143
5 0.004698634 -0.004597552 -2.758531 -I 0.1000211
6 0.2852362 -. 1.694046 - 25.41 069 -I 0.3885932
7 0.2863515 -- 1.543676 -46.31024 -I 0.3436550
8 0.3016463 - I .406653 - 63.2993 I -I 0.3078843
9 0.4644834 -1.281794 -76.90754 -I 0.2790862
10 0.4612821 -1.168016 -87.60117 -I 0.2556] 10
II 0.1937739 -10.43681 -27.83145 -I 10.9946
12 0.2602563 -7.929612 -42.29121 -I 6.392341
13 0.4045841 -6.024710 -48. I 9766 -I 3.735518
]4 0.3172247 -4.577417 -48.82574 -I 2.201234
15 0.1871758 - 3.477806 -46.37070 -1 1.314888
,,
6-13. Independent Variables Subject to Error
165
We now compute D = I I (lfc/,)b p b/. The first term (Il = I) 111 the
sum is
(1/0,1000029 X 10- 3 )
[ (0.6141379)2 X 10- 12
x 0,6141379 x 0.4606032 x 10- 1 I
0,6141379 x 0.4606032 x 10- 11 ]
(0.4606032)2 x 10- 10
and the result is
o = [ 0.001794073
- 0,006367238
-0,006267238 ]
0.02235472
I
Using Eq, (6-8-22) we compute bO. In the first iteration e u = O. The first
term of the sum is thus - (glfcdb j , and
bO = [ -846,9727 ]
- 563,7969
It is now easy to compute Ap and b\v p using Eqs (6-8-23) and (6-8-24) in turn,
Table 6-2 contains the results.
Finally, we evaluate our test function Z (Eq, (6-8-25», whose value lUrns
out to be 2508.496. After modifying 0 and ,v p by adding (50 and (5,\. we re-
compute the value of Z, which has now II1creased to 59729.57, The step is
unacceptable, and we must interpolate. The method of Fig. 5-2 should be used,
I
Table 6-2
First Iteration Results
8w,.
{.L Ap 81'", 8'''"1 OII\!3
'j.
I 1' 1 ]74.6214 0.00008043119 0.002412934 0.0/746213
2 II 9.2671 0.00005490<;48 0.003294569 0.0/192671
3 373.9075 0.0001720644 0.01548579 0.03739074
4 108.6156 0.00004995962 0.005995154 0.01086156
5 - 56.64597 - 0.00002604326 -0.003906488 -0.005664594
6 365.7187 0.06] 95441 0.232329] 0.03657]87
7 74.25694 0.01 ]46286 0.08597142 0.007425692
8 -178.2293 -0.02507068 -0.2820446 -0.0] 782293
9 112,2]50 0.01438364 0.2] 57544 0.01122149
10 -125,6333 -0.01467417 -0.2751405 -0.01256333
]1 3.384982 0.003532839 0,002355224 0.0003384980
12 3.497939 0.002773728 0.003698302 0.0003497938
13 35.72807 0.021525]2 0.04305023 0.003572807
14 19.33928 0.008852392 0.02360636 0.001933928
15 -56.02644 -0.01948491 -0.06494963 -0.005602643
1IJif'" .
>-
..,
','
166 VI ComputatIon of the Estimates II: Problems with Constraints
but for simplicity weJust cut the step 111 half(p = 0.5) to obtain the new value
O 2 = [533.1387, 894.4258]T, and \\, as given in Table 6-3 along with the new
values of 9". The corresponding value of Z is 908.2344, which is smaller than
the initial value and hence acceptable. We are ready now for the second itera-
tion for which we first compute the new value of
15
v 3 (1/15) I (\\'''3 - W"3)2 = 0.6729098,
Jl= I
Table 6-3
Start of Sccond Itcration
\\'1/
.i"J 1 }'i'/12 W l1 3 :
fL [1'1
J 0.007910669 0.1000401 100.0012 0.9897310
2 0.004332781 0.2000274 100.0016 0,9889632
3 0.01625860 0.3000859 100.0077 0.9736953
4 0.002205729 0.4000249 100.0030 0.9844307
5 -0.006835222 0.4999869 99,99805 0.9901676
6 0.1198401 0.08097714 200.1 ]61 0.6442859
7 0.1565019 0.1057313 200.0430 0.5477127
8 0.1889961 o. I 374646 199.8590 0.4460884
9 0.2718676 0.2071918 200. J 079 0.2306]07
10 0.287940] 0.2426628 199.8624 0,1607183
II 0.1505129 0.02176642 300.0010 0.5661692
12 0.2136052 0.04138686 300.0017 0.3171748
13 0.3027226 0.07076252 300.0212 0,03578640
14 0.2576900 0.08442611 300.01 J 5 0.01696696
15 0.1881614 0.09025747 299.9625 0.063] 9863
Convergence is obtained in 21 iterations. The final results are given in Table
6-4. The value of the objective function Eq. (2) at the solution is - 32.09045,
The final value of 0, which we shall need in Section 7-23, is
0= [ 0.0005472710
-0.003062132
- 0.003062132 ]
0.01747675
(6-13-5)
We shall also need later the moment matrix of the final residuals
15 15
1\10= I c,,*e: T = I (\V/ - w,,)(fv,,* - WJ,)T
J'= I J'= I
whIch, UStllg the data of Table 6-4 we find to be
[ 0.001781781 0,009602152
I\1 = 0.009602152 0.07176667
0.001960129 0.01202689
0.001960129 1
0.01202689
0.0041455559_
(6-13-6)' I
,:ttl
.,.
OJ..!::
w;;;
'".
i
!i l .\
..:t
,;
,
I.
I
6-14. An Implicit Equations lv/ode! 167
Table 6-4
Final Rcsult, with 6* = [1170.861]
1027.773
p.. ""J * "'1*
I 0.1002329 100.0060 0.9959696 57.79277
2 0.2001303 100.0067 0.9919683 32.46887
3 0.3004779 100.0369 0.9879283 119.1740
4 0.4000721 100.0074 0.9840074 18.15489
5 0.4998] 57 99.97632 0.9801232 -46.54858
6 0.06580788 200.0668 0.6359499 36.]3177
7 0.09067100 199.9456 0.5370256 - 25.30307
8 0.1235170 199.7895 0.4301884 -90.03149
9 0.2081 ] 43 200.] 083 0.2386000 49.34750
10 0.2534879 200.0567 0.1749817 28.97121
II 0.014978]9 299.9978 0.5653576 -23.33610
12 0.03035963 299.9915 0.3147850 -8.046125
]3 0.07608849 300.0349 0.0551] 629 76.52017
14 0.08861703 300.0217 0.03421704 65.96593
15 0.08537388 299.9644 0.03879805 -98.72794
J
"
'"
t
The slow convergence of the algorithm in this problem is not typical. In
many different cases relating to the same problem convergence occurred in
fewer than ten iterations.
6-14, An Implicit Equations Model
In Section 5-23 we estimated the parameters c appearing in Eq. (5-23- I)
by solving these equations explicitly for the dependent variables 2 1 and 2 2 ,
Suppose, however, that explicit solutions were impossible. We could then use
the methods of this chapter, by introducing as unknowns the" true" values
Zpt and Z1/2' and using the model equations as constraints. Since we do not
know the covariance of the errors, we take as our objective function
cP(Z) = (41/2) log det M
_ (4 1 1 7 ) lo g I ( "' - ) 2 ( "' - ) 2
- - \ L ':'/,1 - -/ll -J12 - -/,2
\11= I /1=1
[ 41 ] 2 }
- I (2 1 ,1 - 2 111 )(2 1 / 2 - 2 112 )
1'= t
;
168 VI Computation of the Estimates II: Problems with Constraints
The Zil and C must satisfy the model equations which become after taking
logarithms:
log C I + Zp4 Cz log IO - (C 3 /C 4 ) log[csz;;I C4 + (l - c S )Z;; Z C 4] - log Zp3 = 0
log Cs - log(l - cs) - (l + c 4 ) log ZpI + (I + c 4 ) log zpz -log zps = 0
(Jl = 1,2,3, _..,41)
Application of the method of Section 6-8 is now straightforward, We have
_ Dg [ C3 Cs :-t4- I
A p =-=
DZp 1 + C4
--
-Ill
(1 ) A_C4-1 ]
c 3 - c_ s zpz
!"p
+ c 4
zu2
where (p =: Cs z;;t 4 + (] - C S )Z;;{4. and
B T == ( Dg ) T
p DC
o
C I
Zp4 log IO
l v
--log!"I'
C 4
o
o
C 3 , c 3 [C S z;;I C41 og Zpl + (1 - cs)zl{41og zpz]
--:;-Iog 'p T _
C4- c 4 !"p
Z ?
log A P -
Zpl
C3(Z;t 4 - z;t 4 )
C 4 (11
1 1
--'--
C s ' 1 - C s
For V we take the value of (I/41)M from the previous iteration, In the first
iteration, we use VI = L Starting from the initial guess c = [I, 1, 1, I, O,5jT,
we converged to the same value of c* as given under case (a), Table 5-8, in
twelve iterations.
6-15. Problems
I. Verify Eqs. (6-8-15)-(6-8-17),
2, Prove that if the model equations take the form
g(\V p , 0) = \V p - fp(O) = 0
r;l!f/ '
'6-15, Problems
169
'then the iterations produced by the metIlOd of SectIon 6-8 are identical to
;those produced by the Gauss method.
3, Using the data of Section 5-2], find the parameter estimates which
,:,minimize the maximum residual (MLE for uniform error distribution).
Compare the results to the least squares estimates,
.1'
,
.',
.:.'I.:\
I
Chapter
Interpretation of the Estimates
I
: J
VII
7-1. Introduction
I ,'
s;
It is not enough to compute a vector IY and to state that this is the esti-
mated value of the unknown parameters O. We must also investigate the
reliability and precision of our estimates. We wish to answer questions such
as "what are the chances that the estimate is off by no more than I %?"
or "how much can we change the estimates and still fit the data well?"
There are several ways in which one can go about answering these questions;
some of these are of a heuristic nature, while others depend on statistical
considerations. We shall present several alternative approaches in the succeed-
ing sections.
Even more important than the question of the reliability of the estimate
is that of the reliability of the model itself. This question is answered by
goodness of fit criteria and statistical hypothesis testing. We cover these
topics only very brieRy here, since extensive treatments can be found in the
statistical texts. I n particular, the reader may consult Anderson (1958) on
the topics that are of direct interest to us here, and Lehman (1959) for a
more general trealment.
Some of the statistical tests and estimates of variability that we discuss
here apply only approximately to nonlinear models. Refinement of these
approximations is often possible [see, e.g., Beale (1960), Hartley (1964), and
Guttman and Meeter (1965), but even with linear models, the tests are
exact only if the measurement errors do indeed follow whatever distribution
was assumed for them. Since this is rarely if ever so, even so called" exact"
tests are only approximate in practice. Furthermore, we do not feel that the
statement "the probability that model A is incorrect is exactly 5 " has
greater practica I utility than the statement "the probability that model A
.'
is incorrect is approximately 50;;:' For these reasons, we present only the
simplest approximate tests, and leave the reader interested in more exact
formulations to consult the cited references.
I
iJ
I
:I
,;:;:)W'
,<;,
, .:
:::
7-2. Response SlIIface Techniques
171
,..
7-2. Response Surface Techniques
[':
i'
The estimate 0* is usually obtained by mmlIl1Izmg or maxlIl1IzIllg some
function c1J(0). Then eP* == eP(O*) is the" best" attainable value of the objective
function c1J, Suppose for a moment that c1J(0) is a risk function of decision
theory (see Section 4- 16), i.e., the value of eP(O) represents the economic
loss that we expect to sustain if we act on the assumption that the parameters
have the va]ue O. In this case eP* is the minimum possible expected loss.
However, shou]d some parameter values 0 =1= 0* give rise to a risk (f)(O)
that is on]y insignificantly larger than cf)*, then we have no compelling reason
to prefer 0* over O. In fact, let 8 be the largest difference between risks that
we are willing to consider insignificant. Then we have no reason to prefer 0*
over any other value of 0 for which
.t-::
'",!.
f',
',; ;
>
i'
:.,
I w(O) - w* I 8
(7-2-1)
We refer to the set of values of 0 which satisfy Eq. (I) as the E-il/(l(fJerence
region,
The argument used here may be applied heuristIcally to any other obJec-
tive function. The fact that we have elected to minimize a function q)(O) means
that we set some store by obtaining a low va]ue of this function. It is not
unreasonable to suppose that va]ues of eP almost as low as q)* would satisfy
us a]most as much as eP*, This gives rise to an indifTerence region in 0 space
as described by Eq. (I), The choice of a suitable E may be more arbitrary
when eP is a sum of squares or a likeJihood than when it is an economic risk.
Once E is chosen, however, the analysis is the same in all cases.
When c1J is continuous and 0* its unique unconstrained minimum, the
0,,;
W' E-indifference region for a sufficiently small positive E is a simply-connected
. . domain surrounding 0* in the I-dimensional 0 space. The region is bounded
by th, /-1 cl;m,o,iooal hypmu,face who", 'quaUno i,
",,'
;;'R
7; '.
!:;,
;-,{,
', '
-r,'
";:'';
.;.:
;'J
'.';"
i;i
,"--';',
'h'
:f;?
'It!:';,
.<:<--
r.{'"
eP(O) = eP* + E
(7-2-2\
We shall restrict our attention to regions of this nature; i.e., we shall ignore
"the possibility that for a given E there may be regions surrounding 10ca]
'minima other than 0* in which Eq. (I) holds,
TO' In a sufficiently small neighborhood of 0'" we may approximate eP by
t ;means of the first few terms of its Taylor series expansion
eP(O) ::::; c1J* + qH (j0 + -1 (jOT H* ('50
B ;where (j0 == 0 - 0*, and q* and H* are, respectively, the gradient and Hessian
:;bf c1J at 0 = 0*. If 0* is an unconstrained optimum of eP, then q* = 0 and
Jq, (3) becomes
(7-2-3)
eP(O) ::::; eP* + 1- (jOT H* ()O
(7-2-4)
I (50 T H * (50 I 2B
(7-2-5)
172
VII I nterpretation of the Estimates
so that the B-indifTerence region is defined, approximately, by
Let A = H* if 0* is a minimum IH posItIve definite), and A = -H*
if 0* is a maximum (H* negative definite). In either case, A is positive definite
(semidefinite in exceptional cases), and Eq. (5) becomes
(50 T A (50 2B
(7-2-6)
"i!-':
. i;.'
.i:;
'it
:;
which is the equation of an I-dimensional ellipsoid whose volume is (2BTC)/12
det- J /2 A/f(lj2 + I). The ellipsoids corresponding to different values of Bare
concentric and similar in shape and orientation, so that much information
can be gained from the analysis of the matrix A, without regard to the actual
value of c.
We can now answer the question of how much the II1dividual parameter
Oa can be varied from its optimal value Oa*' If we let bOp = 0 for all f3 =f Cf.,
Eq, (6) reduces to
Aaa bO/ 2B
(7-2-7)
so that
(V' - (2rJ A aa) I 12 Oa Oa* + (2B/Aaa) I 12
(7-2-8)
This is often written in shorthand notation as Oa = Oa* :t (2B/Aaa)I!2.
We say that Oa is well-determined if the quantity (2B/Aaa)II2 is small on the
scale by which Oa is measured; Oa is ill-determined if (2B/AaY 12 is large,
Usually we wish the parameters to be well-derermined. There are exceptions,
though; a design parameter may advantageously be ill-determined for greater
flexibility in the implementation of the design. It is important, however, that
the ill-determination should be inherent in the design, and not merely the
result of poor data.
It is not enough to determine how well the individual parameters are
determined. Consider, for instance, the two-dimensional case depicted in
Fig. 7-1, with
4 = [ 0.505
" - 0.495
-0.495 ]
0.505
it
I
tJ.
'"
and E; = 0.5. Here Eq. (6) reduces to
0.505 MI/ - 0.99 M I 150 2 + 0.505 bO/ I
(7-2-9)
so that with c'j()z = 0 we have I ()Oll 1.407, and with ()Ol = 0 we have
I b0 2 1 1.407. Thus 0 1 and ()2 may be varied individually by ::t 1.407 with-
\.'1IiI
-:;;.;
7-2, Response Swface Techniques
173
out leaving the indifference region. It is clear from Fig. 7-1, however, that if
we increase (or decrease) 0t and O 2 simultaneously, much larger changes can
be tolerated. In fact, (50 1 = be 2 = 7.701 satisfies Eq. (9). On the other hand,
if the changes in 0 1 and O 2 are taken in opposite directions. their bound is
S8 2
E'.:
'!-
'O'0'!-
0':>
0':5
'O'0'!-
'0'
'0
0j0j
,0
i'"
';"',
7071 S8,
i".
.,
,{
}:
"....
F ' 71 . r [ 0.505 -0.495 ]
Ig, -, E = 0.5 uncertamty rcglOll .or A = -0.495 0.505
:;;
.....,
,7..
of-
lower bO] = -(50 2 = 0.7071. While 0 1 and O 2 appear individually well-
determined, the quantity 0 1 - O 2 is even better determined, but 0 1 + O 2 is
relatively ill-determined. This implies that we have a wide latitude in choosing,
say, a value of 0 1 as long as we adjust O 2 so that 0 1 - O 2 is nearly equal to
;0 1 * - 8 2 *,
Another numerIcal example appears in Section 7-21.
:,
174
VII Interpretation of the Estimates '.,'
7-3, Canonical Form
";"
.,':.l
',
We wish to find the points on the ellipsoid (50T A (50 = 2£ which are furthest ,
away from the origin, and also those which are closest. These points determine,
respectively, the least-determined and best determined linear combinations of
the parameters.
To find the vector (50 satisfying (50T A (50 = 2£ which maximizes or mini-
mizes (SOT(50, i.e., the squared distance from the origin, we introduce the
Lagrange multiplier p, and look for the stationary points of
n((>o, II) == JOT (50 - p«(50T A (50 - 2£)
(7-3-1)
Therefore
(Inja«(50) = 2 (50 - 2p A (50 = 0
(7-3-2)
Letting ), = ] Ip and rearranging. we have
A (50 = ), (50
(7-3-3)
Premultiplying by (SOT
(50T A (50 = ), (50T (50
(7-3-4)
Hence
(50 T (50 = 2£j),
(7-3-5)'
Eg. (3) states that the desired vector (50 is an (un normalized) eigenvectorofA':
with eigenvalue LEg. (5) states that the length of the vectOr is (2£jJe)I/2ii
The I eigenvectors form the I principal axes of the ellipsoid. The longest
axis, corresponding to the smallest eigenvalue, defines the worst-determined:
direction in 0 space, and the shortest axis (largest eigenvalue) defines the best
determined direction.
Let the eigenvalue decomposition (see Section A-5) of A be given by
A UAU T
(7-3-6ri
where U IS the ul1ltary matrix whose columns are the normalized elgenvectors:,
of A, and A is the diagonal matrix of eigenvalues. Then
(50T A 60 = (50T UAU r (50
(7-3-7}:....
...:','
Letting \jJ = U T (50 we obtain
I
(50T A 150 = /T A I = L }'itfi/
i= 1
.i:;1l'
(7-3);;
..:: !'. . ...'
.. ."\,,., .
<". ...', .y'
'tV"
I -4. The Samphug Di.<MbuNau
*
'i,
175
Since U T is unitary, the transformation of coordinates given by I = U T (50
.is a rigid rotation (with possibly some reflections) which leaves distances and
angles unaffected. The number tfi j (i = ],2, . . ., I) is the ith component of the
'!!vector (50 expressed in the system of coordinates whose axes are the eigen-
.yectors of A. Eq. (8) indicates that the principal axes of the ellipsoid coincide
".with the coordinate axes in the I space, and displays clearly the inverse
:::relationship between the lengths of the axes and the square roots of the
:eigenvalues.
" The tfij are referred to as the canonical variables, and the expresston
;I:=1 X j tfi / as the canonical form of the quadratic expression (50T A (50.
In the two-dimensional example of the preceding section, we had
A = [ 0.505
- 0.495
-0.495 J
0,505
whose normalized eigenvectors are [0,7071, - 0.7071] and [0.7071, 0.7071]
:,;\yith eigenvalues I and 0.01. Therefore
I'"
)lrirl
u T = [ 0.7071
0.7071
-O,707I J
0.7071
= [ 0,7071
I 0,7071
-O,7071 J [ (501 J [ 0,7071«(501 - (502) J
0,7071 (50 2 = 0.7071«(50 1 -r (50 2 )
4'h e canonical form is tfi/ + O.Oltfi/. Thus, the principal axes have lengths
::1:{and 10, making the quantities 0,7071 «(50 1 - (50 2 ) and 0.7071«(50 1 + (50 2 )
l.tilaively welJ-and ilJ-determined, respectively.
:gare must be taken that alJ the parameters be measured on compatible
;es. It is evident from Fig. 7-] that by drastically reducing or expanding
;:W{.scale of one of the variables one can distort the ellipsoid to the point
1iei:e it contains no useful information.
,,,,._m :''j
s4{ The Sampling Distribution
{,:TlJ.e sampling distribution of the estimates was defined in Section 3-1. It
![epresents the manner in which the estimates would vary in response to the
.,., I;'
pom variations we expect to Occur from one data sample to another.
'sampling distribution can shed light on the re]iability of the estimates.
Parameter is ilJ-determined if its estimated value can be affected strongly
ti..Yfmingly insignificant variations in the data. Such a situation is character-
..<t_ "'""
':;9Y the estimate having a large variance. The feature of the sampling
:9 ution that is of most interest to us is, therefore, its covariance matrix.
176
VII I nterpretation of the Estimates
We have rcmarked before (Chapter Ill) that we cannot generally hope to
determine the truc sampling distribution. The best that we can hope to do on
the basis of a singlc data sample. and in the absence of a Monte Carlo study,
is to arrivc at a rough approximation to the covariance matrix We would
also like to know thc mean of the sampling distribution, and how it is related
.10 the truc values of the paramcters, and the actual estimate 0*. Generally we
acccpt 0* as an estimate of the mean of the sampling distribution, and we
neglcct the bias of this mean relative to the true value. There exist some
methods for reducing this bias (see Section 7-9). For the moment, however,
we restrict our attention to approximating the covariance matrix. In essence,
then, we attempt to answer the question" If we were to repeat our series of
experiments many times, how would the estimates differ from one replication
to the next?"
7-5, The Covariance Matrix of the Estimates
i
.#..
E
1iII.
Suppose our estimate is the unconstrained mlI1imum of some objective
function (1)(0). This objective function also depends on the data; in particular,
it depends on the measured values W of the random variables D... We indicate
this dependence by writing (1)(0, W) in place of q)(O). At the minimum we have ,.
D(I)(e'" , W)/aO = 0
(7-5-1)
Supposc we varied the data slightly, replacing w by w + (5w. This would
cause our minimum to shift from 0* to 0* + (50*, where we must have, '"
u(IJ(O* + (50*, W + (5W)/(50 = 0
(7-5-2) ,i
Expanding Eq. (2) in Taylor series and retalI1ing only terms up to first order,
we find after subtracting Eq. (I)
«(52q)/(502) (50* + (a 2 cp/ao aW) (5w :::::: 0
(7-5-3)
so that approximately
.'50* = - H"'-1(02(IJ/aO aw) (5w
(7-5-4),;
where as usual 1-1* = 02(IJ/20 2 )0=0*'
The desired covariance matrix V 0 is defined by
V o == E«()O* (50'1)
(7-5-5),
so that
V o :::::: E(H* I (a 2 (IJ/a0 aW) bw bW T (02(IJ/oO aW)T H',,-I)
(7-5-6)!:(
",,'r '.'.. ;Jn
""'I 1
"...?>';:' --;:' '.' .
(,:, 7-5, The Covariance Mmrix oj the Estil1lmes
177
J''
i.,: ;'
The quantities H'" and a 2 (fJlaO aWare evaluated at 0 = 0* and at the aClllal
sample W. Hence they are constants, and can be taken outside the expectation
:.Y:
:::;, sign in Eq. (6)
:"",.,
(:\...
j', -
"h"
"':,:
V o ;::::: H*-I (a 2 (/Jfao a\-v) V w (a 2 (fJ180 l!W)T H I
(7-5-7)
--.:
:'i[;::
;: where V w is the covariance matrix of the data, i.e.,
,;,.;
',1')'.:"
V w == E(bw bw T )
(7-5-8)
r..,"
o;..!,'
?:
?;.
'
r
w-:'
j This formula applies to any objective funclIon, whether or not It has a
ir .pasis in statistics. More specific results can be obtained \vhen the objective
r.::f;unction depends only on the moment matrix M of the residuals. This class
;>'bf functions, which includes sums of squares and log-likelihood for normal
11;.dlstributions, was shown to ad mit the Gauss approximation Eq. (5-9-10)
ffc,';;I'6r H. We derive a similar approximation to V o. Eq. (5-9-10) can be rewritten
: as
'.l
;
we assume that w lI (i.e., the results of the lth experiment) has covariance
!patrix VI' and is independent of w" (I] =/= p) then Eq. (7) reduces to
V o ;::::: H",-1 [ i (a 2 (/Jlae aWl,) VJI (a 2 (/J/co awll)T ] 1-1*-1
JI=1
(7-5-9)
II
H ;::::: 2 )" B/ rB 11
(7-5-10)
wb",
JI=t
BJI == - aelRO = afJcO,
tP(MW)) == (/J(O),
r == iilp/aM (7-5-11)
, -
f2':; Under assumptions similar to those made in deriving Eq. (I Q) it can be shown
,...".-
r;1itp,at for standard reduced models with wJI = YII
"..',
a 2 qJ/ao aY'1 ;::::: 2B"Tr
;:>,:Siibstituting Eq. (10) and Eq, (12) in Eq. (9), we obtain
l:'
v o ;::::: (tIE/rBJI) -I (tIB/rVI,fBJI)(JIB/rBI')-
(7-5-12)
(7-5-13)
,/A derivation similar to the one given in Appendix E for the Gauss-
Tic ,:.arkov theorem, shows that if VII = V (p = 1,2, . . ., n), then choosing r
,roportional to V -I leads to the least possible value of df't V 0' This, in fact,
cstirs in the cases listed below, and shows that these maximum likelihood
_ ,.,d'pseudomaximum likelihood estimates are at least approximately optimal.
,i.V,.". :In the case of single equation least squares, assuming observations with
ard deviation (J, we have
B/ = b l , == aJJao,
f= I,
V = (J2
I'
178
':;'1
'"
VII Interpretation of the Estimates. i
!
I
sO thai Eq. (13) reduces 10
V o u2(JI b p b/) -I = u 2 (JI(a!p/80)(aJ;J80)T) - t
Comparing this to Eq. (5-9-4) shows that here
V o 2u 2 N- 1 (7-5-15),
When u is not known, we replace it with its estimate [1/(n - I)] I;:=I e/ =
[I/(n - 1)]cJ) * . A numerical illustration appears in Section 7-21.
We now treat some of the likelihood functions that were considered in.,
Section 5-9.
(7-5-14)'
I. Normal distribution with known V p = V, According to row of"
Table 5-1 r = JV- I , so that Eq. (13) becomes in view ofEq, (10)
VoJ ( f BpTrBII ) -1 =N-I H*-I
11=1
(7-5-16)
;,',
2. As above, but with unknown V p = V. From row 3 of Table 5-1 r '7::
(nI2)M- I . But according to Eq. (4-9-6), the maximum likelihood estimate fOI:';!;
V is given by (1In)M, so that approximately r =-!-v- I and Eq, (16) is still
valid. .0
:i''
For a wide class of maximum likelihood estimates _with normal dis;:
Iributions we have then
V o H*-I = - (8 2 log Llao aO)o;o>
(7-5-17),
.,
The quality of this approxllllatlon improves as the vanance of the measure.};
ments decreases and the fit of the model to the data gets better.
For most unconstrained maximum likelihood estimates it can be shov,;R:\
(Cramer, 1946, p. 500 et seq.) that asymptotically (as the series of experiments"..
is repeated ad infinitum) thesamplingdistribution approaches (with probabili.
I) the normal form, with means equal to the true values of the parameteI;S¥!
and with covariance matrix given by
V 0 = - [E(L2 log Llao (0)] -I (11 -;. OJ) (7-5-18j;
The computation of the required expectatIon is very tedious, if not altogeti
impossible; therefore, we generally replace the expected value by the mQ$
likely value, i.e., the value at 0 = 0*. This brings us back to Eq, (11)1
Again, the acceptability of this approximation depends on the goodnesjs\1
fit. If the fit is very good, the likelihood function has a sharp peak, and,tJia
expected and most likely values nearly coincide.
The estimates given here and in the sequel for V 0 and other statistic#!
paramelers are computed from the data. Hence they are in themse).v
random variables subject to sampling variations. As illustrated by the exam ,. 0
"r:: 1(lI
7-6, Exact Structural Model
179
1:,.,,., of Section 7-22 these vanatlons can be quite large even when a good fit
,u.Fo the data can be obtained, We shall not COncern ourselves here with the
;c?mputation of the sampling variances of V o . Nevertheless, we point out
;,diat these variances can always be estimated by the Monte Carlo method
?'(Section 3-3), Generally the V o computed from any given data sample can
f6 regarded as no mOre than a rough estimate, correct to within an order of
FE i;p,.agnitude,
[i The fact that the approximations may break down when the fit is poor
',:1:..\
j:.:11ed not worry us too much, since in this case we would not place much
0",;.rliance on the model anyway, and would attempt to improve either the model
f,*;c:i:i:the data, Even a very rough approximation to V o can be of considerable
&:-:.= . .
!i/.e, as will be seen in Chapter X.
r
'0,;?26. Exact StructurallVIodel
t'
.. ;We can also derive an approximation to V o for the case of structural
"'qations acting as equality constraints. Suppose we have obtained estimates
and \V* given the data Wand using the method of Section 6-6. If the data
..' replaced by W + c5W there will be a correction 60* in 0* given by Eq.
t}{(5'.:6-12), Now t the solution G = 0, and (as the reader may verify for him-
:!#If) BTC-IAH-Jq = 0 so that Eq. (6-6-12) becomes approximately
60* = D 'B T C- I Alr I 6q (7-6-1)
Kwh,ere c5q = (vq/aW) 6W = ((12(/)/8'V" (lW) ()W is the change 111 q due to the
s..* -.'
bange in W. Setting
M'eA,are led to
lV O == E(60* 60*T)
D-IBTC- 1 AIr I ((j2cJ)/iNV 6W)V w ((j2cjJ/8'Vv (jW)TIr I ATe-I BD- j
(7-6-3)
:To achieve further progress we assume that cp = t(w - 'V,,)TVJ(W - 'V").
::ceH- 1 = V w , and a 2 cp/a'V" aw = - VJ, Since C = AH-JA T = AVwA T
.:,, = BTC-1B, it follows that
V o D- 1 B T C- J AV w VIVw VIVwATC-IBD-J
= D-JBTC-IAVwATC-IBD - 1= D-JBTC-JBD-J
.-.,at finally
V w == E(c5w 6W T ) (7-6-2)
V o D- 1 = [BT(AVwA1)-IBr l
(7-6-4)
;,:"cular, for the model treated in Section 6-8 we obtain V o by inverting
,':8-14), A numerical illustration appears in Section 7-23.
180
VI I I nterpretation of the Estimates
j:l-
7-7, Constraints
Let us examine now how the covariance matrix is affected by inequality
and equality constraints that do not depend on the data, Let h(O) 0 and
g(O) = 0 denote these constraints, which, of course, are satisfied by 0*,
Let O(i) be the value of 0 that optimizes <P(O) with the constraint h;(O) ;: 0
removed, but all other constraints retained. Clearly, if it happens that O(i) is
feasible, i.e., if 11;(0(i») 0, then 0* = O(i). Actually, we distinguish the follow-
ing four cases (see Fig. 7-2).
(a) 0* = O(i) is well in the interior of the feasible regIon relative to the con-
straint 11;<0) O. Therefore. different values of O(i) arising from different
"
','
, \2:. ------' 8 '" 8 '
: "'-. =
\. . '\
" )
'----, 11, = 0
h, = 0
,- ,
,.
,
"
I
Ii
-;:.
...
h, > 0
11, > 0
h, < 0
11, > 0
Case (0)
Case (b)
h, =0
11, = 0
h,>O
h, > 0
Case (e)
Case (d)
Fq, 7-2, Incqualityconstraint 11,(e) :::- 0; el/), optimum without constraint; e*, optimUJn':
with constraint; -, boundary of rcgion containing 90 . of all rcalizations of 6'i); ---, bound-i.:/
ary of rcgion containing 90 ;; of all rcalizations of e*. ..,
. "%.'.
7-7, Constraints
181
data samples are likely to remam feasible, and the constraint h,(O) 0
exerts no influence on V 0 .
(b) 0* = O(i) is feasible, but it lies very close to the surface hi(O) = 0 In
,this case, a significant number of data samples may give rise to infeasible
, values ofO(i), causing 0* to lie on the constraint. The density of the sampling
distribution of 0* is truncated: positive on one side of the constraint, in1inite
on the constraint, and zero on the other side. The computation of its covari-
ance matrix is difficult, and will not be undertaken here.
(c) Om is infeasible, but only slightly so. 0* is on the constral nt. Some data
::::'samples make O(i) feasible, and therefore some realizations of 0* fall in the
('.','. ,"interior of the feasible region. The distribution is similar to the one of case
. .' (b), and we will not treat it further.
,",
(d) o(i) is extremely infeasible, so that 0* remains on the constraint for all
", but a negligible proportion of all possible data samples. rn this case. wc may
':;" ,treat hi as an equality constraint 17 i (O} = O.
:;: Let the vector g(O) now represent all the equality COnSlra1l11S, IIlciuding
:: ..;the inequality constraints of type (d). As we know from Section 3-6, the
(i.iLagrangian conditions
T.-:
fi,"
''';',
ocp/ao = I )'j o,q;/rlO
i
(7-7-1 )
:{":
r:: must be satisfied at 0 = 0*. If we change the data by an amount (hv, we
?;,'Jind 0* changed by (50*, and )'j changed by M j . At the new optimum, Eq. (I)
t:,'takes the form (approximately to first-order terms)
:1-;'
&i atP alcp * alcp _ ,, ( agj ,ogj ,a 2 g j . * )
,,,,, ' ao +- (50 + - oW = L )'i - + ()A j - + Ai - ()O (7-7-2)
;;, ao ae ao aw j 00 ('0 00 00
f 'f
i; with all derivatIves evaluated at 0 = 0*. Subtracting Eq. (J) from Eq, (2)
[j)aves
!\, A 60* = (ogjiJO)T £51, - ((':12(J)j?0 aW) bw (7-7-3)
1':W.here
If .tbat
A == alcpjaO ao - I )'i a 2 g)oO 00 = H* - I )'i (12,qJOO ao
j
(7-7-4)
!;
:;The variation (50* must leave the equations g(O* + (50*) = 0 satisfied. Hence
fj ()g == Cog/aO) ()O* = 0 (7-7-6)
;!.:.:
(50* = A -1((ogjoO)T £51, - (82(Nao aW) ()w)
(7-7-5)
rg
182
VII Tnterpretation of the Estimates
':,)
"
I I
,<,' .
i.
"
}
Substituting Eq. (5) in Eq. (6) we find after solving for (SA
" [ e g _ I cg T ] - I cg _I a 2 cp
c)/, = -A - -A -DW
. i)O cO ao ao aw
(7-7-7)
;,.!
1.:
:j
Finally, after inserting Eq. (7) into Eq. (5) we find
()O* = - { I
iJ g T [ iJ g a g T ] -I D g} a2(1)
A- ' - -A- ' - -A- I -
DO ao £10 ao ao aw
()W(7-7-8)
;./t
Comparing Eq. (8) with Eq. 17-5-4) we find two changes
I. H* is replaced by A. From Eq. (4) we see that if aJl the constraints
gj are linear. then actually A = H*. This occurs, e.g., when all the constraints
are upper and lower bounds.
2. The expression A - I (i,2(/J/('O cW) ()W is prem ultiplied by the projection
matrix
't-
,:'
I ag T r ag _ I ag f ] - I ag
p= I - A A
ao ao ao ao
(7-7-9)
I ;'
;:,
,', !
which has the property that if x is any I-dimensional vector, then y = P:x
satisfies (8gfDO)y = O. i.e., y lies in the tangent plane to the constraints.
The matrix P thus projects any vector into this tangent plane.
The expression analogous to Eq, (7-5-7) is nOw
v 0 = PA -1(iJ2(f)/?0 i 1 w)V w (()2(f)/(;0 aW)T A -lp T
(7-7-10)
If all g, are linear, i.e.. A = H*, then under the appropriate conditions " .
Eq. (7-5-16) rcmains valid with the following modification q
VU;:::,PH*-lp T
(7-7-11)
Let us compute Panel V 0 for the sImple though common case 111 which
all active constraints are derived from upper or lower bounds on the param-
eters. For simplicity of representation, we assume that the parameters
0 1 , 0I' ....0 1 , are actively constrained to equal a l . a2' ..., ai" respectively,
The remaining parametcrs 0 1 , + I' 0 1 , + 2' . . ., 0 1 are not actively constrained,
The active constrainb can be written as
',' .;
.! ..
i; !
:;
;, I;
, ,
[I 0]0 = a
(7-7-12)
!
. ,,:
with 1 belllg the II ;.: II tdemity. and 0 the II x (/ - II) null matrix. We have,
then
:I:
:.:
. .
('g/(CO = [I 0]
.!,.':"
.'
/,
7-8. Principal Components
183
Since Our constraints are linear, we have A = H*. Let H*-I be partitioned
as follows
A -I = Hr l = [ TJ
(7-7-13)
where B is II X II> C is (1- II) X It. and D is (1-11) X (1- II) Therefore:
:/
[:: A-I :: TJ-I=([I OJ[ l[]rl=B-1 (7-7-14)
p = [ ] _ [ T] []B-I[I OJ
= [ 1- [B-J J = [-B-I J (7-7-15)
. .
and finally
V o = [-B-J J [ T] [ - ICT] = [ _ CB-ICT] (7-7-16)
:r..
.:
As expected, the variances and co variances of 0 1 , O 2 , ..., 0 1 , are zero,
Since these parameters are constrained to equal fixed values. The covariance
matrix of the unconstrained D's is reduced from D to D - CB-IC T . The
matrix D - CB - I C T is simply the inverse of the lower right-hand partition
of A, so that the covariance of the unconstrained parameters is obtainecl by
inverting the corresponding part of A.
7-8. Principal Components
Given the matnx V o or an approximation to it, we can determine which
parameters, or linear combinations of parameters, are well determined
(small variance), and which are poorly determined (large variance). The
variance of the estimate for the ath parameter is given by /.'Oaa' and its stan-
dard deviation by v; == VJ1;. As in Section 7-3, the full picture is obtained by
finding the eigenvalue decomposition of V 0' say
V o = UIIU T , II = UTVoU
(7-8- 1)
where II is the diagonal matnx of the eigenvalues 7[; of V o.
Suppose we define a vector of new variables p = UTO, cSp = U T CSO. The
covariance matrix of the p's is given by
V o == E(cSp cSpT) = E(U T cSe cSe T U) = UTV o U = II
(7-8-2)
''-- '!fJ;
\ .
184
VII Interpretation of the Estimates
We have then a set of new variables PI' P2, . , " PI which replace the original
parameters 0t, O 2 , ..., 0,. Our estimated value of Pi is given by Pi* =
I;I UaiO,/, Since VI' = II is a diagonal matrix, the sampling variations
in Pi* and P / are uncorrelated when i =f= j. The standard deviation of the
estimate Pi* is ni/ 2 , The Pi are called the principal components.
The advantage of dealing with the uncorrelated principal components
rather than with the correlated original parameters is particularly great when
(as in the normal distribution) lack of correlation implies statistical indepen-
dence. For then we can establish confidence intervals and statistical tests
for each component individually.
When Eq, (7-5-17) holds, we have V 0 H* -I = A I. The eigenvectors
of V 0 and A coincide, hence the i5Pi coincide with the canonical variables.
The eigenvalues ofV o are the reciprocals of those of A, i.e., n i = 1/),... Hence
the £-indifference interval of i5Pi = 1/;i is given by
1:. (2£/),,)' /2 = 1:. (2£n,)' /2
The length of the interval is proportional to the standard deviation. For
single-equation least squares, Eq. (7-5-15) implies that n i = 2u 2 Pi'
Like the canonical variables, the principal components depend on the
scaling of the variables. A natural scaling is one which adjusts each variable
so as to have unit standard deviation. This can be achieved by defining >.0 i '\I.
,
Va=OaV/2osothat 'If
.
, .
I
I '
>'
E(i5l'a i51'li} = V OafJ V;a/2 V o "pY2
or
c/o
V v = DVoD
(7-8-3),
where Da{1 = i5 afJ VO:/2 . The matrix V,., whose main diagonal consists entirely
of ones, is the correlation matrix of 0"'. IfP is the matrix of eigenvectors of V.,
then the elements of the vector pT v = pTDO are the principal components of
V". They are uncorrelated, and theIr variances are the eigenvalues of V.' ,
We call them the scaled principal components of V o .
In practice one is interested mostly in determining the princIpal com-
ponents having unusually large variances, since these hold clues to inadequacies.
in model or data. This point is discussed further in Section 7- 18. If V o is .:
known, but not its eigenvalues and vectors, then we can most easily deter-,,;
mine its largest eigenvalue and corresponding eigenvector by means of the';
power method.
7-9, Confidence Intervals
Knowledge of the covariance matrix of an estimator gives an mtUltive
feeling of the degree to which the various parameters are well determined, ::.{
We may wish however, 10 make more explicit statements such as "the true ,"'
.....,.
f:': r.
7-9, Conjidence Intervals
185
value of 0 lies between the numbers a and b with 90 probability." From
the point of view of classical statistics this statement is meaningless; 0 IS a
constant, albeit unknown. The probability of its lying between a and b is
unity if a < 0 < b, and zero otherwise; it can never be 90 o.}. This difficulty
is overcome by Neyman's (Neyman, 1937) theory of confidence intervals.
Suppose we had complete knowledge of the sampling distribution of
the estimate 0* (we discuss first the case of a single parameter 0), This dis-
tribution depends on the true value 0, and we denote its density function by
p(O* I e). For any given value of e we can easily determine two numbers
aCe) and bee) such that, for a given 0 < y < I we have
Pr[a(e) 0* b(O)] = )'
(7-9-1)
This is equivalent to demandmg that for each 0
.b(ii)
I p(O* I 0) dO* = )' (7-9-2)
. o(ii)
It is clear that the choice of a(O) is quite arbitrary; given any c < I - y, we
can choose a(O) so that Pr[O* < a(8)] = £; b(O) is then determined by the
requirement that Pr[O* > bee)] = I - y - £, Suppose we are able to choose
a(8) so that both it and b(O) are monotonically increasing functions of O.
Then there exist inverse functions CI.(O*) and [3(0*) such that
8 = daCe)] = [3[b(O)]
Then the statement 0* aCe) is equivalent to 0 CI.(O*) and the statement
0* bee) is equivalent to e [3(0*). It follows that
Pr[[3(O*) e dO*)] = Pr[a(e) 0* bee)] = y (7-9-3)
'.. Note that 0* as a function of the sample is a random variable, and there-
fore so are CI.(O*) and [3(0*). Eq. (3) asserts that the probability that the value
':" c,>f the random variable [3(0*) does not exceed the true value 0, and that the
f,1illdom variable CI.(O*) is no less than e, is equal to)'. This is a perfectly mean-
i;ngful statement within the confines of classical statistics. The interval [[3(0*),
;.,,«(0*)] is ca]]ed a )'-conjidence interval for O.
.'. ,,,' The simple example of Section 3-1 wi]] help clarify the situation. From
r?hservations It'll on a norma]]y distributed random variable w with mean 0
\.and known variance u 2 we obtain the maximum likelihood estimate Eq.
1;.(j:1-3)
;!-;:.
/I
0* = (I/n) L lV l1
..., 11=1
, -
9(is we]] known that the sampling distribution of 0* is normal with mean 0
iE variance u 2 /n, From tables of the normal distribution [see, e.g., Cramer
;nN:Q46, Table 2)], we note that
.' Pre I 0* - 8 I 1.6449u/j) = 0.9 (7-9-4)
186
Vll Interpretation of the Estimates
so that
Pr(i') - 1.6449al,,/-;; 0* 0 + 1.6449CJ/j;;) = 0.9 (7-9-5)
that IS, a(O) = G - 1.6449CJ/,,/;; and b(G) = 0 + 1.6449CJ/j;;. 80th a(O) and bee)
are monotonic in 0 with inverses CI.(O*) = 0* + 1.6449CJ/j;; and /3(0*) =
0* - 1.6449CJ/,,/;;. Clearly, Eq. (5) is equivalent to
Pr(O* - 1.6449CJ/..j;; 0 0* + 1.6449CJ/j;;) = 0.9 (7-9-6)
I
,- ,-
so that (0* - I. 6449CJ/ ,,//1, 0* + I. 6449CJ/ J /1) is a 90 . confidence interval
for O. This statement should be interpreted as follows: if the series of experi-
ments were repeated one hundred times, each replication would yield a
separate estimate 0*. For each such estimate we could form the interval given
by Eq, (6). Then the true value 0 should be contained in about ninety of these
intervals.
If CJ is unknown, the interval given by Eq. (6) is undefined. However,
it is well known that in this case the quantity j;;(O* - O)/s, where
f " } 1/2
S == [1/(/1 - I)] I (W II - 0*)2
\ 11= I
follows the I-distribution with /1 - I degrees of freedom, Hence, if n = 10,
we have from tables [e.g., Cramer (1946, Table 3)]
(7-9-7)
PdllO l / 2 (0* - O)/sl 1.833] = 0.9
(7-9-8)
so that
Pr(O* - 1.833s/l0 1 / 2 0 0* + 1.833 s/10 1 / 2 ) = 0.9
(7-9-9)
This is a proper confidence interval, since s is computable from the sample,
Since s is a consistent estimate for CJ, Eq. (6) can be used with s replacing (J
when n is large. In fact, the I-distribution approaches the normal as /1 increases,
If the exact form of the sampling distribution is unknown except for its
variance V, we may derive conservative confidence intervals for the mean e oL':.
the sampling distribution. According to the 8ienayme-Chebyshev inequality
(see Section 7-10 for the derivation of a more general result)
Pr(IO*-l:J1 :;?:kVI/2)k-2
which is equivalent to (set )' = I - k- 2 )
Pr{O - [V/( I - y)]I/2 e* 0 + [V/(I _ y)P/2} :;?: y
(7-9-10)
(7-9-11)
Hence
Pr{O* - [V/(I - y)]1/2 e ()* + [V/(l _ y)]1/2} :;?: y
(7-9-12) "
{if. t)
r......
!(:.
I '7-10, Confidence Regions 187
i; We know then that a y confidence interval for ti is contained in the interval
{O* - [V/(I - y)P/2, 0* + [1"/(1 _1')]1/2}
If 0* is an unbiased estimate, then 0 can be substituted for {) In Eq. (12).
:Usually the bias is unknown, but when 11 is fairly large, maximum likelihood
':estimates can be treated as having a normal sampling distribution with
::variance given by Eq. (7-5-17), and bias proportional to I/n. In this case one
"may employ the method of Quenouille (1956) to reduce the bias. Let 0* be,
:' as usual, the estimate of 0 based on 11 experiments, Let 0 , ,* (fL = I, 2, . , ., 11)
'be the estimate based on the 11 - I data points obtained by dropping the
?% ..:Jlth, Let e == CI:=l 0/)/11 be the mean of the 0 , ,*, The bias in 0* is pro-
::portional to 1/11, whereas the bias in each 0/ is proportional to 1/(11 - I),
:,iand so is the bias in O. Hence the following relations hold approximately
0* :::::; a + CI./I1,
e :::::; 0 + CI./(11 - I)
(7-9-13)
,where CI. is an unknown constant. Multiplying the first equation by 11 and the
Lsecond by 11 - I, we obtain after subtraction
a :::::; 110* - (11 - I)e
(7-9-14)
"'The bias of this estimate is of the order of 1/112,
We have noted above that the choice of an interval for a given confidence
level y is somewhat arbitrary. Commonly employed criteria for choosing
among all possible functions a(O) and bee) that satisfy Eq. (I) are the following:
I. Minimum length, Choose a(O) and b(O) so that b(O) - a(O) is minimum.
2, Symmetry around the true value
b(D) - 0 = 0 - a(O)
3, Symmetry in probability
Pr[O* < a(O)] = Pr[O* > beD)] = (I - )')(2.
4, Equal probability at ends. The density function of the sampling dis-
tribution assumes equal values at 0* = a(D) and 0'" = b(O).
When p(O* I 0) is unimodal and symmetric, criteria I, 2, 3, and 4 coincide.
1l:-10. Confidence Regions
. The idea of a confidence interval may be generalized to the case of several
:own parameters. Suppose that for any data sample W we are able to
'define a bounded closed subset SeW) of the I-dimensional 0 space in such a
way that SeW) contains a in a fraction I' of all possible data samples, i.e.,
pr[a E SeW)] = y
(7-10-1)
(:.>(
188
VII Interpretation of the Estimates
Then S(W) is called a Yioinl confidence region fortheparametersO. In choosing
a confidence region we have even more freedom than we had in the case of
confidence intervals, since we may exercise control not only over the location,
but also over the shape of the region. The most commonly used regions are
I-dimensional rectangles or ellipsoids whose centers are at the estimate e*, . . ..
In Section 7-8 we have introduced the principal components p = uTe, \I
which are uncorrelated linear combinations of the parameters. If (as would
be implied when the sampling distribution were normal) the principal com-
ponents are statistically independent, then confidence intervals for the indi-
vidual components can be combined into a rectangular joint confidence region,
Let (0:;, fl;) be a v-confidence interval for Pi. Then
Pr( 0:; :::; Pi:::; f3 i' i = ], 2, . . . , I) = yl
(7- 10-2)
so that we have a i-joint confidence region for p. Unfortunately, it IS difficult
to transform Eq. (2) into a simple statement in terms of the Ocr.
It seems reasonable to choose confidence regions which coincide with the
indifference regions of the objective function that was used to obtain the'
estimate 0*. We have seen in Section 7-2 that these take the approximate form
1(0 O*)TH*(O - 0*)1:::; c
(7-10-3)
which III view of Eq. (7-5-16) may in most cases be approximated by the
ellipsoid
(0 O*)Tyo t(O - 0'1.) :::; c
(7-10-4)
For a given confidence level y, we need only determine c in such a waythatij,
Pr[(O O*)TYol(O-O*):::;c]=)
(7-10-51U
Such a value of c always eXists. regardless of the accuracy of the assumptions,;?!.!'.
made in estimating Yo I. The actual determination of the value of c does;.
however, depend on these assumptions, and on the form of the sampling dis-',;
tribution in general.
When the sampling distribution IS normal, unbiased, and with known covar,;
iance matrix Yo, then as shown in Section 2-8 the quantity (0* - O)V o I (e* -O)!,
is distributed as Xl with I degrees of freedom. Therefore, the constant c may be,j
determined as the upper y point of that distribution. For instance, let 1=1.0';
and y = 0.9. We find then (Cramer. 1946. Table 3\ that .
Pr[(O - O*)TYOI(O - 0*):::; 15.987] = 0.9
(7- 1O-fj):;
When the sa mpling chstributlon cannot be assumed normal. we may generaliz"j;;
the Bienaymc-Chebyshev inequality Eq. (7-9-12) to obtain (provided e = e)
Pr[(O -O*)TV;;I(O - 0*):::; 1/(1 - y)]:;?: y
f!:.r,\.
:t
7-11. Linearization
189
::We derive this relation as follows: Ler
Q(8*) == (0 - O*)TV;;-I(O - 0*)
"and let G" be the region in which Q(O*) a. Let
i
P,/O*) == I - (l/a)Q(O*)
, Then P,,(O*) 0 for 0* outside G", and P,,<O*) I for 0* inside G". Hence
J P,,(O"')p(O*) d8* f P,,(O*)p(O*) dO*
GUJ Go
r p(O*) dO* = Pr(O* E G,,)
. Go
Jhere G co is the entire space and p(O*) is the pdf of the sampling distribution,
.Now
r P,,(O*)p(O*) dO* = r [p(O*) - (l/a)Q(O*)p(O*)] dO*
'G 'Goo
= I - (l/a)£Q(O*)
= I - (I/a)£ Tr V;;- t(O - 0*)(0 - O*)T
= 1-(I/a)TrV;;-IV o = I-I/a
;lIence
Pr(O* E G,,) :;?: I - I/a
:Taking}' = I - I/a one obtains Eq. (7).
." The method of Quenouille for reducing bias (see Section 7-9) is as appli-
'ble in the vector case as in the scalar case.
..'H The vol.ume of the region in 0 space defined by inequality Eq, (4) is
1/""(c) = (cn)//2 det l / 2 V o /i((I/2) + I))
(7-10-8)
!fl-,:ll. Linearization
"Sornewhat more incisive results than those of the last section are often
ipbtainable when the model equations can be approximated by linear ones
If.. .
:I.('the vicinity of the estimate 0*. The single equation model.!;, =.f(X 1l , 0)
iy.'be approximated by
[it::. :
f ll = f(x ll , 0) ;:::; {(XI" 0*) + t;,/(0)o=o'(O - 0*) (7-11- I)
, model resembles the multiple linear regression situation (Secnon 4-4),
',ih;B designating the n x I matrix whose pth row is OJ;,/ao T , F designating
..i
,:}
190
VII I nterpretation of the Estimates
the n-vector whose (th element is (" - f(X", 0*), and 0 - 0* replacing 0 in
Eq. (4-4-2). According to Eq. (4-4-9), the vector 0 - 0* has covariance
(BTy- 1 B) -I, where Y is the covariance matrix of the observations)'/1 [this
corresponds to Eq. (7-5-16)]. If the errors in the observations are normally
distributed, so are the estimates 0*; and therefore, as stated above, the
quantity
.',
. ';!.
.' 1 '
£,
t
Y =(0 - O*)TBTy-JB(O - 0*)
is distributed as X 2 with 1 degrees of freedom. Hence, we can determine
confidence regions for 0* provided Y is known.
It is well known from the theory of multiple linear regrcssion that the
residual weighted sum of squares
.Y' = [Y - F(X, O*wy- I [v - F(X, 0*)]
(7-11-2)
I
'iX
is dlstnbuted independently oL7 as:/ with n -I degrees offreedom. Hence
(n -1).7/1'y' has the Ft./I- I distribution. Suppose it is known that Y = TQ,
where Q is a known matrix. and T an unknown constant. Then
(n - /).9"
I.c/'
(n -1)(0 - O*)TBTQ-IB(O - O*)
I[Y - F(X, 0*))" Q-I [Y F(X,O*)]
(7-11-3)
Thus (n -1).7/1.c/' may be computed without knowledge of T, and tables of
the F distribution [e.g.. Scheffe (1959)] may be used to obtain confidence
regions for O.
An important special case occurs when all observations are independent,
so that Y = (J2 I and
' I :
r.,
,
.
(/1 - /).7
1.:1'
(n 1)(0 - 0*)T8 T B(0 - 0*)
I[Y - F(X, O*)]T[y - F(X, 0*)]
(7-11-4)
which corresponds to the case of unweighted least squares. For single equation
least squares (illustrated in Section 7-21), this simplifies to
(/1 -1\.7/1.'/ = re O*)TN*(fI - O*)/21(j2 n-II-5)
where
/I
(jl = L e/ 1 2 /(11 -I)
Jl=1
I
IS the estimated vanance of the residuals.
The foregoing discussion was based on the assumption that the model
equations were nearly linear in the parameters around the estimate 0*, at L
least for variations in 0 of the order of magnitude of the standard deviation
of the estimates. It may happen that a change of variables will improve the_"
validity of the linearity assumption. A systematic procedure for effecting c; ,
7-12. The Posterior Distribution
191
-,-
i'.:"
such a change of variables is described by Hartley (1964). Procedures for
determining the validity of the linearity assumption are described by Beale
(1960) and illustrated by Guttman and Meeter( 1965). A very simple method
for testing the linearity assumption is the following:
Determine a confidence region based on that assumption. Calculate the
actual values of the objective function at selected points on the boundary of
the region, e.g., at the endpoints of the principal axes. If our assumption is
valid. these values should differ only slightly from the approximation value
rp* + t(O - O*)TBTy-JB(O - 0*)
In fact, it may be worthwhile to derermlile the extent of the regIOn III which
the linearity assumption is valid by computing the values of the objective
function on the boundaries of a series of successively larger confidence regions,
until a serious mismatch between the true and approximate values occurs.
Knowledge of this region may be useful in certain applications, such as
sequential estimation (see Section 9-3).
t..
"I'
(: .
,,::.,-
i .
I::,
'"
. .
7-12, The Posterior Distribution
If we accept the posrerior distributIon as the probability chstnbution of
our parameters, we are able to define confidence regions in a straightforward
way. Let p*(O) be the posterior density. Then for any measureable region fill
in parameter space, the quantity S.II p*(O) dO is the probability that the true
value of 0 lies in ell!. Hence, any region ;f? such that S ..,' p*(O) dO = )' is a )'_
confidence region (in the Bayesian sense) for O. If p*(O) is approximately
normal with covariance matrix Yo, then the methods of Section 7-10 can be
used to construct such confidence regions. In most other cases. the task is a
difficult one,
Suppose our estimate 0* has been arrived at by maximizlIlg the logarithm
of the posterior distribution. Around 0* we then have. approximately
10gp*(0) = 10gp*(0*) - HO - O*)TI-I*(O - 0*) (7-12-1)
where H* is the Hessian of -log p*(O*). It follows that
p*(O) ;:::;. c exp[ -t(O - O*)TH*(O - O*)} (7-12-2)
which indicates that around 0* the posterior distribution looks like a normal
idistribution with mean 0* and covariance y* = -( H*) -I. If we had chosen a
',constant prior density, then the posterior density is proportional to the
J; .likelihood function. If then the model equations are linear in 0, Eq. (2)
."holds exactly. In this case the postenor distribution of 0 and the sampling
;'distribution of the maximum likelihood estimate coincide, and inferences based
',1; ,9n them are identical.
\,;,
:
192
VI r r nterpretation of the Estimates 'u
If the model equations are linear and the prior distribution is normal with
covariance matrix V 0' then Eq. (2) still holds exactly, but the covariance of
the posterior distribution no longer coincides with that of the sampling
distribution. Apart from constant terms. the posterior distribution has the ,;
'.!.
form
log p*(O) =
n
(y" - B"O)TV- \Y" - B,IO) -1(0 - OO)TVOI(O - ( 0 )
Jt=: I
(7-12-3)
This achieves its maximum at
0* = ['t (B/V- t B , ,) + V o I] -I ['t (B/V- t y / I ) + V o 10 0 ] (7-12-4)
and has the negative inverse Hessian
V* = I f f (B TV-IB . ) + V O -I J -I
.:-.... IJ JI
Jt= I
(7-]2-5)
It can be shown that the sampling distributton of 0* [Eq. (4)] has mean
0* = v* [ f ( B TV-IB ) 0 + V-IO ]
--' II II 00
J!= I
(7-12-6)
where 0 tS the true value. ThIs IS unbIased only when 0 0 =;= e. Furthermore
the covariance of the sampling distribution is given by
n
V(J = V* (B/V-IB,,)V'"
Jt;;::: 1
(7-12-7)
which differs from V*.
7-13, The Residuals
After the estimates 0* for the parameters have been obtained, we can:
compute the final residuals
{ ".,,* - w ,1
e * == e ( 0* ) = . {J ( z 0* .)
It It I °11 11'
Y/I - r(x", 0*)
for exact structural model
for inexact structural model
for reduced model
C/l = 1,2, .,., II} (7-13-1)
-.':'1
These residuals measure the departure of the data from the best curve o,i
surface that could be fitted to them. I f the model is exactly valid, these:
residuals must be related to the errors in the data. If such errors did not:<'i
exist, there should be no residuals either. The residuals must on the average\:;
i"fr:\i.'Hf<';
.::-'t, i;\<!...
t
i
i "be ,mall" than the wo"" beeau" we hove eho"n the O' '" "' to make
[the residuals as small as possible. The errors, which should be the residuals
obtained with the true values 0, must be larger (unless e = 0*). The residuals,
, then, are biased estimates of the errors, the bias being toward smaller ab-
solute values.
We now develop approximate expressions for this bias in some typical
estimation situations. Suppose the errors in different experiments are un-
correlated, and have the same covariance matrix V in all experiments. Further,
" suppose 0* minimizes an objective function of the form
:
7-13. The Residuals
193
eP(O) = 'P(M(O»
(7-13-2)
ffi "as discussed in Section 5-9. At this mmimum the gradient of (J) must valllsh.
d['Hence from Eg. (5-9-12) we have
n
--')"\' B T r *- 0
q - - L., I' ell -
l1=t
(7-13-3)
.!.vbere
BI1 == - oe)aO)o=o*,
r == o'PjaM)o=o*
(7-13-4)
Assuming the model is correct, the error £11 is the residual computed for
,1fhe true e, i,e,.
£1' = el,(O)
(7-13-5)
.;:Assumingthat 0* does not differ much from 0, we have from Eg. (I), Eq. (4),
j; ::.Fd Eg. (5) approximately
e * £ - B ( 0* - 0 )
J1 "-' I' JI
(7-13-6)
'Substituting Eq. (6) 111 Eq. (3) yields
kbence
/I
L B/r[£11 - B(O* - 0)] = 0
u=1
(7-13-7)
n
0* - 0= C- 1 L B"Tr£"
'1=1
(7-13-8)
i;,wbere
/I
C == L Bil TrB II
1'=1
(7-13-9)
:$ubstituting Eg, (8) in Eg. (6)
'!,.'.';:;
I:S.ince by assumption
"
*- B C -l "\' B T r
ell -£1' - I' L.", £'1
'1= 1
(7-13-10)
l.
E(£p£,/) = (5 1 ", V
194
VII Interpretation of the Estimates
we obtain from Eq. (10)
E ( e "'e *T ) = V - H C-1B Trv - vrB C- 1 B 1
II JL J1 J1 J1 J1
+ BpC- 1 ( "tl B"TrvrB,,)C-1B/
(7- 13-11):
An approprIate measure of the residuals is given by the (unadjusted)hf
sample covariance matrix V*, defined by
V* == (lln)M(8*) = (I Ill) I eJ1*eJ1*T
p
(7-13-12)1
So that from Eq. (II)
E(V*) = V - (I/n{tB" C- 1 B/)rV - O/Il)vrCtBp C- 1 B/)
+ (I /n)lltl B"C- tJI BTrvrB)C-IB/ (7-13-13)'1
Of particular interest are the cases where r is proportional to V-I,
in rows I and 2 of Table 5-1 (V known or proportional to a known matrix)_,
It is easy to verify that Eq. (13) remains unchanged if r is multiplied by '.
constant, since C would be multiplied by the same constant. Hence, we ma}{'.:::i
simply substitute r = V-I, and obtain ,1,
/I /I
E(V*)=V-(2I n )IB " C 'B/ +(I/I1)IB p C- I CC IB/
1'= 1 p= t
/I
= V - (1/n) I BII C-IB"l
p= 1
(7-13-14E
As expected, E(V*) is "smaller" than V, since V - E(V*) is positive':;
definite.
Consider the case of a single equation model. with V = u 2 . In this case!:
B/ is a vector b p . Hence from Eq. (9)
/I
C = (I/u 2 ) I b p b/
1'= 1
(7-I3-15);if
and, since C IS an I x I matrIX (l being tile number of parameters)
B C IB T = b TC-Ib = T r(C - I b b T )
L I' I' L I' p L p P
p=1 u=1 u=1
= Tr(C- 1 u 2 C) = u 2 Tr II = [u 2
Therefore, Eq. (14) becomes
(7 -13-16)'j;;
. .:.:t
E(V*) = (I - 111l)u 2
(7 - I3-1q;;,, '_,'
., :;;,,:
'4;,13. Tbe Residuals
195
If we want to estimate u 2 (u belllg the standard devIatIOn of the errors)
:from the residuals, we use
I 1 1 I /I
-2 V * _ ' / * _ '\ *2
U =- ---}, - Le
l-I/n l-I/l1n 11_1,'=1"
(7-13-18)
.his is a well known formula, which states that the variance of the errors IS
',estimated by dividing the sum of squares of the residuals by the number of
/!egrees of freedom, which is the number of observations n less the number of
,;'Unknown parameters I.
.. If the model contains 11l > I equatIons, the sItuation IS more complicated.
:::S!lppose, however, that we wish to estimate V as some multiple of V*; say
,ewe assume
E(V*) = pV
(7-13-19)
:rhen, substituting Eq. (19) in Eq. (14) and multiplying by V-I, we obtain
III
pI", = 1",- (1/11) I B"e-IB/V-'
p=1
(7- 13-20)
,!1:aking traces on both sIdes. and remembering that if AB is a square matrIX
iihen
Tr(AB) = Tr(BA)
;e obtain
;fm= 111- (I/n) Tr e- I ( f B/V-1B JI ) = m- (1/11) Tr e-Ie = III -//11
Jl= I
;so'tha t
p = I - 1//1111
(7-13-21)
. Hence, to estimate V we use the (adjusted) sample covariance matrix
1 I I I /I
V = - y* = - M(e*) = I e *e*T (7-13-22)
p 1 - (l/mn) n . n - I/m,,= I " "
!l _';.
;PiJ.c;e again, to estimate the covariance matrix of the errors, we take the
IWpITIent matrix of the residuals and divide by the number of degrees of
:J!eedom per equation, that is, the number of observations 11 per variable less
:!I!e:," average" number of parameters per equation 11m. Clearly, Eq. (22)
:gpces to Eq. (18) in the single equation case m = 1.
,:c:'In the case where V is completely unknown, we have from row 3 of Table
i'T proportional to M- I , If we now make the further assumption that M
'p{oportional to V we have once more r proportional to V, and Eq. (22)
iSi-!ill valid. The maximum likelihood estimate V = O/n)M(e*) is biased by
"';.
196
VII Interpretation of the Estimates'
the factor I - l/ml1. Since this factor approaches unity aSll co, the maximum
likelihood estimate is consistent.
The formulas derived here should be viewed with caution, since the
residuals for one equation may turn out much smaller than predicted, while
another equation has much larger ones. It is only on the average, in a certain
sense, that our bias factors apply.
7-14, The Independent Variables Subject to Error
Even consistency does not hold when the independent variables are
subject to error, so the number of unknown parameters is essentially pro-
portional to 11. A very rough indication of the bias to be expected even when
the number of experiments is very large can be obtained as follows:
Suppose we have an m-equation model with r variables subject to error.
Let the observed, estimated, and true values of these variables for the pth
experiment be designated WI" w/', and \vJ1' respectively. We assume these
three values are not far apart. By definition, the residuals are
e/' == w p * - WI'
(7-14-1)
and the errors are
Ep == "'IL - w p
(7-14-2)
Since we are dealing with the asymptotic case, we assume that our estimates
0* are errorless, i.e., 0* O. The estimates w p * must then minimize
l. ( w * - W ) TV-1 ( W * - W )
1. 11 Jl 11 Jl JJ
subject to g(w , ,*, 0*) = O.
Letting
Ap == cg/owl,)o=o*
we have approximately
g(W p *, 0*) = g(W IL , 0*) + Ap(w/ - w p ) = 0 (7-14-3)
Since 0* = 0, the true values w p also satisfy g(\V p , 0*) = 0, i.e., approximately
g(\"I" 0*) = g(w l " 0*) + Ap(\v JI - w p ) = g(w p , 0*) + AI,Ep = 0 (7-14-4)1
Solving Eq. (4) for g(W II , 0*) and substituting in Eq. (3)
-All E'l + Aiw/ - W,.) = 0
(7-14-5);;;i
Because of its nature as the solution to an equality-constrained minimiza;!
tion problem, w/ must be a stationary point of the Lagrangian
I ( W 'I' - W ) TV -I ( W '" - W ) + AT g( W * 0* )
2 I' /1 JI . JI JI Jl '
""-'
¥!,,
.f ',I.,,'!,'1.
7-14, The Independent Variables Subject (0 Error
197
Differentiating with respect to W/
VIl(W,,* - w,.) + AuTA = u
(7-14-6)
}Vhence
W , ,* - w,, = -V,IAI,TA
(7-14-7)
so that Eq. (5) becomes
-AIIEJl - AJl VJlA/A = 0
(7-14-8)
and
%:
A = -(AIIVJlA/rIAJlEII
Now Eq. (I), Eq. (7) and Eq. (9) combine to form
(7-14-9)
e/ = VJlAJlT(AI' VJlA/)-IAJlf.Jl
::'fnd remembering that £(f."f./) = VJl' we obtain
E(V * ) - E( * *T ) - V A T (A V A T ) -I A V
Jl = ell eJl . - Jl Jl I' I' I' I' Jl
(7-14-10)
(7-14-11)
Assuming once more that V I ,* is proportional to V'I' e.g., V/' = pV " ,
J::q. (ll) reduces to
pI, = VI,A/(AI,V"AJlT)-IAp
(7-14-12)
.:raking traces
pr = Tr[V"A"T(A" V"A/r l A,,] = Tr[(A II VIIA"T)(AII VIIA})-I]
= Tr 1/11 = 1Jl (7-14-13)
i:' so that
p = /11/r
(7-14- I 4)
If 111 = r, i.e., there is only one inaccurately measured variable per equa-
::::'tion, p = r. There is no asymptotic bias and V/ is a consistent estimate of
:';yl'" This corresponds to the situation discussed in Section 7-13. In the prob-
;:,}c£1 of Section 6-13, however, we had one eq uation with .Y I' X 2, and y
:J;:ubject to error. Here p = t, and we expect the residual covariance to be
1dnly one-third the true covariance, no matter how many experiments are
'jtperforrned.
.. It is clear that if we assume all V" equal, then the matrix
"
V* = (I/n) I e/e'T
JI=1
(7-14-15)
[h the same bias factor p. If, in addition, we correct for the bias caused by
fJe I parameters e, we arrive at the estimate
!:-:'. .
I.-
..1.--
"
V = [r//11(n - 1//11)] I e/<'T
Jl=t
(7-14-16)
198
VI I Interpretation of the Estimates
7-15, Goodness of Fit
The crucial question that arises after the estimates have been obtained
is whether or not our model fits the data. The question can be answered in
the affirmative if the residuals of the fitted model can be explained as errors
in the observations. On the other hand, if the residuals are so large, or of
such a nonrandom nature, that they cannot be ascribed to random observa-
tion errors, then we say that the model does not fit the data. We stress that
whereas a lack of fit constitutes strong grounds for rejecting, or at least
amending the model, a good fit does not prove that the model is correct.
A good fit merely establishes the fact that there is no reason to reject the model
on the basis of the data at hand. I n fact, no amount of data can ever prove a
model; all we can hope is that it does not disprove it.
Our least squares or maximum ]ikelihood estimates were usually based
on the assumption that the errors EJI in each experiment were realizations of
a random variable with mean 0 and covariance matrix V. After estimating the
parameters we have a set of residuals e/ from which we compute an adjusted
covariance matrix V (see, e.g., Eq. (7-13-22)). To establish the goodness of
fit is to test the hypothesis that, with certain reservations,t the residuals form
a sample from the distribution that we ha ve postulated for the errors, corrected
for the bias discussed in Sections 7-13-7-14.
To test a statistical hypothesis we usually compute a certain relevant
statistic I, from the sample. We compare I. to a certain reference value 1. 0 ,
and reject the hypothesis if I. )'0' In doing so we incur the danger of erring
in one of the following ways:
I. Error of the flrst kind; we reject the hypothesis although it is true,
2. Error of the second kind; we accept the hypothesis although it is false.
On the assumption that the model is correct, the distribution of the
statistic I. may be determ ined. and hence we can find the value 1'0 such that
lI<
-
.
, f
;o
"
.i? .
. .
,
PrCi. )'0) = CI.
(7-15-1)
,- :i
where CI. IS a suitably chosen small number, e.g., 0.05 or 0.0]. If we reject the
model when I. 1. 0 , the probability of committing an error of the first kind
is C1.. The probability of committing an error of the second kind depends on
what the true model actually is, and we shall not consider this question here,
The statistics that we shall use are valid for a wide class of error distri-
butions. The distributions of the statistics, however, are known and tabulated
primarily for the case when the distribution of the errors is normal Only
i .:.
1 E.g., thc rcsiduals arc scrially correlated even when the errors are independent.
I
:.J-
,-::
:';' -'
<,'
7-]6. TesTs 011 Residuals
199
'
i'X",
, "
in this case is it easy to find the test value )'0 assocIated with a given probability
rJ.. On the other hand, it makes no difference here whether or not the model
equations are linear. Residuals should be attributable to errors, no matter
what kind of model they were computed from.
When the distribution of ), is unknown and cannot be derived by analytic
means, it is still easy to estimate the critical )'0 by Monte Carlo techniques.
Error samples with the proper distribution are generated on the computer,
the statistic), is computed for each sample, and )'0 is chosen so that), exceeds it
in a fraction CI. of the samples.
Various commonly used statistics ), are described in the next section
I, -
:'"
".-'
, .
ot;.,
.(,
't, ",
:f.;;
;;':
i.i1:
t: .
!.f.;"
\
;:
;;,
:';!:,:
ff:
7-16. Tests on Residuals
";.;.
The tests we wish to perform on the residuals relate to their mean and
covariance. In many cases questions concerning the mean of the residuals
are of no significance, since the estimation process guarantees that (except for
the effect of rounding errors) the mean is zero. For instance, whenever a
model of the form
.
;
.<,
)l = EI! + cP( x, EI 2 , 0 3 , . . . , 0/)
(7-16-1)
1;
is estimated by least squares, the average residual is zero For, let the objec-
tive function be given by
<-'.<'"
II n
tP= Le,/= L(J'p-O,-f/J,Y
p=! "= t
0-16-2)
To mlllllTIlZe tP we form as one of the normal equations
n
otP/oEl! = -2 L (Yll - Ell - cP p ) = -2 L e p = 0
11=1 J1
(7-16-3)
Thus the sum of the residuals is zero, and so is their average. Suppose,
however, that we have a model that does not guarantee zero average residuals.
Now if the errors in each experiment have covariance matrix V, we expect the
residuals to have covariance matrix (J - 1/lI1n)V [see Eq. (7-13-22)). The
mean residual vector
n
e=(l/n) L
p=t
e *
II
0-16-4)
should have covariance matrix (J/n)(1 - 1/II1I1)V, since the variance of the
mean of 11 independent observations is 1/11 times the variance of the observa-
tions (we neglect the fact that the residuals are correlated even when the
200
VI I I nterpretation of the Estimates
observattons are not). If the errors Ell are assumed to be NJO, V), then the
average residuals e should be N",(O, (1/11)(1 - 1/I1lI1)V), and we may easily
construct confidence regions for e, as we have done for e in Section 7-10,
I n particular, the statistic
}, == [11/(1 -1/I11I1)]e T y- l e
(7- 1 6-5)
is distributed as x,/.
If V is not known, we !TIust Introduce the matrix
If
S == [1/(11 - I)] L (e/' -e)(e p * _e)l
1 1 =1
(7-] 6-6)
If our hypothesis IS true, then the statistic
A == [(11 - 111)11/(11 - l)m]eTS-le
(7-16-7)
is distributed as F,If,n-",' The quantity [(11 - l)m/(11 - m)]}c is sometimes
known as T 2 [see Anderson (I958, Ch. 5)]. For a single equation model m = 1
and T 2 =k In this case, the quantity )1/2 is known as f, and the associated
test is the well-known {-test.
If the zero-mean hypothesis is accepted, we may wish to test the hypothe-
sis that the errors in each experiment -possess a given covariance matrix V,
That is, we wish to compare the covariance of the residuals V given by Eq,
(7-13-22) with V. An appropriate statistic is given by
}, = 11[log det(Vy-1)"- m + Tr(yy-1)]
(7-16-8)
Its distribution in the normal case is tabulated by Korl1l (1968).
Another frequently tested hypothesis is that two sets of residuals are
uncorrelated. The need for such a test arises when we base our estimates
on an assumed diagonal covariance matri of the errors. We then obtain
a covariance matrix Y of the residuals, and wish to find out whether V nb (0 =I b)
differs significantly from zero.
We compute the correlation coeftlcient
ruh = Vuh/(V nu r/ hh ) 112
(7- 16-9)
If ruh is the correlation coeftlcient computed from a sample of 11* independent.;;.
pairs of mutually independent normal deviates, then the quantity
}" = r uh [(I1* - 2)/( I - r;b)]1/2
(7-16-10);
has the {-distribution with 11* - 2 degrees of freedom (Anderson, 1958,
p. 65). For our purposes, we should probably take 11* = 11 - I/m, but with
11* > 10 the {-distribution is quite insensitive to the number of degrees of
freedom.
"-.-:;:'11:'"'.
. :.:t, ,;:: "',
7-17. RUl1s al1d Outliers
201
Suppose /1* = 20 and r"n = -0.25; we have} = - t .095. According to
Table 4 (Cramer, 1946), the probability that Itlsl exceeds 1.095 is a sizable
29 %, so we cannot reject the hypothesis that V"n = o. We could be 99";;
sure that Vab # 0 if we have It t sl 2.878, corresponding to I r i ; I 0.561.
Further examples appear in Section 7-24.
'J\;
tti
i
7-17. Runs and Outliers
Residuals that have passed the tests of the previous section may still
be unsatisfactory. Though of reasonable magnitude, they may display trends
and other departures from randomness that call for modifications in the
model. The reader is referred to Acton (1959, Ch. 3) or to Draper and Smith
(1966) for an excellent treatment of this problem on a practical level. Briefly,
the residuals should be plotted against the various variables that are included
in the model, and also against the time at which the observations were taken.
Linear, quadratic, or periodic trends may reveal themselves and will call
Jar the inclusion of appropriate additional terms in the model. Trends in the
variance of the errors may also be detected, and may shed some light on the
measuring process. Finally we may test for randomness by counting the
if,o,number of "runs" in the residuals, a run being a sequence of residuals of
equal sign. If the number of runs is much lower than expected the randomness
,,:,.ofthe residuals is suspect.
J 00 If 11 1 and 112 are the numbers of negative and positive residuals respectively,
then the expected number of runs (on the assumption of complete randomness)
.IS
p=211 1 11 2 /(l1 t +112)+ I
'and the variance of the number of runs is
u 2 = 2I1 t I1 2 (2/1 1 /1 2 -/1, -11 2 )/[(11 1 + /1 2 )2(/1 1 + 11 2 - I)] (7-17-2)
.;The actual distribution of the number of runs was derived and tabulated by
.,:Swed and Eisenhart (1943). A table also appears in Draper and Sm ith (1966,
)), 98). When both 11 1 and 11 2 exceed 10, then the quantity
(7-17-1)
z == (r - Ji -t- t)/u
(7-17-3)
.,(r=number of actual runs) is distributed approximately as NI(O, I). A
;;ilumerical example appears in Section 7-24.
We stress that failure to pass the nUI11 ber of runs test is no reason for
;.outright rejection of the model. Usually it is merely an indication that some
?possibly minor effects have been neglected. Particularly in cases where the
I!.I!i ".'
202
VII Interpretation of the Estimates
data are very accurate, neglected effects outweigh random errors in measure-
ment. Consequently, nonrandomness of residuals is the rule, rather than the
exception, when models are fitted to good data.
Many tests on residuals are best accomplished by graphical means. ]f
the probability distribution of the errors is to be investigated, a histogram
or a cumulative frequency plot is called for. Suppose the residuals are re-
numbered so that e l is the smallest (algebraicalIy) and en is the largest. Let
Pj == (i - t)/I1; then Pi is an estimate of the quantity Pr(e ej). A plot of
Pj versus ej then approximates the cumulative distribution function of the
errors. When this plot is made on normal probability paper, the result should
be a straight line if the error distribution is normal. ]f all the points are rea-
sonably close to a straight line except for a few at the low and high ends, then
the presence of outliers (see below) is suspected. ]f the points seem to fall
into a few clusters rather than follow a smooth curve, then one may conclude
that different sources of error were operative in different subsets of the
observations.
It may happen that some gross error is committed in the conduct or
recording of some experiment. Naturally, the erroneous observations give
rise to unusually large residuals, called outliers. More seriously, such erroneous
values can gravely distort the parameter estimates. Therefore, one wishes to
eliminate such observations from the analysis, and the easiest way to spot
them is by examination of the residuals. If there is a clear-cut differentiation
between the "regular" residuals which fall on the smooth part of the
probability plot, and the" outliers", then we should not hesitate to remove
the latter and recompute the estimates in their absence. However, if the
distinction is blurred, then the problem of diagnosing outliers is a difficult
one. A procedure often adopted in practice is to remove all residuals whose
magnitude exceeds the standard deviation either known or estimated using
all residuals) by a fixed factor, say 2.5 or 3. When setting such a threshhold
one should take into account the probability of residuals of such magnitude
occuring by chance in a population of size 11. For instance, with a normal
distribution and 100 observations the probability of finding a residual
exceeding 30- is 23.7%, giving one little reason for rejecting such a residual
out-of-hand. For a more systematic approach, see Anscombe (1960).
7-18, Causes of Failure
If the parameters turn out well-determined and the residuals are accept-
able, then our estimation problem is solved. Only too often, however, we
run into one of the following less satisfactory situations:
'itrn1
;:
t
f':!
i
fii(::
I
;.
, . . : , . _ ' . : . i ; : . t . : 1 : ' . ' : ' ' . ! ;t ;, obv;ou,ly ;mpo.%;bl, to "';mat, 0, and 0, ."pa,,'''y. Th, d,,,,n,,aey
. is not always quite so obvious. Consider for instance our falling sphere
i model Eq, (2-14-5). If we write out that equation in full we find that the
. distance s travelled by the sphere in time t is
r£
.
7-]8. Causes of Failure
203
(a) Parameters ill-determined, residuals large but acceptable, since the meas-
urement errors are known to be large. Barring the possibility of reducing
measurement errors, we can improve our estimates only by conducting
many more experiments, As a rule, the standard deviation of estimates
decreases roughly as /1- 1/2 so that a tenfold improvement in the estimate
requires a hundredfold increase in the number of experiments.
(b) Parameters (or some linear combinations of them) ill-determined,
residuals small. This may be due to a degeneracy in the model. For example,
in the model
y = (0 1 + °2)X
(7-18-1)
s = [gem - 1710)f6nrfl]t - [gm(111 - 1710)f36n2r2fl2] (1 _e-( 6n l"!'/m)t) (7-18-2)
I
Some study is required to see that among the parameters g, 111, 111 0 , r, and JL
appearing in the equation, only two can be estimated independently.
An even more common source of degeneracy are the data themselves.
For example, suppose that for the model
1
.1
y = OIX; + 02 X 2
(7-18-3)
we have made many observations of y at different values of XI and x 2 ' but
:'by chance in each experiment XI turned out to be approximately equal to X 2 .
,For these data, model Eq. (3) is indistinguishable from
y = (Ot + 02)X 1
(7-18-4)
:;in which 0 1 and O 2 cannot be estimated independently. The above case appears
'trivial, but similar conditions obtain, perhaps in more subtle form, in many
;;xperimental situations. The only solution is to plan the experiments properly,
'as indicated in Chapter x.
.'<9 Parameters ill-determined and residuals unacceptable. The model must
'!Je rejected, or at least amended to include those effects that were observed in
the residuals,
One of the hypotheses underlying a model is that the unknown parameters
.are constants that do not depend on the model variables. [t is clearly desirable
:'to test this hypothesis, and this can be done if the data are sufficiently rich.
,>To test, for instance, the hypothesis that e is independent of some variable
/1:'1' we break up the data into subsets each corresponding to a single value.
.''}Jr a narrow range of values, of ZI. We estimate the parameters separately
i1J1<i
204
VII Interpretation of the Estimates
from each data subset, and employ the usual statistical techniques to test
whether the estimates obtained from the subsets are significantly different
from the estimate obtained from the whole sample, Or the subset estimates
show any trends or other functional relationships with Zt. If such relationships
exist, they may be used to amend the original model. This technique has been
described by Hunter and Mezaki (I964) and Box and Hunter (I965),
It is not always possible to apply this method directly. For instance, let
the model be
y = (JI + 02 X I + 02 X 2
(7-18-5)
It is impossible to estimate the parameters 0 1 and O 2 separately if we restrict
the data to a single value of XI. However, we may still break up the entire
range of XI values into a few fairly wide intervals, and obtain a separate
estimate for each range.
7-19, Prediction
Perhaps the most important object of mathematical modeling of physicat
situations is that of predicting future responses to given conditions, The'
estimation procedures provide values for the parameters to, be inserted into.;;.,';;
the prediction equations. These equations need not be the same as the model.'
equations used for estimation, nor need the variables to be predicted coincide! .
with the dependent variables of the model equations. For instance, we observe,:<
the time a liquid takes to flow through a capillary tube in order to estimate.
viscosity; we use the viscosity to predict damping factors for standing waves':
in a pool. At any rate, let us say that we wish to predict the value of some"
vector TI, based on the value of a vecto of independent variables and,'
a vector of parameters O. The prediction is to be made on the basis of the'.
model
TIp = <!>(, 0*)
(7-19-1) j
where the subscrIpt p stands for the predicted value.,.
Assu.ming the model itself is correct, there are three possible sources of\.:
inaccuracy in the predictions: errors in the estimated 0*, errors in the setting;: ."
of, and errors in the measurement of TI. All three sources contribute to the';
difference between the predicted TIp and the eventually observed TI, Usually,'
(except in purely linear models) there will be some bias in the predicted TIp' ;';.$
but there is little that we can say about it. Assuming, however, that thisf\
bias is small compared to the other errors involved, and that the errors:;;'.;it
from all three sources are statistically independent, then we can obtain an>
approximation to the covariance matrix of the prediction errors.
, ';Ii' _, i"
[f;t
{
]-20, Parameter Transformation
205
Suppose we denote the three errOrs by c5{J, () and ()TI, respectively. The
;;'observed value of 11 will be given by
I Tlo = <I>( + (), 0* + (0) + ()TI
,,,A Taylor series expanSlOn up to linear terms yields
(7-19-2)
fi
i
;
I
' \ where V s' V 0 and V 1] are, respectively, the covariance matrices of (), ()O,
i and ()TI. The first term on the right hand side of Eq. (4) may be omitted if
S can be set (or is known) precisely. The matrix V o is obtained in the process
· of estimating the parameters, as shown in Section 7-5. If TI coincides with the
'y in the model equations, then V 1] is estimated (if not known previously)
,Jram the residuals, as in Section 7-13.
Tlo - TIp = (a<l>/a) () + (a<l>/aO) ()O + () TI
(7-19-3)
The covariance matrix of the prediction errors is given by
V'1 == £(Tlo - Tlp)(Tlo - TIp?
= (a<l>/a)Vs (a<l>/aEY + «(l<1>/aow o (a<l>/am T + V 1]
(7-19-4)
.7-20. Parameter Transformation
'., It is frequently convenient to perform the estimation not in terms of the
:}briginal parameters of the model, but in terms of transformed variables
:;,ythich simplify the mathematical form of the model equations Examples
>:Qf this have been given in Section 4-19, in connection with linearizing trans-
::.formations, and the point is also illustrated in the problem of Section 5-23.
, Let us assume, then, that we have estimated a vector 0 of I parameters
:;'\yhich are functions O(e) of the original model parameters e. Let 0* and V o
t'be the estimates for 0 and its covariance matrix, respectively. I f the transfor-
!mation from e to 0 is reversible around 0 = 0*, i.e., if in the neighborhood of
'J)* there exist functions 1'(0) such that e = 1'(0) is a unique solution to the
\.equations 0 = O(e), then J == al'/ao is nonsingular at 0 = 0*. A flrst-order
'Taylor series expansion of I' has the form
e = 1'(0*) + (al'/aO)(O - 0*) = e* + J* 00
(7-20-1 )
:where J* == Jo=o*. Hence, approximately
V c == E(e - e*)(e - e*? ::::: E(J'i- (iO ()OT J*T) = J*V 0 J*T (7-20-2)
:Eq. (2) may be regarded as a special case of Eq. (7-19-4), where the 11 to be
:':predicted are simply the e. See Section 7-24 for a numerical example.
i
;:; ,
<.-'" !
".!. ..
,': ,
206
VII Interpretation of the Estimates
;
.
If the equations 0* = O(c*) cannot be solved explicitly for c*, we have
to resort to a Ilumerical solution. In this case, we can still calculate J* = ..,
C80/Dc);:=\.. If we use the Newton method to solve for c*, we obtain J* as a
by-product.
7-21. Single-Equation Least Squares Problem
We shall now interpret the results of Section 5-21 in the light of the tech-
niques described in this chapter. We recall that the least squares solution
to the model Eq. (5-21-5) with the data of Table 5-2 is given by:
* _ _ [ 8 I 3.4583 ]
e 960.9063' eP* = 0.03980599
H* . N* = r 0.271890 x IO- -0.957336 X 1O5 ]
-0.957336 x 10- 0 3.50371 x 10- 0
According to Eq. (7-2-4) we may represent ePee) approximately by means
of the equation
(1)(0) 0.03980599 + .l I 0 - 5[0.271890 (0 1 - 813.4583)Z
- 1.914672 (0 1 - 813.4583)(Oz - 960.9063)
+ 3.50371 (Oz - 960.9063)Z]
(7-21-1)
I .
II'!
."
I ;
cf
h....;
If
I
I)
""
"
"
- '.;
I
How good is this approximation? I n Fig. 7-3 we compare the contours
of the true objective function to those of the approximation Eq. (I). We
also show the boundary of the region in which the approximate value of cP
c£;'
- 200 -100
;5 /° :J /
/./ 05
./ ,
100 200 300 400 500
I
c£;'
o
8,-8,*
Fig, 7-3, Contours of objective function. Contours of cJj - cJj*: -, true; ---, quadratic
approximation; -, limits of 5 ;; crrDr region.
.... 11 . . .
.i. It . 4 .,
...: ' ,
:.\:i,.. _,,'-" .
7-21, Single-Equation LeasT Squares Problem
207
is in error by no more than 5;';, We find that there is excellent agreement
between the true and approximate values within the region I q)(O) - (1)* I
'.' 0,005, and in some areas the agreement extends far beyond this region.
The eigenvalues and vectors of N* are:
A] = 3.7660 X 10- 5 ,
[ 0.2642 ]
ill = -0.9645'
A 2 = 0.0096 x 10- 5
[ 0.9645 ]
il 2 = 0.2642
",. Accordingly, Eq. (1) can be rewritten In canonical form as
":i<:
.....
CP - 0.03980599 = -}1O- 5 (3.7660 IJ;/ + 0.0096 'f//)
(7-21-2)
(i"
r,'.
,;; "
where
:-;"'.'
,,\;'
IJ;I = 0.2642 (e] - 813.4583) - 0.9645 (e 2 - 960.9063)
(7-2 I -3)
1J;2 = 0.9645 eel - 813.4583) + 0.2642 (e 2 - 960.9063)
If we choose E = 0.005, then the indifference region I q) - CP* I 0.005 is
defined (approximately) by
"
3.7660 IJ;/ + 0.0096 IJ;/ 2 X 0.005/10- 5 = 1000
(7-21-4)
<.,
'.
'0
With 1J;2 held constant at zero, this corresponds to
'1.--'
IIJ;] I (1000/3.7660)1/2 = 16.3 (lfi2 = U)
and with IJ;I held constant at zero
11J;21 (1000/0.0096)1/2 = 323
(tf/I = 0)
Thus IJ;, (the short axis of the ellipse Eq. (4)) is relatively well-determined,
but 1J;2 (the long axis of the ellipse) is poorly-determined. In Fig. 7-3, the 'f/(
and 1J;2 axes do not appear perpendicular to each other. because e 1 and e 2
are drawn to different scales.
To estimate the covariance matrix of the estimate, we use Eq. (7-5-15), but
we must estimate (J2 first. The residual sum of squares is CP* = 0.03980599, and:
(J2 = [1/(15-2)]0.03980599 = 0.00306200,
V o 2 x 0.00306200 N*-] = [:
(J = 0.05533533
16547.6 ]
4696.17 (7-21-5)
The standard deviations of the individual parameter estimates are
(JI = 60561.7 1/2 = 246.093,
(J2 = 4696.17 1 / 2 = 68.5286
correlation between the estimates IS
Pl,2 = (16547.6/246,093 x 68.5286) = 0.981214
208
VII Interpretation of the Estimates
The principal components, of course, coincide with t/J1 and 1J;2' Their
variances are given by
1[) = 2(52/)'1 = ]62.61,
with standard devIatIOns
1[2 = 2(52/J 2 = 65095.3
(5) = 12.752, (52 = 255.14
Again, we see that 1J;1 is weH-determined, 1J;2 Jess so.
To compute the scaled principal components, we scale each parameter
to have unit standard deviation, i.e., we define
,,) = OJ246.093,
\'2 = O 2 /68.5286
The covariance matrix of v is simply the correlation matrix of 0, i.e,.
[ I 0.98]214 1
v" = 0.98]214 I
whose eIgenvalues and vectors (in the v coordlIlates) are:
ILl = 1.981214,
PI
[ I/V ] ,
1/,: 2
{12 = 0,018786
_ [ 1//2 ]
P2 - -1/)2
To express p) and Pl in terms of 0, we have to unsca]e, i.e.,
= [ 1/(/2 x 246.093) ] = [ 0.00287333 ]
PI 1/(,/2 x 68.5286) 0.0103184'
Thus the quantity
PI = 0.00287333(0 1 - 0 1 *) + 0.0103 I 84(0 1 - 0 1 *)
_ [ 0.00287333 ]
Pl - -0.0103184
has variance 1.981214, the quantity
Pl = 0.00287333(0 1 - 0 1 *) - 0.0103184(0 2 - O 2 *)
has variance 0.018786, and the two are uncorrelated,
To obtain a 95 o confidence region for 0 we use the statistic of Eq,
(7-11-5)
13 I -5 - T [ 0.271890 -0.957336 ]
Fl, 13 ? 2 x 0.03980599 210 00 -0.957336 3.50371 (50
= 163.292 x 110-5(0.271890 (50/ - 1.914672 (50 1 (50 2 + 3,50371 (5e z Z )
(7-21-6)
The upper 0.05 point of F with 2 and 13 degrees of freedom is. according to
the tables, 3.81. Our confidence region thus has the equation
!1O- s (0.271890 (501 1 - 1.914672 (50 t (50 1 + 3,50371 (5e/),;;; 3.81/163.292
= 0.023332
'4_;.
I
17-21, Single-Equation Least Squares Problem
;
209
F.omparison with Eq. (I) indicates that this region is bounded by the con tOUt
cP - CP* = 0.023332
According to Fig. 7-3, this contour is partly outside the region where Eq. (I)
,,is a reliable approximation. The fact that the exact contour is inside the
.'pproximate contour, suggests, however, that the latter should be a conserva-
tive estimate of the confidence region.
Finally, we examine the residuals, given in Table 7-1 and Fig. 7-4. A
Table 7-1
R I B * _ [ 8 I 3.4583 ]
cSldua S at - 960.9063
"i
5
I
i
,
I
I
fL
I
2
3
4
5
6
7
8
9
10
11
12
I3
14
15
;
X,II = t X ll 2 = T e,/" = )'II-/'L(B*, X,,)
0.] 100 -0.0145552
0.2 100 -0.00613993
0.3 100 -0.0287542
0.4 100 0.000602186
0.5 100 0.0199295
0.05 200 -0.0906165
0.] 200 0.0304608
0.]5 200 0.0869893
0.2 200 -0.0387225
0.25 200 -0.0219878
0.02 300 0.04975 J 5
0.04 300 0.0504873
0.06 300 - O. I 03587
0.08 300 -0.0550289
0.1 300 0.02933 I 4
01
0.05
...-
'.Cb
o
'..
;::,QJ,
1=300
-0.05
-0.1
o
'2::..
/'
\
\
\ - -
o----------\---__ _-----
\ ..___x---- _-- T = 100
x 200 -.."..-
_..0--
0.1
02
03
04
05
Fig, 7-4. Residuals (least squares problem).
210
,,;,
VII Interpretation of the Estimates'j
J
glance at the latter suggests that the residuals at T = 100 are considerably ","i
smaller than those at T = 200 and T = 300. An F-test on the ratio of the sum.':!4
of squares of the last ten residuals to that of the first five residuals indicatesl
a significant difference even at the 99.5 % confidence level. We shall deal with. ;i
this problem further in Chapter IX, although this small body of data prob-: ...
ably does not merit further analysis.
7-22. A Monte Carlo Study
To investigate the reliability of statistics obtained in the previous section"
we used the simulation technique suggested in Section 3-3. We assumed that\,
model Eq. (5-21-5) was correct, with 0 1 = O 2 = 1000. We used this to com-;<
pute .I'll for the fifteen data points, and added as "experimental error':,;:
a pseudorandom number drawn from an appropriate distribution. Six":;
distributions were studied: normal and uniform distributions, each with:;
(J = 0.0 I, 0.03, 0.05. The estimation procedures were always carried out;;
however, as though the errors were thought to be normally distributed. "
For each one of the six cases, 100 replications (samples) of fifteen obserf)
vations were generated. The parameters were estimated in each one of thes(h
samples, and
bias = 8* - []
was calculated, as well as the estimated covariance matrix. These were."
averaged over all samples We also calculated the actual covariance of th.A
estimates around their means [Eq. (3-3-2)]. The results appear in Table 7-iJ."
The following conclusions can be drawn from the table:
], The average bias in all cases is small compared to the standard deviaT
tions of the estimates.
2. The estimated covariance matrix is, on the average, an acceptabl
estimate of the true covariance, particularly at small values of experimentiti
error. Even at (J = 0.05, the estimates are not unreasonable, particularly:,
when one takes square roots to obtain the standard deviations of the estimates}
3. The estimates are reasonably robust, at least as far as the differenc
between normal and uniform distributions is concerned.
4. The supposition that for this model the estimated variance is conserV£,
ative (too large) is confirmed.
5. The standard deviations of the residuals (corrected for bias) provide);
on the average, excellent estimates of the experimental error.
While these conclusions hold for the average of many replications, thet
results for individual replications vary quite sharply. In fact, the specis
:;;1f
;}'"
., ,"Ii
;f-22, A Monte Carlo Study
211
'T!lble 7-2
lt>l':1onte Carlo Study of Single-EquatIOn Least Squares Model"
"
Experimental error
a=O.OI a = 0.03 a = 0.05
e, eo e, eo e, eo
6.10 0.79 29.22 2.67 66.68 5.05
Normal
distribution
;
@J
i
A verage bias
True
covariance
Eq. (3-3-2)
Average
Estimated
covariance
Eq. (7-5-17)
Standard
deviation of
residuals
[ 3264 751 J [ 28167 6468 ] l 79169 17566 ]
751 179 6468 1560 17566 4193
[ 3315 725 ] . [ 32752 6766 ] [ 105175 19828 ]
725 165 6766 1487 19828 4131
0.01 002497 0.03008677
0.05016445
Uniform
distribution
A verage bias
Truc
covariance
Eq. (3-3-2)
A vcragc
cstimated
covariancc
Eq. (7-5-17)
Standard
dcviation of
rcsiduals
-3.07 -1.01 0.83 -2.67 17.96 -3.82
[ 2743 . 622 ] [ 23911 5447 ] [ 67208 15061 ]
622 146 5447 ]300 15061 3597
[ 3328 735 ] [ 31396 6695 ] [ 9644] 19173 ]
735 169 6695 1514 19173 4191
0.01015331 0.03046174
0.05076649
"Avcragcs arc Ovcr 100 Rcplications.
prqblem that we have solved III Sections 5-21 and 7-2] is one of tile rep-
{!qtions (with )'/1 rounded to three decimal places) of the normal distri-
ittion with (J = 0.05. The bias on this particular replication is
[ 813.4583 - 1000 ] = [ -186.5 J
960.9063 - 1000 - 39.1
if, ..1fhe true value e = (1000, ] OOO)T is marked on Fig. 7-3. I t lies just withlll
;,e'.region of good approximation, and corresponds to an F2. 13 value given
@\pyEq, (7-21-6)
fi1''I;" .,
F2,13 = 163.292 x tlO- 5 (0.271890 x ]86.5 2 - 1.91.+672
x \86.5 x 39.\ -r 3.50371 x 39.1 2 ) =--=0.695
I.!J!IJ'W'
_t:,
212
VII Interpretation of the Estimates
Though this value is far from excessive, the bias in this replication is much
larger than the average bias of (66.68, 5.05) in Table 7-2. On the other hand,
the covariance estimate Eq. (7-21-5) is quite a lot closer to the true covariance I
than is the average estimate of Table 7-2. In some other replications this I
estimate is much worse. I n one case, for instance (still with (J = 0.05), the
estimated covariance is
V = [ 149984 33191.4 J
o 33191.4 7616.09
which is off approximately by a factor of two (still not very significant in an
F-test). Oddly enough. this replication yielded the almost unbiased estimate
0* = [ 1000.512 J
1000.942 r
....
We conclude that in this particular problem our estimates for 0, V o , and
the confidence regions are quite reasonable.
7-23. Independent Variables Subject to Error
We shall now interpret the estimates obtained in Section 6-13 for the
same model, but with all variables subject to error. According to Eq, (7-6-4)
and using Eq. (6-13-5), we have
V D- J = [ 93021.94 16298,55 ]
o 16298.55 2912.917
This IS not very diAerent from what we found previously in Section 7-2C'
under different assumptions. We shall try to determine whether the assumpl-,
tions underlying the estimate of Section 6-13 are validated by th'e dataV''f
From the matrix 1\1 [Eq. (6-1 3-6)] w obtain, using Eq, (7-14-1(1), thi
following estimate for the covariance matrix of the residuals
[1"/111(11 -1/111)]1\1 = [3/1(15 - 2/1)]1\1 = (3/13)1\1
[ 0.000411 0.00224 0.000452 ]
= 0.00224 0.0162 0.00278
0.000452 0.00278 0.000957
On the other hand, to obtain our estimate we had assumed an error cQ-:'
variance of
_ [ 0.0001 0
V" - 0 0.25
o 0
The quantities 13 x 0.00041]/0,0001 and 13 x 0.0162/0.25 should both);
be samples from a Xl distribution with ] 3 degrees of freedom. A ulance af: ..
b 'c'; ;!ii;,""!"
-"1"f1
,,]
tg:
3
i 7-24, Tlvo-Equationlvfaximum Likelihood Problen1 2]3
.
f.;.':the tables shows the first to be much too large, the second too small to be
tf:'acceptable; the odds for rejecting each are greater than 99:1. Even summing
i. 'the two quantities (this is equivalent to evaluating Tr(y- I Y*) as in Section
.. 7-14), we obtain a number too large for Xi6 . Our assumed covariance matrix
f)s contradicted by the data,
[r-
.,", 7-24. Two-Equation Maximum Likelihood Problem
;;'
....
},::: We now treat the estimates obtained in Section 5-23. Let us examine case
:;'(a), unknown V. The estimate 8* is given in the first row of Table 5-8, and
i:\ ,the corresponding value c* is found in Table 5-9. Since all calculations
:>:\vere performed in terms of 8, the inverse approximate Hessian with respect
;; to e is found to be
I V,(N")-' -;;:;;-'
r.;,;.
I
tifcrhe notation 0.834966 - I is used to represent 0.834966 x 10- 1 .) We
Y;y.,rite below the values of Oa* along with their standard deviations, the latter
i:lX:ing the square roots of the diagonal elements of V 0
k;:!io:',
[ 0.834%6. - I
. symmetnc
¥
0.12]944 - 2
-0.360968 - 4
0.122197 - 2
-0.76] 164 - 1
0.517330 - 4
-0.704885 - 3
0.698556 - 1
( -0.0758463 ::!:: 0.288958 ]
-0.0115747 + 0.0011026
8* = 0.790686::!:: 0.034957
1.00224 ::!:: 0.26430
0.859255 ::!:: 0.093578
-0.269914- 1 1
0.181755 - 4
0.47651 - 3
0.4607] - 1
0.875678 - 2
,-" We are interested, however, in c, not in 8. According to Section 7-20
!iy need therefore to calculate the matrix
;. . ','
J* = [lc / ?O) o = o *
!iiI'
trhis can be readily obtained from Eq. (5-23-8) by differentiation
Ii =
;: I '. .0 6905560
I :f .
:j i 0
'X;; 0
Ii";::.::,;
o
- O. 5492625
o
o
o
0.4 178637
-0.008040566
- 1.599525
o
o
- 0.4263207
o
o
-0.9955402
-0.0368951
-Of3583' ]
-0.5681071
214
VfI Interpretation of the Estimates
I
Following Eg. (7-20-2) we now compute
rO.543786 - 3
lsymmctrlc
-0.412978 - 6 0.221255 - 3
0.126945 - 6 -0.159973 - 4
0.312639 - 2
0.564779 - 2
0.226459 - 4
0.112245 - 2
0.692338 - 1
0.137911 - 2 ]
0.537960 - 5
0.266644 - 3
0.164833 - .I
0.395299 - 2
I
I
I
VC=J*VO.JH=
and the estimate c* is represented as
[ 0.5460134 + 0.0233192 ]
0,006357569 I 0.000356293
c* = 1.264724 I 0.0559141
0.9977676 I 0.263123
0.8592555 I 0.0628728
All the parameters are fairly well-determined, with C4 less so than the others. I
The residuals corresponding to our estimate are listed in Table 7-3,
I
Table 7-3
Rcsiduals e" * = y" - f(x" , e ') for Casc (a)
fL e,ri e,f! {.l. eJ\ e,f2
1 -0.2294+ -0.01629 22 0.42388 0.04043
2 -0.18880 0.00559 23 0.24983 0.02265
3 -0.19394 -0.01330 24 0.37242 -0.00994
4 -0.17473 -0.00069 25 0.24696 0.01022
5 -0.19199 -0.01578 26 0.] 6855 -0.00204
6 -0.2] 667 -0.01114 '27 0.] 1696 -0.00205
7 -0.10269 0.00038 28 0.07203 -0.0]] 95
8 -0.05086 -0.00752 29 0.08727 0.02 173
9 0.00012 0.01701 30 0.02814 0.00542
10 -0.13722 0.00483 31 0.01613 0.00927
II -0.06465 -0.02522 32 0.00542 0.02753
12 -0.00414 0.03791 33 0.05353 -0.04208
13 -0.01195 -0.03430 34 0.07066 -0.00772
14 -0.08990 -0.01499 35 -0.01496 -0.00765
] 5 -0.02357 -0.00057 36 -0.17103 -0.04543
16 -0.02433 -0.00699 37 -0.26740 -0.04499
] 7 0.00547 -0.01582 38 -0.19278 0.01363
18 0.06089 0.01065 39 -0.04582 0.0380]
19 0.10571 0.02486 40 -0.00165 0.03415
20 0.23525 0.02979 41 -0.00726 0.01637
1] 0.25371 0.03832
I '
,
I
",:.>
c l :'>::
t'::
';.:'
:.;
....i.
7-24_ Two-Equal ion Iv! aXil1llll1l Likelihood Problem
215
J:,-
;,."
f! Their moment matrix is
:'{:;;.;
:,.
iVI* [ 1.066369 0.06834212 ]
= 0.06834212 0.02096695
;:>:
and the estimated covariance matrix of the errors is
'."
'f.;.,
i..
[ 0.0276979 0.00177512 ]
0,00177512 0.000544596
V corresponding to standard deviations of 0.166427 and 0.0233366 of the Yt
f0;; and)'2 errors, respectively.
We do not know enough about econometrics to decide whether errors
:;} of this magnitude are reasonable, and whether they can be ascribed to
. ; . ,; . ' . - measurement errors alone, A glance at Table 7-3, however, reveals at once
, that at least the YI residuals are not random. They have been plotted in
;. Fig, 7-5. It appears that Eq. (5-23-2) fails to account for certain strong
v* = 1/(41 - 5/2)M
L..
;.',:."
04 L
oJ
oJ
I
"" ,...
;:\'!
.....
:"{p
.;:.
t.C;
r(.jf:.
;
:o 1 1
-02
-03
/
rP
1929
o
1939
10
;::..'
.'.t.
f.:.;
.,..'
\,.
:':t.
.::<:
...._ 0 I
1909
-20
1919
-10
i>;:.
Fig, 7-5. First cquation rcslduals, production modcl.
r:;..'
:-:',:,'"
:;;.i:,xariations of 2 1 with respect to time. The equation for 2 2 seems somewhat
!.ore satisfactory. However, there are 21 negative residuals, 20 positive ones
0!;i,':(both numbers exceed 10) and 16 runs. From Eqs. (7- 17-1), (7-17-2), and
f':; :::V;/(20 + 21) + I 21488
;; = expected number of runs if residuals were random
; 0'2 = 2 x 20 x 21(2 x 20 x 2] - 20 - 21)/[(20 + 21)2(20 + 21 - I)] = 9.982
; ,z = (16 - 21.488 + 0,5)/(9,982)1 / 2 = - 1.579
F,"'' :;'
i';were z is approximately a standard normal deviate.
""'i:'>
216
VII I nterpretation of the Estimates
The probability of finding 16 or fewer runs is approximately F(z - 1,579), I
which according to tables of the normal distribution is only about 6 %.
Hence there is strong, though not conclusive evidence to indictae that the
Z2 residuals are also not random.
I n case (b) we assumed that V is diagonal. The estimated covariance of
the residuals was
r* - [ 0.024722 0.00131260 1
\ - 0.00131260 0.000571524
/"12 0.00131260/(0.0247422 X 0.000571524)1/2 = 0.349
Letting 11* = 41 - 5/2 = 38.5, we have from Eq. (7-16-10)
I. =(36.5)1/20.349/(1 - 0.349 2 )1/2 = 2.31
]
2
If
.
The corrclation between the residuals of 2"1 and 2"2 is
According to tables of the {-distribution, the chance of encountering a value
of II.I as large or larger than 2.31 with 36.5 degrees of freedom is only about
3':,. We reject therefore the hypothesis that V is diagonal.
In case (c) we assumed V proportional to Q = diag (4, I). The covariance
of the residuals turns out to be
V = [ 0.0156715 0.000679944 J
0.000679944 0.00120482
The correlation /"'2 = 0.157 leads to ). = 0.943, which IS not Incompatible
with the supposition that V'2 = 0. On the other hand, if Vtl is an estimate,
of a variance four times as large as the variance estimated by V; (each based
on 38.5 degrees of freedom), then
I. = 0.0156715/(4 x 0.00120482) = 3.25
would be an F 38 . 5 . 38.5 variate. The prob<{bility of encountering such a value'.:
is less than OS\;, so hypothesis (c) stands refuted.
In case (d) we made no assumptions concerning the value of V, T}ij
residuals of log 2"t, however, behave no better than those of Zl (Fig, 7;5.)
The same is true of the residuals in cases (e) and (r). At this time, on the bas
of the data alone, we have no reason to prefer any of the models (a), (d), (e)i
and (f). and none of them account sufficiently for variations in ZI' c-
7-25, Problems
Verify Egs. (7-12-6)-(7-12-7).
2. Show that Eg. (7-5-16) holds when the covariance matrix is proPQf
tlOl1al to a known matrIX Q [see Eg. (4-21-2) and row 2 of Table 5-1],.""
..:jLRV
'if.
I
;
7-25. Problems
217
3. Suppose P In Problem 8 of SectIOn 4-21 is an unknown matrix. Show
that its MLE is given by
I
:if.:
j
.-.:.
."
-c.
m.i
;
A.:::
i
/"
(ljn)M(8*) - (BTQ-IB)-I
where
n
IVI(e*) = L [5" - n8*, X/,)][5" - f(8*, xI,W
J1= I
Denve an expreSSton for P applicable to Problem 9 of Section 4-21.
4. Using the Monte Carlo technique, investigate the robustness of the
test for correlation Eq. (7-16- I 0) for non normal distributions.
5. Suppose observations )'1" xl,(p = 1,2, . . . 11) are to be fitted by the model
Y = {)o + (11""1" + ()2 .\,,2 + . .
(7-25-1)
Let the error distribution be such as to justify estimation of 0 by least squares.
"Suppose Eq. (I) is to be used for predicting)' at given values of x. Show that
,the prediction error variance is minimum at the centroid of the observations
f Eused for estimating 8, i.e., at x = I;: = I x)n.
..:.A..,,",.
Chapter
VIII
Dynamic Mod(ls
8-1. Models Involving Differential Equations
Models are often formulated in terms of differential equations, That is,
the model equations contain not only dependent and independent variables,
but also derivatives of the former with respect to the latter. The model equa-
tions thus take the form
g(x, y, 8yj8x, 8 2 yj8x 8x, .. .,8) = 0
(8-1..1)
ti
:!
1
:
When experiments are conducted, we measure values ofy for given values of x,
but we do not usually directly measure the values of the derivatives, Hence the
model equations cannot be used directly for the estimation ofthe parameters 8,
However, this difficulty may be overcome in one of the following ways:
(a) Differentiation of Data. Approximate values of the derivatives appearing
in the equation can be calculated by differencing adjacent data values, If
xJl and X/I + 1 are neighboring points differing only in the ith coordinate, thn
(YJl+I-y/,)j(X/1+1.i-X/1,J is an apPI;oximation to 8yj8x i in that region,
Even though more accurate approximations are available in some cases, the
maximum accuracy attainable with this method is severely limited, and ts
errors difficult to assess. The main advantage of this method (in those cases
where it is feasible) is that the estimation is performed using Eqs, (1) directJy,
and these are usually much simpler than the integrated equations, Therefore,
the computation can usually be performed much faster than in the method to
be described next. We feel, however, that this advantage does not outweigh
the disadvantage of limited accuracy. The method cannot be used at all if the
separation between data points is large, but may be useful when experiments
are speciaIly planned for it, e.g., by the use of differential reactors,
(b) Integration of Equations. In principle, the differential Eqs. (I) may be
integrated to yield expressions of the form
y = rex, 8)
(8-1-2)
.l
-
ill
8-1, Models Involving Differential Equations
219
which is identical to Eq. (2-4-2). Therefore, all standard estimation methods
may be applied, If Eq. (I) can be solved analytically in closed form, we end
up with explicit formulas for the functions f in Eq. (2), and their origin as
solutions of differential equations need no longer concern us. The problem we
shalI deal with in the succeeding sections is that of estimating e when Eq. (I)
must be integrated numerically, so that the functions f are only implicitly
defined.
A special problem associated with method (b) is that of the initial orbound-
ary values required for integrating the differential equations. These are fre-
quently defined by the experimental conditions, in which case no further
problem exists. When these conditions are entirely or partly unknown, they
must be included in the problem as additional unknown parameters.
(c) Integration of Data. It is sometimes possible to integrate out all the
derivatives appearing in the differential equations. The differential equations
are, thereby, transformed into integral equations, If our observations cover Ihe
region of interest densely, we may integrate the data numerically to obtain
the values of the integrals appearing in the equations, which can now be
regarded as algebraic equations in e.
This method, like (a), requires dense data, and gives rise to unassessable
errors. In addition, it is applicable only in a limited number of cases. Its
advantage over (a) is that numerical integration is generally more accurate
than differentiation. Like (a), it is computationalIy faster than (b).
On the whole, we recommend method (b) whenever the required computa-
tions are not beyond the capability of available machinery. Method (a) or (c)
may be used to obtain an initial guess for (b).
We ilIustrate these methods by means of a simple example, that of a system
(e,g" a radioactive material) undergoing first-order decay. Here we have
dy/dx + 0IY = 0
(8-1-3)
as the model equation. Upon integration, Ihis becomes
y = Yo exp( -0 1 .."1:)
(8-1-4)
We have measured values Y I , at an ascending sequence of values XI' (J.L =
1,2, ., " n). If the initial value Yo is known, we may use our data directly, in
conjunction with Eq. (4), to estimate 0 1 . If Yo is unknown, we treat it as a
parameter O 2 and use
y = O 2 exp( - Otx)
(8-1-5)
to estimate both 0] and O 2 , This constitutes method (b), Here we were able to
integrate the equations analyticalIy. The method applies equally well when the
equations must be solved numerically,
220
'1
.'-1
:'f
VIII Dynamic Models .j
To apply method (a), we could dellne
=/. == (Y,I+I - Y/,-I)j(X/ dl - X/,_I)
(p=2,3,. ,11-1)
!
!i
' IJ
'Ii
as an approximation to dyjdx at x = XII' We then use
: + Od! = 0
.,
i, !
(8-1-6) :i
':I
'I
as the model equation, from whIch we estImate 0 1 , by mll1imizing, say
IJ-I
I (z/. + 0ly/.)2
Jl = 2
To apply method (c), we follow Himmelblau et al. (J 967) and integrate
Eq. (3) from x = 0 to x = x"
y/. - Yo + 0 1 ("y(X) dx = 0
'0
(8-1-7)
If we have measured sulf1ciently many values of y between X = 0 and x = xJl' "
then \ve can obtain an approxinlate value I of the integral in Eq. (7), say by ;.;
using the trapezoidal rule
J,
I'
1/. = {I (Y" + )"/ I)(X,/ - X,,_I)
,,= I
(8-1-8) 'I;
iJ
Then Ot may be estimated from the linear model
)'/.-1'0+ 0 1 / ,,=0
;1
(8-1-9)11
';:1
say, by least squares
In an alternative data lI1[egration method, due to Shinbrot (1954), we
multiply Eq. (3) by sin y.x, and integrate the result over the range of x values,
say from x -= 0 to x = A. using the integra'tion by parts technique
,1 f A
0= f (dyjdx) sin a.x £Ix + 0t Y sin a.x £Ix
,[) - 0
_A ",.4
[ysin'l.xJl:-alycosaxdx+O I / ysinaxdx
, [) . 0
f A
= yeA) sin 7.A + (Ot sin 'l.X - a cos ax)y £Ix
, 0
( 8-1-10)
.:1
If we choose a. = knjA, where k is any integer, then yeA) sin C/.A vanishes, <!
Hence we have ;'j
.04 f '1
III J J' sin(knxi-4)dx = U-"jA) ycos(krrx/A)dx
o 0
:
(8-1-11 )'
:,;L
8-2, The Standard Dynamic Model
221
If y is known at a sufficiently dense set of points, then we can II1tegrate
both sides of Eq. (II) numerically for various values of k. This gives us several
equations for the unknown G 1 , and we may choose that value of 0 1 which
satisfies these equations in the least squares sense.
By choosing appropriate multiplier functions we can apply this method to
problems involving higher derivatives, as well as to models involving partial
differential equations [see Perdreauville and Goodson (1966)]
8-2. The Standard Dynamic Model
We do not propose to treat models represented by Eq. (8-1-1) in complete
generality. Rather, we restrict ourselves to a subclass of models which are par-
ticularly tractable, yet at the same time extremely important in practice. These
are the so-called standard dynamic models, which we define below by listing the
, variables included and the relations among them:
(a) A vector of independent variables x
(b) An additional independent variable t, usually referred to as time, although
it need not represent the actual physical dimensions of time.
(c) A vector of unknown parameters 8.
(d) A vector of state variables s, which are functions of T, x, and 8. The func-
tions are defined implicitly by means of
1. A set of simultaneous first-order ordinary differential equations
s == dsJdt = h(t, x, s, 8)
(8-2-1)
where h is a vector of given functions, and
2, A set of initial conditions
s(O) == s),=o = so(x. 8)
(8-2-2)
where So is a vector of given functions. Note that Eq. (2) includes the possibil-
J,ities that some or all s(O) are given numbers (which are independent variables),
, or that they are themselves unknown parameters.
:., (e) A vector of observed variables y, whose exact values yare given functions
.... of the state variables, and possibly of the other variables as well
y = Yes, t, x, 8)
(8-2-3)
; {\commonspecial case is that in whIch the state variables are observed directly,
....i.e" y = s, Note that of the set of parameters making up the vector 8, some may
a,ppear explicitly only in Eq, (I), others only in Eq, (2) or Eq. (3),
.},J:C
222
'-:-1
VIII Dynamic Modelsj;i:j
.:t
',..;i
By solving (numerically, if necessary) the differential Eqs, (1) with the'.;1
initial conditions Eq. (2), and substituting these solutions for s in Eq. (3), we 'j!
bring. Eq. (3) into the fo:m Eq. (8-1-2), with x and t jointly playing the role':,::
of x 111 the latter expressIOn, Hence, the model we have defined conforms to ,/
our general form, thoug.h in a son:ewhat roUl:dabout fashion. . Ji
In essence, a dynamic system IS charactenzed by a set of state vanables.;"
which change with time (or some other independent variable) according to ';ij
certain first-order differential equations, The initial conditions mayor may:J
not be fully known, The state of the system is observed at various points in,.1
tinle, but son1etinles the state variables are not directly measurable, and wec.'
have to measure the.related observed variables instead, Unknown parameters.,:j
may appear in the initial conditions Eq, (2), in the differential equations (l),; r
and in the observation equations (3), In the last case they usually represent";i
unknown characteristics of the measuring devices, e,g., calibration constants):;
Our main interest usually lies in estimating the parameters that appear in)
the differential equations, but we cannot escape estimating the others as well'.o!j
Fortunately, good initial estimates for these are frequently available. Any,.d
, '.-'
inexact knowledge we have concerning the values of these parameters should. <
be included in the form of a prior distribution, .:.:
:i
We illustrate the concept of a dynamic system by means of a chemlca}:a
reaction involving three species whose concentrations C 1 , C 2 , C 3 , satisfy the,d
following differential equations: '..,
dct/dt = -klc/ + k 2 C 2 C 3
dC2/dt =k 1 c 1 2 - k 2 C 2 C3 - k3 C2
dc 3 /dt = k l c 1 2 - k 2 C 2 C3 + k3 C2
'i
(8-2-4) ,<1
....1
;1
The initial concentrations C2 and C3 are not known exactly, but all concentra-;;
tions must add up to unity, so that we may.write
;:
(8-2-5r::.;
,;
where CJ. and f3 are respectively a known and unknown quantity, ';
At time t we withdraw three samples from our reactor. In two of these wey;a
determine C I directly by titration. The third sample is passed through an;;"
optical instrument, which measures the light absorptivity of the mixture, This::;
is believed to be a linear function with unknown coefficients of C 1 and C2 :'!.]
Denoting the results of the measurements on the three samples as Yl, Y2 ;.;
and Y3, we may write.
CI(O) = C/.,
C2(0) = {3,
C3(0) = I - C/. - {3
.h = CI, Yz = Cl, Y3 = p + qC I + rc z
where p, q, and r are unknown quantities.
In this model, C 1 , C z , and C3 are the state variables; Yl, Yz, and Y3 are th'\'!J
observed variables; C/. and t are the independent variables; {3, k 1 , k z , k 3 , p, q,.,itl ....
d1iF!
( 8- 2-6 ) .:1 '
".'\t
:;!
8-3. Models Reducible to Standard Form
223
/)'iand r are the unknown parameters. Eq. (4), Eq. (5), and Eq. (6) correspond to
" 'lEg, (1), Eq. (2), and Eq. (3), respectively. We are primarily concerned with
':istimating the reaction rate constants k J, k 2' and k 3' Good initial guesses
,for [3 may be known from the manner in which the solution was made up, and
:,tor p, q, and r from previous experiments on the same apparatus.
. An experiment performed on a dynamic system consists of measuring the
"yalues of the observed variables y for given values of the independent variables
"0: and t. A group of experiments performed with identical values of x and
Jidentical initial conditions, and differing only in the values of t, constitute a
Irun. For Our purposes it does not matter whether all the experiments in a
::given run were actually performed as part of a physical run, or whether the
:':apparatus was reset to the same conditions on separate occasions. If several
:.runs each have unknown initial conditions, these constitute separate unknown
j'!rameters. In the above example, we may have unknown parameters [31'
:113.2, ,,' corresponding to distinct runs,
The covariance between the errors in different experiments may, however,
,I qepend on whether or not the experiments belong to the same physical run
'(see Problem 4 in Section 8-9),
:'8,-3. Models Reducible to Standard Form
"
Our defil1ltlon of a standard dynamic model is not as restrictive as might
,'appear at first glance. Many problems not originally in this form may be
recast to fit the definition. We show how this can be done in several cases.
;(a) Suppose a model corresponds to our definition in all respects, except that
'.if contains some second- or higher-order derivatives. If we can rearrange the
",:: ;"pifferentiaI" equations in such a way that we have explicit equations for the
\;Jljghest-order derivative of each variable, then we can reformulate the model
psing the method illustrated by the following example.
". Let a model be defined by means of the following two differential equations
I:in. the variables Z I and Z 2
log d 2 z,/dt 2 + OJ d 2 z 2 /dt 2 + 02(dz l /dt)3 + 0 3 dz 2 /dt + Zl2 = 0
d 3 z 2 /dt 3 + (d 2 z l /dt 2 )2 + 0 4 sin Z,Z2 = 0 (8-3-1)
.'" The highest-order derivatives of each variable are d 2 ztfdt 2 and d 3 z 1 /dt 3 ,
:i:We may solve for these:
d 2 ztfdt 2 = exp{ - [0 1 d 2 z 1 /dt 1 + 02(dztfdt)3 + 0 3 dz1/dt + z/])
d 3 z 2 /dt 3 = -exp{ -2[0 1 d 2 z 2 /dt 2 + 02(dz i /dt)3 + 0 3 dz1/dt + z/]}
-0 4 sin ZIZ1 (8-3-2)
"'. "'.-;.
\: i '. :/I';"
, . :,.;;:;;....
224
;;J
j
VIII Dynamic Model{)
',I
:1 1
Let us introduce the following state variables ,:'
SI == Zt, S2 == dzi/dt, S] == 2 2 , S4 == dz 1 /dt, Ss = d 1 z 1 /dt 2
whereupon Eq. (2) are equivalent to:
'I=S1' s1=exp-(0ISS+02 S 2]+0]S4+ S /), S]=S4'
'4 = Ss, .s = -exp[ -2(0Iss + 01S1] + 0]S4 + s/)] - 0 4 sin SIS].,
"'i
(8-3-3t3
which is in the desired form [Eq. (8-2-1 )]. Initial conditions on.: l , 2 1 , and thei
derivatives are immediately translatable into conditions on the state variables. A
One need not be able to solve the differential equations explicitly for thf(".::,
highest-order derivatives. It is sufficient that one have a numerical procedure<'f
for computing these derivatives if the values of all other quantities appearingin';
the equations are given. . .,;:;J.
We note that most computer programs for the numerical IntegratIOn of;
ordinary dfferential equat!ons require that the problem be formulated as ;"JJ.I
system of first-order equations. ":,,.
(b) Some partial differential equations, particularly of the parabolic type, ma;.
be approximated by a dynamic model. For instance, consider the diffusion or+j,
heat conduction equation A jl
os/ct = CJ. '1 1 s (8-3-4):;JI
'.l.
where CJ. is a constant and '1 2 is the Laplace operator .:Ji
/,
'1 1 s = I01S/0X/
i= 1
and where k. = , 2, or 3 depending on the !:umber f dimnions ofte objet:'f]1
we are considering. Suppose we select a grid of pOInts wlthlJ1 the object, an:'1
let .sp).be the value of (l) t thejth,grid P9it. Furthe:more, let [;Sj be omei,\1
fiJ1lte dffer1e PPOXIl11atJO1 to 2'1- .For IJ1san, IJ1 te o1e;d!mens!.on:1
case, x I. j - .Il1.-\ I' If we take [; Sj - (Sj+ I - _Sj , Sj_1 ),(lI.."\ I) [bette! a,
proximations are discussed by Hicks and Wei (1967)], then,,,,,
S j = (I. [;2 Sj (8_3-n' I 11
has the desired form, Again, this is the way in which parabolic equations a,re;r
often formulated for numerical solution (Rosenbrock and Storey, 1966,}
Ch. 7), Unfortunately, an excessive number of state variables may be requirecC"i(::
(c) A large variety of problems which are already in the desired form ariss:;:;1
from the theory of proess contJ:ol. Linea.r contr1 theor usually deals with:il
models whose state varIables satIsfy the d!fferentJal equatIOns ,...
,.
S = As + Bu(t) + E(t) (8-3-R)::f I
where A and B are matrices, u(t) is a known function (the control signal), an<.
",84, Computation of the Objective Function and Its Gradiel1l
225
"::(t) is an unknown function (the noise) possessing certain statistical properties.
:'The observed variables, in turn, are given by
y = Cs + oCt)
',!where oCt) is another noise function, and C is a matrix, Generalization to
::)lOnlinear systems is obvious, When E(t) = 0, we have a dynamic system that
; conforms to our definition.
Commonly arising problems are those of identification, in which unknown
:::JeJements of A, B, C are to be determined, and of tracking, in which s(1)is to
;:. estimated from the measured values of y( -r) (-r t), The former problem
,)belongs directly to the class of parameter estimation problems that we are
;,'onsidering here. The tracking problem is essentially one of filtering, and the
>methods for dealing with it, mostly due to Kalman (1960), are discussed
,'extensively in the literature (for lucid expositions with many additional
.,references, see Deutsch (1965) and Sorenson (1966)). Here we only wish to
>point out that if E(t) = 0, then the initial conditions completely determine the
::'values ofs(t) at any time. Once the initial conditions and the matrices A and B
;;.are known, s(t) can be obtained by straightforward integration, Hence, the
;:ti-acking problem is equivalent to the problem of estimating unknown initial
;:;'conditions and elements of A and B, which is a special case of the parameter
.'stimation problems that we shall treat.
,. The central problem of control theory is the determination of control
:'.Junctions u(t) that will cause the state variables to behave in a desired way.
:;'t:ven this problem may sometimes be treated within the parameter estimation
fAramework. For practical reasons, one must usually restrict oneselfto funclions
:;:u(t) which depend on a finite number of parameters (e.g" polynomials with
'coefficients to be determined, or piecewise-constant functions). We then wish
:'io, determine the optimal values of these parameters, i.e" those values that
;{inaximize some performance index of the system, This is entirely analogous
:; to a parameter estimation problem in which the performance index plays the
(role of the objective function. The general problem of determining u(1) can
:::also be reduced to a two point boundary value problem by using the maximum
;,'principle (Pontryagin et aI., 1962). This problem can now be formulated as a
.parameter estimation problem, in which the missing initial conditions are the
i'upknown parameters and the available final conditions act as the observations.
trn this form the problem can be solved using, say. the Gauss method.
!:i-tt. Computation of the Objective Function and Its Gradient
In order to proceed with the estImation of the model parameters 0 of a
'qynamic system, we must be able to calculate the value of the objective func-
i. . ::;:.tibn f!J for any given feasible values of the parameters, Now, once the parameter
;:",'3t
226
VIII DynamIc Models
values have been prescribed, the Initwl conditIOns are determined by means of
Eq. (8-2-2). The differential equation Eq. (8-2-1) can now be integrated,
numerically if necessary, from I = 0 to I = III (the time of the /lth experiment)
for /l = I, 2, . , , , n. This determines Sll' the predicted values of the state vari-
ables at the /lth experiment. Now we are in position to determine the Y ll from
Eq. (8-2-3), which in turn are used to compute the residuals ell = Y ll - Y Il ,
From these, most objective functions (sum of squares, likelihood, etc.) can be
calculated directly
If we wish to use a gradient method (Chapter V) for the estimation ofparam-
eters in a dynamic system, we must compute not only the objective function
(P(O), but also its derivatives q == acfJlao. As we have stated before, gradient
methods are the most efficient among currently available methods, The in-
centive to use an efficient (in terms of total number of function evaluations)
method is particularly great in the case of dynamic systems, where each func-
tion evaluation is itself a complex procedure requiring the solution of a set of
differential equations. We detail below several ways for calculating the re-
quired derivatives.
(a) Finite Differences. Finite difference methods, discussed In SectIOn 5-18,
are applicable to dynamic systems. As usual, we must face the problem of
balancing the truncation en'ort (increasing with t:..8) against the rounding
error in differencing (decreasing with t:..8), There is, however, an additional
difficulty associated with dynamic models. Taking small t:..8 and avoiding the
concomitant rounding errors by using multiple-precision arithmetic is in-
effective in itself, since the accuracy of cfJ is limited not only by rounding errors,
but primarily by the truncation errors of the integration method. Increased
precision in (p can be acquired only by combining multiple-precision arith-
metic with decreased integration steps, or by using a higher-order integration
method. Both solutions are costly in computer time. The finite differen
method in its raw form works satisfactorily in many problems, In many others,
however, it fails to provide the accurate derivative values that are required for
convergence of the gradient method.
(b) Sensitivity Equations. Several methods, variously referred to as quasi-
linearization, sensitivity analysis, perturbations, etc. (Howland and Vail-
lancourt, 1961; Tomovic, 1963; McGhee, 1963; Bellman el 01.,1967; Rosen-
brock and Storey, 1966, Ch, 8), are based (at least implicitly) on the fact that
the required derivatives must satisfy certain linear differential equations,
These may be integrated along with the model Eq. (8-2-1) to yield the desired
:t. We are talking here of the truncation error incurred by representing a(fJ/a6 as /:::,,(fJ//:::,,6,
This is quite different from the truncation error of the integration method, which affects
the accuracy of (fJ itself.
.f ..1t
,
8-4, Computation of the Oiective Function and Its Gradient
227
gradient. In this way, the gradient can be computed with essentially the same
degree of accuracy as the function itself without undue difl1culties.
In order to apply the method, we must trace step-by-step the dependency of
the objective function on the various model variables and parameters. We only
list those dependencies which are relevant to our purposes:
]. cp depends on e/I = Y/I - Y/ I . The Y/I are measured (Jl I, 2, . . . , II).
cP may also depend directly on 0, e.g., when there is a prior density function.
This requires addition of the appropriate terms to Eq. (I),
2, Y" depends on SII = S(1/I) and 0 [Eq, (8-2-3)].
3. s(t,,) depends 011 So for the run containing the pth experiment, and on 0
[through integration of Eq. (8-2-1)].
4. So depends on 0 [Eq. (8-2-2)].
Using the chain rule of differentiation we find that
q = ocP/oO = I (OCPjcel/)(ael,!cO) = - I (c'CP/re/,) Dy/,/DO) 18-4-1)
" n
where we have used Dy/,! DO to indicate the total derivative of Y/I with respect to
0, given by
DY,,/DO = oyjoO + (OYI'/OS/I)(aSI'/OO)
(8-4-2)
so that altogether
- -" ( c P ! )( ::1- / " 0 ( 4' /'1 )( j ' O))
q - L C ,ce/1 lY 1I L + C)/I,LS/ 1 CS/ 1 (
U
(8-4-3 )
The quantities acp/oe/l' OY/I/eO, and as'/'/os,1 are easily computed, the latter
two by differentiation of Eq, (8-2-3). That leaves us the problem of delermining
(osjaO).
Let us write down the original differential equation Eq. (8-2-1)
ds/df = h(t, x, s, 0)
(8-4-4)
Differentiating both sides with respect to 0, and employing the chain rule,
we find
( dS ) = Dh = ah + ah os
ao df. DO cO 8s ao
Interchanging the order of differentiations on the left-hand side of Eq. (5)
d ( as ) eh (ih os
- - = - + - - (8-4-6)
df ao ao os ao
(8-4-5)
The quantities oh/oO and oh/as are easily determined by differentiation.
We have, then, in Eq. (6) a set of simultaneous linear first-order ordinary
differential equations in the unknown functions os/oO. These are called the
....JI1;;;"i
228
VIII Dynamic Models
sensitivity equations, since their solutions indicate how sensitive the state
variables are to changes in the parameters. The functions as/ao themselves are
called sensitivity coefficients. To find as,)ao we must integrate these equations,
jointly with Eq. (4), from t = 0 to t = t ll . To do this, we need initial values,
i.e., (iJs/iJO),=o, These are obtained simply by differentiating the initial condi-
tions Eq. (8-2-2), i.e.,
( as )
ao. r=O
as o
ao
(8-4-7)
If p IS the number of state variables and I[ the number of parameters
appearing either in the initial conditions or in the differential equations, then
the number of quantities aslaO is pI!, and the total number of equations to
be integrated [Eq. (4) and (6)] isp(l + II)' On the other hand, if we were to
use one-sided differences to estimate acp/ao we would need to integrate the
p Eqs. (4) for I + II different values of 0, again resulting in a total of p(l + 11)
equations. The computational effort involved in the two methods is roughly
the same, but the accuracy attainable in the sensitivity-equations method is
much higher, and more easily controlled. Admittedly, a greater effort is
required to prepare a problem for treatment by the sensitivity equation
method.
In the stmplest case, the initial conditions are known and the state vari-
ables are observed directly. The initial conditions Eq. (7) then-read simply
as/aO),=o = 0
and Eq. (2) reduces to
Dy,) DO = asu/ao
Let us examine a simple example. There is one state variable, with known
il11tial condition
dsldt = -es,
5(0) = I
( 8-4-8)
We know the solution to be s = e- or , so that
aslae= _te- or
However, let us form the sensitivity equation
:!.. ( as ) = ah + ah as = -s _ e as
dt aD ae as ao oe
Substituting for s its value, we have to solve for aslae the differential equation
(8-4-9)
:!.. ( iJ . S ) 0 ( JS ) = _e- ol
dt 00. + aD '
as )
-0
ae r=O
(8-4-10)
The solution is aslaO = _fe-or, in agreement with the prevIOus result,
.;i. .
r
,1
8-4, Computation of the Objective Function and Its Gradient
229
The required steps in a more general case are illustrated on the example of
Section 8-2. The unknown parameters involved in the initial conditions and
djfferential equations are [3, k l , k l , and k3' From Eq. (8-2-5) we have
OC I
0[3 = 0,
CC, oC 3
--=-=1 -=-1
0[3 , 0[3 ,
OC-
=O
ok j
(I,J = 1,2,3),
t = 0 ( 8-4-1 ])
And from Eq, (8-2-4):
d ( OCt ) ( OCI ) ( OC1 ) ( OC3 )
it 0[3 . = -2k l c l 0[3 , +k l c 3 0[3 +klc l 0[3 .
d ( OC 1 ) ( OCt ) ( OC1 ) ( OC3 )
dt 0[3 . = 2k l c i 0[3 . - (k l c 3 + k 3 ) 0[3 - klc l 0[3
d ( OC 3 ) ( OCI ) ( OC1 ) ( OC3 )
dt a[3 =2k t c l op . -(k l C 3- k 3 ) 0[3 . -klc l 0[3 .
d ( OC I ) 1 ( OCt ) ( 8Cl ) ( OC3 )
- - =-C t -2k l c l - +k l c 3 - +k}Cl-
dt ok l ak l ak l akl.
!!.. ( OC 1 ) = C i l -t- 2k l c i ( OC I ) _ (k l C 3 -t- k3) ( Cl ) _ k l Cl ( C3 )
dt ak l okt ok l ok l
!!.. ( C3 ) = Cil + 2k l c i ( OC I ) _ (k l C 3 _ k3) ( C} ) _ k l C} ( DC 3 )
dt ok l akl. ok l akl. (8-4-12)
d ( OC I ) ( OCI ) ( OC1 ) ( OC3 )
- - =C l c 3 -2k l c l - +k z c 3 - +klc}-
dt ok z ok l . ok 2 . ok}.
d ( OC 1 ) ( OCI ) ( OC} ) ( OC3 )
- - = -C l C 3 + 2k l c i ;=;-- , _ - (k l C 3 + k 3 ) -:;-:- - k} C z -::;-:-
dt ok l 0\1 ok}. ok l
d ( OC 3 ) ( OCI ) ( OC1 ) ( OC3 )
- d - , = -C z C 3 + 2k l c i _ 0' - - (k z C 3 - k 3 ) -;-:- - k} C l -::;-:-
t 0 (1, '1. 0/\1 o/\}
d ( OCt ) ( OCI ) ( OC1 ) ( OC3 )
dt ok 3 . = -2k t c l ak 3 . +k l c 3 ok3 . +klc} ok3 .
d !!.. ( Oz ) = -C l + 2k t c i ( 3 ) - (k} C 3 + k3) ( 1 ) - k} C} ( 3 )
t 0/\3. ok 3 0'\ 3. 0'\ 3.
d ( OC 3 ) ( OCI ) ( OC1 ) ( OC3 )
d - - , = C l + 2k t c i ;-:- - (k l C3 - k 3 ) --.- - k} C} --.-
t 0(3 ok 3 . 0"3. 0/\3
Eq. (8-2-5) and (II) provide initial conditions to the differential equations
:.Eq, (8-2-4) and Eq, (12), which may be integrated simultaneously from t = 0
';.::L""
i
230 vnr Dynamic Models ii
;:!i
<.",
to t = f/l' A separate integration IS requIred for each run, each integrationTe,
going up to the largest f l1 belonging to the run, ;'!
Setting up the Eq. (12) is a rather tedious task. The computer can perform,j
I
this job, using Eq. (6), provided it is given subroutines that compute iJhlDel
and iJhliJs. ,J
Once iJcJiJ[3 and iJcJiJkj (i,j = I, 2,3) have been computed for f = f/l' we ,,1
derive from Eq, (8-2-6): ;H
II
DYt/D[3 = DYzID[3 = iJc t /iJ[3
DYtl Dk j = DJ/21 Dk j = iJcIliJk j (j = 1,2,3)
Dhl D[3 = q iJc 1 /iJ[3 + r iJc z liJ[3
DJ/3IDkj=qiJcl/iJkj+riJczliJkj (j= 1,2.3)
and also, for the additional parameters, p, q, and r:
H
(8-4-13) ,'Ii
Dyul Dp = DYal Dq = DYal Dr = 0 (0 = 1,2)
DhlDp = I. Dh/Dq = C t ' DhlDr = C2
This gives us all the quantities needed to evaluate q, using Eq. (I).
The quantities Dy/J DO are also used to generate N, the Gauss approxima-
tion to the Hessian. A numerical illustration of this method appears in
Section 8-7.
J
It is possible to formulate the problem in such a way that only unknown
initial conditions need be determined. This is done by replacing each parameter
o appearing in the differential equations by a new state variable So subject to
(8-4-14)
So = 0,
so(O) = 0
(8-4-15)
This procedure has been advocated by several authors, (e.g., Bellman et al.
(1967) within the framework of quasilinear,ization), but it serves no purpose
other than to increase the number of differential equations that must be:
integrated.:!
..j
8-5, Numerical Integration
j
..,
1
.'j
'::)
,l.
.'j
The methods described in the preceding section require the numerical ,.:;j
integration of a set of simultaneous first-order ordinary differential equa- «J
tions, Methods for performing this task are described in textbooks on the j
subject, to which the reader is referred, Routines for evaluating the integrals ;;.';1
are available at most computer installations, The following are some remarks':};!
pertinent to the choice of integration method in parameter estimation'.);
problems. ."
;'IL
I
8-6, Some Difficulties Associated with Dynamic Systems
231
, ,
Most integration methods are either of the fixed- or the variable-step
type, The former methods (e.g., Runge-Kutta) are easier to implement, but
the latter provide better control over the truncation errors incurred in the
calculations. Intelligent adjustment of step size can save a great deal of
computer time, On the other hand, if we use a variable-step method, we must
observe precautions. In such methods, the step length at any time is governed
by the behavior of the equations. Two slightly different values of 0 may give
rise to different sequences of step lengths, resulting in slight discontinuities
in the computed functions. These may cause severe errors in derivatives
obtained by differencing. It is suggested, therefore, that all P(ll + 1) equations
required for obtaining a complete set of differences be integrated simul-
taneously, all using the same integration step sizes.
In the algorithm for minimizing the objective function there occur some
points (the main iterates) at which both the function and its derivatives are
required, while at other points only the function is required. It is essential
that the same function value should be obtained at a point whether or not
derivtives are also required, Hence. regardless of the method used for com-
puting derivatives, the integration step size should be determined by the
behavior of the state equations at the point 0 alone. The sensitivity equations,
or the state equations at the perturbed points 0 + f..0 a , should have no effect
on the integration step size. While this runs a certain risk of getting wrong
values of the derivatives, in practice the alternative of computing a non-
reproducible objective function has been found to give more trouble,
')
8-6. Some Difficulties Associated with Dynamic Systems
, "
Solutions of differential equations behave in a variety of ways: some are
'stable and converge to a steady state; some are unstable and diverge to
:infinity; others oscillate or enter into limit cycles. What concerns us here is
the fact that the nature of the solutions to a given set of equations may change
'drastically when one changes the values of the parameters, For instance, the
solution of dsfdt + Bs = 0 is stable when B is positive or zero, and unstable
'when B is negative, For Ihis reason, we may find it difficult to estimate param-
,eters if the initial guesses or any subsequent iterates give rise to solutions
'pf the wrong type.
" In a few cases, we can overcome this problem easily enough, If the system
"escribed by dsfdt + Bs = 0 is known from physical considerations to be
'table, all we need do is impose the constraint B O. Again, if a system is
',described by the set of equations:
ds1fdt + h ll (O)Sl + h 12 (O)sz = 0
dSzfdt + hZ1(O)Sl + hn(O)sz = 0
(8-6-1)
", , ."..,.,
. " ,..,.'
.,\ ... . . _ :i:' (
232
VIII Dynanlic Models
where !lij(O) are known functions, then the constraints
hI dO) > 0,
h 22 (O) > 0,
hll(O)!rdO) -!r12(O)h 21 (O) > 0 (8-6-2)
guarantee stability by making the matrix of coefficients positive definite,
Unfortunately, in most practical situations such conditions turn out to beL
unwieldy. Besides, unless the equations are of the linear time-invariant type
it is difficult to formulate stability conditions which hold at all times. We
don't even have any reason to believe that the solutions must be locally
stable at all times, since although appearing unstable at one time, they may
eventually pass into a stable region.
In addition to unstable solutions in which the state variables increase
rapidly beyond bounds, we may be troubled by solutions which are too
stable, i.e., in which the state variables rapidly reach steady state values which
are independent at least of some of the parameters.
Take, for instance, the system:
dsildl = -kis l + k 2 s 2 ,
ds 2 /dt = kls i - k 2 s 2 ,
(SI(O) = c l )
(S2(0) = c 2 )
(8-6-3)
the solution of which is:
St = Hc! - Kc 2 )/(I + K)] exp[ -(kl + k 2 )t] + K(c i + c 2 )/(1 + K)
S2 = [(KC2 - cl)/(I + K)] exp[ -(kt + k 2 )1] + (c 1 + c 2 )/(1 + K)
(8-6-4). .
where K == k 2/k I' Suppose we have assigned to k! and k 2 initial guesses that
are much too large, so that exponential terms are already negligible for the
smallest III at which measurements of s are available. Then
SI = K(ci + C2)!( I + K),
S2 = (c i + c2)/(1 + K)
(8-6-5).
Clearly, we have lost all information pertaining to k l and k 2 individually,
and we can hope to determine only their r?tio K. In other words, the values".
kt = 10,000, k 2 = 20,000 will Ilt the data just as well (or just as poorly) asl
k! = 100,000, k 2 = 200,000, and the estimation procedure would have no
incentive to reduce the values of k I and k 2' but only to adjust their ratio, .
It seems clear then that we are most likely to avoid both instability and
overstability if we start out with very small values of any unknown param
eters which are rate coefficients. This gives us the best chance of obtaining:.
solutions whose magnitude remains reasonable throughout the time intervals..
for which observations are available, and which are sensitive to the values!;
of the parameters. In many cases it pays to place reasonable bounds on the J .:,
magnitudes of the state variables. Should these be exceeded in the course o(
an integration, we reject the current value of 0 as infeasible. If we already'
have a feasible 0 from the previous iteration, then we can interpolate, i.e"
,J!"......
8-7, A Chemical Kinetics Problem
233
return to a value of 0 halfway between the current and previous values. If
necessary, this procedure may be repeated several times. If the infeasibility
occurs in the course of the first iteration, simply reducing the magnitudes of
all parameters by successive halvings often produces feasible values.
Alternatively, we may temporarily assign fictitious observed values zero
t,o the state variables at time t equal to the value at which these variables
exceed their bounds, and ignore for the present iteration observations taken
at later t. For example, suppose St is observed at t = I, 2, ..., 10, and
ISll :( 1000 is the bound. If we integrate the equations for current values of 0
and find that SI = 1000 at t = 4.5, then we act as though we only had the
observations at t = I, 2, 3, 4, and in addition we add an "observation"
SI = 0 at t = 4.5. It may be profitable to attach a large weight to this last
observation in forming the objective function.
Degeneracies of various types may arise when the differential equations
are linearly dependent. In the chemical reaction scheme of Eq. (3) for instance,
we have dsddt = -ds 2 /dt, hence SI + Sz = C 1 + C z remains constant. Suppose
all our observations were taken in runs for which the initial conditions C I and
t;z always added up to the same value y. Suppose, further, that the observed
variable y is a linear function of SI and Sz with unknown coefficients b o , b l ,
and hz, Thus
J' = b o + b l s 1 + bzs z = b o + bl(sl + sz) + (b 2 - bl)sz
= b o + b ll , + (b 2 - bd s 2 = 0 0 + O 2 S2
(8-6-6)
Under these conditions, then, y appears to be a linear function of S2 (or St)
.)alone, Any attempt to determine three coefficients independently will fail
unless new observations with different values of C 1 + C 2 are made, Additional
problems associated with linearly dependent systems are discussed in
,'Section 8-8.
::8-7. A Chemical Kinetics Problem
< The following example IS somewhat artificially concocted, but it serves to
;:;-llustrate many points.
., We consider a heterogeneous catalytic reaction in which a molecule of
species A is reversibly transformed into two molecules of species B
A +2 2B
:;V it were not for the catalyst, then the rate of the forward reaction A -+ 2B
(Y-'ould be proportional to SI' the concentration of A
R F = kFs l
,;,,hk:
".i:
234
:l
VIII Dynamic Models
The rate of the reverse reaction would be proportional to the square of S2 J
the concentration of B
R H = kHs/
If the reaction reaches a state of equIlibnum. no further changes 111 concentra-
tions occur because Rr = R H ; hence
kH!k r = s//(S/')2
where SI Land s/ are the concentrations at equilibirum, and K == k R/k I' is
called an equilibrium constant. It is determined by thermodynamic considera-
tions alone, and is unaffected by the catalyst. The net forward reaction rate
is given by
R = Rr - R H = krs] - kRs/ = kAsl - Ks/)
The presence of the catalyst affects the value of R. The nature of the effect
depends on the mechanism of the reaction. We shall adopt the following
expression for the rate of the catalyzed reaction
R = "r(s] - Ks/)/(I + MS])2
;:jl
The three constants k 1" K, and M are functions of the temperature T, usually:.'j:
" 'j
assumed to be of the form: ')!
kr=O] exp(-Oz/T)
K = CI. exp( - fJ/T)
M = OJ exp( -04/T)
We assume that K has been determined accurately from thermOdynamic, : ': . ' . ) : '
data as .
K = exp( -IOOO/T)
Species A is disappearing at a rate equal to R, and B appears at a rate of
2R. Hence the differential equations goverping the system are:
dl/d( = I1 I (s,O) = -0 1 exp(-02/T)(SI - e-IOOO/Ts/)/(1 + OJ ex p(-04/ T )S])z
"i
d2Id( = 11 2 (s, = 0) 20 1 exp( - 02/T)(St - e - I OOO/T s/)/( I + OJ exp( - 04/T)Sl)Z,;j
(8-7-1) ,
.:J
To estimate 0 1 , O 2 . OJ' and 0 4 we conduct three runs, at temperatures!::i
T = 200°. 400°. and 600 :'. The second run is started with pure A, and the,;
third run with pure B. Otherwise. the initial concentrations are known only\
approximately. In the first run ;;J
'\.1(0) = 0 5 = I :t 0.05,
In the second run
S2(0) = 0 6 = I :t 0.05
SI(O) = 0 7 = I :!: 0,05,
S2(0) = 0
In the third run
s](O) = 0,
S2(0) = Os = I :!: 0,05
".; '. L
:<.<
..,'.
8-7, A Chemical Kinetics Problem 235
In the course of each run, samples are width drawn at ten different times (in-
cluding initially, at t = 0), and analyzed in a densitometer.
The instrument's readings are linear in the concentrations of A and B,
i.e"
y= I +09S1 +OtOSz (8-7-2)
The coefficients 0 9 and 0 10 are known approximately from past experience
0 9 = I :t 0,05, 0 10 = 2 :!: 0.05
The data are given in Table 8-1.
Table 8-1
Data for Kinetics Problem
Run T Initial conditions f.L 1,. Y'l
200 8 5 8 6 I 0 3.988
2 10 4.073
3 20 4.]53
4 30 4.231
5 40 4.309
6 50 4.376
7 60 4.457
8 70 4.522
9 80 4.615
10 90 4.667
2 400 8 7 0 II 0 I. 997
12 2 2.149
13 4 2.320
14 6 2.465
15 8 2.611
16 10 2.754
17 12 2.896
18 14 3.034
19 16 3.166
20 18 3.278
3 600 0 fiR 21 0 3.012
22 0.5 2.956
23 I 2.926
24 1.5 2.877
25 2. 2.853
26 2.5 2.823
27 3 2.800
28 3.5 2.776
29 4 2.767
30 4.5 2.760
"t'.f:'
236
VIII Dynamic Models
Along with 0 1 , O 2 , 0 3 , 0 4 , we must also estimate the values of the un-
known initial concentrations 0 5 , 0 6 , 0 7 , and 0 8 , and of the unknown coef-
ficients OC) and 0 10 ' To account for our partial knowledge of the latter values,
we assign to the six last parameters independent normal prior distributions
with means [I, I, I, I, I, 2] and standard deviations 0.05.
If we do not know the standard deviation of the measurement errors, we
are led to the objective function
30
(!J(O) = (10/2) log L e/(O)
/1=1
+ (1/2 x 0,05 2 )[(0 5 - 1)2 + (0 6 - 1)2 + (0 7 - 1)2 + (0 8 _1)2
+ (OC) _1)2 + (010 - 2)2]
where
el,(O) = J'IL - I - OC)SI(tIP 0') - 0IOS2(t1L' 0')
Here 0' denotes the vector consisting of the first eight elements of 0, The func-
tions s(t/1' 0') are to be determined by integrating Eq. (1) from t = 0 to 1 = 1/1' ,:!
using initial conditions (05' 0 6 ), (0 7 , 0), or (0, 0 8 ) depending on whether the
11th experiment belongs to run I, 2, or 3.
We shall estimate 0 by means of the method of sensitivity equations)
That is, along with the two Eqs. (I) we integrate at each iteration the set of
sixteen differential equations for the functions as/ao'. The initial conditions
for these equations are
[ 0 0 0 1 0 0 ] (run 1)
0 0 0 0 1 0
as/aO'),=o = [ 0 0 0 0 P 1 ] (run 2)
0 0 0 0 0 0
[ 0 0 0 0 0 0 ] lrun 3)
0 0 0 0 0 0
To form the differential Eqs, (8-4-6) we need the matrices ahlaO' and ah/as"j!
The first row of each matrix is given by:
ah t = [ ht , _ '!J., _ 2ht exp(-04/ T )Sl ,
ao' 0 1 T I + 0 3 exp( - 04/ T )SI :1
".. :
2h 1 0 3 exp(-04/ T )SI o] . . ' . ! . "
, 0, 0, 0, I.:"
T(l + 0 3 exp( - 04/T)SI) ,)
j
alr t = [ _ 0t exp( -02/T) - 2lr l 0 3 exp( -04/T) , 20 t exp[ -(0 2 + 1 000)/T]S2 ] : .. : . ;
as I + 0 3 exp( -04/T)SI 1 + 0 3 exp( -04/T)Sl <
I
I
8-7. A Chemical Kinetics Problem
237
To obtain the second row in each case we multiply the first row by - 2. The
reader may write out in full the differential equations for t11e sixteen func-
tions ah/aO'.
We use the initial guesses [2, 500, 0.5, 50] for the first four parameter.
The guesses [I, I, I, I, I, 2] for the remaining parameters are obvious.
In Table 8-2 we give the results of integrating the eq uations in 5 and
I
Table 8-2
Intcgratcd First Run Data". Using Initial Gucss e
I
.
t = 0 f= 10
s, .1', .1', S,
s I I 0.3623495 2.275271
Bs/ Be , 0 0 -0.2064785 0.4 I 29581
Bs/Be l 0 0 0.002064790 -0.004129570
Bs/Be] 0 0 0.3270462 - 0.6540943
Bs/Be 4 0 0 -0.0008176151 0.001635234
Bs/Be, ] 0 0.5249249 0.950146
Bs/Be 6 0 ] 0.01804298 0.9639123
"Accurate to About Fivc Dccimal Placcs,
.a5/ao' for the first run from t = 0 to t = 10, using the initial guess values for
:.,0. The values of as/aG 7 and as/aG B were omitted, being all zero for the first
:Irun, From these values it is easy to compute the residuals and their deriva-
,tives for the first two observations. The residuals can be found from Eq. (2).
First, at Jl = 1 (t = 0):
e l =Y] - (1 + G 9 s] + G]Osz) = 3.988 -] - I x 1-7- x 1=-0.012
o
o
o
o
[ G9 aSI/aO' + GJO asz/ao' ] ]
a11/ ao = -aedaO = Sl 2
Sl 0
o
I
I
'<,. }if..;,j
238
VIII Dynamic Models
and, at 11 = 2 (t = ]0):
e2 = 4.073 - I - I x 0.3623495 - 2 x 2.275272 = -0.83892
I x (-0.2064785) + 2 x (0.4129581)
I x (0.002064790) + 2 x (-0,004129570)
o
o
0.3623495
2.275272
0.6194376
-0.006194349
-0.9811422
0.002452854
2.45216
1,945867
o
o
0.3623495
2.275272
(?f/ao =
In similar manner, we can compute ell and afjaO for It = 3, 4, ..., 30.
From these, q and N can be computed and the Gauss method applied, The
process does not converge unless penalty functions are used to maintain all
parameters (or at least the first four) strictly positive.
The solution is
1.39266 :t 0.20891
1140.01 ::!:: 75.]6
1.82052 .:!:: 0.80081
366.524 :t 194.213
0* = 1.0060 I ::!:: 0.04502
0.998853 :t 0.02988
0.986829 ::!:: 0.02606
1.01898 :t 0,01677
1.01086 ::!:: 0.02620
1.97541 :b 0.03198
One would judge all parameters well-determined except for 0 3 and 0 4 , The ::il
amount of information conceming the values of 0 5 through 0 10 that was'::f
gained from the data can be gaged by comparing the present standard devi..:q
ations of these parameters with the prior standard deviations of 0,05, There:::j
is substantial improvement in all cases but 0 5 . :')
8-8. Linearly Dependent Equations
, .1
We return to the kinetics problem of the previous section, but now the::,':),
initial conditions for each run are known precisely (see Table 8-3). This time::
the concentrations of A and B are measured directly in each experiment, and::;4
appear under the headings )'/d and )'1,2 respectively in Table 8-4. We recall:
;!
,'C'..""'.'
I
I 8-8, Linearly Dependent Equations 239
Table 8-3
Run Data for Kinctics Problcm
I Initia] conditions
I Run .1',(0) S2(O) 0: c= 2s J to) + S2(0) T
I ] .00830 0.99662 3.01322 200 0
1 0.98862 0 1.97724 400 0
3 0 1.01731 1.0173] 600 0
I
: Table 8-4
Data for Kinctics Problcm
Run f.L Y,'I Y , J2 J";'JI = Ji"1 j!'J2
I ]0 0.98040 1.05134 0.980832 1.05] 556
2 20 0,95262 I. ] 0796 0.952628 I.J 07964
3 30 0.92703 I.J 6017 0.926626 I.J 59968
4 40 0.90120 1.21002 0.90]520 1.210180
5 50 0.87706 1.26056 0.876476 1.260268
6 60 0,84883 1.31534 0.848918 1.315384
7 70 0.82480 1.36299 0.825052 1.363] 16
8 80 0.80163 1.41035 0.80]474 1.410272
9 90 0.77726 1.45822 0.777452 1.458316
2 10 2 0.93445 0.10857 0,934358 0.108524
i ]] 4 0.88192 0.21395 0.881700 0.213840
12 6 0.82997 0.31716 0.830026 0.3]7188
]3 8 0.78015 0.41627 0.780418 0.416404
]4 10 0.73047 0.51561 0.730746 0.515748
]5 12 0.6827] 0.6]219 0.682562 0.612]]6
16 14 0.638]6 0.70085 0.638188 0.700864
17 16 0.59445 0.78925 0.594086 0.789068
]8 18 0.55375 0.86976 0.553742 0.869756
3 19 0.5 0.01688 0.98263 0.0] 7248 0.982814
20 I 0.03331 0.94991 0.033622 0.950066
21 1.5 0.04726 0.92359 0.046940 0.923430
21 2 0.05604 0.90485 0.056192 0.904926
23 2.5 0,06330 0.89059 0.063348 0.890614
24 3 0.07073 0.87613 0.070618 0.876074
25 3.5 0.078]] 0.86138 0.077994 0.861322
26 4 0.082]4 0.85298 0.082] 60 0.852990
27 4.5 0.08782 0.84250 0.087488 0.842334
..r.::
VIII Dynamic Models,'1
240
that for each molecule of A that disappears, two molecules of B are created"
Hence the quantity CJ. = 2s t U) + sit) remains constant throughout any run,:
We can compute its value from the known initial conditions CJ. = 2s l (0) + sz<0Y
(see Table 8-3), and it does not depend on what values of 0 we choose.
Suppose 0 is the true value of 0 The observed concentrations are given'
by
Yut = SIU/l' 0) + B/lt, Y/l2 = sz(t/l' 0) + B/l2
where B/l t and B/l2 are errors, hopefully small. Hence
2Y/ll + Y/lZ = 2s 1 + S2 + 2B/ll + BuZ = CJ. + 2B/ll + B/l2
(8-8-1)
But also, for any trial value 0
2S t (l/l' 0) + szU/l' 0) = [j
Subtracting Eq. (2) from Eq. (l) and remembering that
ei O ) == Y/l(O) - s(t/l' 0)
(8-8-2),
we obtain
2e/lt(0) + e/l2(0) = 2B/ll + B/l2 (8-8-:.),'
Unless 0 is close to 0, the residuals e/l(O) are large compared to the errors';
E/l' hence Eq. (3) takes the approximate form
e'12(0) ::::; -2e/lI(0) (0 i= 0)
From this it follows that the momelll matrix M(O) is nearly singular
-2 e1 1
4 Le1
/l
Indeed, an attempt to estimate 0 = [01' 0z, 0 3 , 04r by mmimizing";_i
log det M(O) fails when one starts with 0 1 = [2, 500, 0.5, 50]T. One simply
finds det M(OI) = 0, and no progress can be made, However, using th:,,
results of Problem 8, Section 4-21, let us take SI as our sole state variable (if
we know St we can always compute S2 = CJ. - 2s l ). If we define YI' == YI ana:.s
Yz' == CJ. - Yz, then we have the representation
l Le1
MIO);::;; _; '"' 2
- L.. e/l l
/l
(0 i= 0)
Yt' =SI,
Y2' = 2s 1
or
y = BS t (8--4!t
T .u.
where B = [I, 2] , Let us assume, further, that the measurement errors of{i
Y/ll and Y , 12 are independent and have the same standard deviation (J, Since:
CJ. is a known constant, the error in Y2 likewise has standard deviation ,ri{:
;,1
i,
!l'
..:,
:8
;;;' lJ-lJ. Linearly Dependent bquatlOns
L41
Hence, at each experiment we can obtain the least squares estimate of 5"
on the basis of the measured Yul and Yuz
S =(BTB ) -IBTv
" J II
(8-8-5)
,9f
S"l = ([I, 2][]rl[l, 2][t]
I
I
= -t)'t + t)'z = -t)\11 + -t(a - )'IIZ)
.The values of sui appear in Table 8-4, as do the values ofYl1 computed accord-
'jng to
YUI =Sul'
YJlz = a - 2s Jl1
(8-8-6)
The standard deviation of the measurement errors may be estimated
: from
(J = {[1/27(2 -l)]:tJ[(YJlI - YJlI)Z + (YJlz - YJl2n} 1/2 = 0,0002876
The values Sill can now be used as "data" for estimating 0 We do this
. py minimizing
27
4i(0) = I [Sid - SIU Jl , oW
Jl=1
,We employ the Gauss method, and use penalty functions to keep all Ba
positive, Starting from 0 1 = [2, 500, 0.5, 50]T we arrive in 19 iterations at the
solution
..
l l.8393::t 0.05; 8515 1
0* = 1175.)6 ::t 19.5L9
2.29692:!: 0.344311
471.100 :!: 66,5655
(8-8-7)
.;'!
'Fhe estlmtes of all parameters are fairly well-determined. The minimum sum
?f squares is
4i(0*) = 0,1353722 X 10- 4
t;:Fhe estimate Eq, (7) should be sufficiently close to 0 so as to make M(O*)
:,nonsingular. Hence we can use 0* as the initial guess for minimizing
4i(0) = (27/2) log det M(O)
(8-8-8)
t',_
.., .
iIrjdeed, three iterations bring us to
l l.48555 + 0.0395848 1
0** = 1176.30 :!: 13.5234
2.31028::t 0.241886
473.880 ::t 46.4845
.\",\L..
242
VIII Dynamic Models
8-9. Problems
I. State variables are usually computed by solving finite difference equa-,
tions which are approximations to the differential equations. It is suggested
by Kelley and Denham (1969) that one ought to obtain exact derivatives of
the approximate s (by differentiating the difference equations with respect to"
0), rather than approximate derivatives of the exact s (by solving the sensi-
tivity equations approximately with a finite difference method). Show that
with the Runge-Kuna method both approaches lead to precisely the same
results. lIIustrate with the models of Eqs. (8-2-4) and (8-4-8).
2. Using the method suggested at the end of Section 8-3, solve the follow-
mg two point boundary value problem:
Find SI(O) for the system
,SI =SI -2ts l /(1 +S2)'
52 = 2Iog[sl(1 + S2)]' S2(0) = 0, S2(1) = I
Use the initial guess SI (0) = 1.5.
3. Suppose the observed variables are measured continuously, so that a
record exists of y(t) (0 t n. Let the objective function be
.T
(/)(0) = I eT(t, O)Qe(t, 0) dl
, 0
where Q IS a gtven positIve detlnite matrix. Derive the sensitivity equation,
i.e., a differential equation for iJC/J/ao. Apply a variable metric method to
finding the minimum of C/J for an example given by Bellman, Kagiwada, et
0/. (1964): There are two parameters and one state variable
.s = -s + O,sJ,
The observed data are given as
y(t) = s(t) + 0.5 cos 60t
S(O) = O 2 .
where s(t) is the solution of the differential equation with 8 1 = 1/30 and
8 2 = I. Use T= 5 and Q = I, i.e.,
5
(/)(0) = r [y(t) - s(t)f dt
, 0
4, The errors in measurements taken in the course of one physical run
are generally not independent. Using the theory of power spectra, one can
obtain expressions for the covariance between the errors in different measure-
ments, provided the differential equations are linear. Specifically, suppose
cls/dl = Os + c;(t),
s(O) = So
. .:,.
':r.k
8-9, Problems
243
where e(t) is a random noise with given power spectrum. Let s satisfy
dsjdt = Bs,
s(O) = So
and let u(t) = s(t) - s(t) be the error function. Derive expressions for the
power spectrum of u(t), and for the autocovariance function V(t, T) =
E[lI(t)U(T)]. Generalize to the case of a vector of state variables, I.e., dsjdt =
A(e)s + E(t).
!(!
:,....;
:k
:
w.::,
-:14
i1i1.
r
'c.
.if.
1,.;,
..
J';
...;
,
.
"';
",'
)
. ...
';>."
F
.':'
Chapter
I
Some Special Problems;j
9-1. Missing Observations
It is not uncommon to find that one is missing one or more data items:,:!
i.e., elements of the matrix W. We distinguish between two situations, whiclf!
"
are illustrated by the following cases: .
(a) A single equation model, with the objective function L= 1 e p Z(9)! 1,
The value of Xl, I is missing. It is clear that we cannot do better than de!!
termine e* so as to minimize L=z e/(e). The entire first experimen.ii
",II
contributes nothing to the estimation of e, and may be dropped from the1
objective function.
Another example in which a term may be dropped occurs when Yt,l is;:
missing and the objective function is
II 111
L L bJLlI[YPll - n.xp, e)]Z
/1= 1 Q= 1
(b) On tile other hand, with YI, I missiI;lg, suppose the objective functioll:;1
has the form
n III
L L Bpllb[YJLlI- fJx p , e)][YJLb - fb(X JL , e)]
11= 10.1)= J
Now YI,I appears in several terms, which cannot all be dropped. :j
Similarly, with X I . I missing, let the model consist of two equations, The:!
values of )'1, I and )'1,2 jointly do contain some information on e, since wj!
cannot in general solve the equations,.,;
YI,I-!I(X I . t , .,e*)=o. YI,Z-/Z(XI,I,...,e*)=o (9-1-1)1!
,'j
simultaneously for X \, I' The first experiment residuals shouLd not be droppt;d,II
from the objective function. Instead, the missing value XI I or YI I can b#Ji
regarded as an additional unknown parameter, whose vale is to' be deter:j!
mined together with e so as to optimize the objective function. 1
iL
9-1. Missing Observations
245
When several noneliminable items are missing, they can all be treated as
unknown parameters. However, from a practical computational point of
view only a few such parameters can be handled in this manner.
The folJowing is a systematic approach:
I. Write down the objective function, with all missll1g data items repre-
sented as unknown parameters
2, Differentiate this expression with repsect to all the missll1g-data
parameters, and equate the derivatives to zero.
3. Solve the equations thus formed for the mlssll1g data parameters.
4, Substitute the solutions in the objective function for those parameters
'where the substitution results in a simplification of the expression. Retain
other unknown parameters in the objective function.
:l!xamples :
J
n
1. <p(e) = I e/(e),
1l=1
XI, I unknown.
8<Pj8x l , I = -2c t (e)8J;j8x I ,1 =0
" el(e) = 0
Substituting in <p(e), we find <p(e) = IZ=z c/(e), I.e., we drop the first
:term,
II III
2, <p(e) = I I Bllllb e llll e llb ,
11=1 ll,b=l
Y I, I missing.
III
8<Pj8YI,l = 2 I BIlI,t el,a = 0
ll= 1
111
:, e l , I = -B-;II I B IlI . l e l ,lI
ll=2
, Substitution of this expression in <p(e) only complicates matters, so we
,,may as well retain YI, J as an unknown parameter. Incidentally, the same
'if ..esult would be obtained if c l . I is made an unknown parameter, and then
;I. I can be computed from )'1. I = ci. I + 11 (XI' e*), where c', I and e* are
the estimated values,
fS
.". A survey of the literature on the problem of missing observations is given
iijiJ?y Afifi and Elashoff (1966), but most of the reported results pertain only to
'rtlultiple linear regression.
Numerical ill ustrations appear 111 Sections 9-6-9-7,
.,. . "M.'''1!; , .
"'!"-"-'
246
IX Some Special Problems
9-2. Inhomogeneous Covariance
Most of the estimation formulas were derived on the assumption that
the covariance matrices V /1 (Ji = I, 2, , , " n) of the errors in the Jith experi-
ment were all equal to a fixed (though possibly unknown) matrix V. The
modifications required when the V/1 differ from each other are trivial, pro-
vided the manner of the variation is known. If the VI' vary in an entirely
unknown manner, nothing much can be done; we simply cannot estimate a
variance from a single observation.
The following three cases can be treated easily:
(a) Suppose the following holds
V/1 = A"VA"T
(9-2-1)
where V is an m x /1l positive definite known or unknown matriX, and the
A" are known m x /1l nonsingular matrices. This includes the case where
the V/1 are known matrices, since then V = I and there exist A/1 such that V/1 =
A/A/. In the single equation case, Eq. (I) amounts to
(J/12 =0,,2(J2
(9-2-2)
where the a/1 are known constants. For a normal distribution, the
objective function takes the form
II II
ePee, V) = I log det A" + n/2 log det V + t I e/ (A) -1 V- t A t e /1 (9-2-3)
/1=1 /1=1
We may drop the first term on the right-hand side, which is a constant.
Let us redefine our model equations so that instead of y /1 = f(x/1' 8) we
write A; Iy /1 = r\ If(X/1' e). We obtain new residuals
e/1 == A;; 1 e/1 = A/ 1 Y /1 - A;; If(X/1' e) (9-2-4)
The objective function now becomes
/I
(J)(e. V) = (n/2) log det V + t I e/v- I e"
/1=t
(9-2-5)
which is identical to the expressions derived in Chapter IV,
Example Suppose, in a single-equation model, the standard deviation is
proportional to the magnitude of the measurement, that is
v =)' (J2V = (JlV 2
11 - JI -' J.1 11
(9-2-6)
The redefined residuals are
e/1 = (I/Y , ,)[.1'/1 - f(X/1' e)] = I - (I/Y/1)f(x/1' e)
(9-2-7)
9-2. Il1homogel1eolis Covarial1ce
247
These have variance (f2. If instead the errors are assumed proportional
to the true values of the measured variables, we should use
ell = J'/I!J(X/ 1 ' 0) - I
(9-2-8)
Eg. (7) is easier to deal with, and the error committed in using it in place of
Eq. (8) is likely to be small
(b) Suppose we know that experiments Jl = I, 1, ...,11 1 have the unknown
covariance matrix VI' experiments 11=11, + 1,11 1 +2, ..., //1 +//2 the
matrix V 2, and so on. The objective function has the form
:f::.
I/J(O, VI' V 2, . . .) = (11 1 /1) log det VI + (11 2 /2) log det V 2 + .
+ i Tr(V I M I ) + t Tr(V;2M 2 ) + .
(9-2-9)
... :i:
where M; is the moment matrix of the residuals in the experiments with
covariance V j . Proceeding as in Section 4-9, we differentiate with respect to
Vi and obtain eventually
V j = (1II1 j )I\'I,(O)
(9-2-10)
and the concentrated objective function becomes
(/)(0) = -1 I 11; log det M;(O)
j
(9-2-11)
Should there be some experiments with known covanances, the objective
function would be
I II
.,.
i
1
cPCO) = 1- I 11; log det NUS) + I'e/(O)V/ I e/,(O)
j
(9-2-12)
where the summation I is extended only over those experiments.
The minimum number of required experiments given in SectIon 4-12
now applies to each I1 j separately, since Eq. (II) cannot be used unless none
of the M j are singular. Hence we must usually have
/lj max (I + I. m)
(i = I. 2, . . .)
(9-2-13)
I i
. ,
!\
;l
Another problem that can be solved by means of the maximum likeli-
hood method is one in which the covariance matrix varies regularly as some
function of the independent variables. These functions may depend on un-
known parameters, which can be estimated. The case where both the model
. equation and the standard deviation of the erorrs are linear functions of the
'independent variables is treated by Rutemiller and Bowers (1968).
(c) Known Serial Correlations. Correlations between errors in different exper-
iments are called serial correlatiol1s. Suppose, in a single-equation model,
the covariance matrix of all errors is given by R. where
I
I
. : c .
R/D' =£(£/' £)
(9-2-14)
248
IX Some Special Problems
The likelihood takes the form
11
log L(O) = -(n/2) log 2iT - t log det R - t I [R - t ]11I/ eiO)eiO) (9-2-15)
", = t
If R is known, we can find a matnx S such that SST = R. Defining a new set
of residuals
11
£i,/(O) == I [S-I]lle,,(O)
/1=1
(9-2-16)
we find that maxlllllzlIlg Eq. (15) is equi valent to minimizing the sum of
squares of the e,/(O).
The estimation of R when the serial correlations are unknown is relatively
difficult, and we shall not consider this problem here. We observe, however,
that residuals almost always show serial correlations even when the errors
possess none. For instance, in the case of the linear model Eq. (4-4-2) with
V = a 2 1, we find that the covariance matrix of the residuals is given by
V r = a 2 [1 - B(BTB) -I BT]. Thus while V is diagonal (no serial correlations),
V r is nondiagonal. Furthermore, V,. is singular since clearly V,.B = O.
9-3. Sequential Reestimation
Suppose a series of experiments is being conducted, and we wish to re-
estimate the parameters as the results of each experiment come in.
Many of the objective functions that we have studied consist of a sum of
terms, each containing the results of a single experiment. Examples are sums
of squares, weighted sums of squares, and log-likelihood functions with
known covariance matrices. [n such a case, let us denote by CP,,(O) the ternl
corresponding to the lith experiment, and by (//11)(0) the objective function
for n experiments. Then
11
(/yCII)(O) == I cPJO)
/ 1 =1
(9-3-1)
If follows that
(/)(11 + I )(0) = qyCII)(O) + cP 11 + I (0)
(9-3-2)
If we had estimated the parameters after the nth experiment, we would have,'
found 0(11) which minimizes (/)(11). We have also obtained the matrix HCn):.'::
"-'r
which is an approximation to the Hessian of (//") at 0 = OCII). A Taylor series ".
approximation to (/)(11) in the neighborhood of OCII) is given by
,/)(11)(0) (//II)(OCII») + t(O - OCIIJ)TH(II)(O - 0(11»)
(9-3-3)
'(
if
,.
f.
9-4. Computational Aspects
49
When the results of the n + I st experiment are available, we wish to find
o(n+ 1) which minimizes <Jj(n+ I). It is reasonable to expect that 0(11+ 1 J will not
differ very much from o(n>, so that the approximation Eq. (3) is valid, and
may be substituted in Eq, (2). Instead of minimizing cp(II+1) we may, instead,
mlIllmlze
".
<p(1I+1)(0) == 1-(0 - O(II))THl1lJ(O - 0 (11 ») + CPII+I(O)
(9-3-4)
i
i
The function ;j;{n+l) is much simpler and easier to calculate than cp{II+1),
so that a great deal of computer time may be saved by this substitution. Of
course, if it turns out that (j) {II + I) is minimized at a point so far removed from
o{n} that Eq. (3) cannot be accepted, then we may have to revert to 4>(11+ I).
We use 0(11) as the initial guess for the minimization of (j)(I1+ 1), and the result
of that minimization as the initial guess for minimizing (J){II+ 1) if required.
A single iteration may suffice for the latter.
If <1\,(0) is the negative log-likelihood for the pth experiment, then Eq. (4)
may be regarded as the logarithm of a posterior density function, in which
t(O - O(II))TH(II)(O - 0(11)) plays the part of a log prior density. It corresponds
to a normal distribution of the parameters 0 with mean 0(11) and covariance
matrix (H(II») -1. This accords well with the fact (established in Chapter VII)
that (Hln}) -1 is an approximation to the covariance matrix of the estimate
o(n), and that 1(0 - fj(n)?H(n\o - o(nJ) is approximately (apart from an addi-
tive constant) the logarithm of the posterior distribution after n experiments.
Minimizing cp(II+1) corresponds to finding the mode of the posterior distri-
bution after n + I experiments, where the posterior distribution after n
experiments is taken as the prior distribution.
These results may be extended to the case where the parameters are to be
reestimated only after v additional experiments have been completed. Here
fi
..
"
\' \'
ifj(II+V)(O) = (Jillt(O) + I 4>n+I'(O) ;:::: 1(0 - o(nYH(II)(O - Oln») + I CPII+P(O)
p=t p=1
(9-3-5)
Sequential estimation procedures are of particular interest when the com-
puter designs. controls, and analyzes the results of experiments on line (see
Chapter X),
A numerical illustration appears in Section 9-8.
,9-4. Computational Aspects
Let us consider a single new observation in the single equation least
squares case. Here
cp(III(O) ;:::: CP{II)(O{II») + (0 - O(II»)T AII(O _ 0(11»)
(9-4- I)
250
IX Some Special Problems
where
11
All == I bl'b/,
1'=1
bl' == 8fl'/8{J
(9-4-2)
At the same time
(/)II+I({J) = [YII+1 - f(x lI + l , (JW = e+l({J)
Now, for (J close to (J(II), we have approximately
(9-4-3)
f(x. 0) =/(x, 0(11») + b+I({J - (J(II»)
so that the Gauss approximation to Eq. (3) is
(/)11 + 1(0) ;:,,; e;; + I ({J(II») - 2e,,+ I (8(") )b + t ({J - (JIll)) + [(0 - {J(II)j T b,,+ IF (9-4-4)
and Eq, (I) becomes, after dropping constant terms
(/>(11)(8);:,,; ({J - {J(,,»)TAII+I({J - (J(II)) - 2ell+l({J(n»)b+1({J - (J(II») (9-4-5)
where
An+1 =An + bll+lb+l
( 9-4-6)
The millimum of Eq. (5) IS easily seen to occur at
(Jfll+ I) == 0(11) + A,+\bn+lell+t({J(II))
(9-4-7)
Having computed (J(II) and A" after n experiments, the updating procedure
after the (n + 1 )st experiment may be summarized as follows:
I. Compute e,,+I(O(")) = Y,,+I - f(x,,+t, (J(n»),
2. Compute b"+1 = 8f(x"+I' {J)/8{J)O=OI;nj,
3. Compute All + I (Eq. 6).
4, Compute A,+\ (see below).
5. Use Eq, (7) to estimate (Jfn+ I),
Step 4 requires elaboration. One does not wish to invert an I x I matrix:!
at each step. Fortunately, this is not necessary. Suppose that A, 1 has alreadY:
been computed. Define:
an + 1 = A; 1 b ll + 1
(9-4-8)
(9-4-9)
[311+[ = b+[an+1
Then, as may be verified by multiplying Eq. (6) by Eq. (10)
A'';l = A,l - all+la+d(l + {311+d
(9-4-10)'
AI
9-6. A Missing Data Problem
25]
Thus A,+\ can be calculated without explicit reinversion. Somewhat more
complicated inverse updating formulas are given by Powell (1969); these
help reduce the accumulation of rounding errors,
9-5. Stochastic Approximation
"
..\c.
)"'-
.,
.I
l\;
When reestimations have to be performed at a very rapid pace, even the
formulas of the preceding section may be too cumbersome. It is the aim of
stochastic approximation methods to introduce further simplifications.
Equation (9-4-7) is a special case of the general stochastic approximation
formula
(,.
"-!.
,
i
0(11+1) = 0(") + C" C,,+ 1(0("»)
(9-5-1)
where c" is some suitably chosen vector. This formula states that the correc-
tion to be applied to 0(") is proportional to C,,+ 1 (0(")), i.e., to the errOr in-
curred in predicting Y,,+I given X" + 1 and 0("). In Eq. (9-4-7) we used
'y
c" = A,llb"+1
.'i!
Sometimes It IS preferable to multiply this value by some positive constant
less than one. The following variations represent progressive simplifications:
I. The procedure of Section 9-4 can only be started after enough observa-
tions have been accumulated to make A" nonsingular (at least n = I). Instead,
we can start with an arbitrarily chosen positive definite Ao.
2, Use
::
/ ,,+1
C" = b"+1 I bJlTb"
Jl=t
(9-5-2)
I
These methods, and some others, are discussed In detail by Albert
..and Gardner (I967).
.,9-6. A Missing Data Problem
We return to case (a) of the two equation maximum likelihood problem
of Section 5-23. We assume, however, that measurements on ZI, 1 and Z1.3
., are missing, It is clear that relevant data still remain in data points Jl = I and
2, and these should not be discarded. Instead, we treat ZI. 1 as an unknown
.' parameter 0 6 , and Z 1,3 as an unknown parameter 0 7 . The first equation of
,'.Eq, (5-23-6) now takes the form
0= -0 6 + J;tx,,, 0)
252
IX Some Special Problems '"
for J1 = I only, and Eq. (5-23-7) becomes
a = 0 1 + 02X/11 + 0 3 log 0 7 + 0 4 log {Os + exp[(x/13/(1 + 04)]}
II
for JL = 2 only. The model equations for JL = 3, 4, ...,41 remain unchanged" "II
The matrices B/1 = 8fja8 have zero sixth and seventh columns for J1 = 3, 4,g
. . . ,41. For JL = I the sixth row is [- ], 0] and the seventh roW is zero; fop 'Ii
J1 = 2, the sixth row is zero and the seventh row is [(0 3 /0 7 )12, I' (0 3 /0 7 )/2,2]'
As initial guesses for Ob and 0 7 we take 1.3 and 0.4 respectively, values which
are reasonable in view of Table 5-6.
The estimate, obtained by means of the Marquardt method, is
e* = [-0.00551000, -0.0115201,0.789967,0.939]89,
0,835258, I.5l783, 0.417195r
The estimate for 0 6 , i,e., ZI, 1 differs considerably from the known value
1.33135. However, this value leads to a residual of - 0.22944 (Table 7-3);:
whereas the present estimate has a residual of only - 0.0479336, The estimate
0 7 is quite close to the true value 0:4084 of Z2, 3'
The parameters 0 1 , O 2 , . . . , 05 can be converted into c as usual
c* = [0.5397926.0.00633332],1.265875,1.064749, 0.591 8664]T
The covariance of e* can be obtained .as usual; the marginal covariance of
0 1 *, O 2 *, , . . , 0 5 * consists of the 5 x 5 upper left hand cotner of the 7 x 7
matrix V 0' and V c can be computed from it as usual.
We wish to see how much information has been lost due to our Ignorance
of ZI, I and Z2,3 and also how much more would be lost if we dropped the
first two observations completely. For this purpose we also obtained the
estimate based on observations JL = 3, 4, ...,41. We computed det V c forni
all three cases; this quantity is proportional to the square of the volume of.:!
any confidence ellipsoid, and is also a n1easure of the uncertainty in the
sampling distribution (see Section 10-2). The results are given in Tab]e 9-1.
We see that the more data are lost. the more uncertain are our estimates.
Table 9-1
Comparison of Information in Data
Casc
dct V c
41 full obscrvations
41 obscrvations, Zt.! aDd ZO,) misslllg
39 obscrvations
0.246678 ,', 10- 20
0.291663.-: 10- 20
0.384105 >: 10- 20
;; .
'''...
',1'..
9-7. Further Problem with Missing Data
253
9-7, Further Problem with Missing Data
Galambos and Cornell (1962) supply data on the observed proportions
fY/l) and J'/l2 of radioactive tracer in two human body compartments at times
"x/l after injection. These data are presented in Table 9-2. The value Yt, 1
missing. The model equations are
y)=0Iexp(-02 X )+(I-OJexp(-03 X ) (9-7-1)
0)0 3 [ 0103 J
Y2 =] - 8)(0 3 _ O 2 ) + O 2 exp( -02 X ) + 0 1 (0 3 _ O 2 ) + O 2 - 1 exp( -03X)
(9-7-2)
Table 9-2
Data for Radioactive Traccr Problcm
Proportion radioactivc traccr
Time,
x,. Compartmcnt I, Compartmcnt 2,
f-L (hr) )'''1 J'/J2
] 0,33 missing 0.03
2 2 0.84 0.10
3 3 0.79 0.14
4 5 0.64 0.21
5 8 0.55 0.30
6 12 0.44 0.40
7 24 0.27 0.54
8 46 0.]2 0.66
9 72 0.06 0.71
Beauchamp and Cornell (1966) used the following method to estimate G:
First, least squares estimates were obtained for 0 using the}'1 data alone,
k<giving
[ 0.555524 + 0.072741 ]
0(1) = 0.03]4238::i:: 0.0038325
0.171109 ::i:: 0.027847
Corresponding to this estimate is a minimum sum of squares of residuals
=0.0009510273, a variance of 0,0009510273/(8-3) = 0.0001902054, and a
-standard deviation CT I = 0.01379. The same procedure was applied to the
;;. ryz data alone, yielding
[ 0.0606528 +0.0124113 ]
0(2) = 0.00680764::i:: 0,00107404
0.09316 .::!:: 0.00593128
'i:,
., r;!&...,
254
IX Some Special Problems
with residuals having a variance of 0.0002860781/(9 - 3) = 0.0000476797
and standard deviation a z = 0.006905. The residuals of each equation are
close to the rounding error of the data. The two estimates of 8, however, are
so far apart (measured by the scale of their standard deviations) as to cast
doubt on the hypothesis that the same parameters apply to both equations.
Nevertheless, let us proceed with the joint fitting of the two equations.
Beauchamp and Cornell compute the residuals from the two separate fits,
and form their covariance matrix (neglecting the first observation on J'z)
without compensating for degrees of freedom. They quote the matrix as
being
V= [ 0.1189 0.009753 ] 10-3
0.009753 0.03179 x
(9-7-3)
They now use the inverse of this matrix as a weight in the objective function
9
I e/(8W- l e p (8)
p=2
The minimum occurs at
8* = [0,06751, 0.00706, 0.08393]T
We shall now proceed to calculate an estimate based on the method used
in the preceding section. We let 0+ denote the missing value of YI. t, and we
assume V unknown, Our objective function then is
9
eP(e) = (9/2) log det I e p (8)e/(8)
p=t
where
e ll (8) = 0+ - 0 1 exp( -OzX 1 ) - (I - 0l)exp( -03 X t)
and all other residuals are defined as usual. Using Beauchamp and Cornell's
initial guess (with a value for 0+ appended)
8 1 =(0.381.0.21,0.197, ])T
The Gauss method (with nonnegauvity constrall1ts, using penalty functions) ">,.
converged to
8* = [0.0782549, 0.00792904. 0.0975048, 1.04852F
(9-7-4)
This result is unacceptable since it requires .1'1,1 = 1.04852, but no value of y
9-8. A Sequential Reestimation Problem
255
can exceed one, In fact, we must have )'Ill + .1'112 ::::; I; therefore, since .1't 2 =
0.03, we impose the additional constraint 0+ ::::; 0.97.j: The result this time is
r 0.0358913 + 0.0073400 1
8* = 0.00463630:t 0.0007897
0.0812979 :t 0.0037664
0.910611 :t 0.013837
(9-7-5)
Note that this is an interior minimum (0+ is below its constraint). Curiously,
the objective function attains a lower value at Eq. (5) than at Eq. (4), indi-
cating that the latter is only a local minimum, and that Eq. (5) is the proper
estimate even in the absence of the constraint on 0+.
Our estimate is quite well determined, and significantly diO'erent from
Beauchamp and Cornell's estimate. This may be accounted for by the fact
that the estimated covariance of residuals corresponding to Eq. (5)
I '
<
l
"
.
I
"
;;
v = [ 3.42423
- 0.565601
-0.565601 ] 10-3
0.118719 x
is very different and much larger than Eq. (3). In other words, the combll1ed
fit attainable for both equations is much worse than the fit obtainable for
each equation separately. The residuals found in fitting the individual
equations are very poor measures for the errors in the simultaneous fit. If,
however the residuals from the joint fit, which have standard deviations of
0"1 = 0.0582 and 0"2 = 0.01090, a.re still considered not in excess of experi-
mental error, then we have no compelling reason for rejecting the joint model
even though the separate models give much better fits.
I :;
-i
9-8. A Sequential Reestimation Problem
:
In SectIon 5-21 we solved a. single equation least squares problem. On
the basis of fifteen observations we found:
cP* = 0.03980599,
8* = [ 813.4583 J
960.9063
-0.957336 x IO- ]
3,50371 x 10- 0
.
N* = [ 0.271890 x IO-
-0.957336 x 10- 0
t Sincc 0.03 is not an cxact valuc, wc rcally should havc uscd thc constraint
8 1 8, [ 8,8, ]
8..+1- exp(-8,x)+ -I cxp(-8,x)<:I
8 1 (8, - 8 2 ) + 8, 8,(8, - 8 2 ) + 8 2
but in practicc, thc simplcr constralilt sufficcs.
256
IX Some Special Problems
Therefore, the objective function Eg. (5-2\-6) has the approximate representa-
tion Eg. (7-21-\). Suppose four additional observations were made. Our
new objective function could be written approximately as
19
<p(t91(S) = (IP51(S) + z= e/(e):::::o 0.03980599
=16
+ (10- 5 /2)[0.271890(0 1 - 813.4583)2 - 1.914672(8 1 - 8\3.4583)
19
x (0 2 - 960.9063) + 3.50371(0 2 - 960.9063)2 + z= [y - f(x w s)f
= 16
The data for the four new observations are given in Table 9-3, Starting
Table 9-3
Additional Good Data
11.
X,I 1
X" 2
)'
16
17
18
19
0.1
0.1
0.2
0.2
150
250
150
250
0.851
0.176
0.825
0.011
with [813.4583, 9/l0.9063] as the initial guess, a single Gauss IteratIon takes
us to
S = [ 891.4626 J
984.38\8
A total of three iterations bring us to the minimum at
s = [ 895.2656 J
985.1655
On the other hand, the true minimum of <p(191(S) = L t e/(S) occurs at
S = [ 892.934\ ::t 2\3.702 J
983.7429::t 53.0 \24
We see then that the single Gauss iteration on the approximate objective
function produced very acceptable results.
The new estimate is vcry close to the old one because the data of Table
9-3 were generatcd by the same model as the previous data. The data of
Table 9-4, however, came from a different model. Nevertheless, when these
.'0!.iI0
<'1t
I
'"
*:
9-9. Problems
257
Table 9-4
Additional Poor Data
fL
X" I
X,.2
)"1
.
16
17
18
]9
0.1
0.1
0.2
0.2
150
250
150
250
0.760
0.300
0.608
0.095
.:
ii£
data are used in place of those of Table 9-3, we find after one Gauss iteration
on the approximate objective function
,
fl,
'if'
8 = [ 462.8711 ]
863.4094
The minimum of the approximate function is found in six iterations at
8 = [ 448.9961 ]
852.5732
And the mlI1lmUm of the exact function IS at
''\
rlil
8 = [ 484.7656 :t 135. 943 1
841.6172:t 61.2455
.
!!!,
!
Even here, where the new estimate is very far from the old. we obtain an
acceptable result in the single iteration.
9-9. Problems
Show that Eq. (9- 4 -7) can be generalized for a multiple equation model
as follows
I
I
8(11+ I) = 8(11) + A,;-+II B I v- le,,+ 1(8 111 ))
where
BII + t = cle ll + 1/i'8,
"
A" = I BI,Ty-IB/,
1 1 = I
Show that
A,;-+II = A,;-I - A,;-IC II + 1(1 + C+ ,A,;- tc,,+, )-1 C+ ,A,;- t
where C IL = B/S, and S is a matrix such that SST = V-I, e.g., the Cholesky
decomposition of y- t.
';t;i;'
-- J<fI!!.'.
Chapter
Design 01 Experiments
10-1. Introduction
Parameters are estimated on the basis of data obtained in experiments, It
is natural to ask whether we can plan experiments so as to facilitate the task
of estimating the parameters. The answer is generally in the affirmative, and
this chapter is devoted to the study of suitable experimental strategies.
For our present purposes, we define an experiment as the act of observing
the value of certain dependent variables Y/l at given values of the independent
variables x/l' We design an experiment by choosing in some rational way the
values of x at which Y is to be observed. We shall use the phrase" the experi-
ment x" to denote the experiment whose independent variables take the value
of x. The values of the independent variables are referred to as the experi-
mental conditions,
The design of the experiments and the estimation of the parameters form
but two stages in a scientific investigation. What constitutes a "rational way"
of choosing experimental conditions can be decided only on the basis of the
overall aims of the investigation. A some\vhat idealized scheme of a typical
investigation is depicted in Fig. 10-1. In practice, investigators rarely adopt
such a scheme explicitly, but nevertheless they adhere to it in a loose informal
way.
This book is concerned with investigations in which parameter estImatIOn
plays a crucial role, forming the contents of box 3 in Fig. 10-1. Such investi-
gations are naturally concerned with the development of mathematical
models to represent physical situations. To devise a formal scheme for pro-
ducing such a model in a general situation is, as yet, beyond our capabilities,
Therefore, we place somewhat more modest goals into box I of our scheme,
Typically, the goal may be one of the following:
(a) The estimation of the parameters in a given model to a specified degree of
precision. For instance, we may wish to estimate the kinematic viscosity of a
liquid, using Eq. (2-1-1) as the model.
.:,
10-], Introduction
':',
r:
o!..'
...;',
:
J/
".1'.
.;l
'r;'
.,.;.
i'
!
j\:
7i!1
i.
',:i-
!.i
'"'
.'
if
!
""'
;
'.
I. Define goal of investigation.
2. Collcct prior avaiIablc relcvant
data and information.
Analyzc data available to date
4. Has thc goal of the invcstigation
bccn met?
No
5. Is therc a rcasonablc chancc of
attaining the goal with availablc
rcso II rccs ?
Ycs
6. Dcsign thc ncxt expcrimcnt or
scries of cxpcrimcnts.
7. Pcrform thc speciflcd expcri-
mcnt(s)
Ycs
No
Fig, 10-1, A schcmc for scicntific invcstigations.
259
260
X Design of Experiments
(b) The prediction of the values of certain variables which depend on some
unknown parameters. For instance, we may wish to predict the power required
to pump the liquid at a specified rate through a given pipe. To do this, we
need to determine the liquid's viscosity.
(c) The selection of which one of several proposed models best accords with
reality. Returning to the liquid and its viscosity, we may wish to determine
whether the liquid is Newtonian (viscosity constant) or non-Newtonian
(viscosity depending on shear rate, past history, or other factors).
(d) The determination of a course of action in a situation where the optimal
action depends on what the correct mode] is and what the values of the param-
eterS are. For instance, the proper design of a structure depends on the
tensile strength of the materials used; the proper design of a chemical re-
actor depends on whether or not the reaction can be catalyzed; and the in-
ventory required in a stockroom depends on the predicted demand, which in
turn depends on the values of parameters appearing in an econometric model.
The method of selecting the experiments to be performed must be tailored
to the goal of the investigation. A simple example suffices to illustrate this
point:
Suppose we propose the model y = 0 1 + O 2 x. For physical reasons,
measurements are restricted to the range - I x I. It is intuitively obvious
(and we shall later derive this fact rigorously) that best estill1ates for 0 1 and
O 2 will be obtained if all experiments are performed at the two extreme points
of the range x = - I and x = I. On the other hand, if our main concern is to
prove that the model is as given and not, say, y = 0 1 + O 2 x + 0 3 x 2 , it be-
comes imperative to perform experiments with at least three distinct values of
x. In fact, the best three experiments are at x = - 1,0, and I. It is meaningful,
then, to ask" what is the best experiment for the attainment of our goal?".
but not simply" what is the best experimdt?"
The classical methods of experimental design were devised by Fisher (1935),
Davies and coworkers (1954), and others to satisfy goals different from those
we are concerned with here, They referred to agricultural or industrial situa-
tions where no a priori mathematical models were availab]e, Generally, one
designed in advance a large number of experiments to be performed simultane-
ously (a necessary condition in agriculture, where an experiment takes months)
In the scientific laboratory, on the other hand, an experiment usually takes
only a short time, but requires expensive apparatus of which not many speci-
mens are availab]e. Experiments are perforce carried out in sequence, one (or
at most several) at a time.
Wald (]947) has demonstrated that when experiments are carried out tl1
sequence a smaller number of them is required, on the average, than when they
are performed simultaneously. This is true even where no use is made of
W.
k.
i"
j
r
'Ii
J
lr'
,.
-;:F
l:;
t::;
.-:
.'i'
r.
O
t":
j".
::
!:.{.
,,:
t.
l
1J';
f,;
t.
f:
r
,i.
<
:
1:i<
:,.'
{
11-S
.?il
1S
....;-;
q
d.
:;:.
"'
;;.'.
;
::
10-2. Information and Uncerrainry
261
information gained in one experiment for planning the next one; The gain in
this case accrues entirely from the ability to terminate the experimentation
precisely at the point at which one's goal has been met. If, in addition, one is
able to design each experiment in the light of the results of the previous ones,
the gain in efficiency can be even more impressive. Informally, this strategy is
adopted by most chemists, physicists, and other experimental scientists. What
we are seeking here is the formalization of well-established intuitive pro-
cedures. The major contributions to the attainment of this goal are those of
G.E.P. Box and his coworkers, starting with Box and Lucas (1959). Many
more of their papers will be cited in the sequel.
10-2, Information and Uncertainty
It is the purpose of an experiment to gain relevant information. The best
experiment is the one that is most informative, It is only natural that we should
turn to information theory in our quest for a quantitative criterion for select-
ing the experiments to be performed.
Suppose is a random vector. From the probability distribution of we
can gain a picture of the uncertainty associated with ; the more disperse the
distribution of, the more uncertain is the value any specific realization of
will assume. These intuitive notions of uncertainty have been formalized by
Shannon (1948), who showed that the unique (except for a positive multi-
plicative factor) suitable measure of uncertainty associated with the proba-
bility density function p() is given by
H(p) == -E(log p) = - J p() log p() d
( 10-2-1)
We gain information by reducing the uncertall1ty, Suppose Po() and p"'()
are, respectively, the prior density of, and the posterior density after an
experiment has been performed. According to Lindley (1956), the amount of
information J that is gained by the experiment equals the reduction in un-
certainty from the prior to the posterior distributions
J = H(po) - H(p*)
(I 0-2-2)
Our aim is to find that experiment which maximizes J. Since H(po) is un-
affected by the experiment, we may equally well look for the experiment that
minimizes H(p*).
When is the vector of unknown parameters e, Po and p* may be the
prior and posterior densities in the usual Bayesian sense. If we wish to eschew
this interpretation, we may take Po and p* to be the estimated sampling
262
X Design of Experiments
distribution densities before and after the experiment is conducted.i When
the normal approximations are adopted, the two interpretations yield identical
results.
We shall need to evaluate H(p) for the multivariate normal distribution,
Let p() = Nn(a, V).
We have
H(p) = -E(log p) = - E[ -(n/2) log 2n - -! log det V - -t( - a)TV- I( - a)]
= (n/2) log 2n + t log det V + t Tr[V- 1 E( - a)( - a) T]
= (n/2) log 2n + t log det V + t Tr V-IV
= (n/2)(I + log 2n) + t log det V (10-2-3)
Discarding irrelevant constants, we can say that
H*(p) == log det V
(10-2-4)
is a measure of the uncertainty In the distribution N,la, V),
We have remarked previously (Section 7-10) that for a normal distribution
det t / 2 V is proportional to the volume of a confidence region in space,
Eq, (4) tells us that the uncertainty increases linearly with the logarithm of the
volume of the confidence region. An experiment that seeks to minimize un-
certainty also seeks to shrink the volume of the confidence region as much as
possible.
10-3. Design Criterion for Parameter Estimation
Suppose our current state of knowledge concerning the value of the param-
eters e may be summarized in a normal prior distribution N(8 o , Yo),
Typically, this is the posterior (or sampling) distribution relative to experi-
ments already performed. We are conten\plating performing n additional
experiments, in which y I' (p = I, 2, , . . , n) are to be measured. Our task is to
determine the values of xl' (Il = I, 2, , . ., n) at which these measurements are
to be taken, We assume that the errors y J1 - f(xll' 8) are distributed as ;
N",(O, V).
After these experiments are performed, we shall be able to construct a
new posterior distribution. Let the normal approximation to that distribution
be Ni(O, V). The vector 0 will be obtained as the mode of the posterior
density; the posterior covariance V is given by Eq. (7-12-5)
V = [ :t B/V-tB/l + V;l ] -I (10-3-1)
IL=I
t We assumc that scvcral cxperimcnts havc already been pcrformcd, and Po estimated;
from thc rcsults.
:.,,,
',:
;
/
. "
10-3. Design Criterion for Parameter Estimation
263
. ;
where, as usual, BI' == aflJae evaluated at x = xI" e = 0, Since the experi-
ments have not been performed yet, we cannot tell what value 0 will take;
but in trying to calculate V we can use eo in place of 0 when evaluating
af,J ae,
Given any proposed set of experimental conditions X t ' Xl' ,.., XII we are
thus able, by using Eq, (I), to estimate what the value of V will be after the
proposed experiments are conducted. This is the same thing as saying that
the estimated V is a function of Xl' Xl' ., , , XII'
To maximize the amount of information gained by the experiments, we
wish to select the xI' (Il = I, 2, "', n) in such a way that the uncertainty is
minimized, i.e., so that
'.
'
f ;.
: "
,
,':'
, "
t'.r.
" .
R = log det V
(10-3-2)
:r.
is minimized. This also mmimizes the volume of the confidence region for the
parameters, Clearly, minimizing log det V is equivalent to minimizing det V,
or maximizing det(V) -I,
Let us reintroduce the notation
"
r Bt J
_ Bl
B= i,_'
IT rI :
:J
(]O-3-3)
.,'
The matrix II is the joint covariance matrixt of the errors E I , El , . . . , Ell' Then,
Eq, (I) becomes
'1,-1
V = (BTII-IB + Va I)-I
( 10-3-4)
and
. .
det(V)-J =det(V o J +BTII-IB)=detV o l det(I+VoBTII-IB) (10-3-5)
We recall now [see Eq. (A-I-33)] that det(I + AB) = det(I,. BA) so that
det(V)-1 = det Va I det(I + BV 0 BTII- J )
= det Va I det II-I det(II + BV 0 BT) (10-3-6)
Since det Va t det II-I is a positive constant, we may simply maximize the
function
T(x l , Xl' ..., XII) == det(II + BV a B r )
( 1 0- 3- 7)
Let us examine the matrix II + BV a B T . As stated before, II is the joint
covariance matrix of the errors in all the measurements to be taken in the
*,'-;j'!;i'$.
:'1' I'
:t: If the Ell are senally correlated, we introduce the proper off-diagonal elements into
the definition of n.
264
X Design of Experiments
course of the 11 proposed experiments, and BV a B T is the covariance matrix of
the errors incurred in computing the predicted outcomes f(x p , 0) of the pro-
posed experiments due to the current uncertainty in the values of the para-
meters. Therefore, n + BY 0 B T is the total covariance of the predicted out-
come (see Eq. (7-19-4), with Va and n playing the roles ofV o and V1]' respect-
ively; Y is assumed zero). Eq. (7) is then a measure of the joint uncertainty of
the predicted outcomes.t We have shown that to obtain maximum information
we must perform those experiments whose outcome is the most uncertain.
This result is not surprising; experiments whose outcomes are most uncertain
represent the greatest gaps in our knowledge of the system under considera-
tion; to fill the gaps we must perform those experiments.
We have a choice of minimizing det(V o I + B T n- 1 B) or, equivalently,
maximizing det(n + BV 0 B T ). Our choice should depend on the relative di-
mensions of the two matrices, which are I x I and /1111 x 111/1, respectively. We
should obviously choose the determinant of lower dimension. The case that is
most favorable to the second formulation is one where a single experiment is
to be conducted on a single equation model. Here n reduces to a single
number a 2 , and B is a row vector b T [af/aol T . Hence Eq. (7) reduces to
T(x t ) = a 2 + bTVob
(l0-3-8)
If a 2 is a constant, we need only find the X, which maximIzes the error of
prediction variance bTVob.
We cite the following simple example:
Suppose the model is linear
y =f(x. 0) = 0 1 + 02 X
( ]0-3-9)
We have b T = [l, x]. Let a 2 = 0.1 and suppose the current estimates are
0 1 = 2, O 2 = I, with covariance matrix V 0 = diag(O.I, 0.5). The predicted out-
come y of any experiment x is given by 2 + x, and the variance of this predic-
tion is, according to Eq. (8)
T(x) = 0.1 + [I, x] [Oc/ 05] [.:] = 0.2 + 0.5x 2
( I 0-3-10)
To improve the estimates of 0 1 and O 2 we should perform an experiment
maximizing T(x), that is we should choose as large (in absolute value) an x as
is practically feasible. The situation is illustrated in Fig. 10-2, where the pre-
dicted curve 2 + x is plotted surrounded by a confidence curve of width
:t(0.2 + 0.5X 2 )1/2. We choose for our experiment a value of x at which the
confidence band is as wide as possible
t Thc dctcrminant of a covariance matrix is somctimcs rcfcrred to as the generalized
variance.
"I
i
''i1
,
10-4. Design Criterion for Prediction
265
; .
<".
F
4 1
f.-
I
1
I
o
]
1iI
.,....
.;-
f.i
;.
.
t
I ,,',
<.1-"
""
<
t
}t;,
'-
-I
Fig, 10-2, Prcdicted J' with confidcncc band5.
g'i
;:'::
.;::.
i
'.
Ji.
"':."
:..,.
.j.
The desIgn criterion that was described here has been arrived at from
different points of view by Box and Lucas (I 959) and Box and Hunter (1963),
with further details supplied by Draper and Hunter (1966, 1967a, 1967b) and
Atkinson and Hunter (1968), The use of the method on a computer-simulated
chemical kinetics model is described by Kittrell, Hunter, and Watson (1966),
and details for the estimation of polymerization parameters are worked
out by Behnken ( 1964).
',!,;..
':.
' 10-4, Design Criterion for Prediction
.. Ife is to be estimated purely for the purpose of predicting certain quanti-
. ti,," <l>1, e), then th, u ""''' i oIy nf t 10, p"d ictio" i, gi"co by dol V. ' wh",
I V p is defined in Eq. (7-19-4) and V is used in place of V o . Choose X/I
I : , : , , ; , : . :, , : : . ' : : ' , . (/I = I. 2. . . . , n) so as to minimize the uncertainty of the prediction
_. R = det V p = det[(acpj[11;)V «(!cpji 1 1;)T + (c1cpjae)V (ccpj(le)T + V'I] (10-4-1)
Here V, given by Eq. (10-3-1), is the only quantity that depends on the X/I'
X Design of Experiments
266
In the special case when the parameters e themselves are the 11 to be predIcted,
then Eq. (I) red uces to the criterion of Section 10-3. We may be interested
only in a subset of the parameters, in which case we associate that subset with
11, and minimize the determinant of the matrix obtained from V by deleting
the rows and columns corresponding to the unwanted parameters,
10-5. Design Criterion for Model Discrimination
Sometimes several alternative models are proposed for the same physical
situation. Vve wish to cond uct experiments that will enable us to select the
"best" model, i.e., the one that best fits the data.
Each one of our models attempts to predict y as a function of x and e,
Vvhat varies from model to model is the mathematical form of the function,
and the set of parameters involved (although some of the parameters appear-
ing in different models may possess the same physical interpretation), We
attach a superscript (i) to quantities pertaining to the ith model. The ith
model equation reads
Suppose that we already have estimates eg 1 for the parameters appearing
in the ith model, and estimates vg 1 for the associated covariance matrices,
Typically, these are obtained by fitting each model in turn to data from
previously performed experiments, Using the parameter values eg) we can
predict the outcome yti1 of any proposed experiment x, assuming the ith
model is the correct one. This prediction is given by
y = f(i)(x, e(i))
(10-5-1)
:i
IP:
R'
f
Still assumll1g that the ith model is corret, we can compute the covariance of
the prediction error in Eq. (2), Following Eq. (7-19- 4 ), this is
Vii\x) = V + B(i\X)VJB(ilT(X) (10-5-3)
where Wi) == iJf(i1/iJe(i) and V is the covariance of the measurement errors of...
y (which may also be a function of x),
The hypothesis that the ith model is correct leads us to regard the outcome.
of a proposed experiment x as a random variable 11 with pdf P(i\11 I x) having'"
mean and covariance given by Eq. (2) and Eq. (3), respectively. Suppose the
experiment has actually yielded an outcome y. Then we can compute the i,
numbcr pUJ(y I x), which is the likelihood associated with the ith hypothesis.
For the moment we restrict ourselves to the case of two alternative models, ;
The quantity
y(i1(X) == f(i1(X, eg 1 )
( 10-5-2)
:i
i.
I
!
012(x) == log(P('1(y\x)/p(Z1(y\x)]
I
I
..",;;;.
10-5. Design Criterion for Model Discrimination
lbl
is a measure of how much the observed Y suppons model I in preference to
model 2 (it is related to the likelihood ratio, see Section 10-6). In advance of
performing the experiment we do not know y, so we cannot compute a t 2' but
we can compute its expected value under the assumption that model I is
correct (the symbol E(I) denoting expectation under this assumption)
E(1)[a t2 (x)] = J p(l)(Ylx)log[P(l)(ylx)/p(2)(y\x)] dy
(10-5-4)
If indeed model I is correct, we wish to conduct an experiment x which is
likely to confirm this, i.e., is expected to produce a large value of a12' Con-
versely, if model 2 is correct, we wish our experiment to have a large value
of the corresponding quantity E(2)[a2l(X)]. Since we do not know which
model is correct, we form the sum of these two quantities
1t, 2(X) == E(l I [a\2(x)] + E(2J[a 21 (x)]
= J [pp I(y I x) - p(2J(y I x)]log[p(l I(y I x)/ p(21(y I x)] dy
(10-5-5)
The experiment to be selected is the one that maximizes 1 1 , 2(X); a large value
of 1t, 2 can only be obtained if p(21 is much larger than pel \ or vice versa, In
either case, the outcome shows a strong preference for one model as opposed
to the other.
The quantity 1t, 2 is called the divergence or the i/?formationfor discrimina-
tion (Kullback and Leibler, 1951; Kullback, 1959) Its similaJ ity to
Eq. (10-2-1) is evident.
If both models assume normal error distributions with covariance
matrices vt) and vi 2 1, respectively, then it can be shown that
11. 2(X) = -m + -t Tr(Q(l)Vi 2 ) + Q( 2JV il)
+Hy(2) _ y(1Y(Q(IJ + Q(2J)(y(2J _ yet)) (10-5-6)
where Q(il == (V)) -I. The dependence of 1 1 ,2 on the experimental conditions
x comes about through Eqs. (2) and (3), An important special case occurs
when the models are of the single-equation type, with m = I. Then V) = a/,
Q(i) = ai 2 (i = 1,2), and
1 1 ,2(X) = _I + -H(a 2 /aY + (a 1 /a 2 )2] + .HO!a. 2 ) + (I/a/)](y(2) - y(I))2
(10-5-7)
This equation was derived by Box and Hill (1967). The analogues to Eqs. (2)
and (3) are in this case:
yfi)(x) = f(i)(x, eg))
a/ex) = a 2 + b(j)1(x)V)b(i)(x)
(10-5-8)
(10-5-9)
268
X Design of ExperIments
where (J is the standard deviation of the measurements errors and b(iJ(x) =
(If(ilf (iB! II.
Equations (6) and (7) have a simple heuristic interpretation, particularly
in the single-equation case. Let us plot the predicted values ylll and y<ZI as
functions of x; this is done for a hypothetical situation in Fig, 10-3 where x
y
y(2!(X)
XI
Xz
X
FiJI. 10-3. Discrimination bctwccn two prcdictcd rcsponses.
is assumed one-dimensional. If we chose to perform the experiment Xt, where
},o> and /Z) coincide, the resulrs of the experiment will tell us nothing about
which prediction was the better one. On the other hand, the two predictions
are most divergent at X z , and the result of the experiment (unless it happens
to fall exactly midway between the two predictions) is likely to confirm one
or the other of the two models depending on which prediction it falls nearer
to. It seems reasonable to select. then, the experiment x for which {j'(ZI _/11)Z
is maximum. It may happen, however. that at that value ofx (xz in Fig. 10-3)
one or both of the predictions are particularly uncertain, possessing large
values of (Ji. Performing this experiment then is likely to be inconclusive, and
we may prefer another experiment for which (ylZI - ylll)Z is somewhat
smaller, but where the uncertainty is much smaller. Therefore we must attach
to the term (ylZI - ylll)Z a weight which is small when even one (Ji is large, and
large when both (Ji are small. Eq, (7) provides the right weight, and the same
is true of Eq. (6) in the multiresponse case.
It frequently turns out that the (J i do not vary strongly with x, so the weights
are nearly the same for all values of x. In this case we need only find the
maximum of(yIZI- yll))Z or ly(Z) - yI"l.
. :
.
;,; f:
i: .
,...."
j'). j ';
;s:;.
..J'.' ,<
i' "
..l,
.;;:;.0
':,"
:1
'."' I ;j
.;!., :?
.::.(1"
': I t:.
r;. ':
..... -
t'
F
,
t;
!,,:
'.:.'
,.
!;'
t,
:(
ii
,,y;.
i
.t
-
f;
'b
.
:
.I;:
i
.{1
1
&".1.
..
,
"
..
;ill
III
. Jfr<.
10-6. Termination Criteria
269
Our results can be generalized in several directions. To design sevcral
experiments simultaneously, we maximize .1 1 ,2 constructcd with yU) and
V(il augmented to include the responses from all the planned experimcnts; in
Eq. (3) B(iJ takes the meaning defincd in Eq. (10-3-3), and n of Eq. (10-3-3)
replaces V.
There are several ways in which we may treat morc than two modcls.
After each experimem is performed. wc can compute the likelihood i1
associated with each model and the bcst current estimate of its parameters.
We then design the next experiment so as to discriminatc specifically betwccn
the two models with largest values of the likelihood. Or. following Box and
Hill (1967), we may form a joint divergence as a linear combination of the
pairwise divergences
.II. 2. J,...!X) == I I..!').1).I,..1(X)
i*j
( 10-5-10)
We have at this point no experience to guide us in the choice of thc mcthod
to use, but it is obvious that the first one requircs fewcr calculations.
Our aim may be both to find the best among alternative models, and at
the same time find good estimates for the parameters in the best modcl. A
solution to this problem suggested by Hill et 01. (1968) is to use as dcsign
criterion a weighted sum of Eq. (6) and Eq. (10-3-2), the lattcr quantity
being evaluated for the currently best model. Initially, a relatively largc
weight is placed on Eq. (6), but as one model becomes increasingly preferred.
the relative weight given to Eq. (10-3-2) is progressively increased.
10-6, TerminaTion CriTeria
We now turn our attention to box 4 of Fig. 10-1. How do we dccide
whether more experiments are needed? How and when do we decide that a
givcn model is better than the alternatives ')
Wc have advocated the use of the maximum likelihood method for estimat-
ing parameters. We preferred to assign to our parameters 0 the valuc 0 1 rather
than O 2 , provided that the likelihood associated with 0 1 was greater than
that associated with O 2 , The same idea applies to the choice of models; we
prefer model I to model 2 if the maximum likelihood attainable with model I
is greater than that attainable with model 2. These considerations lead to
Wald's (Wald, 1947) sequential probabilit)' ratio (or likelihood ratio) test.
Suppose our aim is to choose one of two alternative hypotheses, /-I) (model I
is correct), or 1-1 2 (model 2 is correct). Let j)(y, ogl) be the likelihood
(Le., the value of the joint probability density function) associated with the
data obtained to date, and with the current best estimate ogl for the param-
etcrs based on the ith hypothesis (i = I, 2).
270
X Design of Experiments
Let A and B be two constants satisfying
O<B<I<A
( 10-6-1)
Then the likelihood ralto test proceeds as follows:
I. If IJII/IJ2i.,;; B accept hypothesis 2.
2. If IJI 1/ IJ2 I A accept hypothesis I.
3. If B < IJI 1/ IJ.21 < A continue experimentation.
The choice of the constants A and B is determined by what confidence we
desire to place on the results. Let 'l. be the probability that I-I) is accepted
when H 2 is true, and fJ the probability that H 2 is accepted when HI is true,
It was shown by Wald (1947) that the following relations hold approximately
(the last two being consequences of the first two)
A (I - jJ)/'l.,
'l. (I - B)/(.4 - B),
B fJ/(I - C/.),
fJ B(A - I)/(A - B)
(10-6-2)
If we want. say, to be 95 o certain that we accept HI only if H) is true,
and 90 .n certain that we accept H 2 only if 1-/ 2 is true, then C/. = 0.05 and
fJ = 0.1 so that A = 0.9/0.05 = IS and B = 0.1/0.95 = 0.105. Conversely,
suppose we choose A = 10, B = 0.1. This is tantamount to. accepting error
probabilities 'l. = 0.9/9.9 = 0.0909 and fJ = 0.1 x 9/9.9 = 0.0909. The choice
C/. = fJ leads to B = I/A, and hence C/. = fJ = 1/(1 + A).
When more than two alternatives are present, we need only apply the test
to the two currently most likely models.
It is instructive to derive an expression for the likelihood ratio after 11
experiments in the single equation case. Assuming normal distributions, we
have
Vi) = (2n) -In/2 ia -n ex p { -( IJ2( 2 ) ptp p - fi)(e(i»)]2}
(10-6-3)
For the ith model. L IS maximized If we estimate a to be
I " \ 1/2
ati) = \( I jIJ)J;}r" - .t;:il(e(i»)f J
(10-6-4)
Hence
lS'l = (2n)-1/l121(a(i))-/l exp( -JI/2)
and the likelihood ratio is
(10-6-5)
V I) = ( a I2 ) ) /l = { I:= I [Y'l - .t;2\e(2))f r/2
IJ2) a(11 I=I[)'p - fll(e(I»)]2J
(10-6-6)
:.
f
f
:1-.
:
'i-F
:....
<if
..-
;;
-,..
i!i;',
..;:::;,
.:'t-
:
I
.J.:
::
;
{;
g:
I .,
.
,
:
';
I
I,
10-7. Some Practical COllsideratiolls
271
If after 11 experiments a(2) > a(11, we expect to find ultimately that model I
is to be preferred, If [a(2)/a(1)]" < A, we must defer final conclusions until
some more experiments are performed. Having no reason to expect the
estimates of a(2) or a(]) to be changed much by the results of future ex-
periments, we can predict that ]j1)/L(2) will exceed A after we conduct 110
additional experiments with [a(2)/a(1)]"+lo A
110 (log A)/(Iog a(2)/a(l) - 11
( 10-6- 7)
The smallest integer 110 satisfying Eq. (7) is an estimate of the number of
additional experiments required to reach the conclusion that model I is the
better one,
If a( I) < aPI, then
no -(log B)/(Iog a(1)/a(2» - n
(I 0-6-8)
provides an estimate for the number of additional experiments required to
establish a preference for model 2. The reliability of these estimates, which
is very small when 11 11 0 , increases steadily as 110 approaches zero. For
further discussion of the expected number of experiments, the reader is
referred to Wald (1947).
When experiments are being conducted for the purpose of estimating
parameters in a single model, the termination criterion is usually formulated
in terms of the variance of the estimates. One demands that det V o fall below
a specified value, or that the individual parameter variances V Oii (i = 1,2"", l)
all fall below specified levels a/. The number of additional experiments
required at any stage may be estimated easily from the fact that the elements
of V o are roughly proportional to (11 - l) -1. If det V o = a after 11 experiments,
and the number of additional experiments 110 required to reach det V 0 = b < a
is to be determined, then we must solve the equation
(11 + 110 - l)lb = (n - l)la
( 10-6-9)
for 110 '
10-7. Some Practical Considerations
We have derived several experimental design criteria, given by Eqs.
(10-3-2), (10-3-7), (10-4-1), (10-5-6), and (10-5-7) for the various cases that
may arise, Let D(x) denote the criterion adopted in a given situation. The
experimental conditions x are to be chosen so as to maximize D(x). We
discuss here some of the problems associated with finding these experimental
conditions,
272
X Design of ExperIments
In the first place, we must realize that the choice of experimental condi-
tions is generally not unrestricted. Mole fractions can only range from zerO
to one, the temperature of a liquid is constrained between its freezing and
boiling points, and the pressure in a vessel is limited by the strength of its
walls. Therefore, searching for the maximum of D(x) involves constrained
optimization, with the variables (experimental conditions) confined to a
bounded feasible region. Experience has shown that the maximum usually
falls on the boundary of the feasible region (Atkinson and Hunter, 1968) have
derived conditions under which this must be so). The experimenter must
apply the design criterion with caution; the extreme values of the experi-
mental conditions prescribed by the criterion may be far removed from the
region of interest, and it may be well to impose stricter bounds on the variables
than is required by physical or technical limitations. There is also the danger
that the properties (i.e., the model equations or parameter values) of the
system under investigation are not the same at the boundary as in the center
of the feasible region. We recommend therefore that occasional experiments
be chosen in the interior of the region, even when not prescribed by the
design criterion.
The reader will have noticed thm the design criterion cannot be computed
unless initial estimates eo and V 0 are given for the parameters and their co-
variance matrix. At the start of the investigation such estimates may not be
available, and some initial experiments must be performed to get things going,
The number of such experiments must exceed somewhat the number of
unknown parameters, so that the estimates eo and V 0 can be obtained. The
initial experiments may be selected by standard methods such as factorial,
fractional factorial, or rotatable designs covering the feasible range of the
experimental conditions.
An experimenter using these designs must remember that he cannot
expect to get more out of the proced ure than he has put into it. He cannot
expect to obtain clear-cut preference for one model or one value of the
parameters, if major effects have been neglected. For example, suppose a
compound A is converted into a product C according to the consecutive
reaction scheme
A-> B,
B->C
(10-7-1)
However, the expcrlmenter has set down models II1volvmg only the reaction
A->C (10-7-2)
He should not be dtsappolllted then If the design criterion does not tell him
to run experiments with varying initial concentrations of B.
In our derivations, expected information was the sole criterion for
selecting experiments. In practice, considerations of economics and con-
venience in experimental setup must also playa role. In many situations,
....::,
...,
. .
I :
r.
:
;c .
. '
<
10-8, Computational Considerations
273
.;r_
particularly those involving dynamic systems, experiments are conducted in
runs; several observations are made at different times on a process starting
from given initial conditions. In such cases one should design whole runs,
rather than single observations, We must select, then, a set of initial condi-
tions so, and times t 1 , t 2 , ,." t ll at which observations are to be made. Com-
puting the total information obtainable in each possible run is a formidable
task because of the high correlation between the predicted values of successive
observations of a run. It is, however, easy to calculate the expected informa-
tion in any single observation taken at times tl' with initial conditions so.
Ifwe plot the expected information 1 as a function of t for given So we usually
find that there is a definite time t M(SO) at which the expected information
attains a maximum value.t Let I M (so) be the expected information at 'M(SO)'
It is reasonable to choose that run (i.e., the value of so) whose 1,\I(SO) has the
largest value, The actual observations to be made during the run, i.e., the
values of the t 1" are chosen in that portion of the [(I) curve where 1 is not
much below 1 M , The problem of determining values of 'Il is further treated by
Heineken et 01. (l967a,b).
Other complications arise when the cost of an experiment depends strongly
on the experimental conditions, It may then be cheaper to gain a certain
amount of information by performing several cheap though inetlicient experi-
ments, rather than a single etlicient though expensive one. The simplest solu-
tion is to divide the expected information gain in an experiment by the cost
of that experiment, and maximize the expected information per unit cost.
Design criteria based purely on economic considerations can be derived from
decision theory, as shown in Section 10-10.
(I
£
r.,<,
..
:i.'
t;
;'- -/:;:.'
j t ':;:..
, ? . . :, . :. . '
' ,,;
, .8E
! 'r
::::
i'
.,
10-8, Computational Considerations
The problem IS to locate the maximum of the design criterion, which is a
complicated nonlinear function D of the experimental conditions x. The
function is often so complicated that analytic computation of its derivatives
is out of the question, Additional factors which contribute to the ditliculty of
the problem are the following:
I. The maximum is usually located on the boundary of the feasible region,
2, There are usually several local maxima. In the cases that have been
studied in detail, the number of local maxima tended to be close to the
number of unknown parameters in the model.
I,
.' Ir:Jf'
. . ... , ; . . , . . , . ' ..
j . " .
;.:
, ;
L..
t It is possible for thc maximum to bc approachcd asymptotically as 1-- w. Wc thcn
sclect t." as thc timc at which I = 1 M - E.
274
X Design of Experiments
As JnOlcared in Chapter V, maximization of a nonlinear function is
easiest when derivatives can be calculated, no constraints apply, and there is a
unique local maximum. On all these scores our problem is a difficult one.
Furthermore, if we wish to obtain the most information in each experiment,
we must repeat the maximization procedure before each experiment is per-
formed. Fortunately, there are some mitigating circumstances:
I. There is no need to locate the maximum with a great deal of precision.
2. The locations of the local maxima do not seem to vary much from
one experiment to the next. What do change are the relative heights of the
various maxima, so that the conditions chosen for a sequcnce of experimcnts
cyclc among the scvcral local maxima.t
Indecd, Box (1968) shows that if a sequence of 11 experiments is designed
(nonsequentially) to cstimate 1< /1 parameters, thcn an optimal or near-
optimal dcsign is usually obtained if the I best experiments are each replicated
n/I (as closely as possible) times.
It seems, thereforc, that we need search for the local maxima throughout
the entire feasible region only the first tjme around, i.e., after the initial
experimcnts have been pcrformed, After that, when the rcsults of each new
experimcnt come in, we need only search in the neighborhood of each alrcady
established local maximum so as to locate its current position (which may
shift slightly after each cxperiment).
The safest way to conduct the initial thorough search for local maxima is
to evaluatc the dcsign critcrion at all points on a dcnse grid throughout the
feasible region. Those grid points where the design criterion exceeds the values
at all neighbors are selected as approximate locations of the local maxima,
Further refinement can then be achieved by starting hill climbing procedures
(e.g., direct search optimization, see SectiDn 5-19) at these points. A suffi-
ciently fine initial grid makes this step superfluous
The grid search technique is feasible only when the number of independent
experimental conditions is small. With three variables, a ten-level grid in each
dimension results in a thousand points, which is not excessive if the model
equations are simple. With four or more dimensions, the grid search
technique is likely to be impractical. In this case we suggest the following
proced ure:
I. Select a feasible point at random.
2. Starting from this point, apply a direct search optimization procedure
until a local maximum is reached,
3. Repeat I and 2 until at least 1(= maximum number of parameters in
:1: This statcmcnt, like most others in this scction, is bascd solcly on a limitcd amount
nf pvnl"'l.i,....n{p with rnn,n"fpr_l;in1111:.:,'prl P'\'n(>rin,pnt
)" 1
-0'.1".,
:Ji.
Tasks pcrformcd by computcr
2. Estimatc paramctcrs for all
proposcd modcls.
8"',8(2, ,
3. By grid or random scarch, locate
x' ",..., x'P', thc sct of local
maxima of D(x).
XCI), x(:!),... ,XCV)
4 Choosc thc localmaximllm with
largcst valuc of D.
6. Estimatc paramctcrs for all
proposcd models.
I .!"
"il'.
i
"
;
;\('
t
,..
"
,}}'
8. Starting from thc old local
maxima (and possibly somc
additional randomly choscn
points) locatc thc ncw sct of
maxima by dircct scarch
X(IJ, X(2)._. x(P)
Tasks pcrformcd by laboratory
Stan
X h X21 . . . _X'I
I. Perform initial cxpcrimcnts
x I1 .... Xli' i.e., nleaSlIre
YI...", Y".
y" yo,..., y"
X,I
5. Pcrform the cxperimcnt x".
i.e., llleasure Y".
y"
Ycs
)10 End
Fig, 10-4, A scqucntial cxpcrimental proccdurc, Symbols ncxt to arrows indicatc
transmittcd data.
276
X Design of Expenments
any of the models considered) distinct local maxima have been obtained, or
until a certain number of tries has failed to uncover a new local maximum,
Let xlI), X(21, ..., x(p) be the known local maxima after 11 experiments have
been performed. The (/1 + I)st experiment is, of course, conducted at the
highest local maximum, i.e., at the xU) whose design criterion is largest. After
this experiment has been performed, we establish new values of the xU) by
applying the direct search technique, starting at each ofthe old xU), It is not
unCommon for some of the new Xli) to coalesce; i.e., searches starting at
several of the old xU) may lead to the same (within some tolerance E) new
xU), To guard against the possibility that some maxima are being overlooked,
one may also include after each experiment additional random starting points
for searches,
Fig. 10-4 contains a proposed flow diagram for the procedure to be
followed. The diagram is divided into two sections, dealing respectively with
the functions of the computer (estimation and design), and of the laboratory
(execution of specified experiments). This raises the question of how to
implement the Jinks between the computer and the laboratory apparatus. The
answer depends on the circumstances; if the experiments are of very short
duration and suitable instrumentation is available, the computer may be
connected directly on-line to the apparatus. Otherwise, manual transfer of
data is required. Note that the computer functions described here are quite
distinct from actual on-line control of experiments, where the question is not
what experiments to perform, but how to insure that the specified experiment
is carried out properly. Of course, on-line design cannot perform unless the
control function is also implemented, but the latter is outside the scope of
this book.
10-9. Computer Simulated Experiments
Before applying our design methods to real experiments it may be wise
to test them on computer simulated experiments. In this way we can deter-
mine economically whether the method is likely to succeed,
How do we simulate an experiment on the computer? An experiment is,
from our point of view, merely a device for generating the value of Y p for a
given value ofx", To simulate the experiment, all we need then is a computer
routine which accepts a value of X'I' and returns a value of Y,I' Internally, this
subroutine should compute Y,I using a formula such as
Y = f(O) ( x 0(0» ) + E ( 10-9-1 )
11 Jl' Jl
where fro) is one of the models proposed for the phenomenon under study,
and 0(0) is a specific set of values for the parameters appearing in this model.
1
.'''' J
:i;'.
;::.:
.it..
10-9. Computer Simulated Experiments
277
:',.
..
,. .'
. :;:J'-
-.;.
J
The error term Ell consists of pseudorandom numbers with the proper proba-
bility distribution (see Section 3-3). In addition, we may include a systematic
error, to test what happens if none of the proposed models is really correct.
The experimental design procedure is tested by applying the procedure
of Fig. 10-4, with the functions of boxes ( I ) and (5) performed by the computer
routine just discussed. Note that only this particular routine" knows" which
model has been selected, and what parameter values have been assigned;
precisely as in nature the laboratory apparatus" knows" the model and the
parameters. The only way in which other computer routines (e.g., those
performing the functions of boxes (2) and (6» can guess at the right model
and parameter values is by analyzing the data (YII values) supplied by boxes
(1) and (5).
We present, now, a numerical example (Bard and Lapidus, 1968) in which
the design method for discrimination among models is applied to computer-
simulated experiments. The example clearly illustrates the potential power of
the method.
Hougen and Watson (1947, pp. 943-958) have proposed eIghteen alterna-
tive models for determining the rate of catalytic hydrogenation of mixed
isooctenes into isooctane
C s HI6 + H} -> C S H 1S
(10-9-2)
Ii.
I'-
P
r'
I.:'
Blakemore and Hoerl (1963) have attempted to fit all these models, and
two additional ones, to data that were available in the literature. They found
that all but two of the models could be rejected immediately. There was no
conclusive evidence to choose between these two. which have the forms
y = OI)XIX2/(1 + Ojl)X:/} + OI)X2 + O1)X3)3
(10-9-3)
and
y = O2)XIX2/(1 + O2)Xt + O2)X2 + O2)X3)2
(10-9-4)
I ,.
'"
II -
.}.
;.(:
where y is the rate of reaction and Xt, X 2 and X 3 are the partial pressures of
hydrogen, isooctene, and isooctane, respectively. Blakemore and I-Ioerl
conclude, in part
"Carefully desIgned experIments are necessary. , , there are no fittlllg
techniques which can overcome the deficiencies of poorly-designed experi-
ments . . ."
This system was, therefore, considered a good one for testing the experi-
mental design procedure, To simulate the reaction on the computer, the
following relations were used
For experiments II = I, 2, . . ., 6
]I = 0.0653477x 1 X 2 [1 + £(u)]/(l + 0.128246x:/ 2 + 0.159038x 2 + 0.0206618x 3 )3
(l0-9-5)
278
X Design of Expenments
and for experiments tl = 7, 8, . .
y = 0.0558x[x 2 [1 + s(a)]/(I + 0.104x 1 + 0.264x 2 + 0.0151x 3 )2 (10-9-6)
where s(a) is a pseudorandom number with distribution N 1 (0, ( 2 ), Note that
a is the standard deviation of the relative error in y. This choice of model is
to be interpreted as follows:
Model Eq. (4) is the correct one, but by chance the first six experiments
happen to give wrong results which appear to be closer to model Eq. (3),
The aim was to see how soon the experimental design procedure could pick
out Eq. (4) as the correct model, in spite of the handicap posed by the first six
observations. The parameter values used in Eqs, (5) and (6) were those that
gave the best least squares fits to the literature data used by Blakemore and
Hoerl. The permitted ranges of the independent variables were the same as
in the literature data, i.e.,
0.1 Xl 2,5
0.1 x 2 3
0.05 x 3 2.7
(I 0-9-7)
The flow chart of Fig. 10-4 was implemented in the following way:
Box I The initial experiments, six in number, formed a fractional fac-
torial design. They consisted of the centers of the six surfaces bounding the
region defined by Eq. (7). The conditions for these experiments are listed
in Table 10-1, along with the results [computed from Eq. (5)] for the case
a = 0,03, i.e., 3 % relative error.
Table 10-1
Initial Expcriments
fL Xl X2 X3 Y (a = 0.03)
1 0.1 1.55 1.375 0.00441
2 2.5 1.55 1.375 0.07932
3 1.3 0.1 1.375 0.00508
4 1.3 3 ] .375 0.05633
5 1.3 1.55 0.05 0.04912
6 1.3 1.55 2.7 0.04292
Box 2 The least squares criterion was used to estimate parameters for
both models. The fact that the relative rather than absolute error remained
constant from experiment to experiment was ignored (i,e" it was assumed that
the experimenter did not know that the error standard deviation varied from
10-9. Computer Simulated Experime11ls
279
experiment to experiment), The parameter estimates wIth their standard
deviations and the residual standard deviations for the data of Table 10-1
are presented in Table 10-2.
Table 10-2
Paramctcr Estimates for Initial Expcrimcnts (a = 0 03)
'::!
Modcl
8 J 8, 8 3 8..
0.064372 0.116329 0.160034 0.024028
. :I: :I: :I:
:r:
0.000294 0.001065 0.000544 0.000272
0.056738 0.071874 0.277537 0.040] 66
:I: :I: :I: :I:
0.001808 0.005881 0.009287 0,003611
Standard
dcviation of
residuals
Eq. (10-9-3)
0.488055 ,,: /0-4
,.
Eq. (10-9-4)
3.75137 X /0-4
It is not surprising that at thIs point model Eq, (3) gives much the better
fit, and its parameters are the better-determined ones.
Box 3 Since there are only three independent variables, a complete grid
search was considered feasible. The design criterion function 1 1 ,2(X) of
Eq. (10-5-7) was evaluated at all points on an 11 x I I x 11 grid encompassing
the feasible region defined by Eq. (7). Local maxima are taken to be those
grid points at which 1 1 ,2 exceeds values at all direct neighbors. The local
maxima after the six preliminary experiments are listed in Table 10-3, with
the highest maximum underlined,
Table 10-3
Local Maxima of Dcsign Criterion Func-
tion Aftcr Initial Expcriments (0" = 0.03)
_Yt X,2 X3 .1 1 , ,
2.02 0.68 0.05 1.0]445
2.5 2.13 0.05 0.1536827
0.58 2.42 0.05 0.9023/07
2.5 3 2.7 1.287048
Box 4 We choose the highest maximum of 1 1 ,2 for our next eXperIment.
According to Table 10-3, then, we perform the seventh experiment at
x 7 = (2.5, 3, 2,7),
2S0 X Design of Experiments
Box 5 Eq. (6) is used to generate Y'L. In our example, Y7 turns out to be
0.09769.
Box 6 Procedure identIcal to Box 2,
Box 7 The simulation runs were terminated after 30 experiments. How-
ever, the likelihood ratio Eq. (10-6-6) was evaluated and printed out after
each experiment, so that the number of experiments that would have been
required for given confidence levels a., 13 could be determined easily, Let
Rll = IJ2)/D1J after f.1 experiments, and assume a. = 13. Then quitting after f.1
experiments would be correct had we set (3 = 1/(1 + R , ,), and our confidence
in preferring the second model after f.1 experiments is given by
CI2} == I _ {3 = R,J(I + R,,) = D 2 )/(L(I) + D 2 ») = «(J(l»)"/[«(J(l»)'L + «(J(2))"]
( I 0-9-S)
Box 8 The entire grid search of box 3 was repeated after each experiment.
This, of course, would be impractical in larger problems. The procedure
described in Fig. 10-4 was also tried, and led to results that were very nearly
as good.
Table 10-4 gives the details of experiments 7-30 for the case (J = 0,03, In
addition to X}L and YIL we list the logarithm of the likelihood ratio and the
confidence CI2) in prefer! ing model Eq. (4) over Eq. (3) after each experiment
has been processed.
Similar runs were made with relative error standard deviations of I %, 3 %
and 6%. Fig. 10-5 summarizes the results of the three runs. It should be
noted that to establish a preference for model Eq. (4) with 95 % confidence,
we needed 17 experiments with (J = 0.01, 21 experiments with (J = 0.03, and
by the method of Section 10-6 we predicted that 36 experiments would be
required with (J = 0,06. ;
For this problem, the use of max I i 2 ) - y(J) I as the design criterion
worked just as well as using Eq, (10-5-7).
To determine whether the sequential design procedure employed here
provides any improvement over classical design procedures, the 27 experi-
ments of a 3 x 3 x 3 factorial design were simulated. These are formed by
taking all possible combinations of the independent variables at the following
levels:
XI = 0.1, 1.3, 2.5
X 2 = 0.1, 1.55, 3
X3 = 0.05, 1.375,2.7
These lI1c1ude the six initial experiments of Table 10-1. The results are com-
pared to those obtained in the sequential design procedure in Table 10-5.
10-9. Computer Simulated Experiments 281
Table 10-4
Computer Designed Experiments (cr = 0,03)
Confidence
in prcfcrring
model Eq.
fL Xl X2 -'\.] Y 10g(Ll2ljU I ') (10-9-4)
1-6 (scc Tablc 10-1) -12.24 0.00005
7 2.5 3 2.7 0.09769 1.29 0.563
8 1.78 0.68 0.05 0.03709 -0.443 0,391
9 0.58 2.42 0.05 0.02649 0.635 0.654
10 1.78 0.39 0.05 0.02243 1.392 0.801
11 2.5 1.84 1.375 0.08126 1.029 0.737
12 0.34 2.42 1.11 0.01589 1.148 0.759
13 1.54 0.68 0.05 0.03363 I. 630 0.836
14 0.34 2.42 3.15 0.01588 1.936 0.874
15 2.02 3 2.17 0.0821 I 2.209 0.901
1q 16 2.5 1.84 1.64 0.08494 1.825 0.861
17 1.54 0.68 0.05 0.03364 2.325 0.911
.\ 18 0.34 2.42 0.05 0.01638 2.503 0.924
;.1 ]9 2.02 3 2.435 0.0796] 1.757 0.853
20 2.5 1.84 1.905 0.08108 2.499 0.924
21 1.54 0.68 0.05 0.03411 3.081 0.956
-:.1. 22 0.34 2.42 0.05 0.01525 3.677 0.975
23 1.54 0.68 0.05 0.03146 3.077 0.956
24 0.34 2.42 0.05 0.01627 3.280 0.964
25 2.02 3 2.7 0.08 174 3.884 0.980
26 2.5 1.84 2.17 0.07816 4.308 0.987
27 1.54 0.68 0.05 0.03096 3.783 0.978
., 28 2.5 1.84 1.11 0.07775 3.636 0.974
29 0.34 2.42 1.11 0.01610 3.812 0.978
30 2.5 1.84 2.7 0.08084 4.053 0.983
To interpret the numbers in the table, remember that a 0.5 preference
level indicates complete indifference between the two models. Thus, at error
levels of 3 / or more, the factorial design completely fails to differentiate
between the models, whereas the sequential procedure generates 83.3 .{, con-
fidence in the correct model even with a 6 o, error. At a I 0.;; error level, the
factorial design barely prefers the correct model, whereas the sequential design
selects the proper model with almost complete certainty,
Admittedly, systematic errors and other complications that may be
expected in practice were absent from this study. Still, the benefits of the
sequential approach turned out to be very substantial. One has reason to
hope that even under less favorable circumstances, at least some of these
benefits will be retained, In fact, Hunter and Mezaki (1967) have reported
282
X Design of Experiments
10
9999%
<0
m
9
o
o
- 9995
999
'"
,
m
o
o 0
0-
W
0-
W
o
00
x
995
- 99
'"
"0
o
E
5
o
'"
"0
o
E
o 0
"
m
9
!2 x x x
o x x +t+ +
x x o.++
+ ++T ++
+!OOO
0-
++
+
95
90
80
'=
'"
U
C
'"
"<e
c
o
u
o 0
x
x
x x x
S'
o 0
x x
0-
W
50
'"
"0
o
E
'"
"0
0
0
-5
'"
"0
CP
0
..J
-10
6
10
14
,
18
,22
26
30
Number of experlmenlS
Fig, 10-5, Scqucntial discrimination betwecn modcls. Standard deviation of measurc-
n1enl errors: 8 I I:: x, 3,: -'t-. 6.
Table 10-5
Comparison of Expcrimcntal Dcsign Proccdurcs
Prcfcrcncc for model Eq.
(10-9-4) aftcr 27 experimcnts
a Factorial dcsign Scqucntial dcsign
0.01 0.589 0.9983
0.03 0.481 0.978
0.06 0.500 0.833
.'
',':
..
'.,
;
-':"
.:'.
."
,
J
:''J-
'
2
,;
',i;
.,
;;
"{!
,';;
:.:i
:!
';,..
!
.5
t.E
!
+;
.
10-10. Design/or Decision Making
283
successful application of the sequential design procedure to the discrimination
between two alternative models for the kinetics of the catalytic hydrogenation
of propylene. Nine experiments previously performed yielded a likelihood
ratio of JjJ)/Jj 2 i = 1.22. After a mere four additional properly designed
experiments a firm preference for model I was established with IJ I )/IJ2i 99.
10-10. Design for Decision Making
So far we have been concerned with the somewhat abstract aim ofelucidat-
ing the" true" model or parameter values. Consequently, we used an abstract
measure of information to select the experiments. When the parameter values
are required for some specific purpose it may be more appropriate to minimize
the total expected cost of achieving that purpose. We have already (Section
4-16) introd uced the loss function c (0*, 8) which represents the cost of using
the value 0* where 8 is the true value. Similarly, we introduce a function
d(X) which represents the cost of performing the series of experiments
X == [XI' X 2 , . . . , xn]T. The outcome of the as yet unperformed experiments is
the random variable Y == [YI' Y2, . . ., YII]T whose pdf is plY I X, 0). The latter
function is by definition also the likelihoodt L(O I X, Y) of any hypothetical
sample Y. Hence, given a prior density PoCO), we can form the posterior density
p*(O I X, Y) = kpo(O)L(O I X, Y) for any possible outcome Y. We can also form
the expected (marginal) pdf of Y by assigning the weight PolO) to each possible
value of p(YIX, 0)
p(YIX) = J p(YIX, O)Po(O) dO
( I 0- 10-1 )
Using Eq. (4-16- I) we can evaluate the risk associated with USll1g the
value 0* on the assumption that the outcome of the experiments to be
performed will be some specific value Y
R(O* I X, Y) = J c(O*, O)p"'(O IX. Y) dO
110-10-2)
Once the outcome Y becomes known, we shall of course choose 0* so as
to minimize the risk. We denote this minimum risk R*(X, Y)
R*(X, Y) == 111111 R(O* I X, Y)
()*
( 10-10-3)
We cannot yet evaluate R* because we have not measured Y. However,
foHowing Raiffa and Schlaifer (1961 J. we can find the expected value of R*
t Contrary to prcvious practicc wc rctain thc argumcnts X and Y in thc expression for
the likelihood bccause the data have not yct bcen taken.
284
X Design of Experiments
by averaging over all possible outcomes of the proposed experiments X, using
p(YIX) as defined in Eq. (I)
R(X) == f R*(X, Y)p(YIX) dY (10-10-4)
R(X) is the expected risk assocIated wIth performmg the experiments x, To
this we add the cost of experimentation d(X) to obtain the total expected
cost of X
C(X) == d(X) + R(X)
(10-10-5)
We shall perform the set of experiments X for which OX) is minimum,
Among the possible sets of experiments is the null set, i.e., no experiments
at all. In this case d(X) = 0 and p*(e I X, Y) = po(e). Hence R does not depend
on Y, and
C = R = min RW*) = min r cW*, e)po(e) de
0* O. oJ
We now analyze the case in which both PoW) and pry I X, e) are normal.
Let:
Po(e) = Nle o , V 0)
p(YIX, 6) = N,,/,[F(X, e), II]
(10-10-6)
(l 0-1 0-7)
where, as usual, Y denotes the /1711-dimensional vector obtained by adjoining
to each other the 11 rows ofY, and n is the joint covariance matrix of the errors
in all projected experiments, usually given by Eq. (10-3-3), We now assume
that the model equations F(X, e) can be reasonably approximated in the region
of interest by a first-order Taylor series expansion around e = eo, i.e.,
F(X, e) Yo + B(e - eo)
(10-10-8)
where Yo == F(X, eo) and B == aF/ae)o=oo. 'Note that B is a function of X.
Now Eq. (7) can be rewritten as
p(Y I X, e) = N",n[Y 0 + B(e - eo), n]
(10-10-9)
We leave It as an exercise for the reader to show that both the posterIor
density of e and the marginal density of Yare also normal. Specifically
p*(e I X, Y) = NI(e, V)
(10-10-10)
where
0= eo + (VOl + BTn-IB)-IBTn-l(y - Yo)
V = (VOl + BTn-'B)-1
(10-10-11)
(10-10-12)
and
p(Y I X) = N",n(Y 0' fr)
(10-10- 13)
.,'
:!.
.,
,"j
;;;
:0-
.
.:.
;;
s;'
:.
'."
,';"
'
'
f
';
.
';;£
;:\
f.!
..,..
-
t
10-10. Design/or Decision Making
285
where
IT = [n- I - n-1BVBTn-lr l
(10-10-14)
The situation is particularly tractable if the loss function is quadratic, i.e.,
as in Eq. (4-16-6)
c(e't-, 0) = (0* - O)Tp(O* - 0)
( 10-10-15)
where P is a given positive definite (or at least semidefinite) matrix. As was
shown in Section 4-16, this leads to an optimal choice of 0* = 0 (the mean of
the posterior distribution). Then the minimum risk is the expected value of
(0 - O)Tp(O - 0) under the posterior distribution i.e.,
R*(X. Y) = £[(0 - O)Tp(O - OJ] = Tr PV
(10- I 0-16)
A glance at Eq. (12) indicates that V and hence R* are independent of Y,
hence R(X) = R*(X, Y) and
C(X) = d(X) + Tr P(V 0 1 + BTn-1B)-
( 10-10- I 7)
When Eq, (10-3-3) applies
"
C(X)=d(X)+TrP(V o l + I B/' 'B,,)-I
11=1
(10-10-18)
When minimizing C(X) we can seek to find the optimal number of experi-
ments as well as the conditions under which they are to be performed. If the
experiments are to be performed in sequence, it is only necessary at any
given titne to find the optimal conditions for a single experiment XI' and
compare the associated cost min C(xd with the expected cost of performing
XI
no experiment at all (which is Tr PV 0 1 when Eq. (18) holds), If the outcome
is favorable to the additional experiment, we perform that experiment, re-
place PoCO) by P*CO), and repeat the procedure. The stopping rule is obvious;
cease experimentation when the expected cost of no experiment falls below
the minimum expected cost of the next experiment.
It must be admitted that while the procedures outlined above are concep-
tually simple and appealing, their implementation is difficult in most practical
situations. While the minimization of Eq. (18) is no more difficult than the
minimization of Eq. (10-3-2), almost any other loss function leads to severe
computational difficulties which arise from the need to evaluate multiple
infinite integrals for all possible values of X, Y, and O. To this must be added
the diftlculty of assigning realistic cost functions, a by no means trivial task.
286
X Design of Experiments
10-11. Problems
I. Verify Eg. (10-10-10)-(10-]0-12).
2. Using Eg. (10-2-2), Eg. (10-2-4), and Eg. (10-10-] 2), show that in the
case of normal prior and error distributions and a linear model, the value of I
is positive; i.e., one gains information no matter what the outcome of the
experiment. For more general results, see Lindley (1956).
3 Derive Eg. (10-5-6).
4. Derive a decision-theoretic design criterion for discriminating between
alternative models, Assume that one is given prior probabilities n(i) that the
ith model is correct, and that the loss function has the form cijee, e) which
represents the cost of assuming that model i hOlds with parameter values e,
when in fact model) holds with parameter values e.
."" f
:,'"'
'i .
hi
: 'I
I"
I.,.
,",
'".
. ,.
f:.
'.,
.'-,-,
i::
i1::
..;,:{.
".
:I-,..
>
:;'.
'I'
:
f
ti:
i;
Appendix
A
Iall'ix Analysis
A-I, Matrix Algebra
The reader unfamiliar with matrix notation may prefer to write out
matrix expressions in full. But he will soon develop facility in manipulating
matrices and will no longer need subscripts and summations. This will greatly
enhance his insight and enjoyment of the subject.
Throughout the book, boldface normal size capital letters (both latin and
greek) denote matrices, e.g.,
l Alt Al1
An An
A=[Aij]= :
Ami A m2
AIII J
..4 2 "
4..J mn
(A-I-I)
is an m x 11 matrix. A is square if m = 11. A matrix all of whose elements are
zero is denoted 0 and is called the nul! matrix. Bold face small capital letters
denote column vectors obtained by adjoining to each other the rows of the
corresponding matrix. Thus, if A is defined by Eq. (I), then
All
Au
A=
AlII
A 21
A 22
(A-I-2)
....4.'1111
Boldface lower case letters denote column vectors, e.g.,
l al j
a = [aJ = ?
am
(A-I-3)
288
Appendix A. Matrix Analysis
'I,i
is an m-dimensional vector. A vector all of whose elements are zero is de-
noted O. All non boldface characters are scalars. Capital or lower case non-
boldface letters with subscripts may be elements of the corresponding matrix
or vector. A subscripted boldface letter indicates one in a set of vectors or
matrices.
The superscript T denotes transposition. if A is defined by Eq. (1), then
lA" A ZI A", J
AT=[Aj;]= A{Z An A rnz (A-I-4)
A ln A Zn Am1l
is an 11 x m matrix. A square matrix A is symmetric if AT = A: i,e" Aij = A ji
for all i and j.
If a is defined by Eq. (3), then
aT=[al,a z ,' >,a m ]
(A-1-5)
is an m-dimensional row vector.
If A and B are both m x n, then [A + B]ij = Aij + Eij.
We define the following matrix products:
(a) A is m x nand B is II X k. Then AB is the m x k matrix whose i,j element
IS
"
[AB]ij = I Ai/Blj
1= I
(i = I, 2, . > . , m; j = I. 2. . . .. k)
( A-I-6)
(b) A is m x nand b is n-dimensional. Then Ab is the m-dimensional column
vector whose ith element is
n
[AbL= IAi/b ,
1= 1
(I = I, 2, . > . , m)
(A-I-7)
(c) A is m x nand b is m-dimensional. Then b T A is the n-dimensional row
vector whose ith element is
m
[bTA]; = I blAli
1= I
(A-I-8)
(d) a and b are Ill-dimensional Then the inner product a Tb = b T a is the
scalar
m
aTb = '" a.b.
L I I
i==1
(A-I-9)
The inner product of a vector with itself, i.e" aT a is the square of the length
(also called norm) of a. We use the notation lIall to designate the norm of a,
-', l
A-I, Matrix Algebra
289
(e) a is m-dimensional and b is n-dimensional. Then the outer product ab 1
is the m x n matrix whose i, j element is
[abTtJ = ajb j
(i = 1,2, . .. , m; i = 1,2, . . . , n)
(A-I-IO)
If we regard an m-dimensional column vector as an m x I matrix and a
similar row vector as a I x m matrix, then all the above products become
special cases of (a). From these definitions, one can work out the product of
any number of terms. For example, the quantity aT Ab is the scalar
aTAb = a.A..b.
L I I} J
i. j
(A-I-II)
which may be verified by applying Eq. (7) first, and then Eq. (9). This IS
permissible because matrix and vector products are associative, i.e.,
.,
aTAb = aT(Ab) = (aTA)b
I
1 '1
,.
I .-
.
. .
1
>
Let A be a square m x m matrix. The main diagonal of A is the set of
elements At t, A 22, ..., AIIIIII A diagonal matrix is one whose only nonzero
elements are on the main diagonal. The identity matrix 1 is a diagonal matrix,
all of whose diagonal elements are- unity, i.e.,
Jr! l
(A-I-12)
or
,'"q
$ ..
'"
I
I jj = ()ij == 0
(i = j)
(i -:f. j)
(A-I-13)
The symbol ()ij is called the Kronecker delta. Clearly
IA=A,
BI B,
Ia = a,
b TI = b T
'-,1
.:L.:
for any suitable matrices A and B, and vectors a and b.
If A is a square matrix, then A-I designates a matrix (if one exists) such
that
I'
t;.
A-IA=AA- 1 =1
(A-I-14)
".
A -1 is called the inverse of A. A matrix A can possess at most one inverse.
If A has no inverse, it is said to be singular.
The following relations may be derived easily
. ,
t,:
..
(Ab)T=bTAT, (AB)T=BTAT, (AB)-I=B-IA- I , (AT)-I=(A-I)T
(A-I-IS)
,..
..
'-::;
290
Appendix A. Matrix Analysis"
A nonzero vector v is an eigenvector of the square matrix A, and A is the
associated eigenvalue, if
Av = ),v
(A-I-I6)
Vectors a and b are orthogonal to each other if a Tb = 0, If A is symmetric
m x m, then one can find m mutually orthogonal eigenvectors VI' v 1 , "V rn
of A, Usually, we normalize the vectors so that
VjTV j = ()ij
(i,j= 1,2, .,111)
(A-I-I7)
The Vi then form a set of orthonormal eigenvectors of A,
Let V be the m x m matrix whose ith column is Vi. In view of Eq. (17),
we have VTy = yyT = I, i,e., yT = V-I. The matrix Y is said to be unitary,
If Ax = 0 (x -:f. 0), then x is called a null vector of A. If A is square, then
it can possess null vectors only if it is singular. A singular matrix has at least
one zero eigenvalue.
Let x be a vector and A a symmetric matrix. The scalar x TAx may be
regarded as a function of x, It is called the quadratic form associated with A,
The matrix A is positive definite if x TAx> 0 for all x -:f. 0, and positive
semidefinite if x T Ax 0 for all x, Negative definiteness is defined analog-
ously. All eigenvalues of a positive definite or positive semidefinite matrix
are positive or nonnegative, respectively.
The symbol Ai] t is used to denote the i, j element of A-I, and not the
reciprocal of A ij .
If A is a square nonsingular matrix and y is a known vector, then the
solution to the set of simultaneous linear equations
Ax=y
(A-I-IS)
is given by
y = A-IX
(A-I-I9)
Suppose A is any matrix, not necessarily square. Then there exists [see
Penrose (1955)] a unique matrix A +, called the pseudoinverse of A, satisfying
the relations
AA + A = A,
A + AA + = A + ,
A + A = (A + A) T,
AA + = (AA +)T
(A-I-20)
If A is square nonsingular, then A + = A - I . If A is m x n, then A + is n x /11,
If the equations Ax = y have a solution, then x = A +y is the solution of
minimum length. If Ax = y has no solution, then x = A +y minimizes the
sum of squares of the deviations y - Ax; and of all vectors having this prop-
erty, x = A +y has minimum length.
J
A-I. Matrix Algebra
29\
The Trace of an m x m matrix A is the scalar
'"
Tr(A) == I A ii
;=1
(A-I-2l)
The trace of a matrix is equal to the sum of its eigenvalues and the de-
terminanT of a matrix is equal to the product of its eigenvalues.
One verifies easily that
Tr(AB) = I AijB ji = Tr(BA)
i. j
Hence
and
Tr(ab T ) = bTa = aTb
aT Aa = TrlAaa T)
<,
IfI IS the m x m identity matrix, then Tr(l) = /11 and det(l) = I.
Let A be the m x n matrix defined by Eq, (I). Suppose k and I are positive
integers satisfying k < m and 1< 11. Define the following matrices:
fAtl
B ll::
..+:
l At. t+t
C == A21+1
Ak,l+l
l Ak+t.l
D == Ak+;2.1
A,,,,1
l Ak+I.I+1
E == A kt2 , 1+1
A""l+l
An
An
Au
At.I+2
A2,1+2
Ak,1+2
A k -rt.2
Ak+2.2
Am.:?
A k +I.I+2
Ak+2.1+2
A"'.1+2
A tl J
An
Akl
At''' J
A 2 . n
A k "
(A-I-22)
Ak+t.l l
Ak+2,1
A""
Ak+t." l
Ak+2...
Allin
\:
We write the matrix A in partitioned/arm as
.',i,-"
f!!1
A = [ J
(A-I-23)
292
Appendix A. Matrix Analysis
Matrices in partitioned form may be multiplied as though the submatrices
were elements, provided the resulting expressions make sense. For instance,
let x be an n-dimensional vector partitioned as follows
x = []
(A-I-24)
where
l Xt J l XI+I J
= Xl b = X'+l
a _ ., -.
. .
. .
XI X"
(A-I-25)
Then one may easily verify that
Ax = [ B C ] [ a ] = [ Ba + Cb ]
DEb Da + Eb
(A-I-26)
Note that this makes sense only if x is partitioned so that the dimension of a
equals the number of columns in Band D.
The partitioning of a matri.x into more than four submatrices proceeds
analogously.
The rank of a matrix is the maximum number of linearly independent
columns or rows in the matrix (it makes no difference whether we take rows
or columns). A nonzero vector has rank I. The rank of a square matrix equals
the number of nonzero eigenvalues. We have:
rank (A + B) rank A + rank B
rank (AB) min (rank A, rank B)
(A-I-27)
(A-I-28)
It follows that
I
rank (ab T) =
o
(a -:f. 0 -:f. b)
(a = 0 or b = 0)
(A-I-29)
and
"
rank I a i biT 11
i= 1
(A-I-30)
A matrix whose rank equals the number of rows or columns (whIchever is
less) is said to be of full rank. A square matrix of full rank is nonsingular,
and vice versa.
A matrix of the form A = aa T is positive semidellnite, because for every
vector x
xTAx = (x Ta)l): 0
,.t
-<.)
A-2. Matrix Differentiation
293
The sum of positive semidefimte matrices IS positive semldefimte. Hence
2:7= 1 aj aj T is positive semidefinite,
If A is positive semidefinite, then so IS B T AB where B tS any matrix or
vector.
Let A be a square matrix. Suppose }'Illin and }'Illa, are the eigenvalues of A
with smallest and largest absolute values, respectively. Then for any vector b'
IAlllinlllbll IIAbll 1}'lllaxlll b ll
1}'U1inlllbIl2 IbTAbl I A maxill b ll 2
(A-I-31)
(A-I-32)
If A and Bare 111 x nand n x 111 matrices, respectively, then (Wilkinson,
1965, p. 54)
det(I lII + AB) = det(I" + BA)
(A-I-33)
A-2. Matrix Differentiation
,
'&
.
Ji-
{
*
::.
..
Let Cf. be a scalar function of a vector a and a matrix A; let b be a vector
function of a scalar [3 and a vector c, and let C be a matrix function of a
scalar )J, Table A-] lists the various derivatives that may be formed. Deriva-
tives of vectors with respect to matrices, and matrices with respect to vectors
and matrices, require more than two subscripts. They cannot, therefore, be
represented in matrix notation. On the rare occasion when they are needed,
subscript notation will be used.
Table A-I
Matrix Derivatives
'!" .'
::
b. .
The symbol
is a
whose elcments arc
f
Ba/Ba
Ba/BA
Bb/B{3
Bb/Bc
BC/oy
column vector"
matrix
column vector
matrix
matrix
(8a/8a); -= Ba/Bo,
(oa/8A),j cc Ba/BA"
(Bb/B{3), - Bb.fi!{3
(i!b/i!c)'j oeo 8b.fi!cj
(oC/i!y)u = acu/ay
a A casc may be made for dcfining oa/i!a as a row vector,
but we prefer to regard all vectors that do not carry the
symbol T as column vcctors.
' 1 "-
'. ,
1.'. :
To differentiate a product of vectors and matrices with respect to one
term, we proceed as follows (assume we are computing (l::t./('A):
]. Write the expression out in terms of subscripts and summations Do
not use the symbols i and j as subscripts.
294
Appendix A. Matrix Analysis
2, Suppose the term Aid appears in the summation. Remove this term,
replace the remaining appearances of subscripts k and 1 with i and j, respec-
tively, and remove summations with respect to k and I. The result is the de-
rivative with respect to A jj .
3. Reorder the expression so that the term containing i appears first and
the term containingj appears last. Reorder the other terms so that any two
occurrences of other indices are in consecutive terms. It may happen that
some of the terms are left over. These terms can be grouped to form a scalar,
which can be placed in front of the remaining matrix expression, as in
example (e) below.
4. Drop all summations and indices. Add T symbols where necessary.
Examples
(a) a = aTAb.
l. a = IJ.,I0J.AJ.,b,.
2. uajiJAij=ajb j .
3. iJajiJAij = ojb j
4. iJajiJA = ab T
(b) a = Tr(BA TC).
I. a = I",.J..,BmIAJ.ICkm'
2. iJajiJA jj = I", B",j C j ",
3. iJa/iJA;j = I", C;", B",j
4. iJajiJA = CB.
If the matrix A appears more than once, each appearance should be
treated separately and the results added.
Example
(c) a = aT ABATb.
l. a = IJ..,.",."akAkIBI",A"",b".
2. aajiJAij I",."ajBJ",A"",b" + IJ..laJ.AJ.IBubj.
3, GajiJAij = I",."a;b" A,,,,, Bj", + II.. IbJlk AI.l BJj'
4. iJajiJA = abTAB T + baTAB.
The handling of other derivatives is analogous.
Examples
(d) Compute iJa/iJa, where a = aT Aa.
l. a = IJ..taJ.AJ.lal'
2. iJajiJaj = II A iI al + II. aJ. Akj.
3. aajIlai = I, A il a l + II. AJ.JIJ..
4. aajiJa = Aa + A T a . If A is symmetric, iJajiJa = 2Aa,
I;, .
,<
*- I ' ,.
-.t: . .
:: ; ':
':' .
; :;
A-2. Matrix Differentiation
295
(e) Compute ab/ae, where b = Aea TBe.
I. b i = Ik,I,,,,AikckaIBI,,,c,,,.
2. abdacj = II,,,,AijaIBI,,,c,,, + Ik,IAikckalBIj'
3. abJacj = II, ",(al BI'" c".)A ij + Ik, I A ik cka l Bij.
4. abjae = (a TBe)A + Aea TB.
, :;
:. -
.' I :
(note that the term II. '" aIBI",c". = a TBe is a scalar and can be placed any-
where in a product).
We shall also need the followll1g denvattves:
(f) We wish to compute aAki 1 jaAij, where Ai:; I is the k, I element of A -I.
By definition
aAki1jaAij = lim (I/£)[(A + £B)-I - A -Ihl
c-o
(A-2-I)
, j ; where B is a matrix whose 111, n element is (j",;tJ"j; i.e., the i,j element is unity,
f&: and all other elements are zero. Now
;,
(A+£B)-J = [A(I+£A-1B)r l =(I+£A- B)-IA- 1
( A-2-2)
. 'j For sufficiently small £ the following series expansion is valid
(I + lOA - 1 B) - 1 A - J = (I - lOA - I B + £2 A - I BA - 1 B - . . . )A - I
= A - I - lOA I BA - I + [,2 A - I BA - I BA - I _ . . .
(A-2-3)
, . ,
. -
and we can prove easily that
lim (lj£)[(A + £B)-J - A- J ]= -A-IBA- I
<-0
(A-2-4)
Therefore'
I
aAki1jiJAij= -[A-'BA-J]k/= -I Ai:;"IB",,,A.1
m.tJ
, A -I < 5 A -I 1-I A -I
= - L I..m O",i (lJj III = - / 1..; 'jl
"'.11
(A-2-5)
;.: which is the desired result.
. (g) Now we can evaluate, for example, aCf.jaA where a. == x T A -IX. Indeed:
a. = I xkAkiJx,
k.1
aa.jaA ij = I xk(aAI:t I jaAi)xl = - I X k AII Aft' XI = - I Akl I XkxlAftl
k,l k.1 k.1
(A-2-6)
so that
!it,fiE.
a(xTA-1x)jaA= _(A-J)TXXT(A-I)T
(A-2-7)
296
Appendix A. Matnx Analysis
(11) Let a. = det A. We wish to evaluate 8a.ji7A. Let us expand the determll1ant
in co factors of the ith row, i.e.,
det A = IAikA
k
where A IS the cofactor of A ik . A i does not depend on any of the elements
in the ith row. Therefore
(A-2-8)
i7 det Aji7A;k = A
(A-2-9)
As is well known
A,nl = A,/det A
(A-2-1O)
Hence, A = Ai:; 1 det A and
i7 det Aj8A = (A -I)T det A
(A-2-11)
Furthermore
a log det Aji7A = (I/det A) 8 det Af8A = (A -1)T
(A-2-12)
A-3, Pivoting and Sweeping
Many computations involving matrices may be viewed as a sequence of
operations called pivoting. It is useful to examine the pivoting operation in
detail, and list some of its applications. In the sequel we always assume that
we start with a given matrix B which is progressively modified by successive
pivotings. Unless otherwise stated, whenever we refer to B or to any of its
elements we mean the current, rather than the original values
Definition Suppose Bij 1= 0 for some pair of indices i, j. Then performing
a Gauss-Jordan pivot, or simply pivoting on (i, j) means changing the elements
of B according to the following scheme:
I. Replace Bpq by Bpq - B iq Bp) Bij for all p 1= i, q 1= j.
2. Replace B;'J by B;q/Bij for all q I=j.
3. Replace B pj by -Bpj/Bij for all p 1= i.
4. Replace Bij by IfB;j'
The element Bij (before pivoting) is referred to as the pivoT. Pivoting on
(i, i), i.e., with a pivot on the main diagonal, is referred to as sweeping (Beaton,
1964) row i. Two pivots are unrelated if they differ in both row and column,
i.e., Bij and Bkl are unrelated if i 1= k andj 1= I. The following properties are
easily verified:
I. Pivoting is reversible, i.e., pivoting on (i,j) twice restores the original
matrix.
'ii
:j
.y:
>:!i
:f
,j;
-' 3ft
,
'::F
: ;
A-3. Pivoting and Sweeping
297
;:
2. Pivoting on unrelated pivots is commutative, i.e., pivoting first on (i, j)
and then on (k, I) produces the same matrix as pivoting first on (k, I) and then
on (i, j), provided i t= k and} t= I. Since different elements on the main diagonal
are unrelated, it follows that sweeps are always commutative.
3. From land 2 we deduce that pivoting in sequence on (i, j), (k, I), and
(i, j) is equivalent to pivoting on (k, I) alone if (i, j) and (k, I) are unrelated.
The following applications will motivate the definition of pivoting:
(a) Exchange of Variables. Suppose B is 111 X n, x is an II-vector, and y is an
m-vector satisfying
".
'::j
;1
:!
'. i "
';"
,-{
y + Bx = 0
(A-3-l)
The elements of x and y may be regarded as independent and dependent
variables, respectively. Suppose we wish to interchange the roles of, say, Xl
and Yl' That is, we wish to express the variables XI' Yz , Y3 , . . . ,)'m as functions
of )'1' Xz, X 3 , ..., XII' The first row of Eq. (I) reads
Ir.
t
r.
r:
)'1 + Ellx l + El2 Xz + . . . = 0
If Ell t= 0, then this is equivalent to
XI + B/)'I + Bt Z BII Xz + . . . = 0
(A-3-2)
,.
;.t''
,t
'.J'
,i!
...
(A-3-3)
Solving for XI and substituting in the ith row ofEq. (I) we find, after collecting
terms
Yi - Bi1B/)'t + (BiZ - BiIBl2/BII)xz + ... = 0
(A-3-4)
Consider the following tableau as a schematic representation of Eq. (I)
,C'"
Xt Xz XII
YI Bll Bl2 Btll
)'Z B ZI B Z2 B ZII (A-3-5).
YIII Bml B m2 Emil
y '
,.
..:
Then, after exchanging XI with)'l we can represent Eq. (3) and Eq. (4) In a
new tableau
)'1 Xz XlJ
X I/BII Bl2/ B ll Bill/Bit
1
Yz -Bzl/B ll Bn -B21 B l2/ B ll B ZII -BZIBIII/Bll
Ym -BmdBll B mZ -BmlBIZ/Bll Bmll - BmtBtll/Bll
(A-3-6)
298
Appendix A. Matnx Analysis
It is evident that the elements of B have been transformed as by pivoting
on (I, I). Generally, exchanging Y i for Xj is accomplished by pivoting on (i,j).
(b) Partial Elimination, Instead of interchangIng just one pair of variables,
we may wish to interchange several. Let the equations ofEq. (I) be partitioned
as follows
YI + B l1 x t + Bl2 X2 = 0,
Y2 + B 2l x I + B22X2 = 0
(A-3-7)
The correspondll1g tableau is
X T X 2 T
. I
YI Bll Bl2 (A-3-8)
Y2 B2l B 22
Let BII be a k x k nonsingular submatrix of B. Then we can solve the
first k equations in Eq. (7) for X t ' and substitute in the remaining equations to
obtain
Xt +B;-II YI +B;-IIB I2 X 2 =0,
Y2 - B 2I B;-/Yt + (B22 - B 2I B;-/B l2 )x 2 =0
(A-3-9)
Suppose it is possible to exchange, in sequence, Yt for Xl' Yz for x 2 , ,.., Yk
for Xk' The result is the same as exchanging the entire vector YI for Xl'
According to Eq. (9), then, sweeping (if possible) rows I, 2, . . . , k of tableau
Eq. (8) produces
YI T X T
. 2
X B- 1 B;-/B l2 (A-3-1O)
I II
Y2 -B 2I B;-l t B 22 -B2IB;-/Bl2
This property is used in the projectIon method (Sections 6-2 and 6-3). It
can be shown that if BII is positive definite, then the required sweeps can
always be executed, i.e., no B;i ever turns zero.
(c) Matrix Inversion, When BII is the entire matrix B, then sweeping all rows
transforms B into B- 1 , since Y + Bx = 0 is changed into X + B-1y = O. This
procedure cannot be carried out if zero diagonal elements are encountered,
For instance
[ ]
'.:I
";J'L
b
li'
, ,
,
A-3, Pivoting and Sweeping
299
f;
f
cannot be swept though it is nonsingular. However, we can always proceed
as follows:
;..
it
';
t
..
iF
tL
F.
f:.
1. Write out the tableau Eq. (5). The Yi and x j are symbolic headings,
whereas the Bij are numerical values.
2, Among all the elements whose rows are headed by a Yi and whose
columns are headed by an Xj, find the one, say Bpq, with largest absolute
value. If no such elements exist, proceed to step 5. Otherwise:
3. If Bpq = 0 the matrix B is singular, and the process is terminated.
Otherwise:
4, Pivot on (p, q) and interchange the headings Y p and x'}' Return to step 2
5. Rearrange the rows so that their headings appear in the natural order
"' I l."
:q
r!
Xl' Xl' .. . , Xm.
6. Rearrange the columns so that their headings appear 111 the natural
order Y1, Yl, . . ., Y",. Our tableau now contains B- 1 .
.-,-'
'" ',".
ro;.. f
.,\;' . .
(
:' .'
(d) Simultaneous Linear Equations, We wish to solve for x the set of simul-
taneous equations
,,", "_ t.
;.;:
Ax= b
(A-3-1 J)
f_:; :
.l-
iE
'f"
i. " i!Y
,.,. Jr.;.
(
't.'
:;.
where A is 111 X 171. Let us define B as the 111 x (171 + I) matrix [A, b] and let us
apply to B the algorithm of the preceding section, except that no pivots are
allowed in the last column. If A is nonsingular, one ends up with the matrix
[A -1, A -Ib], i.e., the solution x is found in the last column. If one is only
interested in x, then step 6 may be omitted. Also, in step 5 only the elements
of the last column need be rearranged. This method of solving equations is
known as Gauss-Jordan elimination. Ordinary Gaussian elimination is faster,
but Gauss-Jordan elimination is very convenient and economical in storage
space when the inverse too is desired.
If A is singular, the process terminates in step 3 with all eligible pivots
equal to zero. Let us partition x and Y into vectors XI' Xl and Yt, Yl, respec-
tively, where subscript I refers to those elements which have been exchanged,
and subscript 2 to those elements which have not been exchanged. For in-
stance, Xl consists of elements of X which appear as column headings in the
final tableau. The final tableau takes the form (wc= have rearranged rows and
columns suitably)
....
f'
. ; 1i: .
-'
..
..
r",
f."
f{
'..,.,
.
....
:,' .
f::
):.7
,,:
-:
\,
'...
Y1 T
X T
. 1
'.\
illtr
Xl
Yz
C tl C t2 C I
C 2I C n = 0 C l
(A-3-12)
300
Appendix A. Matrix Analysis
The matrix C n must vanish, for otherwise we could have continued
pivoting. Let the partitioning of Eq. (II) that corresponds to the partitioning
of x and y be
Alix i + A I2 X 2 = b l .
A2[x 1 + An x 2 = b 2
(A-3-13)
Then, from Eq. (10) and Eq. (12) we must have:
CII =Ai,
C n = A2IA;-tl,
C I2 = AIIAI2
C n = An - A 21 A II A I2 = 0,
C t = Allbl
C 2 = b 2 A2IAllbl
(A-3-14)
Now, if we eliminate XI directly from Eq. (13) we find
XI = Allbl - A/AuX2'
(An - A2IA/Au)X2 = b 2 - A2lAtlbt
(A-3- J 5)
which, in view of Eq. (14), can be written as
Xt = c t - C I2 x 2 '
OX 2 = c 2
(A-3-16)
From this we deduce the followll1g:
I. If c 2 t= 0 then the equations Ax = b have no solution. Note that C 2 is
the set of elements in the last column which belong to rows with r headings.
2. If c 2 = 0, then the equations Ax = 0 have infinitely many solutions.
These can be obtained by assigning arbitrary values to X 2 and letting XI =
C t -C I2 X 2 .
(e) Rank of Matrix and Linear Independence of vectors, Let ai' a 2 , ..., alii
be a set of l1-vectors, and let A be the 111 x 11 matrix whose ith row is a?
We wish to determine the rank of A, or what is the same, the number of
linearly independent a j , We write down A in tableau form, and proceed to
apply the algorithm of (c) above. The number of pivots executed before the
process had to be halted equals the rank of the matrix. Referring to Eq. (14),
the condition C n = 0 may be rewritten as A 22 = C21A12 = A 21 C 12 . But
also A2[ = - C 21 AI I and A I2 = AI] Cu. Combining these, we find:
[ AI2 ] = [ All ] Cu
An A 21
(A-3-17)
[A21' An] = C21[AII' Au]
(A-3-18)
Thus, C u and C 2 I contain the coefficients for the linear dependence among
the columns and rows of A, respectively. The rows of [AI I' A ,2 ] form a
maximal linearly independent subset of the a,. The columns of
[ All ]
A 21
::'
A-3. Pivoling and Sweeping
301
form a maximal linearly independent subset of the columns of A. When the
rank of A equals the number of rows, then All' An, Cll' and C n are
vacuous; when the rank of A equals the number of columns, then All'
An, C Il ' and C n are vacuous. When A is square and nonsingular, only
"-' Au and C u = A/ exist.
(f) Determinant. To compute the determlllant of B, we follow the procedure
of (c); step 6 may be omitted. If the process cannot be completed, then
det B = O. Otherwise, the determinant equals the product of all pivots times
( -1)'", where r is the number of row interchanges required in step 5.
[I
(g) Stepwise Linear Regression. We wish to find the I-vector 0 which mini-
mIzes
c.P(e) = (Y - Be)TV-I(y - Be)
I
iL
;
Let us form the (l + I) x (l + I) matrix
[ BTV-IB BTV-I Y ]
A== (BTV-ly)T yTV-Iy
1"1
[, .
I t,
. .
'. ..
.
(:'
Suppose we sweep some of the first I rows of A, produclllg a modified matrix
A (whenever we speak of A we are referring to its current form). Let I denote
the set of indices of the swept rows, and J the set of indices of the unswept
rows (excluding row I + I). Let a be the last column of A. Then aa(a E I)
is the optimal value of Oa provided all 011 (f3 E J) are restricted to vanish.
Furthermore, a l + 1 is the minimum of c.P(e) under the above restriction, and
1/ == a/IAp/l (f3 E J)
is the reduction in c.P(e) that would ensue if Of! were to be included in the regres-
sion, i.e., if row fJ were to be swept. Therefore, the following algorithm is
suggested for forward stepwise regression:
t,;
I '.:
,I.; :
. ,
.."
f:
:P:'
i
;'-'
I. Choose a small positive number G such that a change G in q)(e) is
considered insignificant.
2. Construct the matrix A. Let I be the empty set, and let J = {I, 2, . . . , I}.
3. Of the elements fJ E J for which 1/ > G, find the one, say f3*, for which
1/ is largest. Sweep row f3*, and transfer f3* from J to T.
4. Repeat step 3 until no 1/ exceeds G. At this point, the model is repre-
sented by the equation
1;, = I BJlaOa,
aEI
Oa=aa=Aa,'+1
o
(a E 1)
(a E J)
302
Appendix A. Matrix Analysis
I n backward stepwise regression, we start with
_ [ (BTY-IB)-I e* ]
A = _e*T yTy-ly _ e*TBTy-1Be*
where e* = (BTy-IB)-IBTY-ly. We let I = {I, 2".., I} and J is the empty
set. Step 3 above becomes:
3'. Of the elemems a. E I for which fa 2 8, find the one, say a.*, for which
f/ is smallest. Sweep row a*, and transfer a* from [ to J.
A-4. Eigenvalues and Vectors of a Real Symmetric Matrix
Presented below are the computational details for an algorithm which
combines Givens-Householder reduction to tridiagonal form, the QR
algorithm with origin shifts for diagonalizing the tridiagonal matrix, and
successive orthogonal transformations of the unit matrix to obtain the eigen-
vectors. For explanations, see Wilkinson (1965) and Ortega and Kaiser
(1963), Steps which are starred (*) can be omitted if only eigenvalues are to
be computed.
A is the 11 x 11 symmetric matrix whose eigenvalues are to be computed.
Step 2 is omitted if /1 = 2. Two constants 8 1 and 8 2 are used in termination
tests [steps 8 and 9 below]. The following method is suggested for selecting the
values of these constants: Let 8 be the desired relative accuracy of the largest
eigenvalue (this cannot exceed the precision of the computer. For instance,
if a k-bit word length is used, we must have 8 > r k ). Let S = I, j= 1 Al j ,
Then let 8 1 = 8(2.S)1/2 and 8 2 = 81/112.
1*. SetY=I".
2.. For i = I, 2, ." ,11 2. in turn, perform the following steps:
a. Let a = 1 if Ai+l, i:;' 0, a = -1 otherwise.
b Let .-'" .2 S - ac 1/2
. ' c - Lj=i+l '"1j.i' - .
c. Let b i = -so If s = 0, proceed to step i.
d. Leta= I/(c+ IAi+l,iSI).
e, Lctll'i+I=Ai+t,i+S, W j =Aj,i(j=i+2.,i+3"..,I1),
f. Let u j = a.Wj (j = i + I, i + 2, . , ., 11).
g*. Let PJ.. = I1=i+ 1 VJ...j w j (k = 1,2, ...,11),
h*. Replace VJ..j with VJ..j - pJ..uj(k = I, 2.,." ,11; j = i + I, i + 2,
. . . , 11).
1. Let qJ.. = I1=i+ t AJ..jllj (k = i + I, i + 2.,. .,11).
J. Let{J=-lIz=i+lqJ.. u J..'
k, Let qJ.. = qJ.. - {J11'J.. (k = i + 1, i + 2, . , " 11).
I. Replace AjJ.. with Aji. - qjWk - wjqJ.. (j, k = i + I, i + 2"..,11).
,s;
"
}
)
.:'{
o
:
: ;4
J
J
"
.\
"'i
'!
;
.;:J.
. .
. "
';'.}
:'J:J
"f,:j
q
.!
}'1
/;.:J
;.fj
d
';'1
,:J
...
",,"/.
': :r
{J
oiJ
.t
I
.d
.... ) .
"
',,;.,
-'
:;
';;1
.j
:;1:
:
'I
-1
.,;.'1:
'\ I ;<'
'.' .'-..
-..1
f ;
;':,
A-5. Spectral Decompositions
303
3, Let b 17 - 1 = AI7-1,17' a i = Ai,;(i = I. 2,. ., n).
4. Let m = n, 9 = O.
5. Letc1=I, a=a m , p=a 1 .
6. For i = 1,2, ..., m - I in turn, perform the following steps:
a, If I bil [;2' replace b i by zero, set s = 0 and c = sign (p), and
proceed to step e.
b. Let x = (p2 + b/)1/2.
c. Let s = bJx, c = pix.
d*. For j = I, 2,..., n in turn, let fJ = cV j , i+1 - sV j . i Replace
V j , i with SVj,i+t + cV j , i and V j , i+t with 13.
e. Let r = cp + sb i , d = CCI.
f. Letq=dbi+sa i + l .
g. Replace a i with dr + sq
h, If i > I, replace b i - I with Sl r.
i. Let St = s, P = ca i + 1 - sClbi' c i = C.
7,. Replace b m - 1 with SIP and am with Ctp.
8, If I7:"'-/ I bd [;1 proceed to step 16.
9. If I bm-II [;2 proceed to step 13.
10. If II am/a I - II > t, return to step 5.
II. Replace 9 with 9 + am'
12. Replace a i with ai - am (i = 1,2,. ., m) and return to step 5.
13. Replace am with am + g.
14. Replace /11 with m - I.
15. If 111 2, return to step 9.
16. Replace a i with a i + 9 (i = 1,2, ..., m).
17, At this point, a i is the ith eigenvalue of A, and (*) V ji (j = I, 2, . . ., n)
is the ith eigenvector of A (i = I, 2, ..., n). These eigenvectors form
an orthonormal set, i.e., Ii Vij Vi = (5jk (j, k = I, 2, . , . , n).
A-5. Spectral Decompositions
Let A be a symmetric I x I matrix. Suppose D and E are, respectively,
diagonal and nonsingular I x I matrices, satisfying
A = EDE T (A-5-I)
Then EDE T is referred to as a spectral decomposition of A. In component
form, a spectral decomposition is given by
I
A ij = I dkEikEjk
k=1
(A-5-2)
where
d k == Dk
(A-5-3)
304
Appendix A Matrix Analysis
Let x be any I-dimensional column vector. The quantity
1
A(x) == x TAx = L AijXiXj
i, j=t
is called the quadratic form defined by A. From Eq. (2) we have
1
A(x) = xTAx = xTEDETx = yTDy = L diy/
i= 1
(A-5-4)
where
y == E T x (A-5-5)
The matrix A IS positive (negative) definite if A(x) > 0 «0) for all x oF O. It
Follows From Eq. (4) that A is positive (negative) definite if and only if all d i
are positive (negative).
If none of the d, are zero, the matrix A is nonsingular, and we can form
A-I = (ET)-ID-1E- 1
(A-5-6)
since E was assumed nonsingular, and D - I is a diagonal matrix with (D- 1 )ii =
d i - I .
Any symmetric matrix possesses infinitely many spectral decompositions.
Of these, the following play important roles:
(a) The Eigenvalue Decomposition, Suppose E IS a ullltary matrix Y satIs-
fying.
yT = Y - I
(A-5-7)
In this case we denote D by A and d, by I". Then we have from Eq. (I) and
Eq. (7)
AY = vAyTy = YA
(A-5-8)
Let Vi denote the ith column of Y. Then Eq. (8) is equivalent to
AV i = l'i V ,
(i = I, 2, , . . , I)
(A-5-9)
which states that the lei and v, are, respectively, the eigenvalues and eigen-
vectors of A. The equation
1
A = YAy T = " ). v . v . T
l 1 1
i= 1
(A-5-1O)
represents the eqenvalue decomposition of A. Inverting Eq. (10) we find
1
A-I =(yT)-IA-ly-1 =YA-1yT= Llei 1V i V /
i= 1
(A-5-Il)
provided all lei oF O. I F we omit from the summation in Eq. (I I) all the terms
for which )" = 0, we obtain a matrix A +, called the pseudo in verse of A. This
definition of the pseudoinverse applies only to symmetric matrices; for the
genera! case see Eq. (A-I-20).
A-5, Spectral Decompositions
305
Equations (10) and (II) show how both a matrix and its inverse can be
reconstituted when the eigenvalues and vectors are known. We now consider
the quadratic form A(x), We have
I
ACx) = xTYAyTx = y1Ay = L )'IY/
i=l
(A-5-12)
where
y = yT X
(A-5-13)
!".
t-
f
Since Y is unitary, the transformation of coordinates given by Eq. (13) does
not affect the shape of the contours of the function A(x), i.e., the shape of the
surfaces on which A(x) = constant. From Eq. (12) it is evident that these
surfaces are quadratics whose ith principal axis is inversely proportional to
I ),d 1/2 and lies in the direction of Vi' For instance, if 1= 2 and )'1 and )'2 are
positive, the contours are ellipses whose principal axes are proportional in
length to }i l / 2 and ),;-1/2. If the eigenvalues are nearly equal, the contours
are nearly circles; if they differ widely, the contours are very elongated.
The most extensive analysis of methods for computation of eigenvalues and
vectors can be found in Wilkinson C 1965). A summary of a fast and convenient
method based on the QR algorithm appears in the previous section. If com-
putations are carried to n digits of precision, then the error in any computed
eigenvalue is about ::!:: 1O- II }max' where }max is the eigenvalue of largest
absolute value. It follows that eigenvalues much smaller than }m" cannot be
computed with great precision. We define the condition number of a matrix as
the ratio of largest to smallest (in absolute value) eigenvalues. The computa-
tion of the small eigenvalues (corresponding to long principal axes) of a
matrix with a large condition number poses a serious problem. Fortunately,
this problem can usually be eliminated if we use a different spectral decomposi-
tion, as de.scribed below.
(b) The Scaled and Inverse Scaled Decompositions, An apparent ill-condi-
tioning (large condition number) of matrices encountered in practice is often
due to the scaling of the variables. For instance, consider the function
«P(G) = .He 1 2 + e 2 2 ). This has the Hessian matrix
H = [ ]
which is very well-conditioned indeed, having both eigenvalues equal to one.
Let us rescale the first variable by substituting 'II = I 050 t . We leave the second
variable unchanged, setting '12 = O 2 , In terms of the new coordinates,
«P = t[CI1JI0 5 )2 + '1/]
B'
:...
..
L
H = [IOIO ]
306
Appendix A, Matrix Analysis
The condition number has been increased from I to 10 10 . ThiS suggests that
before computing eigenvalues and vectors we should scale the matrix properly.
The simplest scaling is one which reduces all diagonal elements to unit
magnitude. If our matrix is a Hessian, this means that we are rescaling all
variables so that the curvature of the objective function at the minimum is
unity along all coordinate axes. If our matrix is the inverse of a Hessian, we
are scaling all variables to possess unit standard deviation (see Section 7-5).
If our matrix is positive or negative definite, the proposed scaling sets the
magnitude of all off-diagonal elements to less than unity. If the matrix is not
positive definite, this scaling method may fail by leaving very large off-diagonal
elements. On the whole, however, the method has given very good results.
Given a matrix A, we define a diagonal matrix B with
_fI A jjll/2
B jj =t I
(Ajj=FO)
(A jj = 0)
(A-5-14)
then the matrix
C = B-IAB- 1
(A-5-15)
has elements Cij = Aij/IAjjAjjll/2 (except when Ajj or Ajj = 0), and, in
particular, C jj = I. We refer to C as the scaled version of A. If A is a covariance
matrix then C is the correlation matrix. Let the eigenvalue decomposition
of C be given by
C = urru T
(A-5-16)
where rr is diagonal; lljj = ITj, the ith eigenvalue ofC; U is the matrIX whose
ith column is Up the ith eigenvector of C; and U T = U-I .
Eqs. (15) and (16) may be combined to yield
A = BUrrUTB = FrrF T (A-5-17)
where
F == BU, i.e., Fij ;= B jj Uij
(A-5-18)
We call the relation A = FrrF T the scaled decomposition of A, Inverting
Eq. (17) yields
A-I = B-Iurr-luTB- I = err-leT
(A-5-19)
where
e = B-IU, i.e., Gij = BjitUjj
(A-5-20)
We call the relation A-I = err-I e T the inverse scaled decomposition of A.
The following is a summary of the steps required to compute the scaled
decompositions:
I. Divide each element A ij by I A jj A jj II /2, forming the matrIX C.
2. Obtain the eigenvalues IT; and eigenvectors Uj of C.
z
A-5. Spectral Decompositions
307
'.'
f::'
E'
::
: -
:.-
.
3. Multiply the jth element of u i by 1 A jj 1 1/2 to form a vector f;, which is
the ith column of F.
4, The scaled decomposition of A is given by
n:
f':., ,
'-'.
f;
..
,
r
:
.
i:
;
li:-
[,
[:
i'
i'
i;'
r.
[
W
I
A = IT f.f.T
1-.. I I I
i=I
(A-S-21)
This IS equivalent to Eq, (I 7).
S. Divide thejth element ofu j by IAjjl l /2 to form a vector gj, which is
the ith column of G.
6, The inverse scaled decomposition of A-I is given by
I
A -I ,,-t T
= L. IT; gigj
i=1
(A-S-22)
provided all IT j to O. This is equivalent to Eq. (19).
J
i
Note: Replace any zero A j ; by one in the above computations.
A numerical example appears in Section 5-21
The above procedure [omitting steps 3 and 4] can be regarded as a method
for computing the inverse of a symmetric matrix. As such it is unlikely to
win any prizes for speed, but it is quite accurate and stable. It provides
insight into the nature of the matrix, and lets us generate" almost inverses" of
A. By this we mean matrices which (like the pseudoinverse) differ from Eq.
(22) only in the values of the IT;, the latter being chosen so as to confer certain
desirable properties (e.g., positive definiteness, or well-conditioning) on the
matrix. For examples, see Sections 5-7-5-8,
,
if:
If.
Ii!.
'-,
[;'
l:
rc
.
\
;
t;)
::
():
t
::
(c) The Square Root Decomposition. If A is positive definite, it is possible
to obtain spectral decompositions in which D = I, the identity matrix; i,e.,
A = EE T . Of particular interest is the decomposition in which E is a sym-
metric matrix S, whence A = S2. The matrix S is called the square root of A.
If A = Y A y T is the eigenvalue decomposition of A, then we have, because
yTy = I
A = (Y N/2yT)2
(A-5-23)
so that A 1/2 = S = YA 1/2yT. Here A 1/2 is a diagonal matnx with elements
)F2.
:.i
-
tI.,_.,
I,,"
::
t
"
","
(d) The Cholesky Decomposition, [See, e,g., Fox (1964)]. Again we assume
that A is positive definite, and choose D = I. Now, however, we specify that
E should be a lower diagonal matrix L, that is, a matrix whose elements above
the main diagonal are all zero
L;j=O
(j > i)
(A-5-24)
f/'
,. jlt!.t
,B,,',
308
Appendix A. Matrix Analysis
Since A=LL r we have A,j=I=1 LikLjk' which In view of Eq. (24) be-
comes:
j
-(.. = '\ L., L.,
") L lto. Jh.
k=1
(j < i)
(A-5-25)
j
A ii = IL?k
k=1
(A-5-26)
These equations may be solved recursively for the L ii . From Eq. (26)
LII = A:,;2
( A-5-27)
From Eq. (25)
L'I = A n/LII (i = 2, 3, .." I)
Then, uSlllg Eq. (25) and Eq. (26) alternately for i = 2, 3,. ,I:
( A-5-28)
( j-I )/
[.. = A.. - "' [. , L., L..
I) I) L I' J"- JJ
k=1
(j = 2, 3, . . . , i-I; skip for i = 2)
(A-5-29)
. i-I ) 1/2
[.. = ( A . - '\ L
" II L--.11.
k= I .
(A-5-30)
This procedure can be carried through provided all of the square root
arguments are positive. This occurs if and only if A is positive definite. Of all
the decompositions discussed, the last is the only one that can be accom-
plished in a finite procedure, which requires approximately 1 3 /6 multiplica-
tions. All the other decompositions depend on the evaluation of eigenvalues.
which requires an iterative procedure.
The Cholesky decomposition is particularly useful in solving for x the set
of linear equations
Ax = b
(A-5-31)
These can be rewritten as
Ly = b
(A -5-32)
where
y = LTx
(A-5-33)
Now Eq. (32) on account of the triangular nature ofL, has the form:
L 11 )'1 = b 1
L 21 YI + L 22 )'2 = b 2
L 31 YI + L 32 )'2 + L 33 )'3 = b 3
(A-5-34)
A-j, Spectral Decompositions
309
These equations can easily be solved in turn for )'1' h,. .,)'/. Then Eq. (33),
which has the form:
L 11 x 1 + L 21 X 2 +". + Lllx/ =)'1
L 22 x 2 +... + L/ 2 x/ =)'2
(A-5-35)
LI/x/ = J't
':\;; 1 :
..:-:
_»f
Jf
can be solved in turn for X t , X/-I"", Xl' This is the fastest method for solving
Eq. (31) when A is positive definite,
We conclude this section with a computational note. In most applications
the matrix A is given only in decomposed form [Eq. (I)]. We are interested in
computing a vector y = Ax, where x is a known vector, but have no use for
A itself. The following procedure is much more economical than generating
A first and then computing Ax:
I. Suppose A = EDE T . Compute the vector z = ETx.
2. Compute the vector u = Dz.
This is done simply by applying the formula llj = djz j to each component
ofz,
3. Compute y = Eu.
In summary, the proper order of the calculations is given by
Zj = L Ekjxk,
k
lI j = djz j ,
) '. =" E../I.
I lJ J
j
(A-5-36)
A numerIcal example appears 111 Section 5-21.
"':f ; /
......,.,
':l :;
,
....,..,1.- ,
t
f
f.
l
. ...." , .
i'.:.' _
:" e'
.:.
Appcmli:x
B
Probability
This section is not to be regarded as an Il1Iroduction to or summary of
probability theory. It merely lists the probabilistic concepts and the notation
used in the book.
Pr(A) denotes the probability that event A occurs.
Pr(A I B) is the conditional probability of event A, provided B is known to
occur.
If is a random variable with contll1uous distribution, then
F(x) == Pr( < x)
(B-1 )
is the cumulaTive probability distribution jimctiolJ of .
If F(x) is differentiable, then
p(x) == dFfdx
is the probability density jimctlOlJ (pdf) of . In this case
b
Pr(a < < b) = F(b) - F(a) = f p(X) dx
a
(B-2)
Let I be any function of the random variable. Then the expected value
off is
00
E(f) == f I(x)p(x) dx
-00
(B-3)
In particular, the mean, or expected value of itself is
== Em = fOO xp(X) dx
-00
(B-4)
and the variance, or expected square deviatIon from the mean, IS
v == E[( - )2] = foo (x - )2p(X) dx
- CfJ
( B- 5)
The standard deviation of is the square root ofthe variance,
The mode of is the value of x at which p(x) is maximum,
Appendix B. Probability
311
These definitions generalize to the case where S is a vector of random
variables (I' l' "', "" The joint cumulative disTribution F(x) is given by
F(x) = Pr(1 XI' l Xl' .." '" X",)
(B-6)
the pdf by
p(x) = a"'F(ax I aX 2 ' . . ax",
(B- 7)
and the expected value
f ro f OJ f OJ
S == E(s) = . ..
-co -co -co
xp(x) dx I dx z ' , . £Ix",
(B-8)
which we write in shorthand notation as
= f xp(x) dx
( B-9)
Using the same notation, we define the covariance matrix (sometimes
called variance-covariance matrix)
V == E[(s - )(s - )T] = f (x - )(x - )Tp(X) dx
(B-IO)
v is positive definite, or at least semi definite. The variance of i is the diagonal
element Vii = E((i - c;Jl. The covariance of i and (j (i i= j) is the off-
diagonal element Vij = E(i - J(j - j). The generalized variance of S is
defined as det V,
The correlation of (j and j is given by
Pij == (V"f,ijf(Ji (J)
(B-1 I)
where
(Ji= J:;;W
(B-12)
is the standard deviation of (i. Two variables are un correlated if their cor-
relation (or covariance) vanishes.
Let ( and tl be two random variables with pdf PI (x) and PlCV) respectively.
These variables are said to be statistically independent if their joint pdf p(x, y)
has the form
p(x, y) = PI (X)Pl(y)
(B- 13)
It can be shown that independent variables are ullcorrelated. The COIl-
verse does not always hold, For normally distributed variables, however,
zero correlation implies independence.
If sand TJ are random variables with jOll1t pdf p(x, y), then the marginal
distribution ofTJ is given by
p(y) = f p(x, y) dx
(B-14)
312
Appendix B. Probability
If the distributIon of depends on some other variables y, we write the
pdf as p(x I y). If Y is a possible value of some other vector random variable
11, then we call p(x I y) the conditional pdI of given 11 = y,
The following equation relates the joint, marginal, and conditional dis-
tributions
p(x, y) p(x I y)p(y)
(B-15)
hence
p(x I y) = p(x, y)/ f p{u, y) du
(B- I 6)
provided the denominator does not vanish. If and 11 are independent,
we find, in view of Eq. (13) and Eq. (15), that p(x I y) = p(x), i.e., knowledge
ofll does not affect the distribution of .
If x is an m-dimensional vector random variable with pdf fi(x), and f(x)
is an l11-dimensional vector of continuous, differentiable single-valued
functions such that f(x l ) = f(x 2 ) only if Xl = x 2 , then the vector y = I(x)
is a random variable with pdf
pry) = p(x)ldeC I 8f(8xl
(B-17)
The Jacobian matrix of the transformation from X to y, is defined as 8f(8x,
and its determinant (which under the above conditions must be nonzero)
is the Jacobian.
,
,"'!
":'1
. ,
j
i
Appemlix
c
The Rao-{:l'amel' Theorem
Let p(Y! <j» be the pdf of the sample y, Then from the definition of a
pdf
f p(YI<j» dY = I
(C-l)
Let t* be a vector-valued statistIc of the sample, l.e"
t* = t*(Y)
and let t be the expected value of t*, i,e..
(C-2)
t( <j» = f t*(Y)p(Y I <j» dY
(C-3)
From Eq. (I) we have
c f f lCP ) T
c<j> p dY = \ c<j> dY = 0
(C-4)
Also, from Eq, (3)
c ( C P ) T
- r t"'p dY = r t"' -:;- dY = P
c<j> . . c<j>
(C-5)
where
P == ct(c<j>
( C-6)
r
,.
Thus, using Eq, (4) and Eq. (5)
f (1* - t)(Cp(C<j»T dY = f t*(Cp(C<j»T dY - t f (Op(C<j»T dY = ct(c<j> = P
( C-7)
Now
cp(c<j> = pc ]ogpjc<j>
hence Eq. (7) may be rewritten as
f uv T dY = P
(C-8)
314
Appendix C. The Rao-Cramer Theorem
where we define
u == plll(t* - t):
v == pIll a 10gp/a<l>
(C-9)
The covariance matrix of the statistic t* is
V t < == f uu T dY = f p(t* - t)(t* - t)T dY = E(t* - t)(t* - t)T (C-lO)
Let
R == f vv T dY = f pea 10gp/a<l»(a 10gp/a<l»T dY
= E(a logpja<!»(a 10gp/a<l»T
(C-II)
Let A(<I» be an arbitrary matrix function of <I> such that Av is a column
vector of the same dimension as u, The matrix (u + Av)(u + AV)T is clearly
positive semidefinite, and so is the sum of any number of matrices of this
form. Hence, if
B == f (u +Av)(u + Avf dY
(C-12)
then B is positive semidefinite. But
B= f(uu T + AvvTAT + Avu T + uvTA)dY=V t < + ARA T + ApT + PAT
= V t < - PR-1p T + (A + PR-1)R(A T + R-IPT) (C-13)
Now, B must be positive semidefinite for any A; in particular it must be
so for A = -PR - I, in which case B = V t < -PR- tp T . Hence, V t < -PR-1p T
must be positive semidefinite.
From Eg. (12), B = 0 if and only If u =:= -A v, In this case, from Eg, (8)
P = f - Avv T dY = -AR
(C-14)
and therefore Eg. (13) reduces to
0= V t < - PR-1p T
Conversely, suppose V I < - PR-Ip = 0, Then, from Eq, (13)
B = (A + PR-1)R(A T + R-IPT)
In particular, if we choose A = -PR-l, we have B = 0, Hence, from
Eg, (12)
:;
(C-1S)
f (u - PR-1v)(u - PR-1v)T dY = 0
",,.]
f
i
,
.
,
.4
;1
J
1
I
1
1
i
i
.
f
t
"
t
i
1
i
j,
I
,
r
i
I
i
I
!
!
!
I
Appendix C. The Rao-Cramer Theorem
315
and it follows that u =PR-1v. Associating the estimate <1>* with t* and <P
with t, we obtain the results stated in Section 3-2,
Note that the proof is valid only if p satisfies regularity conditions which
permit the differentiations under the integral sign in Eq. (4) and Eq. (5),
We have also assumed that R was nonsingular.
Appendix
D
Generating a Samplf from a Given
IuUivariate Norma] Distribution
We wish to generate on a computer a sample from the distribution
Nk(a, V). That is, we need a vector z of num bers ZI' Z2, . . . , Zk derived from
a normal distribution with mean a and covariance matrix V We proceed as
follows:
(a) Let m = k/2 if k is even, or 111 = (k + I )/2 if k is odd,
(b) Generate 2m pseudorandom numbers XI' X 2 , .,', X 2 ", uniformly and
independently distributed on the interval zero to one. For a discussion of
methods to accomplish this see Moshman (1967) and Lewis el 01. (]969).
(c) From the Xi we generate Yj (j =. I, 2, ..., k) which are normally and
independently distributed with zero means and unit variances. For this
transformation many methods have been proposed, but only the following
two are easy to program and reasonably fast, yet produce the required
distribution exactly:
I. Method of Box and Muller (1958). Compute:
Y2i-1 =(-210gx2i_I)1I2cos2nx2i
Y2i = ( - 2 log X li - J )1/2 sin 2nx 2i
(i = 1,2,...,111)
(0-1)
If k is odd, Y2", need not be computed.
2. Method of Marsaglia and Bray (1964). Compute U i = 2(x i - 1) for
i = I, 2, ..., 2m. I f for any j = I, 2, ..., m it happens that ui j - J + ui j > I,
replace x 2j - J and x 2j by a new pair of uniform random numbers, recompute
u 1j - J and 1I 2j , and repeat untilllL_1 + ui j I. Compute:
Y2i-1 = 1I 2 i-l[ -210g (lIi i - J + lIi i )/(lIii-1 + lIi i )]'/2
.'
.,
j
i
"'1
)lzi = u 2 ;[-2 log (lIii-1 +lIiJ/(lIii-J +lID]J/2
(i = I, 2, , . . , 111)
(0-2)
..!
:
J
J
'1
1
!
i
j
1
J
,\
1
"
;1
d
Appendix D. Generating a Sample
317
The second method is probably faster, sltlce it requires no evaluation of
trigonometric functions.
!
'.,
\J
(d) Compute the eigenvalues }'i and eigenvectors Vi (i = I, 2, " k) of V.
Generate the matrix U whose ith column is J.:/ 2 Vi
Note that the }.i must all be nonnegative. A faster method, useful if V is
known to be nonsingular, is to find a lower triangular matrix U such that
UU T = V by means of the Choleky decomposition (see Section 5-5).
(e) Compute
;
i
.!
,
"I
,
z = a + Uy
(0-3)
i to obtam the desired sample z.
', Tf many samples are required from the same distribution, step (d) should
! be performed once for all samples at the beginning.
.
A
i
"
:1
1
:1
I
1
I
!
Appendix
E
The Gauss-JUadlov Theorem
Suppose we have a model
y-Be=1':
(E-!)
where I': is a random veclor with mean 0 and nonsingular covariance matrix
V, y is a vector of observations, and B a known matrix. We wish to find
the least-variance linear unbiased estimator e* for e. Linearity implies the
existence of a matrix A independent of y such that
e* = Ay
(E-2)
Therefore
E(e*) = E(Ay) = AE(y) = ABe
(E-3)
If the estimate is unbiased, we must have E(e*) = e for any e, hence
AB=I
(E-4)
The covariance matrix of e* is given by'
V o = E(e* - e)(e* - e)T = E(Ay - e)(Ay _ e)T
= E[A(y - Be) + (ABe e)][A(y - Be) + (ABe - e)f
= E[A(y - Be)(y - Be)TA T ] = E(AI':I':TA T) = AVA T (E-5)
We wish to find the matrix A which minimizes some measure of V o. The
following are possible measures:
I. The so-callcd .. generalizcd variance," i.e., det V o .
2. Some wcighted average of the elements ofV o , e.g.. Tr(GV o ), where G
is an arbitrary positive definite matrix.
3. The spectral norm (largest eigenvalue) of V.
All measures lead to the same answer. We shall use the measure (I) here,
and the reader may derive the resull for the other measures as an exercise.
.
.{
J
.,
,
i
i
,,'{
1
Appendix E. The Gauss-Markov Theorem
319
We wish, then, to determine the matrix A which minimizes det A Y AT
wbile satisfying AB = I We introduce a matrix of Lagrange multipliers A,
and construct the Lagrangian
2(A, A) = det A Y A T + Tr[A(AB - I)]
(E-6)
We must find the stationary point of Eg. (6). By the methods of Section
A-2 we find
.-.!
82j8A = 2(AYA T )-J AY + A1B T = 0
PostmuItiplying Eg. (7) by AT, we obtain in view of Eg. (4)
21 + AT = 0
Substituting AT = -21 In Eg. (7), we obtain
(AYAT)-JAY = B T
PostmuItiplying by y-I and then by B we find, successively:
(AYAT)-IA = BTy- 1
(AYAT)-IAB = (AYAT)-I = BTV-1B
Substituting Eg. (II) in Eg. (10)
BTy-1BA = BTy- 1
.i
";i
, i
.j
1
i
'\
So tbat finalIy
A = (BTy-1B)-JBTy- 1
and
e* = (BTY-1B)-JB1Y-1y
(E- 7)
(E-8 )
(E-9)
(E-I0)
(E-11 )
(E-12)
(E-13)
(E-14)
in agreement with Eg. (4-4-7), It is not difficult to verify that this solution is
indeed a minimum of det A Y AT,
A treatment of the general case where singular matrices may occur is
given by Price (1964), The results call for substitution of pesudoinverses for
inverses whenever needed.
Appendix
F
A Convergence Theorem
for Gradient Methods
T/zeorem Given a continuous function <p(e) with continuous differentiable
first derivatives. Let <PI = <p(e t ), and let 0) be the set of all points e such that
<p(e)::;:;; <PI' Define a sequence of points e 1 , e j , '.., by
e j + 1 = e j - pjRjqj
(F-l)
where
qj == a<Pjae)e=o,
We make the following further assumptions.
I. There exists a number AI such that no eigenvalue of the Hessian
H(e) exceeds 1\1 in absolule value for all e E 0),
2, All R j are positive definite matrices whose eigenvalues fall between two
positive numbers 0 < fJ < y.
3. All pj are chosen so that
min(po, allj) ::;:;; pj::;:;; pj
(F-2)
where Po is a positive constant, a a constant satisfying 0 < a < I, and pj is
the smallest nonnegative p at which <p(e j - pR j q;) is a stationary function
of p.
Let e* be a limit point of the sequence {e j }, Then e* is a stationary point
of <P, j,e., q* == q(e*) = o.
Comment: Such a limit pomt (not necessarily umque) must exist if 0) is
bounded.
Proof Clearly {<PJ == {<p(e j )] is a monotone nomncreasmg sequence.
Because of continuity, we must have
qJ* == qJ(H*) ::;:;; qJ j
(i = I, 2, .. ,)
(F-3)
Appendix F. A Conl'ergCl/ce Theorem for Gradient Methods
321
Suppose e* is not statIOnary. Then IIq*1I = a > O. Due to continuity of ({J
and q, and the definition of a limit point, we can find an integer j such that
and
2a II qj II ta
(F-4)
. [ (2 - a)'J.a 1 fJ1 a 1 fJ1 a 1 (f P o ]
c{) :0( c{)* + 111111 " 7 ' _
J 256y-M 256}'-M 16
Consider the function
( F-5)
we have
tP(p) == c{)(e j - pRjqj)
(F-6)
and
d'P/dp = (i!C/1/iJe)(Oe/cp) = _qTR;qj
(F-7)
d'P/dp)" = ° = -q/Rjqj:o( -fJllqif:o( - fJa 1 /4 (F-8)
which follows from Eg. (A-I-32) and Eg. (4).
Also
Id 1t P/dp11 = /(iJe/iJpf(iJ 1 c{J/(le aeHcel(1p) I = Iq/Rj Hi Rjqjl
:0( IvIIIRjqjf 4a 1 1,1/11 (F-9)
which follows from Eg. (A-I-31) and Eg. (A-I-32).
In view of Eg. (8) and Eg. (9) we have. for p > 0
dlp/dp:o( - fJa 1 /4 + 4a 1 }'lMp (F-IO)
At p = flj we have d'P/dp = O. Hence, from Eg. (10)
0:0( -fJa 1 /4 + 4a 1 }'lMJlj (F-II)
or, Jlj (fJ/16}'lM) and aflj ('Y.fJ/16y 1 ML In view ofEg. (10) we have, for
p>O
tP(p) :0( cfJ j - (fJa 1 /4)p + 2a1},2Mpl
(F-12)
Suppose Pj has been chosen so that CJ.flj:o( Pj:o( Jlj [see (Eg. 2)]. Since
pep) is monotonically nonincreasing for 0:0( p:o( flj' we have because of
Eg. (12)
C/1j+l = 'P(p):o( tP(rlp):o( 'P(Cf.fJ/J6lM)
:0( C/1 j - (fJa 1 /4)(Cf.fJ/16}'1J\f) + 2a1y1M(CJ.fJ/J6y1Mf
= C/1 j - [(2 - CJ.)CJ.a 1 fJl/128}'lM]
Employing Eg. (5) we find, then
C/1j+l C/1* - [(2 - a)aa 1 fJl/256y 1 M] < c{J*
This contradicts Eq. (3).
(F-13 )
(F-14)
322 Appendix F. A Convergence Theorem for Gradient Methods
The other alternative is that Po ::;:;; Pj ::;:;; /lj' But then
rJ)j+1 ::;:;; rJ)j - (fJa 2 /4)po -t- 2a 2 y2JV/po2
(F- 15)
Now there are two possibilities
(a) Po fJ/16 y 2 M. Therefore
rJ)j+1 rJ)j - a 2 po(fJ/4 - 2y 2 Mpo)::;:;; rJ)j - a 1 po[fJ/4 - 2iM(fJ/I6y 2 M)J
= rJ)j - a 2 fJpo/8 ::;:;; rJ)* - a 2 fJpo/16 < rJ>* (F-I6)
In contradiction wIth Eq. (3).
(b) fJ/16iM.:S; Po::;:;; Pj. Therefore, because IfJ(p) IS monotonIcally non-
increasing at P = Po
rJ)j+1 = c])(Po) (f)(fJ/16y 2 M) rJ)j - (fJa 2 /4)(fJlI6y 2 M) + 2a 2 y 2 M(fJ/I6y 2 Mf
= rJ}j - (02fJ2/128iM}::;:;; rJ>* - (a 2 fJ2/256y 2 M) < rJ>* (F-I7)
again contradicting Eq. (3). Hence (j* must be stationary.
Appendix
G
Some Estimation Prorams
It is impossible to list all existing estimation programs. The ones listed
below are either of historical interest, possess special features, or are in
widespread use. The list is in chronological order.
I. G. W. Booth and T. L Peterson (1958), Nonlinear estimation (IBM-
Princeton), 704 G2 3226 NLI (Previously designated SHARE 687 WLNLI).
First generally available computer program for nonlinear parameter estima-
tion. Written in IBM 704 Assembly Language. Uses Gauss method with finite
difference approximation to solve single equation least squares problems.
2. M. A. Efroymson (1961), Nonlinear regression with differential
equations, 7090 G2 3146 NLR. Written in FORTRAN II for the IBM 7090
specifically to handle models which are in differential equation form. Uses
Gauss method with finite difference approximations.
3, L. Lapidus and T. I. Peterson (1964), Chemical reaction analysis by
nonlinear estimation, 7090 T2 IBMOOI4. A package combining program I
above with a FORTRAN II interface for estimating parameters in chemical-
kinetics models,
4, D. W, Marquardt (1965), Least squares estimation of nonlinear param-
eters, 7040 G2 3094 NUN. Written in FORTRAN IV for the IBM 7040.
Uses Marquardt's method with analytic derivatives or finite difference
approximations to solve weighted least squares problems.
5. H Eisenpress, A. Bomberault, and J. Greenstadt (1966), Nonlinear
regression equations and systems, estimation and prediction, 7090 G2
IBM0035. Written in FORTRAN IV-FORMAC for the IBM 7090. Per-
forms maximum likelihood estimation of multiple equation econometric
models. The FORMAC system evaluates automatically analytic derivatives
of all orders required for the full Newton method with rotational dis-
crimination,
6. D. F. Shanno, 1967, CREEP-Constrained nonlinear estimation
package, 7094 G2 3492, Written in FORTRAN IV for the IBM 7094.
Estimates parameters in least square models with constraints, using a
324
Appendix G. Some EstimatIOn Programs
modified Marquardt method combined with gradient projection. ReqUires
analytic derivatives,
7. Y. Bard, 1968, Nonlinear estimation and programming, 3600
13.6.003. Written in FORTRAN IV for the IBM System/360, Solves least
squares and multiple equation maximum likelihood problems with known
or unknown covariance matrix. Uses the Gauss method with analytic deriva-
tives. Includes provisions for constraints (penalty function method), prior
distributions, models in standard dynamic form (sensitivity equations), and
chemical kinetics models.
8, H. Eisenpress, 1968, Nonlinear regression equations and systems,
estimation and prediction, 3600 13,6.005. Program 5 above, rewritten in
PL/I-FORMAC for the IBM System/360.
9. F. S. Wood, 1971, Nonlinear least squares curve-fitting program, 3600
13.6.007. Written in FORTRAN TV for the IBM System/360. Solves least
squares problems. using a modified Marquardt method.
Note: At the time of writing, several of the above programs were available
from the SHARE Program Library Agency, Triangle Universities Computa-
tion Center, P.O. Box 12076, Research Triangle Park, North Carolina
27709.
Reierences
Abadie, J., ed. (l967a). .. Nonlinear Programming." Wiley, New York.
Abadie, J. (l967b). "Generalization of the Wolfe Reduced Gradient Method to the Case
of Nonlinear Constraints." Electricite dc France, Direction des Etudcs et Recherches,
Clamart, France,
Abadie, J., and Carpentier, J. (1966). Generalisation de la methode du gradient reduit de
Wolfe au cas de contraintes non lineaires. Internat. Congr. Operations Research, 4th,
Boston.
Acton, F. S. (1959). " Analysis of Straight Linc Data." Wi Icy, New York.
Afifi, A. A., and Elashoff, R. M. (1966). Missing observations in multivariatc statistics:
1. Review of the literature. J. Amer. Statist. Assoc. 61, 595-604.
Albert, A. E., and Gardncr, L. A. (1967). "Stochastic Approximation and Nonlinear
Rcgression." MIT Press, Cambridge, Massachusetts.
Anderson, T. W. (1958). "An Introduction to Multivariatc Statistical Analysis." Wiley,
New York.
Anscombe, F. J. (1960). Rcjection of outliers. Tecl1l10metrics 2, 123-147.
Arndt, R. A., and MacGrcgor, M. H. (1966). Nucleon-nucleon phase shift analyscs by
chi-squared minimization. In "Methods in Computational Physics," (B. Alder, F.
Fernbach, and M. Rotenberg, eds.), Vol. 6. Academic Press, Ncw York.
Atkinson, A. c., and Hunter, \Iv. G. (1968). The design of experiments for parametcr
cstimation. Teclmometrics. 10,271-289.
Bard, .Y. (1967). .. A Function Maximization Method with Application to Parameter
Estimation." New York Scientific Ccntcr Report 322.0902, IBM, Ncw York.
Bard, Y. (1968). On a numcrical instability of Davidon-like mcthods. Math. Camp. 22,
665-666.
Bard, Y. (1970). Comparison of gradient methods for the solution of nonlinear parameter
estimation problems. SIAlY! J. Nillner. Anal. 7, 157-186.
Bard, Y. (1971). An cclectic approach to nonlinear programmlllg. Proc. ANU Sem. Optimi-
zation, Canberra, Austral. Nat. Univ.
Bard, Y., and Lapidus, L. (1968). Kinctics analysis by digital paramcter estimation. Catal.
Rev. 2, 67-112.
Barnctt, V. D. (1967). A notc on lincar structural relationships whcn both rcsidual varianccs
are known. Biometrika. 54, 670-672.
Bartels, R. H., and Golub, G. H. (1968). Chcbyshev solution to an ovcrdetcrmined linear
system. Comll1. ACJ\1. 11,428.
Baycs, T. (1763). Essay towards solving a problem in the doctrinc of chances. PlIilns.
Trans. Roy. Soc. 53, 370-418. [Rcprintcd in Biomerriko. 45, 293-315 (1958)].
Bcalc, E. M. L. (1960). Confidence regions in nonlincar cstimation. J. Roy. Srarisl. Soc.
Ser. B. 22,41-76.
326
References
Bcaton, A. E. (1964). "Thc Usc of Spcclal MatrIx Opcrators in Statistical Calculus."
Rcscarch Bullctin RB-64-5I, Educational Tcsting Scrvicc, Princeton, New Jersey.
Bcauchamp, J. J. and Cornell, R. G. (1966). Simultaneous nonlinear estimation. Tech-
lIomelrics. 8, 319-326.
Behnkcn, D. W. (1964). Estimation of copolymer reactivity ratios: an example of nonlinear
cstimation. J. PolYIl7. Sci. Pari A. 2,645-668.
Bellman, R., Collicr, c., Kagiwada, H., Kalaba, R., and Selvester, R. (1964). Estimation of
hcart parameters using skin potential measurements. Comm. AO\1. 7,666-668.
Bellman, R., Jacqucz, J., Kalaba, R., and Schwimmcr, S. (1967). Quasilincarization and
the estimation of chemical rate constants from raw kinctic data. Alalh. Biosci. 1, 71-76.
Bellman, R. E., Kagiwada, H H., Kalaba, R. E., and Sridhar, R. (1964). "Invariant
]mbcdding and Nonlincar Filtering Theory." Memorandum RM-4374-PR, The Rand
Corporation, Santa Monica, California.
Berman, M., Weiss, M. F., and Shahn, E. (1962). Some formal approaches to thc analysis
of kinctic data in terms of linear compartmental systems. Biophys. J. 2,289-316.
Blakemore, J. W., and Hoed, A. E. (1963). Fitting nonlinear reaction rate equations to
data. Chem. Ellg. Progr. Symp. Ser. 59 (42), 14-27.
Bodkin, R. G., and Klcin, L. R. (1967). Nonlinear estimation of aggregate production
functions. Rev. Ecollom. Sialisi. 49, 28-44
Bond, E., Auslander, M., Grisoff, S., Kenney, R., Myszewski, M., Sammet, J., Tobey, R.,
and Zilles, S. (1964). FORMAC, An experimental FORmula MAnipulation Compiler.
Proc. Nal. Calif Ass. CampI/I. l\1ach. 191h
Booth, G. W., and Peterson, T. L (1958). "Nonlinear Estimation." IBM SHARE Pro-
gram Pa. No. 687 WLNLI.
Box, G. E. P. (1957). Use of statistical methods III the elucidation of basic mechanisms.
Bull. III.IT. IlIlernal. Slali.\"I. 36, 215-225.
Box, G. E. P., and I-lill, W. .I. (1967). Discrimination among mechanistic models. Tech-
lIomelrics. 9,57-71.
Box, G. E. P., and Hunter, W. G. (1962). A useful method for model-building. Technomelrics
4,301-318.
Box, G. E. P., and Hunter, W. G. (1963). Sequential design of cxpcrIments for nonhnear
models. Proc. IBAI Sci. Comp/ll. Symp. Sialisl., IB1H, IYhile Plains, New York.
Box., G. E. P., and Hunter, W. G. (1965). The cxperimental study of physical mechanisms.
Techllomelric.r. 7, 23-42.
Box, G. E. P.. and Lucas, H. L. (1959). Design of experimcnts in nonlinear situations.
Biomelrika. 46, 77-90.
Box, G. E. P., and Muller, M. E. (1958). A note on thc gencration of random normal
dcviatcs. Anu. Ma/It. Slall\-l. 29, 610-611
Box, G. E. P.. and You Ie, P. V. (1955). The exploration and exploitation of response
surfaccs: an cxampIc of the link bctwcen thc fittcd surfacc and thc basic mechanism of
thc systcm. BimllelriC.\". 11,287-323.
Box, M. J. (1966). A comparison of sevcral current optimization methods, and the use of
transformations in constrained problems. Complli. J. 9, 67-77.
Box, M. J. (1968). The occurrcncc of replications in optimal designs of experiments to
cstimate paramcters in nonlinear modcls. J. Roy. SIC/lisl. Soc. Ser. B. 30,290-302.
Brcnt, R P. (971). "Algorithms for Finding Zeros and Extrema of Functions without
Calculating Derivativcs." Computer Sciencc Rcport STANS-CS-71-198, Stanford
Univcrsity. Palo Alto, California.
Broyden, C. G. (1967). Quasi-Newton methods and their application to function minimi-
zation. Malh. Camp. 21, 368-381.
References
327
Buzzi Ferraris, G. (1968). Mctodo automatico per trovare I'ottimo di una funzione. Il/g.
Chim. Ital. 4, 17]-]92.
Carney, T. M., and Goldwyn, R. M. (1967). NumcrIcal experIments with various optimal
estimators. J. Optimization Theory Appl. 1, 113-130.
Carroll, C. W. (1961). The crcated response surface technique for optimizing nonlinear,
restraincd, systems. Operations Res. 9, ] 69-184.
Chow, G. C. (1964). A comparison of alternative estimators for simultaneous equations
Econometrica. 32, 532-553.
Colville, A. R. (1968). "A Comparative Study of Nonlinear Programming Codes." IBM
N.Y. Scientific Center Report No. 320-2949, New York.
Cornfield, ], (1967). Bayes Theorcm. Rev. Inst. Internat. Statist. 35, 34-49.
Cottle, R. W., and Dantzig, G. B. (1968). Complementary pivot theory of mathematical
programming. Linear Algebra and Appl. 1, 103-125.
Cragg, J. G. (1967). On thc rclative small sample properties of several structural-equation
estimators. Econometrica. 35, 89-110.
Cramer, H. (1946). " Mathcmatical Mcthods of Statistics." Princeton Univ. Press, Princeton,
New Jersey.
Daniel, J. W. (1971). "The Approximate Minimization of Functionals." Prentice-Hall,
Englcwood Cliffs, New Jersey.
Dantzig, G. B., and Cottle, R. W. (1967). PositIve (semi-) defimte programming. III "Non-
linear Programming," (J. Abadie, cd.), pp. 55-73. Wiley, New York.
Davidon, W. C. (1959). "Variable Metric Method for Minimization." A.E.C. Rcsearch
and Development Report ANL-5990 (Rev.).
Davidon, W. C. (1968). Variance algorithm for minimization. Comput. J. 10,406-410
Davies, D. (1970). Some practical methods of optimization. In "Integcr and Nonlinear
Programming" (1. Abadie, ed.), North-Holland Pub!., Amsterdam.
Davies, O. L. D. (1954). "The Design and Analysis ofIndustrial Experiments." Oliver &
Boyd, Edinburgh.
Deming, W. E. (1943). "Statistical Adjustment of Data." Wiley, New York.
Deutsch, R. (1965). "Estimation Theory." Prentice-Hall, Englewood Cliffs, New Jersey.
Draper, N. R., and Hunter, W. G. (1966). Design of experiments for parameter estimation
in multiresponse situations. Biometrika. 53, 525-533.
Draper, N. R., and Hunter, W. G. (l967a). The use of prior distributions in the desIgn of
cxpcriments for parameter estimation in nonlinear situations. Biometrika. 54, 147-153,
Draper, N. R., and Hunter, W. G. (l 967b). The use of prior distributions in the design of
experiments for parameter estimation in nonlinear situations: multiresponsc casc.
Biometrika. 54, 662-665.
Drapcr, N. R., and Smith, H. (1966). "Applicd Regression Analysis." Wiley, New York,
Eisenpress, H., Bomberault, A., and Greenstadt, J. (1966). "Nonlinear Regression Equa-
tions and Systems, Estimation and Prediction (IBM) 7090." Computer program 7090-G2
IBM0035 G2, IBM, Hawthorne, Ncw York.
Eisenpress, H., and Greenstadt, J. (1966). The estimation of nonlinear econometric systems.
Econometrica. 34, 851-861.
Eisenpress, H., and Surkan, A. (1966). "Fitting Ore Deposit Models to Geophysical
Survey Data." Pcrsonal Communication.
Fariss, R. H., and Law, V. J. (1967). Practical tactics for overcoming difficulties in nonlinear
regression and cquation solving. AIChE Nleet., Houston, Feb. 1967.
Faure, P., and Huard, P. (1965). Resolution des programmes mathematiqucs a fonction
nonIineaire par la methode du gradient reduit. Rev. Hal/('aise Recherche Operationelle.
9, 167-205.
328
Reference
Fcller, W. (1966). "An Introduction to Probability Theory and Its Applications," V.)l. ]
Wiley, New York.
Ferguson, T. S. (1967). "Mathematical Statistics, A Decision Theoretic Approach.
Academic Press, New York.
Fiacco, A. V. (1968). Sccond-order sufficicnt conditions for weak and strict constnine
minima. SIAlvl J. Appl. lvlath. 16, 105-108.
Fiacco, A. V., and McCormick, G. P. (1964). The sequential unconstrained minimizatio
techniquc for nonlincar programming: A primal-dual mcthod. lvlal1agemellt Sci. IC
360-366.
Fiacco, A. V., and McCormick, G. P. (1965). "Thc Scqucntial Unconstrained MinirlllZ1!IOI
Technique for Convex Programming with Equality Constraints." RAC-TP-155, Researcl
Analysis Corporation, McLean, Virginia.
Fiacco, A. V., and McCormick, G. P. (1967). The slacked unconstraincd minimizatior
tcchniquc for convex programming. SIAAf J. Appl. lvlath. 15,505-515.
Fiacco, A. V., and McCormick, G. P. (I 968). "Nonlinear Programming: Sequcntia
Unconstrained Minimization Techniqucs," Wiley, New York.
Fisher, R. A. (1935). "The Design of Experiments." Oliver & Boyd, Edinburgh.
Fishcr, R. A. (1950). "Contributions to Mathematical Statistics" (Collection of papers
publishcd 1920-1943,) Wilcy, New York.
Flanagan, P. D., Vitale, P. A., and Mendelsohn, J. (I969). A numcrical investigation of
sevcral one-dimensional search procedures in nonlinear regression problems. 7 eel,-
110metrics 11, 265-284.
Fletcher, R. (1965) Function minimization without evaluating derivatives-a review.
Compl/t. J. 8, 33-41.
Flctcher, R. (1970). A new approach to varIable metrIC algori.thms. Compl/t. J. 13,317-.122,
Fletcher, R., and Powell, M. J. D. (1963). A rapidly convergcnt desccnt mcthod for mini-
mization. Compl/t. J. 6, 163-168.
Fox, L. (I 964). "An Introduction to Numcrical Lincar Algebra." Oxford Univ. P-ess
(Clarendon), London and New York.
Freudenstcin, F., and Woo, L. S. (I968). "Kinematics of the Human Knce Joim." /13M
New York Scientific Center Report No. 320-2928, New York
Galambos, J. T., and Cornell, R. G. (1962). Mathematical modcls for the study of the
metabolic pattern of sulfate. J. Lab. C/il1. lvled. 60, 53-63.
Gauss, K. F. (1809). "Thcoria Motus Corporum Coelestium." 111" Wcrke," Vol. 7, 240-
254.
Goldfarb, D., and Lapidus, L. (1968). A conjugatc gradient mcthod for nonlinear pro-
gramming problems with linear constraints. Ind. Eng. Chem. FIll/dam. 7, 142-151.
Goldfcld, S. M., Quandt, R. E., and Trotter, H. F. (1966). Maximization by quadratic
hill climbing. Ecol/ometrica. 34, 541-551.
Goldstein, A. A., and Price, J. F. (1967). An effectivc algorithm for minimization. NI/mer.
lvlath. 10,184-189.
Golub, G. (1965). Numerical methods for solving linear least squares problems. NIl/I"'r.
Math. 7,206-216.
Golub, G. H. (1969). Matrix dccompositions and statistical calculations. 111 "Statistical
Computation" (c. Milton and J. A. Nclder, eds.). Academic Press, New York.
Golub, G. I-I., and Percyra, V. (1972). Thc Differentiation of Pseudoinverses and Nonline'lr
Least Squares Problcms Whose Variables Separate." Rep. No. ST AN-CS-72-26 I ,
Computcr Scicnce Dcpt., Stanford University, Palo Alto, California.
Grant, F. S., and Wcst, G. F. (1965). "Interpretation Theory in Applied Geophysics"
McGraw-Hili, New York.
s
References
329
1.
Grcenstadt, J (1967). On the relative efllciencies of gradient methods. l\,flllh. Camp. 21,
360-367.
Greenstadt, J. (1970). Variations on variablc metric mcthods. Iv/alh. Camp. 24, 1-22.
Guttman, 1., and Meeter, D. A. (1965). On Bealc's mcasures of nonlinearity. Tec/lIlOmelrics
7, 623-637.
Hadley, G., (1964). "Nonlinear and Dynamic Programming." Addison-Wcslcy, Reading,
Massachusetts.
Hammcrslcy, J. M., and Hanscomb, D. C. (1964). "Monte Carlo Mcthods." Methuen,
London.
Hartley, I-I. O. (1961). The modificd Gauss-Newton method for thc fitting of nonlincar
regression functions by Icast squarcs. Tecl1l10melrics 3,269-280.
Hartley, I-I. O. (1964). Exact confidence rcgions for the parametcrs III nonlincar regression
laws. Biometrika. 51, 347-353.
Healy, M. J. R. (1968). Multiple regression with a singular matrix. J. Roy. Statist. Soc.
Ser. C Appl. Statist. 17, 110-117.
Heineken, F. G., Tsuchiya, H. M., and Aris, R. (I 967a). On the mathcmatical status of the
pseudosteady state hypothcsis of biochemical kinetics. A/ath. Biosci. I, 95-113.
Heineken, F. G., Tsuchiya, H. M., and Aris, R. (I 967b). On thc accuracy of dcterIllInmg
rate constants in enzymatic reactions. iWath. Biosci. I, 115-141.
Hicks,J.S.,and Wei,J. (1967). Numerical solution of parabolic partial dilTcrential cquations
with two-point boundary conditions by use of the method of lincs. J. Assoc. Compllf.
Mach. 14,549-562.
Hill, W. J., Hunter, W. G. and Wichcrn, D. W. (1968). Ajoint design critcrion for thc dual
problem of model discrimination and parameter estimation. Teclmometrics. 10, 145-160.
Himmelblau, D. M., Jones, C. R., and Bischoff, K. B. (1967). Dctcrmination of rate con-
stants for complex kinetics models. II/d. £I/{). Chem. FI/I/dam. 6, 539-543.
Hoerl, A. E. (1962). Application of ridgc analysis to regression problems. Chem. £I/g.
Progr. 58,54-59.
Hoerl, A. E., and Kennard, R. W. (1970). Ridgc regrcssion: blascd estimation for non-
orthogonal problems. Techl/ometrics. 12,55-67.
Hood, W. c., and Koopmans, T. c., eds. (1953). .. Studies in Econometric Mcthod."
Wiley, Ncw York.
Hookc, R., and Jcevcs, T. A. (1961). .. Direct search" solution of numerical and statistical
problems. J. Assoc. Compllf. l\'/ach. 8, 2:12-229.
Hougen, O. A., and Watson, K. M. (1947) "Chemical Process Principlcs, Part Thrcc:
Kinetics and Catalysis." Wiley, New York.
Howland, J. L., and Vaillancourt, R. (1961). A generalizcd curve fitting method. SIAlW
J. Appl. Math. 9, 165-168.
Hunter, W. G., and Mezaki, R. (1964). A model building technique for chemical enginecring
kinetics. AICh£. J. 10,315-322.
Huntcr, W. G., and Mezaki, R. (1967). An cxperimental dcsign strategy for distinguishing
among rival mechanistic mode/s'.an application to the catalytic hydrogenation of pro-
pylene. Can. J. Chem. £I/g. 45, 247-249.
Jeffreys, H. (1961). "Theory of Probability," 3rd ed. Oxford Univ. Press, London and
New York.
Jennrich, R. 1., and Sampson, P. F. (1968). Application of stcpwise regression to nonlinear
estimation. Tec/1I10metrics. 10, 63-72
John, F. (1948). Extrcmum problems with inequalities as subsidiary conditions. In " Studies
and Essays," 187-204. Wiley (lnterscicnce), New York.
Johnston, J. (1963). "Econometric Methods," McGraw-HilI. New York.
d
n
),
n
h
330
References
Kalman, R. E. (1960). A new approach to linear filtcnng and predicllon problems, J. Basic
Eng. 82, 33-45.
Kelley, H. J., and Denham, W. F. (1966). Orbit dctcrmination with the Davidon Method,
Joint Allfomlit. Control Can! Seattle, Washington.
Kelley, I-I. J., and Denham, W. F. (1969). Modeling and adjoints for continuous systems,
J. Optimization Theory Appl. 3, 174-183.
Kellcy, Jr., J. E., (1958). An application of lincar programming to curve fitting. SIAM J.
Appl. Math. 6, 15-22.
Kittrell, J. R., Hunter, W. G., and Mezaki, R. (1966). The use of diagnostic parameters
for kinetic model building. AIChE J. 12, 1014-1017.
Kittrell, J. R., Hunter, W. G., and Watson, C. C. (1966). Obtaining precise parameter
estimates for nonlincar catalytic rate models. AIChE J. 12,5-10.
Kittrell, J. R., Mezaki, R., and Watson, C. C. (1965). Estimation of parameters for non-
linear least squarcs analysis. Ind. Eng. Chem. 57, 18-27.
Koopmans, T. c., and Hood, W. C. (1953). The estimation of simultaneous linear economic
relationships. In "Studics in Econometric Method" (W. C. Hood and T. C. Koopmans,
eds.). Wiley, New York.
Korin, B. P. (1968). On the distribution of a statistic used for tcsting a covariance marrix.
Biometrika 55, 171-178.
Kowalik, J., and Osbornc, M. R. (1968). "Methods for Unconstrained Optimization
Problems." American Elsevier, New York.
Kuhn, I-I. W., and Tucker, A. W. (1951). Nonltnear programming. In Proc. Berkeley Symp.
lvIath. Statist. and Probability, 2nd (.I. Ncyman, ed). Univ. of California Prcss. Berkcley,
California.
Kullback, S. (1959). "Information Theory and Statistics." Wiley, New _York.
Kullback, S., and Leiblcr, R. A. (1951). On information and sufllciency. AmI. lvIath. Statist.
22, 79-86.
Kiinzi, I-I. P., and Krclle, W. (1966) .. Nonlinear Programming." Ginn (Blaisdell), BostOn,
Massachusetts.
Lawton, W. I-I., and Sylvcstre, E. A. (1971). Elimination oflincar parameters in nonlinear
regression. Tecl1ll0metrics. 13,461-467.
Legendre, A. M. (1805). .. Nouvellcs Methodes pour la Dctermination dcs Orbltes de
Cometes." Paris
Lehman, E. L. (1959). "Testing Statistical Hypotheses." Wilcy, New York.
Lcvcnbcrg, K. (1944). A mcthod for thc solution of ccrtain nonlinear problems in least
squares. Qnart. Appl. Math. 2, 164-168.
Lcwis, P. A. W., Goodman, A. S., and Millcr, J. M. (1969). A pseudorandom number
generator for thc System/360. IElvI Systems J. 8,136-145.
Lindlcy, D. V. (1956). On a measurc of the information provided by an experiment. Ann.
IVIath. Statist. 27, 986-1005
Longley, J. W. (1967). An appraisal of least squares programs for the electronic computer
from the point of view of thc user. J. Amer. Statist. Assoc. 62,819-841.
Mangasarian, O. L. (1969). "Nonlinear Programming." McGraw-Hili, New York.
Marquardt, D. W. (1963). An algorithm for least squares estimation of nonlinear parameters.
SIAlvI J. 11,431-441.
Marsaglia, G., and Bray, T. A. (1964). A convenient mcthod for generating normal variables.
SIAM Rep. 6, 260-264.
McCormick, G. P. (1967). Second-ordcr conditions for constraincd minima. SIAl\1 J.
Appl. Math. 15,641-652.
McGhce, R. B. (1963). "Identification of nonlinear dynamic systems by regression analysis
References
331
,1
1
I
I
methods." Doctoral dissertation, Univ. Southern California, Los Angelcs, Califorllla
(University Microfilms 64-2588, Ann Arbor, Mich.).
Melkanoff, M. A., Sawada, T., and Raynal, J. (1966). Nuclear optical modcl calculations.
In " Methods in Computational Physics" (B. Alder, S. Fcrnbach, and M. Rotenberg,
eds.), Vol. 6. Academic Press, Ncw York.
von Mises, R. (1919). Fundamcntalsiitze der Wahrscheinlichkeitsrechnung. IHa/h. Zei/.IThriji
4,1-97
Moshman, J. (1967). Random number gcneration. In " Mathcmatical Methods for Digital
Computcrs" (A. Ralston and H. S. Wilf, eds.), Volume II. Wiley, New York.
Murtagh, B. A., and Sargent, R. W. H. (1969). A constrained minimization method with
quadratic convergence. In" Optimization" (R. Fletcher eel.). Academic Prcss, Ncw York.
Neider, J. A., and Mcad, R. (1965). A simplex method for function minimization. Compllt.
J. 7, 308-313.
Neyman, J. (1937). Outline of a thcory of statistical estimation bascd on the classical theory
of probability. Phil. Trans. Roy. Soc. London Ser. A, 231, 333-380.
Neyman, J. (1962). Two brcakthroughs in the thcory of statistical dccision making. ReI'.
Inst. Intemat. Statist. 30, 11-27.
Ortega, J. M., and Kaiser, I-I. F. (1963). Thc LL T and Q R mcthods for symmetric tri-
diagonal matrices. Compll/. J. 6, 99-10 I.
Osborne, M. R., and Watson, G. A. (1969). An algorithm for minimax approximation in
the nonlinear casc. Compll/. J. 12, 63-68.
Pearson, J. D. (1969). Variable metric methods of mtlllmizalion COII/pllt. J. 12, 17/-178.
Penrose, R. (1955). A generalizcd inverse for matrices. Proc. Call/hri{qe Philos. Soc. 51,
406-413.
Perdrcauville, F. J., and Goodson, R. E. (1966). Identification of syslcms dcscrIbed by
partial diffcrential equations. Trails. ASJ\./E, Ser. D, 88,463-468.
Peterson, T. I. (1962). Kinctics and mechanism of naphthalcnc oxidation by nonlinear
estimation. Chem. EII.9. Sci. 17,203-219.
Pontryagin, L. S., Bolyanskii, V. G., Gamkrclidzc, R. V., and Mlshchcnko, E. F. (1962).
"Thc Mathematical Theory of Optimal Proccsses." K. N. Trigoroff (transl.)' Wiley
(lntersciencc), New York.
Powell, M. J. D. (1964). An efllclent method for finding thc minimum of a function of
several variablcs without calculating dcrivativcs. COII/pllf. J. 7, 155-/62.
. powcil, M. J. D. (1965). A method for minimizing a sum of squarcs of nonlincar functions
without calculating derivativcs. COII/pllt. J. 7, 303-307.
Powell, M. J. D. (1969). A theorem on rank one modification to a matrix and its inverse.
Compl/t. J. 12, 288-290.
Price, C. M. (1964). The matrix pscudoinverse and minimal variance estimates. SIAl\1
Rev. 6, 115-120.
Qucnouille, M. I-I. (1956). Notes on bias in estimation. Biometrika. 43, 353-360.
Raiffa, H., and SchIaifer, R. (1961). "Applied Statistical Dccision Theory." Graduate
School of Business Administration, Harvard Univ., Boston.
Rao, C. R. (1957), Thcory of the method of cstimation by minimulll chi-squarc. BIIll
111femat. Statist. Inst. 35,25-32.
Robbins, H. (1955). An empirical Bayes' approach to statistics. In Proc. Berkeley Symp.
Stati.w. and Probability, 3rd I, 157-164. Univ. of California Press, Berkeley, California.
Robbins, I-I. (1964). The empirical Bayes' approach to statistical dccision problems. Ann
'Math. Statist. 35, 1-20
Rosen, J. B. (1960). The gradient projection method for nonlinear programming: I. LlIlear
constraints. SIAM J. 8, 181-217.
I
I
I
)
]
I
I
i
i
j
i
I
I
'J
i
i
I
!
332
References
Rosen, J. B. (196]). The gradient projection method for nonlinear programming: II.
Nonlinear constraints. SIA!vI J. 9,514-532.
Rosenbrock, I-I. H. (1960). An automatic method for finding thc greatest or least value of a
function. Comput. J. 3,175-184.
Rosenbrock, H. H., and Storey, C. (1966). "Computational Methods for Chemical Engi-
ncers." Pergamon, Oxford.
Rutemillcr, H. c., and Bowcrs, D. A. (1968). Estimation in a heteroscedastic regrcssion
model. J. Amer. SllIlisl. Assoc. 63, 552-557.
Sammet, J. E. (1966). Survey of formula manipulation. Comm. ACi\1. 9, 555-569.
Savage, L. J. (1954). "The Foundations of Statistics." Wiley, Ncw York.
Scheffe, H. (1959). "The Analysis of Variance." Wiley, New York.
Seal, ]-1. L. (1967). The historical development of the Gauss linear model. Biomelrika. 54,1-24
Shannon, C. E. (1948). A mathematical thcory of communication. Bel! Syslem Tech. J
27, 379-423 and 623-656.
Shinbrot, M. (1954). "On the Analysis of Linear and Nonlinear Dynamical Systcms from
Transient-Response Data." NACA Technical Notes, TN 3288.
Smith, Jr. F. 8., and Shallllo, D. F. (1971). An improved Marquardt proccdurc for non-
linear regressions. Tec!momelrics. 13, 63-74.
Solow, R. M. (1957). Tcchnical change and the aggregate production function, Rev,
£collom. SWlisl. 39,312-320.
Sorenson, I-I. W. (1966). Kalman filtcring technIques. In" Advances in Control Systems"
(c. T. Lcondes, ed.), Vol. 3 Academic Press, New York.
Spendley, W. (1969). Nonlincar least squares fitting using a modified simplex minimization
method. III "Optimization" (R. Fletcher, ed.). Academic Press, New York.
Stewart II], G. W. (1967). A modification of Davidon's minimization method to accept
difference approximations of derivatives. J. Assoc. CampI/I. !vlach. 14 72-83.
Swed, F. S., and Eisenhart, C. (1943). Tables for testing randomness of grouplllg III a
sequence of alternatives. AI/n. !vlalh. Slalisl. 14,66-87.
Tomovic, R. (1963). "Sensitivity Analysis of Dynamic Systems." McGraw-Hili, New York.
Turner, M. E., Monroe, R. J., and Homer, L. D. (1963). Gencralized kinetic rcgression
analysis: hypergeometric kinetics. Biomelrics. 19,406-428.
Wagner, H. M. (1959). Linear programming techniques for regrcssion analysis. J. Amer.
Slalisl. Assoc. 54, 206-212.
Wald, A. (1947). "Sequential Analysis." Wiley, Ne,w York.
Wiener, N. (1949). "Extrapolation, Interpolation and Smoothing of Stationary Time
Series." MIT Press, Cambridge, Massachusetts and Wiley, New York
Wildc, D. J., and Beightlcr, C. S. (1967). "Foundations of Optimization." Prentice-Hall,
Englewood Cliffs, New Jerscy.
Wilkinson, J. H. (1965). "The Algebraic Eigenvalue Problem." Oxford Univ. Press (Claren-
don), London and New York.
Winkler, R. L. (1967). The asscssment of prior distributions III Bayesian analysis, J. Amer.
Slali./. Assoc. 62, 776-800.
Wolfe, P. (1963). Methods of nonlinear programming. III "Recent Advances in Mathc-
matical Programming" (R. L. Graves and P. Wolfe, eds.), McGraw-Hili, New York.
Zangwill, W. r. (l967a). Nonlincar programming via penalty functions. !vlal/agemel/I Sci.
5, 344-358
Zangwill, W. 1. (I 967b). Minimizing a function without calculating derivatives. Complll.
J. 10, 293-296.
Zoutendijk, G. (1960), "Methods of Feasible Directions." Elsevier, Amsterdam.
Authol' Index
Numbers in italics rcfer to the pages on which the complete references are listed.
A
Abadie, J., 83, 146,325
Acton, F. S,. 201, 325
Afifi, A. A" 245, 325
Albert, A. E, 251, 325
Anderson, T. W., 170, 200, 325
Anscombe, F. J., 202, 325
Aris, R., 273, 329
Arndt, R. A., 15,325
Atkinson, A. c., 265, 272, 325
Auslander, M., 116,326
B
Bard, Y., 91, 96, 107, 109, 110, 148, 15],
277, 324, 325
Barnett, V. D., 81, 325
Bartels, R, H., 77,325
Bayes, T., 36, 325
Beale, E M. L., 170, 191,325
Beaton, A. E, 296, 362
Bcauchamp, J. J., 16, 253, 326
Behnken, D. W" 265, 326
Beightler, C. S" 83, 332
Bellman, R., 16, 226, 230, 242, 326
Berman, M., 16,326
Bischoff, K. 8., 220, 329
Blakcmore, J. W., 277, 326
Bodkin, R. G., 25, 133, 138, 326
Bolyanskii, V. G., 225, 331
BOlllberauIt, A., 116, 133, 323,327
Bond, E, 116,326
Booth, G. W., 7, 323,326
Bowers, D. A., 247, 332
Box, G. E P., 100, 123,204,261,265,
267,269,316,326
Box, M. J., 119, 120, 153,274,326
Bray, T. A., 316, 330
Brent, R. P., 120,326
Broyden, C. G., 107, 108, 326
Guzzi Fcrraris, G., 120,327
C
Carney, T M., 61, 327
Carpentier, J., 146,325
Carroll, C. W., 141,327
Chow, G, c., 61,327
Collier, c., 16,326
Colvillc, A. R., 117, 327
CornelI, R. G.,]6, 253, 326, 328
Cornfield, J., 33, 327
Cottle, R. W., ]48,327
Cragg, J. G., 6],327
Cramer, H., 19, 80, 178, ]85, 186, 188,
201,327
D
Daniel, J, W., 83, 327
Dantzig, G. G., 148,327
Davidon, W. c., 106, 107, 108, 110,327
Davies, D., .146, 327
Davics, O. L. D., 260, 327
Dcming, W. E., 154,327
Denham, W. F., ]6,242,330
Dcutsch, R., 77, 225, 327
Draper, N, R., 201, 265, 327
E
Efroymson, M. A., 323
Eisenha rt, c., 201, 332
Eisenprcss, H., ]5, ]16, ]33,323,324,327
ElasholT, R. M., 245,325
334
F
Fariss, R. H., 9], 327
Faure, P., 146,327
Feller, W., .19,328
Ferguson, T. S., 33, 328
Fiacco, A. V., 52, 107, 108, 142, 145,
159,328
Fisher, R. A., 7, 260, 328
F]anagan, P. D., 91, 328
Fletcher, R., 107, ] 10, ]20, 328
Fox, L., 307, 328
Freudenstein, F., .16,328
G
Galambos, J. T., 253, 328
Gamkrelidze, R. V., 225,331
Gardncr, L. A., 25], 325
Gauss, K. F., 6, 97, 328
Goldfarb, D., 110, ]46,328
Goldfeld, S. M., 94, 328
Goldstein, A. A., 90, 328
Goldwyn, R. M., 6],327
Go]ub, G. H., 77, 102, 103, ]22,325,328
Goodman, A. S., 316, 330
Goodson, R. E., 221, 33/
Grant, F. S., ]5,328
Greenstadt, J., 89, 92, 107, I] 6, 133, 323,
327, 329
Grisoff, S., ]16,326
Guttman, I., 170, 191,329
H
Hadley, G., 84, 329
Hammcrsley, J. M., 46, 329
Hanscomb, D. c., 46, 329
Hartlcy, H. 0., 100, 170, 191. 329
Healy, M. ./. R., 102, 329
Heinekcn, F. G., 273, 329
Hicks, ./. S., 224, 329
Hill, W. J., 267, 269, 326, 329
Himmclblau, D. M., 220, 329
Hoerl, A. E., 60, 277, 326, 329
Homer, L. D., 16, 332
Hood, W. c., 7, 64,329, 330
Author Index
Hooke, R., 119, 120,329
Hougen, O. A, 277, 329
Howland, J. L., 226, 329
Huard, P., 146,327
Hunter, W. G., 123, 204, 265, 269, 272,
281,325,326,327,329,330
Jacquez, J., 226, 230, 326
Jeeves, T. A., 119, 120,329
Jeffreys, H., 35, 329
Jennrich, R. I., 91, 94, 102,329
John, F., 52, 329
Jonnston, J., 16, 329
Jones, C. R., 220, 329
K
Kagiwada, 1-/., 16, 242, 326
Kaiser, H. F., 302, 331
Kalaba, R., 16, 226, 230, 242, 326
Ka]man, R. E., 225, 3-10
KeJley, H. J., 16, 242, 330
Kelley, Jr., J. E., 71,330
Kennard, R. W., 60, 329
Kenney, R., 116,326
KittreJl, J. R., 120, ] 23, 265, 330
Klein, L. R., 25, 133, 138, 326
Koopmans, T. c., 7, 64, 329, 330
Korin, B. P., 200, 330
Kowalik, J., 84, 330
Krelle, W., 83,330
Kuhn, H. W., 52,330
Kullback, S., 267,330
KLlI1zi, I-/. P., 83, 330
L
Lapidus, L., 1.10, 146,277,323,325,328
Law, V. J., 91,327
Lawton, W. 1-/., 122,330
Lcgendre, A. M., 6,330
Lehman, E. L., 170, 330
Leibler, R. A., 267, 330
Levenbcrg, K., 94, 330
Author Index
Lewis, P. A. W., 316, 330
Lindley, D. V., 26.1, 286,330
Longley, J. w., 103,330
Lucas, H. L., 261,265,326
Mc
McCormick, G. P., 52, 107, 108, 142,
145, 159, 328, 330
McGhce, R. B., 100, 226,330
MacGregor, M. H., 15,325
M
Mangasarian, O. L., 53, 330
Marquardt, D. W., 94, 114,323,330
Marsaglia, G., 316, 330
Mead, R., 120,331
Meeter, D. A., 170, 191,329
MeJkanoff, M. A., 15,331
Mendelsohn, J., 91,328
Mezaki, R., 120, 123,204,281,329,330
MiJler, J. M., 316, 330
von Mises, R., 73, 331
Mishchenko, E. F., 225,331
Monroe, R. J., 16,332
Moshman, J., 316,33/
MuIJer, M. E., 316,326
Murtagh, B. A., 146,331
Myszewski, M., 116,326
N
NeIder, J. A., 120,331
Neyman, J., 35, 185,331
o
Ortega, J. M., 302,33/
Osbornc, M. R., 84, 154,330,331
p
Pearson, J. D., 107,33/
Penrose, R., 290, 331
335
PerdreauviIJe, F. J., 22],331
Pereyra, V., 122,328
Peterson, T. 1., 7, ]23,323,326,331
Pontryagin, L. S., 225,331
Powell, M. 1. D., 107, 1.10, .120, 251,
328,331
Price, C. M., 319, 331
Price, J. F., 90, 328
Q
Quandt, R. E., 94, 328
QuenouiIJc, M. H., 187,33/
R
Raiffa, H., 33, 35, 77, 283, 33/
Rao, C. R., 80,33/
Raynal, J., ]5,33/
Robbins, H., 35, 331
Rosen, J. B., 146,33/,332
Rosenbrock, H. H, .120,224,226,332
Rutemiller, H. c., 247, 332
s
Sammet, J. E., 116, 326, 332
Sampson, P. F., 9 J, 94, 102, 329
Sargcnt, R. W. H., 146,331
Savage, L. J., 33, 332
Sawada, T., ] 5, 33/
Schcffe, H., ] 90, 332
Schlaifer, R., 33, 35, 77, 283, 33/
Schwimmer, S., 126, 230, 326
Scal, H. L., 7, 332
Sclvester, R., 16, 326
Shahn, E., 16, 326
Shan no, D. F., 95, 323, 332
Shannon, C. E., 19, 261,332
Shinbrot, M., 220, 332
Smith, /-I., 201, 327
Smith, F. B., Jr., 95, 332
Solow, R. M., ]34,332
Sorenson, /-I. W., 225, 332
Spendley, W., .I 20, 332
Sridhar, R., 142, 326
Stewart, III, G. W., 119, 332
336
Storcy, c., 120, 224, 226,332
Surkan, A., ] 5, 327
Swed, F. S., 101, 332
Sylvestrc, E. A., ]22,330
T
Tobey, R., 116,326
TOll1ovic, R., 226, 332
Trottcr, 1-1. F., 94, 328
Tsuchiya, 1-1. M., 273, 329
Tucker, A. W., 52, 330
Turner, M. E., ] 6, 332
v
Vaillancourt, R., 226, 329
Vitale, P. A., 9], 328
w
Wagner, H. M., 71, 332
Wald, A., 260, 269, 270, 271, 332
Author Index
Watson, C. c., 120, 265,330
Watson, G. A., 154,331
Watson, K. M., 277, 329
Wei, J., 224, 329
Weiss, M. F., 16, 326
West, G. F., ]5,328
Wichern, D. W.,269, 329
Wicner, N., ]6,332
Wildc, D. J., 83, 332
Wilkinson, J. H., 293, 302, 305, 332
Winkler, R. L., 35, 332
Wolfc, P., 146,332
Woo, L. S., 16, 328
Wood, F. S., 324
y
You Ie, P. V., 123,326
z
ZangwllI, W. 1., 120, IA5, 332
ZilIes, S., 116,326
Zoutendijk, G., 148,332
B
Bayes' theorem, 36-37
Bayesian estimation, 72-77
Bias, 40-41
estimation of, 47
reduction of, 187
Bienayme-Chebyshev inequality
for multiple parameters, 188-189
for single parameter, 186
Bounds on parameters, 151-153
effect on sampling distribution, 182-183
need for, 141
Bounds on state variables, 232
C
Canonical form, 174-175
Chebyshev estimate, see Minimax
deviation cstimate
Chemica] kinetics models, 15, 222,
229-230, 233-241
Chi square distribution, 21
Cholesky decomposition, 307-309
modified for Marquardt's method, 95
Complementary pivot problem, 148
Complementary slackness, 52
Computer role in experiments, 175
Computer programs for parameter
estimation, 323-324
Conditional distribution, 312
Confidence interval, 6,184-187
Confidence region, 187-191
iIIustration, 208
for linearized model, 190-191
minimizing volume of, 263
Subject Index
Constraints, 49-53, see a/so Bounds on
parameters, Bounds on state
variables, Equality constraints,
Incquality constraints
arIsing from prior information, 32-33
effect on sampling distribution, ]80-183
Control theory, problems of, 225
Correlation, 311
test for, 200-201, 216
Curve fitting, 1-2
D
Data
errors in, see Errors
randomness of, 18
requirements for estimatIOn, 69-70
Data matrix, 17
Davidon-Fletcher-PowcIl method, I] 0
Decision theory
applied to design of experiments.
283-285
applied to parameter estimation, 74-76
Deming's mcthod, 154-]59
Dependent variables, ] 3
Derivativc-free methods, 117-] 20
Design criteria, locating maximum of,
273-276
Design of experiments, 258
for decision making, 283-285
for discriminating among models,
266-269
for parametcr estimation, 261-265
for prediction, 265-266
termination criteria, 269-271
338
Determinant, 291
computation of, 301
Deterministic modcl, 11-12
Differential equations, see a/so Dynamic
models
models formulated as, 218
numerical integration of, 230-231
stability of, 231-232
Differentiation
analytic, by computcr, ]] 6, 323
of dynamic model objective function,
226-228
importance of accuracy Ill, 116
of matrix functions, 193-196
numerical, 117- I 19
Dircct search mcthods, I 19-120
Directional discrimination methods, 91-94
Discrimination among models
design of expcrimcnts for, 266-169
illustration, 277-283
Dynamic models
computation of objective function,
225-130
difficulties associatcd with, 231-133
gradient of objectivc function, 226-130
illustration, 133-138
methods of solution, 218-221
reduction to standard form, 223-224
standard, 221-223
E
Economctric models, 25-26, 133-] 38,
167-168,213--216
Eigenvalue decomposition, 304-305
Eigenvalues and vcctors, 290
computation of, for real symmetrIc
matrix, 301-303
Equality constraints, 49-51
linear, application of projection mcthod
to, ]60
model equations viewed as, .Iee Exact
structural model
penalty functions for, 159-160
Errors, 54
distribution of, 22--13
estimating paramctcrs of distribution,
195
Subject Index
Estimate
asymptoltcally efficient, 43
consistcnt, 42
efficicnt, 41
ill-determined, 172, 203
linear, 44
robust, 44
statistical properties of, see Sampling
distribution
sufficient, 44
unbiased,40
well determined, 172
Estimation procedures
desirable properties for, 44-45
reasons for failure, 102-204
Exact structural model, 24
computation of estimates for, 154-159
covariance matrix of parameter
estimates, 179, 212-214
estimating parameters of error
distribution, 196-197
illustration, 163-168
maximum likelihood method for, 68-1)9
Expcrimcntal conditions, 17, 258
Experiments, ] 7
cost of, 173, 284-285
design of, see Design of experiments
simulated by computer, 46, 176-277
F
F distribution, 21, ] 90
Farris-Law method, 93
Feasible region, 48
Finite differences, 1.1 7-119
central, I] 9
determination of optimum length, 118
for dynamic mod cis, 226
one-sided, ]] 7-1.1 8
G
Gauss-Jordan pivot, 196
Gauss-Markov theorem, 59, 318-319
Gauss method, 97-106
implementation of, 101-106
with penalty functions, 106, 144
with prior distribution, 10.1, 106, 131-133
Subject Index
as sequence ofIinear regressIOns, 99-100
single-equation least squares, 97
illustration, 114-130
Gaussian distribution, see Normal
distribution
Givens-Householder transformation, 302
Goodness of fit criteria, 198-202
Gradient methods, 86
convergence of, 87-88, 1.1 5-.1] 7,
320-322
efficiency of, 89
step length determinatIon 111, 110-113
I-I
Hessian matrix, 88
Gauss approximation for, 97-99
variable metric approximations for,
106-1I 0
Indcpendent variables, ]4,221
subject to error, see Exact structural
model
Indifference region, ] 71-173
illustration, 207
Inequality constraints, 51-53
linear, application of projection method
to, 146-153
pen.altY functions for, .141-145
Inexact structural model, 24-15
Information
for discrimination, 267
in a distribution, 19, 261
gained from an experiment, 261
in normal distribution, 262
prior, 32-35
Inhomogeneous covariance, 146-248
Initial conditions, 211
Initial guess, 120-123
Interpolation-extrapolation methods,
II .1-113
illustrations, 117-129
Interval estimate, 6
Iterative methods, 84-88, see also
Derivative-free mcthods, Gradient
methods
339
acceptable, 85-86
initial guess for, .I 20-123
termination criteria for, 114-1I5
K
Kuhn-Tucker condition, 52
for quadratic program, 147
L
Lagrange multipliers, 50-53
Least squares method, 55-61, see a/so
Regression
ull\veighted, 55
weighted 56-57
Levenbcrg method, see Marquardt method
Likelihood, 26-29
concentrated, 65-66
standard reduced model, 27
structural models, 27-29
Likelihood equations, 62
Likelihood ratio, 269
Linear equations, solution of, 299
Linearity, 5
Linearizing transformatIOns, 78-80, 122
illustration, 131
Linearly dcpendent equations, 238-241
IVI
Marginal distribution, 311
Marquardt method, 94-96
illustration, 130-131
Matrix
condition of, 89, 305
improved by scaling, 306
rank of, 292, 300-301
spectral decompositions of, 303-309
square root of, 307
trace of, 29.1
Matrix algebra, 287-293
Matrix functions, differentiation of,
293-296
Matrix inverse, 289, 298-299
updating of, 250
Matrix pseudoinverse, 290, 304
340
Maximization, see Optimization
Maximum likelihood method, 61-71
exact structural model, 68-69
illustrations, 133-138
independent variables subject to error,
67-68
normal distributIOn, 63-70
two-sided exponential distribution,
70-71
uniform distribution, 70
unknown covariance matrix, 64-66
Measurement errors, see Errors
Minimax deviation cstimatc, 77
computational method for, ]54
Minimization, see Optimization
Minimum chi-square method, 80
Minimum risk estimatc, 74-76
Minimum variance bound, 4]
Missing observations, 244-145
illustrations, 151-255
Modc-of-postcrior-distribulton eSltmate,
73-74
illustration, 131-133
sampling distribution of, 192
Model
deterministic form, 11-12
estimation of parameters of, 2-4
formulation of, illustration, 29-31
stochastic form, 24-26
Moment matrix, 64
likelihood expressed as function of,
97-99
Monte Carlo method, 46
illustration, 210-212
N
Newton-Greenstadt mcthod, 92
Newton's mcthod, 89-91
Nonlinear programming, 83
Normal distribution, 18-21
gencrating pscudorandom sample from,
316-317
information in, 261
maximum likclihood method for, 63-70
multivariate, 20-11
univariatc, 20
Normal cquations, 49
Subject Index
o
Objective function, 47
computation for standard dynamic
model, 215-230
as function of moment matrix, 97-99
Observed variables, 221
Optimality conditions
constrained problems, 49-53
unconstrained problems, 48-49
Optimization, 47
Orthogonalization, 103-106
p
Parametcr estimation, 4
applications of, 14-16
computcr programs for, 323-314
dcsign of experiments for, 262-265
history of, 6-7
in a probability distribution, 3, 80
problem formulation, 37
Parameters, 11-12
Penalty functions
equality constraints, 159-160
illustration, 160-16i
inequality constraints, 141-145
as a prior distribution, 145
Pivoting, 296
Point estimate, 6, 39
Positive definite matrix, 290
role in gradient methods, 86, 116
Posterior distribution, 36-37
mode of, 73-74
Prediction, 13-14
dcsign of experiments for, 265-266
errors in, 204-205
Principal components, 183-184, 208
Prior distribution, 33-35
informative, 35
noninformative, 34-35
Probability distribution, 310
estimating parameters of, 3, 80
Projection method, 146-] 53
for bounded parameters, 151-153
illustration, 162-163
for linear equality constraints, 160
step length determination, 149
Pseudomaximum likelihood method, 78
Pseudorandom numbers, 316-317
Subject Index
Q
QR method, 302-303
Quadratic programming, 147
Quasilinearization. 230
R
Rank one correction method, 107-109
Rao-Cramer theorem, 41, 313-315
Reduced model, 13-14
standard, 26
Regression
backward selection, 302
forward selection, 301
multiple linear, 58-61
methods of solution, 102-106
ridge, 60
stepwise, 59, 101, 301-302
Reparametrization, see Transformation of
variables
Residuals, 54
analysis of, illustration, 209-210,
214-216
outliers, 202
run tests, 201-202
statistical properties of, 193-196
statistical tests on, 199-202
Risk, 74, 283-284
Rotational discrimination methods, 91
s
Sampling distribution, 39-45
covariance matrix of, 176-179, 207
effect of constraints on, 180-183
estimation of statistical properties of,
175-183
evaluation by Monte Carlo method, 46,
210-212
evaluation of statistical properties of,
45-47
statistics of, 40
341
Scaled decomposition, 305-307
Scientific investigation, goals of, 258-260
Sensitivity equations, 226-230
Sequential reestimation, 248-251
illustration, 255-157
Serial correlation, 247-248
Simulation of experiments, 46, 276-277
Spectral decompositions, 303-309
State variables, 221
bounds on, 231
Steepest descent method, 88
Stochastic approximation, 251
Stochastic model, 14-16
Structural model, ] 2-] 3, see a/so Exact
structural model, Inexact structural
model
Sufficient stahshc, 43, 61
SUMT method, 4]1
Sweeping, 296
T
Termination criteria
for iterative methods, 114-115
for sequential experiments, 169-171
Transformation of variables, see also
Linearizing transformations
effect on sampling distribution, 205-106
to eliminate constraints, ] 53
invariance of estimates under, 44
to simplify model, 133
u
Uncertalllty, 26]
Uniform distribution, 2]-22
maximum likelihood method for, 70
v
VarIable metrIC methods, 106-] 10
Vectors, linear independence of, 300-301