Text
                    
Olllillear Paralneter
Estilnatioll


YONATHAN BARD


International Business !llachines Corporation
Cambridge, lvIassacllllselts


@


ACADEMIC PRESS New York and London 1974


A Subsidiary oj Harcourt Brace JOl'OIIOL'ich, Publishers





 . .. ,; Contents Preface ix Chapler 1 Introduction ]-1. Curvc FIttlllg J 1-2. Modcl Fitting "2 1-3. Estimation 3 ]-4. Lincarity 5 1-5. Point and Intcrval Estimation 6 1-6. Historical Back- ground 6 1-7. Notation 7 Chapter 11 Pmblem Formulation A DETERMINIST]C MODELS 2-1 BasIc Concepts 1/ 2-2. Structural Modcl /2 2-3. Paramcter Evalua- tion /3 2-4. Rcduccd Modcl /3 2-5. Application Arcas /4 B DATA 2-6. Expcrimcnts and Data Matrix /7 C PROBABILISTIC MODELS AND LIKELIHOOD 2-7. Randomncss in Data /8 2-8. Thc Normal Distribution /8 2-9. Thc Uniform Distribution 2/ 2-10. Distribution of Errors 22 2-11. Stochastic Form of thc Model 24 2-12. Likclihood-Standard Rcduccd Modcl 26 2-13. Likelihood-Structural Models 27 2-14. An Example 29 2-15. Utility of Distribution Assumptions 32 D PRIOR INFORMATION AND POSTERIOR DISTRIBUTION 2-16. Prior Information 32 and Noninformativc Priors 34 2-17. Prior Distribution 33 2-19 . Bayes' Thcorem 36 2-18 Informative 2-20. Problems 37 Chapler III Estimators and Their Properties A STATISTICAL PROPERTIES 3-1. Thc Sampling Distribution 39 3-2. Propertlcs oj the Sampling Distribution 40 3-3. Evaluation of Statistical Properties 45 v 
B MATHEMAT]CAL PROPERTIES 47 3-5, Unconstraincd Optimization 48 3-6. Equality 3-7. Incquality Constraints 51 3-8. Problems 53 3-4. Optimization Constraints 49 Chapter I V Methods of Estimation 4-1. Rcsiduals 54 A LEAST SQUARES 4-2 Unwcightcd Lcast Squarcs 55 4-3. Wcightcd Lcast Squarcs 56 4-4. MultipIc Lincar Rcgrcssion 58 B MAX]MUM LIKELIHOOD 4-5. Dcfinitlon 61 4-6. Likelihood EquatIOns 62 4-7. Normal Distribution 63 4-8. Unknown Diagonal Covariancc 64 4-9. Unknown Gcncral Covariancc 65 4-10. Indcpcndent Variablcs Subject to Error 67 4-11. Exact Structural Modcls 68 4-12. Data Rcquircmcnts 69 4-13. Somc Othcr Distributions 70 C BA YESIA N ESTIMATION 4-14. Dcfinition 72 4-15. Modc of thc Postcrior Distribution 73 4-16. Minimum Risk Estimates 74 D OTHER METHODS 4-17. Minimax Dcviation 77 4-19. Lincarizing Transformations 4-21. Problcms 80 4-18. Pscudomaximum Likclihood 78 78 4-20. Minimum Chi-Squarc Mcthod 80 Chapter V Computation of the Estimates I: Unconstrained Problems 5-1. Introduction 83 5-2. ltcrativc Schcmc 84 5-3. Acccptability 85 54. Convcrgcncc 87 5-5. Stecpcst Descent 88 5-6. Ncwton's Mcthod 88 5-7. Dircctional Discrimillation 91 5-8. Thc Marquardt Method 94 5-9. The Gauss Mcthod 96 5-10. The Gauss Method as a Scqucncc of Lincar Rcgression Problems 99 5-11. Thc ]mplementation of thc Gauss Method 101 5-12. Variablc Metric Mcthods 106 5-13, Step Sizc 110 5-14. Intcrpolal1on- Extrapolation 111 5-15. Tcrmination 114 5-16. Rcmarks on Convcrgcncc 115 5-17. Dcrivativc Frcc Mcthods 117 5-18. Finite DitTerences 117 5-19. Dircct Search Mcthods 119 5-20. Thc Initial Gucss 120 5-21. A Singlc- Equation Lcast Squarcs Problcm 123 5-22. Adding Prior Information 131 5-23. A Two-Equation Maximum Likelihood Problcm 133 5-24. Problcms 139 Chapter VI Computation of the Estimates II: Problems with Constraints A INEQUALITY CONSTRA]NTS 6-1. Penalty Functions 141 with Boundcd Paramcters 151 6-5. Minimax Problcms 154 6-2. Projcction Methods 146 6-3. Projcction 6-4. Transformation of Variablcs 153 
B EQUALITY CONSTRAINTS 6-6. Exact Structural Modcls 154 6-7. Convcrgcncc Monitoring 156 6-8. Somc Special Cascs 157 6-9. Pcnalty Functions 159 6-10. Lincar Equality Constraints 160 6-1]. Lcast Squarcs Problcm with Penalty Functions 160 6-12. Lcast Squares Problcm-Projection Mcthod 162 6-13. Indcpcndcnt Variables Subjcct to Error 163 6-14. An Implicit Equations Modcl 167 6-15. Problcms 168 Chapter VII Interpretation of the Estimates 7-1. introduction 170 7-2. Rcsponsc Surfacc Techniqucs 17l 7-3. Canonical Form 174 7-4. Thc Sampling Distribution 175 7-5. Thc Covariancc Matrix of thc Estimatcs 176 7-6. Exact Structural Model 179 7-7. Constraints 180 7-8. Principal Components /83 7-9., Confidcncc Intervals 184 7-10. Confidcnce Rcgions /87 7-11. Linearization 189 7-12. Thc Postcrior Distribution /91 7-13. The Rcsiduals /92 7-14. Thc Indcpcndcnt Variablcs Subjcct to Error 196 7-] 5. Goodncss of Fit /98 7-16. Tcsts on Rcsiduals /99 7-17. Runs and Outlicrs 20/ 7-18. Causcs of Failure 202 7-19. Prcdiction 204 7-20. Paramctcr Transformation 205 7-21. Singlc-Equation Lcast Squarcs Problcm 206 7-22. A Montc Carlo Study 210 7-23. Indcpcndcnt Variablcs Subjcct to Error 2/2 7-24. Two-Equation Maximum Likclihood Problcm 2/3 7-25. Problems 2/6 Chapter VIII Dynamic Models 8-1. Modcls Involving Diffcrcntial Equations 218 8-2. Thc Standard Dynamic Model 221 8-3. Modcls Rcduciblc to Standard Form 223 8-4. Computation of thc Objective Function and Its Gradicnt 225 8-5. Numcricallntcgration 230 8-6. Somc Difficultics Associatcd with Dynamic Systcms 23/ 8-7. A Chcmical Kinctics Problcm 233 8-8. Lincarly Dcpcndcnt Equations 238 8-9. Problcms 242 Chapter IX Some Special Problems 9-1. Missing Obscrvations 244 9-2. Inhomogcncous Covariancc 246 9-3. Scqucntial Rccstimation 248 9-4. Computational Aspccts 249 9-5. Stochastic Approximation 25/ 9-6. A Missing Data Problcm 251 9-7. Further Problcm with Missing Data 253 9-8. A Scqucntial Rccstimation Problcm 255 9-9. Problcms 257 Chapter X Design of Experiments 10-1. Inrroduction 258 10-2. Information and Unccrtainty 26/ 10-3. Dcsign Critcrion for Paramcter Estimation 262 10-4. Dcsign Critcrlon for Prcdiction 265 
VIII Contents 10-5. Dcsign Critcrion for Modcl Discrimination 266 10-6. Tcrmination Critcria 269 10-7. Somc Practical Considcrations 271 10-8. Computational Considcrations 273 10-9. Computcr Simulatcd Expcrimcnts 276 10-10. Dcsign for Dccision Making 283 10-1 I. Problems 286 Appendix A Matrix Analysis A-I. Matrix Algcbra 287 A-2. Matrix Diffcrcntiation 293 A-3. Pivoting and Swceping 296 A-4. Eigcnvalucs and Vcctors of a Rcal Symmctric Matrix 302 A-5. Spectral Decompositions 303 Appendix B Probability 3/0 Appendix C The Rao-Cramer Theorem 3/3 Appendix D Generating a Sample from a Given Multivariate Normal Distribution 316 Appendix E The Gauss-Markov Theorem 318 Appendix F A Convergence Theorem for Gradient Methods 320 Appendix G Some Estimation Programs 323 References 325 AI/lhor Illdex 333 Sl/bjecI Index 337 
Preiace This book is intended primarily for use by the scientist or engineer who is concerned with fitting mathematical models to numerical data, and for use in courses on data analysis which deal with that subject. Such filling is frequently done by the method of least squares, with no regard paid to previous knowledge concerning the values of the parameters (coetlicients), nor to the statistical nature of the measurement errors. In Chapters II-IV we show how the problem can be formulated so as to take all these factors into account. In Chapters V-VI we discuss the computational methods used to solve the problem, once its formulation has been completed. Chapter VII is devoted to the question of what conclusions can be drawn, after the estimates have been computed, concerning -the valid ity of the estimates, or of the model which has been fitted. In Chapter VIII we discuss the important special case of models which are stated in the form of differential equations. Other special problems are treated in Chapter IX. Finally, in Chapter X we suggest methods for planning the experiments in such a way that the data will shed the greatest possible light on the model and its parameters. We cannot stress too strongly the point that if data are to be gathered for the purpose of establishing a mathematical model, then the experiments should be designed with this purpose in mind. Hence the importance of Chapter X A practical, rather than theoretical point of view has been taken through- out this book. We describe computational algorithms which have performed well on a variety of problems, even if t heir convergence has not been proven, and even i I' they have failed on some other problems. We have as yet no foolproot etlicient methods for solving nonlinear problems; hence we cannot afford to throwaway useful tools just because t hey are not perfect. The presentation uses matrix algebra and probability theory on a very elementary level. Reviews of the needed concepts and proofs of some impor- tant theorems will be found in the appendixes. Some supplementary material has been included in the form of problems at the ends of chapters. Problems requiring actual computation have not been included; the reader is likely to have his own data to compute with, and additional data may be found in many of the cited references. Several numerical problems have, however, been worked out in great detail in separate sections at the ends of Chapters V-IX for the purpose of illustrating the methods discussed in those chapters. 
x Preface The author is deeply indebted to the IBM Corporation, and in particular to the managements of the New York and Cambridge Scientific Centers, who have supported the writing of this book and provided all the necessary resources, The author is also grateful to Professor L Lapidus of Princeton University, and to his colleagues J. G. Greenstadt, p, G Comba, H. Eisenpress, K. Spielberg, and P. Backer, for many helpful discussions, and for reviewing portions of the manuscript. 
Chapter I In LroducliolJ 1-1. Curve Fitting A scientist who has compiled tables of data wishes to reduce them to , more convenient and comprehensible form. He accomplishes this by repre senting the data in graphical or functional form. In the first case, he plots hi: data points, and then draws some curve through them. In the second case, he selects a class of functions, and chooses from this class the one that best fit his data. This is called curve fitting, In the simplest case, the data consist of values YI' Yz, , . . , YII of a depen dent variable Y measured for various values XI' X 2 ' ' , . , XII of an independen variable x, A frequently chosen class of functions is the set of all polynomial of order not exceeding m Y = 0 0 + 0lX + 02X2 + ... + O",x'" (1-1-1 The values of the parameters 0 0 , 0 1 , . . ., 0", are chosen so as to get the bes possible fit to the data. The most commonly used technique for accomplishin, tllis is the least squares method, in which those values of the 0 i are selectee which minimize the sum of squares of the residuals, i.e.. " ( III ) 2 s=I Yp-IO"x,," 11=1 a=O (1-1-2 Curve fitting procedures are characterized by two degrees of arbitran ness, First, the class of functions used is arbitrary, being dictated only to minor extent by the physical nature of the process from which the data came Second, the best fit criterion is arbitrary, being independent of statistical con siderations. This arbitrariness can be exploited to make the fitting jo easy, Choosing equations which, like Eq. (l),t are linear functions of th parameters; using orthogonal Or Fourier polynomials (in place of ordinar t This reference is to the first equation of thc current scction, i.c., Eq. (1-1-1). 
2 I Introduction polynomIals) as the functions to fit; employmg the least squares cnterion-all these contribute to making the computation of the parameters a mathemati- cally easy job. On the other hand, due to their arbitrary nature, the equations that we get are useful only for summarizing the data and for interpolating between tabulated values. They cannot be used to extrapolate, i.e., to predict the outcome of experiments removed from the region of already available data. Also, the equations and the parameters occurring in them shed little insight on the nature of the process being measured, except to answer such questions as to whether variable x has an influence on variable y. Curve fitting techniques have widespread applications in situations that go far beyond the simple y vs. x table. An example is the identification of dynamic systems by means of rational transfer functions or Volterra series. Most multiple linear regression, analysis of variance, and econometric time- series problems are also of a curve fitting nature, since the equations used are not derived from" laws of nature." 1n most of these applications, however, assumptions are made concerning the statistical behavior of the errors, there- by elevating them at least partly to the status of estimation problems as dis- cussed in Section 1-3. 1-2, Model Fitting Often the scientist is, to a certain extent, familiar with the laws which govern the behavior of the physical system under observation. He can then derive equations describing the relationships among the observed quantities. For instance, the fraction y of a radioactive isotope remaining x seconds after the isotope's formation is given by y = e- ox (1-2-1) where the parameter 0 is a physical constant proportional to the IIlstantaneous rate of decay of the isotope. The magnitude of 0 is unknown, but we wish to assign to it a value which makes Eq. (]) fit the data (YI' Xl)' (Y2, x 2 ), ..., (YII' X,.) as well as possible, e.g., by the least squares criterion, An equation such as Eq. (I) which is derived from theoretical considera- tions is called a model, and the procedure just described constitutes model jilling. In principle, model fitting is not much different from curve fitting, except that we can no longer guide the selection of a functional form by considerations of computational convenience. For instance, Eq. (I) is not a linear function of the parameter, and because of this the computation of the " best fit" is more difficult than the computation of the e i in Eq. (I-I-I). 
1-3, Estimation 1-3. Estimation A new consideration arises in model fitting that does not exist in curv fitting, The parameters occurring in a model, e.g., 0 in Eq. (1-2-1), usuall represent quantities that have physical significance. If the model is a corree one, then it is meaningful to ask what is the true value of 0 in nature. Becaus of the generally imprecise nature of measurements we can never hope t, determine the true values with absolute certainty. Also, due to the randor nature of the errors in measurements, the value of 0 that best fits one serie of measurements differs from the value that fits another series, even thoug both series are performed on the same isotope. However, we can look fc procedures to obtain values of the parameters that not only fit the data we! but also come on the average fairly close to the true values, and do not var excessively from one set of experiments to the next, The process of determinin parameter values with these statistical considerations in mind is termed modI estimation, The classical problem of statistical estimation differs somewhat from tl- model estimation problem that we have just defined. The statistician observ( a sequence of values (" realizations") that a random variable assumes, F( instance, he may obtain a sequence of numbers such as I, 5, 6, 3, . . . denotir successive throws of a die. The statistician assumes a "model" in the fon of a probability distribution which may depend on some unknown paran eters. In our case, the statistician who suspects the die may be load( assigns probabilities [0 1 , O 2 ,0 3 ,0 4 , Os, I - I.F= 10;] to the six possible ou comes of a throw, He then attempts to estimate the OJ from the observ( values of the random variable. Here he will probably use the estimate 6 OJ = 11;/ 2:: n j (1-3- j=l where llj is the number of throws on which the number i showed up (i = 1, _.,6). As a further example, the observed value of the random variable may the height h of adults in a community. If we assume that this variable h normal (Gaussian) distribution with mean ho and standard deviation u, tl1- the probability density function is given by p(h) = [1/(27I)1/2 u ] exp[ -(1/2u 2 )(h - hO)2] (1-3- If we measure the heights hI' h 2 , .." h" of n randomly chosen individw from the community, we form the usual estimates: " ho = (I/n) 2:: hi' 1'= I (1-3. " u 2 = [1[(11 - I)] 2:: (hi' - ho)2 1'=1 (1-3. 
4 I Introduction The model estimation problem can be embedded in the statistical estI- mation problem in the following way: It is reasonable to suppose that the outcome Y of a measurement taken at time x/l (we shall phrase our discussion in terms of the radioactive decay model of Section 1-2) is a random variable whose mean value is given by Eq, (1-2-1) as exp( - Ox l ,). If many measure- ments were to be taken at the same XI' we would discover that the observed values Y/l fluctuate around their mean value with standard deviation u. Suppose these Auctuations have a normal probability distribution. The prob- ability density function for )'1' would then have the form similar to Eq. (2) p(Y/l) = [lj(2rr)I/2 u ] exp{ -(lj2u 2 )[Y/l - exp( -Ox)Y} (1-3-5) In fact we only take one measurement at any specific x/l' What we have are realizations YI' )'2' ..., Yn' each of a different random variable whose dis- tribution depends on the parameter XI' which varies from one variable to the next, and on some other parameters (0, u) which are common to all these dis- tributions, The parameter estimation problem which is the primary concern of this book is the problem of estimating these common parameters, At first glance, the parameter estimation problem appears more general than the classical statistical estimation problem, since in the latter all samples are taken from the same distribution. The distinction between the two prob- lems disappears if we choose to regard all the data as being a single multi- variate sample from the joint distribution of all the observations made in the course of the series of experiments. It follows that many of the statistical estimation methods can be applied to our parameter estimation problems. The single sample point of view is, however, rather awkward when one examines, say, the asymptotIc properties of these estimates (see Chapter III for definitions) since it requires that the entire set of experiments be repeated over and over again. Parameter estimation techniques may be applied as computational tools to pure curve fitting problems. One must remember, however, that the sta- tistical properties of these estimates (e.g., those described in Chapters II I and VII) sometimes lose their meaning in the curve fitting context. Clearly, parameter estimation is a more difficult operation than curve fitting, calling for more sophisticated analysis and more extensive computa- tion. The effort is worthwhile since a well established model and precisely estimated physical parameters are much more versatile tools, both for illu- minating the present situation and for prediction in new situations, than ar- bitrarily fitted curves can ever be. To bring home this point, one need only observe that a physical parameter estimated from one model can always be used in another model to which it is relevant. For instance, the viscosity of a 
1-4, Linearity 5 liqUId estimated from viscometer data can be used to predict the required pumping load for a piping system being designed. There are other mathematical problems which may be solved by means of parameter estimation or curve fitting techniques. These techniques may be regarded as attempts to solve (as best one can) an overdetermined (more equations than unknowns) system of simultaneous equations. Solving a sys- tem of n equations in n unknowns may, therefore, be regarded as fitting to n data points a model involving /I unknown parameters. Two-point boundary value problems in ordinary differential equations may be treated as models in which the known terminal conditions are the data, and the missing initial conditions are the unknown parameters. Some optimal control problems may be solved by regarding the control actions as unknown parameters, and the desired trajectory of the system as the data to be fitted. Similarly, some engi- neering design problems may be posed as requiring parameter values which induce the systems to meet prescribed conditions as closely as possible. 1-4. Linearit) To understand what we mean by the term" nonlinear estimation" we must first make the following definitions: An expression is said to be linear in a set of variables CPt' CP2' .., q)" if it has the form ao + 2::;'= 1 a; q)j, where the coefficients aj (i = 0, I, . . ., /I) are not functions of the cPj. An expression is quadratic in the cP; if it has the form ao + 2::;'= I a; (/) j + 2::i', j= t b jj cP j cP j' again with all coefficients not depending on the qJ;. If we differentiate a quad- ratic expression with respect to one of the cP;, we obtain a linear expression. Linear estimation problems are ones in which the model equations are linear expressions in the unknown parameters, e.g., Eq. (1-1-1). When the model equations are not linear, as in Eq. (1-2-1), we speak of nonlinear estima- tion, However even some apparently linear problems are essentially nonlinear. This is so because in order to estimate the parameters we usually minimize some function, such as the sum of sq uares of residuals. To find the minimum we equate the derivatives of the function to zero and solve for the values oj the parameters. Now when the model equations are linear, the sum of square function is quadratic, and the derivatives are again linear. The estimates arc obtained, therefore, by solving a set of simultaneous linear equations, ane all is well. But if some other functions which are not quadratic are chosen te be minimized, then the equations to be solved are no longer linear, even wher the model equations are linear. Such problems should also be regarded m nonlinear estimation problems. Exam pies of such problems are given ir Sections 4-8-4-9. 
6 I Introduction 1-5. Point and interval Estimation There exist many methods (e.g., least squares) which calculate specific numbers representing estimates for the parameter values. Such numbers are called point estimates. A point estimate for the parameters e, u appearing in Eq. (1-3-5) may take the form 0* = 4, u* = 0.1 (1-5-1) A point estimate standlllg alone is not very satisfactory. Random errors are present in all measurements, and no mathematical model accounts for all facets of a physical situation. Therefore we cannot hope to obtain point esti- mates exactly equal to the true values of the parameters (if such exist). Nor can we expect point estimates calculated from different data samples to be equal, even if the samples were obtained under similar conditions, Therefore we need to augment the point estimate with some information on its vari- ability. For instance, in place of Eq. (I) we wish to have a statement such as 0* = 4 :t 0.2, u* = 0.1 :t 0.02 (I -5-2) The numbers 0.2 and 0.02 are meant to represent the standard deviatIOns of the variability of the estimates for 0, u. The information contained in Eq. (2) may be translated into a statement of the typet "We are 75  sure that e is between 3.6 and 4.4, and we are 75 % sure that u is between 0.06 and 0.14." This statement constitutes an interval estimate for our parameters. Interval estimates can be computed directly, without first calculating point estimates and their variability. In fact, many statisticians prefer interval estimates, because they feel one is not justified in picking out one specific preferred value to be used as a point estimate. We feel, however, that the needs of the scientist or engineer are best served by point estimates with measures of their reliability, so we will not discuss any direct interval esti- mation procedures. The calculation of interval estimates (called confidence intervals in this context) from point estimates is discussed in Sections 7-9-7-10. 1-6, Historical Background Legendre (1805) was the first to suggest in print the use of the least squares criterion for estimating coefficients in linear curve fitting. Gauss (1809) laid the statistical foundation for parameter estimation by showing that least squares estimates maximized the probability density for a normal (Gaussian) t The statcmcnt is dcrivcd tram Eq. (2) using the Bienayme-Chebyshev inequality with k = 2. See Eq. (7-9-11). 
1-7, Notation 7 distribution of errors, In this, Gauss anticipated the maximum likelihood method. Gauss and his contemporaries seemed to prefer, however, purely heuristic justifications for the least squares method. Further work in the 19th and early 20th centuries, by Gauss hi mself, Cauchy, Bienayme, Chebyshev, Gram, Schmidt, and otherst concentrated on computational aspects of linear least squares curve fitting, including the introduction of orthogonal polynomials, The development of statistical estimation methods received its impetus from the work of Karl Pearson around the turn of the century and R. A. Fisher in the 1920s and I930s. The latter revived the maximum likelihood method and studied estimator properties such as consistency, efficiency, and sufficiency [see the collection of Fisher's ( 1950) papers]. The development of decision theory by Wald and others has, in the post-World War II years, introduced a new basis for selecting estimation criteria. The practical impact of these methods in the area of nonlinear parameter estimation has so far been slight, except for causing increased awareness of the uses of prior dis- tributions, The first modern applications of statistical estimation theory to model estimation were made in the field of econometrics by Koopmans and others, starting in the I 930s. Their work is summarized in the Cowles Commission Reports (Hood and Koopmans, 1953). The main contributions to the appli- cation of statistical techniques in the construction and estimation of mathe- maticalmodels in the physical sciences have come from professor G. E. P. Box and his coworkers at Princeton University and the University of Wisconsin. The computation of estimates for nonlinear models usually requires find- ing the maximum or minimum of a nonlinear function. Computational methods bearing the names of Newton, Gauss, and Cauchy have been known for a long time, but their extensive application to practical problems had to await the arrival of the electronic computer. The first general purpose com- puter program for solving nonlinear least squares problems was written by Booth and Peterson (1958) in collaboration with Box. The program employed a modified Gauss method. It has since been followed by many other programs, some more general in nature and some dealing with more specific estimation problems. A list of such programs can be found in Appendix G. 1-7. Notation Matrix and vector notation are used throughout this book. A boldface capital letter denotes a matrix: A, r. A boldface lower case letter denotes a column vector: a, 'Y. :j: References to this work, along with a more detailed historical survey are given by Seal (I 967). 
s I Introduction The (i,j) element, appearing in the ith row andjth column of A is denoted Aij or [A]ij' The ith element of a is denoted ai or [a];. All is the Jlth in a sequence of matrices AI' A 2 , A 3 , ,.., The (i,j) element of All is denoted Allij or [AJ,]ij' Analogously for vectors. AT is the transpose of A, i.e., [A T]ij = [A]ji' aT is the row vector with the same elements as a. A - J is the inverse of A if such exists. A'" is the pseudo inverse of A. det(A) is the determinant of A. Tr(A) = 'L A iI is the trace of A. A is said to be m x 11 if it has 111 rows and 11 columns. A column vector is m x I and a row vector I x 11. I is the identity matrix, i.e., /..=6..=(1 IJ ') \0 (i=j) (i ¥ j) I", is the 111 x m identIty matrix. A = diag( a) means that A is a matrix with elements A ij = a i 6 ij . Suppose a IS a functIon of the vectors a and b and the matrix A. Then: claj ca DajiJA a 2 a/iJa iJb is the column vector [aa/aa]; = aa/aa, is the matrix [aa/aA],) = aajaAij is the matrix [iJ 2 ajiJa ab]ij = a 2 a/aa; ab j Suppose a is a vector function of the scalar 13 and the vector b. Then: ca/(113 IS the column vector [clajiJI3]; = iJaJiJI3 aajab is the matrix [aa/ab]ij = aaJab j Suppose A is a matrix function of the scalar a. Then: iJA/iJa is the matrix [iJAjaa]ij = aA;)aa Derivatives of matrices with respect to vectors and matrices, or of vectors with respect to matrices, give rise to arrays with more than two dimensions. Rules for differentiating vector and matrix expressions are given in Section A-2 Appendix A, We also make use of some notation associated with probability concepts. Pr(A) is the probability of event A. 
1-7. Notation 9 If x is a random variable, then p(x) is the probability density function of x. p(x I A) is the probability density of x given that A occurred. E(x) is the expected value of x, E(xl A) is the expected value of x given that A occurred. !Ix = E{[x - E(x)f} is the variance of x. U x = V./2 is the standard deviation of x. The notation p(xl y) is meant to indicate that the probability density of x is also a function of the variable y The reader totally unfamiliar with matrix and probability theories is urged to study texts on these subjects. The reader who merely wishes to refresh his memory may consult Appendixes A and B which contain skeleton definitions of the terms involved and the operations applying to them Other notation: A == B means that A equals B by definition. A  B means that A equals B approximately, or to within the order of approximation being considered (e.g., up to second-order terms in a Taylor series). log x is the naturallogarithm of x. exp(x) == eX. Nk(a, V) is the k-dimensional normal distribution with mean a and covar- iance matrix V. Unless otherwise stated, the notation x = a :t b denotes that x is a random variable or estimate with mean a and standard deviation b. The estimated value of some quantity x is denoted x*, and its true (though unknown) value is denoted x. Formulas and equations are numbered by chapter and section. For in- stance, Eq, (5-3-6) is the sixth equation in Section 5-3. The chapter and section numbers are omitted from references to equations within the same section. Subscripts: a, b, c, ... refer to model equations or dependent variables. The usual range is I to /11. Example: in J'a = f,,lx, 8) the ath dependent variable J'a is a function fa of X and 8. CI., {3, I" , .. refer to parameters. The usual range is I to I. Example: q7 = (kPjee a is the ath component of the gradient of c[J with respect to 8. 
10 1 Introd uction p, 'I, qJ refer to experiments. The usual range IS I to 11. Example: Y" is the vector of dependent variables measured 111 the pth experiment. Its Gth component is Yllu' i frequently (but not always) refers to iteration number. Example: e j is the vector e appearing in the ith iteration. Its o:th component is G ja . 
Chapter II Problem Formulation A. Determimstic Models 2-1, Basic Concepts The scientist often expresses his theones In the form of mathematical re- lationships among certain quantities. Similarly, the engineer derives equatiom that describe the properties of his structures or the workings of his processes We refer to the relations which supposedly describe a certain physical situa- tion, as a model. Typically, a model consists of one or more equations. The quantities appearing in the equations we classify into variables and param- eters. The distinction between these is 110t always clear cut, and it frequently depends on the context in which the variables appear. Usually a model is de- signed to explain the relationships that exist among quantities which can be measured independently in an experiment; these are the variables of the model. To formulate these relationships, however, one frequently introduces "constants" which stand for inherent properties of nature (or of the ma-terials and equipment used in a given experiment). These are the parameters. We illustrate by means of an examp]e: A cylindrical vessel of cross-sec- tional area A is filled with a liquid of density p and viscosity II. It is allowed to drain through a capillary tube of radius R and length L. Let 11 and 110 denote the depth of the liquid in the vessel at times t and to, respectively. The equations of laminar flow yield, for this case, the relation log(hofh) = (ngR 4 f8AqJL)(t - to) (2-]-1) where g is the acceleration due to gravIty, and qJ = !llp IS the kinematic vis- cosity of the liquid. If we interpret Eq. (I) as a relationship between the height of the liquid and the time, then we shall regard h, 11 0 , t, and to as the variables, and g, A, R, L, and qJ as the parameters. Among the latter, the first is a con- stant of nature, the next three reflect the properties of the apparatus, and the 
12 I I Problem Formulation last one a property of the material used. If we performed experiments on several different vessels, we might add R, A, and L to the list of variables, leaving g and qJ as the sole parameters, On the other hand, suppose our instrument is to be used as a viscometer. We place two marks on the vessel, at heights 110 and 11 from the bottom, and measure the time f..t that it takes for the surface of the liquid to pass from the higher to the lower marl<. The kinematic viscosity of the liquid can then be calculated from the following rearrangement of Eq. (1) qJ = CI. f..t (2-1-2) where CI. = ngR''','8AL 10g(110/11). We calibrate the instrument with liquids whose viscosities are known. For the purposes of calibration, then, Eq. (2) contains the variables f..t (directly measurable) and qJ (which can be found in published tables), and the parameter CI. (in whose physical significance we are not at the moment interested). The values of some of the parameters which appear in a model may be known with great precision (e.g., the gravitational constant g in Eq. (I)). The role of such parameters does not differ, at least for our purposes, from that of purely numerical constants, such as n or 8 in Eq. (I). We exclude such parameters from further considerations. 2-2, Structural Model The models we have so far considered take the general functional form g(z, 8) = 0 (2-2-1) where: g = {g t, g 2 , . . . , g ,,,) V is an m-dimensional vector of functions. Z = {ZI' Z2, ..., ZlJ T is a k-dimensional vector of variables. 0= {Ol' O 2 , . . . , O[}T is an I-dimensIonal vector of parameters whose values are not precisely known. Equations (I) are referred to as the structural equatIOns of the model Looking at the model represented by Eq. (2-1-1), we find that there is only one equation, hence /J1 = I: there are four variables Zl = 11, Zl = 11 0 , Z3 = t, 
2-4. Reduced Model 13 and::: 4 = to; and there are four unknown parameters 0, = .-1, O 2 = R, OJ = L, 0 4 = qJ. Eq. (I) then takes the form y,(Z, 8) == IOg(Z2/ZI) - (rry/S)(O//OI(JJ (}4)(ZJ :::4) = 0 (2-2-2) A model for which 111 = I is called a single equation model. We refer to a model as linear if each one of the model equations has the form I g,(Z, 8) = BiO(Z) + I BiJ(z)Oj = 0 j1 (/=1,2,. ,111) (2-2-3) where B,j (i = 0, I, ..., m:j = 1,2, ..., k) are known functions of the z. Models which are not linear are referred to as nonlinear. Equation (2-1-2) is a linear model (with CJ. as the parameter), whereas Eq. (2) is nonlinear. 2-3, Parameter Evaluation A model whose form corresponds to Eq. (2-2-1) is called a deterministic model, since all the quantities appearing in it are assumed to be well deter- mined, at least in principle. The model can be of little practical value, how- ever, unless the values of its parameters are known. There are two principal methods by which we may establish the values of the parameters: 1 Calculate the value of a parameter by applying established laws of nature to already known quantities. For example, if R, .4, L, II, and 110 have been measured, we can compute CJ. = rrgR 4 j8AL 10g(II o /lI) as the value of the parameter to be used in Eq. (2-1-2). 2. Measure the values of the model variables that occur in actual physical situations, and then seek parameter values which cause the model equations to be satisfied, at least approximately. We are concerned here with the imple- mentation of this second procedure. 2-4. Reduced Model The structural Eqs. (2-2- I) are suitable for checking the validity of the model. If values can be found for the parameters such that the equations are at least approximately satisfied, then we do not reject the model. The most important practical use to which the model may be put is that of prediction. For this purpose, the variables z are classified into two groups: I. The r variables y = )'1' )'2' ..., )'r whose values we wish to predict. These we call the dependent variables. 
14 II Problem Formulation 2 The s variables x = XI' Xl' ..., Xs on the basis of which we wish to do the prediction. We call these the independent variables. The problem of prediction, then, is that of determining in advance the values that the dependent variables will take for given values of the indepen- dent variables. Rewriting the structural equations with x and y replacing z g(x, y, 0) = 0 (2-4-1) We see that reasonable prediction tS possible if all of the following conditions hold: I The model IS reasonably correCL 2. The values of the parameters are known to a good approximation 3. The structural equations can be solved for the dependent variables, yielding the reduced equations y = f(x, 0) (2-4-2) where f = /1' /1' ...,.r.. is an r-dimensional vector of functions. SlIlce the number of structural equations is m, we can usually solve for the values of up to m dependent variables, leaving s = k - m independent variables. A linear reduced model is one in which the functions f are linear in the O. A linear structural model may result in a nonlinear reduced model. For instance, the linear structural model log y + Ox = 0 reduces to the nonlinear model y = e-o,. Strictly speaking, we should refer to lhe "structural form" or "reduced form" of the same model. I n practice, however, we shall attach the designa- tion .. model" to whatever set of equations we happen to be dealing with at the moment 2-5, Application Areas There is nothing in Eqs. (2-2-1) or (2-4-2) to imply that we need have explicit analytic expressions for the functions g and f. All that is required is that given the values of their arguments (z and 0, or x and 0), one can cal- culate the values of the functions. This may require solution of a system of difTerential equations, or a complicated system simulation. When the struc- tural equations cannot be solved explicitly, we may still obtain predicted values of the dependent variables by solving the equations numerically. 
2-5. Application Areas 15 The followmg example of a model requiring the solution of differential equations is taken from the field of chemical reaction kinetics. Consider a chemical reaction in which molecules of a certain species (compound) A decompose spontaneously into molecules of Band C. In chemical notation, the reaction would be written as A->B+C (2-5-1) The law of mass action states that the rate of decomposition is, at any moment, proportional to the concentration of A at that moment. This leads to the differential equation (VA/dl = -lel.!'A (2-5-2) where YA is the concentration of A at time I, and k I is the so-called reaction rate constant Eq. (2) may be integrated explicitly to yield YA = x A exp( -kl/) (2-5-3) where X A is the concentration of A at zero time. This is a reduced equation, with YA the dependent variable, X A and I the independent variables, and k. 1 a parameter. While in this case the differential equation could be solved explicitly, it is not uncommon to find models where the integration can only be performed numerically. Such models are treated in detail in Chapter VIII. We cannot show here how mathematical models are derived in the various branches of science, but we can cite a few examples to demonstrate that the utility of parameter estimation methods is not confined to the field of chemical reaction kinetics. (a)- Nuclear Physics, Scattering data have been used to estimate parameters referring to nuclear structure or nuclear-nuclear forces [see Melkanoff el at. (1966); Arndt and MacGregor (1966)]. (b) Geophysical Exploration, Geophysical surveys are often conducted by flying over the region of interest and recording measured values of variables such as magnetic and gravimetric field intensities. These records are then scanned for anomalies which may indicate the underground presence of valuable ore deposits. Assuming the ore deposit to have given shape, size and location, it is possible to derive expressions for the magnetic and gravi- metric fields along the flight paths [see Grant and West (1965)] Although these expressions are very complicated, they can be used (Eisenpress and Surkan, 1966) to estimate ore deposit parameters from aerial survey data. (c) Biophysics, To study the manner in which substances are transported from one part of an organism to another, biologists conceive of the body as 
16 II Problem FormulatIOn conslstmg of compartments separated by semipermeable membranes. A tracer substance is injected into one compartment, and its concentration in the other compartments is subsequently measured at various points in time. These data may be used to estimate intercompartmental transport rate parameters (Berman el a/.. 1962; Turner el at., 1963: Beauchamp and Cornell,1966). Another interesting application is the determination of the dipole moments of various sections of the heart from measurements of skin potential (Beil- man, Collier, Kagiwada, Kalaba, and Selvester, 1964). (d) Probability, Given many samples of a random variable having a given probability distribution, we wish to determine parameters (e.g., mean, standard deviation, etc.) appearing in the distribution. This is the classical estimation problcm in statistics. A "curve fitting" approach to the problem is to construct a histogram from the data, and fit to it the expression for the probability density function. (e) Econometrics. Econometricians attempT to const.ruct mathematical models for the national economy or certain segments of it. These models describe the dynamic relationships among variables such as income, sales, produc- tion and employment. Parameters appearing in the model may be estimated from past data, and used to predict future trends (J ohnston, 1963). (f) Orbit Calculations. The orbit of a satellite can be expressed as a function of parameters which describe the heavenly bodies that attract the satellite. These parameters can be estimated from the observed orbits (Kelley and Denham, 1966). All these are examples in which the parameters to be estimated possessed (more or less) a physical significance, and, the model equations attempted to represent true cause and effect relationships. The following examples are of a different kind. We attempt to determine design parameters which will confer desirable properties on a device to be constructed. (g) A smoothing filter is to be installed in an electrical circuit We cal- culate the ideal transfer function for the filter by solving the Wiener-Hopf equation (Wiener, 1949). The filter must be constructed from passive ele- ments (resistors, capacitors, and inductors) so that its transfer function can only be a rational function, i.e., the ratio of two polynomials. Our task is to determine the coefllcients in the two polynomials so that their ratio approx- imates the Wiener-HopI' solution as closely as possible. (h) Designers of artificial limbs attempt to reproduce the observed kine- matics of natural limbs. They must estimate the design parameters so as to best approximate the observed motions (Freudenstein and Woo, 1968). 
2-6 Expenmel1ls and Data M{[{ri.\ 17 B. Data 2-6, Experiments and Data Matrix Parameter estimation is based on data, and the data conit or oberved or measured values of the model variables. One may obtain thc data by observing situations occurring in nature. or onc may sct up e.\pcriments in which conditions are controlled so as to favor the process or observation. In Chapter X we shall go into the question of \\hat experiments hould be performed for estimating a given model. For thc present, however. it is immaterial where and how the data were obtained, e\cept inasmuch as the measurement process affects the errors in the obsen'ations. In most cases the data gathering process possesses a certain structurc. Performing an experiment consists or recording the observed values or a sct of variables under a given set of eXlu'rimcntal condition.l. Sometimes this means that the dependent variables are observed for given values of thc independent variables. Sometimes, however, the experimental conditions themselves are not among the variables of the model. We may, for instance, wish to relate height and weight of individuals in a population. In this case, the individual chosen can be considered the" set of e,\ perimental cond ition." whereas the height and weight are the model variables. Frequently, in the course of an investigation, several experiments are per- formed, each under a different set of e\perimental conditions. A variable subscripted by a letter {/, I}, or </) denotes the value of that variable as measured in the corresponding experiment ZJ(= [':JII:::JI2' ....:Jt/JT are the values of the model variables observed in the 11th experiment. A func- tion subscripted with one of these letters denotes that function computed for the values observed in the corresponding experiment g,,(O) = g(z", 0) (2-6-1) We shall use corresponding capital letters to designate the data matrix, i.e., the matrix whose flth row consists of the data vector for the flth ex- periment. Thus, Z and G are the matrices whose {Ith rows are Z"T and g,,T respectively, e.g., [- " ""'12 -H] Z= Z'I (2-6-2)  -"I. "'-,,2 
n: II Problem FormulatIOn where 11 is the number of experiments. The defimtlOns of xp' Y J1 , X, and' are obvious I n practice it happens frequently that not all variables are measured in every experiment, or even that the set of dependent variables measured differs completely from one set of experiments to the next. In most cases this will raise no undue dilliculties; we simply use the appropriate set of model equations for each experiment. Some of the problems that do arise in this connection are discussed in Section 9-1. C. Probabilistic Models and Likelihood 2-7. Randomness in Data Deterministic models describe reality only in an idealized sense. If the values of all the variables were known exactly, and if no forces other than those explicitly considered were at work, then and only then could we expect to find parameter values that cause the model equations to be satisfied exacLly. I n practice, we k now that measurement techmques possess lImited accuracy, that repeated measurements of one and the same quantity yield different values. that the conditions for which the model was derived are never quite attainable. and that disturbances which could not be predicted or taken into account in the model always occur. Yet these unpredictable disturbances are as much parts of physical reality as are the underlying exact quantities which appear in the model. The model is not complete, then, unless it also describes in an appropriate manner these random elements of the situation. The appropriate description of random phenomena is through probability statements. The following sections will demonstrate the manner in which the deterministic model can be imbedded in the probabilistic description of the data, but first we digress somewhat to describe some probability distribu- tions that are applicable to experimental errors. 2-8, The Normal Distribution The Importance of the normal distribution (defined below) derives from several reasons (a) It has been found to approximate closely the behavior of many measurements in nature. 
2-8. The Normal Distribution 19 (b) It is the limit which many other distributions approach when the sample size is increased beyond bound. In particular, we have so-called central limit theorems (Feller, 1966) which state that, under fairly general conditions, the distribution of the sum of n independent random vari- ables approaches the normal distribution as /1 is made sumciently large. Central limit theorems are often used to explain the widespread occurrence of this distribution in nature: if the observed value of the random variable is the resultant of many additive, independent, effects, the resulting distribu- tion is likely to be normal. In some cases the normal distribution applies not to the variable itself, but to some function of it. For instance, if a given effect is built up over a period of time as the sum of many random effects, each of which has a standard deviation proportional to the magnitude of the overall effect at the ti me, then the distribution of the logarithm of the overall effect is likely to be normal. This phenomenon is observed in situations relating to the growth of individuals (Cramer, 1946, p. 2201 (c) By specifying the distribution of a random variable, we convey a certain amount of information concerning the values assumed by the vari- able. A suitable measure of the information contained in the distribution whose probability distribution function (pdf) is pix) is given by I(p) == E(log p) = Jp(X) log pix) dx (2-8-1 ) (Shannon, 1948; see also Section 10-2). Consider the followtl1g situation: A scientist knows that the measuring errors of some apparatus have mean p and standard deviation a. For certain reasons (which should become clear in the sequel) the scientist is compelled to assign a pelf pix) to the measurement errors. This pdf will later be used to make inferences concerning the true values of the measured variable. In the absence of any further Information, what function p(x) should be chosen? The function pix) must satisfy the followtl1g conditions: It is a pdf, i.e., pix) :;?; 0, and r" plxj dx = I -.:.c (2-8-2) 2. Its mean is as specified, i.e., (' xp(x) dx = l .. -Cfj (2-8-3) 3. Its variance is as specified, i.e., r: (x - p)lp(X) dx = a 1 "-00 (2-8-4 ) 
20 1 [ Problem FormulatIOn It is reasonable to select from those functions p(x) satisfYing these con- ditions the one IV/lOse information content is least. By doing so, we are adding the smallest possible additional information over and above what we legiti- mately know (i.e., the values of l and a). Finding p(x) such that [(p) is minimized and Eqs. (2)-(4) are satIsfied is an exercise in the calculus of variations. Following standard procedures, we form the Lagrangian functional A(p) == (CD [p logp + )"IP + /2 X P + l3(X-p)2 p ] dx - - ':r (2-8-5) where the }'I are Lagrange multipliers (see Section 3-6). The Euler equation that p must satisfy to make A(p) stationary is obtained by differentiating with respect to p the expression under the integral sign log p + 1 + l, -r 1. 2 x + /3(X - 11)2 = 0 (2-8-6) Hence p(x) = exp[ - I - /1 - }2X - }3(X . p)2] (2-8-7) The values of the I" can be determined by substituting Eq. (7) in Eqs. (2)- (4) Using the relation Jc,- exp( _i_u 2 ) du = (iT!X)1/2, we find ultimately that I., = . log lit + log () - I, i' 2 = 0, }'3 = (I (2a 2 ) (2-8-8) Hence pIx) = [1'(2iT)I:2 a ] exp[ -( 1/2( 1 )(x p)2] (2-8-9) This is the ulliuariate normal dislribution with mean p and variance a 2 . We designate this distribution N,(p, ( 2 ). When x is an n-dimensional vector random variable with mean  and nonsingular covariance matrix V, \ve flnd by similar arguments that the least informative pM has the form p(x) = (2iT)-,,/2 det- I / 2 V exp[ -t(x - ?V-I(X- )] (2-8-10) which is the multlVartale normal distribution with mean  and covariance matrix V We designate this distribution NI1(' V). To summarize: when we specify only the mean and variance of a random variable, we have not determined the entire distribution. If an entire dis- tribution is demanded, however, then by specifying the normal distribution we assume the least possible amount of extraneous information. (d) The normal distribution is particularly tractable mathematically. Many resulb can be worked out expJicitly only for Ihis distribution. Therefore, 
2-9. The Uniform Distribution 21 it is frequently convenient to assume a normal distributIon where no specific justification for it exists. This is unlikely to cause much harm, except when the estimation method selected is very sensitive to the shape of the tails of the distributions. A normal distribution of an n-dimensional random vector x is completely characterized by the mean  and the covariance matrix Y, as shown by Eq. (10). We assumed that Y was nonsingular; otherwise y-I could not have been formed. When Y is singular, i.e., det Y = 0, we speak of a singular normal distribution. If m < n is the rank of y, then there exist m linear combinations of the x which possess a nonsingular normal distribution.:!. In a normal distribution, uncorrelated variables are independent. The mean, mode, and median coincide. The following are additional useful prop- erties of the normal distribution. We assume throughout that x is NII(O, Y) (this is just another way of saying that x is an /I-dimensional normally dis- tributed random vector with mean 0 and covariance V). Then I. Ax is NII/(O, A Y AT), where A is any /II x n matrix 2. Let C be an /I x n matrix, such that CTC = y-I. Then y == Cx is NII(O, I), i.e., the elements of yare n independent normal variables with zero means and unit variances. Such variables are called standard normal deviates. 3. If y is NII(O, 1), then yTy is a random variable whose distribution is called chi-square with n degrees of freedom, designated X/. In other words, x/ is the distribution of the sum of sq uares of 11 independent standard normal deviates. 4. We have yTy = XTCTCX = xTy- tx. Hence, xTy-lx is X/. 5. If d and !J.J are independent random variables with distributions X p 2 and X,/, respectively, then qd jpdJ is a random variable whose distribution is designated Fp.</ . The land F distributions play an Important role in establIshing con- fidence intervals for estimated parameters, and in testing the goodness of fit of the model to the data (see Chapter Vll). 2-9. The Uniform Distribution The uniform distribution (also called rectangular) is one in which the range of possible values of each variable is confined to a finite interval, and all values in the interval are equally likely Thus, if a and b are n-dimensional t Thcsc arc thc principal componcnts corresponding to nonzcro cigcnvaIucs of V. See Section 7-8. 
22 II Problem Formulation vectors with b j > a, for I = I, 2, .., II, then the uniform distribution within the II dimensional rectangle a,::; x j ::; b, is given by: p(x) = [(b, a l )(b 2 - a 2 ). . Ib" - a,,)r l for a j ::; x j ::; b j, i = 1, 2, . . . , 11 (2-9-1) p(x) = 0 otherwise For this distribution we have: x = 1(3 + b) Vii = (lfI2)(b j - aJ2 Vij = 0 (j #- j) (2-9-2) (2-9-3) The components of the vector x are independent. The uniform distribution frequently describes the errors of measurement due to the limited number of significant digits that can be read on a scale, because all intermediate values between scale marks are equally probable. The assumption of a uniform distribution with known bounds for all the errors in a model implies that all these errors are restricted in magnitude. This means that the model must be rejected if no parameter values can be found that keep all the errors within the permitted bounds. The use of this distribution can be justified only if one is willing to accept such drastic con- clusions. One must be quite certain that the measurements really differ from the true values of the variables by no more than the specified error bounds, and that no other random factors have been overlooked in the model In contrast, the normal distribution assigns nonzero (though small) probabilities to any error, no matter how large. It is more forgiving towards inadequacies in the model, and does not break down upon the appearance of an occasional unexpectedly large error. This objection to the uniform distribution does not apply if the upper bound on the error magnitude is not known in advance. 2-10, Distribution of Errors Attempting to relate the deterministic model to the data gathered from II experiments, we are led to the set of equations gll(O) = 0 (Jl = I, 2,. ,II) (2-10-1) The total number of equations in Eq. (I) usually far exceeds the number of unknown parameters O. Only under exceptional circumstances do there exist 
2-10. Distribution of Errors 23 values of 0 which cause all the equations in Eq. (I) to be satisfied. Indeed we cannot expect all these equations to be satis/led, since I. The measured values of the variables do not always represent their true values 2. The model is not exactly accurate, various effects having been neglected in its formulation. To account for errors of type I. we scan the list of all the measured quantities Z. and break it up into two sets: quantities U which are believed to be free of significant error, and quantities \V whose measured values may differ significantly, in a random manner. from their true underlying values, which we designate \\1. The difference between the measured and true values of a variable we call the error E==W-\\' (2-10-2) that is Ell == WI' - \V I , (p = I, 2, . . . .11) , t1 .i 1.' '\!'  .Y! 1, f, ! t- We now assume that each U'IIn is a realization of a random variable w IW ' or, equivalently, that \V is a realization of a matrix random variable Q. This means that \V is one sample out of all possible results of our series of 11 experiments. Furthermore. we assume that the random variables WI'" possess a joint pdf which depends on the true values 1\'1"" as well as on some param- eters \jJ, whose values mayor may not be known. Thus. the pdf has the form p(QI \Y, \jJ). It is usually the case that the pdf depends explicitly on the Q and \Y only through their difference, i.e., it has the form peE 1'1'). It is also frequently the case that the errors in different experiments are statistically indpendent. That means that we have a pdf PII(EJlI '1'1') associated with the errors E/l in the (th experiment. an9 the joint pdf for all experiments is given by 11 peE I '1'1' , 10' . 'I"J = 11 fJ,.(E/l1 '1'1,) Jl= 1 (2-10-3) To illustrate, assume that the errors in the pth experlmem are distributed as NAO, VIJ Then ( 1\ 1 ) _')_ ) -'-/2 d -1/2 \ ' " ( __L T \ ,-I ) PI' EJI 1'. - (_IL et I' exp. 2EI' I' EI' (2-10-4) Hence, the joint pdf is given by ( E I V \T V ) - ( ')- ) -11,./2 11 d t- J / 2 V . ( _J. '\"' . T\T-J ) P I' 2".', 11 - IL /II e I,exp. 21'IEI' II EI' (2-10-5) 
24 II Problem Formulation The vector of distribution parameters \jI here consists of the elements of the matrices V" . It must be remembered that when we speak here of random variables we arc referring to the results of the measurements, not to the choice of experi- mental conditions. In many cases, the experimental conditions are selected randomly, e.g., by drawing individuals at random from a population. This docs not concern us here; once the individual has been chosen, he ceases to be random. What we are interested in are the random differences that may ari-,c between repeated measurements on the same individual. The valui;s of w', and w'] for J1 =f= tl are usually realizations of different vector random variables w" and w'!' Only in the case when experiments J1 and 'I are replications of each other are w" and W n realizations of one and the same random variable. 2-11. tochastic Form of the Model I-low can the deterministic model be modified so as to account for the variability in the data and model? There are several ways in which this can be done. The specific form chosen should depend On what we know about the system being described. Do we have strong confidence in the model but not in the data? Do we trust the data but not the model? Perhaps both are subject to significant errors? The type of model that is appropriate in a given situa- tion depends on the answers to these questions. We list below some of the forms that the model may take. A typical example which illustrates the con- ditions under which these forms are appropriate follows in Section 2-14. (a) Suppose the data are subject to measurement errors, but the model equations are thought to apply exactly, to the Irue (though unknown) values of the variables g(U", \\" 0) = 0 (p=1,2,...,n) (2-11-1) \Ve refer to Eq. (I) as an exact structural model. The measurement errors E = W - \V are assumed to have a joint pdf p(EI\jI). (b) Suppose the model equations apply only approximately even to the true values of the variables. The error in the model at the pth experiment is assumed to be a random variable 'Y" g(U", \V", 0) = 'Y" (p=I,2,...,n) (2-11-2) We refer to Eq. (2) as an 1I1exact structural model. In conformity with our usual notation, we let r be the matrix whose pth row is 'Y/. The 'Y" are sup- posed to account for forces that were neglected in the formulation of the 
2-J J, Stochastic Form of the Model 25 model. One usually assumes that the "I" have zero means, and that they are statistically independent of the measurement errors. Then the overall pdf applicable to this model has the form peE I \jI)p(r I \1/'), where \1/' is an additional set of distribution parameters. (c) A special case of (b) occurs when all vanables are measured precisely so that w is vacuous. Then !' g(u" , 0) = "III (p = I, 2, . , n) (2-11-3) L The relevant pdf is simply per I \jI'). Let us introduce an arlificial variable Y" to which we assign the" observed" value zero, and let us define £1 1 == -"1". Then Eq. (3) is equivalent to Y I , = g(u ll , e) + £", which has the same form as the reduced model Eq. (9) discussed below. (d) In some applications, particularly in the field of econometrics, it has been found appropriate not to introduce the" true value" 'v" explicilly, but rather to treat the model equations as applying approximately to the mea- sured values g(U II , w II ' 0) = "III (p = I, 2, . . . , n) (2-11-4) where "I" is an error term which IS not treated as a random variable in its own right. Rather, one assumes that w lI is a random variable distributed in such a way that "II' = g(U II , W II , 0) (regarded as a funclion of WII) has a given pdf p("I,,). The pdf for the original variable w lI is lhen obtained according lo the rules for transforming variables in probability distributions peW) = p("I,,) I del(ogjiiw l ,) I (2-11-5) The quantity det(og,,/ow ll ) is the Jacobian of the transformation from w', to g", and the dimension of W must be the same as that of g. The econo- metricians refer to the W as the endogel1OlLS variables. Example The following two-equation production model is due to Bodkin and K]ein [1967, Eqs. (12) and (18)] {g = 1\ ' - ° O .lIpl \I /I-O,) - ) ' (J = Jll - Jll 3 2 /12 - III bH - gl12 == IV,lI - 11 1 ,2/°1 = }',,2 (2-] ]-6) where WI is the ratio of real production output to labor input, 11'2 the ratio of capital input to labor input. III the time. and 11 2 the ratio of wage rate to price of output. Here ( ... 1 1 det ogIJOW II ) = I -(1-01)030'2PIIVI-:f' I =(I_O)O ()"p, 11'-0, ( 2-11-7) o t 3 2 ,,2 
26 II Problem Formulation If [1'1 11 ' i ' 1 12] are assumed distributed as N 2 (O, V), then [Will' w 112 ] have the pdf p(w ll ) = I( 1- ()I)UJ Oi"\V120II(2Tir I(det V)-1/2 exp( _g/V-lgl,) (2-11-8) (e) Suppose the dimension of w is equal to the dimension of g, and sup- pose further that the structural equations g(u, w, 0) = 0 can be solved for w. Then we obtain the reduced model w = flu, 0). In the Jlth experiment there may be errors of two kinds; errors EI1I in the measurement of w, and errors E I / 2 in the model equations. In conformity with usual practice when dealing with reduced models, we write y in place of wand x in place of u. The model now takes the form YII = f(x " , 0) + E" = fll(O) + EI' (2-/1-9) where Ell == EIII + E 112 . If we deflne 5'1 1 == YII .- Ell' then we may write the model as 5'11 = f(x ll , 0) = fll(O) (2 11-10) The quantity 5'11 cannot legitimately be thought of as a "true" value of YII unless E I . 2 is negligible. The relevant pdf for the model Eq. (9) has the form p(E III , E 112 ), but in practice the dual nature of the errors is lIsually ignored, and the pdf is written simply as p(E I .). The joint pdf for all the errors has the form PIE I \11). We refer to a reduced model 111 which the Independent variables x are measured precisely as a standard reduced model. The appropriate representa- tion of such a model is given by Eq. (9) or, equivalently, by Eq. (10). Of all nonlinear models, this is the one for which the calculation of the estimates is easiest. For this reason, it is tempting to neglect errors in XII in any reduced model, regardless of whether this is justified in physical fact. The resulting errors in the estimates are difficult to predict, and a Monte Carlo study would be appropriate (see Section 3-3). Be that as it may, the vast majority of all nonlinear estimation calculations have in practice been undertaken on the implicit assumption that the model was in standard reduced form. 2-12, Likelihood-Standard Reduced Model The standard reduced model Eq. (2-11-9) can be put III the more conCise form Y = F(X, 0) + E (2-12-1) Suppose the model is specified, along with the joint pdf PIE I \jJ) and with the data", X. For any given values of the parameters 0 we can compute the residuals E(O) == Y - F(X, 0) (2-12-2) 
2-13. Likelihood-Structural Alodels 27 i.e., the differences elw(O) between the observed values )'JllI and the" computed" valuest;,(x l " 0) of the dependent variables. If 0 is close to the true value 0, then E(O) should be close to the true errors E. In the joint pdf, let us replace the errors by the expressions for the residuals. The resulting expression, which is a function of 0 and I alone, is called the likelillOodfU/1ctio/1 of the sample L(O, \jI) == p(E(O) I \jI) = p(Y FIX, 0) I \jI) 12-12-3) Note that since X and Yare known guantitie, they do not appear as variables among the arguments of the likelihood function. As an example, suppose the pdf is given by Eg. (2-10-5), i.e., the errors In the pth experiment are distributed as N",(O, VI')' and errors in different experiments are independent. The likelihood is obtained by substituting Y I , - f(x ll , 0) for Ell in Eg. (2-10-5) " L(O, VI' V 2, ., V,,) = (2n)-",..!1 TI deC I!1V I , 1'= I ( /1 , x eX P l - 1 I [YII - f(x ll , OWVI I [YII - f(x ll , 0)] / Jf= I (2-12-4) The likelihood function can be defined in more generality as follows: take the joint pdf of the deviations or errors, and substitute for all random variables their sample values in the form of expressions involving measured variables and unknown parameters; the resulting expression is the likelihood function. In the next section we carry out this procedure for several additional models. 2-13, Likelihood-Structural Models (a) Exact Structural Model. Referring to Eg. (2-11-]) we find that the \V I . appear as additional unknown parameters in the model. We define the residuals here as the differences between the measured values w lI and any particular assumed values for '''JI' i.e., E(\V) ==W - W (2-13-1) Hence the likelihood function, as derived from the pdf piE I \1/), has the form L(W, /) == peW - \\7 I /) (2-13-2) The parameters \\7 are not free to assume any values whatsoever; they are constrained to satisfy the structural Egs. (2-1 ]-1). 
28 II Problem Formulation As an example, when the joint pdf is given by Eq. (2-10-5), all we have to do is substitute w ,1 - \V II for E" to form the likelihood L( \' I' Y  ' . . ., Y n' (v I' (v  , " Vi ) = ( In ) -nr.'2 TI det-]i\ .  tJ - J1 JI= 1 x eX P [ -1 f (W" - W,,)Ty, I (W '1 - (V,,) ] }1= I (2-13-3) Note again that WI' W 2 , ..., w" are known vectors (being the measured data). Hence they do not appear among the arguments of L. Since an exact structural model requires a large number of additional unknown parameters \Y, it is desirable to transform it to reduced form when possible. (b) Inexact Strucmral Model. Ler the model be described by Eq. (2-11-2). Once more, the residuals are defined by Eq. (I), and they take the place of E in the pdf p(EI\)I)p(rj1'). An expression for r is obtained simply by evaluating Eq. (2-11-2) for specific values ony and 0, i.e., by substituting G(U, \Y, 0) for r. Thus L(W, 0, \)I, \)I') =p(W \YI\)I)p[G(U W,O)I\)I'] (2-13-4) In this case, nO restnctIons apply to the 0 and W. As an example, suppose the errors in Ware again distributed as Eq. (2-10-5) and that the "1 1 1 are similarly distributed as N",(O, QJ1)' We obtain the likelihood function by substituting "II \V II for EI" and gjl for "1" L(O, YI' Y, " Y", QI' Q, .. Qn, \"1' w 2 , ..' w,J " = (2n)-("!)(r+m) TI (det-l/ Q" det- 1/2 VJ1) }I= I I J  [ . A ) Ty-I ( A ) T ( A " )Q -I ( . A t:\ )] 1 x ex p . \ - 1 L (W" - "II II ",I - ",I + g,l lJ ll , "J1' U I' g" lJ ,I , "", \J f 1'= I (2-13-5) Again, lJ l , and W ,I , being known vectors, do not appear as variables among the arguments of L. (c) For the econometric models discussed llllder (d) III Section 2-1 I, the likelihood function is found by multiplying the terms Eq. (2-11-5) for a1l values of p, i.e., " " L(O) = TI p(wJ1) = TI p(g,,) I det(og"f o "J1) I }1= t JI= 1 (2-13-6) 
r i: 2-14. An Example 29 For the model of Eqs. (2-11-6) this turns om to be L(O, Y) = [(I - OdOJ(2n)-I(det y)-1/2]" [ n J -0, [ " J X01:;:=IJl u , [1 11' ex p _l.'\'uTy-l g 2 J12 1 L bJI. Jl p=1 p=1 (2-13-7) The reducibility of an exact model g(u l " \\" 0) = U to the form \" = f(u p , 0) depends primarily on the re!ation between the number m of equations and the number r of random variables per experiment. We distinguish three cases: I. r = m. Except in certain singular cases (vanishing Jacobian) the equa- tions may, in principle, be solved for the \V II (although the solution may not be unique). Even when the solution cannot be exhibited in explicit form, it can be computed numerically. Thus, at least in principle, the \V I , may be eliminated from the likelihood function, which remains a function of 0 and I alone. 2. r> /11. Not all the \V I , can be e!iminated. We can, however, choose 111 of the ,\" solve for those, and substitute in the likelihood function. This leaves us with only r - m unknown \V I , per experiment, and these are unrestricted in value. 3. r < 111. In this case, we may solve r of the equations for the W II , and use these to eliminate the \V p from the likelihood and from the remaining m - r model equations. For each experiment there remain /11 - r equations, con- taining only up and O. Therefore. if /1 is the number of experiments, the total number of equations is (/11 - r)/1, and the mode! must contain at least (117 - r)11 unknown parameters 0 for there to be a solution. In most cases we find that the number of random variables per ex periment must at least equal the number of equations. 2-14, An Example The following example should clarify the conditions under which the vanous types of mode! are appropriate. A sphere of radius r and mass 111 is dropping freely through an incompressible Newtonian fluid of viscosity JI. The force of gravity acting on the sphere is gem - 117 0 ), where mo is the mass of the fluid displaced by the sphere, According 10 Stokes's law, the drag opposing the motion of the sphere (when the motion is slow) is 6mpu, where u is the velocity of the sphere. Newton's first law of molion takes the form mv = gem - 111 0 ) - 6nrpu (2-14-1) 
30 II Problem Formulation where l' == d1'/dl is the acceleration. \Ve may rewnte this as V + 21' = (3 (2- I 4-2) where 2 == 6Tr1"JI/m, (3 == g(111 - 111 0 )/111 (2-14-3) Assumll1g that the sphere was initially at rest. we can integrate Eq. (2) to find 1'(1) = «(3/2)(1 - e- 2 /) (2-]4-4) The distance s traveled by the sphere since the inception of its motion is .I = r dT) dT = ((3/2)1- ((3/2 2 )(1 - e- U ) -'0 which can be translated into the "model" gis, 1. 2, I') == s -j'[l - (I''Z)( I - e- 7 ')j = 0 /2-14-5) where j' == /1/2 = 09(111 - 111 0 ),167[1"/1 (2-14-6) Suppose we have measurements SI' 52' .., 5" recorded when the clock indicated times II' l2' ..., l". We are interested in estimating some of the physical constants appearing in the model. At the outset it is clear that the model equation contains only two independent parameters. Hence only two of the physical constants 09,1", Ill, ilia, /1 appearing in the model can be estimated independently. Since all the information contained in Eq. (5) relative to these constants is derivable from the values of 'Z and }', we shall assume that these are the parameters to be estimated. \Ve examine the following cases, in all of which it is assUllled that errors in different measurements are statistically independent. The parameters of the error distributions are represented as \jI. (a) The model equation Eq. (5) is exact, i.e., any systematic deviations from it are negligible compared to measurement errors. I. The measurements of l" are precise. but those of 5" are su bject [Q errors with pdfp(EI/). Eq. (5) becomes in reduced form g" = )'[/" - (I !'Z)( I - exp( -'l./,,))] and the likelihood is (/1 = I. 2, , 11) (2-14-7) " L('l.,)',\I/) = f1fJIs"-j'[/,, (1/2)(1 --exp(-2I,,))]I/) JJ:::::; I (2-14-8) 2. Both 1" and SII are subject to meaSUrement errors E, and Es, respectively, with pdf p(e r , eJ' \11). We have the exact structural model \, - )'[;" (I !'Z)( 1 - exp( -Cl.i))] = 0 (fl=I,2,...,n) (2-14-9) 
2-14. All Example 31 where S" and i" are the" true" values of sand 1 at the 11th measurement. The likelihood is " L[s", i,,(11 = 1,2... ,11), /] = TI p{r" - i", .1"" '/JI\l/j 1'= I (2-14-10) with i" and S" constrallled by Eq, (9). Alternately, we can substitute sp from Eq. (9) into Eq. (10) to obtain L(i,,(1l = I, 2,. .,11), cr., )', \)I) " = TIp{r"-i,,.)'[i,,-(I;cr.)(I-exp(-cr.i,,))]I/: (2-14-11) JI= I with no constraints applying to the l J , 3. Only l J , IS subject to significant errors. If Eq. (5) could be solved for 1" we would have a standard reduced model. Since this is impossible. we again adopt an exact structural model, except that ,IJI is replaced by .1"" in Eq. (9). and all references to sJ' and .1"" are deleted from Eq. (10). (b) The model Eq. (5) is inexact. For instance, if the sphere is sufliciently small. then the drag force is randomly perturbed due to the impact of in- dividual molecules. A Brownian motion is thereby superimposed on the falling motion of the sphere. Eq. (1) must be amended to read i, + !J.1' = (J + 'I (2-14-12) where I) is a random variable whose distribution may be derived from the laws of statistical mechanics. When the equations are integrated, there will arise a perturbation on s, i.e., Eq. (5) will take the form .I"-}'[1 (1/0:)(I-e-")]=4) (2-14-13) where q) is also a random variable. Let p( qJ I' (/)2' . . . , 4),,1 Y. )', \11) be the joint pdf of the (/)J" If the measurements of sJ' and l J , are precise relative to the standard deviation of q). then we have the likelihood L(cr.')'./)=p{SI -)'[1 1 -(I/cr.)(1 -exp(-w l ))]. .1"2 - )'[1 2 - (lio:)( I - exp( -cr.1 2 ))], ..,10:,)" \I/} (2-14-14) We have used a joint pdf instead of the product of individual pdf's because the assumption of independence between observations is tenable here only if the experiment is restarted from rest for each measurement. Otherwise, if the disturbances up to time 1 2 have conspired to make S2 larger than expected from Eq. (5), then .I is likely to remain too large in succeeding periods. See Problem 4 of Section 8-9 
32 II Problem Formulation 2-15. Utility of Distribution Assumptions The problem of estimating the parameters 0 has now been augmented, inasmuch as we must also estimate the parameters /, and possibly \\7. We let <p denole the entire set of unknown parameters, i.e., q> == {O, \)I, \V} (2-15-1) Those \V which could be eliminated (see Section 2-13) are excluded. A scientist may be reluctant to erect the entire probabilistic superstructure, only to find himself with a larger problem than he started with. It is true that some para- meter estimation procedures may be applied without making any probabilistic assumptions. The resulting estimates, however, are rather meaningless. They may suflice for cmve fitting, but nothing will be known about the relationship between the estimated and true values of the parameters. Frequently people make implicit assumptions concerning the probability distribution without realizing the fact. This happens, for instance, when weights are assigned in the least squares procedme (see Section 4-3). By recog- nizing the role of such weights as parameters \)I of a distribution, we are able to estimate the weights rather than assign them. Thus we are able to shift a burden from ourselves to the computer (see Sections 4-8-4.9). It should be noted that in some cases (particularly with linear models) it suffices to specify the covariance matrix of the distribution without commit- ting oneself to any specific form of the density functions. D. Prior Information and Posterior Distribution 2-16. Prior Information The scientist usually has some ideas concerning the values of his param- eters even before any data have been gathered. He is frequently able to exclude entirely some values. For instance, the rate constant in a chemical reaction or the viscosity of a liquid must be positive. An estimation procedure that came up with negative values for such parameters should be entirely unacceptable. His physical intuition may lead the scientist to reject some other values as being entirely implausible, even though they are strictly speaking not impossible. Even among the admissible values the scientist may regard some as more plausible than others. For instance, suppose a chemist knows with great precision the viscosities '76 of n-hexane and '78 of n-octane, and he 
2-17. Prior Distributiol/ 33 is trying to determine '17, the viscosity of /I-heptane. Experience with the properties of homologous series of organic compounds will lead him to re- ject entirely values of '17 such that '17 ::( 1/" or 117 ?oIls. Among the remaining values, he will prefer those near (II" + Ils)/1 to those near lit, or 178 . 2-17, Prior Distribution The sCientist may summarize his prior It1formatlon In what IS called the prior distributiol/ of the parameters. The prior distribution may bc character- ized by means of the prior del/sity function Po(q». The prior density function is required to be nonnegative, and to possess the property that if q)1 and (1)2 are any two values of q>, then Po(q>1 )/PO(q>2) represents the ratio of the plausi- bility of q>1 to that of (1)2' Note that we do not require the normalization condition J Po(q» dq> = I. In fact, we do not even require the integral J Po(q» dq> to exist. Thus, we are permitted to assign thc ul/Uim/l priority de/lsity Po(q» = I to describe the case when all values of q> are equally plausible. We always assign Po(q» = 0 to all values of q> which are to be entirdy excluded. Controversy still rages around the question of whether or not the prior distribution may be regarded as a true probability distribution. For the stat- istician belonging to the frequentist school a probability distribution is meaningful only when applied to a random variable When the parameters represent physical constants, their values are perfectly definite (although un- known), and they cannot be regarded as random variables. Proponents of subjective probability (e.g., Savage, 1954) and decision theory (e.g.. Rainl and Schlaifer, 1961; Ferguson, 1967), however, do indeed admit "degrees of belief" and subjective choices of plausibility as probability densities. Not only do they allow one to postulate prior densities, they actually insist that one do so. They believe that any sensible subsequent action (e.g., parameter estimation) must be based on some choice of a prior distribution. We shall not attempt to resolve the controversy here. For a conCtse dis- cussion of the problem we refer the reader to Cornfield (1967). We feel that it is up to the scientist to decide for himself the extent of his commitment to a prior distribution. He should remember that introducing a prior distribution biases the results of the estimation process so as to favor parameter values for which Po(<I» is relatively large. This bias diminishes as the number of ex- periments is increased. In other words, when the amount of data available is sufficiently large, the effect of the prior distribution on the paramcter esti- mates is negligible, except that values of q> for which Po(<I» = 0 remain excluded. t This is an intuitivc conccpt, which wc do not attcmpt to dcfine hcrc. Somc authors have givcn dcfinitions bascd on bets that thc scientist is willing to lay on cach valuc of cpo 
34 II Problem Formulation There are several cases for which the use of a prior distribution is non- controversial: (a) Assigning Po( Ij» = 0 to physically impossible values of Ij>. (b) If Ij> is truly a random variable, its pdf should be used as the pnor density. For instance, Ij> may represent the physical properties of a batch of chemicals. If these properties are known to vary randomly from one batch to the next according to some pdf p(Ij», it is entirely proper to use Po(lj» = p(lj» when attempting to estimate the properties of one specific batch. (c) Suppose a number of relevant experiments has already been conducted, but additional experiments are being planned. As will be seen shortly, the information on the parameters contained in the data may be expressed in the form of a so-called posterior distribution. It is entirely proper to use the posterior distribution from the already completed experiments as the prior distribution for the experiments yet to be conducted. 2-18. Informative and Noninformative Priors In case (c) above the choIce ot the pnor distrIbutIon IS obvious. The same is true in case (b), provided p(lj» is known. How to choose a distribution in other cases? \Ve distinguish three situations: (a) The l1ol1i1f(ml1alil'e case occurs when we really have no marked prefer- ences for some values of Ij> over others, at least within the relevant region in Ij>-space.: The simplest solution, and a ,very satisfactory one in practice,g is to assume no prior distribution at all, and to use a parameter estimation method which does not require one. This avenue, however, is closed to prac- titioners who are irrevocably committed to decision theory and Bayesian statistics. They are forced to choose a prior distribution, and are likely to assume a uniform prior density. This is logically unsatisfactory, since if q; has a uniform prior distribution, any nontrivial function of q; (say q;3) has a nonuniform distribution. As we could have written the model in terms of q;3 rather than 4>, we find ourselves in a situation where the choice of param- etrization affects the outcome of the estimation. l By the rclevant region we mean t1lat regIOn In which tIle likclihood functIOn IS far from vanishing.  Author's opinion. 
t; " :," ;. ,.. , ; " ;.i i.' ."1 2-18, Informative and Noninformative Priors 35 Alternative procedures are available. Raiffa and Schlaifer (1961) introduce the concept of the conjugate distribution. This is a distribution which has the same mathematical form as the likelihood function derived from some hypo- thetical sample. A suitable noninformative prior distribution can sometimes be obtained by finding the function to which the conjugate distribution tends as the sample size (now assumed a continuous variable) tends to zero. Another suggestion, due to Jeffreys (1961), is to use IN as the noninformative prior density for nonnegative variables. The justification given is that this distribution is unaffected when 4) is replaced by 4/'. (b) The informative case occurs when we do prefer some values of e!> to others within the relevant region. Generally the precise form of the prior distribution is immaterial as long as it has approximately the right shape. The method of conjugate distributions can be used here too; the scientist postulates what seems to him a likely data set, and the likelihood function corresponding to it is used as a prior density. The advantage this method offers is that the prior density, the likelihood function, and the posterior density all have the same mathematical form, which sometimes simplifies formal manipulations. In the case of nonlinear models, however, all these functions are complicated. Numerical evaluation replaces formal manipulation, and this method no longer offers any advantage. One is probably better off assuming a normal or other simple prior density with suitably chosen means and variances. Another approach is to attempt the graphical construction of a suitable pdf or cumula- tive distribution function. Winkler (I967) has carried out experiments in which students were made to construct prior dcnsities to represent their beliefs concerning the values of certain parameters, using several of the above mentioned methods. The feasibility of constructing such functions was demon- stI:ated, but the question of whether they would be of practical value to the parameter estimator remains unanswered. In practice, our prior information on a parameter 0 often takes the form of a value eo ::!:: (I. reported in the literature. The number (I. may be a standard deviation, in which case we assign the prior density N , (00' (1.2); or, ::t (I. may represent absolute bounds on the deviation from 00, so that the uniform dis- tribution over the interval Go - (I. ,.:;:; 0,.:;:; Go + (I. is appropriate. In both cases, the chosen distribution is the least informative one among all distributions satisfying the given conditions. (c) It may happen that e!> is truly a random variable with pdf pee!» (as in case (b), Section 2-17), yet the function pee!» is not known. The empirical Bayes' method of Robbins (1955, and 1964) (see also Neyman, 1962) provides an approach to the estimation of pee!» on the basis of available data. 
36 II Problem Formulation 2-]9. Bayes' Theorem We have summarized the information contained in the data by means of the likelihood Function L, and the prior information by means of the prior density Po(<I». We combine the two in the so-called posterior density which is proportional to their product p*(<I» = cL(<I> )Po(<I» (2-19-1) with c = [J L«!»Po(<I» d<l>]-1 (2-19-2) provided the integral exists.t 1 F Po(<I» is regarded as Ihe probability density ascribed to <I> before the experiments were performed, then p*(<I» is the density we must ascribe to (!> after Ihe data wcre obtained. This follows from Bayes' (Bayes, 1763) theorem, which may be stated as follows: Bayes Theorem. Let A and B be two events whose probabilities of occurrence are PCA) and PCB) =F 0 respectively. Let peA I B) denote the conditional prob- ability that A occurs, given that B has occurred, and let PCB I A) be defined analogously. Then peA I B) = PCBI A)P(A)jPCB) (2-19-3) Fr()(!f: The proof Follows immediately from the definition of conditional probability peA & B) =P(AIB)P(B) (2-19-4) whcre peA & B) is the probability of A and B both occurring. But also peA & B) = P(BI A)P(A) (2-19-5) Dividing Eq. (5) by Eq. (4) and solving for peA I B) yields Eq. (3) directly, In our casc we define A to be the event" the true value of <I> is within a hypcrcube of volume!! d<l> centered at <1>0" and B to be the event" the true value of Q. is within a hypercube of volume dQ. centered at W," 1. In some applications the value of C IS immaterial, and we can proceed even when the integral does not exist. Sce Section 4-]5. * We lIse thc notation dcp as a shorthand for dcp, dcp2 ... dcp, and dQ for dWll dW12 ...dWlrdw21...dwllro 
2-20. Problems 37 By the definitions of the pdf, the prior distribution, and the likelihood function, we have peA) = Po( 1\>0) dl\> PCB 1.4) = L( 1\>0) dQ (2-19-6) (2-19-7) The value of PCB) is obtained by summing peA & B) = P(BI A)P(A) over all possible A's, i.e., P( B) = [f L( 1\>0)Po( 1\>0) dl\> 1 df! Substitution in Eq. (3) yields peA I B) = L(l\>o)Po(l\>o) dl\>j J L(l\>o)Po(l\>o) dl\> (2-19-8) (2-19-9) But peA I B) is the probability of A occurring given that thc cxperiments yielded the data W. By definition, then peA I B) = p*( 1\>0) dl\> (2-19-10) from which Eq. (I) follows immediately. Note thatp*(I\» is a meaningful pdf only if Po(!» was one. The frequentist who in a given case does not accept a prior distribution will not accept a posterior distribution either. When the prior distribution is uniform, the posterior density is propor- tional to the likelihood function. If the results of two series of experiments are statistically independent, the joint likelihood Function is the product of the two individual likelihoods. Formally it is then possible to regard the likelihood from the first series as the prior for the second series, and then the posterior for the second series equals (excepI for a constant factor) the joint likelihood function. This is the basis for the assertion made undcr case (c) in Section 2-17. The posterior distribution (equal to the likelihood in the absence of prior information) combined with any constraints that may be applicable, embodies all four elements that enter into the parameter esti mation problem, namely the model, the data, the probability distribution of the errors, and the prior information on the parameters. Formulating a parameter estimation problem is in many cases equivalent to writing down the posterior distribution 2-20. Problems I. WrIte down explictt expressions for all the likelihood functions ap- pearing in Section 2-14, assuming that all error distributions are normal with zero means and given variances. 
38 It Problem Formulation 2. The gamma distribution with parameters a and v has the pdf _ f1- t(v)a"O"-le-aO ra,ve O ) = \0 (G)< 0) (0 < 0) where rev) == I'D X,,-I e -., dx o is the gamma function. Show that E(G) = vja and V(G) = vla 2 . 3. Suppose an object is measured II times to determine its length G. The measurements are denoted IV II (Jl = I, 2, ..., 12). The model equation is IV II = O. Assume the errors are distributed as N I (0, ( 2 ). Suppose u is known and 0 is assigned the prior distribution Nt (Go, Uo 2). Write down the likelihood and the posterior pdf p*(O). Show that p*(O) is normal, and find its mean and vanance 4. As in the previous problem, but assume G is known and u is to be estimated. Let T == Iju 2 and assume that T has the P rior distribution 1a .. . o. o Show that the posterior distributio:1 of T is also gamma, and find its para- meters. 5. Investigate the shape of the gamma distribution for various ranges of its parameter values. Under what circumstances is the gamma distribution a suitable prior for a parameter? 6. Show that the examples of Problems 3 and 4 contain conjugate dis- tributions. Show how the parameters of the prior distribution are related to the sizes of hypothetical samples. Investigate the behavior of the prior distribution as the hypothetical sample size is reduced to zero. 
Chapter III Estimators and Theil' Properties A. Statistical Properties 3-1. The Sampling Distribution A point estimation method is a procedure which enables one to compute an estimate (1)* for the parameter vector <1>, given the data matrix W The estimation method defines (at least implicitly) a vector valued function h (1)* = heW) (3-1-]) \Ve shall use the expression" the estimator h" to mean" the estimation pro- cedure defined by the function h." [f the experiments which yielded our data were to be repeated, we would obtain different values of the W, i.e., different realizations of the random variables Q.. Application of the estimator h to the new data would yield dif- ferent values of <1>*. We see then that the estimates <1>* are themselves random variables, possessing a certain probability distribution, which depends both on the nature of h and on the distribution of the Q.. We refer to this distribution as the sampling distribution of the estimator h, and denote its pdf(if one exists) Ph( <1>*), Note the fundamental difference between the sampling distribution and the posterior distribution. The sampling distribution refers to the estimate (1)*, which is truly a random variable. The sampling distribution is defined only once the estimation procedure is defined; different estimation procedures when applied to the same data generally give rise to different sampling distributions. On the other hand, the posterior distribution is independent of the estimation procedure. It applies to the true values <1>, and its interpretation in cases where these are not random is, therefore, controversial. A glance at Eq. (1) reveals that the sampling distribution of (1)* depends on the actual distribution of Q.. This distribution however, depends on the true values <1>, which are generally unknown (or we would not be trying to estimate 
40 II I Estimators and Their Properties them). Therefore, even when we can derive a formula for the sampling dish'i- bution, we can evaluate only the approximation obtained by substituting the estimated parameter values for the true ones. Still, one can frequently deduce some important properties of the distribution, as shown by the fOllowing example: SUppose we measure an object 11 times to determine its length O. The measurements will be denoted by \VII (tl = I, 2. , . . , 11). The model equations take the form \\',/ = 0 (3-1-2) Assume the measurements to be independent, and normally distributed with variance (J2 and mean D. Consider the estimator II 0* = (1/11) I \VII /1=1 (3-1- 3) i.e" 0* is the mean of the observations. It is well known that ()* is normally distributed with mean D and variance (J2/11. Thus we have found that the mean of the sampling distribution is equal to the true value e, and its variance decreases as ]/11 when the sample size is increased, The above example introduced the concepts of the mean and variance of the sampling distribution, also referred to as the mean and variance of the estimate. In the genera] case, the mean (j) and covariance matrix V) of the estimate are given by: - (jJ == £(1>*) = J <I>*p,,(I>*) d<l>* = J h(W)p(W 1<1» dW (3- I -4) v) == £(<1>* - (jJ)(<I>* - (jJfr = J (h(W) - (jJ)(h(W) - (jJ)Tp(WI<I» dW (3-1-5) 3-2. Properties of the Samplin Distribution It is clearly desirable to have an estimator whose sampling distribution is concentrated in the neighborhood of the true values of the parameters. More formally, we define the following properties of estimators: (a) The hias of an estimator is the difference between the expected value of the estimator and the true value of the parameter, i.e" b == (jJ - <1>. An esti- mator is unbiased if its bias vanishes, i.e., if (jJ = (I>. In the example of Section 3-1 we saw that the estimate Eq. (3-1-3) was unbiased. Clearly, we desire estimators with small (in absolute value) bias, but total unbiasedness is mostly unobtainable. Nor is unbiasedness particularly important, since the bias is not the only error in any given estimator. Furthermore, if an estimate is unbiased 
, r" I:':. ., i' 3-2. Properties of the Sampling Distribution 41 for some parameter 4), it is generally biased for nontrivial functions of 4). That is, even if £(4;*) = ([;, it need not be true that, say, £(4/ 1 ,2) = ([;2. Thus. the presence or absence of bias in an estimator is affected by a change in para- metrization. ;.- t., (b) While the bias is a measure of the" systematic" error in an estimator, the variance measures its random error. The Rao-Cramer theorem (see Appendix C) establishes a theoretical lower bound on the attainable covariance matrix Voj> of an estimator. We see from Eq. (3-I-..J.) that qJ is a function ofcj). If this function is dif- ferentiable, then we can form the matrix .. P == oqJjocj) = J h(W)(OpjOcj»T dW (3-2-1 ) We also define a matrix R by R = £«0 log pjocj»(o log pjiJcj»T) = J (0 10gpjocj»)(0 10gpjocj»T p dW (3-2-2) The theorem asserts that the matrix V) - PR -I pT is positive semideflnite, and that it is nul! if and only if there exists a matrix A (whose elements may be functions of cj») such that (1)* - cj) = A(cj»)o log pjocj) (3-2-3) The proof is given in Appendix C. The matrix PR - I pT is called the minimum variance bound (MVB). Since the diagonal elements of a positive semidefinite matrix must be nonnegative, we have for the variance of each 4),,* V)a" ;?; I P"fJ[R -1 JfJr Pa)' fJ,)' (3-2-4) with equalit:y holding only when Eq. (3) is satisfied, An estimate is called efficientt if its variance is the lowest theoretically attainable, i.e., when V) = PR-lpT (3-2-5) When the estimate is unbiased, we have qJ = cj). and hence P = I. The Rao- Cramer theorem reduces to the statement that V) - R - t is positive semi- definite. and an efficient unbiased estimate has the covariance V)=R-I (3-2-6) t Some authors use the term" MVB estimate" instead, ;md reserve the tcrm" ct1lcient" for what we call" asymptotically efficient" hcre. 
42 III Estimators and Their Properties In the example of Section 3-1 we had an unbiased estimate for the single parameter O. Its variance was found to be Va = u 2 /n. The likelihood function is whence [ n ] -? ?..... ? peW, 0) = (2rr)-n/- u -n exp -(I/2u-)Il/1I'1l - G)- (3-2-7) n o 10gp/oO = (I/u 2 ) I (IV II - 0) 11=1 (3-2-8) and R = E(O log p/(0)2 = (I/u 4 )E{ [J/w ll - G)r} = (I/u 4 )ELtl J/w i . - 0)(11'.,- G)] Now the variables 11'/' (JI = 1,2, . . ., n) were assumed independent with means o and variances u 2 . Hence (3-2-9) and E(w i . - lJ)(IV - 0) = u 2 6 WJ (3-2-10) n R = (1/0.4) I E(\I.I'  0)2 = nu 2 /u 4 = n/u 2 / 1 ==1 (3-2-1 I) J n this case, then, V o = R - I, and the esti mate is efficient. We could have deduced this fact also from Eq. (3-1-3), which yields 0* - 0 = (1/11) f 11'1' - 0 = (1/11) f (w ll - 0) = (u 2 /11) a log p/oe (3-2-12) I'=:: I p=] so that Eq. (3) is satisfied with A = u 2 /11. As with unbiased ness, efficiency can be attained only In a small class of relatively simple models. It should be pointed out that in some cases, although no ellicient estimate exists, among those estimates that do exist or among estimates of a certain class there may be one whose variance is least. For instance, if all errors in a linear model are identically and independently distributed, then least squares estimates have, among all linear unbiased estimates, the smallest variance. Yet, they are not necessarily efficient. (c) While unbiased and efficient estimates cannot generally be found for samples of finite size, the situation changes drastically when the number of experiments increases beyond bound. Under this condition, the actual value of many estimates converges with probability one to the true parameter values. We refer to such estimates as consistent or asymptotically unbiased. The variances of most of the estimates that we shall deal with approach zero as 1/1/ when 1/ increases. Hence, if the estimator is consistent for cj; it is COIl- 
;  I.- k:' ,- , - i ; ie, !.  L' ,- L < ,. ,.. I.,-{, ; i.' i  !ff i , _." I" 3-2. Properties of the Sampling Distribution 43 sistent for any well-behaved function of qJ. Thus consistency is a more signifl- cant concept then unbiasedness. (d) Both V) and R- 1 tend to zero as I/n for most relevant estimators. Hence we call a consistent estimator asymptotically efficient if with probability one lim n(V)-R-I)=U (3-2-13) n-CO (e) Our discussion of estimation criteria would not be complete without the mention of sufficient statistics. A statistic p of a sample is any function com- puted from the values of the sample for the purpose of extracting relevant information p = r( W) (3-2- I 4) In particular, any estimate <1>* is a statIstic defined by Eg. (3-1-1). A statistic p is deemed sufficient for the parameters Q> if the value of p conveys as much information concerning the value of Q> as did the original sample W. In other words, we may compute the value of p from the sample, and then discard the data W without losing any information relevant to the estimation of (!>. This statement should be interpreted as follows. A sample can contain information concerning the value of <I> only if the distribution of the sample is a function of <1>. To say that p is a sufficient statistic for <1> implies, then, that once the value of p is determined, the distribution of the sample can be repre- sented in terms of p alone, with no further dependence on (1). Therefore, there must exist a function q(W, p) such that p(WI<I>, p) =q(W, p) (3- 2-15) But we have, from the definition of conditional probability p(WJ<1» = p(WIif>, p)p(plif» (3-2-16) Letting p(pl<1» s(p, <1» (this is the sampling distribution of the statistic p) we are led to the factorization theorem for sufficient statistics p p(WI<1» =q(W, p)s(p, <1» (3- 2-17) Takmg logarithms and differentiating both sides with respect to > we obtain a log peW I <i»)/oip = 0 log s(p, ip)/oip == t(p, ip) (3-2- I 8) In the example of Section 3-1, we have p(W/G) = (2rr)-n!2u- n exp[ -(1/2u 2 ) ( f \lip 2 - 2G f \lip + nU 2 )} \ Jl= I 11= 1 = {(2rr rn !2 u -n exp [ -(I /2( 2 ) ptl 11',/] }{ exp [( I /2( 2 )( 2UJI II'p - nlP) ] } (3-2-19) 
44 I [I Estimators and Their Properties which is in the form Eq, (17) with p = I= I IV". Thus, the sum of the observa- tions is a sufTicient statistic in this case. An estimate which is a function of a sufficient statistic is a sufficient estimate. [n om example, 0* = (Ifn)p, hence 0* is a sufficient estimate for O. Supposing <1>* is an efTicient estimate of <1>, then from Eq. (3) we find that cllogp!c(b = A -I(»(<I>* - <1» (3-2-20) Comparison of Eq. (20) with Eq. (18) shows that p =<1>* is sufficient. Thus we have proved that if an eflicient estimate exists, it must also be sufficient. Conversely, if a sufficient estimate exists, some function of it is an efficient estimate of some function of the parameters. (0 The value of an estimate usually depends on the form of the probability distribution that we assume. We rarely possess exact knowledge of the distri- bution, and must usually coment ourselves with a rough approximation. We desire, therefore, that om estimate be robust, i.e., that it be only slightly affected by seemingly unimportant changes in the form of the assumed distri- bution. (g) The choice of parameters appearing in a model is often arbitrary. We may replace the original parameters <I> with a different set of parameters tj) which are single valued functions of the <1>, e.g., tj) = s(<I» where-s is a vector of functions. It is desirable that our estimators be in1'Griant under reparametriza- tion. That is. if <1>* and tj)* are the estimates obtained when the model is repre- sented respectIvely in terms of <I> and tj), then we expect to find that tj)* = s(<I>*). (h) An estimation procedure which cannot be implemented on available com- puting machinery is of little use. An estimate which is readily computable is, from a practical point of view, more valuab(e than a statistically more efficient estimate which is computable only with an excessive amount of labor. In other words, an adherent of decision theory should make his cost function depend not only on the error of the estimate. but also on the cost of computing the estimate. (i) A linear estimate is one which is a linear function of the data, i,e., <1>* is a linear estimate if there exists a matrix A (which may be a function of the u,J such that <1>* =AW (3-2-21) Useful linear estimates valid over a wide range of data values can be found only when the model itself is linear in the parameters, Among the properties that we have defined, the most important in practice are small bias, small variance, robustness, and computability. 
i.:. [: t.: ( , P t i \  . 3-3, Evaluation of Statistical Properties 45 A true measure of the accuracy and precision of an estimate IS given by the mean square error relative to the true, rather than mean, value, It is easily verified that £(<1>* - 4»(<1>* - 4»T = £(<1>* - (j»(<1>* - (j»T + «j) - 4>)(j) - 4»T = V'I' + bb T (3-2-22) where b is the bias. The root-mean-square error in the estimate 4)i* of the ith parameter is given, therefore. by (o} + b/-)1!2, where (Jj == V,!.{/ is the stan- dard deviation of the estimate 4)j*. This points out the fact that we gain little in laboring to make a highly biased estimate (large bJ very efficient (small (Ji), or to eliminate bias in an inefficient estimate. Ideally, we would like to derive formulas for estimators having the desired properties, This, unfortunately, is not possible except in simple cases such as with linear models (see Appendix E). The best that can be done in practice is to propose reasonable estimators, and to test their properties as described in the next section. 3-3. Evaluation of Statistical Properties Given an estimation procedure heW). how do we determine its statistical properties? How do we determine whether its bias and variance are within acceptable limits? How can we compare its performance to that nf other estimators? How can we assess robustness? Several approaches to the answer- ing of these questions suggest themselves: (a) Theoretical Analysis, Once an estimation procedme is propcrly dclInccL it may be possible to derive the precise sampling distribution, or at least some of its relevant properties such as the mean and variance. In practice, such an analysis can be carried out only for linear models (see Appendix E), or for the asymptotic distribution when the sample size is increased beyond bound. This unfortunately leaves open the most common situation, i.e.. a nonlincar model with moderate sample size. (b) Replication. If we repeat the whole series of e;..periments many timcs, and apply our estimator to each data set in turn, our estimates will form a large sample drawn from the sampling distribution. This samplc can be used to estimate the mean, variance, and other properties of the sampling distribution. This procedure possesses some serious drawbacks. First. it is expensive. inasmuch as a very large number of experiments must be performed each time the estimator is applied to a new model. Second. although we may find the mean of the sampling distribution, we cannot determine the bias unlcss 
46 III Estimators and Their Properties the true values of the parameters are known. Therefore, we must first carry out a series of experiments on a system whose parameters are known, Such a system is not always available. (c) Computer Simulation (Monte Carlo Method). The objections to the replication method disappear if the experiments are not carried out on a real physical system, but are simulated on a computer instead. To simulate a series of experiments on the computer we proceed as follows: I. Define the system by prescribing the model equations, the probability distribution of the errors. and, where applicable, a prior distribution. Assign " true" values 4) to all the parameters cp. 2. Assign "true" values ZJI to the variables in the ,Llth experiment (Jl = 1.2. . .., n). Choose these values so that the model equations g(ZJI' 8) = 0 are satisfied. One way of doing this is to choose a set of independent vari- ablcs. assign arbitrary vafues to these, and then solve the equations for the values of the remaining variables. The task is most easily performed if the equations are in reduced form. 3. Use the computer to produce a set of errors eJI drawn from rhe pre- scribed probability distribution. For most computers there are available routines which generate a stream of numbers having the appearance of being random numbers uniformly distributed in the interval zero to one. They are referred to as pseudorandom I7l1/71bers. From these. suitable transformations may be used to obtain samples from any other desired distribution. In Appendix 0 we show how a sample from a multivariate normal distribution may be obtained. For a more general treatment the reader is referred to the literature on Monte Carlo methods, e.g" Hammersley and Hanscomb (1964). Having generated the values of the errors ell' we add them to the VJI (previously generated among the ZJI) to obtain the actual" data" w JI ' 4. Thc estimation procedure is now applied to the data generated by the computer as though they were obtained in real experiments. This yields an estimate cp* for the parameters. 5. Replicate the series of experiments as many times as we please by repeating steps 3 and 4, each time with a new sample of errors, 6. The relevant properties of the sampling distribution are obtained by averaging over all replications. Let cp/' be the estimate of cp obtained in the ith replication of the series of experiments, and let N be the total number of replications. Then. we estimate the mean of the sampling distribution as and its covariance matriX as i\ (j)* = (I IN) }' q)/' i=:.1 (3- 3-1) N Q* = [I/(N - I)] I (q);* - (j)*)(cp;* _ (j)*)T i:=;l (3-3-2) 
3-4. OpTImization 47 These formulas apply, of course, also when the experiments are real, not simu1ated. The bias b of the estimator is then estimated as b (jJ* - tf> (3-3-3) The ftexibility of the simulation method is endless. We can estimate the properties of the sampling distribution for any model. and for any values of the parameters within the model. We may examine the efTects of errors in the formulation of the model, by deliberately using a slightly difTerent model in the estimator than was used in the data generator. \Vhen the errors come from a distribution that is not the same as the one assumed in the estimation procedure, we obtain a measure of the robustness of the estimator. We can also compare the true samp1ing distributions to theoretically derived approxi- mations (see Chapter VIJ). All this can be done on the modern computer at a small fraction of the cost, in time and money, of a comparable set of physical experiments. The results of an actual Monte Carlo study appear in Section 7-22. B. Mathematical Properties 3-4. Optimization In most parameter estimation methods we proceed in two stages: (a) Denne a function <fJ(<I» which is a suitable measure of the departure of the data from the model. i.e.. of the" lack of nt". We refer to this function as the objectil'efllllction. For example. in the least squares method. the objective function is the sum of squares of residuals. (b) Seek those values <1>* of the parameters <I> at which the objective function attains its minimum or maximum, as appropriate. \Ve accept <1>* as our estimate for <1>. The process of computing <1>* is called optimi:::ation. Chapter IV is devoted to the realization of (a). i.e.. to the definition of the objective functions. The remainder of this chapter is devoted to the analytic properties exhibited by the solutions to the optimization problem. Descrip- tions of the optimization process itself will be found in Chapters V and VI. When the unknown parameters are free to assume any values whatsoever, we speak of Hnconstrained oplimi:::alioll. Sometimes. only parameter values satisfying certain constraints are permitted. We may have a vector of equality const rain t s g( <1» = 0 (3-4-1) 
4g III EstlIllators and TheIr PropertIes andjor inequality constraillls h(<))) :;:, 0 (3-4- 2) whcre h ;: 0 means that each component hi;: O. The set of all values of <jJ satisfying all the constraints is called the feasible region. F.easible points satisfying Eq. (2) with strict inequality constitute the interior of the feasible region. A point satisfying hi<jJ) = 0 for some j is said to be on the jth con- straint, and also on the bOlllldary of the feasible region. We examine conditions that characterize the minimat in the various cases that may arise. A point <jJ* is said to be a local minimum of cfJ(<jJ) if in some neighborhood (e.g., sphere) around <jJ* there is no feasible point <jJ** such that q)(<jJ**) < cfJ(<jJ*). A point is a globalminimul7l if there is no feasible point <jJ** such that rJ)(<))**) < cfJ(<jJ*). Clearly, any global minimum is also a local minimum. Although we wish to find the global minimum, the conditions at our disposal usually characterize local minima, and there is generally no easy way to tell whether a given local minimuIll is the global mll1lmum. The problem of optimization, with or without constraints, is often referred to as the problem of mathel7larical programming. If both objective function and constraints are linear functions of the unknown parameters, we speak of linear prugramming. When constraints are present, but either they or the objective function are nonlinear, we speak of nonlinear programming. In the sequel. we shall assume that the objective and all constraint functions are .lwice difTerentiable functions of the parameters. 3-5. unconstrained Optimization To characterize an unconstrained minimum we use the rules of elementary calculus. We state the results briefly The following are necessary conditions for <jJ* to be a minimum of rJ): NI. <jJ* is a SlationCliT point of cfJ, that is. the gradient of cfJ vanishes at <jJ* M)ja<jJh=,j,* = 0 (3-5-1) N2. Let H«I) be the Hessian matrix of cfJ, i.e., HaP == (J2cfJ/aq;a aef)(!. Then H(<jJ*) must be [Josi(ive semidefinite, i.e., for any nonzero vector y we must have y TH( <b*)y :;:, 0 (3-5-2) :1. If a maXimUm or q; is required, we seek the minimulll of -q; instead. 
3-6. Equality Constraints 49 t  ., i: The following conditions are sltficient for (1)* to be a local minimum of (I) l  f SI. (1)* is a stationary point of <P. S2. H(<!>*) is positive definite, i.e.. Eq. (2) holds with strict incquality, If H(<!» is positive definite for all <!>, and <!>* is stationary. then <!>* is the umque global minimum of <1). When <!>* satisfies NI and N2 but not S2, it is impossible to determine whether <!>* is a local minimum without considering higher order derivatives. Relations Eq. (I), regarded as equations in the unknown <1)*, are called the normal equations. Condition N I states that the minimum must be a solution of the normal equations. Any solution to the normal cquations may be a local minimum only ifit also satisfies N2, and it must be a local minimum if it satisfies S2. 'f 3-6. Equality Constraints , r I; i , , !: i f. Suppose <1)(<!» IS to be minimized subject to the equality constraints Eq, (3-4-1). If the solution is denoted <!>*, then in the neighborhood of <!>* the function <P(<!» must be stationary .[Or all vanations in <I) that stay within the constraints. Let ()<!> be such a variation, i.e., t. I: g«I)* + ('5<1») = 0 (3-6-1 ) To a firsI-order approximation g(<!>* + b<!»  g(<!>*) + (Dg/a<!» b<!) From Eq. (I) and Eq. (2) we conclude that (3-6-2) t I, "  . r f !t, ., (c7g/c7<!» ('5<!> = 0 (3-6-3) for all permissible vanations ('5<!>. Expanding </)(<!» around <!>* we find, to a first-order approximation }. r 1: . f : <1)(<!>* + ('5<!»  1)(<!>*) + (iJ<1J/iJ<!»T ('5<!> (3-6-4) Since <P(<!» is to be stationary, we must have (iJ<P/a<!» T ('5<1) = 0 (3-6-5) for all b<!> satisfying Eq. (3). This condition may be paraphrased as follows: The vector c7<1J/c7<!> must be orthogonal to all vectors ('5<!> which are orthogonal to the rows of the matrix og/c<!>, It follows that iJ<!)/c7<!> must belong to the I.. .. -.. -; 
50 III Estimators and Their Properties , " ' ' ,:t ';;t ,. ; f :! subspace panned by the rows of DgIDe!>, which in turn implies the existence of numbers }'I *, l1*' ..., Ic,,* such that P ri1)!tl.+. = \' ) * ri q /(l'+' r . '11 L.1 ,I, '11 i= I (3-6-6) i. f ;:11 " , where p is the number of constraints. The },,.* are called Lagrange multipliers. Let us now construct the function p ft(G>. i,) == 1)(e!» - I I.,g,(e!» ;= J (3-6-7) which has a stationary point at e!> = e!>*. i = i,* if and only if p 11A 1r7e!»<I> =.1>' = N)/I'e!»<I>=<j,. - I 1.,* gle!>*) = 0 l= I (3-6-8) i' .f and (l/l/a}.J.!,=<j,. = -g,(G>*) = 0 (i = I, 2, . .' , p) (3-6-9) , t -;} ,f : 1, r i' But Eq. (8) and Eq. (9) correspond exactly 10 Eq. (6) and Eq. (3-4-1), respec- tively. We thus conclude that 1)(e!» has a constrained stationary point at e!> = G>* if and only if 11«1). i) has ln unconstrained stationary point at G> = G>* and i, = P. In some problem we have an II x 11/ matrix of constraints G(e!» = O. In this case we need an 11/ x II matrix of Lagrange multipliers L, and the Lagrangian takes the form i- i. : /1(e!>. L) == (p(e!» + Tr(LG) (3-6-10) r r f > r r --;  r f. ', To determine the nature of the stationary point (e!>*, i,*), expand both 1) and the .II,. in Taylor series around e!>*. retaining terms up to second order: (/J(e!>* + be!»:::o (/J(e!>*) + (ik/JjT 8e!»T ()e!> + t be!>T ((!1f[J/8e!> ae!» b<l> (3-6-11) .11,.(<1>* + <)e!» :::og,(e!>*) + (ogJ<le!»T ()G> + t ()e!>T (OlgJ<le!> (1<») ()<I> (i= 1.2" ,/7) (3-6-12) If e!>* is a local conIralJled minImum of 1), it follows that f[J(G>* + be!»  1)(e!>*) for all sufllciently small ('ie!> for which (og Fe!» I ('ie!> +.  ()e!> r (r'lg ,/ i'e!> ('G» i)e!> = 0 (i = I. 2..... p) (3-6-13) '.' ?'  ;:. i f: t t: i; t 1: i ( : l .II,(e!>* + ()e!» = 0 (i = I. 2, . . ., p) Such ()e!> musl therefore satIsfy (approximately) the equations and the inequality ((11) ji'e!» I ()<I> + 1 (5e!> T (iJ2<1) /I'e!> riG» ()e!>  0 (3-6-14)  t '. f{ 
3-7, Inequality Constraints 51 Multiplying each Eg. (13) by Jc i * and subtracting from Eq. (14) we obtain, in view of Eg. (6) i b<l)T [(8 2 <P/8<1) 8<jJ) - J//I' (02g)8<1) 8<1)] b<l);: 0 (3-6-15) This must hold for any b<jJ satisfying Eq. (14), i.e., by cords joining the solu- tion point cjJ* to neighboring points approximately on the constraint surfaces, Because of continuity, Eq. (15) must, therefore, also be satisfied by the tangents to the constraint surfaces at the solution, these being limits of such cords. These tangents satisfy (8gji"<jJ) b<jJ = 0 (3-6-16) That is, they are the null vectors of the matrIX cg/8<jJ. Now let B be a matrix whose columns span the null space of agj8<jJ. That is, every null vector b<jJ of ogj8cjJ can be expressed in the form bcjJ = Bx, where x is an arbitrary vector whose dimension is that of the null space. If the dimension of cjJ is r, and if 8gj8<jJ has p linearly independent rows. then the dimension of x is r - p, and B is an r x (r - p) matrix. Letting a 2 1) p iJ2g. A == irt-. rt-. - :L Jc,"' a rt-.  e,,!, C'II i= I'll C'II (3-6- 17) " .. we know, from Eq. (15), that xTBTABx;: 0 (3-6-18) ;'t for all r - p dimensional vectors x, i.e., B T AB must be positive semidefinite. Conversely, if B T AB is positive definite, it follows by continuity that Eq. (15) holds for all sufficiently small (5<jJ satisfying Eq. (13). I-lenee, <jJ* is a constrained minimum of 1). >.: ... .:; .: ,.; [ 3-7, Inequality Constraints ,:t.,: ....; 's; Suppose we wish to mlI1lmlZe 1J(<jJJ subject to the lI1equality constralI1t Eq. (3-4-2). If the minimum occurs at a point <jJ'I' which is in the interior of the feasible region, that is h,(<jJ*) > 0 for all i, then the constraints are irrele- vant as far as the local nature of the minilllulll is concerned. Therefore, <jJ* must satisfy the conditions for an unconstrained minimum. When cjJ* lies on the boundary of the feasible region. there will be some values of i for which \ :' h,(<jJ*) = 0 (3-7-1) We refer to these hi as the actiL'e constraints. For the purpose of character- izing the point <jJ*, we may disregard all the inactive constraints. and in the 
t: ,:' 52 III Estimators and Their Properties t r sequel we shall consider the vector h to consist of all the active constraints alone. We denote by t the number of active constraints, and by T the total number of constraints. At an inequality constramed minimum it is required that the gradient of the objective function should point decisively into the feasible region. We observe that since the constraint functions are positive inside and negative outside the feasible region, their gradients also point into the feasible region, The necessary condition for minimality is, then, that the gradient of the objective function should be a linear combination with positll'e coefficients of the gradients of the constraint functions. More precisely, John (1948) has proven that <I)'" is a minimum only if there exist nonnegative numbers lo. ll' )'2' .." )'t not all zero such that h r , i. J: f t, ! t" I. r .' ,. f: f f: , )'0 acfJ(<I>*)ja<l> = }' li[ah;(<I>*)/a<l>J i=1 (3-7-2)  I, k The more famous Kuhn-Tucker condition (Kuhn and Tucker, 1951) asserts that we may choose lo = I provided the constraints [Eq. (1)] meet a certain qualification,t which in practice is almost always satisfied. Clearly, Eq. (2) is unaffected if we add on the right-hand side terms COrres- ponding to the inactive constraints, provided their multipliers assume the value zero. The Kuhn-Tucker condition can then be stated as follows: T D(/J(<I>*)/r<l>= I I. i 8h;/c<l> i= I (3-7-3) h,(<I>*) ;: 0 )'i;: 0 (i = I, 2, . . ., T) (i = I, 2, . . " T) (i = I. 2. .. . , T) (3-7-4) (3-7-5) (3-7-6) ).;11,.«1)*) = 0 The last equation states that either a constraint is active (hi = 0) or its multi- plier vanishes. This is known as the principle of complementary slackness. Sufficient conditions for local optimality have been derived by McCormick (1967) and Fiacco (1968). They require the quantity y T Ay to be positive for all vectors y which point into the feasible region or are tangent to it at (1)*, Here A is the Hessian of qJ - Ii lei hi U o having been set to one). The role of the I.,. here IS analogous to that of the Lagrange multipliers in the case of equality constraints. The l, are called dual variables or shadow prices. When both equality and inequality constraints apply, the necessary :j: The qualification rcqUircs mat for cvcry vector u such that u T 811;/8<1»$=$*:'- 0 (i = 1, 2, ..., I), thcrc should cxist a vcctor of functions <I>(T) such that <1>(0) = <1>*, <I>(T) is in thc feasiblc rcgion for 0 -- I <; I, and 8<1>/8T$= $* = u. A case where the qualification is not met occurs whcn <1>* is at a cusp formcd by constraints tangcnt to cach other. 
3-8, Problems 53 conditIon for a mil1lmum states that there eXIst scalars Ill' Jl2'. . and non- negative scalars Xl' )'1' . , , such that acfJ/i3cj>).b='b' = I Jli ag)i'lcj»4'=4" + I Ai iJh)(lcj»",=",. i . (3-7-7) (r The subject of optimality conditions is discussed in great detail by Mangasa- rian (1969). . " 3-8, Problems :., 1. Consider the model whose likelihood is given by Eq. (3-2-7). Show that if e and v are unknown parameters, then I;: = I \1'1 1 and I;:= I 11'/ form a pair of statistics which is jointly sufficient for 0 and v. 2. Show that the Lagrange multiplIers represent the unit cost of the con- strain.ts. That is, if a constraint g;(O) = 0 is replaced by g,(O) = E, then the minimum attainable value of cfJ is increased (to a first-order approximation) by an amount Xi E. 3. As above, but for lllequality constrall1ts. 4. Let A be a square matrix. Using Lagrange multipliers, show that of all vectors x of unit length (i.e., x T x = I), the ones for which x TAx i:-. eithcr a minimum or a maximum are eigenvectors of A. ..'.' ;;r.::., ;: ::. '::.  I"" . ,,\ . ":5. '. ::,'. I.' ) .\:.' ";", . \\ ;i Wt. . !;:,:  ... ,,' :.i h'" i:;: 7: E'.,' vr . t'" j J' ,' 
Chapter I'T lethods 01 Estimation 4-1. Residuals In the previous chapter we have discussed, in general terms, what desirabll properties an estimator should possess, and how any specific estimator ma be tested in order to determine whether it possesses these properties. We nm proceed to describe specific estimators, or estimation methods, which are i widespread use. We have defined the error £,,,, as the difference between the measured an true values of a variable. In the case of a reduced model y = f(x, e), if \\ knew the true value e of e we could compute the error EJW == Y,1lI - .fll(X Jl , 0) (-:1--1- We can also compute these differences for any other value of e. This defir functions e,,,,C e ) == V,,,, - .fe,Cx , ., e) 14-1 to which we refer as the residuals, The errors are equal to the re:;idu evaluated with the true value e = e. The residuals in an inexact structural model g(z, e) = 'Yare obtai I simply by evaluation of the model equations elwCe)=gQ(Z,,' e) (4- When the structural model is exact, the residuals are the differences bet\\ the observed and estimated values of the variables ( '"II )  * e Jw VY == H'Jlll - \-\'JJlI (4- 
I i:' I. i \ I I \ i I \ \ I ! I \ 4-2. Unll'eighted Least Squares 55 A. Least Squares 4-2. Unweighted Least Squares e y i IV i n 1 \ Id ve \ I I .1) I ! I les I I i -2) ta1s ned 1-3) ,een -1-4) The method of least squares is the oldest and most widely used estimation procedure. At least some of its popularity is due to the fact that it can be applied in an ad hoc manner directly to the deteriminstic model, without any cognizance being taken of the probability distribution of the observations. Needless to say, estimates obtained in such a way may be very unsatisfactory indeed, although one can envision situations in which nothing better can be done, We do not wish to imply that least squares estimates are always merely ad hoc. Quite the contrary is true, and where the observations have certain probability distributions, these estimates even possess optimal statistical properties, which will be described in the sequel. In cases of pure curve fitting, where the coefficients have no physical sig- nifi\:ance, the least squares method is usually adequate. We employ the following notation: A small capital letter denotes the vector formed by adjoining to each other the rows of the matrix denoted by the same letter. Thus ET= [E I1 ,E 12 , ."EI",.E2I,...,E",,,] The least squares procedure in Its simplest form consists of finding the values of e which minimize the function <p(e) =ET(e)E(e) = Tr[ET(e)E(e)] (4-2-1) which is, in component form III II epee) = I I e:wC e ) lI= I Jl=1 (4-2-2) i.e., we minimize the sum of squares of the residuals. When m = I we speak of single equation least squares. In practice, most estimation problems fall into this category. A typical problem is. worked out in detail in Section 5-21. We derive the normal equations easily III 11 a<P/ae = 2 I I e/ Ui ae/w/ae = 0 lI=J 11=1 (4-2-3) /..t ;: .:\1' _ . ,.i!;.:' 
56 IV Methods of Estimation In the most common case of a single reduced equation, e" = J'" - f(X", 0), and we have: 11 <P(O) = L [Y" - .r(X" , oW IL=I (4-2-4) 11 (jcJ)/c{}a = - 2 L e" cl(x,!, O)/OOa 11=1 (CI. = I, 2, . .. , [) (4-2-5) 4-3, Weighted Least Squares An objective function consistlllg of a simple sum of squares is often unsatisfactory for the following reasons: (a) The vanous quamities J',UI (or g,w) may represent entities having different physical dimensions. or measured on different scales, For instance, )I,d may be the concentration of a chemical, expressed in mole fractions and falling in the range zero to one; at the same times, Y,'2 may be a temperature mea- sured in degrees centigrade, and falling in the range 500-1000. It clearly mal, es no sense to sum together squares of numbers of such disparate orders of magnitude; the residuals of the temperatures are likely to dominate those of the mole fractions and any information contained in the latter will be lost. (b) Some observations may be known to be less reliable than others, and we want to make sure that our parameter estimates will be less influenced by those than by the more accurate ones. (Note that we cannot. after all, escape the statistical structure of the data.) The solution to both of these problems is one and the same; assign a nc n- negative weight factor b,,,, to each epo(O), and minimize ", 11 ([l(O) == L L b,w eo(O) 0= 111= I (4-3..1) We choose small b,,,, for J',w which are measured on a large scale, or which are highly unreliable, and conversely for large b,UI' A more general formulation is one which assigns weights to cross product terms as well, i.e., Tn 111 n n q)(O) = L L L L BlJ"')(h) e,,,,(O)eh(O) 0=1 h=1 p=1 .,=1 (4-3.2) The weights BIIILI)(./h) must be elements of a positIve delll1lte or semidefinite matrix, for otherwise ([)(O) can be made to approach - 00. Clearly, Eq. oJ) is a special case of Eq. (2), with Bllw)(hJ = b,,,,(5J!(5,,h' and Eq. (4-2-2) i, a special case of Eq. (I), with all b ,w = I. 
 i% it   ; . ':.-: .....J. \ ;i. ' I j i  _.. . , "', 4-3. Weighted Leos; Squares 57 Additional important special cases of (2) are tile followl!1g: (a) Weighting by VariabJe. Where BU.w)(l/b) = 0 (I! 0;6 II), and he weighb are independent of J.l trJ m 11 <p(e) = I L B"b L elltl(O)el,,,(O) = (/= I b= I Jl= I 11 I ejlTBclt Jt=l (4-3-3) When B is a dic:.gonal matrix, this simplifies to HI " $(8) = I b" I ea(e) 0= I J1= I (4-3-4) (b) Weighting by Experiment. Applicable mostly to the single equation case n :pee) = I b l , e/(€j) /1=1 H-3-5) (c) Weighting by Experiment am! Variable. n l1l ", 11 $(e) = L L I.hI,B"bellnlO)eIl"lS)= Lbl,e/Be l , Jl=lll=lh=l JI=J (4-3-6) Does statistical theory tell us what values should be assigned to the weights, or when we are entitled to use the simpler formulas? The answer is at least partly in the affirmative We shaI! see later that if the noclel eqLtations are linear in the parameters (Section 4-4), or if the IlllmDer of observations is large and the errors are normally distributed (Section 4-7). then the choice of weights leading to least-variance estimates is given by the elements of the inverse. of the covariance matrix of the errors. That is B(fltl)(r/b) = (V - j )(JItI)(b) (;-3-7) where J'(Jln)(b) = E(GIItIG",,) (4-3-3) Although we cannot prove optimal properties in the general case (non- normal distributions with nonlinear models), it is still reasonable, and ap- proximately optimal, to use weights which are the elements of the inverse of the covariance matrix. When the covaria.nce matrix is not known, one may choose either to guess or to use a method such as maximum iikelihood which sometimes enables one to estimate the weights along with the other param- eters. Or, one can obtain a direct estimate of the covariance Illatrix by replicating at least some of the experiments, 
58 IV Methods of Estimation 4-4, Multiple Linear Regression When the model is linear, the choice of proper weights in Eq. (4-3-2) ensures optimal statistical properties for the corresponding estimators. The linear model takes the form flL = f(xlI' e) = BJl(xJl)e (4-4..1 ) where BI,(xIJ is a matrix of given functions (polynomials and trigonometric functions are often used in curve fitting).! Adjoining the equations for all values of fL, we obtain, in matrix form F= Be (4-4-2) where F T == [fiT, f 2 T, ' . . , f,/] and B T == [Bl T(Xl)' B/(x 2 ), ..., B/(x lI )]. For given data, B is a constant matrix. Suppose the XI' are measured precisely, and each observation Y I , is a sample from a random variable whose mean value is fJl' and let the joint covariance matrix of all the elements of Y be Y, i.e., E(ylw - .!;",)(Y'/b - I,/b) = V;lw)U/b) If we deterllltlle e = e* so as 10 minimize the function (4-4-3) qJ(e) == (Y - Be)T Y-I(y - Be) (4-4-4) then 0* must satisfy: (MJf(le) = - 2B T y- 1 (Y - Be*) = 0 (4-4-5) This is equivalent to the 110rmal equal/Vl1s BTy-1Be* = BTy-ly (4- 4 -6) Solving for e*, we find, provided BTy-1 B IS nonsmgular,9 e* = (BTY-1B)-tBTy-Jy (4- L -7) t Suppose wc arc fitting thc single cquation model J(x, e) = 8 1 + 8 2 x + 8 3 x 2 . Thcn B" is the row vcctor [I, x, x 2 1 and [ ' XI X1 2 ] 1 Xz X2 2 B = : ] X'I x ,J 1. !i If BTV-'B is singular, thc normal equations posscss infinitcly many solutions Of thcse, thc onc for which eHe* is minimum is given by e* = (BTV-IB)+BTV-Iy where A + is thc pscudoinverse of A (see Scction A-I). 
:-- Lt', r,:,. fi I   ; ' j  . . t .% If   .., . }.  '.,ii: 1:.iJ' ""'t,  to' . .I  ---. ' ., .1 .  r '" " -lr t' "t;f !', . . . . .. .. I  ... : . ...... ., ,; '. ':f. :!t . l' , 'I . :1 4-4, Multiple Linear Regression 59 This is the well-known multiple linear regression formula, Clearly e* is a linear estimate, having the form e* = A Y. By our assumption, E(Y) = Be. Hence, B and V being constant, we find from Eq. (7) E(e*) = e (4-4-8) That is, e* is an unbiased estimate of e. Now, Eq. (3) is equivalent to E(Y - Be)(y - Be)T = V Also, it is easily seen that e* - 0 = (BTV-IB)-IBTV-I(y - Be) Hence the covariance of the sampling distribution of the estimate e'" turns out to be V o == E(e* - e)W' - O)T = (BTV-IB)-I (4-4-9) The Gauss-Markov Theorem (proved in Appendix E) asserts that among all linear unbiased estimates, Eq. (7) yields the one whose variance is smallest. [f, furthermore, the distribution of the E" is normal. the estimate is efllcient. In the case where the errors of all the observations are independent and of equal variance (J2, we have V = ,,,21. and e* = (BTB) - I BTy (4-4-10) which is the usual unweighted linear least squares estimate. The covariance matrix of this estimate is given by V 0 = (J2(B T B) -I, Computational methods for solving linear regression problems are dis- cussed in Section 5- I I. A question that often arises in connection with linear regression problems is which variables should be included, and which should be excluded from the model. Stated in another way, the question is which parameters should be left out (assumed to be zero) because they do not con- tribute significantly to the model. The method of stepwise regression (Section A-3) provides an answer to this question. Before leaving the subject of linear models. let us examine briefly the question of how the optimal properties of the regression estimate are alfected when the assumed model is incorrect. First, suppose some important terms were omitted from the model. As a result, it is no longer true that E(Y) = Be; rather, we have E(Y) = Be + s (4-4-11) where s is a fixed vector consisting of the omitted terms. If e* is computed from Eq, (7), we find E(e*) = e + (BTV-1B)-IBTV-IS (4-4-12) 
60 IV Methods of Estimation so that 0* is no longer an unbiased estimate, The bias IS precisely equal to (BTy- 1 B)-I BTy-I S . Secondly. consider the case where an erroneous value has been taken for V. Suppose the true covariance matrix is U =ft Y. Then the covariance matri { or the estimate Eq. (7) is given by v = (BTy-IB)-IBTy-IUy-IB(Bfy-IB)-1 (4-4-13) We wish to determine how inefficient this estimate is relative to the bet possible estimate in which Y = U. The covariance of the latter estimate i;, according to Eq. (9), (BTU-' B)-I, We define the relative inefficiency e cf Eq. (7) as the ratio of its generalized covariance to the minimum attainable generalized covariance, i.e.. e = det(BrV-IB)-IBTY-'UV-IB(BTy-IB)-I/det(BTU-1Brl (4-4-14) Clearly, e = 1 irV = U In other cases. it can be shown that e can assume any value in the range given below; its actual value depends on B l:(ei((l+::N!4C1. (4-4-1:5) where CJ. is the condition number, i.e., the ratio of largest to smallest eigen- values, of the matrix V-I ,'2UV- 112 . To illustrate, suppose an unweighted least squares estimate is used where in fact the error variances range between] 0 and 100. We have, then, V-I /2 = I, and U = diag(u), where u is a vector .)f numbers in the range 10 to 100. It follows that V- I / 2 UV- 1 / 2 = U = diag(u), and C/. = 100flO = 10. The inefficiency of 0* may be as high as (I + 10)2/40  3. While an estimate of the form Eq. (7) or Eq. (10) is the best u/1bias.d estimate, it is possible to construct biased estimates whose total expected squared error is less. For instance, in t)le method of ridge regression (Hoerl, 1962; Hoer! and Kennard, 1970), one substitutes for Eq. (10) the estimate O*(}.) = (BTB + )J)-IBTy (4-4-16) where ), is a positive parameter. It can be shown that the expected squcre error is UU') =£(O*U) - (j)(U"V) - {j)T = (BTB + },I)-I(u 2 B T B +},2{j{jT)(BTB + }J)-I (4-4-17) and that the quantity del UU,) is minimum when }, satisfies the equation ),{jT(u2BTB + ),2{j{jl)-I{j = Tr(BTB + Xlr 1 (4-4- , 8) Since 0 i::o unknown, the optimal }, cannot be determined a priori. Hoer! and Kennard recommend construction of a so-called ridge trace, which is a plot of the components of O*(l) versus }, with }, increasing from zero. One chooses 
l 't'o t   i I , t ,., t." E "- f."'o . ; 1i f.... t;; I """ tfi,': - ..,( ;, l'!;... t!;f& 1 'G; ; 1 1 .,'. .,",- ;:. '. 4-5. Definition 6! a value of /. where e*u) ceases to vary rapidly. Note that at ), = 0 we have the usual least squares estimate. Note also that /. = 0 docs not satisfy Eq. (18). Hence, the least squares estimate is never the linear least squared crror estimate. B. .Maximum Likelihood 4-5. Definition In Sections 2-12-2- I3 we ha ve dellned the likelihood function L(q» of the sample as being the joint pM of the observations, viewed as a function of the unknown parameters q>. These unknown parameters were of three kinds: I. e represents the unknown parameters of the deterministic models. 2. \\1 represents the true values of the observed variables. 3. \jJ represents other distribution parameters. In Section 2-13 we saw that the model equations could be regarcled as equalIty constraints which limit the possible values that the e and \\t could attain. In addition, prior information may impose certain inequality constraints (e.g., nonnegativity) on the parameters. The maximum likelihood estimate (M LE) of q> is that value of q> satisfying all the equality and inequality constraints, for which the likelihood function attains its maximum value (if such a value exists). Under relatively mild conditions on the form of the likelihood function, the MLE is consistent and asymptotically eAlcienl. This is a strong argument for using the MLE v..'hen the sample is large. The MLE does not usually possess any optimal properties for small samples. It is generally neither unbiased nor efficient, although it is sufllcient when a sunIcient statistic exists. Sampling experiments [see. e.g.. Chow (1964). Cragg (1967), Carney and Goldwyn (1967)] have shown, however. that the maximum likelihood method produces acceptable estimates in many situations. Whereas better methods may be available in specific cases. a powerful argument for the use of the maximum likelihood method is the generality and relattve ease of application, Since the logarithm is a monotonic tJ1creastJ1g function of its argument, the value of q> that maximizes L(q» also maximizes log L(q». Since log L is frequently a simpler function than L. it is in terms of maximizing log L that we shall clsually formulate the problem. 
62 IV Methods of Estimation The following heuristic argument may make the maximum likelihood method seem plausible: the probability of observing a sample lying in a region 15W around the actually observed sample W is given by peW I <I» bW = L(<I» ()W. The value <I> = <1>* for which this probability is greatest is the MLE. We say that <1>* is the most likely value of <I>. Of all possible values of the parameters, <1>* is the one having the largest probability of giving rise to a sample within i5W of the actually observed one. 4-6. Likelihood Equations In this and subsequent sections we examine the application of MLE to various cases. We shall proceed as far as we can formally. In most applica- tions, the final computation of the estimates requires numerical methods to be described in the next two chapters. We first discuss the case where no constraints of any kind apply, Thi, occurs when our model is of the reduced type, discussed in Section 2-12, Then the likelihood is a function of e and \jJ alone, as shown by Eq, (2-12-3). We know from Section 3-5 that a free (unconstrained) maximum of th, function log L(<I» must satisfy the set of likelihood equations (llog U<l»/8<» = 0 (4-6-1) Since L(<I» =p(WI<I». we find from Eq. (3-2-18) that if p is a sufficient statistic for <1>, then Eq. (I) is equivalent to f(p, <I» = 0 (4-6-2) Hence, to calculate the maximum likelihood estImate it is sufficient to know the value of the sl!l(iciel1f statistic p, and we may discard the original data W Unfortunately, there are few practical cases involving nonlinear models for which a sufTicient statistic can be found. One may approach the problem of finding the estimates in two ways: (a) By solving the likelihood equations. and then determining whether the solution is indeed a maximum. This is the approach taken when the solution is to be found analytically. (b) By attempting to find the maximum of the likelihood function directly, paying no regard to the likelihood equations. This is the more fruitful ap- proach when the solution is to be found numerically (see Chapter V). Even in this case. however, the likelihood equations can sometimes be used t::> eliminate some of the parameters. thus reducing the size of the problem to be solved numerically. This method, known as stagewise maximization, works out particularly well for the elimination of the distribution parameters '!', and specific illustrations are given in Sections 4-8 and 4-9. 
r t f ;. i I i. 1 { I , I:.: , . h . ;' , i' '. '- .' r., .. , If. ,".' 4-7. Normal Distribution 63 4-7. Normal Distribution We consider the case of a normal distribution. In the most general case, we denote by T(IIlI)('/h) the covariance of c 11lI with f.h' and by B(IIf/)(,/h) the ele- ments of the matrix inverse to T. The errors f. IW (p = I, 2, ... _ n; a = I, 2, .. " m) possess the normal distribution N"",(O, T). The logarithm of the pdf may be derived from Eq. (2-8-10) as being III 111 tI n log L = -(II/J1f2) log 2IT -tlog det T - t L L L L BIlIf/)('IhICllf/(O)C,,/,(S) (l= 1/1= 1/'= 1'1= I (4-7-1) If all the elements of T are known. finding the values of e that maXIllllze Eq (I) is equivalent to minimizing 111 111 tI cfJ(S) == L L L L BIJIlI)(h)CI,"(S)C'/h(S) 1I= J /1= I 11= I '1= I (4-7-2) which is the same as Eq. (4-3-2). Thus, for a normal distribution with known covariance, M LE reduces to weighted least squares, with the weights given by the elements of the inverse'of the covariance matrix. We now turn to the case when the covariance matrix is not known, and must be estimated from the data. Variances and covariances are measures of the magnitudes of the errors. The data themselves can tell us nothing about the magnitude of an error unless we have replications of the same error. If we measure the length of an object once. we can gain no idea of the error in the measurement; if we measure it twice, the difference between the measurements can be used to estimate the error. In the general case described by Eq. (I) we have no replication: each error f. IW is assumed to have its own variance T;I/II)II'/I). In order that estima- tion of the variances should be feasible, we III liSt assume thaI several measurements (or quantities derived fTom them) possess identical variances. The following are typical assumptions: (a) Errors in different experiments are independent. (b) Errors in each experiment are distributed with the same covanance matrix V. Both assumptions may be summarized by T;,IlI)(r,h) = ()/l1, I ';,h' 1';,/, = E([;,lf/C 1 ,h) (4-7-3) The trace of a matrix A is the sum of its diagonal elements. i.e., Tr(A) == Ii A ii. It follows that Tr(AB) = Li.i A Ii B i /' It is easily veri lied then that 11 L e/(S)V-lel,(S) = Tr[V- 1 M(S)] i=J (4-7-4) 
64 IV Methods of Estimation where M(O) is the moment matrix of the residuals, defined by MOb == L;:=te/lUe/lIJ' i.e, " M(O) == L e/,(O)e/(O) JI= 1 (4-7-5) Under our assumptions (a) and (b) above, the likelihood function takes the form log L = -(11111/2) log 2n - (n/2) log det V - t Tr[V- 1 M(O)] (4-7-6) Clearly, maximizing Eq. (6) when Y is a known matrix, is equivalent to minimizing (J)(O) == t Tr[Y- I M(O)] (4-7-7) Retaining the factor .} in the above expression (and similar constant co- efficients in other objective functions to be derived later) is important in Bayesian estimation, when log PoCO) is added to - <p(0) (see Section 4-15). A further specialization of Eq. (4) is obtained when one adds the following assumption: (c) All errors are independent, i.e., Y is a diagonal matrix l..;,b = 6 ab l'" ' U o == E(£n) == val (4-7-8) 111 which case iii m log L = -(nl11/2) log 2n - (11/2) L log va - t L v; I lV/,,"tO) (4-7-9) u= 1 (/== 1 In the single equation case log L = - (11}2) log 2n - /1 log (J - (I /2(J2)M(0) (4- 7 -10) where " Al(O) = L e,/(O) /1=1 (4-7-11) Whether or not (J is known, log L is maximized relative to 0 by minimizing J\1(0). Maximum likelihood here is equivalent to unweighted least squares. 4-8. Unknown Diagonal Covariance We shall treat Eq. (4-7-9) first, and then generalize to Eq. (4-7-6). Assuming then that Va are unknown, we seek those values of 0 and the VII that maximize Eq. (4-7-9), We proceed by the method of stagell'ise maximi::atio/1 (Koopmans and Hood, 1953). This consists of finding, for any value of 0, the values of the I'll that maximize log L. These will be some functions of 0, say VII(O). Substitution of i\(O) for l'a in Eq. (4-7-9) reduces log L to a function fl5 of 
l lit.!! I ; ,< .. :i iCE 'I; :fi ; .f. 'j 1 " 'f IF . " I fl'  ', , ID fi :@ :.r 't i;;"  :'1 I ',t.:. f; 'J, ':J : .if; -;fii  jf . ; , 4-9. Unknown General Couariance 65 o alone, and we seek 0* so as to maximize ..:l(G). The first step, then, is to differentiate Eq. (4-7-9) with respect to each I'a' and equate the derivatives to zero II d log L/dv a = -n/21'" + (1/21',/) I: e/,,«(}) = 0 /1=1 (a = 1,2,... ,111) ('-i-8-!) This equatIOn has the ullIque finite solution " v,,(O) = (I/n) I e;Ju) Jt=l (4-0-2) Substituting Eg. (2) in Eq. (4-7-9) one obtains 2(0) = log L(e, vaC G )) III , - II 1 = -(111l1/2) log 2iT - (11/2) I log (l/n) I e;,,(ij) n=1 L }I=I J III --1  a=] [ II / II ] 1..(\.' I  1. '!);\ I e,wC'd / (1/11) L ('Iii/tV! 11= 1 JI= I which can be reduced to "' " 2'(0) = (I1J1lf2)(log (iI/2IT) - I) - (n/2) I: log I e;JO) u=! jf= I ( 4-8-3) Maximizing Eq. (3) is ckarly equivalent lO minimizing III di(O) == (11/2) I log M,/{/(O) 11=1 (-1-g-4) where !VI(D) is the moment m£1.lrix of the residuals defined by Eq. (-1--7-5). Vye refer to .0(0) as the concentrated likelihood function. To solve our estimation probiem. we proceed as follows: I. Find 0* which maximizes Y(8), or minimizes (j>(fn 2, Estimate l',,* = iJ,l8*), using Eq. (4-8-2). This estimate for /'" :s biased but consistent. The bias may be eliminated approximately (e:>.actly for certain linear models) by replacing 1',,* with nv/,/(n - m/) (see Sec:tion 7-13 for further details). 4-9, Unknown General Covariance The results for the case of a nond13gonal unknown covariance matri\ (Eq, 4-7-6) are similar to the ones obtained in the preceding section. but require some additional matrix calculus. Let nV) be some scalar !"tnc1.ion of 
66 IV Methods of EstImatIon a nonsingular matrix V, and let Z'j7aV denote the matrix of partial derivatives off with respect to the elements of V, i.e., (ufl E'V)ab = (f12 V,'b Then the following formulas hold (see Appendix A-2) Z; log det VlaV = (VT)-I il[Tr(V- 1 M)]I(W = _(VT)-I M(v T )- I (4-9-1) (4-9-2) (4-9-3) Applying these formulas to Eq. (4-7-6) and remembering that V is sym- metric, i.e.. V 1 = V, we obtain i' Jog LI(W = -(nj2)V- I + -tV-IM(O)V-I = 0 We may rewrite Eq. (4) as V-I = (l/n)V-I\'I(O)V- 1 (4-9-4) (4-9-5) Premultlplying and postmultlplying Eq. (5) by V, we obtain Y(O) = (Iln)Y1(0) (4-9-6) We now have Tr[y-I(O)M(O)] = n Tr I", = nl1l (4-9-7) whence. by substilUting Eq. (6) in Eq. (4-7-6) we are led after simplification, to 2"(0) = (nlll/2)(log (n/1IT) - I) - (nil) log det M(O) (4-9-8) Maximizing this is equivalent to minimizing (p(0) == (nil) log det 1"1(0) (4-9-9) The [wo steps are: I, Find 0* to maximize 2"(0) or mllllt11IZe <])(0). 1. Estimate V* = Y(O*) from Eq. (6). Here, too, the estimate is biased, See Section 7-13 for possible bias removal. If the ofT-diagonal elements of M(O) are neglected, we have det M = iV!I t M 22 . .. M",,,,, and log det I\l = I'= I log Maa. In that case, Eq. (9) reduces to Eq. (4-8-4). The cases dealt with in thIs and the preceding sections may be regarded as the solving of weighted least squares problems with unknown weights. Formulas (4-8-4) and (9) give maximum likelihood estimates in the case of a normal distribution. One is tempted, however, to recommend their use even where the form of the distribution is unknown, provided assumptions (a) and (b) in Section 4-7 are valid (see Section 4-18). The use of Eq. (9) is illustrated by means of practical problems in Sections 5-23 and 9-7. 
'.. , ji 1. 4-10. Independent Variables Subject to Error 67 ," .. i> 4-10. Independent Variables Subject to Error  }:: Suppose our model is in reduced form, but the independent variables are also subject to error. It will be recalled that in this case the model equations take the form f 'L' ) f: Y ll = f(x l " 0) We now have residuals of two kinds: (4-10-1 ) .. ." t" 4);, e.q,(x lI ) == XII - X Il evl,(O, XII) == YII - f (XII' 0) We adjoin the s-dimensional e xl , and l11-dimensional e y /, into a single (s + 111)- dimensional vector elL of residuals e (0 x) =e, ( ex/I ) (4-10-3) II ' JI - \C Y11 (4-10-2) ;i:, ;... f: 1 Jr". ;I- . {', ' If the ell are assumed normally distributed with zero means and covariance matrices V, the likelihood function is given by log L(O, XI" V) = - [(s + 111)17/2] log 2iT - (17/2) log det V ';''' ;It r . /I - -! I e/(O, x/I)V-1eJO, XI,) (4-10-4) /L=1 .... . '.' If V is known in its entirety, the function to be minimized is r.: t f.  ,4,' /I <P(O, X) == 1 I e/(O, x/,)V-leiO, XI,) J{=1 (4-10-5)  Unfortunately, when V is entirely unknown, it cannot be estimated by the method of maximum likelihood, To see why this is so, we partition the matrix Vas follows: !. , : :-: :J P. >. ; V = [ V x . . X.I' J V XJ ' \.1'.1' (4-10-6) where . f  " V. u == E(e X / L e;), V XJ ' == E(ex/le;I')' V J'Y == E( e YIL e;l,) (4-10-7) i,....; :' . ., Let us set V xy = 0, and V. u =£1, where £ is a very small positive number, and let us set XII = XII (iL = 1,2, .,., n), i.e., e. m = O. Then Eq. (4) is reduced to ¥,-r:- log L = - [(s + m)n/2] log 2n - (n/2) log del V)'y - (ns/2) log £ .; /I J.  T \ r- 1 - 2 Lent Y3' e)'JI Il=t (4-10-8) .. /1<.. 
6X IV Methods of Estimation Because of thc term - (l7s/2) log D_ thc quantity log L may be made arbitrarily large by choming I; small enough. Thus. thc likelihood function does not posscss a maxlmum_ Thc ahovc dilliculty disappcars when v,-, is known. say \!\=p (4-10-9) whcrc P is a known positlvc dcfinitc matrix. If. in additton. we assume that thc x and y crrors arc mutually uncorrelated, then thc nonconstant part of Eq. (4) reduccs to " log L(O. X, \,,) = (11/2.) log dct VI" - 1 I (C;" P - le.</, + e, V,;. I e,.,,) 1'= I (4-10-10) Onc vcrilies easily that the M LE for V,.)" is 9", = (1117)1\1,,. = (1/17) I e,," e., I' (4-10-1 I) so that the concentratcd objectivc function to be minimized is " ({)(O_ X) = (17/2) log det 1\'1, + 1 Ie;" p- I ex" JI= I (4-10-12) The bias of the esti mate Eq. (II) for V'I" can bc considcrable. and this estimate is not cven consistcnl. ;\ suitablc correction factor is derived in Section 7-14. Computationally it is oftcn best to treat the problems discusscd in this section as constrained minimization problcms. That is. f(x lI . 0) is not sub- stitutcd for 5'" in thc C\prcsion for (/1. Thc 5'" are retained as explicit un- knowns. and Eq. (I) is trcated as a set of equality constraints. ]n this form, the problem is amcnable to solution by thc method of Sections 4-11 and 6-6-6-8. 4-11. Exact Structural Models Rccall thaL Lhc model takes thc form -.j g(u", ,v" _ 0) = () (4-11-1) ;i The U l , havc bccn mcasurcd prcciscly. the \\'1' are subject to mcasurement errors. The rcsiduals are defined by Eq. (4-1-4). The likelihood function is givcn by Eq. (2-13-2)_ If thc errors in each cxperiment are indepcndently dis- tributed as N,.(O. V) \vhere r is thc dimcnsion of \\'11' then the likelihood takes the form .'i " log L("\' _ V) = - (17/2) log det V - -2 I e/V- I ell 11= I (4-11-2) .... .. .1 :c .:  
4-12. DaTa Requiremel1ts 69 (constant terms have been dropped). The mL\lmUm likelihood estlmatc IS found by determining the values of \\1 and 0 which maximize log L while satisfying the constraints [Eq. (I )]. \\lc introducc an III-dimcnsional vector of Lagrange multipliers )'/1 for each experiment to form the Lagrangian function .  ,.t . ,  ., II /1(W, O. V, )'1' ." )'11) == log L + I )'1 / ' g(u ll . {V", 0) JI= I (4-11-3) The solution to the estimation problem will be found at a sraLtonary point of A. Numerical methods for finding the solution are described in Sections 6-6-6-8. and an example is worked out in detail in Section 6-13. , "J . .j 4-12. Data Requirements In Sections 4-8 and 4-9, we saw that when unknown, the elements of the covariance matrix (or. equivalently, the weights for weighted least squares) could be estimated along with the model parameters. In the case of indepen- dent observations we found that we must minimize III If 1),(0) = I I U , ; le/)U) {/= I If= I (4-12-1) . , when the l'fi are known, and UI If (])2(0) = I log I elII(O) (/= I Jl= 1 (4-12-2) when the Ufi are not known. Clearly. (1)1 (0)  () for all O. and the equality holds if and only if .1'//(/ = .!;,(x/ I ' 0) or gllll(Z/I' 0) = () (4-12-3) . i for aJl fl and a. Thus. we have a total of 1111/ equations to be satisfied, and meaningful estimation can occur as soon as 1111/ at least equals /, the number of parameters to be estimated. On the other hand, suppose we can find values of fJ which satisfy Eq. (3) exactly just for one specific value of a. This could occur if f n  1/. where In is the number of parameters appcaring in the ath equation. But in this case the ath term in Eq. (2) is - cr..'. For meaningful estimation. then, we must have 1/ > maxfi (I,,). In particular. if aliI parameters appear in every equation, we must have 1/ > I. The situation where V is not assumed to be diagonal is similar, but we have an additional restriction when Eq, (4-9-9) is used. The 1Il x 17l matrix !\"iCO) is the sum of 1/ matrices ell e/, each of rank one, Hence, the rank of IVI cannot exceed 11, and for \\'I to be nonsingular it is necessary that 11  111. If M is singular, its determinant vanishes and Eq. (4-9-9) is meaningless. , . . ,' '. 1f:>:' t 
70 IV Methods of Estimation To summarize, the number 11 of required experiments must satisfy: I. /1 > max" (I,,) if V is unknown. Also, 11  Tn if V is not known to be diagonal. 2. 11  I/m If V IS known.! More observations are usually required when V is unknown than when it is known. This is not surprising. 4-13. Some Other Distributions Perhaps the greatest virtue of the maximum likelihood method is its straightforward applicability to the formulation of a wide variety of estima- tion problems. Given a distribution for the errors, it is an easy matter to write down the expression for the likelihood function, When this function is continuous and smooth, its maximum can be found by means of some of the gradient methods of Chapter V as are applicable to the normal distribution problems. The situation is entirely different, however, in the case of a dis- continuous distribution, such as the following: Suppose our measurement errors all follow uniform distributions. Let the range of GJlO be :!: rJlO' Any value 0 for which I Cllll(O) I > r/l O for even one fl,a has likelihood zero. All values of 0 for which I e llO I  rJIO for all JI,a possess the same positive likeli- hood, and are all equally acceptable as maximum likelihood estimates. It may easily happen that no such values exist. The best procedure is to find the value of 0 for which eP(O) == max Ie Jlo(O)/r J ,,, 1 (4-13-1) Jl. " attains its minimum value. If this minimum value turns out to be no greater than unity, we have found a maximum 1<ikelihood estimate, Otherwise, we know that no such estimate exists, What we have found, then, is a minimax weighted deviation estimate, as described in Section 4-17. When the range of G 1lll is not known, but all errors are assumed to have the same range, then minimizing ([J(O) == max I elw(O) I gives a maximum likelihood estimate, and the minimum value of (p is an estimated value of all r lw ' If the errors have the two-sided exponential distribution P(L 1W ) = 1.'1'" exp( -k,,,, I E""I) (4-13-2) then, provided the k lw are known, the maximum likelihood estimate calls for minimizing the weighted sum of absolute values of the residuals (P(O) == L k ll " I elw(O) I Jl,a (4-13-3) lln some exceptional Circumstances, a smaller number of experiments may suffice. 
4-13, Some Other DlSlribwiol1s 71 One verifies easily that the constants c llO and k/ w are related as follows to the standard deviation (J/la: C/l a = 1/J2 (JIlO' kilO = J2/(J/w (4-13-4) . If we assume (J/la = (Ja for all J.L, and if all errors are independent, the log likelihood is ) 10gL= -(l1m!2)log2-I1I 10g(JD-J2I(l!(JD) I I epl,(S) I (4-13-5) 0=1 0=1 p=J Differentiating with respect to (Ja, equating to zero, and solving for (J1I gives the maximum likelihood estimate ! n iJ a = (J2/n) I I e/w(S) I /l=t (4-13-6) Substituting back in Eq. (5) we eventually find that to estimate S when the (Ja are unknown, we must minimize 111 11 <P(S) == 11 I log I I el'D(S) I n= 1 u;::;l (4-13-7) The objective functions of Eqs. (3) and (7) may be brought into the realm of conventional mathematical programming problems by means of the following device: We define new variables e;lI and e;';, satisfying e/la(S) = e;;a - e;D' e;"  0, e/a  0 (4-13-8) Equation (3) is replaced with <P(S, e;D' e;J = I k lu ,( e;;a + e/J /l,a (4-13-9) :J Clearly, if e/lD(S) is posItive then el"'(S) = e l ;" and e;-;" = 0, and vice versa. The theory of mathematical programming leads us to expect that the number of nonzero variables in the solution will equal mil, i.e., the number of equality constraints in Eq. (8). Among these will be the I paraneters 0, leaving only mil - I nonzero residuals, This means that the fitted equations will pass exactly through at least I of the observed data points. Hence this estimation method is relatively insensitive to the presence of a few observations with very large errors; these are simply ignored. The mathematical programming formulation of problems given by Eqs. (I) and (3) are discussed by Kelley (1958) and Wagner (1959), who deal specifically with the linear programming problems which arise when the model equations are linear in the parameters. 11 'd .;;r 'l:X. 
72 IV Methods of EstimatIon C. Bayesian Estimation 4-14. Definition In the estimation methods discussed so far we have made no use of the prior information. which in Sections 2-16-2-19 we have treated as an integral part of the problem. As we have seen, the posterior pdf p*(I») is given (Eq, 2-19-1) by Bayes' theorem as p*(I») = cL(I»)Pn(lj» (4-14-1) where L(I») is the likelihood function, and Po(<!» the prior pdf, which sum- marizes the prior information. Estimates which make use of the prior infor- mation are usually based on the posterior distribution. and are therefore known as Bayesian estimates. If p*(I») is to be a pdf, we mllst have f p*(I») dl» = I, and hence c must be I/f, where J=. f L(<!»Po(l») dlj> (4-14-2) We refer to the function p*(I») as a proper or improper posterior distribution if the integral f does or does not exist, respectively. In the latter case, we let c = I. The following are sulliclent but by no means necessary conditions for the existence of f: I. L( 1») JS bounded and 170(1») is normal. 2. L(I») and Po(l») are bounded. and 170(1») vanishes everywhere except JI1 a bounded region of I» space. To select an estimate for the parameters 1», we pick some typical values of the posterior distribution, such as the mean, median, or mode. Such values are referred to as location parameters of the distribution since they locale the region in I» space where most realizations of the random variable occur. Some of these location parameters exist even if the posterior distribution is im- proper, while others may not exist even for proper distributions. Among those parameters which exist for a given problem, the choice is somewhat arbitrary. In the sequel we shall describe two distinct approaches toward making this choice. At this point it is well to summarize some of the benefits that accrue from Bayesian estimation: I. One is sure to obtain estimates which are physically meanll1gful. It is guaranteed that estimates for parameters known to be positive are indeed positive. 
i .1 J J 4-15. Mode of the Posterior Distribution 73 (. 'j , !  ,f, "i  t . 2. The model equations may be degenerate relative to some of the param- eters. For instance, the model for the falling sphere [Eq. (2-14-5)] contains only two independent combinations of five distinct parameters. Non-Bayesian methods can be used to estimate at most two of these parameters, and exact knowledge of the others is required. But if inexact prior information is avail- able on at least three of the parameters. then the posterior density. being a nondegenerate function of all five, can be used to estimate all five. :1 "  ! ,i  . 4-15. Mode of the Posterior Distribution #. The natural extension of the maximum likelihood method to Bayesian estimation problems consists of looking for the mode of the posterior dis- tribution. That is, we accept as our estimate the value of 4> for which p*«I» is maximum. This method, to which we refer as MPO (maximum of posterior distribution), offers the following advantages: 1. The estimate coincides with the maximum likelihood estimate III case of a uniform prior distribution. since then p*(<I» is proportional to L(4»). The estimates coincide even if Po(<I» is uniform only within a bounded region and zero elsewhere. provided only the maximum of L(<I» occurs within this region. The practitioner who accepts the M LE when no prior information is given would naturally wish his estimates to be affected only slightly when a slight amount of prior information becomes a\.ailable. MPO satisfies this require- ment. 2. We know from a theorem by von Mises (1919). that if Po«p) is contin- uous and does not vanish at the maximum of L(4». then the MPD converges to the MLE as the number of experiments is increased indefinitely. The MPD shares the consistency and asymptotic cAiciency of the M LE. 3. The MPO can be obtained whether or not p*(<I» is a pmper distribution. 4. It is usually much easier to compute the M PO than other Bayesian estimates, In computing MPO estimates, we distingUIsh two cases: (a) The prior distribution does not vanish anywhere. In this case, we maXlm]ze tP(<I» = log L(<I» + 10gPo(<I» (4-15-1) The same techniques as are used for MLE can be applied here. In particular, if L(<I» is one of the normal cases discussed in Section 4-8 and Section 4-9. and if Po does not depend on the elements of V, then those may be eliminated as before, and the concentrated likelihood may replace the likelihood in 
'-I.: 74 IV Methods of Estimation Eq. (I). Care must be taken, however, to retain any constants multiplying the concentrated likelihood. For instance, if L is given by Eq. (4-7-6), we may use Eq. (4-9-8) to replace Eq. (I) by cfJ(O) = (n/2) log det M(O) -logpo(O) (4-15-2) which IS to be mlllimized. For numerical examples, see Sections 5-22 and 8-7. Note that in the presence of log PoCO), the factor n/2 may not be dropped. In the case of single equation least squares with unknown a, the term (n/2) log det M takes the form (n/2) log Iz= I e,/(O). (b) The prior distribution vanishes outside the region defined by a set of constraints h(<I»  0 (4-]5-3) In this case we have a typical nonlinear programmlllg problem; find the maximum of Eq. (I) subject to all the applicable constraints. Methods of dealing with this problem are described in Chapter VI. 4-]6, Minimum Risk Estimates So far Our motive has been to find values of 0 which are most likely to be close to the true values. Sometimes, however, the estimated value is required for a specific purpose, e.g., for designing a plant, and we are interested in finding the value of 0 which is best for this particular purpose. In many situa- tions, what is "best" is determined by economic considerations, and the choice of the best estimate can be made by means of decision theory. In decision theory, a cost is assigned to any loss suffered because of an error in the estimate. That is, to the act (9f using the parameter value <1>* when the true value is 4) we assign the cost c(<I>*, <]». Since <]> is unknown, the actual cost C(<I>*, <]» cannot be computed. However, if we are willing to say that <I> is distributed according to the posterior distribution, then we can compute the risk, defined as the expected value of the cost of assigning the value <1>* to <I> 1t Jl -'j ...i )! " ; R(<I>*) == £c(<I>*. <1» = J c(<I>*, <I»p*(<I» d<l> (4-16-1 ) The minimum risk estimate (M RE) is defined as the value of <1>* which mini- mizes R(<I>*). Here p*(<I» must be a proper pdf. The following is a simple example: A manufacturer conducts experiments to measure the tensile strength 0 of an alloy. He intends to use the alloy to manufacture a component whose size, and hence cost, will be inversely proportional to O. Let 0* be the estimate '.i .'". ;1 
4-16. Minimum Risk EstimaTes 75 " to be used for O. Then the cost of the component will be $a/O* (any additional fixed cost is irrelevant to the present discussion). The component will fail if the true value {j is less than 0*. However, if the component does fail. the manufacturer will have to pay a fine of $K. His total cost will be . _* - _ { a/O* c(O , 0) - K + a/EI* (0*  D) (0* > 0) (4-16-2) ", Assuming that the posterior density p*(O) summarizes all available informa- tion on e, then the risk or expected cost is 00 O. R(O*) = f c(O*, O)p*(O) dO = 010* + K f p*(O) dO -00 - (4-16-3) To find the minimum risk estimate, we differentiate dR/dO* = -a/(O*)2 + Kp*(O*) = 0 (4-16-4) :, ' Hence, one should use the value 0* which satisfies the equation (fIYp*(O*) = a/K (4-16-5) , ", r:' . '\ In a sense, the MRE is not really an estimate. The value of 0* which satisfies Eq. (5) cannot be considered the most likely to be true; it is merely the value which in the given economic situation involves the least risk Attempts to use decision-theory-like methods in pure (i,e., economics-free) estimation problems usually start with the assumption of quadratic cost functions taking the form -" .' c(<I>*, <1» = (<1>* . 4»)Tp(<I>* - 4») (4-16-6) ;: ; where P IS a given positive definite weighting matrix. This essentIally defines the cost flS a weighted sum of squares of the estimation errors. Substituting Eq, (6) in Eq. (I) yields R(<I>*) = f (<1>* - <I»Tp(<I>* - <I»p*(<I» d<l> (4-16-7) and , ) 8R/8cP* = 2P f (cP* - cP)p*(<I» dcP (4-16-8) .>  '., ) Where R(if;*) attains its minimum, 8R/o<l>* vanishes, whence (assuming P nonsingular) f <I>*p*(<I» d<l> = f <l>p*(<I» d<l> Since <1>* is a constant and J p*( <1» d<l> = L Eq. (9) reduces to <1>* = f <l>p*(<I» d<l> (4-16-9) (4-16-10) rJl i:. . 
76 IV Methods of Estimaiion We conclude, then, that the M RE for a quadratic cost function is the meal! of the posterior distribution. More explicitly. Eq. (10) can be written as 4>';. = I .:j>L(qJ)Po(,p) (hp / J L(<jJ)po(<!» d<!> (4-16-11) Fortu;latc!y, the c!:illate Eq. (I I) does not depend on the weights P. Hence, onc nced not worry about wh:lt values should be assigned to them. Thcrc are many practica: disadvantages associated with this MRE: (:1) Thc c:,t:mate does not exist if p*(,p) is an improper distribution. Consider thc case where our model takes the form .1\, = c( ex p( - (}x,) (-+-16-12) CI. being 2. knowp constant. Assuming a normal distribution with standard deviation (J, we have L(O) = f2iT)-1I!2(J-1I expf -(lj2(J:= [Y II - CI. exp( -OxIIW} (4-16-13) Suppose all .Y" arc positive. As (} increases bcyond bound, L(O) becomes pro- portional to exp[ - (I /2(Jl) I;: = I J'/J -# O. Hence, if PoCO) is uniform for all values of 0, the integrals in Eq. (I I) diverge. What is even worse, however, is the fact that if wc assume floW) = 0 oUIside the region 0  (}  A, the integrals in Eg. (II) cxist. b:t their ratio tends to infinity with A. Thus, the estimatc Eq. (II) is r.ot robust under seemingly unimportant changes in PoW). After all, the vaIuc chmen for ..; is arbitrary, and the estimate should not depend stro:1gly on this choi..:c. ;hc SOUrce of the diftlculty here is the sensitivity of the .:lRE to the tails of thc assumed distribution. (h) Even \vhen the integrals in (II) exist. their evaluation may be impractical. If  is an /-dimensiona! vector. then I + I imegrals must be evaluated: one in thc (lcnominator, and onc for each component of <jJ in thc numerator. Each one of these integrations must be carried out over an I-dimensional space, each din 1 cnsion possibly extending from v:., to + co. No satisfactory methods for !,crforming such integrations (unless I = I) are available. In addition. any reaonable approach to this integration problem requires find- ing, as a first stcp. the loc<:tion of thc mode of the posterior distribution, Thus. computation of thc Iv! PD is a prcrequisite to the computation of the MRE. (c) Thc M R.E ie, not iilvari::nt under reparametnzaIlon, whereas the MPD IS. (d) Thc MRE doc not gcnerally COIl':ergc to the MLE as Po(<!» approaches the uniform ditribution. 
;"ir.> .I: . . i. t; ,., lt If f '.. , '';:.: _, h \ 4-! 7 Minimax Deuillt ion 77 In conclusion. the M RE can be recommended only where called for by true economic decision making purposes. For a further discussion of mtl1i- mum risk estimates, not confined to quadratic cost functions. the reader is referred to Chapters :2 and 3 of Deutsch (1965) and Chaptcr 6 of RailTa and Schlaifer (1961). ;: '. f :', . :.!. .' ,f::' D. Other Methods 4-17. Minimax Deviation The parameters are determined in such a way as to minimize the maximum deviation of the model from the data. This is particularly useful for design purposes (see Section 2-5), or for obtaining maximum likelihood estimates with uniform error distributions (Section 4-13). Such estimates are sometimes called Chebyshev estimmes. Let' denote the magnitude of the largest residual. Then the following conditions are satisfied e,w(e) :E; (, - e,U/( e) :E; ( <II = I. 2. . . . , 1/; a = I, 2, . . . , m) (4-1 7-1 ) or, equivalently ea(e) :E; (2 (/-I = I. 2. . . , . n; a = I. 2. . . . , 111) (4-17-2) Our problem may then be formulated as follows: Find the values of the parameters e and ( which minimize the objective function (IJ(e. 0 == ( (4-17-3) while satisfying Eq. (I) or Eq. (2). This is a classical nonlinear programming problem. It is possible to attach different weights to different residuals. i.e.. by replacing Eq. (2) with 'l./llle;a ::E; (2 (Jl = I. 2, . . . , 1/; a = I, 2, .. ,/H) (4-17-4) with r:J. pa gIven positive numbers. Numerical procedures for solving this problem are discussed in Section 6-5. If the model equations are linear, the algorithm of Bartels and Golub (1968) may be used. ':" 
7'cl IV Methods of Estimation 4-18, Pseudomaximum Likelihood In this method we employ the maxImum likelihood equations derived on the assumption of a normal distribution, regardless of whether the distribution is or is not in fact normal. Since in practice we often assume normality even when we have no basis for doing so, pseudomaximum likelihood is perhaps the most widely used method. We may regard any use of nonlinear least squares or weighted least squares as an application of this method. Equations (4-8-4) and (4-9-9) are the most important extensions of pseudomaximum likelihood beyond the weighted least squares concept. 4-19. Linearizing Transformations Consider the model equations y = f(x, e), Suppose we were able to effect a transformation of variables y = try) in such a way that the function t[f(x, e)] is linear in e. Then we apply the method of multiple linear regression to estimate e. The advantage gained derives from the fact that this estimate may be obtained by direct calculations, whereas nonlinear estimation pro- cedures require complicated iterative schemes. Understandably, this method was very popular before nonlinear estimation codes for electronic computers became available. We illustrate by means of a simple example: Let r = XI exp( -Ox 2 ) (4-19-1 ) be our model equation, with () to be estimated. Letting.j' == log.l' transforms the equation into j'; = log Xl - OX2 (4-19-2) This is linear in 0, which can be estimated, say, by minimizing II <1)(0) == I (log .I'll - log XJl! + OX 112 )2, 11= I The method can be applied equally well when the transformed equations are linear not in the! parameters e, but in a set of I independent functions of them, say rr(e). Then we can obtain linear regression estimates rr* of the rr, and estimates ofe by solving the equations rr(e*) = rr*, To illustrate by means of a trivial example, leI y = 0 1 exp( - O 2 x) (4-19-3) 
. , i!* t .'  , . , .:r" i <1 .' . : ! !:' . . ;.  4-19, LlI1eanzlI1g Transformations 79 with 0 1 and O 2 to be estimated. As before, let P == log y so that .II = log 0 1 - O 2 x Letting 1I 1 == log 0 1 , IT 2 == O 2 , we have (4-19-4) .1 1 = IT 1 - IT 2 X (4-19- 5) j.' which is linear in IT 1 and IT 2 . Estimates for e may be obtained from those of 1t . ., by means of -,  ..; J .! 3 .i l :} ><;i  .i l \ j'it .,..;;t .. . 1";  ..: ;::i ",' '-:, fii' 1: ,.. i: .'i .;, "",' ..." . .. I. i: ... , ii' .. .. r' i:: : r::  ,cd i ii,. .... '.,:, ' - , . ' .;, : - .!i..., 0 1 * = exp(IT 1 *); 0 *- * 2 - 1[2 Other examples, arising in chemical reaction kinetics, are (a) y = Xl exp[ - 0tX3 exp( - 02/X2)] (4-19-6) which, under y == log[ -log (y/xJ], ITI == log 0t, IT 2 == O 2 transforms into .p = ITI - (l/X 2 )IT 2 + log X 3 (4-1 9- 7) and (b) .1' = 0IXt!(l + O 2 X 2 + 0 3 X3 + 0 4 X4)2 (4-19-8) which under y == (X I /.1')1/2, ITI == 1/0 1 \.:, IT 2 == O 2 /0 1 \':', IT3 == 03/0t Yl, IT4 == 0 4 /0 1 \'2 becomes .f' = ITI + X 2 IT 2 + X3IT3 + X4IT4 (4-19-9) The main objection that can be raised to this method (other than its limited applicability) is that the statistical distribution of the errors on the calculated values of YII == t(YII) is not the same as that of the errors in YII' Therefore it may be appropriate to apply the least squares criterion to the residuals in Y but not in y. We can overcome this problem in part by recog- nizing that if VII is the covariance matrix of the errors in YII' then (provided these errors are small and the transformation t and its derivatives are con- tinuous and bounded) the covariance matrix of YII is approximately VII = (Ct/cy")V,, (Ct/ay/,)T (4-19-10) Hence, in place of minimizing I;: = I (YII - f)TVIl(YII - I'll) we should mini- mize I= I [y l , - t(f)]TVI 1 [y l , - t(fl')]. In the single equation case, a variance a 2 of Y ll is translated, approximately, into a variance (j2 = (c..vjaY I I)2(/ of YII' 
80 IV Methods of EstimatIon The transformation 't usually introduces a bias; i.e., if the errors In Y/l have zero means, those in Yll do not. This bias can usually be neglected under the above assum ptions. 4-20. Minimum Chi-Square Method The minimum chi-square method (Cramer, 1946; Rao, 1957) is used to put statistical estimation problems (i.e., estimating the parameters in a prob- ability distribution) in a least squares form. Suppose we have observed Jv realizations Xi of a random variable (, and suppose ( is supposed to have a pdf p((IO). If we divide the entire range of ( into 11 disjoint intervals [aj. b;J then the expected number of observations Xi falling in each interval is I). E;(O) == N Pr(a j <.;::::; b) = N J Jp(xIO) dx lIJ Let N j be the number of Xi actually observed in this interval. The mil11mum chi-square method consists in finding the value of 0 which minimizes (/)(0) == I [N j - ElOW/ElO) j (4-20-1 ) In the modified mt11ImUm chi-square method, I/N j is usecIln place of I/Ej as the weighting factor (/)(U) == I [N j - E/O)f/N j j (4-20-2) We must choose our intervals in such a way that N j i= 0 for allj. The modified form Eq. (2) is easier to use, since the denominators are constants. Both estimates are consistent and asymptotically efficient (if each N j goes to infinity). These properties are identicalto those of the maximum likelihood method, and the latter may be preferred on account of its greater simplicity. The minimum chi-square method, however, enables us to examine the fit of the proposed distribution to the data throughout the range of the distribution, by comparing the observed occurrences N j with the predicted values EiO*). When the number of observations is small, the loss of information due to the grouping of the data may be considerable, and the method cannot .be recommended. 4-21. Problems I. Work out the statistical assumptions on the distribution of errors which correspond to the weighting schemes of Eqs. (4-3-3)-(4-3-6), 2. Verify the relation Eq. (4-4-15) for the two parameter case (i.e" V is 
\; iF ;.. , 4-21. Problems 81 if' t: ) 'S fi "l. if: it; ii', 2 x 2) with U = I. Explain why the latter assumptIon entails no loss of generality. 3. Derive Eqs. (4-4-17)-(4-4-10). 4. Show that if the model equations are linear in the paramctcrs. the covariance matrix is known. and the distribution of errors is normal. then the MLE is efficient. , t:' 5. Show that if the errors in each experimcnt are normally and inde- pendently distributed with covariance matrix V = TQ. whcrc Q is a known matrix and T is an unknown constant. then the M LE can be found by minimizing the concentrated objective function j, r' ik f; J 1.,:, (fJ(a) = (11111/2) log Tr[Q-1 M(O)] ('+-21-1) and T can be estimated from i{a) = (1/11111) Tr[Q-1 M(O)] (4-21-2)  "L., ti. .   z, <I 6. Consider the exact structural model .\\ - {I'/I = 0 {Jl = I. 2.. . 11). The measurement errors of XII and -"/1 are normal and independent with known variances 6,2 and 6/. respectively, Show that the M LE for {I is i ,; 0* = {'Y.S,.}. - Su + [('Y.S,.,. - S,.)2 + 47.S.;,.]li 2 ]/2'Y.S". f tor jiY \.' . . . ' .'. where - j - S '\" - 2 CI.. = U x " U),7, xx == LII:::::.IX"  S - '\" . 2 \'.\'-LII=I.\J' ., S - '\" ,- \ . .\\' - L/J=I.'\JI I' Show that this estimate converges to the lIslialleast squares estimate S.,,)S,-, as ax -> 0, See also Barnett (1967) for a slightly more general case. . .'< Ii.. t .. 7. Consider a sequence of disjoll1t time intervals of lengths II' 1 2 , . . . , I". Let the number of phone calls passing through an exchange in the {lth interval be N p . Under the assumption of a Poisson process. the probability that N p = k (k = 0. I, 2. .,.) is [(I.tlflk!] exp( -I.t ll ). where i is the average arrival rate of calls. Find the maximum likelihood and minimum chi-square estimates for k f  t : '  <;;  ,.. J t; .  . ; (_¥! rii: 8, Consider the following model: Certain" state" variables 5 are functIons of the independent variables x 5 11 = [(xII' 0) + EJI and the dependent variables Y II are linear functions of the state variables (B is a known matrix) YII = B5JI + 0Il 
82 1, '!i IV Methods of Estimation ,..  .' \ Assume that £" and i5 '1 are random variables independently distributed as N,.(O, P) and N",(O, Q) respectively, that Q is a known matrix, and that III > /" Show that the M LE of e can be obtained as follows' i. Compute the multiple linear regression estimates of the" true values" of the state variables ,  :.! .. 8 " = (BTQ-IB)-IBTQ-IYll (p = I, 2, . . . , 11) ii. Use the computed 8 '1 as though they were actually measured values of Sl" and apply the appropriate M LE estimate from Sections 4-7-4-9 Note: For an applIcation, see Section 8-8. 9. Suppose each experiment consists of measuring the same quantity y several (/11) times. Show how the results of the previous problem can be applied to the present case. Note that under a reasonable set of assumptions one has Q = a 2 1, and one does not have to know the value of a 2 in order to apply the method, 10. Let S be a matrix such thar STS = V-I. Define B = SB and Y = SY. Show that this transformation reduces Eq. (4-4-4) to an unweighted sum of squares. ,L ,:.\ 5: .,j 
Chapter v Computation of tile Estimates I: llnconstrained Problems t' . (. . '. . , 5-1. Introduction Most parameter estimation methods require that we find values cp*of the parameters cp for which some objective function (;1J(cp) attains its maximum or minimum. Typical objective functions to be minimized are the sum of squares, weighted sum of squares, and risk function. Those to be maximized are the likelihood, concentrated likelihood, and posterior density functions. All these were described in Chapter IV. In most practical applications, any unknown distribution parameters \v , . are eliminated from the objective function by methods such as those described in Sections 4-8 and 4-9. Therefore, we shall wrile our objective function as c.P(e) rather than c.P(cp) and we regard 9 as a vector with I components. Some of the methods of solution we shall discuss are easily extended to the case when distribution parameters remain present in the objective function. Sometimes the unknown parameters are free to assume any values what- soever, and we speak of unconstrained optil11ization.t In other cases, only values satisfying certain inequalities and/or equations are admissible. The problem is then to find 9 such that 01)(9) is maximum (or minimum) subject to: h(9) :;, 0 g(9) = 0 (5-1-1) (5-1-2) '1" where hand g are vectors (possibly vacuous) of given functions. We sacrifice nothing by restricting our attention to minimization, for maximizing a function can always be accomplished by minimizing its negative. Minimizing a function subject to Eq. (1) and Eq. (2) constitutes the problem ,of nonlinear programming. In spite of extensive treatment in the literature [see Daniel (1971), Wilde and Beightler (1967), Abadie (I 967a), Kunzi and - . , t Minimization or maximization, as appropriate. 
84 V Computation of the Estimates I: Unconstrained Problems KreI:e (1966), Hadley (1964), . . .J na single methad has emerged which is best for the salutian lOf all nanlinear programming problems. One cannat even hape that a " best" methad will ever be faund, since problems vary sa much in size and nature. For parameter cstimatian problems we must scek methads which are particularly suitable to the special nature af these problems which may be characterizcd as fallaws: , . '-.', ,';j 1",  < .' I. A rclatIvely small number af unknowns, rarely exceedlllg a dazen or sa, 2. A highly nanlincar (though continuC'us and differentiable) abjective functian, whase camputatian is aftcn very time cansuming. 3. A rclatively small number (sametimes zero) af inequality canstraints. Thase are usually af a vcry simple nature, e.g., uppcr and lawer baunds. 4. Na equality canstraints. except in the case af exact structural madels (where, incidentally. the number af unknawns is large). These will be treated separately in Scctians 6-6-6-8. SlIlce tllC constraints playa relatively minor role in mast esrimcttian prob- lems, we shall first discuss some methads far uncanstrained aptimizatian. Additianalmcthads can be faund in Kawalik and Osbarne (1968). in Sectian 6-1 we shall shaw haw cantraincd problems can be canverted inta uncan- strained oncs, making thc pn:viously described mcthads applicable. 5-2. Rter4t:ve SchcIne Thc mcthods we shall diCllss are itemtire in nature. We start with a givcn paint 0, known a the initial gues.\, and procecd ta generate a sequence of paints O 2 , D], ... which we hape canverges ta the paint e* at which q)(O) is minimum. The camputation af e; ,., is called the ith iteration, and the paint e; the ith iterate. In practice, ane terminates the sequence after a finite number N af iteratians, and anc acccpts e.v as an approximatian ta e*. The vector v, =:0 0;+ I - 0; (5-2-1) is called the ith step. We wish each step ta bnng us c!aser ta the minimum. Since wc do nat know whcre the minimum is, we cannat test far this canditian directly. In a sense, hawcver. we may cansider the ith step ta have" improved" aur situatian (by bringing us claser ta the minimum in q) space, if nat in 0 space) if (J)i+1 < (1); (5-2-2) where (JJ j =:ocJJ(e) (j = I, 2, . . .) (5-2-3) . .J 
-: 5-3, Acceptability 5 ;" We call the ith step accepTable if Eq. (2) holds. An iterative method is ! acceptable if all the steps it produces are acceptable. We shall only consider acceptable methods. All the methods we shall discuss adhere to the following schemc: 1. Set i = 1. An initial guess 0 1 must be provided externally. 2. Determine a vector V j in the direction of the proposed ith step 3. Determine a scalar pj sllch that the step > ITj = pjv, (5-2-4) ,. /is acceptable. That is, we take OJ+t = OJ + PjV j (5-2-5)  ' . 'nd require that pj be chosen so that Eq. (2) holds. " 4, Test whether the termination criterion (see Section 5-15) is lllCt. If not, ,'increase i by one and retu rn to step 2. If yes, accept 0 j + t as the value of 0*. .. The various methods to be described below differ only in the manner of : phoosing Vj and p,. We refer to these quantities as step direction and step .I'i::e :.tespectively. Since Vj is not required to be a unit vector, p j is only proportional, but not necessarily equal, to the step length in the llsual sense. ", " ;"" '. 5-3. Acceptability 1_,' r,,\ Consider the ith Iteration of a minimization procedure. Suppose we strike qut from OJ along some direction v, generating the ray ;1 O(p) == 0 j + pv (p  0) (5-3-1 ) Along this ray, the objective function varies as p is changed, thus becoming a function of p alone, We designate this function lJ'jy(p) == q)(O(p» = q)(Oj + pv) (5-3-2) .,' , Its derivative is given by dlJ'jvldp = (iXPjiJO)T(aOjop) = (c"JcPj20)T V (5-3-3) 'Fp.e gradient vector of <P(O) is M)jaO, which we designate as q(O). The exth c,omponent of q is the quantity a<PjaOa. Denoting by qj the gradient veclor eyaluated at 0 = OJ, we have IP;y == d'P jvldp)p = 0 = qjTv In.the sequel we assume qj #- 0, (5-3-4) ".",,!, 
86 V Computation of the Estimates I: Unconstrained Problems 1 I 'I The quantity II1;,. is callcd the directional derivative of c[) relative to v at OJ. If 1[1;,. is negativc, then q;(O) decreases in value when one starts moving away from OJ in the direction ofv. Therefore, if P is a sufficiently small positive number, the step (!v is acceptable. On the other hand, if Pi"  0, there may not exist any positive valuc of p for which pv is an acceptable step. We calI v an acceptable direction if 1[1;,. < O. We can noW prove the following: Theorem A direction v is acceptable if and only if there exists a positive definite matrix R such that v = -Rqj (5-3-5) Proo) I Let R be a positive definite matrix, and let v be gIven by Eq. (5). Then, from Eq. (4) and the definition of positive definiteness 1[1;" = qjTv = _qjTRqj < 0 2. Suppose qjTv < o. Select R = [1 - (qjq/lq/qi) - (VVT/VTqj)] (5-3-6) Then Eq. (5) holds, and R is positive definite (we leave the details as an exercie for the reader) The reqUlremcnt 1[1;,. = qi T V < 0 says that the directton v leads downhill if it forms a greater than 90° angle with the gradient qj' The theorem staVs that this condition can be insured if the direction is determined by operating on the negative gradient with a positive definite matrix according to Eq. (5). A minimization method in which the directions are obtained in this manner is called an acceptable gradient method (it is simply a gradient method if R is not required to bc positive dcfinite). The basic equation of the ith iteration in a lY grad ient method is f)i+1 = Oi - PiRjq, (5-J.7) Various gradient methods differ in the manner of choosing the R i and Pi' In devising or choosing an optimization method one attempts to minimize the total computation time rcquired for convergence to the minimum. This time is composed primarily of the following two factors I Function and derivative evaluations. 2. Algebraic manipulations such as matrix inversions or eigenvalue deler- minations. It is usually possible to tradc off these factors against each other. A method employing morc laborious algebraic proced ures may require fe'.ver iterations, and hcnce fewer function evaluations. This is likely to pay olf if 
  ;: :; .: 5-4, Convergence 87 the objective function is a complicated one. In parameter estimation problems, the objective function is synthesized from the model equations and from the data obtained in many experiments. Its computation is usually time consum- ing. We do not hesitate therefore to recommend methods which are sophisti cated algebraically, as long as they are efficient in terms of the number of required function and derivative evaluations. 5-4. Convergence ; 1 . .. ,i I . One would like to be able to prove that the method one has selected converges to the true minimum of the objective function. Unfortunately con- vergence proofs usually require that certain assumptions be made concerning the nature of the objective function, and the validity of these assumptions is difficult to verify on any given problem. Even more significantly, the existence of a convergence proof is no guarantee of reasonable performance in practice, A method may converge in theory, yet take an excessive number of iterations, or require computations to be carried out with an unreasonable number of significant digits. For this reason, our discussion of convergence theorems will be brieF. Let q\ denote the value of qJ(O;). If at each iteration we select an acceptable point, then the sequence {qJJ == {q)o, q)1 ' q)2, ...} is monotone decreasing. If the values of the objective function possess a lower bound, then this sequence must converge to a limit q)oo. If the sequence {OJ is bounded (i.e., there exists a number /vI such that OjTO j < 11J for all i) then it has at least one limit point. It follows from the continuity of qJ that q)(Ooo) = q)oo, where 0 00 is any limit point of {OJ. Because of this, the sequence {OJ can have more than one limit point only by remarkable coincidence. In all practical cases, the sequence {OJ is either unbounded, or converges to a point 0 00 ' The rate of convergence, however, may be so slow that the sequence appears nonconvergent. A stationplJl point of the objective function is one at which q(O) = o. If OJ is stationary, i.e., qi = 0, then Eq, (5-3-7) shows that all OJ (j:;?; i) coincide with OJ. Therefore, the most that we can hope to prove about any gradient method is that it converges to a stationary point. Convergence to the true minimum can be guaranteed only if it can be shown that the objective function has no other stationary points. In practice, convergence to a local maximum or saddle point requires an improbable coincidence. One usually reaches at least a local minimum. Convergence proofs require that the Pi be chosen sufficiently large, and the matrices R i sufficiently positive definite. The following theorem is typical. Its proof is given in Appendix F: 1 
88 V Computation of the Estimates I: Unconstrained Problems I I : J :'  . ,\;i .'-a '; 1 1 ,  Theorem Let 9 denote the set of all 0 such that q)(O) < q)l. Suppose the following conditions are satisfied: I. cP has continuous first and bounded second derivatives in q.. 2. Let pj be the smallest nonnegative value of P at which tpj\.,(p) attains a local maximum, where v j = -Rjqj. Let rL be a positive number less than one, and Po a positive number. We choose each pj so that either rLpj  pj  lj or min(po, C/-Ilj) ::( pj ::( pj. 3. Lct (j and y be constants satisfying )' > f3 > O. We choose each R j so that all its eigenvalues lie between f3 and)'. Then all the limit points of {OJ} are stationary points of q). These conditions are sufficient, but not necessary. In the algorithms that we shall describe the R j are usually chosen so as to satisfy condition 3. There does not seem to be any nced, in practice, to trouble oneself with satisfying condition 2 precisely. Condition I is almost always satisfied in principle, but the limited accuracy of numerical calculations sometimes causes trouble. 5-5, Steepest Descent The simplest gradient method employs R j = I, so that V j = -qj in all iterations. The direction -qj is the one in which the objective function de- creases most rapidly, at least initially. Hence this method is called sTeepest descent. Unfortunately, as discusscd more fully in subsequent sections, this mcthod is often very incHlcient, requiring a large number of steps which tend to zigzag in a so-called hemstitching pattern. The method is not recommended for practical applications, and is discussed here only for reference. 5-6, Newton's Method The Hessian matrix H(O) of the function q)(O) is the matrix of second partial denvatives, i.e., Hap(O) == a 2 cpjae a ae p (5-6- I) Let H j be the Hessian matrix of cP evaluated at 0 = OJ. We define the function Qi(O) = q)j + qiT(O - OJ) + t(O - OJ)THj(O - OJ) (5-6-2) which consists of the terms up to second order In the Taylor series expansion of q) around the point OJ. In a sense, Qj(O) matches the behavior of q)(O) at 0= 0; more closely than does any other second order surface. 
5-6. Newton's Method 89 Suppose we wish to find the point at which Q;(O) is stationary. We equate to zero the gradient of Qj 8Q;/80 = qj + H;(O - OJ) = 0 (5-6-3) which, if Hi is nonsingular, has the solution 0;+1 = 0; - H;l qj (5-6-4) Equation (4) defines the ith iteration of the Newton (also known as Newton-Raphson) method. It conforms to the general formula of gradient methods Eq. (5-3-7) with Pi = I and R i = H j - t . If cP(O) coincides with Q;(O), i.e., if qJ is a quadratic function, then 0;+ I IS stationary point of qJ. This is a minimum if Hi is positive definite. In that case R j is positive definite, the method is acceptable, and it converges in a . single iteration. If H; is negative definite, 0;+ I is a maximum, and if B i is indefinite, 0;+1 is a saddle point. [n both cases the method is not acceptable. When cP is not quadratic, OJ+ I does not generally coincide with the sta- tionary point, and the method does' not converge in a single iteration. The method is acceptable, however, as long as Hi is positive definite, as it should be at [east in some neighborhood of the minimum. [n this neighborhood, convergence is quadratic. This means tl:tat the number of correct digits in 0 is approximately doubled by each iteration, until further improvemcnt is barred by the rounding errors in the calculations. Outside this neighborhood, convergence cannot be guaranteed. It is worth noting that the step v = -Rq (R positive definite) solves the problem of minimizing the function q TV + 1v T R- 1 v. The closer R is to B- 1 , the closer is this modified problem to the :'original one. Returning to the case of a quadratic function (p with a positive defimte Hessian fI, we find that the Newton step -H-Iq is the only one that takes us to the minimum in a single iteration. Any other step - Rq with R =1= H - I will miss the minimum. If we define the efficiency of a method as the decrease in function value obtained by a single step of the method, divided by the maxi- r. "inum possible decrease, then the efficiency of the Newton method is I. It was 'shown by Greenstadt (1967) that the efficiency e( R) of a method using the ,direction - Rq is bounded by 41'/( I + d  e(R)  I ( 5-6-5) ", . '\rere Y is the condition number, i.e., the ratio of largest to smallest eigen- ';va[ues of the matrix R 1 / 2 HRI/2. With R = I (the method of steepest descent), :.';}',= YH is the condition number of H itself. [t is quite common to find cases ,. ''Yhere Y > 10 5 , so that the Newton method may be 25,000 times more efIlcient .. t,ran steepest descent! :i.)- 
90 V Computation of the Estimates I: Unconstrained Problems Even in the nonquadratic case, a large value of I'H indicates the presence of a long and narrow ridge or trough in the surface q)(O). The ridge is aligned with the direction of the eigenvector of H whose eigenvalue is very small in absolute value. The ridge is concave upwards if the eigenvalue is positive, and downwards if it is negative. In the first case, the Newton method attempts to reach a minimum or saddle point at the bottom of the ridge; in the second case a maximum or saddle point at the top of the ridge. In either case, as shown below, it proceeds to take long steps along the ridge in the direction of the expected stationary point. The results are good in the first case, but dis- astrous in the second. In contrast, a method employing a matrix R which differs radically from H- 1 tends to crisscross the ridge in a so-called hemstitching pattern, and makes very slow progress. Let H = I: =; li Vi V iT be the eigenvalue decomposition of H (see Section A-5), and let the gradient be expressible in terms of the eigenvectors as q = I= I ct.j v, . Then, since V,T vi = 6 ii ,  :i :l ; i1 )i I I I _ H-1 q = _ '\" ;:-IV.V T '\" ct..V. = - '\" ( ct.. f ).)v. L.' I I L .1 J L I I I i= I J= I i= I ( 5-6-6) If li is very small, then unless ct. i is accidentally also very small, the step _H-1q has a large component in the direction Vj' That is why the Newton step approximately parallels the direction of a ridge. In spite of its splendid performance in those cases where it works, the Newton method is not a practical one, for the following two reasons: I. It does not work in many cases, because H(O) is not necessarily positive definite except near the minimum, 2. It requires the evaluation of second derivative. This places a heavy burden on the user, particularly where the objective functions are as compli- cated as those to be found in parameter estimation problems. Various tricks have been devised for overcoming these difficulties, while retaining the advantage of the method. Directional discrimination and Marquardt's method are dcsigned to overcome the problem of indefiniteness, whereas the Gauss and variable metric methods eliminate the need for comput- ing second derivatives.t All these methods are described in succeeding sections. We note that of the two deficiencie, the first one (nonconvergence) must be overcome if the method is to be useful. The second difficulty (second derivatives required) is merely a matter of convenience. One may raise the question of whether the Newton method (modified to guarantee convergence) .f ,j : One may also compute second derivatives directly by finite differences on q, as pro- posed by Goldstein and Price (1967). 
5-7, Directional Discrimination 91 II I < :\; ;; is not so much more efficient than methods that do not require second deriv- atives, as to make the evaluation of these derivatives worthwhile. We have no definitive answers to this question, but a limited amount of experience has led to the following tentative conclusions: I. If the model fits the data well, the Gauss method often requires no more iterations than the Newton method (Bard, 1967). 2. If the model does not fit well, the Newton method may require fewer iterations than the Gauss method, but thecompllting times for the two methods are roughly the same (Flanagan et al., 1969). 5-7. Directional Discrimination ,. Assume that we have a matrix A j which IS the Hessian Hj or an approxima- tion to it. We would like to obtain a positive definite (or at least semidefinite) matrix R j which is in some sense close to Ai 1, so that v = - R j qj is an acceptable direction. Furthermore, we wish to obtain a reasonable v even if Ai is singular, or nearly so. The idea behind directional discrimination methods is to use different formulas for computing the various components (in a suitably chosen co- ordinate system) ofv. Jennrich and Sampson (1968) use the original coordinate system, and set to zero those components of v which do not seem, at the moment, to affect CPo The technique is most suitable for use in conjunction with the Gauss method, and will be discussed further in Section 5- I l. Generally, however, it pays to transform the coordinates so as to eliminate" interaction" among the parameters, i.e., so as to obtain a diagonal Hessian. In such a coordinate system, the effect on q) of varying one component of v is approxi mately independent of the value of any other component of v. Fariss and Law (1967) have coined the term rotational discrimination to describe such methods. To obtain a suitable transformation of coordinates, we may use any spec- tral decomposition of Ai (see Appendix A-5). Good numerical accuracy is obtained with scaled decompositions, and these will be used here. Let the inverse scaled decomposition of Ai be given by Eq. (A-5-l9), i.e., " ,. " ,>- Ai = (GT)-ITIG- I (5-7-1) " t: The equation A j v = -qi can, therefore, be written as TIG-Iv = GTqi Let e == G - 10 and v == G - I V . We have (5-7-2) J. . I qa == aq)jae a = I (8cpjae{l) aO{ljM a {I=I ,. :C1;(: 
92 V Computation of the Estimates 1: Unconstramed Problems But 0 = Ga, so that aOpjae a = G pa and i/a = I=I (8cpjae p )G pa ; or in matrix notation q = GTqi Hence Eq. (2) may be written in the 0 coordinate system as II" = - q (5-7-3) (5-7-4) or, since II is diagonal with Daa = na n" V" = -iJ" (ct = I, 2, . , . , I) (5-7-5) The solution is Va: = -}'aqa. (a = 1, 2, , I) (5-7-6) with I'a = n; 1 (5-7-7) We are now ready to exercise directional discrimination by departing from Eq. (7) for some of the components. First of all, to guarantee an acceptable step, all 1'" must be positive. Hence we follow Greenstadt's suggestion (Greenstadt, 1967) and replace any negative eigenvalues by their absolute values. In place of Eq. (7) we take 1'" = 1 n; I 1 (0:=1,2,.. ,1) (5-7-8) As noted in Section 5-6, the presence in H of a small negative eigenvalue corresponds to a downward concave ridge or trough in the objective function, Taking I' j = I nj-II merely aligns the direction of search with the direction pointing downwards along the ridge (see Fig. 5-1). We still face the problem of what to do about zero or very small eigell- values. In numerical computations, an eigenvalue is almost never exactly zero, so we need not worry unduly about that situation, If a n" is very small, e.g., Inal < 10- 5 maxplnpl, then we face a dilemma; on the one hand, the Newton method tells us that the direction corresponding to this eigenvalue is precisely the one leading most efficiently to the minimum, provided the objective function is quadratic. On the other hand, we know the function is not quadratic, and we are reluctant to take the very long step I n;11 iJa which is based on the quadratic assumption. Also, the convergence theorem of Section 5-4 requires that the range of eigenvalues of Rj be limited. We can think of at least three basic strategies; (a) Newton-Greenstadt Method I'a = max[1 n"I- 1 , <5] (5-7-9) where <5 is a small constant. 
 5-7. Direct iO/1al Discrimi/1at iO/1 93 ;';j Small pOSlllve Eigenvalue . ., ;.,':( "-  ... <p 0 I r:" '. ". . New Ion dlreC lion . GreenSlodl dlreCIEDI r: f Small negohvE eigenvalue 3  t, L <p 0 I Fi.'l' 5-1, Ridgcs on objcctivc function.  (b) Modifiedt Farris-Law Method max[I7[I-I, (5] 'n - I - a a 17[ I ( 17[ I > £) (i7[I£) (5-7-10)  ' , where (5, £, and a are constants. This method may be viewed as being Newton- like relative to the large eigenvalues, and steepest-descent-like rclative to the small eigenvalues. If we take 6 =£ = 0 we get the pseudoinverse, so that this method may be characterized as an approximate pseudoinverse method. (c) A "Neutral" Method . , ' , ,l _ f1l1ax[l7[ I-I, (5] y -\f3 (i 7[. I > d ( 17[. I  £) (5-7-11 ) )1",. ;1:' where 6, £, and f3 are constants. This method is appealing because it is the only one which confines the eigenvalues of Rj to a range bounded by positive numbers from above and below, as required by the convergence theorem of Section 5-4. A. ,. . i0j. < t In the original method, y. = 0 for ncgativc and very small 7T a' 
94 V Computation of the Estimates I: Unconstrained Problems Practical experience with these methods does not as yet provide us with a clear cut choice among them. In some extremely ill-conditioned problems [e,g., Jennrich and Sampson's two exponentials (Jennrich and Sampson, 1968)] methods (b) and (c) performed splendidly while (a) failed. In several less artificial cases, however, method (a) has converged fastest. Whatever our choices for the fa' let r be the diagonal matrix with raa = fa. Then, from Eq. (6) v = - rij (5-7-12) j Replacing v and q by their definitions we obtain, after premultiphcation by G v = -GrGTqi (5-7-13) so that R i == GrG T is the required" almost inverse" of Ai ,I An alternatIve method for converting an arbitrary matrix into a positive definite one is given in papers by Levenberg (I 944), Marquardt (I963), and Goldfeld ef af. (1966). The method rests on the observation that if P is any positive definite matrix, then Ai + ),P is positive definite for sufficiently large ;;, )" no matter what Ai Marquardt suggests the choice Pi == B/ where B i is A defined by Eq. (A-5-14). That is, P is a diagonal matrix whose elements coincide with the absolute values of the diagonal elements of Ai (with say zero elements replaced by ones). Thus, we use 5-8. The Marquardt Method R i == (Ai + )'iB/)-1 and take a step with pj = I, i.e., (5-8-1) .. .t O"i = -Rjq, (5-8-2) Observe that as )'j-> 00, the tenr )'jB/ dominates Ai. The step then becomes O"j -> .- ),i IBi 2 qj )'1-+ ry:; (5-8-3) This is an extremely short step in a downhill direction, B/ being positive definite. A su!llciently large )'i always produces an acceptable step. On the other hand, when )'i is very small, /)j approaches the Newton direction -Ailqj. Marquardt (1963) suggests the followll1g algorithm for the selection of )'i: I. When i = I, start, say with ), = 0.01. 2. At the start of the ith iteration, compute v = -(A j + ),B/)-l qi , 0(1) = OJ + v, and cJP} == rP(O(l)), j 
,)-8. The Marquardt Method 95 3, Ifq}!) < cV i , accept 0i+1 = 0(1), and replace), with max(O.U, £) where £ is a small positive number, say 10- 7 . ;; Otherwise 4. If (vTqY/[(q?qi)(VTV)] <.1. replace), with 10Jc and return to step 2. Otherwise, interpolate, i.e., find a value Pi sufIlciently small so that <fJ(0i + Pi v) < ([Ji (see Section 5-14). Accept 0i+1 = Oi + PiV. The preceding algorithm may require computation of v for several values of A in a single iteration. This can be avoided by replacing step 4. with: 4', Find a value Pi sufIlciently small so that q){Oi + PiV) < q)i (see Section 5-14), Accept 0i+1 = Oi + PiV, Replace), with 10k Yet another method for choosing )'i is described by Smith and Shanno (1971). There are two ways in which the actual computation of v may be under- . taken: (a) Solve the set of simultaneous linear equations (Ai + )'iB/)V = -qi ( 5-8-4) "!fseveral values of )'i must be tried In a single iteration, these equations must ,; Qe re-solved each time. Note that if all diagonal elements of Ai are positive, ,:then Ai + )'iB/ is identical to Ai except that each diagonal element is multi- 'plied by I + )'i' If Ai + )'iB/ is known to be positive definite (as, e.g., in the , pauss method; see Section 5-9), then the Cholesky decomposition (Section "A-5) method should be used to solve for v. Even if positive definiteness is not $sured, it is a good idea to start the Cholesky decomposition. If a negative .. nivot occurs, the procedure is halted, )'i is multiplied by 10, a new malrix ,g£;nerated 1 and the procedure restarted. Alternately, a separate ), may be ; illlaintained for each row of the matrix. If a negalive pivot is encountered say ;:ipthe third row, then the corresponding), is increased sufIlciently to make that >fivot positive, and the Cholesky procedure can now be continued without , restart. :..,(b) Let Eq. (A-5-22) be the inverse scaled decomposition of Ai' Then the reader .:. may easily verify that 1 R i = (Ai + )'iB/)-t = I (ITa + )'i)-l gaga 1 a=I ( 5-8-5) \:,0nce the decomposition of Ai has been obtained, R i may be computed for "y number of values of )'i' We observe that we can restrict our choice to \vaJues of )'i satisfying )'i > -mill ITa (5-8-6) :: 't,,".iL. a 
:; 1 96 V ComputatIOn of the Estimates I: Unconstrained Problems It is worth remarking that Marquardt's method finds the step v which minimizes the quadratic approximation to f/J given by Qi(V) == f/J i + VTqi + -tvTAiv (5-8-7) subject to the restriction that That is, the step v takes us to that point on the ellipsoid defined by Eq. (8) .  at which the function Qi(V) attains its minimum. To prove this, we form the Lagrangian v'B/v = c (5-8-8) A(v)" = qJ i + VTqi + _}v T Ai v + j)'i(vTB/ v - c) We differentiate with respect to v, equate to zero qi + Aiv + )'iB/v = 0 (5-8-9) . . . :..! (5-8-10) and solve for v , v = -(Ai + )'iB/)-lq, (5-8-11) :\1 in agreement with Eq. (I). The particular ellipsoid chosen depends on )'i' since by substituting Eq. (II) into Eq. (8) we find c = q/(Ai + )'iB/)-IB/(Ai + )'iB/)-lqi (5-8-12) The larger )'i is, the smaller is c, and the smaller is the ellipsoid to which we are confined, The algorithm starts with an ellipsoid of a certain size, deter- mined through Eq. (12) by the initial choice for )'i. If the corresponding step v fails to decrease the objective function, this is an indication that the chosen ellipsoid is much larger than the region within which the quadratic approxi- mation Eq, (7) is valid. By increasing )" we shrink the ellipsoid and try again. The Marquardt method has proven very reliable in practice. In the prob- lem of Section 5-21 with difficult initial guess, and in several cases of the problem of Section 5-23, the Marquardt method proved faster than direc- tional discrimination methods. On the other hand, in a series of ten other test problems (Bard, ] 970), the reverse held true. The need for further testing is evident.  I j 5-9. The Gauss Method Y" I : ".1 , . In most parameter estimation problems, the unknown parameters appear only indirectly in the objective function. The latter depends explicitly on the model equations, which in turn depend on the parameters. To compute derivatives of the objective function, we first differentiate it with respect to ,.:q if. .fI 
",,' , ., 5-9, The Gauss Method 97 j"" the model equations, and then differentiate those with respect to the para- meters. The Gauss (1809) method, originally applied to least squares problems, consists of simply omitting the second derivatives of the model equations when the Hessian is being computed, We illustrate by means of the simplest example, that of single equation nonlinear least squares. Here we minimize ' II 11 " <P(O) = I [yP - f(x l " oW = I (Yll - J;,f = I e/ p=t p=1 p=1 (5-9-1) whence '",, ' II 11 qrz = ocpjoerz = 2 I e" oe"joe rz = -2 I e l , aJ;,/oe rz p=t p=t (5-9-2) ",'1 I''; and II II Hap = o2q)joe a 8e p = -2 I e p 8 2 J;,/oe a oOp + 2 I (8J;,joOa)(oJ;,/oep) (5-9-3) p=1 p=1 In the Gauss method, we neglect the first term, and use N in place of H, where N is defined by /I Nap = 2 L (oj;JoO,.)(8J;J80{j) J1=) (5-9-4) , " A numerical example appears in Section 5-21. In the preceding discussion we have derived N as an approximation to H, and the Gauss method as an approximalion to the Newton method. There is an alternative interpretation: suppose the model equations are replaced by their tangents; that is, the nonlinear (in 0) model is approximated by one that is linear. If we now try to solve the corresponding linear least squares problem we find the solution to be precisely e = 0; - N- 1 q. Now e is usually not the correct solution to the nonlinear p.roblem. Yet, if we accept 0; + I = e, we may regard the Gauss method as solving a sequence of linear problems. This interpretation is pursued further in Section 5- I 0 The term neglected in Eq. (3) contained the residual e l , as a factor. Since the residuals are, hopefully, small, this provides some justification for regard- ing N as a good approximation to H, particularly near the minimum. The . same justification applies to all of the more general cases in which the ob- jective function depends on the parameters only through the elements of the moment matrix of the residuals M(O) = Lll eJO)e/(O). This includes most of the least squares and maximum likelihood estimates for normal distributions that were discussed in Chapter IV, e.g., Eqs. (4-7-7), (4-7-9), (4-8-4), (4-9-9), and (4-21-1). In all these cases we have I.. 1'])(0) = ljJ(M(O)) (5-9-5) where tp is a sUItable function. .: W ::.< 
= 2 L (a,p/aMnb)e/J,,(Oe/lb/aOa) 1 ' ,(1,11 (5-9-8) '. j -. jl t j t j ,t ..I 'v   .3 t " . i .' J : ' "1 j .s i  ;j 1 ]  't 1 . ;: t . J \, ':;"t A :j .1 .;j :1 .;! '. , .11  .';1  98 V Computation of the Estimates I: Unconstrained Problems Differentiation of Eq. (4-7-5) yields: aM",,!aOa = L (e/ Ill VC/lb/ aO ", + ejLb oCjwfilO a ) /l (5-9-6) ( - Cf;Lb/UOa clcjLI,! ao '" = ag/IIJiWz for reduced models (5-9-7) for inexact structural models Therefore, from Eq. (5) and because of the symmetry of M C/a = aClJ/?ocr = L (c11flji'M"I»(aM"I,/DO",) (I," = L Uil/JIDM"b)(e IW cc/l/,/oO", + C/Jb ac/ lll /aO a ) 1 1 ,(1,1) One can further work out that Hal! = c 12 (/JfiJO", aOf! = 2 L (ulfl/uM"b)(aCjW/aOa)(aejl/,/aOp) JI, (It b 7'\' ll f 1 / "'. [ ) (a 2 I  O  O) + _ L., (L (,11 "I) e/ J " e jLb c: a 0 IJ 1 1 ,11,/1 + 4 L L (ollfJ/oM",) aiV/cd)e,laC8c/lI,/80a)e'IC(8e'ld/OOp) (5-9-9) Jl, n./) 'I. c. ,I As we see, the second derivatives of the model equations are always multiplied by residuals c jlll ' and the terms involving them are dropped in the Gauss method. We note also that the terllls involving ollfJ/8M aM contain residuals and we drop these too. This leaves us with the approximate Hessian Nail == 2 I (OlfJ /8lVfllb)(OCjw/80",)(OCjlb/aOp) If. II, b or, in matrix notation n N = 2 L B/rBjl 11= 1 (5-9-10) where B/I and r are matrices defined by B JI ,lI')'. == -i 1 (:Ju'/oOa. = C!;w/aoa., l"b == uP/aNf lib (5-9-11) Using the same notation, the gradient Eq. (8) is given by n -_?'\' B T r q - - L., " ell JI=1 (5-9-12) In Table 5-1 we give formulas for r in the cases of the norlllal distributions discussed in Scctions 4-7-4-9. It is significant that in all these cases r turns out to be positive definite (or at least semidennite). It follows from Eq. (10) that 
5-10. A Sequence oj Linear Regression Problems 99 "1- { 1';!: ,  if:. i-i'o;: ':: ?:. Table 5-1 Somc Log Likelihood Functions and Thcir Dcrivativcs Objcctive function givcn by DcscrIpuon Objcctivc function l[J (IVI) al[J r=- aM Eq. (4-7-7)  Tr(V-'M) Normal distribution, known covariancc matrIX, wcightcd Icast squarcs. _V_1 Eq. (4-21-1)  2 Normal distribution, covariancc matrix known cxccpt for multiplicativc factor. /1111 /1111 - Iou Tr(Q-'IVI) Q 2 " 2 Tr(Q-'I\1) -J f ;,' !t).  :. I 1 10 . ?t d Eq, (4-9-9) Normal distribution, unknown covariancc matrix. 11 - log dct M 2 - /I _M- 1 2 Nand R = N- 1 are also positive definite. In particular, this is true in the single equation least squares case, Eq. (4). Application of the Gauss method to an objective function of the form Eq. (4-9-9) is illustrated in Section 5-23, If the objective function is a posterior density as in Eq. (4- 15-1) or Eq, (4-15-2), the Gauss method may still be used to approximale the Hessian of the log likelihood, to which the true Hessian of the log prior density must be added. The latter is frequently easy to compute. For instance, if PoCO) is normal with covariance matrix V, then the Hessian of log Po is simply - V-I. See Section 5-22 for a numerical example. 5-10, The Gauss Method as a Sequence of Linear Regression Problems The essence of the Gauss method is to use in the jth iteration a step whose direction is given by Vj = -N;l q , ( 5-1 0-1) Equivalently, Vj is the solution of the set of simultaneous linear equations NjVj = -qj (5-10-2) which, in view of Eq. (5-9-10) and Eq. (5-9-]2), may be writ/en out as II II L B/rB1,v = I B/re l , II= 1 JI= 1 (5-10-3) where the subscript i has been dropped for convenience. J!I'? 
100 V Computation of the Estimates I: Unconstrained Problems -.. .: . :;. .1 Referring back to Section 4-4, we find that if we want to determine by multiple linear regression the coe!1lcients v in the expressions ,. ';.;:,. .cl . ell =BI' v (fl = I, 2, . . . , 11) (5-10-4) 91 E(e l . - BI' v)(e'l - B'I V)T = ('5 ll1/ r- 1 we are led to minimizing the function (5-10-5) -:0£ . ,;11> '11 . :1 ":. and if we assume the covariance matrix " (fPI(V) == I (e l . - BI' v)Tr(e ll - BIL v) 1 1 ::::: 1 (5-10-6) ,. 'J  '1 '\i. .;<;. ' ;t .." ; The normal equations corresponding to (fpI(V) are given precisely by Eq. (3), with all quantities evaluated for 0 = OJ. Thus, each iteration of the Gauss method may be regarded as the solution of a multiple linear regression problem. The above remark applies only to the determination of the direction V j but not of the length pj. The solution of the linear problem is OJ + Vj, i.e., pj = I. This step may prove unacceptable in the nonlinear problem, and we use an interpolation-extrapolation scheme as described in Section 5-14 to determine a better value of p,. Such schemes were originated by Box (1957), Hartley (1961), McGhee (1963). The linear regression problem represented by the objective'function Eq. (6) and its normal equations Eq. (3) can be. generated by inspection, without reference to Table 5-1 and the derivations of Section 5-9. All we have to do is replace the model equations by their linear approximations around the current value OJ 1'1,(0) = fl.(Oj) + (('fl,/DO)(O - OJ = f"te;) + BII Vj (5-10-7) .' "J. so that elL = YIL - fl.(Oj) = BI' V j as in Eq. (4). These are the lineanzed model equations. For the weighting matrix r we take V-I when known, or its current estimate (e.g., l1iVl- I ) when not known. It goes without saying that various tricks that are useful for solving linear regression problems are applicable here too. For instance, it is well known that the condition of the normal equations is usually improved if one" sub- tracts out the means." This strategy applies provided the model has a constant term. Suppose, for instance, that for a single-equation model, Eq. (4) has the form :} ;;J! t el' = 1'1 + I b",l'a 2=2 (5-10-8) 'i :. :"-;,, Iii : :i Let b be the average value of b l ", i.e., " b, == (1/11) I b l " 1 ' = 1 (5-10-9) ..", 
5-11. The Implementation 0./ the Gal/SS Method 101 ,\' Then Eq. (8) may be rewritten as 1 e" = VI + " iJJI \'2 a2 (5-10-10) /- where I iii == \', + I lJa\' a=2 (5-10-1 I) [jJI> == bJI - lJ (5-10-12) We noW use model Eq. (4) to calculate v" \'2' ..., \ l' and from these and Eq. (11), we can compute \'1' It is a remarkable property of the normal equations in regression problems that they always have a solution, even when N is singular. I n fact, when N is singular there are infinitely many solutions. Among these, the one of III 1111 mum length is given by ,'1. Vi = -Nj+qj (5-10-13) Other solutions may be obtained, for instance, by means of stepwise regres- sion (one continues pivoting until no nonzero pivots are ]eft. See Section A-3.) In this solution, the number of nonzero components in Vi does not exceed the rank of N j . We prefer the pseudo inverse solution because of its minimum length property. However, presence of rounding errors makes use of the exact pseudo inverse undesirable, and it is best to make N j nonsingular by the direct- ional discrimination or Marquardt methods. 5-11. The Implementation of the Gauss Method Thereare several ways in which the directIon Vi given by Eq. (5-]0-1) may be computed. Any method suitable for the solution of multiple linear regres- sion problems is also useful here, In a sense, however, linear problems place more stringent requirements on methods of solution: We expect to obtain the correct answer in a single step, and must therefore compute N - I very precisely. A nonlinear problem, on the other hand, requires several iterations; slight errors in each iteration can be tolerated, as long as the chosen directions are acceptable. In other words, Ni I need in principle only be positive definite for 'nonlinear problems. However, substantial errors in the computation of N j - I plaY greatly increase the num ber of iterations required. In the presence of a prior density (as when we seek the mode of the pos- terior distribution) or a penalty function (when inequality constraints apply, see Section 6-1), appropriate terms must be added to both N j and qj. The ,errns added to N are usually positive definite, and when N is ill-conditioned, an improvement in its condition may result. The linear regression structure is :;,jt,c.J 
102 V Computation of the Estimates I: Unconstrained Problem, seemingly lost, but at the end of this section we indicate how it can sometime, be recovered. Numerical techlllques for computing the direction V j fall into two classes First, methods for solving the normal equations, without taking account of their particular structure. These methods are obviously applicable whether or not the equations possess the linear regression structure. Second, methods that rely on the linear regression structure, and are sometimes inapplicable in the presence of a prior distribution. We may also classify methods into those which simply compute v j = - N j - I qj , and those which allow the inverse to be adjusted in favorable ways (a) Normal Equation Methods, The simplest method simply solves the Eq. (5-10-2) for Vj, using standard simultaneous equation techniques. The fastes: method is the Cholcsky decomposition (see Section A-5), but is not recom. mended unless N j is known to be positive definite and fairly well-conditioned.:: In general, we recommend one of the directional discrimination method:; (Section 5-7) or the Marquardt method (Section 5-8), all of which allow us to improve the condition of N j . (b) Rcgression Mcthods, The method of Jennrlch and Sampson (1968) con- sists of applying the tepwise regression technique (see Appendix A-3) to the regression problem of Section 5-10. The normal equatipns are formed, but components of V, which cannot significantly reduce the value of the objective function are set to zero. This is a directional discrimination method, the directions coinciding with the coordinate axes. We hazard the guess that backward stepwise regression would be more eflicient than forward regression in most nonlinear problems. Methods which do not require formation of the normal equations ae capable of greater numerical accuracy" and are particularly suitable when precise solution to highly ill-conditioned linear regression problems are re- quired. The main disadvantage of these methods is the need to keep in com- puter storage all the Ell matrices and CII vectors (to form the normal equatiom" these may be generated and discarded one at a time). We mention here two of these methods: (1) Golub (1965) generates the Cholesky decomposition of N directly from the Ell' using Householder tram.- formations; and (2) modified Gram-Schmidt orthogonalization has been :r Thc mcthod can bc adaptcd to thc singular or ncar singular case, as shown by Healy (1968), but this adaptation has pcrformcd poorly in somc tcst cascs wc tried. This is bccaue although thc Cholcsky method givcs an accuratc solution to thc normal cquations cven whcn thcy arc ncarly singular, thc stcp dircction thus gcncratcd is so far from the ncgathe gradicnt as to bc almost unacccptabic.  Golub's mcthod can bc adaptcd to opcratc scqucntially on small scgmcnts of tte matrix B, thcrcby ovcrcoming thc computcr storagc problcm. The resulting algorithm 
'fJ ,'i '£:f' ';1, t: . o l  : .." . ,.'{.- -;.,.- , :'1 ..... ;.;{' . ' .; i '' .,. t.:: ' r:;: :'1 , 'i ")} : :;  'I '1 i ,  ,I ...; "'1 5-11, The Implementation oj the Gauss Method 103 Jound by Longley (1967) to be considerably more accurate than solution of i;normal equations. Golub (I969) reports it to be slightly more accurate than , the Householder procedure. , We present here the details of orthogonalization: Let us adopt the follow- "ing notation, similar to the one used in Section 4-4 f BI J _ B 2 B= " ' f r 0 ., 11= ? r o 01, E J :: 1 r J l e"j (5-11-1) , B has 111n rows aild I columns; we denote the latter as b I' b 2 , ..., b l II is ,;:11111 x 111n, and E is a column vector with 111n elements. The normal Eq. (5-10-3) ;;',may be rewritten a5 BTIIBv = BTIIE (5-11-2) Suppose the b" are linearly independent. Then we can find a set of I vectors Pl, P2, . . ., PI which are orthonormal relative to II, i.e., p;TIIpj = c'5ij (i,j = I. 2.. ,I) (5-11-3) . and which form a basis for the b". This means that the b a are independent ",linear combinations of the Pi' i,e., there exists a nonsingular / x /matrix A 'such that I h" = L Ai"Pi i=1 (a=I,2,...,/) (5-11-4) Let P be the matrix whose columns are the Pi' TheIl Eq. (3) and Eq. (4) are equivalent to: pTlIP = I B=PA (5-11-5) (5-11-6) "In the sequel, whenever we use the term" orthogonal," we mean orthogonal "e1ative to II. The vector E can be decomposed into a component which is a )inear combination of the Pi' and a component D which is orthogonal to all qfthem, This can be stated concisely in the following equations I E = D + I t i Pi = D + Pt (5-11- 7) i= 1 ',where D satisfies pTIID = 0 and t is an I-vector of coefficients. We verify ..easily that pTIIE = pTIID + pTIIPt, i.e., t = pTIIE (5-11-8) )mploying Eqs. (5)-(8), we can write the solution 10 Eq. (2) as v = (BTIIB)-IBTIIE = (ATp1IIPA)-IATpTIIE = (AT A)-IATt = A-It .",JlL (5-11-9) ?I .1; 'J ..0.11' ,,[;!.;, ;j ,:.. 
104 V Computation of the Estimates I: Unconstrained Problems 1 '.'1 '" 1 ;.,.. {" 'I I I :t <t I i '. I The computation of v is particularly easy if A is an easily inverted matrix, e,g" of upper triangular form. The following procedure generates such a matrix, It is known as modified Gram-Schmidt orthogonalization, I, Form the matrix C == [B, E]. This matrix, which has 111/1 rows and I + I columns, will be transformed as described below, We let C i denote the ith column of C 2 Set k = L 3, Let S',k = (c k TITc k )1/2 4, Replace c" with c,,/SkI.' Note that now c" is normalized in the sense that C k TITc" = I. 5. Let S',i = ciTITC" for i = k + I, k + 2, . . . , 1 + I. 6. Replace c i with C i - S',iC" for i = k + I. k + 2, ,.., 1 + I Note that thereby the c i (i> k) are rendered orthogonal to c", without losing their previously established orthogonality to the c j (j < k). 7. If k = I, terminate. Otherwise, replace k by k + I and return to step 3. It is clear from the remarks accompanying steps 4 and 6 that the first I columns of the final C tableau form an orthonormal (relative to IT) basis for B, and that the last column of C contains the component of E orthogonal to the vectors in that basis. In other words, we now have \1  ,;ji i$i C = [P; D] l5-11-1O) I It is easily verifi.ed by reference to our algorithm that ( a-I ) Pa = (I/SaJ b a - .L SiaP; t:::::.:l (CI. = I, 2,. . . , I) (5-11-11) and ..i I D=E- L Si.l+lP; l= 1 ( 5-11-12) Hence a- t b a = L ShPi + SaaPa ;=1 (5-11-13)  and ,,' -". I E = D + L Si.l+J Pi i=l ...,. :,-1. (5-11-14) Comparing Eg, (13) to Eq. (4) yields: A,a=Sia (i= 1,2,.."CI.- I) Aaa = Saa A.a = 0 (i = CI. + I, CI. + 2, .., I) Similarly, from Eq. (14) and Eg. (7) it appears that 'i=SU+1 (i=1,2,...,1) (5-11-15) (5-11-16) 
5-11. Tile Implementation l:l the GOliSS Method 105 The matrix A and the vector t are thus fully determincd. The systcm of equations Av = t can now be solved for v by successive substitutions: I'I=t1lA 11 \'= ( t2- i A'IJ\'IJ )j .-4 n (7.=1-1./-2.. ,I) (5-11-17) lJ=2+ I . It may happen that the b are not linearly independent. Suppose the rank of B (i.e., the number of linearly independent b 2 ) is II < I. Then in 1- II of the iterations in the orthogonalization procedure it will turn out that C I . = (). hence Ski. = 0 and steps 4-6 cannot be carried through. The simplest solution is to leave C unchanged and set SI.; = 0 (i = k + I, I, + 2, ..., I + I) It follows then that I - It rows of A and the corresponding elcmcnts of twill be zero, and the corresponding \'2 will be indeterminate according to Eq. (17). To these \'2 we may assign arbitrary values. c.g., zero. and the remaining \' are computed using Eq. (17). The following is a simple numerical example. Let BJ,: L6 P j 19 8 ' 10 E{H nl   rj Hence follow the steps of the algorithm: I. Cl:! 12 19 8 10 14 j 4 9 13 2. k= I 3. SII = 35.41 186 4, Replace the fi.rst column of C with [0.056478, 0.169435, 0.423587, 0.451 826]T. 5. S12 = 26.82717, SI3 = 27.92849 6. The second and third columns of C are replaced by l 10.48485 14.4) . -455 -3.36364 -2.12121 12.4 2265 j -0.73206 -2.83014 0.38118 7. k= 2 3, S22 = 27,80833 4. Replace the second column of C with [0.37704, 0.51979, -0.12096, -0.07628]T, 5, S13 = 15.99368 
106 V Computation of the Estimates I: Unconstrained Problems ". 1 ': j  6. Replace the third column of C with [6,39239, - 9,04545, - 0.89558, ;',\ 1.60117]T ", v N I A [ 35.41186 26.82717 ] [ 27.92849 ] h ow we 1ave = ° 27.80832' t= 15.99368' so tat [ 0.35296 ] I . . 1 . f ' d I h . . fi I . 0.575 I 4' t IS easl y ven 1e t wt t IS satls es the norma equatJOns [ 1254 950 ] [ 989 ] 950 1493 v = 1194 "i l ' ,. r In some cases, the normal equations take the form (B T I1B + Q)v = B T I1E + <I> (5-11-18)  where Q is a given positive definite matrix and <I> is a given vector. For instance: :! I In the Marquardt nlethod, Q consists of the diagonal elements of -.'' B TI1B multiplied by a scalar J., and <I> = O. 2. If 0 has a normal prior distribution with covariance V 0 and mean 0 0 , then Q = V; I and <I> = V;I(OO - OJ . 3. If 0 is subject to inequality constraints and the penalty function method ; is used, then Eq. (6-1-9) supplies Q and Eq. (6-1-7) supplies -<I>, both to be slimmed over all constraints. By appending to the /1/71 model equations a fictitious I additional equations one can reduce Eq (18) to the normal regression form Eq. (2). Let S be a matrix such that SST = Q (e.g., the Cholesky decomposition). Define B == [:T l E == [s I <I> l ti == [ I] One verifies easily that BTtiB = BTrm + Q, and BTtiE = B T I1E + <1>. There- fore, performing a linear regression with B, E, and ti replacing B, E, and I1 is equivalent to solving Eq. (18). 5-12. Variable Metric Methods The Gauss method in its various forms is undoubtedly the best available ',j for the solution of those problems to which it applies. When the objective function is not one of those shown in Table 5-1, however, the method may not be applicable. One of the so-called variable metric methods is recommended in such cases, The term variable metric methods was coined by Davidon (1959)  to designate schemes in which the matrix R is systematically adjusted from ,W iteration to iteration in such a way as to make it behave like H- 1 . These "f'J .; 
5-12- Variable Metric Ivfethods 107 methods may be viewed as sophisticated finite difference schemes For comput- ing the second derivatives of cP. The specific scheme proposed by Oavidon has been modified slightly by Fletcher and Powell (1963), and has been widely used in this form, gaining a reputation of being the most eflicient general unconstrained optimization method available. This particular implementation was admittedly arbitrary, and subsequent papers have come Forth with alter- native implementations, e.g., Broyden (I967), Greenstadt (1970), Fiacco and McCormick (1968), Davidon (1968), Pearson (1969), Bard (1970), and Fletcher (1970). Following a general introduction to these methods, we shall describe in detail the ROC and IROC variations which we have found (Bard, 1970) somewhat more efficient than others. This will be Followed by a brieF descrip- tion of the Davidon-Fletcher-Powell method. which is well documented in the literature The main idea behind the variable metric methods is the following: From the definitions of the gradient q and the Hessian H we have Hi = (cq/aO)o=o, (5-12-1) 'Therefore, to a first-order approxImatIon HiO"i =11i (5-12-2) where O"j = 0i+ I - 0;, and 11i = qi+1 - qj. This means that O"i= Hi l 11i (5-12-3) Suppose that before the ith iteration we have a matrix Aj which is an ?-pproximation to Hi I, We wish to add to il a correction L':.A j in such a way ,Jhat the rsulting matrix A i + I satisfies Eq. (3) when replacing Hi I That is, with Ai+1 == Ai + f'1A i (5-12-4) we require that O"i = A j + I 11i = A i 11i + L':.A i 11i (5-12-5) Hence L':.A i 11i = Pi (5-]2-6) where Pj = O"i - A i 11j (5-12-7) 'Bq, (6) does not determine L':.A j uniquely, since it contains only I conditions for the 1(1 + 1)/2 independent elements of the symmetric matrix L':.A i . .:t... 
108 V Computation of the Estimates I: Unconstrained Problems The stmplest possible marrix !'J.A i is of rank one, i.e., it has the form !'J.A i = rirjT (5-12-8) where rj is some vector. Substituting in Eq, (6) we obtain rirjTTji = Pi (5-12-9) that is r i = (I/r i T T1i)Pi = O:Pi (5-12-10) where 0: == (r j T T1J- I is an unknown constant. Substituting in Eq, (9) we find 0:2PAPiTT1i) = pj (5-12-11) Therefore 0: 2 = !/Pi T T1, (5-12-12) Finally !'J.A i = rjr j T = C(2pjp? = (I/P?T1i) PiPj T (5-12-13) .'!. Eq. (13) defines the Rank One Correction method (ROC): Broyden (1967), Davidon (1968), and Fiacco and McCormick (1968), have all proven the following: Theorem Suppose qJ(e) is a quadratic function with a constant nonsingular Hessian matrix H. Let e I' e z , . . . , e, + I be a set of points such that the vectors (Jj == e i + I - e , (i =1,2, . . ., I) are linearly independent. Let AI be an arbitrary symmetric matrix, and let Ai (i = 2, 3, ..'., 1 + I) be defined recursively by means of Eq. (4) and Eq. (13). Then, provided Pi T T1i # 0 for i = 1, 2, . .., I, we have . A'+I = H- I (5-12-14) The theorem says that if c]) is quadratic, the ROC method produces the exact inverse Hessian in 1 steps. Once the inverse Hessian is known, a single Newton step converges to the minimum, When c]) is not quadratic, one expects Ai (i  I) to represent an approximation to H- 1 evaluated somewhere in the region of the last 1 iterates, This should. be particularly true near the minimum, where successive iterates lie close together. We expect the matrices Ai to converge to the value of I-r I at the minimum. Though no rigorous proof of this proposition has been found as yet, numerical tests have confirmed its validity. :,t ? '.;..'"oj .. 
5-12. Variable Metric Methods 109 Although the theorem holds in principle for arbitrary AI' numerical stabil- ity of the calculations [see Bard (1968)] requires that the elements of AI have the right order of magnitude. A good choice is a diagonal matrix with A Iu = -()I)ql. (5-12-15) ,( SlIlce Ai is an approximation to H; I, we would like to take A, for Rj There is no guarantee, however, that Aj is positive definite. This may be remedied by means of a slightly modified Greenstadt procedure: Let Eq. (A-5-21) be the scaled decomposition of Ai' that is I A. = '" IT.f.fT I L J J J j=l (5-12-16) Let us define l'j = max[fJ, min(j rr;l, 1')] (5-12-171 where fJ and l' are small and large positive constant, respectively. Then I R. == '" 1 ,.f.fT , L J J J j=1 (5-12-18) is positive definite. It coincides with Ai when the latter is positive definite with all eigenvalues between fJ and )'. Marquardt type corrections to the diagonal elements could, in princi pie, be also used for rendering A j positive definite This type of correction does not 'appear to work very well when applied to a matrix that is an approximation to the inverse, rather than to the Hessian itself. It happens, however, that a procedure entirely analogous to the ROC method can be used to construct an approximation to the Hessian directly. We call this method inverse rank onc correction (IROC) (Bard, 1970). In this case we wish to satisfy (A, + 6A j )a j = .1}i [see Eq. (2)], and we are led to Aj = tl/sjTa,)SjS/ (5-12-19) i,where Sj ==1}j - A,a j ( 5-12-20) We initialize AI as the inverse of the matnx given by Eq. (15). The matrices Ai ,.!:onverge to H in the quadratic case. Since Ai is an approximation to H, we "can use the Cholesky decomposition with the Marquardt method to compute :.Vi efficiently, as described in Section 5-8, . {1.!tJ. , ,;!<ii;.. ., 
6.A; = (lfO/TJ;)O";O";T - (l/TJ/AjTJi)AiTJjTJ/A j (5-12-21) 110 V Computation of the Estimates I: Unconstrained Problems In the Oavidon-Fletcher-Powellmethod (OFP) (Oavidon, 1959; Fletcher and Powell, 1963), the matrix 6.A j is of rank two, instead of rank one. The simplest such choice satisfying Eq. (6) is Suppose we choose 0", = - Il j A j qj, where II; is a positive value of P at which (IJ(e, - pA, q,) attains a minimum. Fletcher and Powell have shown that under these conditions A, + I = A, + 6.A, is positive definite provided Ai was so. Thercforc using R, = A, always produces an acceptable step. 5-13. Step Size In the preceeding sections we were concerned primarily with choosing the direction of the step taken in the ith iteration, that is, with the choice of R j . We shall now turn our attention to the determination of step size, i.e., to the choice of {I, Thc mcthods that have been used fall into three categories: I. p, = I. Required by Newton's method (to guarantee quadratic con- vergence near the minimum) and by Marquardt's method. In the latter case, the step size is determined indirectly through the choice of ).i' 2. p, = p, I.e.. we proceed along the chosen direction to the point at which (IJ ceases to decrease, as required by the Oavidon-Fletcher-Powell method. Suitablc mcthods of searching for Pi are given by Fletcher and Powell (1963). Bard (1970), GoldlHb and Lapidus (1968), and others. 3. Intcrpolation-extrapolation, employed in conjunction with the Gauss and ROC methods. Hcre one expends a certain amount of effort on finding a good, acceptablc value of IIi' without bothering to locate Pi precisely. It is true, on the whole, that the closer Pi is to IIi' the smaller is the total number of iterations required. On the other hand, the more precisely we wish to determine the value of Pi' the larger is the number of times that we must evaluate the objective function in each iteration. The difference between cases 2 and 3 is that in the former the best balance is sIruck when Pi is determined with much greatcr precision than is required in the latter. In the succeeding section we suggest a simple algorithm for determining Pi in 3. This has worked with a reasonable degree of success, but there is no evidence that it is the most efficient possible. There is no end to the degree of ingenuity that may be expended on devisi ng uch algorithms. In all cases the search for Pi proceeds without computation of derivatives. It would be wastcful to compute at each point I + I functions (cP and the I components of its gradient) in order to conduct a one-dimensional search. The gradient is required only at the main iterates e l , e 2 , ,., . 
\:.'. "2..... : f' 5-14, Interpolation-Extrapolation III In the algorithm of the succeeding section it is assumed that at each iteration we are given an upper bound Pi. mil' on the feasible values of Pi. When inequality constraints need to be satisfied (see Section 6-1) Pi,mllx is the .' ,smallest positive value of P for which 0; - pR; q; lies on the boundary of the  . feasible region. If no inequality constraints apply, Pi, mllx can be chosen as an t{;: arbitrarily large number. We are also given a lower bound Pi. min (see Section 5-15). If no acceptable P > Pi,min can be found, the search is terminated, :;'; :: i::.'. ft: ):.; ( "?:.. 5-14. Interpolation-Extrapolation Assuming that we have chosen an acceptable direction, there always exists a number 1]; such that if 0 < P < I] i' then tp ;(p) == <])(0; - pR i q;) < q) i. The basic idea of the interpolation method is that if we have inilially picked a value P = pea) such that tp j(pf0»  q) i' we next try a smaller value of p, and keep repeating the process until an acceptable value is found. The idea behind extrapolation is that if our initial choice P = pea) turned out acceptable, it pays ,to try at least one other value of p to see whether we cannot do even better. 'In both cases, the new trial value of p is chosen so as to minimize a quadratic approximation to tPj(p) We know that tPi(O) = cD i , and dtp;/dp)p=o = -q?Riqj (see Eq. (5-3-6). Suppose we have computed tPj(p(O». Let us define CI.. == q\, [3 == tp;(p(oJ), i\.'" .}' == -qjTR i q;, and let us try to find a quadratic function a + bp + ep2 whose f'" 'values match those of tpi(p) at p = 0 and p = pea). and whose slope matches he' 'that of tPj(p) at p = 0 We have then: "1,;:t f t....,,: a=CI.. ( 5-14-1) (5-14-2) (5-14-3) b = l' a + bp(O) + ep(O)2 = [3 Whence e = ([3 - CI.. - 1Jp{O»1 pIon (5-14-4) .,The quadratic a + bp + ep2 has a slationary point at p* = -bl2e = yp(O)2/2(vpf°) + CI.. - (3) (5-14-5) The initial value of pea) for each iteration is determined cautiously or qptimistically depending on whether or not the previous iteration did or did pot require interpolations. A detailed implementation of these ideas is given ,in the flowcharts of Fig, 5-2. 
Yes " 1 ;( : I . '- I :'. i: (,  I :;" . " .. I ';' ' - II Entry: Given (IJ, , 8" Ri' q" pi,",,,, p,,",'n, J (an integer set = ] in the first iteration), a (= 0.5 if penalty functions are used, = I otherwise). Set p(O) = 2 -J min(l, (J Pi.on.,) y -q,T R, q, Computc 'I '(0) and p* ._ yp,0!1/2(yp'OI '1>,.- '1'(0,). To Fig. 5-2b ,j . ! No ": .j d 'u j ' I .' !I\ . 'I: j., , j 11 ;j ' ;;j Compute ....( 21 max[0.25p' ", minCO.75p' ", p*)I. Yes No Terminatc. Accept Computct 'I" 1'; IIlcrcasc J by I. 8* =8,. Yes Out No .,.  ,-" " .;., " Sct p' ,) - p' 21 Compute p' yp" )1/ II (I>, 'I" 11). Out (To next iteration.) Notation: 'I" II '1',(p'JI) (1)(8, - pIJIR,q,) Fig. 5-211. Delernlination of step Jength, interpolation. t If the conlputation of 'P(2) }. IS impossible (due, e.g., to an excessively large argumcnt in an exponential) then incrcase T t........ 1.,....1....... ....(2) 0.....-1 r.o,. .,......,;.... " ...J: 
f :0.'"'.: 'I.;;. 'ii; t'.. !.>' From Fig. 5-2a, Replace J by the largest integer not exceeding J{2. r:'.' No ',:C .1':" (; ....'::..\ .... l'. ,;,:.. :.'..: "i.' ,!.' ''''.', f:!.. j,: !:.7.:'; . .:; i t; }f. Uf No :,...:, ..:.,  Out (To next iteratIOn,) e...,. i Out (To next iteration.) Notation: 'f'(j) "" 'F,(p(j) = <1>(8, - p(j)R,q,) Fig, 5-1b, Determination of step length, extrapolation, 
114 V Computation of the Estimates I: Unconstrained Problems .;; ;" it  5-15. Termination PI _';..1 --'.:-1» It IS necessary to devise a cntenon for stopping the iterative search for < the minimum of qJ(O). As was stated before, all one can hope for is conver- 'j; gence to a stationary point of qJ. It may seem natural, therefore, to adopt .  the vanishing of the gradient as the termination criterion, Unfortunately, i rounding errors and poor scaling often make the goal of a vanishing gradient'i: unattainable even approximately. In many cases, the computer comes up with .j: parameter values very close to the minimum, yet the gradient is still sizable, it,; In addition, if perchance the algorithm fails to converge at all, a termination ,."'; rule based entirely 011 the gradient leaves the progranl to iterate endlessly. : A more practical criterion dictates stopping as soon as further iterations .j fail to change the parameter values significantly. That is, given a set of smaU. 1 .,' numbers 6a (0: = 1,2, ..., I), we accept 0i+J as the solution 0* provided 10i+I,a - 0i,al  6a (ex = I, 2, . , ., I) (5-15-1); Where Oi, a is the exth component of 0,. The numbers 6a may either be pre ' scribed in advance, or they may be computed by the program. In the latter" case, folJowing Marquard t (I963), we recommend 6a = 1O- 4 (Oi, a + 10- 3 ) (5-15-2) where the additive term 10- 3 is designed to avoid embarrassment if e tt happens to be nearly zero. This criterion has worked very well in practice. It , tends to be on the conservative side, sometimes allowing a few more iterations : than are strictly necessary. The rationale for the criterion is that, convergence. or no convergence, it does not pay to keep iterating if the parameter values. cease changing. Suppose in the ith iteration a step direction Vi has been determined. Then" Eq. (I) is satisfied if for each ex, P I Vi, a I  6 a , i,e., if P  min[6a/ I Vi, a I], Hence. a the mlJ1lmum admissible P for the ith iteration is Pi,min = min[6 a /lv i ,al], As, shown in Fig. 5-20, termination occurs if the algorithm is forced to choose.: Pi  fJi,min' The above criterion does not offer an ironclad guarantee that the process,: wilJ terminate in a finite number of steps. If the objective function is known ';. to have a finite minimum, then termination can be guaranteed if we stop whenc ever qJ i-I - c]J i < c; for some small prespecified positive number 6. That is, we stop as soon as no significant progress is made in reducing the value of the objective function. It may be safer, however, to require qJi-1 - qJi < 6, i.e., to; continue unless no significant progress has been made over a number oL' '<..;' 
ii::'16. Remarks on Convergence 115 iterations, The variable metric methods are particularly liable to stall over a ;,p;tImber of iterations, then to make sudden progress. "Finally, an upper bound may be placed on the number of iterations :19wed, This should be coupled with a restart procedure to permit continua- ;ijOIj with possibly a different algorithm. ,;,...,pnce the iterative proces is termintd at 0 = 0*, one would like to know :.whether or not one has arrived at a mlllll11Um. We assume that we know the '>:I<,--;,' :dient q* = q(O*) and at least some approximation H* to the Hessian )i(9*), If we cut a cross section of the cP surface along the Ou axis. we have a ;$iIrve whose approximate equation near 0* is given by IJI(Ou) = cP* + qa *(Oa - Oa *) + Y-J:(Oa - Oa *)2 ;>Vfh has a stationary point at Oa = Oa* - qa*/H: (5-15-3) (5-15-4) iWhe"quantity Da = I qa*/H: a I is therefore a measure of the error In the deter- :ation of 0*. If each Da is small on the scale by which Oa is measured, then iiis likely that 0'" is very close to a stationary point of CPo Though not fool- poof, this test works in most cases, Its reliability is improved if applied in the ';£Qordinate system of the canonical variables (see Section 7-3). . If H* is indeed the Hessian of cP at 0*, then we may easily determine il1ther 9* (already known to be a stationary point) is really a minimum. All t is required is that H* be positive definite (see Section 3-5), i.e., that all its igenvalues be positive. When using the Gauss method, our approximation N* ",., t Wc,onstructed so as to be automatically positive definite, regardless of whether 9 t '11Ot H(O*) is so. In these cases, then, N* contains no information perlaining o the nature of the point 0*. Our only recourse is to explore directly the be- :vior of CP.around 0*. In the ROC method, on the other hand, the matrix ¥i from the. last iteration may be a true approximation to [H(O*)r I, We }1ot prove that R i is or is not positive definile when 0* is or is not a mini- tqn; however, if R i is not positive definite we suspect that 0* is not a um, and vice versa, *¥: . -" . ", If one has reason to doubt that 0* is a minimum, one should restart the !rative procedure from a point close but not identical to 0*. If convergence !9. the same 0* is obtained, this is likely to be at least a local minimum. 6,.. Remarks on Convergence ;;:Suppose our process has converged to a point 0* at which cP turns out to ;ot stationary. We aSSume that we have used a gradient method that is Ht nominally acceptable. The directional derivative -qiTRi qi of cP in the 
116 V Computation of the Estimates I: Unconstrained Problems direction of the vector - R i qi is supposed to be negative, yet even a very small step in this direction fails to produce a decrease in CPo The cause is probably,. ' T: ta:::;':a::led with ,"'uffiei'n t accuraey. If an income! vecto,'11 p is used in place of qi in the calculation of the direction, the true directional'''' 1 derivative is - qiTRi p, which need not be negative even when R i is positive; definite. What is needed, then, is increased accuracy in the computation of the , derivatives. This is why the derivatives should be computed from their ana- ...' lytic formulas. If this is impossible, one must revert to finite difference approximations. The question offinite differences is discussed in Section 5-18, When the model equations are thought to be too complicated for analytic differentiation by hand, one may use the computer to perform this task. Computer programs to perform analytic differentiation are available. In fact, the FORMAC system (Bond et a!., 1964) is utilized by Eisenpress et aT. (1966) to supply first and second analytic derivatives of nonlinear model equations in an implementation of the Newton-Greenstadt algorithm (see Eisenpress and Greenstadt (1966)) for parameter estimation of econometric models. For a survey of other analytic differentiation by computer schemes see Sammet (1966). (b) The matrix R i is not suf'ficiently positive definite, due to accumulation of rounding errors. This cannot happen with either the Gauss 'or ROC methods, as long as we use the discrimination or Marquardt trick to insure positive , definiteness. ] t can, ho\vever, happen in ill-conditioned cases as a result of 'Jl rounding errors when conventional matrix inversion or simultaneous equa- tions solution methods are used. These are to be avoided. We do not possess any methods that guarantee convergence to the global minimum. If we have computed a solution which we suspect is not the global minimum, we can restart the calculatiorls from a radically different initial' guess, and repeat the process until we are satisfied. Such procedures are rarely required in well-posed parameter estimation problems, that is, in prob- lems where: (I) the errors in the data are not excessive, (2) the model fits the'; data well (with the proper values of the parameters), (3) the true parameter- values are not outside the permitted range, and (4) the data were obtained from properly designed experiments (see Section 7- 18 and Chapter X). When these conditions do not hold, almost anything can happen: the objective function may possess multiple minima, or may slope down asymptotically as certain parameters increase beyond bound. These things do occasionally happen even in well-posed problems, but not very frequently. The reader should realize that the state of the art of nonlinear optimization is such that one cannot as yet write a computer program that will produce the correct answer to every parameter estimation problem in a single computer I . 
:18, Finite Differences 117 :rup, All too often, the first run produces unacceptable results. By studying ,;.tqese results one can perhaps obtain better starting guesses; one can choose to ,jinpose bounds or a prior distribution on the variables, or to relax previollsly )I]}posed bounds; one can search for errors in the coding of the model equa- ,ftj(ps or their derivatives, By careful coaxing, the computer may be made to :i'yeld acceptable results in subsequent runs. An interactive computer system ':3'n be particularly useful for this purpose. i?-.17. Derivative Free Methods We have dwelt at length on gradient methods because these have proven .t9,'be fastest and most reliable for a large number of problems.:j: This is not ;iirprising; precise knowledge of the objective function gradient at any point iIiimediately puts at our disposal the totality of downhill directions al that :'point. To test whether a given direction belongs to this class, all we nced to 'do is verify that it forms an obtuse angle with the gradient. As we have 'marked in the previous section, we lose this crucial ability as soon as the :frue gradient is replaced by an approximation. Nevertheless, the burden of differentiating the model equations may at .'fll:ges prove too onerous, and while precise derivatives may be crucial in some Pwblems, other (perhaps most) problems can be solved without them. We iJ.iscuss below some of the methods of doing so. 8. Finite Differences "The most obvious, and in our experience most successful method for aY9iding analytic differenliation of the objective function is to use a gradient ethod, with finite difference approximations supplying the reqllired perivatives. The simplest finite difference approximation to the gradient IS given by ;e one-sided dfference method ,;,. cJ;(OI' O 2 , . . ., 0" + (50", . . . , 0,) - cJ;(0" ° 2 , . . . , 0", . . . , 0,) q;  ".;t , (50" (CI.=I,2,. ,f) ( 5-18-1 ) , :;l:Colville (1968) reports this to be the casc cven with some highly constrained non- qitifar programming problems. >!'i" " " 
..?J 118 .,.::; .:?. v Computation of the Estimates I: Unconstrained Problems Two sources of error contribute to the inaccuracy of qa: (I) the rounding error arising when two closely spaced values of cJ; are subtracted from each other, and (2) the truncarion error due to the inexact nature of Eq. (l),which is accurate only in the limit as Mia -> O. The roundmg error increases as Mia decreases. We shall henceforth write qa = [cJ;(e + i){),) - (f)(e)]/()o, as shorthand for Eq. (I). Let £ be the relative error in the computed values of 1) (at best E = 2- b where b is the number of binary digits carried by the computer in use), The actual error in 1J has magnitude £ 1 (f) I, and the error in (/)(0 + MI,) - 1J(0) can be as high as 2el rPl. although the root-mean-quare crror is only )2. e I 1J I. The maximum rounding error in Eq. (I) is, therefore ! ,.  ::; The maximum total error is approximately (\ == () /I., + () T. , = 2E 1 1J / MI,I t- i 1 H n ria" 1 This has a minimum at ( 5-18-5) ()II., == 2£1 1 )1/1('\0,1 (5-18-2) On the othcr hand. wc have the Taylor series expansion (f)(O + (){)a) = (/Ji0) + q, bOa + iHn ('\0/ + .,. (5-] 8-3) The truncation crror in Eq. (I) is, therefore, approximately ()T., == 11 Hnll{50al ( 5-18-4) i MI,I = (4£ 11)/Hnl )1i1 (5-18-6) If wc are interested In the mean square error instead, we would minimize 2[;1 (f) 1 IM),1 + H;a ()0,1/ 4 ! : so that 1 ) ' (I 1 _) h . 1 "'I f:f 1 ) 1/1 { ,-(_,,_e 'l-'j a' (5-18-7) Eq. (6) or Eq. (7) can be used as a basis for estimating the step sizes DO. (a = I, 2, . . ., I) required for computing the differences. The same equations could have been obtained by requiring that the two error sources contribute equally. In the cac of Eq. (6) the total maximum error turns out to be 2(£1 q)Haal )1/1, whereas the root-mean-square error attendant upon Eg. (7) is 2'/-I(cl 1JH,al )li 1 . To apply these formulas we need estimates of the Hessian H. These are available in the Gauss and variable metric methods. In the latter, we usually have A  H -I rather than H. However, if we start out with a diagonal Al we can form A I easily. Then the easily verifiable formula '1.; J:. :$ .'I A,-:; 't = A.--I - (I faiT Ai Ip.-)Ai-Ip.-p/A.-- t f t' ". r ;;,} (5-18-8) 
5-19, Direct Search Methods 119 enables us in the ROC method to compute successively the matrices A; I which are approximations to H. A similar procedure for the Davidon- Fletcher-Powell method is given by Stewart {I 967). In the Gauss method we need the derivatives of the individual model equations, rather than of the objective function directly, In the absence of anything better, we would still recommend using Eq. (6) or Eq. (7) for choosing (50.. In place of Eq. (I), however, we would apply similar equations to the model equations for each experiment in turn. These formulas should not be used blindly. Gross errors in the estimated H may lead to absurd values of (50.. Lower and upper limits should be imposed on the (50., e.g., 10- 5 10.1  1c50.1  10- 2 10.1. A smaller lower bound would be appropriate if the calculations are performed in double precision. Eq. (I) represents the crudest possible estimate of qa' A better estimate is given by the central ddference scheme rJ>( ° 1 , ° 2 , . . . , 0. + (50., . . . , 0/) - CP(O t, O 2 , . . ,,0. (50. ' ", 0/) qa  2 (50. (C1. = I, 2, .. . , I) (5-18-9) Unfortunately, this scheme requires computation of two additional function values for each gradient component, instead of the one required by the one- sided difference. The truncation error of Eq. (9) is (1)..J.) Mia 3/24, where cp;4) == a 4 cpl aO a 4. The step with least maximum error has length 1 {50al = (l6£ 1 cp 1 1 1 rJ>..J.) 1 )1/4, and the attendant error has magnitude t[£31 cp 1 3 1 q)4) j)1 /4. These formulas are not very useful, since we rarely know cp..J.). However, it is safe to say that somewhat larger values of (50. may be used here than with the one-sided difference scheme. For economy of calculation we suggest that one-sided diAerences be used for several iterations, until no further progress can be made. Then one may switch to central differences if one feels that the solution has not been attained. 5-19. Direct Search Methods The term direct search was coined by Hooke and Jeeves (Hooke and Jeeves, 1961). It has come to be applied to methods which (like Hooke and Jeeves') search for the minimum without explicit evaluation of derivatives, analytic or numerical. The idea of direct search methods is appealing, and they have performed well in certain cases [see, e.g., the survey by Box (Box, 1966)]. Our own experience, however, has been disappointing; gradient 
120 V Computation of the Estimates [: Unconstrained Problems methods, even using finite diAerence approximations, have outperformed direct search methods on all but the most trivial parameter estimation prob- lems, both in reliability and speed of convergence. For this reason we shall mention a rew or the more promising or popular methods, but not describe any or them in detail. The methods that perrormed best in the Box (1966) survey were those due to Powell (1964, 1965). The first one minimizes an arbitrary function; it has been amended by Zangwill ( 196 7b), who also describes a method of his own. The second Powell method (Po\\ell. 1965) is designed specifically for mini- mizing a Slim or squares. but can be adapted easily to other problems which admit the Gauss approximation. This algorithm is related to the Gauss method, with finite difrerences taken along the search directions (instead of along the coordinate directions. as would be the case with the usual finite difference version of the Gauss method). The weakness of the method derives from the fact that the difrerences are taken only in a single direction per iteration, so that one's estimated derivatives in all other directions are perennially out or date This effect worsens as the dimension of the parameter vector increases. Other methods that have found considerable use in solving optimization problems are those of Hooke and Jeeves (1961), Rosenbrock (1960) [see also Rosenbrock and Storey (1966)], Buzzi Ferraris (1968), Brent (1971), and the Simplex method of Neider and Mead (1965). The latter method was adapted to least squares problems by Spend ley (1969). A review of direct search methods appears in Fletcher (1965) 5-20, The Initial Guess All the optImIzation methods that we have described require that one supply an initial guess 0 1 for the values of the parameters. The choice of a good initial guess can spell the difference between success and failure in locating the optimlllll, or between rapid and slow convergence to the solution, Unrortunately, while we can prescribe algorithms for proceeding from the initial guess. we must rely heavily on intuition and prior knowledge in select- ing the initial guess. Nevertheless, we can provide some suggestions which Illay be helprul in many cases. A comprehensive discussion of such methods can be round in Kittrell cl 01. (] 965). At the Olltset wc must caution the reader nol to exaggerate the importance of finding a good initial guess. In many cases the proper solution has been obtained starting rrom the first initial guess that came to mind. In these cases, at the possible expense of a rew additional minutes of computer time, 
5-20. The Initial Guess III one has saved oneselfa considerable amoulH of trouble. We suggest, therefore, that (unless computer time is exceptionally scarce or expensive) one attempt to estimate the parameters" by brute force ". Only if this strategy rails should one resort to more delicate techniques. The most obvious method for making the initial guesses is by the LIse of prior information. Estimates calculated from previous experiments, known values from similar systems, values computed from theoretical considerations: all these form ideal initial guesses. On the opposite end of the spectrum stand problems in which our only information concerning the parameter values is given in the form of upper and lower bounds on their valtles. Ifwe do not even have such bounds, we can transform our variables into bounded ones: e.g.. a positive variable (/ can be replaced by the bounded variable cp = I I( I + 0), or a completely free variable e may be replaced by cp = arctan O. Once we have all our parameters confined to a rectangular region in e space, we can conduct a grid search: compute the value of the objective function at every point on a regular rectangular grid, and choose the point with the best value as the initial guess. The main dilliculty with this approach is" the curse of dimensionality": in a grid Wtth k levels in each one of the I dimensions, the total number of points at which the objective function must be evaluated is k l . This is a prohibitively large number for all but the smallest values of k and I. An alternative to the grid search IS random search. Here a number of points within the feasible region are chosen at random, and the one giving the best value of the objective function is used as the initial guess. It is true that among a hundred points there is a good chance of finding one that is within I % of the solution. However, this 1 . applies to the volume or the feasible region. If there are I parameters, the relative accuracy of each parameter is only 0.01 1 /1, or 31.6  of the permitted range when 1=4. The random search method does not overcome the curve of dimensionality, but it does offer some advantages over grid search. One may bias the sampling so as to favor certain regions of parameter space (this can be regarded as sampling from a prior distribution). and one may use a sequential termination criterion: stop sampling as soon as a function value significantly better than the average has been found. Sometimes, a transformation of variables is called for prior to commencement of the search. For instance, if even the order of magnitude of a parameter is unknown, it should be replaced by its logarithm . It is not always necessary to provide initial guesses for all the parameters In a model. If some of the parameters enter the model equations linearly, and an initial guess is provided for the other parameters, then Ihe linear param- eters can be estimated by linear multiple regression. Suppose, for instance, k 'ii!. i. ';;..:  
y = kx exp( - EfT) (5-20-1) 122 V Computation of the Estimates I: Unconstrall1ed Problems that the model has the form Ji" = OJ exp( - Oz x). If we have the initial guess Oz = 6, and let :::" == exp( - 6x), then an initial guess for 8 1 can be found by solving the linear least squares problem min III (Y il - 0 1 :::,,)z. Special versions of the Gauss method to deal with partly linear models have been devised (Lawton and Sylvestre, 1971; Golub and Pereyra, 1972). The most fruitful approach to finding an initial guess is to substitute a simpler problem for the original estimation problem. The answers to the simpler problem can be used as initial guesses for the original problem, There is no systematic way of applying this idea to all problems. but the following is a parttal list of what may be attempted. (a) Linearization, We try, by means of transformation of variables, to change the model equations into ones that are linear in the parameters (see Section 4-19). The linear problem can be solved by multiple linear regression with no need for an initial guess. (b) Multistage Estimation. By breaking up the data into groups, we may estimate certain auxiliary parameters for each group; then we estimate the original parameters as functions of the auxiliary parameters. For example, the rate of a chemical reaction is given by the expression where Y is the rate, x the concentration, T the temperature, and k and E the parameters to be estimated. The rate y is measured as a function of x at several values of T, say T" Tz' ..., Tq. Suppose we use the data taken at T j to estimate the coeflicient Kj in the equation .r = A.;., (i = I, 2, . . . , q) ( 5-20-2) The estimated 1\., can then be llsed as data for estJmatll1g log k and E in the linearized model log F..', = log k EfT; (i = I, 2, . .. , q) (5-20-3) Of course, in this case we could have linearized Eq. (I) directly; however, in kinetics models IIlvolving simultaneous reactions, the original equations cannot be linearized, whereas the multistage procedure still applies. (c) Model SimplificatIOn, It is frequently possible to approach the [lnal model through a sequence of simpler ones, in which various effects are neglected and the corresponding parameters suppressed. After the param- eters have been estimated for a simple model, analysis of the residuals (see Section 7-13) can provide an indication as to what terms should be added to the model next. This method serves not only to obtain initial guesses for a given model. but also (and perhaps more importantly) to synthesize a final 
5-21. A Single-Equation Least Squares Problem 123 model where none is given. Examples for such syntheses are given by Box and Youle (1955), Peterson (1962), Box and Hunter (1962), Hunter and Mezaki (1964), and Kittrell, Hunter, and Mezaki (1966). (d) Simpler Estimation Method. We replace the proper objective function by one which is easier to minimize. For instance: (I) We linearize the model as under (a) above. (2) In a multi-equation model, we use one of the equations to obtain preliminary estimates of the parameters. It is true, however, that some- times it is easiest to obtain estimates when all equations are used simullane- ously, since otherwise the information relating to the values of some of Ihe parameters is lost. To give a trivial example. let the three model equations be YI = (0 1 + 02)X, J'2 = (02 + 03)X, J'J = (03 + 01)X ), ?J; Clearly, the three parameters can be estimated independemly only if all three equations are used. (3) In dynamic models we may use easy to apply data integration or differentiation methods to obtain initial guesses for the integra- tion of equations method (see Section 8-1). 5-21. A Single-Equation Least Squares Problem:j: Let y be the fraction remailling at time x I of a chemical compound A undergoing the first order reaction A->B (5-21-1 ) The variable .I' satisfies the differential equation dyJdx, = -ky (5-21-2) where k is the rate constant. The solution lO this equation with the illitial condition y = I at x, = 0 is y = exp( -kx l ) (5-21-3) . The rate constant k depends on the absolute temperature x 2 as follows k = {)I exp( -()2/X2) (5-21-4) t The numerical results quotcd in thc discussIOn of this and subscquent problems were '. 'Obtaincd as output from calculations performcd in singlc prccision Iloating point arithmctic on an IBM Systemi360 computcr. Thc results wcrc convcrted to dccimal from a binary i: representation inside the computcr. Thcrcfore, thc results of pcrforming the samc calcula- ... tions on a dccimal dcsk calculator (or, for that mattcr, on any other computcr, or evcn '. using a differcnt program on the samc computer) would diffcr slightly from those prcscnted " .here, In a long itcrativc proccdure such diffcrcnccs can build up to such an extcnt that a IJ:(:/:.:.iJ!fferent number of iterations may bc requircd to rcach substantially thc same cnd result. 
124 ':'1 .., . :. "! V Computation of the Estimates I: Unconstrained Problems , where 0 1 is the so-called frequency constant, and 8 2 is the activation energy (expressed in suitable units). Our model equation takes the form Y =/(X 1 , X 2 , 0 1 , 8 2 ) = exp[ -8 1 x 1 exp( -8 2 /X 2 )] i :- f. (5-21-5) Data for a set of fifteen observations on x and yare given in Table 5-2. Table 5-2 Data for Lcast Squarcs Problcm Fraction A Expcrimcnt Timc, Tcmpcraturc, rcmaining, numbcr, {.l. x o1 (hr) x,,2CK) Yo I 0.1 100 0.980 2 0.2 100 0.983 3 0.3 100 0.955 4 0.4 100 0.979 5 0.5 100 0.993 6 0.05 200 0.626 7 0.1 200 0.544 8 0.15 200 0.455 9 0.2 200 0.225 10 0.25 200 0.167 II 0.02 300 0.566 12 0.04 300 0.317 13 0.06 300 0.034 14 0.08 300 0.016 ]5 0.1 300 0.066 m iit I ,,::;j I ' r is- '!J  i '., ;. .:. Our aim is to estimate 0 1 and O 2 . As far as we know, the errors in the x p are negligible, whereas those of the yp are all independent and with equal standard deviations. The least squares criterion is, therefore, appropriate, We seek to minimize 15 15 (1)(0) = I e!l2(0) = I [Yll - J;,(0)f 1'= I 11= I (5-21-6) FoJlowing Eg. (5-9-2), we find that the gradient of cp is given by: 15 15 qt == (icp!i!OI = -2 I ell iJf)iJO t = 2 I eJII exp(-82!XIl2)Xpt p=l =1 (5-21-7) 15 15 q2 == i J cfJ!a0 2 = -2 I e'l a.t;P 0 2 = -2 I ell (OIX Il I!X ,(2 )f ll exp( -8 2 !x ,(2 ) p=l p=l (5-21-8) '; 1-1 
. }-2I, A Single-Equation Least Squares Problem 125 ; ,The approximate Hessian N is given by Eq. (5-9-4) 15 NaP = 2 I (cJ;'/cOa)(of;.;aO p ) 1 1 :;;::1 (a, f3 = I, 2) (5-21-9) Let our initial guess be ,; \t.. 0 1 = [ 01, I ] = [ 750 ] 01.2 ] 200 Table 5-3 gives the values of the f;" ell' and Of;,/iiO for 0 = 0 1 From Table 5-3 ',entries and Eqs. (6)-(9) one easily calculates CPl = 1.090441 [ -0.002230450] N = [ 0.2689478 -0.7730614] 10- 5 qt = 0.006863795' t -0.7730614 2.310325 x Table 5-3 Least Squarcs Problcm Functions at 6 T = [750, 1200] {.L /" e"=Yll-'!" 10 6 :< 8[',/88, 10 5 >. 0/,,/88 2 I 0.9995393 -0.0195393 -0.6141379 0.4606032 2 0.9990788 -0.0160788 -.1.227710 0.9207821 3 0.9986185 -0.0436185 - 1.840716 1.380537 4 0.998]585 -0.0191585 - 2.453158 1.839868 5 0.9976986 -0.0046986 -3.065035 2.298776 6 0.91 I 2362 -0.2852362 -112.9364 42.35113 7 0.83035] 5 - 0.2863515 -205.8234 77.18373 8 0.7566463 0.3016463 - 281.3304 105.4990 9 0.6894834 -0.4644834 -341.8115 128.]793 10 0.6282821 -0.46]2821 - 389.3387 146.0020 11 0.7597739 -0. I 937739 - 278.3] 46 69.57867 12 0.5772563 -0.2602563 -422.9124 105.7281 13 0.438584 I -0.4045841 -481.9767 120.4942 14 0.3332248 -0.3172248 -488.2577 122.0644 ]5 0.2531757 -0.1871757 -463.707] I 15.9268 To determine the first step direction VI we must compute VI = - N - I qt, li;e,,> solve the set of simultaneous equalions N I VI = -ql. In our two-dimen- ,:s!nal problem this can be done trivially on a desk calculator. However, for prposes of illustration, we shall apply the Greenstadt method. For this, we 1:i 1eoo to compute the inverse scaled decomposition of N (we omit the sub- i!:cript I). We follow the steps outlined in Section A-5: I. NW = 0.001639963, NW = 0.004806584. 
[ 0.001639963 0 J [ 1 N = BCB = 0 0.004806584 -0,980716 . [ 0.001639963 0 ] x 0 0.004806584 -0.90716J 126 V Computation of the Estimates I: Unconstrained Problems Therefore 2. The matrix C has the form [_ -n. The reader may verify that such a matrix has eigenvalues I + a and I - a with corresponding eigen- vectors [1/,/2, - 1/,./2J and [1/,./:2., 1/,,/2.J respectively. Hence the eigenvalue decomposition of C is given by _ T _ [ 1//2 1/...'/2 ] [ 1.980716 0 J [ 1/J2 -1/,./'2 J C - un, f - , / r r;:; 1/)2 1/" 2 0 0.019284 1/....; 2 1/....; 2 3. The inverse decomposition of N is given by N- I = Grr-le T where _, [ ( I / vi 2) I /0.00 1639963 G=B U= . , _ -(1/,,/2)1/0.004806584 = [ 431.1723 431. 1723 J -147.1122 147.1122 II - I = [ I (1.9807 I 6 0 J = [ 0.504868 0 J o 1/0.019284 0 51.8573 (I /J2) I /0.001639963 ] (1/ J5.) I /0.004806584 Hence N- I = [ 431.1723431.1723 ][ 0.504868 0 J[ 431.1723 -147.112:2 ] -147.1122 147.1122 0 51.8573 431.1723 147.1122 The ratio of eigenvalues was 1.980716/0.019284  100, indicating that N is mildly ill-conditioned. We can tolerate, however, eigenvalue ratios of up to 10 4 or 10 5 , hence there is no need to adjust the value of the smaller eigenvalue, 4. To compute N-Iq we proceed as follows: [ 431.1723 -147.1122 ][ 0.002230450 ] [ 1.971456 ] 431.1723 147.1122 -0.006863795 = -0.04803973 [ 0.5048679 0 ] [ 1.971456 J = [ 0.9953249 J o 51.85727 -0.04803973 -2.491209 VI = [ 431.1723 431.1723 ] [ 0.9953249 ] = [ -644.9785 J -147.1122 147,1122 -2.491209 -512.9099 
5-21. A Single-Equation Least Squares Problem 127 ;,;:: tf :':1.: i:'-.8 ;;1,;. "r.,. '.-. Along the ray 0 = 0 1 + pv l , the directional derivative at p = 0 IS :t';, !: 81f1j8p = vtTql = (-64-'+.9785)( -0.002230450) + (-512.9099)(0.006863795) = -2.081916 ; This quantity being negative confirms the fact that (/J decreases, at least initially, as one proceeds from 8 1 in this direction. Trying initially pro) = ] we arrive at 0(0) = [ 750 - 6-.+4. 9785 1 = [ 105.02]5 1 1200 - 512.9099 687.0901 where <P(O(O») = 0.9133969 < q),. This indicates that we are at an acceptable point. We try to find an even better point by fitting a parabola to (1)(0 1 + pv l ), The equation of the parabola is tP(p) = a + bp + Cp2 And it must assume the following values: tV(O) =;= 1.090441 = CI. lfJ(l) = 0.9133969 = [3 dlf J jdp)j1=o = -2.081916 =)' Using Eq. (5-14-5) we find that 'V(p) has a stationary point at p* = -2.08]916 x Ij2( -2.081916 x I + 1.090441 - 0.9133969) = 0.5464714 Trying 8(1) = 0 1 + p*v l = [397.5376, 919.7092]T we find 1)(0(1)) = 0.3345645 which is a great improvement over both 8 1 and 0(0). We accept 0(1) then as 92' the starting point for the next iteration. A computer program using the flowchart of Fig. 5-2 produced the sequence of iterations given in Table 5-4. No further reduction of (1)(8) was obtained Table 5-4 Lcast Squares Problem, Good Illilial Guess (Gauss Itcrations) cP(eJ e, I 2 3 4 5 6 1.09044] 0.3345645 0.05765885 0.04038005 0.0398073 I 0.03980599 [750, 1200]' [397.5376,919.7092]' [646.0847,938.5288]' [810.6260, 965.7625]' [818.3628, 962.1228]' [813.4583, 960.9063]T 
128 V Computation of the Estimates I: Unconstrained Problems i.I < !  ,\1)'  ,1i :.1 1  "I" '.j '!;  i t after six Iterations, so that we took as our estimate 0* = [813.4583, 960,9063]T with (1)* = 0.03980599. At this point, the gradient was * = [ -0.218524 ] 10-6 q 0.631308 x and the approximate Hessian N * = l 0.271890 -0.957336 1 10 -5 1 _ 0.957336 3.50371 x Applying the test of Section 5-15 we find that p., e" * = )'" - [(X", 8*) p., e,,* = )',. - [(x"' 8*) 1 -0.0145552 9 -0.0387225 2 -0.00613993 10 -0.0219878 3 -0.0287542 ]1 0.0497515 4 0.000602186 12 0.0504873 5 0.0199295 13 -0.103587 6 -0.0906165 14 -0.0550289 7 0.0304608 ]5 0.0293314 8 0.0869893 1i" I "' ? ..  I : '1.\.   )if " .,j '.  ::' ,. , '  ; 1 ''': ; . ;-,$ if ()I = Iql*/Nill  0.], ()2 = Iq2*/ Ni21  0.02 These values are negligible compared to Ot* and O 2 *, so we may assume that we have converged to a stationary point. The final residuals corresponding to this solution are given in Table 5-5. Table 5-5 Lcast Squarcs Problcm, Final Rcsiduals l .J.'h In the preceding calculatIons we were, fortunate enough [0 have started from a good initial guess; 0 1 = [750, 1200]T as compared to the final estimate 0* = [813.4583, 960.9063f. Suppose now that we had started from the much poorer initial guess 0 I = [100, 2000]T. Proceeding as before we find ({), = 5.299502 _ [ -0.0007098080 1 ql - 0.0002442936 [ 0.7036033 -0.2354773 J N I = -0.2354773 0.07896382 - I [ - 1 34608.0 J v, = -N, ql = -432361.0 (01__ ,_ [ -134508.0 J o - 0, + ", - -430361.0 P i . Jf.; I' i" .1 (5-21-10) X 10- 7 \ 1 ".\ (5-21-11) 
5-2]. A Single-Equation Least Squares Problem 129 ,< When we attempt to compute cP(OCO» we note that the exponents occur- ring in the formulas for];, are so large that computer capacity is exceeded. We attempt to remedy things by halving p, but we have to repeat this process eight times before the exponentia]s are brought under control. We have then, with pCO) = r 8 = 0.00390625 OCO) - 0 000'90675' _ [ -425.8]40 ] . - I + . .J - "I - 311.0039 with p(O) = cP(OCO» = 0.3366272 x 1020. Since this exceeds cPl' OCO) IS not acceptable and we must interpolate. Here }' = VI Tql = -10.08228 Hence, from Eq. (5-]4-5) - ]0.08228 x (0.00390625)2 p* = 2( -I 0.08228 x 0.00390625 + 5.299502 - 0.3366272 x 1020) ::::05 x 10- 25 Sjnce this is less than 0.25pC°), we follow Fig. 5-2A and set p<2 J = 0.25p<°) = 0.0009765625 leading to 0(2) = [ -31.45349 1 1577.782 ' '.p(2) = 5.471375 Ihis is still unacceptable. Repetition of the interpolation procedure once more forces us to take p(3) = 0.25 p ( 2) = 0.000244141 and ., ,.. O(3)= [ 67.13663 ] 1894.446 ' IJ1(3) = 5.301888 Once more we need to interpolate -10.08228 X (0.000244141)2 p* = 2( -10.08228 x 0.000244141 + 5,299502 - 5.30] 888) = 0.0000619701 This time, 0.75 p (3) > p* > 0.25p(3), so we take p(.+) = p* and  . --, 1 '"'" O(4) = [ 91.65955 ] 1973.211 ' tp(4) = 5.299135 
;t. . ', 130 V ComputatIOn of the EstImates 1: Unconstramed P.roblems ,,,, !;.. : We have finally an acceptable pomt. We set O 2 = 0(4) and proceed to the next iteration. The procedure converges, though rather slowly, in 25 iterations, which are plotted in Fig. 5-3, Also shown is the direction of the negative gradient at 0 1 , Since the vector pointing from 0 1 to 0* lies between the negative gradient and the Gauss direction from 0 1 to O 2 , it appears that the Marquardt method may prove efficient in this case, and indeed it turns out to be so, :  ,,: .,  ! i,!: 8 2 8, 2000 r---<-:---- I \ Method " \ > 1 -0- Gouss Ii . \ UI1C(J115trQlf1a f'..' d I I -.8.- I orqUGr t I \ C0 r 5ha m c1 1 -.1- SnIlSS'" penolly,f?I'>O ,: '-; ,( J,t: Sechons  lt2) - '- - o>t..I prOJection, I I '\ ,- f , 1 . 1500 I f " f " I \. Ii \. ,: ., Ii Ii 1 1 . 1000 I: r; I  /';/<- I , -1-'"'/ I /x : / / "'/ 1/ 500 0 .' ':, d ., ;; I  ..". ...  '., j,;" Ih :>:f ,i ;: .... ..... ..... ....... ...... . '-'- 8-'- \:t l ] ____x-:;-4. .,+ --:;.-- --- 8* \' \1 500 1000 8, Fig, 5-3, Lcast squarcs problcm Returning to the first iteration, the Marquardt step would be given by = _10 7 [ 0.7036033(1 + )'J) VI -0.2354773 x [ -0.0007098080 ] 0.0002442936 -0.2354773 ] -1 0,07896382(1 + )'1) Trying first )'1 = 0.01, we obtain a step leading to a value of 0 for which c]j cannot be computed, and similarly for )'1 = 0.1. With )'1 = 1, however, we find [ 3272.00l J VI = -10589,988' O(D) = [ 3372.001 J -8589.988 ' c]j(D) = 6.183162 'l." 
!i,,22, Adding Prior Information 131 This is still unacceptable, so we increase )'1 to 10, and eventually to }'I = 100 with [ 98.8778 ] v) = -303,391 ' [ 198.8778 1 0 1 = 1696.609 ' rP 1 = 4.979104 Which IS an acceptable starting point for the next iteration. The full process, \vhich converges in 10 iterations, is also shown in Fig. 5-3. If the initial guesses for the parameters are much too large, J or its deriv- .:. atives vanish, and the process does not converge. We can use the linearization (method to obtain good initial guesses. For this purpose, we observe that q. (5) is equivalent to. 10g( -log y) = log 0 1 - 01/Xl + log x, (5-21-12) or where y+ == log( -log y), y+ = 0 1 + + 0 1 + x.... (5-21-13) 0 1 + == °1 (5-21-14) 1'-We now have a model linear in the parameTers 0 l .j. and 0 1 +, which may be !nstimated by linear least squares as x+ == -1/x 1 , 0)+ =:clogO), 8 1 + = 6.643963, 8 1 + = 928.6492 ::.corresponding to 8) = exp 8 1 + = 768.1331, 0 1 = 8 1 + = 928.6492 i:!;.Wfie are obviously good initial guesses for estimating 0 by the nonlinear iHilst squares procedure, ::;'f2. Adding Prior Information .;.... Assume that prior to having obtained the data of Table 5-2 we had some ii.;owledge concerning the values of the parameters. Let us suppose that this lncnowledge could be summarized in the following equations 8 1 = 1000::!:: 200, 0 1 = 1000::1:: 200 (5-22-1) irh quantity 200 is meant to represent the standard deviation of the disIri- !::kufions of 8 1 and 0 1 , Let us elect to assign to 0 a normal prior distribulion, !{s'o:that apart from an additive constant logpo(O) = --!-[(I/200 2 )(8 1 - 1000)2 + (1/200 2 )(0 2 - 1000)1] (5-22-2) 
qJ(O) = (15/2) log S(O) + (1/80,000)[(0 1 - 1000f + (0 2 - 1000)2] V Computation of th, E,timat" L Unwn,tmin,d Pwbl,m, 'I ": ",: ..  : .'.: ->. .". ':.. 15 log L(O) = -(1/21') I [Y ll - .t;,(0)] 2 Jl=1 (5-22-3) 132 Assuming that the observation errors in yare also normal with unknown variance 1', then the log likelihood is and the log posterior distribution is the sum of the two. As in Section 4-8, we can eliminate v and form the concentrated log posterior distribution, which with sign reversed, reduces to the following objective function to be minimized (5-22-4) t; where 15 15 S(O) == I [y" - ,f,'(O)] 2 = I e/(O) /'= I jJ=1 (5-22-5) f-' Hence : ; :-l [ - ---,- Ie ..c...!!... + - (0 - 1000) ] q () ,::, "%, 40.OO ' ] 5 15 (If. I -- "e .:.:.!'...+-(O -1000) S(O) ,.0 1 ,. i1{)2 40,000 2 [  I 1/;. ) 2 + S(OL.= t ( . 1101, 40,000 N= I - 1- -f 1 / .  I I!I,.  S(O) ,. = I iJ(l1 1'(12 (5-22-6) ;: .:.t J ] - 15 if -f ] ) ,,(/,. 0./" -L.,;-- S(O) ,.= 1 aO I a0 2 15 I 5 fll 2 I - I ::;- +- S(O) ,,= I CoJ 40,000 .', .f. ,i (5-22-7)  :i.J:: With this prior distribution it is natural to start with the initial guess 0 1 = [1000, 1000]T From here, the Marquardt method converges in three iterations to 0* = [ 929.7134 ] 990.8511 (ua = 200) When the standard deviations of (II and O 2 are assumed to be 100 instead of 200, our estimates turn out to be 0* = [ 976.2349 1 1000.1695 (ua = 100) The solution of Section 5-21 may be regarded as corresponding to a prior" distribution with IIlf1nite standard deviations. We recall that the result was 0* = [ 8I3.4583 J 960.9063 (ua = co) :'t;;;... "'!i "";,'i :;am1"".;, 
(J-23, A Two-Equation Maximum Likelihood Problem 133 ;Observe how the solution progressively approaches the mode of the prior '.,distribution [1000, 1000F as the variance (i.e.. uncertainty) of our prior ,information decreases. ..5-23. A Two-Equation Maximum Likelihood Problem . We take a two equation econometric model which was used by Bodkin ['and Klein (1967) to fit U.S. production data for the years 1909-1949. The :: model is based on the constant elasticity of substitution (CES) theory of i,.production, and it takes the form gl = c I 10 C2 =4[c S 2;-C4 + (1 - cs)zZ"c4rCJlc4 - 23 - 0 (5-23-1) gz = [c s !(1 - C s )](2 1 !2 Z )-t-c4 - 2s = 0 i:here ZI is capital input, 2z is labor input, .:] is real output, ':-1- is time (in ,years; 1929 taken as origin), 2S is ratio of price of capital services to wage i: cale, and c t , c z , C] , C 4 , C s are unknown parameters. .,. The data, in the form of yearly values of Z" for II = I, 2, . . . ,41, are given :'j11 Table 5-6. Of the variables involved, ':1 and 2z are considered dependent '..{c:;ndogenous) whereas 2 3 , ':4 and Zs are independent (exogenous) The treat- (inent given by Bodkin and Klein is the standard one in econometrics, i.e., :!;;thc:; distribution of the measured values of Z I and.: z is such as to give rise to inormally distributed errors in Eq. (I). The likelihood is formed as in Eg. ,:It2-13-6). The details of those calculations are given in Eisenpress ef al. (1966) i:d Eisenpress and Greenstadt (1966) . , For illustrative purposes, we shall adopt a different approach here. We Eiiqte that .the model equations can be solved explicitly for the dependent ;i:apables to give the reduced form equations Zl = A ZS l /(I+C41, 2Z = A[(I - c s )!c s ]t/(t+C4) (5-23-2) 't;;w&ere A = (z]!c 1 1O c 2=4)I/C3{C S [((1- c S )/C S )I/(I+C4) + :;4/(I +C4)]}I/C4 (5-23-3) .,: :;(1;0' cast these equations into more tractable form we introduce the following :nw, variables .}. . ."., 'Y1\== ZI, )'z = 2Z , Xl =24' xz==log=-3' X] == log.:s (5-23-4) r4",ve reparametrize the problem by defining ,f,;':c"," . 8 1 == (l/c 4 ) log C s - (l/c]) log C l , 8z = -(cz/c]) log 10 8 3 = l/c 3 , 84. == l/c4.' 8s = [(I - cs)/CSf/(1 +C4) (5-23-5) "  "'''' .",' 'i<1I:_,., .;' ,!;..:'. 
134 V Computation of the Estimates I: Unconstrained Problems ,I Tahle 5-6 U.S. Production Data" I .;.t ,.:,' IL :::, Zl z, Z4- -5 ?i .., 1.33135 0.64619 0.4026 -20 0.2-144 7 , 2 1.39235 0.66302 0.4084 -/9 0.23454 :j 3 1.41640 0.65172 0.4223 -18 0.23206 4 1.4877 3 0.67318 0.4389 -17 0.2219/ ,i 5 1.510/5 0.67720 0.4605 -16 0.22487 },' 6 1.43385 0.65/75 0.4445 -/5 0.2/879 :' 11 7 1.48188 0.65570 0.4387 -14 0.23203 8 1.67115 0.71417 0.4999 -]3 0.23828 I 9 1.71327 0.77524 0,5264 -12 0.2657/ ( 10 1.76412 0.79465 0.5793 -II 0.23410 '."*. " I .76869 0.71607 0.5491 -10 0.22181 11 1.80776 0.70068 0.5052 - 9 0.18/57 r 13 1.54947 0.60764 0.4679 8 0.22931 £ 14 1.66933 0.67041 0.5283 7 0.20595 .t ] 5 1.93377 0.74091 0.5994 6 0./9472  16 1.95460 0.71336 0.5964 - 5 0.1798] 17 2.11198 0.75159 0.6554 - 4 0.180]0 18 2.26266 0.78838 0,6851 - 3 0.16933 . 19 2.33128 0.79600 0.6933 2 O. ] 6279 \; 20 2.43980 0.80788 0.7061 I o. I 6906  21 2.58714 0.84547 0.7567 0 O. I 6239 22 1.54865 0.77232 0.6796 ] 0.16103 23 2.26042 0.67880 0.6136 2 0.14456 24 1.91974 0.58519 0.5145 3 0.20079 25 1.80000 0.58065 0.5046 4 0.] 8307 26 I. 86020 0.62007 0.571 ] 5 0.18352 I 27 1.88201 0.65575 0.6/84 6 0.] 8847 28 1.97018 0.72433 0.7113 7 0.20415 29 2.08132 0.76838 0.74'61 8 0.19006 30 I. 94062 0.69806 0.6981 9 0.17800 31 1.98646 0.74679 0.7722 10 0.]9979 32 2.07987 0.79083 0.8557 1/ 0.21 ]]5 33 2.18232 0.88461 0.9925 12 0.23453 34 1.52779 0.95750 1.0877 13 0.20937 35 2.62747 1.00285 I. 1834 14 0.19843 36 2.61235 0.99329 1.2565 15 0.18898 37 2.52320 0.94857 1.2293 16 0.17203 38 2.44632 0.97853 1.1889 17 0.]8]40 39 2.56478 1.02591 1.2249 18 0.19431 40 2.64588 1.03760 1.2669 ]9 0.] 9492 41 269105 0.99669 1.2708 20 0.179] 2 a Data adaptcd from Solow (1957). " ;", r.><:- ,t; "'.'1 '.. . '-. , 
 ;;.1.': ";\:i: ...,", 5-23, A Two-Equation Maximum Likelihood Problem 135 Our reduced equations now take the form  !l; . ';; t..;; :i!.!.:' E:' i;l: r{;, '7.':;.  YI =J;(x,e)=exp[a-8 4 x 3/(I +8 4 )], Y2 = fz(X. e) = exp(a + log 8 5 ) (5-23-6) where a == 8 1 + 8 z x I + 83XZ + 8 4 10g{8 5 + exp[x3/(I + 04)]} (5-23-7) We shall solve the problem in terms of the e, and then convert the answer into c using the inverse transformations C 1 = [1 + 81+e4)fe4re4103 exp( -0 1 /8 3 ), C z = -8 z /(8 3 log 10) C3 = 1/8 3 , C4 = 1/8 4 , c 5 = 1/[1 + 81 +04)/04] (5-23-8) Let us formulate some alternative likelihood functions to be maximized. Assuming the errors in the reduced equations to be normally distributed, independently for each year and with covariance matrix Y, we have (apart from irrelevant constants) 41 10gL(e, Y) = -(n/2) log det Y -t I (Y11 - f)Ty-1(YII - flJ (5-23-9) 1 1 =1 We examine the following cases: (a) Unknown Y. The concentrated likelihood is equivalent to the objective function Eq. (4-9-9) (e) = (11/2) log det M f 41 4t = (41/2) log \J;::I(Yld - 1;d)ZII2;I(YIIZ - f/1z)Z - [ I (YII! - 1;" )(Y I /2 -1;,2) ] Z } (5-23-10) 11= I I (Ii) Unknown diagonal Y. The objective function is given by Eq. (4-8-4) 'c«P(e) = (1J / 2)atl log lvl aa (e) = (41/2) log LY'/11 - 1;d)ZI,/YI'2 - 1;12)Z] (5-23-11) :(c), Covariance matrIx proportional to Q == [ ], I.e., the errors in YI are .'umed twice as large, and independent of. the errors in Yz. The relevant ';i;iqjective function is given by Eq. (4-21-1) «p(e) = (1111112) log Tr (Q- 1M) = (41 x 2/2) log [ t I (Y1d - f/1I)! + I (Y 1 12 - 1;,2)2 ] /1=1 /1=1 (5-23-12) L;:;, 
136 ',I . V Computation of the Estimates I: Unconstrained Problems ';i (d) As in (a) above, but 10g)'1 and log)'2 are the dependent variables. The objective function has the same form, with log )'''1' - log/;w replacingy l ", - .010' All these objective functions have the Gauss form, and the approximate Hessian has the form N = 2 I: I B/rB II . Let us examine case (d) first. Here BII = n log flJcJG a 2 f  -'':1 ., ;1' 1 " ., .,. "  .:-:.' j ll jj /  IN .- W i. Table 5-7 Elclllcnts of E,,1, Case (d)" B"D' = a 10g/',n/aO, a 2 X"I 3 XII:! 4 (0 x"' ) log ',+cxJ)- (I + 0..) LJ X"' XI" II.. Cxp _ 0 (I -I..) ' (0 x", ) (I + 8..)- 5 -J cxp _ 0 - (I -!- ..) X,d (I +oy 5 0.. 0, +cxp  - (I -j 8..) X,II :z, : X'I:! I ) X'i 3 log ( 0, -!-exp _ 0 (I + ..) O X", X"' ., cxp _ 0 (I + .,) O ' (0 ., X"' ) (I + .,)- 5 +cxp _ 0 (I + ..) 0.. , o X", T e; 5'+ CX P (I +0 4 ) "In Cascs (a), (b), and (c) Multiply Tablc EntrIcs by /'... The elements of BII are given in Table 5-7. Referring to the third row of Table 5-1, we find that -1-1 I (log )'111 - log.!;,,)2 I'=- 1 41 r=- ., 41 I (log .1',,, - log.!,,,) /1= I x (log )'1 12 - log.!;,z) 41 I (log )'1" -log.!;,)) lJ= I x (log )'1,2 - log.!;,z) -) 41 I (log )'1.2 - 10g.!;,z)Z 11= I (5-23-13) :l  .O'  tp " , ,-,.,.. . ,"1 I ' >,. I,. 
:"-23, A Two-Equation Maximum Likelihood Problem 137 In cases (a), (b), and (c) we obtain Bl'aa by multiplying the corresponding ,p.try in Table 5-7 by J;w, since aJ;,alaea = J;w a logJ;w/aea' .fhe expression for r as given in Table 5-1 comes out to be: ;tase (a) [ 41 41 III(YIII - fl'l)2 r=- 2 41 I'I (YIII - fl'l)(YI'2 - f 1l 2) 41 ] -1 1'I(Yl'l - fl't)(YI'2 - f1'2) 41 L (Y1'2 - f1'2)2 Jl=l (5-23-14) ,:gase (b) [ [ 41 ] - t L (Yill - flll)2 41 1'=1 r=- 2 0 [ t (y,,: ;;,J' ] _I ] 1 1 - 1 (5-23-15) ;,se (c) 41 [ ! 0 ] r= I 41) 2 41) 20 1 "4 LIl=l (}PI - fill) + 'L=l () 1'2 - f1'2) (5-23-16) : shall omit the details of the calculation. The results for cases (a)-(d) are :;reported in Table 5-8 in the form of the final estimates e* and the minimum iiTible 5-8 !{iksUlts of Paramctcr Estimation for Production Thcorv Problcm (Estimatcs of e and hlimum of Objcctive Function) . ;'€iise 8,* 8 2 * 8,* 8 4 * 8 5 * 1];* .:',. . H:a -0,0758463 -0.0115747 0.790686 1.00224 0.859255 -82.71488 -4.27288 -0.00882702 0.7]3410 4.85440 1.47252 -76.70853 i(b) -0. 155586 -0.00978696 0.737126 1.05720 0.878024 - 79.1353  . -4.27489 -0.00834795 0.696653 4.85221 1.47180 - 75.8870 -4.76163 -0.00569320 0.600316 5.27596 1.49747 -66.5601 n,(C) -2.45064 -0.00556030 0.598743 3. I 3535 1.30821 -66.0913 j\( -0.0409260 -0.01 18384 0.802121 0.96870 0.850246 -99.3714 -3.54201 - 0.00882492 0.724913 4. I 8994 1.41740 -95.2669 i.i':" ..,_ a Local mllllmum of obJcctlvc functIOn. 
13g V Computation of the Estimates I; Unconstrained Problems objective function values (f)*. In Table 5-9 the results are given in terms of the original variables c*, Also reported are the results of Bodkin and Klein (1967), who used an objective function of the form Eq. (2-13-6). Table 5-9 Rcsults of Paramctcr Estimation for Production Thcory Problcm. Casc Cl* C2* C3* C4* C5* (a) 0.5460 0.00636 1.265 0.998 0.5752 (a)" 0.6074 0.00537 ] .402 0.206 0.3854 (b) 0.5417 0.00577 1.357 0.946 0.5629 (b)" 0.6051 0.00520 ] .435 0.206 0.3855 (c) 0.5935 0.00412 1.666 0.]89 0.3822 (c)" 0.5791 0.00403 1.670 0.319 0.4123 (d) 0.5473 0.00640 1.247 1.027 0.5806 (d)" 0.6049 0.00529 1.379 0.239 0.3936 (c) 0.5839 0.00589 1.362 0.475 0.447] (f) 0.5410 0.00643 1.238 1.130 0.6037 " Local minimum of objcctivc function. (e) The model equations take the form of Eq. (I). (f) The model equations are written as gl == log C I + C 2 ':-1- log 10 - (c 3 /C-1-) log[c52"c4 + (I - C 5 )':1 04 ] -log 2"3 = 0 g2 == log C 5 - log( I - c 5 ) - (I + c-1-)(log': l - log 2"2) - log':5 = 0 It is revealing to f1nd such discrepancies in the estimates (particularly of C4), since in all cases the same model equations were f1tted to the same data, the only difference being in the assumptions concerning the distribution of errors. We must defer further discussion of this problem to the end of Chapter VII. Matters are further complicaled by the question of convergence. In at- tempting to solve these problems using various algorithms and starting values, we found that in each case there exists at least one local minimum of the objective function otl1er than the global minimum. These local minima are also recorded in Tables 5-8 and 5-9. The other minima are, we think, global, but we have no way of proving that this is so. The performance of various algorithms is summarized in Table 5-10. tll  I : I ;, ;1 " ; "..   -1£f 
5-24, Problems 139 . Table 5-10 C:onvergence of Various Algonthms for Problcm 2 Casc Starting point. Objcctive function cvaluation for intcrpolation- Algorithm" Convcrgcncc C Itcrations cxtrapolation. I I 15 28 2 I 27 55 3 I 28 56 4 I 12 16 5 ] 35 73 6 I 36 350 3 2 28 56 15 32 3 2 8 15 3 8 17 3 2 ]5 27 I ] 38 78 2 2 8 15 3 2 8 15 4 10 15 ; :fJ; (a) 2 3 (b) I 4 (c) 4 (d) . Starting points: I. [0, 0, 0, 0, 1] e _2. [4.27489, -0.00834795, 0.696653, 4.85221, 1.47180] I - 3. [-1.29970, -0.00995700, 0.734317, 2.10535, 1.15490] 4. r -0.0758463, -0.0115747,0.790686, 1.00224,0.859255] . ,b Algorithms: :1.' Gauss, directional discrimination, using Eq. (5-7-9). :2.:Gauss, dircctional discrimination, using Eq, (5-7-10), E = 10- 5 , a = I. 3!-'Gauss, dircctional discrimination, using Eq, (5-7-] I), E = IO-S, f3 = I. 4.:,Pauss, Marquardt. ;': Variablc mctric, ROC. :. Variablc mctric, DFP. C Convcrgcncc: :1 .:To best known solution. ,!o a local minimum. :5-24. Problems I. Verify that the Gauss method is invariant under reparametrization; ;i., the sequence of iterations is the same whether we use the original set of 'yables 0 or a transformed set c = c(O), provided only that c i = C(OI)' and "that c is a linear function of O. ',. :.::;: :J 
140 V Computation of the EstImates I: Unconstrained Problems ':1i ,I '& .t..'; 2. Determll1e what conditions must be fulfilled for the Marquardt and,. variable metric methods to be invariant under reparametrization. 31 3. Suppose cjJ(O) = 0 1 2 + 100000/. Let Ot T = [100, I]. Compare the pro- } gress towards the minimum of cjJ that can be made in a single iteration of the steepest descent and Newton methods. 'I 4. For the objective function of Problem 3, find initial guesses for whIch the upper and lower limits of Eq. (5-6-5) are satisfied. Show that these limits < cannot be violated. 5. Devise an algorithm similar to the Gauss method for finding 0 and X ., to nlininlize the objective function of Eq. (4-10-]2).  ':;,  
,apter :VI Computation o the Estimates II: Problms with Constraints ';A, Inequality Constraints ::1:. Penalty Functions .Inequality constraints of the form Eq. (5-1-1) limit the domatn of para- ,'meter values within which the estimate is to be found. They often arise from 'prior information concerning the values of the parameters (see Section 2- I 6). ;'rhe presence of inequality constraints. particularly in the form of upper and Jt:)er bounds on each parameter, often exerts a beneficial influence on the ;o-!1vergence of an optimization algorithm. In quite a few problems, converg- jce to a correct minimum results from imposition of somewhat arbitrary ::,bo}ll1ds, without which the algorithms bog down in irrelevant regions of :;patflmeter space. We would go so far as to recommend imposition of generous, ;$ogh not. unreasonably so, bounds in all nonlinear parameter estimation Jp.f9!?lems, ". ,We possess several powerful algorithms for unconstrained optimization t,#qwould like to apply the same algorithms to the constrained problems. We Ieq to modify the objective function in such a way that it remains almost !:HJ:!.anged well in the interior of the feasible region, but increases drastically is 'qne approaches the constraints. To accomplish this, we assign a penal/I' !il1 to each inequality constraint. This function is nearly zero whe jii;h:e::constraint function is strongly positive, but increases sharply as the con- .".. " :Striillit function approaches zero from above. To the constraint hj(O)  0 (6-1-1) eassign, following Carroll (1961), the penalty function ..:r:":, _ C(O) == ajhj(O) (6-1-2) ':".M,ii.L./ 
J,;t' 142 VI Computation of the Estimates II: Problems with Constraints where Cl. j is a small positive constant. We now modify the objective function by adding to it the penalty functions for all the constraintst 1)t(S) =1)(S) + I Cl.j/hj(S) j (6-1-3) Let st and S* be the points at which 1)t and cP attain their respective minima within the feasible region, Fiacco and McCormick (1964) have proven that under suitable conditions lim st = S* (6-1-4) aj_ 0 These concepts are employed in SUMT (sequential unconstrained maximi- zation technique), originally presented in Carroll's paper but later amplified by Fiacco and McCormick: I. Select the Cl. j and a feasible initial guess Sl' 2. Find st using one of the unconstrained optimization methods, 3, Reduce the values of the Cl.j' and return to step 2, using st as the inItial guess, The process is continued until st does not change significantly upon reducing the Cl. j . Then we accept st as our estimate of S*. The search for the minimum of 1)t in step 2 must still be confined to the feasible region. It may appear, therefore, that nothing has been gained. Why not minimize q) directly? The answer is that we have created a situation where the objective function always starts increasing before one has a chance to leave the feasible region. Therefore, the procedures for determining step length always succeed in producing an acceptable feasible step. If we happen to be near a constraint, it is quite possible for a minimization method when applied to (I) to direct one towards the infeasible region, even though there exist feasible directions in which the function decreases. By adding the penalty functions we deflect our step to a feasible direction, The point is illustrated in Fig. 6- I where the contours of an objective function are drawn. The minimum occurs at point A. Slarting out at point B, the steepest descent procedure carries us along the path BC DA  to the minimum. If the feasible region is constrained to lie to the left of the line FG, the path is blocked at point C. Introducing a penalty function (Fig. 6-2) leaves the contours around A almost undisturbed, but distorts them near the constraint in the manner shown. We now have a feasible path BHIA  to the minimum. Although the example is given in terms of steepest descent, it applies equally well to other minimization methods. This example illustrates the important point that the path from a feasible starting point to a feasible minimum may pass through infeasible territory. "t Altcrnativcly, wc may usc -C/. Lj log hie). This has thc bcncfit of bcing unaffcctcd by scaling thc functions hie).  Thc dctails of thc hcmstitching ncar thc minimum wcrc omittcd from thc Figurcs, 
6-1. PenalTy Functions 143 I .. . . . .' 1 ,-,:'. 1 "?: . j Fig,6-1, Contours of W; Minimization without pcnalty functions. If matters were always as simple as in our illustration, then we could simply do away with the constraint altogether. In practice it may happen, however, that there are local minima in the infeasible region, or that it is impossible even to compute the value of the function for infeasible parameter values (this occurs frequently with dynamic systems whose differential equations become unstable), Hence the importance of creating paths that lie entirely within the feasible region, Suppose st is well in the interior of the feasible region. This is recognized to be the case when Lj Cf.)h)St) is very small. Then the minima with and without penalty functions nearly coincide. Having obtained st we may take S* = st, or perhaps go through an additional iteration of the minimization I :;  ,"" F InfeoSible B :' I x "J,\ i G Fig, 6-2, Contours of Wi": Minimization with pcnalty functions. 
144 VI Computation of the Estimates I I: Problems with Constraints procedure, starting at ot and omittmg the penalty functions entirely, If ot turns out to be near some constraints [sizable value of Ij a)hj(Ot)], a gradual reduction in the aj is called for. All iterates Oi must be restricted to the inferior of the feasible region, except for the last iterate which should be allowed to fall on the boundary. This means that in Eq. (5-3-7) we must have Pi < Pi. milx (6-1-5) in all iterations but the last, and Pi:::; Pi,milx ( 6-1-6) in the last iteration, taken with no penalty functions. Here, Pi, n1ilx is the greatest lower bound on the positive values of P for which Oi + PV i is not feasible. The flowcharts of Section 5- I 4 assume that Pi, milx can be calculated at each iteration. If we use a gradient method to minimize <pt(O) we need to compute first and sometimes second derivatives of the penalty functions. From Eq. (2): a()r70" = - la)h/(O)] ch)aOa (6-1-7) a 2 () aO a CO'I = [a)h/(0)][2(ah)aOa)(ah)aOp) - h)O) a 2 h j /aO a aop] (6-1-8) When 0 is far from the jth constraint, the contribution of C to the objective function and its derivatives is very small. Near the jth constraint, h j is nearly zero, and the second term of Eq. (8) may be neglected relative to the first term. In eilher case, it is safe to replace Eq. (8) with a2UiiOa (lOp  [2a j /h/(0)](ah)cO a )(ah)aO p ) (6-1-9) which does not require computation of the second derivatives of the constraint functions. This is analogous to the way in which the second derivatives of the model equations are suppressed in the Gauss method. Note also that Eq, (9) is at least positive semidefinite, and does not spoil the definiteness of N when added to the latter. Frequently (e.g., with upper or lower bounds) the constraints are linear functions whose second derivatives vanish anyway. Eq. (9) is then exact. The initial choice of aj should be dictated by the range of values that hj(O) and qJ(O) can take in the feasible region. For instance, if we have two constraints reflecting the bounds b" :::; 0" :::; aa h)O) == aa - 0"  0 hj+1(0) == Oa - b a  0 (6-1-10) (6-1-11) 
.6-1, Penalty Functions 145 I we might set Cf.j = Cf. j +1 = O,OOI(aa - b a ) 4J(01)' If the initial guesses 0t are well in the interior of the feasible region, then a good choice for Cf.j may be Cf.j = 0,00111/( 1 ) 4J(01) ( 6-1-12) ... The penalty function method is easy to program, and has been found to work well when the solution is known (or expected) to be in the interior of the feasible region. A numerical illustration appears in Section 6-11. When the solution is likely to be on the boundary, then the projection method discussed below is preferable. Even an interior minimum may be reached faster by the projection method, but the complexity of the latter mitigates against its use, Penalty function methods other than the one described here have been proposed, e,g" by Zangwill (1967a) and Fiacco and McCormick (1967), These methods possess the advantage (for general nonlinear programming problems) that the initial guess and intermediate iterates are not restricted to the interior of the feasible region; i.e., they are allowed to violate the con- straints, In the case of parameter estimation problems, however, this is not at all an advantage. In the first place, it is usually easy to stay in the feasible region because of the simple nature of the constraints. Secondly, the objective function often behaves in an erratic manner, and may even be uncomputable outside the feasible region. The importance of staying within the feasible region was stressed above, While penalty functions may appear to be merely a computational artifact, they do indeed possess a statistical interpretation. Suppose 4J(0) is a log likelihood, and let us assign to 0 a discontinuous prior distribution, which is zero outside, and uniform inside the feasible region. Then the posterior density is zero outside, and proportional to 4J(0) inside the region, and the probem can be formulated as being that of finding the maximum of the posterior density. Let us, however, try to smooth out the prior distribution so that its density approaches zero continuously (though rapidly) as one goes out to the boundary of the feasible region. This can be accomplished precisely by making II -log PoCO) = I ()O) j (6-1-13) As an example, suppose 0  8 1  I, and we use (j = -Cf. log 11)0). fn this case -log PoCO) = - Cf. log 0 1 - Cf. log (I - 0 1 ) (6-1-14) li so that PoCO) = 8 1 a( I - 0 1 )a (6-1-15) which is the BI +a, 1 +a (beta) distribution. 
146 VI Computation of the Estimates II: Problems with Constraints .1 6-2, Projection Methods Another class of methods for optlmlzlIlg with inequality constraints is variously known as gradient projection and reduced gradient (Rosen, 1960, 1961; Wolfe, 1963; Faure and Huard, 1965; Abadie and Carpentier, 1966; Abadie, 1967b). These methods, which in Fig. 6-1 would take us along the path BCEA, may be summarized as follows: At each iteration define the 110rma/ step as the one computed according to the gradient method of our choice, with the constraints ignored. We now face one of the following two situations: I. If e j is in the interior of the feasible region, apply the normal step (e.g., BC in Fig. 6-1). If this results in an infeasible point, the step is truncated so as to leave us on the boundary of the feasible region. I 2. ]f e j is on the boundary, take the normal step or a fraction thereof if 'h;, ;, 1m; hi, (,. g.. EA ;n Ftg. 6-1). Oth"w'". t"at '0 m, 0 r th, a"tv< con- :. straints as equality constraints, and take a step along these constraints (e,g" CE in Figure 6- I). The question of which ones of the active constraints should be retained in any given situation is a difficult one. Although Rosen (1960, 1961) gives a working solution to this problem, this solution is not necessarily optimal. The quadratic programming solution listed below is a good one provided all the algebraic manipulations involved are less time-consuming than the function evaluations. ; Efficient algorithms have been constructed by using these techniques in combination with variable metric methods for generating the step directions [Goldfarb and Lapidus (1968) and Murtagh and Sargent (1969) for linear constraints; Davies (1970) for nonlinear constraints]. These algorithms are usually superior to the penalty function method for finding minima that lie on a constraint. r n many parameter estimation problems, where we hope to find an interior minimum, the penalty function method seems preferable because of its greater simplicity. Exceptions to this rule do occur, however, and therefore we indicate how the ith iteration of a gradient method may be modified in the presence of linear inequality constraints (e.g., upper and lower bounds). Note that when the penalty function method is used, all iterates e j (except possibly the last) are in the interior of the feasible region, and hence the method is immediately applicable even when some of the constraints Eq, (6-1-1) are in the form of strict inequalities hie) > o. Such situations arise, for instance, when terms like I IDa or log Oa appear in the model equations, re- quiring Ocr > O. 1 n the projection method some iterates may fall on the bound- ary. Hence the constraint Oa > 0 must be replaced by 8a - E  0, where E is a small positive number. 'I 
I 6-2. ProjectIOn Methods 147 I -' ! [n an unconstraIned gradient method we take a step in the direction v = - R i qj _ The minimum would be attained in a single step if the objective function to be minimized were, in fact Q;(v) == c]Ji + vTqj + -!vTRi1v (6-2-1) which is close to c]J provided R j is a good approximation to H i - J . Suppose the constraints take the form a.TOb. (j= 1,2,.__) (6-2-2) J J Let Ai be the mmrix whose columns are precisely those vectors 3} such that aj TO i = b J , i_e" those constraints which are active at the point OJ _ Let p be the number of such constraints; A j is I x p. Any feasible step must satisfy A/v  0 (6-2-3) The followmg strategy will be adopted for finding the direction Vi: ,Minimize the current approximation Eq_ (I) to the objective function, subject ,to the currently active constraints (as will be seen later, currently inactive ,constraints will help determine the step length, but not its direction). The quad- ratic expression Eq. (I) therefore acts as a temporary objective function, and the problem of finding its minimum subject to the linear constraints Eq. (3) its called quadratic programming (QP). Since R j , qj, and Ai differ from iteration '!O iteration, it follows that each iteration of the original problem requires the >l.)olution of a different QP problem. The algorithm described below is very .. :efficient, however, and in many estimation problems the computation of c]J 'and its derivatives is more time consuming than the solution of the QP ,problem, Let v j be the solution to the ith iteration QP problem, At 0 = 0; + V j the 'gradient of Q ;(v) is obtained from Eq. (I) as I q(v;) = qj + Ri1v; ( 6-2-4) 'According to the Kuhn-Tucker conditions, there must exist a vector of La- range multipliers Ai satisfying Eq, (3-7-3), which becomes in our case qj + Ri 1V ; = AiA i (6-2-5) :l-et Wi == A j T Vi denote the vector of constraint functions evaluated at e = 9 i + Vi' Then Eqs, (3-7-4)-(3-7-6) take the form W;  0 A;  0 AiTwi = 0 (6-2-6) J..et z; ==-AjTRiq; and W j == A;TRjAj Premultiply Eq. (5) by A;TRj. Con- ;ditions Eqs. (5) and (6) are now transformed into the following problem (we ,henceforth drop the subscript i): Find A and W satisfying :. 15 W=Z+WA w O AO A TW = 0 (6-2-7) 
148 Y I Computation of the Estimates II: Problems with Constraint. This is known as the complementary pivot problem (Cottle and Dantzig, ]968) and can be solved by an algorithm given by Dantzig and Cottle (1967). We present here a simpler and faster algorithm (Zoutendijk, ] 960; Bard, ] 971) whose convergence has not been proven, but which has not failed in hundreds of applications. Observe that from Eg. (7) for each j either }'j or \\'j vanishes. Since H'j is the value of the jth constraint function at the solution to the QP problem, it follows that this constraint remains active if \\'j = O. In this case we refer to the jth constraint as binding. If \I'j > 0 then the solution would not be affected by removal of this constraint, and it is called nonbinding. Let W B = 0 and W N > 0 be the vectors of binding and nonbinding constraints, respectively. Of course, we do not know as yet which constraints are going to be included in which set, but for the time being we ignore this difficulty. Let A B and AN be the correspond- ing partition off.. Then from Eg. (7) AN = O. Let W be partitioned along the same lines into [ W BB WBN ] WN W NN and ZT into ZIJT, ZN T . Then Eg. (7) becomes: W N > 0, )'B  0 (6-2-8) (6-2-9) (6-2-10) 0= ZB + W1JBJ. B W N = ZN + WN)'B From Eq. (8), AIJ = - Wi;"BlzB' so that from Eg. (9) w N = ZN - WN Wi;"B1ZB' Let us form the tableau [ W BB WBN ZB ] E = [W. z] = T W BN W NN ZN (6-2-11) If Gauss-Jordan plVOlS (see Section A-3) were to be effected in turn on all the diagonal elements of W IJIJ' then the last column of E would be transformed into [ Wi;"BIZIJ ] [ - AB ] ZN - WN Wi;"BlzB = WN It follows that if the proper partitioning of the constraints into binding and nonbinding ones is at hand, and if we sweep those rows of E which correspond to the binding constraints, then the last elements in those rows must become non positive (since )'B  0). Conversely, the last element in each unswept row of E must become nonnegative (since w N > 0, but zero elements may appear .Rii 
I 6-2, Projection Methods 149 under certain conditions of degeneracy), I n order to find the proper partition, the following algorithm is suggested: it I. Form the tableau E which has p rows and p + I columns (p is the number of constraints active at OJ 2. Assign to thejth row of E (j = 1,2, ..., p) the indicator k j = I. 3. Let e j denote the current value of the last element in the jth row. Find a= min kje i , If a -£ (where £ is a small positive constant, say £ = 10- 6 j for smgle precision calcu]ations), proceed to step 5. Otherwise: 4. Let r be an index for which a = kre r . Sweep the rth row (i.e., execute a Gauss-Jordan pivot on £,.,.) and change the sign of k,.. Return to step 3. 5. The solution is now at hane\. Consider the jth row of E. If k j = I, the jth row is unswept (or swept an even number of times). Hence the jth con- straint is nonbinding. Therefore, e j = \I'j = (AT V)j  0, and A j = o. If k j = - I, the jth row is swept and the jth constraint is binding. I-lence e j = - Aj  0, and Il'j + (A TV)j = O. To compute v j (we now restore the subscript i), solve Eq. (5) "  V j = R;(AjAj - qj) = R;(ABAB - qj) (6-2-12) , , where A B consists of the columns of A; corresponding to the swept rows of E, and J' B consists of the last elements of those rows, with signs changed. 6. The actual step (Jj = pj Vj is computed by interpolation-extrapolation along the ray OJ + PV j (p > 0), with the additional proviso that OJ+ I = 9 j + pj v j must also satisfy all the constraints inactive at OJ, and therefore not included in A j . If we denote these constraints as c/O - b j  0 (j = I, 2, . . .), then we must have c/O; + Pie/vi - b j  O. Since pj > 0 and c/O j - b j > 0 ,(inactivity at 0;), it follows that only constraints for which c/v j < 0 threaten ,to become violated. Hence Pi must satisfy the inequality Pi  Pm". =e= min [(b j - C/O;)/C/V;] jEJ (6-2-13) where J is the set of indices j for which c/ V; < o. If J IS vaCLlOUS, then Pm.. = 00. The following simple example illustrates the computation of Vi. Assume Jhat the objective function is approximated locally by the quadratic function Q(V) = 1<1'1 + ])2 + -!-I'/ (6-2-14) We stan at I't = "1 = 0, where the active constraints are 1'2  0, \'t + \'2 O, - VI + 1'1  0 (6-2-15)  ,.:;; 
150 VI Computation of the Estimates II: Problems wtth ConstraItlts The constraints and the contours of the objective function are displayed in Fig. 6-3. From Eq. (14) and Eq. (15) we deduce qT= [1,0], H=lz; R= H-1=Iz,and - :] A = [ We form z = - A TRq = [0, - I, I]T and \V  A 'RA  [:   ] 0' .,,/ '!./ ...../ / /' /' / vio L // "'-'" / "- ./ ' //  " /  / // / ../ " v-' , '0 (-1,0'  .'. / / (6-2-]6) V2 v > 0 t 2- v, ""- ""- '''-, Fig, 6.3. Projcction mcthod wih Q(v) = .H'l, + 1)2 + V22, The algortthm proceeds as follows: -:J I. [ 1 I E =:  2. k T = [I. I, I]. 3. a=ezk z = -I < _10- 6 . 4. ,. = 2, hence we pivot on £z. z. The resulting tableau is E  [\ I o 2 I -2 3. {f = ('lkl = ezk z ==, > _10- 6 . o I o 2 4 ] [ 1 ] -t' k = -: 
6-3, Projection with Bounded Parameters 151 5, Since k 2 = - I, the second constraint is binding, whereas the first and third are not. Hence As = UJ (the second column of A) and J' B = [fJ (minus the second element in the last column of E). Hence, according to Eg. (12) v = [ J ([:J x t - []) = [ -tJ which, as is evident from Fig. 6-3, is the correct solution. An extension of this algorithm to the case of nonlinear constralllts ,s given by Bard (1971). 6-3. Projection with Bounded Parameters The r.eader may wonder at the need for a complicated algorithm when bounds are the only constraints imposed on the parameters. Why not simply suppress those components of the step v which would violate the bounds? A simple example shows that such a procedure can produce erroneous results. Let, for instance Oi = [l qi=[-l R i = [ J .and suppose both components of e have zero lower bounds. Since ql = -I, [it is clear that the objective function can be reduced by increasing 0, while ,keeping O 2 constant. However, if we compute v we find - R i qi = [- 3, - 6]. 'According to the above suggestion, since both components of v are to be :reduced beiow their lower bounds, we would conclude erroneously that we are at the minimum and take no step at all. We shall proceed therefore to apply the algorithm of the preceding section. At the ith iteration, several of the components of 0 (we omit the subscript i for convenience) may be at their lower bounds, and others at their upper bounds, while the remainder are free to move in either direction. For sim- plicity in the algebraic exposition to follow, we shall treat lower bounds only, but later we give the arithmetic details for both cases. If Ocr is at its lower bound, the corresponding constraint takes the form I'a  0 (6-3-1) Let v(O) = - Rq be the step that would be taken 111 the absence of con- straints, It is easy to see that forming A TR merely picks out of R those rows which correspond to variables which are at a bound; postmultiplying this by A 
152 VI Computation of the Estimates II: Problems with Constraints picks out the corresponding columns, Thus, W = A TRA = it where it is the set of elements of R at the junction of active rows and columns. Similarly, z = - A TRq is obtained by taking the active elements of v(O). Hence, E can be formed by inspection from Rand v(O) E = [it, y(O)] (6-3-2) It is easily verified that the followlllg procedure is eqUivalent to the al- " gorithm of Section 6-2 in the present context: I. Set up the tableau E, which has p rows and p + I columns. 2. For j = 1,2, .. ., p let m j = r:: I} ifthejth row corresponds to a variable ;1. at its { IOWC q bound upper J . 3. For j = I, 2, . . ., p let k j = I. 4. Let e j denote the current value of the last element in the jth row of E. Find a = minj JI1 j k j c j . If a  -£, proceed to step 6, Otherwise: 5. Let r be an index for which a = 111rkre,.. Sweep the rth row (Gauss- Jordan pivot on E,.,.) and change the sign of k,.. Return to step 4. 6. The solution is now at hand. Bounds for which k j = - I are bindll1g, others are not. We construct An by taking the elements of -e in the binding rows, and we construct RAn by taking the columns of R corresponding to. .,: variables which are at a binding bound, It is thus easy to compute v using: Eq. (6-2-12) v = (RAn)AB + v(O) ( 6-3-3) .1 Since vanables at binchng bounds cannot change their values, the correspond-'". ing elements of v automatically come out O. ":. We illustrate the procedure by means of a numerical example (a further. ,. illustration appears in Section 6- I 2): Suppose the following values are current in the ith iteration e,  [+ ",-m Ri= [ -I -4 -10 30 8 -] . } Hence, v:O) = - R i qi = [- 3, 16, 6r. Suppose all parameters are restricted to ,16 the range zero to one. Thus OJ and 0 3 are at their lower and upper bounds;. respectively. The algorithm proceeds as follows: l. E= [_ - 2. m= [I, -Ir. 3. k= [I, Ir. 4. a=m2k2e2= -6< _10- 6 , -3 ] 6 ' );g:J 
. .6-4, Transformation of Variables 153 5, r = 2, hence we pivot on £2.2 obtaining the new tableau E = [ 1.8 0.8 - 0.8 0,2 1.8 ] 1.2 ' k = [ _] 4, a = 171 2 k 2 e 2 = 1.2> _10- 6 . 6, Since k 2 = -I, the second constraint (the bound on 03) is binding. We have As = [- 1.2] (second element of -e) and RAu = [-4, 8, 5]T (third : column of R), Hence v = [ - ] x (-1.2) - [  ] = [ : ] 5* 6* 0 (6-3-4) :.;,1\s expected V 3 = 0, and we could have replaced the starred elements in Eq. (4) : gy zeros without altering the result. The actual step (j" will be some multiple of v, i.e., (j" = pv. The multiplier . p is limited to the range 0  P  min(l, PmaJ, with Pm", to be determined by 'the requirement that OJ + Pmax V remains feasible. That is, we must have 0+ 1.8Pmax  I, 0.5 + 6.4Pmax  I . Hence Pmax = min[(1 - 0)/1.8, (I - 0.5)/6.4] = 0.3125 .Tile actual value of P would be determined by interpolation-extrapolation :(ee Section 5-14) to guarantee a decrease in the objective function value, :&4. Transformation of Variables " ""Sometimes a change of variables can transform a constrall1ed problem into :,unconstrained one. For instance, to minimize C/J(e) with e required to be :p.ositive is equivalent to minimizing C/J(p 2 ) with P free to assume any value. $im,ilarly, if e must satisfy 13 + e  a, then we minimize C/J((a + (3)/2 + ;[(t;:..... [3)/2] sin p) with P unconstrained, since as P varies from - OJ to OJ, the ;quantity (a + 13)/2 + [(a - 13)/2] sin P remains within the bounds a and {J, ox, (1966) has demonstrated that with ingenuity even more complicated fO,nstraints can sometimes be eliminated by means of such transformations. ;We pave some numerical evidence that the use of transformations is no more :.!'..' -',- f!1.Cient than the use of penalty functions, and we prefer the latter because of . ,Thi'r greater generality, .. 
154 VI Computation of the Estimates II: Problems wIth ConstraIl1ts 6-5, Minimax Problems Some estImation problems (see Sections 4-13 and 4-17) led us to seek the value of 0 that minimizes (1)(0) == max e,,(Oj )' (6-5-1 ) <,:;:. This \Va, shown to be equivalent to findIng ::.0 so as to minllllIze tP(O, ::) == :: (6-5-2) ... ":1 subject to z - 11',,(0)1  0 (J.l = I. 2. . . . . n) (6-5-3) An iterative schemc which is analogous to the Gauss method for least squares problems has been suggested by Osborne and Watson (1969). Let 0; be the value of 0 at the ith iteration. We approximate e,,(Oj + \';) by a linear expression e,,(O,) + b;;\'I' The nonlinear programming problem Eq. (2), Eq. (3) is replaced by the linear programming problem: Find ::. v, so as to minimize Eq. (2) subject to I ; , , z - e,,(Oi) - b,, v,  O. :: + 1',,(0,) + b;;\',  0 (l = 1,2,.." n) (6-5-4), This prohlcm can be solved by means of standard linear programming(LP) methods. Once v, has been computed, the length of the step taken in this direction is determined by interpolation-extrapolation. ,,'_J .,.: I  " ': '" B. Equality Constraints 6-6. Exact Structural Models In Section 4-1 I we have formulaled some estimation problems as requiring the minimization of a function q)(\V) subject to the constraints GC\V, 0) = 0, The" true" data \V and the parameters (') are the unknowns. In the ensuing discussion we will treat G and \V as vectors. G and \Y. We suppose that at the ith iteration we have current values of \Y i and 0;. The method of solution given below follows suggestions by Deming ( 1943). We denote the current values of q) and G by q), and G" respectively, and adopt the following notation: q == MJ/?\Y A == c:G/i'\V, H == if cp / C\Y c\Y B == CG/i.'O (6-6-1 ) l "' 
6-6, ExaCT StrucTlIral Models 155 with subscript i denoting the value at W = 'V j , 0 = OJ. We define the following functions: <1)j«(5\\1) == cfJ i + q/ (5\\1 -t--} (5W T Hi (5\\1 G j «(5\V, (50) == G j + A, (5\\1 + B j (50 i (6-6-2) (6-6-3) W j is the second-order Taylor series approximation to r!l, and G i is the first- order Taylor series approximation to G. We now replace our original problem by the following: _ Find (5\\1 and (50 so as to mlI1lmlZe ql j, while satisfYlI1g the constraints G j = O. We introduce a vector of Lagrange multipliers A and seek the station- ary point of the Lagrangian 11«(5\V, (50. J.) == li + J.Tc, (6-6-4) I Accordingly, we form the normal equations: al1/a«(5\V) = qi + Hi ()\V + A/J. = 0 a/1/aJ. = G j + A j (5\V + B i (50 = 0 ali/a«(50) = B/J. = 0 (6-6-5) (6-6-6) (6-6-7) From Eq, (5) (Wv = -Hi I(qi + A/J.) (6-6-8) So that, from Eq (6) G j - AjHi lqj - AjH j - 1 A/A + Hj (50 = 0 Solving for J. we obtain J. = Ci I(B j (50 - AjH,-lqi + GJ (6-6-9) (6-6-10) where C j == AiHi 'A j ' ( 6-6-11 ) ,Substituting Eq. (10) in Eq, (7) and solving for (50 we obtain (50 = D i - I B/C j - I (A)-I,- ' qi - G j ) (6-6-12) where D j = Bi TC ; I Bj (6-6-13) The matrix D j plays a role analogous to that of N j in the Gauss method, c:. and the same" almost inversion" methods (e.g.. directional discrimination or tvfarquardt's method) should be used where D; I is required. Eq. (12) enables one to compute 0i+1 = Oi + (50. Then Eq. (10) can i:.Qe used to compute A, which in turn can be substituted in Eq. (8) to compute ;::W and the new approximation \V j + 1 = \V j = ()\'v. Usually \V is close to . ',,, .,&.,}the observed values W, so that we naturally take \V = W as the initial guess. ;i(::'." 
'..';:1 :'r ',5. 156 VI ComputatIon of the EstImates II: Problems with ConsIfalllts 6-7. Convergence Monitoring I When we apply the Deming procedure there IS no natura] way of tellmg whether or not progress towards the solution has occurred in any given itera- tion. There is no way of telling in advance whether the final value of the objective function or the Lagrangian must be less or greater than the current value. At the solution. however, the equations G = 0 must be satisfied, so that if Qj is some positive definite matrix. it is natural to require that the ith iteration must cause the value of G TQ, G to decrease. Now we have: '-'.: '::' ",1 ::... ':: ;: ,',1 ';:\ E'(GTQj G)/?O = 2B/Qj G c'(G TQj G)/C\V = 2AjTQj G (6-7-1) ( 6-7-2) Hence, if liO and ()\V are given by Eq. (6-6-12) and Eq. (6-6-8). it turns out after much arithmetic that £'i(GTQjG) = -2GTQiG <0 ( 6-7-3) This means that the quantity GTQj G decreases initially as we take a small step in the prescribed direction. A natural choice of Q; is the inverse of the covariance matrix of G. If V w is the covariance matrix of the data W. then the covariance of G is approximately A j V w Ai T. Usually V w = Hj- I, so we choose Qi=(AjH;-IA/)-1 =C" I Our strategy at the ith iteration is the following: I. Compute Zo == G,TC;- I G " 2. Compute (50 and ()\V using Eq. (6-6-12), Eq. (6-6-10). and Eq. (6-6-8), 3. Compute ZI == G1(0; + ()O. \V; + b\v)C: IG(O; + ()O. \V i + chv). 4. If Zt < Zo. set 0,+ I = 0, + (50, \V;+ i =\V i + chv and proceed to the next iteration. Otherwise, use interpolation (Section 5. ]4) to shorten bO and ,; ()\V, and return to step 3. (6-7-4) I If. , Note thm a difTerent weighting matrix C;- I is used in each iteration, ,; Therefore. we can compare ZI and Zo in any given iteration, but not from one iteration to the next. At the end of the calculations. we can apply the necessary condition Eq. (3-6-6) as a test of convergence. In terms of our present variables, the con- ditions take the form BTA = 0, q + A T A = 0 (6-7-5)', These, along with the ongma] equatIons G=O (6-7-6) must be satisfied by the final values 0*, \v*, J,*. ."., 1i., ! 
6-8. Some Special Cases 157 6-8. Some Special Cases Consider an objective function of the weighted least squares form /I q)(\V) = 1- I (\V II - WI,)TVI;-I(\"II - W/I) JI=I (6-tH ) Therefore: ( V 1(\Vt - WI » ) V:;-I(\V, - w,) q = v} I(\V - W,) (6-1)-2) 'I  Vl H= 0 o o V;-I v,1 /I J (6-8-3 ) The other interesting objective function is typified by Eq. (4-10-12), and :;,occurs when a subset of variables YII (not exceeding /11 in number) have un- :known covariance, while the remaining (if any) variables XII have known co- +yariance P/I' We have then (assuming zero correlation between XII and Y I ,) n " m (   ) ( /2) ] d t  ( A , ( A ) 1, I  ( X X ) T p -I ( x X ) 'l' X, Y = IJ og e L YII - y l ,)' Y I , - Y II . T }- L . I' - . I' I' . I' - . I' 1 1 =1 JI= I The vector q is made up of elements OeJ.;/ox ll = PI I(X II - XI')' CeJ)/aYII = IlMJ.1(5'11 - YII) \6-8-4) }The Hessian H has a complicated form, but we may use the Gauss approxima- :iJion in which (I /n)M n replaces the covariance of YII where required. We take ithen 2rf) / .. P -1 o 'J. ex l , uX/I = I' 8 2 eJ.;/UX II i'YII = 0 [ /I ] -1 a 2 eJ.;jOY I I OYII  IJMJ.I = IJ I (YII _ YI')(YII _ yl,)T 1 1 =1 In the sequel, then, we shall assume that Eqs. (2) and (3) apply always, ;;.ith (in the case of partly unknown covariance) (6-8-5) v = [ P IJ 0 ] I' 0 (I/n)MJ'Y (6-8-6) !'Npte that P JI remains the same, but (I/n)Myy varies from iteration to itera- !;,fjop, If the unknown portions of V are assumed diagonal, then we only use . 'ili;e'corresponding diagonal elements of (l/n)M, and substitute zeroes for the : (bff-dia g onal elements, ,¥P' t:.:.:" 
150 VI Computation of the Estimates I I: Problems with Constramts We now work out the details of the algorithm, LettlIlg we note that: l q: J q = - qn lVJ' I-I = () n G = 2 u "'" AI () () A 2 (\- ,. - n I B, J B = B 2 BII_ so that: lC' n C Alr1A 1 = : C 2 II Do BTC-IB= IB/C,IB" JI::::: I e l , == \V " - w" where q,l == V, I e" () v,J l where All == cg , ,/c'hv l1 :"J whcre 13" == (lgji'O ] (6-8-7) ( 6-8-8) ( 6-8-9) ( 6-8-10) (6-8-11) ( 6-8- I 2)  ; . (6-8-]4) - I . .. " I ;  ., where C,I==A/IV/IA/ (6-8-]3) Computations which we leave as an exercise result in: n <5{) = D- I I B/C/-;-I(AI,e " - g,,) JI= I )," = c; 1(13" c)o - Aile,. + g,,) C)\"" = - e/I - V"A/ I TAil (6-8-15) ,J (6-8-16) (6-8-17) 
6-9, Penalty FuncTions 159 The pseudo-objeclive function of Section 6.7 takes the form n Z =GTQG = I g/C;l gl , 1,=1 (6-8-] 8) Further simplifications Occur III the single equation case. Here g/' is a single number gl" while AI' and BI' are row vectors ai' T and b" T respectively Hence: ai' == cg,J(Wll' b / , == cgjeO ( 6-8-19) C = dJag(c ll ), where C I , == a/,' Vl,a l , (6-8-20) II D = I (I/c/,)b/lb/ 1 1 =1 (6-8-21 ) . ,I II £50 = D- I I O/c)(a/e l , - gl,)b/, Jl= I (6-8-22) )'11 = (l/cl,)(b/ ()o - a/e / , + g/') (6-8-23) ( 6-8-24) £5w l , = - e l , - AIIV/,a / , II Z = gTQg = I (I/cll)g'/ 1'= I (6-8-25) 1!Ii  ;i r'" .:f. !;: ;.... 'I;; . The algorithm is always started with some initial guess, 0 1 , ane! with ,v,. = WI" A difficulty arises when the covariance matrices V" are at least partly unknown. For then. from Eq. (6). certain rows and columns of V/I are taken from (l/n)M; but in the first iteration 1\-1 = 0 causing VI' to be singular. Furthermore. a glance at Eq. (17) reveals that all components of (hv,. cor- responding to the J' variables will be zero. The dilliculty is easily overcome by arbitrarily assigning to the unknown elements of V some reasonable initial guesses for the first iteration only. The method is illustrated in Sections 6- ]3-6-] 4. o. '" 6-9. Penalty Functions The idea of penalty functions can also be applied to equality constraints. Here we penalize values of the unknown $ (including both 0 and 'v) according to their deviations from the constraints. To the objective function <11($) we add a term proportional to g/($) for each equality constraint gj = O. SUMT (sequential unconstrained maximizing technique) as applied to problems including both equality and inequality constraints consists of defining the objective functions (Fiacco and McCormick 1965, 1967) .-Ii/ CPl;t($) == CP(<t» + !Y.!. I ]/l1 j ($) + !Y., 111 I g/($) j j ( 6-9-1) 
160 VI Computation of the Estimates II: Problems with Constramts where (1.1' (1.2' . . . is a sequence of decreasing positive numbers converging to zero, Let $k * be the value of $ which minimizes cp/; then, under suitable assumptions on the convexity of the feasible region and the concavity of the'.! functions, Fiacco and McCormick prove the convergence of the sequence '" Q>t*, $2*' ... to the minimum of cp($) satisfying the constraints '.\' 17/$)  0 (j = 1,2,.. .), g / $) = 0 (j = I, 2, . . .) (6-9-2) <.:. The application of the Gauss method to the minimization of cpk t is obviom. a; 6-10. Linear Equality Constraints If the unknown parameters are supposed to satisfy linear equality relation- ships, these can be handled by means of the projection method of Section 6-2, All we need to do is include permanently the equality constraints in the bind- ing set by sweeping the corresponding rows of E. All tests on the sign of the last element in a row belonging to an equality constraint are omitted, The initial guess must be chosen so- that it satisfies all the constraints. ::g .: 6-11. Least Squares Problem with Penalty Functions We return to the slIlg]e equation least squares Problem of Section 5-21. We recall that we encountered some difficulties in converging to the solution when starting from the initial guess 0 1 = [100, 2000]T. We shall attempt to overcome these difficulties by imposing bounds on the parameters, Specifically let us require that o :0( 0 1 :0( 100,000, 0:0( O 2 :0( 2,000,000 corresponding to the constraints: ht(O) == 0 1  0, 17](0) == O 2 o, 17 2 (0) == ]00,000 - 0 1  0, 17 4 (0) == 2,000,000 - O 2  0 (6-11-1) According to Eq. (6-]-2) we form the penalty function <-'j ! 4 (0) == L ceO) = 0.01/0 1 + 0.01/(100,000- 0 1 ) + 0.2/0 2 + 0.2/(2,000,000- 02) j= I (6-11-2) The coefficients (l.i were determined as 10- 50 1 . Our new objective function is 15 (JJt(o) = I e/(O) + (0) (6-11-3) :i .1 i :i  p. " -j" . / 1 = I 
6-11, Least Squares Problem with Penalty Functions 161 Using Eqs, (6-1-7) and (6-1-9) we find: [ _? I e ai JL _ 0.01 + 0,0] ] -1'=1 I' ae 1 e/ (100,000 - 0 1 / qt = 15 a 1r 0 7 0 'J " ClI"- . -2 f....- eJL - - -Z + z JL= 1 ae z e z (2,000,000 - Oz ) [ 15 ( ,r ) z OO 'J 00 7 15 If I f. ] " OJ JL .  , . - " (:1 I' (II 2f....- - +-, ') L -- JL=1 ae 1 e l 3 (100,000-ed 3 -11=laOla: Nt=  ai JL ai JL  a];, z 0.4, 0.4 2 L -- 2f....- - +-, JL= 1 ae 1 ae z 1'= 1 (aeJ 0/ (2,000,000 - O 2 )3 (6-11-5) 16-11-4) For the first iteration, using the data of Eq. (5-2 I -I 0), we find: cpt = 5.299702 [ 0,01 ] -0,0007098080 ----, 100- [ _ 0.0007 I 08080 ] q/ = 0,2 = 0,0002442436 0.0002442936 - -----, 2000- [ 0.7036033 X 10- 7 + : -0.2354773 x IO- t N/= -0,2354773 X 10- 7 [ 0,9036033 -0.2354773 ] x 10-7 = -0,2354773 0,07946382 t _ - ( N t ) -1 t _ [ -629.9713 ] VI - 1 ql - -32604,29 7 0.4 ] 0.07896382 x 10 - +  2000 ,omparing v/ to V t given by Eq, (5-21-11) we see that although the penalty function has but a small effect on the value of q), it has the power to turn the step direction away from the troublesome 0 1 = 0 axis. We compute now the largest value of P for which pv 1 t is feasible Pm"x = min{(100 - 0)/629,9713, (2000 - 0)/32604.29} = 0.06134162 Following the flowchart of Fig. 5-2a, we initially try p(O) = 0.5Pma' = 0.03067081 " ! '  0;:  . ',-", r  ' .,.05'. . '"t 
162 VI Computation of the Estimates II: Problems with Constraints for which e = [ 100 - 0.03067081 x 629.9713 ] = [ 80,67827 ] 2000 - 0.03067081 x 32604,29 1000 where q)t = 3.652150, an acceptable value. We proceed, however, to extra- polate according to Fig. 5-2b, and obtain e = [ 80.67828 - 0.5 x 0.03067081 x 629,971 3 J = [ 71.01 740 J 1000 - 0.5 x 0.03067081 x 32604.29 500 with q)t = 0.5349855. This much improved value is the basis for starting the second iteration, We converge after nine iterations (plotted in Fig. 5-3) of the Gauss method to () = [ 814.4814 J 961.1797 ' q)t = 0.04002851 " I "  l At this point, (/) = 0.03980605. One further iteration without penalty functions ' . leads to () = [ 813.838I J 960,9944 ' q) = 0,03980603 which is close to the solution obtained 111 24 iterations without penalty functions.   6-12. Least Squares Problem-Projection Method Suppose we try to solve the problem of the previous sectIOn, but using the projection method in place of penalty functions. A glance at Fig. 5-3 shows that even in the absence of any constraints, the Gauss method never carries one actually as far as the 0 1 = 0 axis, Hence, no occasion to project a step into any of the constraints Eg. (6-1 I - I) arises, and the projection method does not affect the course of the iterations (other than avoiding some of the futile function evaluations in the unfeasible region). Let us, however, change the constraints to read 0 1  100, O 2  0 (6-12-1) We now apply the algorithm of Section 6-3: I. R = N- ' is obtained by inversion from Eg. (5-21-10) and ,,<D) from Eq. J f (5-21-11). Since only (JI is at a bound, only the first row and column are taken from R, and only the first element from ,,<D), to form E = [0.7199572 X 10 10 , - 134608.0]. 
6-13. Independent Variables Subject to Error 163 I ... 1 '" , > i* .., 2. Since 0 1 is at its lower bound, 111 1 = I. 3. kt = I. 4. a=l11tktet = -134608 < _10- 6 . 5, r= I. Pivot on Ell to obtain E= [0.1388971 x 10- 9 , -0,1869666 x 10-4.], k) = -I. 4. a = l11tkte) = 0.1869666 x 10- 4 > _10- 6 . 7, Since k l = -), the bound on 0 1 is binding. Hence VI = 0 (01 cannot change) and "2 = RJ2AI - viol = 2.146975 x 10 10 x 0.1869666 X 10-4._ 432361.0 = - 30948.37. We now find Pm'" = 2000/30948.37, and try (f = Pm"x"l = [0, -2000]T. This brings us to 0 = [100, O]T and (/1 = 6.028293, which is unacceptable. Interpolation forces P = 0.25pmax with consequent e = [100, 1500]T, q; = 4.975522, which is acceptable. At this point the tableau reads E = [0,5366083 x 10 8 , - 7951.062]. Row I must be swept to produce E= [0,1863556 x 10- 7 , -0.1481725 x 10- 3 ] and "2 = [0, -6144.531]T. Only in the fourth iteration with 0 = [100, 656.7756P do we obtain the tableau E = [93635.87, 196.3826] which does not require sweeping, so that the lower bound on 8 1 ceases to be binding. Tn this iteration "4 = [196.3826, 269.I0l8r, Convergence Occurs in ten iterations, which are also plotted In Fig, 5-3, 6-13, Independent "\tariables Subject to Error I  '(i 1'.:;1; . Let us take once more the model ofEq. (5-21-5), but assume now that in addition to y, also XI and x 2 are subject to measurement errors. We shall use the exact structural model approach, writing gll == g(\I',II, \1'112' \\1 1 <3, 0., O 2 ) = exp[ - 0 1 \I'I!! exp- (0 2 /\1'1 1 2)] - \1 ' ,,3 (6-13- I) where \\1 111 , "IJl2' \1 1 113 represent the" true" but unknown values of the measured quantities WI!! == XIII' 1\'1 12 == X 112 , 11'1 1 3 == )'. We assume all measurement errors to be independent, and form the objective function 3 15 1: I (I/p,,) I (WIlli - 11'lllIf a::::: I 1'= I  I Since we have only one equation, it follows from the discussion in Section 4-10 that a maximum likelihood estimate can be obtained for only one of the variances Po Let us then assign standard deviations of 0.01 and 0.5 to the time (Will) and temperature (11', 12 ) measurements, respectively, and let the variance of the 11'1<3 measurements remain unknown. Following Eq. (4 I 0- 12), the concentrated objective function. from which 1'3 has been eliminated, takes the form ql(\\!) = (15/2) log SJ + 1« I/O.OOOI)S, + (1/0.25)S2) (6-13-2) 
164 VI Computation of the Estimates II: Probiems with Constnunts where 15 Sa == I (11'/'" - IV/ IlI )2 1 1 =1 (a = 1,2,3) (6-13-3) Because we take as ll1itial guesses IV/Ill = 1I'/IlI' it follows thm S3 = 0 initially, and neither 1> nor its derivatives can be evaluated. The solution suggested in Section 6-8 is to take for the first iteration only the objective function (fiO)(W*) = 1«1/0.000I)S) + (1/0.25)S2 + (I/UO)S3) (6-13-4) where vO) is an Initial guess for the variance of 1\'p3' We take vO) = 0,01 2 = 0,0001. As our initial guess for 0 we take 0t = [750, 1200]T. The solution is obtained iteratively, using Eqs. (6-8-19)-(6-8-25), We pre- sent details of the first iteration, First we compute the vectors a/I = (8g p /8fv) and b p = (8g)80) for It = I, 2, . . ., 15. The values of b/l already appear in the last two columns of Table 5-3. In Table 6-1 we list the values.of a p ' as well as g/1 (from Eq. (I» and of c p which according to Eq. (6-8-20) is given by f ,; - I T V - 0 0001 t 2 'O?5 a 2 ,dO) a 2 C J ! - G II 11ll'1 -. (Ill -r._ ,,2 -r 3 113 Table 6-1 First /leratlon Data a" = og"/o{'-I,, ,.,. g" = g(6, w,,) alfl 10 4 (/1.2 a,'3 ]0 3 e" 0.01953936 -0.004606035 -0.5527238 I 0.1000029 2 0.01607883 -0.00460391 I - 1.1 04938 -I 0.1000052 3 0.04361850 -0.004601792 1.65664-+ -I 0.1000090 4 0.01915848 --0.004599672 -2.207842 -I' 0.1000143 5 0.004698634 -0.004597552 -2.758531 -I 0.1000211 6 0.2852362 -. 1.694046 - 25.41 069 -I 0.3885932 7 0.2863515 -- 1.543676 -46.31024 -I 0.3436550 8 0.3016463 - I .406653 - 63.2993 I -I 0.3078843 9 0.4644834 -1.281794 -76.90754 -I 0.2790862 10 0.4612821 -1.168016 -87.60117 -I 0.2556] 10 II 0.1937739 -10.43681 -27.83145 -I 10.9946 12 0.2602563 -7.929612 -42.29121 -I 6.392341 13 0.4045841 -6.024710 -48. I 9766 -I 3.735518 ]4 0.3172247 -4.577417 -48.82574 -I 2.201234 15 0.1871758 - 3.477806 -46.37070 -1 1.314888 ,, 
6-13. Independent Variables Subject to Error 165 We now compute D = I I (lfc/,)b p b/. The first term (Il = I) 111 the sum is (1/0,1000029 X 10- 3 ) [ (0.6141379)2 X 10- 12 x 0,6141379 x 0.4606032 x 10- 1 I 0,6141379 x 0.4606032 x 10- 11 ] (0.4606032)2 x 10- 10 and the result is o = [ 0.001794073 - 0,006367238 -0,006267238 ] 0.02235472 I Using Eq, (6-8-22) we compute bO. In the first iteration e u = O. The first term of the sum is thus - (glfcdb j , and bO = [ -846,9727 ] - 563,7969 It is now easy to compute Ap and b\v p using Eqs (6-8-23) and (6-8-24) in turn, Table 6-2 contains the results. Finally, we evaluate our test function Z (Eq, (6-8-25», whose value lUrns out to be 2508.496. After modifying 0 and ,v p by adding (50 and (5,\. we re- compute the value of Z, which has now II1creased to 59729.57, The step is unacceptable, and we must interpolate. The method of Fig. 5-2 should be used, I Table 6-2 First Iteration Results 8w,.  {.L Ap 81'", 8'''"1 OII\!3 'j. I 1' 1 ]74.6214 0.00008043119 0.002412934 0.0/746213 2 II 9.2671 0.00005490<;48 0.003294569 0.0/192671 3 373.9075 0.0001720644 0.01548579 0.03739074 4 108.6156 0.00004995962 0.005995154 0.01086156 5 - 56.64597 - 0.00002604326 -0.003906488 -0.005664594 6 365.7187 0.06] 95441 0.232329] 0.03657]87 7 74.25694 0.01 ]46286 0.08597142 0.007425692 8 -178.2293 -0.02507068 -0.2820446 -0.0] 782293 9 112,2]50 0.01438364 0.2] 57544 0.01122149 10 -125,6333 -0.01467417 -0.2751405 -0.01256333 ]1 3.384982 0.003532839 0,002355224 0.0003384980 12 3.497939 0.002773728 0.003698302 0.0003497938 13 35.72807 0.021525]2 0.04305023 0.003572807 14 19.33928 0.008852392 0.02360636 0.001933928 15 -56.02644 -0.01948491 -0.06494963 -0.005602643 1IJif'" . >- 
.., ',' 166 VI ComputatIon of the Estimates II: Problems with Constraints but for simplicity weJust cut the step 111 half(p = 0.5) to obtain the new value O 2 = [533.1387, 894.4258]T, and \\, as given in Table 6-3 along with the new values of 9". The corresponding value of Z is 908.2344, which is smaller than the initial value and hence acceptable. We are ready now for the second itera- tion for which we first compute the new value of 15 v 3 (1/15) I (\\'''3 - W"3)2 = 0.6729098, Jl= I Table 6-3 Start of Sccond Itcration \\'1/ .i"J 1 }'i'/12 W l1 3  : fL [1'1 J 0.007910669 0.1000401 100.0012 0.9897310 2 0.004332781 0.2000274 100.0016 0,9889632 3 0.01625860 0.3000859 100.0077 0.9736953 4 0.002205729 0.4000249 100.0030 0.9844307 5 -0.006835222 0.4999869 99,99805 0.9901676 6 0.1198401 0.08097714 200.1 ]61 0.6442859 7 0.1565019 0.1057313 200.0430 0.5477127 8 0.1889961 o. I 374646 199.8590 0.4460884 9 0.2718676 0.2071918 200. J 079 0.2306]07 10 0.287940] 0.2426628 199.8624 0,1607183 II 0.1505129 0.02176642 300.0010 0.5661692 12 0.2136052 0.04138686 300.0017 0.3171748 13 0.3027226 0.07076252 300.0212 0,03578640 14 0.2576900 0.08442611 300.01 J 5 0.01696696 15 0.1881614 0.09025747 299.9625 0.063] 9863 Convergence is obtained in 21 iterations. The final results are given in Table 6-4. The value of the objective function Eq. (2) at the solution is - 32.09045, The final value of 0, which we shall need in Section 7-23, is 0= [ 0.0005472710 -0.003062132 - 0.003062132 ] 0.01747675 (6-13-5) We shall also need later the moment matrix of the final residuals 15 15 1\10= I c,,*e: T = I (\V/ - w,,)(fv,,* - WJ,)T J'= I J'= I whIch, UStllg the data of Table 6-4 we find to be [ 0.001781781 0,009602152 I\1 = 0.009602152 0.07176667 0.001960129 0.01202689 0.001960129 1 0.01202689 0.0041455559_ (6-13-6)' I ,:ttl 
 .,. OJ..!::  w;;; '". i !i l .\ ..:t ,;   , I. I 6-14. An Implicit Equations lv/ode! 167 Table 6-4 Final Rcsult, with 6* = [1170.861] 1027.773 p.. ""J * "'1* I 0.1002329 100.0060 0.9959696 57.79277 2 0.2001303 100.0067 0.9919683 32.46887 3 0.3004779 100.0369 0.9879283 119.1740 4 0.4000721 100.0074 0.9840074 18.15489 5 0.4998] 57 99.97632 0.9801232 -46.54858 6 0.06580788 200.0668 0.6359499 36.]3177 7 0.09067100 199.9456 0.5370256 - 25.30307 8 0.1235170 199.7895 0.4301884 -90.03149 9 0.2081 ] 43 200.] 083 0.2386000 49.34750 10 0.2534879 200.0567 0.1749817 28.97121 II 0.014978]9 299.9978 0.5653576 -23.33610 12 0.03035963 299.9915 0.3147850 -8.046125 ]3 0.07608849 300.0349 0.0551] 629 76.52017 14 0.08861703 300.0217 0.03421704 65.96593 15 0.08537388 299.9644 0.03879805 -98.72794 J " '" t The slow convergence of the algorithm in this problem is not typical. In many different cases relating to the same problem convergence occurred in fewer than ten iterations. 6-14, An Implicit Equations Model In Section 5-23 we estimated the parameters c appearing in Eq. (5-23- I) by solving these equations explicitly for the dependent variables 2 1 and 2 2 , Suppose, however, that explicit solutions were impossible. We could then use the methods of this chapter, by introducing as unknowns the" true" values Zpt and Z1/2' and using the model equations as constraints. Since we do not know the covariance of the errors, we take as our objective function cP(Z) = (41/2) log det M _ (4 1 1 7 ) lo g I  ( "' - ) 2  ( "' - ) 2 - - \ L ':'/,1 - -/ll  -J12 - -/,2 \11= I /1=1 [ 41 ] 2 } - I (2 1 ,1 - 2 111 )(2 1 / 2 - 2 112 ) 1'= t 
; 168 VI Computation of the Estimates II: Problems with Constraints The Zil and C must satisfy the model equations which become after taking logarithms: log C I + Zp4 Cz log IO - (C 3 /C 4 ) log[csz;;I C4 + (l - c S )Z;; Z C 4] - log Zp3 = 0 log Cs - log(l - cs) - (l + c 4 ) log ZpI + (I + c 4 ) log zpz -log zps = 0 (Jl = 1,2,3, _..,41) Application of the method of Section 6-8 is now straightforward, We have _ Dg [ C3 Cs :-t4- I A p =-= DZp 1 + C4 -- -Ill (1 ) A_C4-1 ] c 3 - c_ s zpz !"p + c 4 zu2 where (p =: Cs z;;t 4 + (] - C S )Z;;{4. and B T == ( Dg ) T p DC o C I Zp4 log IO l v --log!"I' C 4 o o C 3 , c 3 [C S z;;I C41 og Zpl + (1 - cs)zl{41og zpz] --:;-Iog 'p T _ C4- c 4 !"p Z ? log A P - Zpl C3(Z;t 4 - z;t 4 ) C 4 (11 1 1 --'-- C s ' 1 - C s For V we take the value of (I/41)M from the previous iteration, In the first iteration, we use VI = L Starting from the initial guess c = [I, 1, 1, I, O,5jT, we converged to the same value of c* as given under case (a), Table 5-8, in twelve iterations. 6-15. Problems I. Verify Eqs. (6-8-15)-(6-8-17), 2, Prove that if the model equations take the form g(\V p , 0) = \V p - fp(O) = 0 r;l!f/ ' 
'6-15, Problems 169 'then the iterations produced by the metIlOd of SectIon 6-8 are identical to ;those produced by the Gauss method. 3, Using the data of Section 5-2], find the parameter estimates which ,:,minimize the maximum residual (MLE for uniform error distribution). Compare the results to the least squares estimates, .1' , .', .:.'I.:\ 
I Chapter Interpretation of the Estimates I  : J VII 7-1. Introduction I ,'  s; It is not enough to compute a vector IY and to state that this is the esti- mated value of the unknown parameters O. We must also investigate the reliability and precision of our estimates. We wish to answer questions such as "what are the chances that the estimate is off by no more than I %?" or "how much can we change the estimates and still fit the data well?" There are several ways in which one can go about answering these questions; some of these are of a heuristic nature, while others depend on statistical considerations. We shall present several alternative approaches in the succeed- ing sections. Even more important than the question of the reliability of the estimate is that of the reliability of the model itself. This question is answered by goodness of fit criteria and statistical hypothesis testing. We cover these topics only very brieRy here, since extensive treatments can be found in the statistical texts. I n particular, the reader may consult Anderson (1958) on the topics that are of direct interest to us here, and Lehman (1959) for a more general trealment. Some of the statistical tests and estimates of variability that we discuss here apply only approximately to nonlinear models. Refinement of these approximations is often possible [see, e.g., Beale (1960), Hartley (1964), and Guttman and Meeter (1965), but even with linear models, the tests are exact only if the measurement errors do indeed follow whatever distribution was assumed for them. Since this is rarely if ever so, even so called" exact" tests are only approximate in practice. Furthermore, we do not feel that the statement "the probability that model A is incorrect is exactly 5 " has greater practica I utility than the statement "the probability that model A .' is incorrect is approximately 50;;:' For these reasons, we present only the simplest approximate tests, and leave the reader interested in more exact formulations to consult the cited references. I iJ I :I ,;:;:)W' 
,<;, , .: :::  7-2. Response SlIIface Techniques 171 ,.. 7-2. Response Surface Techniques [': i' The estimate 0* is usually obtained by mmlIl1Izmg or maxlIl1IzIllg some function c1J(0). Then eP* == eP(O*) is the" best" attainable value of the objective function c1J, Suppose for a moment that c1J(0) is a risk function of decision theory (see Section 4- 16), i.e., the value of eP(O) represents the economic loss that we expect to sustain if we act on the assumption that the parameters have the va]ue O. In this case eP* is the minimum possible expected loss. However, shou]d some parameter values 0 =1= 0* give rise to a risk (f)(O) that is on]y insignificantly larger than cf)*, then we have no compelling reason to prefer 0* over O. In fact, let 8 be the largest difference between risks that we are willing to consider insignificant. Then we have no reason to prefer 0* over any other value of 0 for which .t-:: '",!. f', ',; ; >  i'  :., I w(O) - w* I  8 (7-2-1) We refer to the set of values of 0 which satisfy Eq. (I) as the E-il/(l(fJerence region, The argument used here may be applied heuristIcally to any other obJec- tive function. The fact that we have elected to minimize a function q)(O) means that we set some store by obtaining a low va]ue of this function. It is not unreasonable to suppose that va]ues of eP almost as low as q)* would satisfy us a]most as much as eP*, This gives rise to an indifTerence region in 0 space as described by Eq. (I), The choice of a suitable E may be more arbitrary when eP is a sum of squares or a likeJihood than when it is an economic risk. Once E is chosen, however, the analysis is the same in all cases. When c1J is continuous and 0* its unique unconstrained minimum, the 0,,; W' E-indifference region for a sufficiently small positive E is a simply-connected . . domain surrounding 0* in the I-dimensional 0 space. The region is bounded  by th, /-1 cl;m,o,iooal hypmu,face who", 'quaUno i,   ",,' ;;'R 7; '. !:;, ;-,{, ', ' -r,' ";:''; .;.: ;'J '.';" i;i ,"--';', 'h' :f;? 'It!:';, .<:<-- r.{'" eP(O) = eP* + E (7-2-2\ We shall restrict our attention to regions of this nature; i.e., we shall ignore "the possibility that for a given E there may be regions surrounding 10ca] 'minima other than 0* in which Eq. (I) holds, TO' In a sufficiently small neighborhood of 0'" we may approximate eP by t ;means of the first few terms of its Taylor series expansion eP(O) ::::; c1J* + qH (j0 + -1 (jOT H* ('50 B ;where (j0 == 0 - 0*, and q* and H* are, respectively, the gradient and Hessian :;bf c1J at 0 = 0*. If 0* is an unconstrained optimum of eP, then q* = 0 and Jq, (3) becomes (7-2-3) eP(O) ::::; eP* + 1- (jOT H* ()O (7-2-4) 
I (50 T H * (50 I  2B (7-2-5) 172 VII I nterpretation of the Estimates so that the B-indifTerence region is defined, approximately, by Let A = H* if 0* is a minimum IH posItIve definite), and A = -H* if 0* is a maximum (H* negative definite). In either case, A is positive definite (semidefinite in exceptional cases), and Eq. (5) becomes (50 T A (50  2B (7-2-6) "i!-': . i;.' .i:;  'it :;  which is the equation of an I-dimensional ellipsoid whose volume is (2BTC)/12 det- J /2 A/f(lj2 + I). The ellipsoids corresponding to different values of Bare concentric and similar in shape and orientation, so that much information can be gained from the analysis of the matrix A, without regard to the actual value of c. We can now answer the question of how much the II1dividual parameter Oa can be varied from its optimal value Oa*' If we let bOp = 0 for all f3 =f Cf., Eq, (6) reduces to Aaa bO/  2B (7-2-7) so that (V' - (2rJ A aa) I 12  Oa  Oa* + (2B/Aaa) I 12 (7-2-8) This is often written in shorthand notation as Oa = Oa* :t (2B/Aaa)I!2. We say that Oa is well-determined if the quantity (2B/Aaa)II2 is small on the scale by which Oa is measured; Oa is ill-determined if (2B/AaY 12 is large, Usually we wish the parameters to be well-derermined. There are exceptions, though; a design parameter may advantageously be ill-determined for greater flexibility in the implementation of the design. It is important, however, that the ill-determination should be inherent in the design, and not merely the result of poor data. It is not enough to determine how well the individual parameters are determined. Consider, for instance, the two-dimensional case depicted in Fig. 7-1, with 4 = [ 0.505 " - 0.495 -0.495 ] 0.505  it  I tJ. '" and E; = 0.5. Here Eq. (6) reduces to 0.505 MI/ - 0.99 M I 150 2 + 0.505 bO/  I (7-2-9) so that with c'j()z = 0 we have I ()Oll  1.407, and with ()Ol = 0 we have I b0 2 1  1.407. Thus 0 1 and ()2 may be varied individually by ::t 1.407 with- \.'1IiI -:;;.; 
7-2, Response Swface Techniques 173 out leaving the indifference region. It is clear from Fig. 7-1, however, that if we increase (or decrease) 0t and O 2 simultaneously, much larger changes can be tolerated. In fact, (50 1 = be 2 = 7.701 satisfies Eq. (9). On the other hand, if the changes in 0 1 and O 2 are taken in opposite directions. their bound is S8 2 E'.: '!-  'O'0'!- 0':> 0':5 'O'0'!- '0' '0 0j0j ,0 i'" ';"', 7071 S8, i". ., ,{ }: ".... F ' 71 . r [ 0.505 -0.495 ] Ig, -, E = 0.5 uncertamty rcglOll .or A = -0.495 0.505 :;; ....., ,7.. of- lower bO] = -(50 2 = 0.7071. While 0 1 and O 2 appear individually well- determined, the quantity 0 1 - O 2 is even better determined, but 0 1 + O 2 is relatively ill-determined. This implies that we have a wide latitude in choosing, say, a value of 0 1 as long as we adjust O 2 so that 0 1 - O 2 is nearly equal to ;0 1 * - 8 2 *, Another numerIcal example appears in Section 7-21. 
:, 174 VII Interpretation of the Estimates '.,' 7-3, Canonical Form ";" .,':.l ', We wish to find the points on the ellipsoid (50T A (50 = 2£ which are furthest , away from the origin, and also those which are closest. These points determine, respectively, the least-determined and best determined linear combinations of the parameters. To find the vector (50 satisfying (50T A (50 = 2£ which maximizes or mini- mizes (SOT(50, i.e., the squared distance from the origin, we introduce the Lagrange multiplier p, and look for the stationary points of n((>o, II) == JOT (50 - p«(50T A (50 - 2£) (7-3-1) Therefore (Inja«(50) = 2 (50 - 2p A (50 = 0 (7-3-2) Letting ), = ] Ip and rearranging. we have A (50 = ), (50 (7-3-3) Premultiplying by (SOT (50T A (50 = ), (50T (50 (7-3-4) Hence (50 T (50 = 2£j), (7-3-5)' Eg. (3) states that the desired vector (50 is an (un normalized) eigenvectorofA': with eigenvalue LEg. (5) states that the length of the vectOr is (2£jJe)I/2ii The I eigenvectors form the I principal axes of the ellipsoid. The longest axis, corresponding to the smallest eigenvalue, defines the worst-determined: direction in 0 space, and the shortest axis (largest eigenvalue) defines the best determined direction. Let the eigenvalue decomposition (see Section A-5) of A be given by A  UAU T (7-3-6ri where U IS the ul1ltary matrix whose columns are the normalized elgenvectors:, of A, and A is the diagonal matrix of eigenvalues. Then (50T A 60 = (50T UAU r (50 (7-3-7}:.... ...:',' Letting \jJ = U T (50 we obtain I (50T A 150 = /T A I = L }'itfi/ i= 1 .i:;1l' (7-3);; ..::  !'. .  ...' .. ."\,,., . <". ...', .y' 
'tV" I -4. The Samphug Di.<MbuNau * 'i, 175 Since U T is unitary, the transformation of coordinates given by I = U T (50 .is a rigid rotation (with possibly some reflections) which leaves distances and angles unaffected. The number tfi j (i = ],2, . . ., I) is the ith component of the '!!vector (50 expressed in the system of coordinates whose axes are the eigen- .yectors of A. Eq. (8) indicates that the principal axes of the ellipsoid coincide ".with the coordinate axes in the I space, and displays clearly the inverse  :::relationship between the lengths of the axes and the square roots of the :eigenvalues. " The tfij are referred to as the canonical variables, and the expresston ;I:=1 X j tfi / as the canonical form of the quadratic expression (50T A (50. In the two-dimensional example of the preceding section, we had  A = [ 0.505 - 0.495 -0.495 J 0,505 whose normalized eigenvectors are [0,7071, - 0.7071] and [0.7071, 0.7071] :,;\yith eigenvalues I and 0.01. Therefore I'" )lrirl u T = [ 0.7071 0.7071 -O,707I J 0.7071  = [ 0,7071 I 0,7071 -O,7071 J [ (501 J [ 0,7071«(501 - (502) J 0,7071 (50 2 = 0.7071«(50 1 -r (50 2 ) 4'h e canonical form is tfi/ + O.Oltfi/. Thus, the principal axes have lengths ::1:{and 10, making the quantities 0,7071 «(50 1 - (50 2 ) and 0.7071«(50 1 + (50 2 ) l.tilaively welJ-and ilJ-determined, respectively. :gare must be taken that alJ the parameters be measured on compatible ;es. It is evident from Fig. 7-] that by drastically reducing or expanding ;:W{.scale of one of the variables one can distort the ellipsoid to the point 1iei:e it contains no useful information. ,,,,._m :''j s4{ The Sampling Distribution {,:TlJ.e sampling distribution of the estimates was defined in Section 3-1. It ![epresents the manner in which the estimates would vary in response to the .,., I;' pom variations we expect to Occur from one data sample to another. 'sampling distribution can shed light on the re]iability of the estimates. Parameter is ilJ-determined if its estimated value can be affected strongly ti..Yfmingly insignificant variations in the data. Such a situation is character- ..<t_  "'"" ':;9Y the estimate having a large variance. The feature of the sampling :9 ution that is of most interest to us is, therefore, its covariance matrix. 
176 VII I nterpretation of the Estimates We have rcmarked before (Chapter Ill) that we cannot generally hope to determine the truc sampling distribution. The best that we can hope to do on the basis of a singlc data sample. and in the absence of a Monte Carlo study, is to arrivc at a rough approximation to the covariance matrix We would also like to know thc mean of the sampling distribution, and how it is related .10 the truc values of the paramcters, and the actual estimate 0*. Generally we acccpt 0* as an estimate of the mean of the sampling distribution, and we neglcct the bias of this mean relative to the true value. There exist some methods for reducing this bias (see Section 7-9). For the moment, however, we restrict our attention to approximating the covariance matrix. In essence, then, we attempt to answer the question" If we were to repeat our series of experiments many times, how would the estimates differ from one replication to the next?" 7-5, The Covariance Matrix of the Estimates i .#.. E  1iII. Suppose our estimate is the unconstrained mlI1imum of some objective function (1)(0). This objective function also depends on the data; in particular, it depends on the measured values W of the random variables D... We indicate this dependence by writing (1)(0, W) in place of q)(O). At the minimum we have ,. D(I)(e'" , W)/aO = 0 (7-5-1) Supposc we varied the data slightly, replacing w by w + (5w. This would cause our minimum to shift from 0* to 0* + (50*, where we must have, '" u(IJ(O* + (50*, W + (5W)/(50 = 0 (7-5-2) ,i Expanding Eq. (2) in Taylor series and retalI1ing only terms up to first order, we find after subtracting Eq. (I) «(52q)/(502) (50* + (a 2 cp/ao aW) (5w :::::: 0 (7-5-3) so that approximately .'50* = - H"'-1(02(IJ/aO aw) (5w (7-5-4),; where as usual 1-1* = 02(IJ/20 2 )0=0*' The desired covariance matrix V 0 is defined by V o == E«()O* (50'1) (7-5-5), so that V o :::::: E(H* I (a 2 (IJ/a0 aW) bw bW T (02(IJ/oO aW)T H',,-I) (7-5-6)!:( ",,'r  '.'.. ;Jn ""'I 1 "...?>';:' --;:' '.' . 
(,:, 7-5, The Covariance Mmrix oj the Estil1lmes 177 J'' i.,: ;' The quantities H'" and a 2 (fJlaO aWare evaluated at 0 = 0* and at the aClllal sample W. Hence they are constants, and can be taken outside the expectation :.Y: :::;, sign in Eq. (6) :"",., (:\... j', - "h" "':,: V o ;::::: H*-I (a 2 (/Jfao a\-v) V w (a 2 (fJ180 l!W)T H I (7-5-7) --.: :'i[;:: ;: where V w is the covariance matrix of the data, i.e., ,;,.; ',1')'.:" V w == E(bw bw T ) (7-5-8) r..," o;..!,'  ?: ?;. ' r w-:' j This formula applies to any objective funclIon, whether or not It has a ir .pasis in statistics. More specific results can be obtained \vhen the objective r.::f;unction depends only on the moment matrix M of the residuals. This class ;>'bf functions, which includes sums of squares and log-likelihood for normal 11;.dlstributions, was shown to ad mit the Gauss approximation Eq. (5-9-10) ffc,';;I'6r H. We derive a similar approximation to V o. Eq. (5-9-10) can be rewritten : as '.l ; we assume that w lI (i.e., the results of the lth experiment) has covariance !patrix VI' and is independent of w" (I] =/= p) then Eq. (7) reduces to V o ;::::: H",-1 [ i (a 2 (/Jlae aWl,) VJI (a 2 (/J/co awll)T ] 1-1*-1 JI=1 (7-5-9) II H ;::::: 2 )" B/ rB 11 (7-5-10)  wb", JI=t BJI == - aelRO = afJcO, tP(MW)) == (/J(O), r == iilp/aM (7-5-11) , - f2':; Under assumptions similar to those made in deriving Eq. (I Q) it can be shown ,...".- r;1itp,at for standard reduced models with wJI = YII "..', a 2 qJ/ao aY'1 ;::::: 2B"Tr ;:>,:Siibstituting Eq. (10) and Eq, (12) in Eq. (9), we obtain l:' v o ;::::: (tIE/rBJI) -I (tIB/rVI,fBJI)(JIB/rBI')- (7-5-12) (7-5-13) ,/A derivation similar to the one given in Appendix E for the Gauss- Tic ,:.arkov theorem, shows that if VII = V (p = 1,2, . . ., n), then choosing r ,roportional to V -I leads to the least possible value of df't V 0' This, in fact, cstirs in the cases listed below, and shows that these maximum likelihood _ ,.,d'pseudomaximum likelihood estimates are at least approximately optimal. ,i.V,.". :In the case of single equation least squares, assuming observations with ard deviation (J, we have B/ = b l , == aJJao, f= I, V = (J2 I' 
178 ':;'1 '" VII Interpretation of the Estimates. i ! I sO thai Eq. (13) reduces 10 V o  u2(JI b p b/) -I = u 2 (JI(a!p/80)(aJ;J80)T) - t Comparing this to Eq. (5-9-4) shows that here V o  2u 2 N- 1 (7-5-15), When u is not known, we replace it with its estimate [1/(n - I)] I;:=I e/ = [I/(n - 1)]cJ) * . A numerical illustration appears in Section 7-21. We now treat some of the likelihood functions that were considered in., Section 5-9. (7-5-14)' I. Normal distribution with known V p = V, According to row of" Table 5-1 r = JV- I , so that Eq. (13) becomes in view ofEq, (10) VoJ ( f BpTrBII ) -1 =N-I H*-I 11=1 (7-5-16) ;,', 2. As above, but with unknown V p = V. From row 3 of Table 5-1 r '7:: (nI2)M- I . But according to Eq. (4-9-6), the maximum likelihood estimate fOI:';!; V is given by (1In)M, so that approximately r =-!-v- I and Eq, (16) is still valid. .0 :i'' For a wide class of maximum likelihood estimates _with normal dis;: Iributions we have then V o  H*-I = - (8 2 log Llao aO)o;o> (7-5-17), ., The quality of this approxllllatlon improves as the vanance of the measure.}; ments decreases and the fit of the model to the data gets better. For most unconstrained maximum likelihood estimates it can be shov,;R:\ (Cramer, 1946, p. 500 et seq.) that asymptotically (as the series of experiments".. is repeated ad infinitum) thesamplingdistribution approaches (with probabili. I) the normal form, with means equal to the true values of the parameteI;S¥! and with covariance matrix given by V 0 = - [E(L2 log Llao (0)] -I (11 -;. OJ) (7-5-18j; The computation of the required expectatIon is very tedious, if not altogeti impossible; therefore, we generally replace the expected value by the mQ$ likely value, i.e., the value at 0 = 0*. This brings us back to Eq, (11)1 Again, the acceptability of this approximation depends on the goodnesjs\1 fit. If the fit is very good, the likelihood function has a sharp peak, and,tJia expected and most likely values nearly coincide. The estimates given here and in the sequel for V 0 and other statistic#! paramelers are computed from the data. Hence they are in themse).v random variables subject to sampling variations. As illustrated by the exam ,. 0 "r:: 1(lI 
7-6, Exact Structural Model 179 1:,.,,., of Section 7-22 these vanatlons can be quite large even when a good fit ,u.Fo the data can be obtained, We shall not COncern ourselves here with the ;c?mputation of the sampling variances of V o . Nevertheless, we point out ;,diat these variances can always be estimated by the Monte Carlo method ?'(Section 3-3), Generally the V o computed from any given data sample can f6 regarded as no mOre than a rough estimate, correct to within an order of FE i;p,.agnitude, [i The fact that the approximations may break down when the fit is poor ',:1:..\ j:.:11ed not worry us too much, since in this case we would not place much 0",;.rliance on the model anyway, and would attempt to improve either the model f,*;c:i:i:the data, Even a very rough approximation to V o can be of considerable &:-:.= . . !i/.e, as will be seen in Chapter X. r '0,;?26. Exact StructurallVIodel t' .. ;We can also derive an approximation to V o for the case of structural "'qations acting as equality constraints. Suppose we have obtained estimates and \V* given the data Wand using the method of Section 6-6. If the data ..' replaced by W + c5W there will be a correction 60* in 0* given by Eq. t}{(5'.:6-12), Now t the solution G = 0, and (as the reader may verify for him- :!#If) BTC-IAH-Jq = 0 so that Eq. (6-6-12) becomes approximately 60* = D 'B T C- I Alr I 6q (7-6-1) Kwh,ere c5q = (vq/aW) 6W = ((12(/)/8'V" (lW) ()W is the change 111 q due to the s..* -.' bange in W. Setting M'eA,are led to lV O == E(60* 60*T)  D-IBTC- 1 AIr I ((j2cJ)/iNV 6W)V w ((j2cjJ/8'Vv (jW)TIr I ATe-I BD- j (7-6-3) :To achieve further progress we assume that cp = t(w - 'V,,)TVJ(W - 'V"). ::ceH- 1 = V w , and a 2 cp/a'V" aw = - VJ, Since C = AH-JA T = AVwA T .:,, = BTC-1B, it follows that V o  D- 1 B T C- J AV w VIVw VIVwATC-IBD-J = D-JBTC-IAVwATC-IBD - 1= D-JBTC-JBD-J .-.,at finally V w == E(c5w 6W T ) (7-6-2) V o  D- 1 = [BT(AVwA1)-IBr l (7-6-4) ;,:"cular, for the model treated in Section 6-8 we obtain V o by inverting ,':8-14), A numerical illustration appears in Section 7-23. 
180 VI I I nterpretation of the Estimates j:l- 7-7, Constraints  Let us examine now how the covariance matrix is affected by inequality and equality constraints that do not depend on the data, Let h(O)  0 and g(O) = 0 denote these constraints, which, of course, are satisfied by 0*, Let O(i) be the value of 0 that optimizes <P(O) with the constraint h;(O) ;: 0 removed, but all other constraints retained. Clearly, if it happens that O(i) is feasible, i.e., if 11;(0(i»)  0, then 0* = O(i). Actually, we distinguish the follow- ing four cases (see Fig. 7-2). (a) 0* = O(i) is well in the interior of the feasible regIon relative to the con- straint 11;<0)  O. Therefore. different values of O(i) arising from different " ',' , \2:. ------' 8 '" 8 ' : "'-. = \. . '\ " ) '----, 11, = 0 h, = 0 ,- , ,.  , " I Ii -;:. ... h, > 0 11, > 0 h, < 0 11, > 0 Case (0) Case (b) h, =0 11, = 0   h,>O  h, > 0 Case (e) Case (d) Fq, 7-2, Incqualityconstraint 11,(e) :::- 0; el/), optimum without constraint; e*, optimUJn': with constraint; -, boundary of rcgion containing 90 . of all rcalizations of 6'i); ---, bound-i.:/ ary of rcgion containing 90 ;; of all rcalizations of e*. .., . "%.'. 
7-7, Constraints 181 data samples are likely to remam feasible, and the constraint h,(O)  0 exerts no influence on V 0 . (b) 0* = O(i) is feasible, but it lies very close to the surface hi(O) = 0 In ,this case, a significant number of data samples may give rise to infeasible , values ofO(i), causing 0* to lie on the constraint. The density of the sampling distribution of 0* is truncated: positive on one side of the constraint, in1inite on the constraint, and zero on the other side. The computation of its covari- ance matrix is difficult, and will not be undertaken here. (c) Om is infeasible, but only slightly so. 0* is on the constral nt. Some data ::::'samples make O(i) feasible, and therefore some realizations of 0* fall in the ('.','. ,"interior of the feasible region. The distribution is similar to the one of case . .' (b), and we will not treat it further. ,", (d) o(i) is extremely infeasible, so that 0* remains on the constraint for all ", but a negligible proportion of all possible data samples. rn this case. wc may ':;" ,treat hi as an equality constraint 17 i (O} = O. :;: Let the vector g(O) now represent all the equality COnSlra1l11S, IIlciuding :: ..;the inequality constraints of type (d). As we know from Section 3-6, the (i.iLagrangian conditions T.-: fi," ''';', ocp/ao = I )'j o,q;/rlO i (7-7-1 ) :{": r:: must be satisfied at 0 = 0*. If we change the data by an amount (hv, we ?;,'Jind 0* changed by (50*, and )'j changed by M j . At the new optimum, Eq. (I) t:,'takes the form (approximately to first-order terms) :1-;' &i atP alcp * alcp _ ,, ( agj ,ogj ,a 2 g j . * ) ,,,,, ' ao +- (50 + - oW = L )'i - + ()A j - + Ai - ()O (7-7-2) ;;, ao ae ao aw j 00 ('0 00 00 f 'f i; with all derivatIves evaluated at 0 = 0*. Subtracting Eq. (J) from Eq, (2) [j)aves !\, A 60* = (ogjiJO)T £51, - ((':12(J)j?0 aW) bw (7-7-3) 1':W.here If .tbat A == alcpjaO ao - I )'i a 2 g)oO 00 = H* - I )'i (12,qJOO ao j (7-7-4) !; :;The variation (50* must leave the equations g(O* + (50*) = 0 satisfied. Hence  fj ()g == Cog/aO) ()O* = 0 (7-7-6) ;!.:.: (50* = A -1((ogjoO)T £51, - (82(Nao aW) ()w) (7-7-5) rg 
182 VII Tnterpretation of the Estimates ':,) " I I ,<,' . i. " } Substituting Eq. (5) in Eq. (6) we find after solving for (SA " [ e g _ I cg T ] - I cg _I a 2 cp c)/, = -A - -A -DW . i)O cO ao ao aw (7-7-7) ;,.! 1.: :j Finally, after inserting Eq. (7) into Eq. (5) we find ()O* = - { I iJ g T [ iJ g a g T ] -I D g} a2(1) A- ' - -A- ' - -A- I - DO ao £10 ao ao aw ()W(7-7-8) ;./t Comparing Eq. (8) with Eq. 17-5-4) we find two changes I. H* is replaced by A. From Eq. (4) we see that if aJl the constraints gj are linear. then actually A = H*. This occurs, e.g., when all the constraints are upper and lower bounds. 2. The expression A - I (i,2(/J/('O cW) ()W is prem ultiplied by the projection matrix  't- ,:' I ag T r ag _ I ag f ] - I ag p= I - A A ao ao ao ao (7-7-9)  I ;' ;:, ,', ! which has the property that if x is any I-dimensional vector, then y = P:x satisfies (8gfDO)y = O. i.e., y lies in the tangent plane to the constraints. The matrix P thus projects any vector into this tangent plane. The expression analogous to Eq, (7-5-7) is nOw v 0 = PA -1(iJ2(f)/?0 i 1 w)V w (()2(f)/(;0 aW)T A -lp T (7-7-10) If all g, are linear, i.e.. A = H*, then under the appropriate conditions " . Eq. (7-5-16) rcmains valid with the following modification q VU;:::,PH*-lp T (7-7-11) Let us compute Panel V 0 for the sImple though common case 111 which all active constraints are derived from upper or lower bounds on the param- eters. For simplicity of representation, we assume that the parameters 0 1 , 0I' ....0 1 , are actively constrained to equal a l . a2' ..., ai" respectively, The remaining parametcrs 0 1 , + I' 0 1 , + 2' . . ., 0 1 are not actively constrained, The active constrainb can be written as ',' .; .! .. i; ! :;  ;, I;   , , [I 0]0 = a (7-7-12) !  . ,,: with 1 belllg the II ;.: II tdemity. and 0 the II x (/ - II) null matrix. We have, then :I: :.: . . ('g/(CO = [I 0] .!,.':" .' /, 
7-8. Principal Components 183 Since Our constraints are linear, we have A = H*. Let H*-I be partitioned as follows A -I = Hr l = [ TJ (7-7-13) where B is II X II> C is (1- II) X It. and D is (1-11) X (1- II) Therefore: :/ [:: A-I :: TJ-I=([I OJ[ l[]rl=B-1 (7-7-14) p = [ ] _ [ T] []B-I[I OJ = [ 1- [B-J J = [-B-I J (7-7-15) . . and finally V o = [-B-J J [ T] [ - ICT] = [  _ CB-ICT] (7-7-16) :r.. .: As expected, the variances and co variances of 0 1 , O 2 , ..., 0 1 , are zero, Since these parameters are constrained to equal fixed values. The covariance matrix of the unconstrained D's is reduced from D to D - CB-IC T . The matrix D - CB - I C T is simply the inverse of the lower right-hand partition of A, so that the covariance of the unconstrained parameters is obtainecl by inverting the corresponding part of A. 7-8. Principal Components Given the matnx V o or an approximation to it, we can determine which parameters, or linear combinations of parameters, are well determined (small variance), and which are poorly determined (large variance). The variance of the estimate for the ath parameter is given by /.'Oaa' and its stan- dard deviation by v; == VJ1;. As in Section 7-3, the full picture is obtained by finding the eigenvalue decomposition of V 0' say V o = UIIU T , II = UTVoU (7-8- 1) where II is the diagonal matnx of the eigenvalues 7[; of V o. Suppose we define a vector of new variables p = UTO, cSp = U T CSO. The covariance matrix of the p's is given by V o == E(cSp cSpT) = E(U T cSe cSe T U) = UTV o U = II (7-8-2) 
''-- '!fJ; \ . 184 VII Interpretation of the Estimates We have then a set of new variables PI' P2, . , " PI which replace the original parameters 0t, O 2 , ..., 0,. Our estimated value of Pi is given by Pi* = I;I UaiO,/, Since VI' = II is a diagonal matrix, the sampling variations in Pi* and P / are uncorrelated when i =f= j. The standard deviation of the estimate Pi* is ni/ 2 , The Pi are called the principal components. The advantage of dealing with the uncorrelated principal components rather than with the correlated original parameters is particularly great when (as in the normal distribution) lack of correlation implies statistical indepen- dence. For then we can establish confidence intervals and statistical tests for each component individually. When Eq, (7-5-17) holds, we have V 0  H* -I = A I. The eigenvectors of V 0 and A coincide, hence the i5Pi coincide with the canonical variables. The eigenvalues ofV o are the reciprocals of those of A, i.e., n i = 1/),... Hence the £-indifference interval of i5Pi = 1/;i is given by 1:. (2£/),,)' /2 = 1:. (2£n,)' /2 The length of the interval is proportional to the standard deviation. For single-equation least squares, Eq. (7-5-15) implies that n i = 2u 2 Pi' Like the canonical variables, the principal components depend on the scaling of the variables. A natural scaling is one which adjusts each variable so as to have unit standard deviation. This can be achieved by defining >.0 i '\I. , Va=OaV/2osothat 'If . , . I I ' >' E(i5l'a i51'li} = V OafJ V;a/2 V o "pY2 or c/o V v = DVoD (7-8-3), where Da{1 = i5 afJ VO:/2 . The matrix V,., whose main diagonal consists entirely of ones, is the correlation matrix of 0"'. IfP is the matrix of eigenvectors of V., then the elements of the vector pT v = pTDO are the principal components of V". They are uncorrelated, and theIr variances are the eigenvalues of V.' , We call them the scaled principal components of V o . In practice one is interested mostly in determining the princIpal com- ponents having unusually large variances, since these hold clues to inadequacies. in model or data. This point is discussed further in Section 7- 18. If V o is .: known, but not its eigenvalues and vectors, then we can most easily deter-,,; mine its largest eigenvalue and corresponding eigenvector by means of the'; power method. 7-9, Confidence Intervals Knowledge of the covariance matrix of an estimator gives an mtUltive feeling of the degree to which the various parameters are well determined, ::.{ We may wish however, 10 make more explicit statements such as "the true ,"' .....,. f:': r. 
7-9, Conjidence Intervals 185 value of 0 lies between the numbers a and b with 90  probability." From the point of view of classical statistics this statement is meaningless; 0 IS a constant, albeit unknown. The probability of its lying between a and b is unity if a < 0 < b, and zero otherwise; it can never be 90 o.}. This difficulty is overcome by Neyman's (Neyman, 1937) theory of confidence intervals. Suppose we had complete knowledge of the sampling distribution of the estimate 0* (we discuss first the case of a single parameter 0), This dis- tribution depends on the true value 0, and we denote its density function by p(O* I e). For any given value of e we can easily determine two numbers aCe) and bee) such that, for a given 0 < y < I we have Pr[a(e)  0*  b(O)] = )' (7-9-1) This is equivalent to demandmg that for each 0 .b(ii) I p(O* I 0) dO* = )' (7-9-2) . o(ii) It is clear that the choice of a(O) is quite arbitrary; given any c < I - y, we can choose a(O) so that Pr[O* < a(8)] = £; b(O) is then determined by the requirement that Pr[O* > bee)] = I - y - £, Suppose we are able to choose a(8) so that both it and b(O) are monotonically increasing functions of O. Then there exist inverse functions CI.(O*) and [3(0*) such that 8 = daCe)] = [3[b(O)] Then the statement 0*  aCe) is equivalent to 0  CI.(O*) and the statement 0*  bee) is equivalent to e  [3(0*). It follows that Pr[[3(O*)  e  dO*)] = Pr[a(e)  0*  bee)] = y (7-9-3) '.. Note that 0* as a function of the sample is a random variable, and there- fore so are CI.(O*) and [3(0*). Eq. (3) asserts that the probability that the value ':" c,>f the random variable [3(0*) does not exceed the true value 0, and that the  f,1illdom variable CI.(O*) is no less than e, is equal to)'. This is a perfectly mean- i;ngful statement within the confines of classical statistics. The interval [[3(0*), ;.,,«(0*)] is ca]]ed a )'-conjidence interval for O. .'. ,,,' The simple example of Section 3-1 wi]] help clarify the situation. From r?hservations It'll on a norma]]y distributed random variable w with mean 0 \.and known variance u 2 we obtain the maximum likelihood estimate Eq. 1;.(j:1-3)  ;!-;:.  /I 0* = (I/n) L lV l1 ..., 11=1 , - 9(is we]] known that the sampling distribution of 0* is normal with mean 0 iE variance u 2 /n, From tables of the normal distribution [see, e.g., Cramer ;nN:Q46, Table 2)], we note that .' Pre I 0* - 8 I  1.6449u/j) = 0.9 (7-9-4) 
186 Vll Interpretation of the Estimates so that Pr(i') - 1.6449al,,/-;;  0*  0 + 1.6449CJ/j;;) = 0.9 (7-9-5) that IS, a(O) = G - 1.6449CJ/,,/;; and b(G) = 0 + 1.6449CJ/j;;. 80th a(O) and bee) are monotonic in 0 with inverses CI.(O*) = 0* + 1.6449CJ/j;; and /3(0*) = 0* - 1.6449CJ/,,/;;. Clearly, Eq. (5) is equivalent to Pr(O* - 1.6449CJ/..j;;  0  0* + 1.6449CJ/j;;) = 0.9 (7-9-6) I ,- ,- so that (0* - I. 6449CJ/ ,,//1, 0* + I. 6449CJ/ J /1) is a 90 . confidence interval for O. This statement should be interpreted as follows: if the series of experi- ments were repeated one hundred times, each replication would yield a separate estimate 0*. For each such estimate we could form the interval given by Eq, (6). Then the true value 0 should be contained in about ninety of these intervals. If CJ is unknown, the interval given by Eq. (6) is undefined. However, it is well known that in this case the quantity j;;(O* - O)/s, where f " } 1/2 S == [1/(/1 - I)] I (W II - 0*)2 \ 11= I follows the I-distribution with /1 - I degrees of freedom, Hence, if n = 10, we have from tables [e.g., Cramer (1946, Table 3)] (7-9-7) PdllO l / 2 (0* - O)/sl  1.833] = 0.9 (7-9-8) so that Pr(O* - 1.833s/l0 1 / 2  0  0* + 1.833 s/10 1 / 2 ) = 0.9 (7-9-9) This is a proper confidence interval, since s is computable from the sample, Since s is a consistent estimate for CJ, Eq. (6) can be used with s replacing (J when n is large. In fact, the I-distribution approaches the normal as /1 increases, If the exact form of the sampling distribution is unknown except for its variance V, we may derive conservative confidence intervals for the mean e oL':. the sampling distribution. According to the 8ienayme-Chebyshev inequality (see Section 7-10 for the derivation of a more general result) Pr(IO*-l:J1 :;?:kVI/2)k-2 which is equivalent to (set )' = I - k- 2 ) Pr{O - [V/( I - y)]I/2  e*  0 + [V/(I _ y)P/2} :;?: y (7-9-10) (7-9-11) Hence Pr{O* - [V/(I - y)]1/2  e  ()* + [V/(l _ y)]1/2} :;?: y (7-9-12) " {if. t) r...... 
!(:. I '7-10, Confidence Regions 187 i; We know then that a y confidence interval for ti is contained in the interval {O* - [V/(I - y)P/2, 0* + [1"/(1 _1')]1/2}  If 0* is an unbiased estimate, then 0 can be substituted for {) In Eq. (12). :Usually the bias is unknown, but when 11 is fairly large, maximum likelihood ':estimates can be treated as having a normal sampling distribution with ::variance given by Eq. (7-5-17), and bias proportional to I/n. In this case one "may employ the method of Quenouille (1956) to reduce the bias. Let 0* be, :' as usual, the estimate of 0 based on 11 experiments, Let 0 , ,* (fL = I, 2, . , ., 11) 'be the estimate based on the 11 - I data points obtained by dropping the ?% ..:Jlth, Let e == CI:=l 0/)/11 be the mean of the 0 , ,*, The bias in 0* is pro- ::portional to 1/11, whereas the bias in each 0/ is proportional to 1/(11 - I), :,iand so is the bias in O. Hence the following relations hold approximately  0* :::::; a + CI./I1, e :::::; 0 + CI./(11 - I) (7-9-13) ,where CI. is an unknown constant. Multiplying the first equation by 11 and the Lsecond by 11 - I, we obtain after subtraction a :::::; 110* - (11 - I)e (7-9-14) "'The bias of this estimate is of the order of 1/112, We have noted above that the choice of an interval for a given confidence level y is somewhat arbitrary. Commonly employed criteria for choosing among all possible functions a(O) and bee) that satisfy Eq. (I) are the following: I. Minimum length, Choose a(O) and b(O) so that b(O) - a(O) is minimum. 2, Symmetry around the true value b(D) - 0 = 0 - a(O) 3, Symmetry in probability Pr[O* < a(O)] = Pr[O* > beD)] = (I - )')(2. 4, Equal probability at ends. The density function of the sampling dis- tribution assumes equal values at 0* = a(D) and 0'" = b(O). When p(O* I 0) is unimodal and symmetric, criteria I, 2, 3, and 4 coincide. 1l:-10. Confidence Regions . The idea of a confidence interval may be generalized to the case of several :own parameters. Suppose that for any data sample W we are able to 'define a bounded closed subset SeW) of the I-dimensional 0 space in such a way that SeW) contains a in a fraction I' of all possible data samples, i.e., pr[a E SeW)] = y (7-10-1) (:.>( 
188 VII Interpretation of the Estimates Then S(W) is called a Yioinl confidence region fortheparametersO. In choosing a confidence region we have even more freedom than we had in the case of confidence intervals, since we may exercise control not only over the location, but also over the shape of the region. The most commonly used regions are I-dimensional rectangles or ellipsoids whose centers are at the estimate e*, . . .. In Section 7-8 we have introduced the principal components p = uTe, \I which are uncorrelated linear combinations of the parameters. If (as would be implied when the sampling distribution were normal) the principal com- ponents are statistically independent, then confidence intervals for the indi- vidual components can be combined into a rectangular joint confidence region, Let (0:;, fl;) be a v-confidence interval for Pi. Then Pr( 0:; :::; Pi:::; f3 i' i = ], 2, . . . , I) = yl (7- 10-2) so that we have a i-joint confidence region for p. Unfortunately, it IS difficult to transform Eq. (2) into a simple statement in terms of the Ocr. It seems reasonable to choose confidence regions which coincide with the indifference regions of the objective function that was used to obtain the' estimate 0*. We have seen in Section 7-2 that these take the approximate form 1(0 O*)TH*(O - 0*)1:::; c (7-10-3) which III view of Eq. (7-5-16) may in most cases be approximated by the ellipsoid (0 O*)Tyo t(O - 0'1.) :::; c (7-10-4) For a given confidence level y, we need only determine c in such a waythatij, Pr[(O O*)TYol(O-O*):::;c]=) (7-10-51U Such a value of c always eXists. regardless of the accuracy of the assumptions,;?!.!'. made in estimating Yo I. The actual determination of the value of c does;. however, depend on these assumptions, and on the form of the sampling dis-',; tribution in general. When the sampling distribution IS normal, unbiased, and with known covar,; iance matrix Yo, then as shown in Section 2-8 the quantity (0* - O)V o I (e* -O)!, is distributed as Xl with I degrees of freedom. Therefore, the constant c may be,j determined as the upper y point of that distribution. For instance, let 1=1.0'; and y = 0.9. We find then (Cramer. 1946. Table 3\ that . Pr[(O - O*)TYOI(O - 0*):::; 15.987] = 0.9 (7- 1O-fj):; When the sa mpling chstributlon cannot be assumed normal. we may generaliz"j;; the Bienaymc-Chebyshev inequality Eq. (7-9-12) to obtain (provided e = e) Pr[(O -O*)TV;;I(O - 0*):::; 1/(1 - y)]:;?: y 
f!:.r,\. :t   7-11. Linearization 189 ::We derive this relation as follows: Ler Q(8*) == (0 - O*)TV;;-I(O - 0*) "and let G" be the region in which Q(O*)  a. Let i P,/O*) == I - (l/a)Q(O*) , Then P,,(O*)  0 for 0* outside G", and P,,<O*)  I for 0* inside G". Hence J P,,(O"')p(O*) d8*  f P,,(O*)p(O*) dO* GUJ Go  r p(O*) dO* = Pr(O* E G,,) . Go Jhere G co is the entire space and p(O*) is the pdf of the sampling distribution, .Now  r P,,(O*)p(O*) dO* = r [p(O*) - (l/a)Q(O*)p(O*)] dO* 'G 'Goo = I - (l/a)£Q(O*) = I - (I/a)£ Tr V;;- t(O - 0*)(0 - O*)T = 1-(I/a)TrV;;-IV o = I-I/a ;lIence Pr(O* E G,,) :;?: I - I/a :Taking}' = I - I/a one obtains Eq. (7). ." The method of Quenouille for reducing bias (see Section 7-9) is as appli- 'ble in the vector case as in the scalar case. ..'H The vol.ume of the region in 0 space defined by inequality Eq, (4) is 1/""(c) = (cn)//2 det l / 2 V o /i((I/2) + I)) (7-10-8) !fl-,:ll. Linearization "Sornewhat more incisive results than those of the last section are often ipbtainable when the model equations can be approximated by linear ones If.. . :I.('the vicinity of the estimate 0*. The single equation model.!;, =.f(X 1l , 0) iy.'be approximated by [it::. :  f ll = f(x ll , 0) ;:::; {(XI" 0*) + t;,/(0)o=o'(O - 0*) (7-11- I) , model resembles the multiple linear regression situation (Secnon 4-4), ',ih;B designating the n x I matrix whose pth row is OJ;,/ao T , F designating 
..i ,:} 190 VII I nterpretation of the Estimates the n-vector whose (th element is (" - f(X", 0*), and 0 - 0* replacing 0 in Eq. (4-4-2). According to Eq. (4-4-9), the vector 0 - 0* has covariance (BTy- 1 B) -I, where Y is the covariance matrix of the observations)'/1 [this corresponds to Eq. (7-5-16)]. If the errors in the observations are normally distributed, so are the estimates 0*; and therefore, as stated above, the quantity .', . ';!. .' 1 ' £, t Y =(0 - O*)TBTy-JB(O - 0*) is distributed as X 2 with 1 degrees of freedom. Hence, we can determine confidence regions for 0* provided Y is known. It is well known from the theory of multiple linear regrcssion that the residual weighted sum of squares .Y' = [Y - F(X, O*wy- I [v - F(X, 0*)] (7-11-2) I  'iX is dlstnbuted independently oL7 as:/ with n -I degrees offreedom. Hence (n -1).7/1'y' has the Ft./I- I distribution. Suppose it is known that Y = TQ, where Q is a known matrix. and T an unknown constant. Then (n - /).9" I.c/' (n -1)(0 - O*)TBTQ-IB(O - O*) I[Y - F(X, 0*))" Q-I [Y F(X,O*)] (7-11-3) Thus (n -1).7/1.c/' may be computed without knowledge of T, and tables of the F distribution [e.g.. Scheffe (1959)] may be used to obtain confidence regions for O. An important special case occurs when all observations are independent, so that Y = (J2 I and ' I : r., , . (/1 - /).7 1.:1' (n 1)(0 - 0*)T8 T B(0 - 0*) I[Y - F(X, O*)]T[y - F(X, 0*)] (7-11-4) which corresponds to the case of unweighted least squares. For single equation least squares (illustrated in Section 7-21), this simplifies to (/1 -1\.7/1.'/ = re O*)TN*(fI - O*)/21(j2 n-II-5) where /I (jl = L e/ 1 2 /(11 -I) Jl=1 I IS the estimated vanance of the residuals. The foregoing discussion was based on the assumption that the model equations were nearly linear in the parameters around the estimate 0*, at L least for variations in 0 of the order of magnitude of the standard deviation of the estimates. It may happen that a change of variables will improve the_" validity of the linearity assumption. A systematic procedure for effecting c; , 
7-12. The Posterior Distribution 191 -,- i'.:" such a change of variables is described by Hartley (1964). Procedures for determining the validity of the linearity assumption are described by Beale (1960) and illustrated by Guttman and Meeter( 1965). A very simple method for testing the linearity assumption is the following: Determine a confidence region based on that assumption. Calculate the actual values of the objective function at selected points on the boundary of the region, e.g., at the endpoints of the principal axes. If our assumption is valid. these values should differ only slightly from the approximation value rp* + t(O - O*)TBTy-JB(O - 0*) In fact, it may be worthwhile to derermlile the extent of the regIOn III which the linearity assumption is valid by computing the values of the objective function on the boundaries of a series of successively larger confidence regions, until a serious mismatch between the true and approximate values occurs. Knowledge of this region may be useful in certain applications, such as sequential estimation (see Section 9-3). t.. "I' (: . ,,::.,- i . I::, '" . . 7-12, The Posterior Distribution If we accept the posrerior distributIon as the probability chstnbution of our parameters, we are able to define confidence regions in a straightforward way. Let p*(O) be the posterior density. Then for any measureable region fill in parameter space, the quantity S.II p*(O) dO is the probability that the true value of 0 lies in ell!. Hence, any region ;f? such that S ..,' p*(O) dO = )' is a )'_ confidence region (in the Bayesian sense) for O. If p*(O) is approximately normal with covariance matrix Yo, then the methods of Section 7-10 can be used to construct such confidence regions. In most other cases. the task is a difficult one, Suppose our estimate 0* has been arrived at by maximizlIlg the logarithm of the posterior distribution. Around 0* we then have. approximately 10gp*(0) = 10gp*(0*) - HO - O*)TI-I*(O - 0*) (7-12-1) where H* is the Hessian of -log p*(O*). It follows that p*(O) ;:::;. c exp[ -t(O - O*)TH*(O - O*)} (7-12-2) which indicates that around 0* the posterior distribution looks like a normal idistribution with mean 0* and covariance y* = -( H*) -I. If we had chosen a ',constant prior density, then the posterior density is proportional to the J; .likelihood function. If then the model equations are linear in 0, Eq. (2)  ."holds exactly. In this case the postenor distribution of 0 and the sampling  ;'distribution of the maximum likelihood estimate coincide, and inferences based ',1; ,9n them are identical. \,;, : 
192 VI r r nterpretation of the Estimates 'u If the model equations are linear and the prior distribution is normal with covariance matrix V 0' then Eq. (2) still holds exactly, but the covariance of the posterior distribution no longer coincides with that of the sampling distribution. Apart from constant terms. the posterior distribution has the ,; '.!. form log p*(O) = n  (y" - B"O)TV- \Y" - B,IO) -1(0 - OO)TVOI(O - ( 0 ) Jt=: I (7-12-3) This achieves its maximum at 0* = ['t (B/V- t B , ,) + V o I] -I ['t (B/V- t y / I ) + V o 10 0 ] (7-12-4) and has the negative inverse Hessian V* = I f f (B TV-IB . ) + V O -I J -I .:-.... IJ JI Jt= I (7-]2-5) It can be shown that the sampling distributton of 0* [Eq. (4)] has mean 0* = v* [ f ( B TV-IB ) 0 + V-IO ] --' II II 00 J!= I (7-12-6) where 0 tS the true value. ThIs IS unbIased only when 0 0 =;= e. Furthermore the covariance of the sampling distribution is given by n V(J = V*  (B/V-IB,,)V'" Jt;;::: 1 (7-12-7) which differs from V*. 7-13, The Residuals After the estimates 0* for the parameters have been obtained, we can: compute the final residuals { ".,,* - w ,1 e * == e ( 0* ) = . {J ( z 0* .) It It I °11 11' Y/I - r(x", 0*) for exact structural model for inexact structural model for reduced model C/l = 1,2, .,., II} (7-13-1) -.':'1 These residuals measure the departure of the data from the best curve o,i surface that could be fitted to them. I f the model is exactly valid, these: residuals must be related to the errors in the data. If such errors did not:<'i exist, there should be no residuals either. The residuals must on the average\:; i"fr:\i.'Hf<'; .::-'t, i;\<!... 
t i i "be ,mall" than the wo"" beeau" we hove eho"n the O' '" "' to make [the residuals as small as possible. The errors, which should be the residuals obtained with the true values 0, must be larger (unless e = 0*). The residuals, , then, are biased estimates of the errors, the bias being toward smaller ab- solute values. We now develop approximate expressions for this bias in some typical estimation situations. Suppose the errors in different experiments are un- correlated, and have the same covariance matrix V in all experiments. Further,  " suppose 0* minimizes an objective function of the form : 7-13. The Residuals 193  eP(O) = 'P(M(O» (7-13-2) ffi "as discussed in Section 5-9. At this mmimum the gradient of (J) must valllsh. d['Hence from Eg. (5-9-12) we have n --')"\' B T r *- 0 q - - L., I' ell - l1=t (7-13-3) .!.vbere BI1 == - oe)aO)o=o*, r == o'PjaM)o=o* (7-13-4) Assuming the model is correct, the error £11 is the residual computed for ,1fhe true e, i,e,. £1' = el,(O) (7-13-5) .;:Assumingthat 0* does not differ much from 0, we have from Eg. (I), Eq. (4), j; ::.Fd Eg. (5) approximately e *  £ - B ( 0* - 0 ) J1 "-' I' JI (7-13-6) 'Substituting Eq. (6) 111 Eq. (3) yields kbence /I L B/r[£11 - B(O* - 0)] = 0 u=1 (7-13-7) n 0* - 0= C- 1 L B"Tr£" '1=1 (7-13-8) i;,wbere /I C == L Bil TrB II 1'=1 (7-13-9) :$ubstituting Eg, (8) in Eg. (6) '!,.'.';:; I:S.ince by assumption " *- B C -l "\' B T r ell -£1' - I' L.", £'1 '1= 1 (7-13-10) l. E(£p£,/) = (5 1 ", V 
194 VII Interpretation of the Estimates we obtain from Eq. (10) E ( e "'e *T ) = V - H C-1B Trv - vrB C- 1 B 1 II JL J1 J1 J1 J1 + BpC- 1 ( "tl B"TrvrB,,)C-1B/ (7- 13-11): An approprIate measure of the residuals is given by the (unadjusted)hf sample covariance matrix V*, defined by V* == (lln)M(8*) = (I Ill) I eJ1*eJ1*T p (7-13-12)1 So that from Eq. (II) E(V*) = V - (I/n{tB" C- 1 B/)rV - O/Il)vrCtBp C- 1 B/) + (I /n)lltl B"C- tJI BTrvrB)C-IB/ (7-13-13)'1 Of particular interest are the cases where r is proportional to V-I, in rows I and 2 of Table 5-1 (V known or proportional to a known matrix)_, It is easy to verify that Eq. (13) remains unchanged if r is multiplied by '. constant, since C would be multiplied by the same constant. Hence, we ma}{'.:::i simply substitute r = V-I, and obtain ,1, /I /I E(V*)=V-(2I n )IB " C 'B/ +(I/I1)IB p C- I CC IB/ 1'= 1 p= t /I = V - (1/n) I BII C-IB"l p= 1 (7-13-14E As expected, E(V*) is "smaller" than V, since V - E(V*) is positive':; definite. Consider the case of a single equation model. with V = u 2 . In this case!: B/ is a vector b p . Hence from Eq. (9) /I C = (I/u 2 ) I b p b/ 1'= 1 (7-I3-15);if and, since C IS an I x I matrIX (l being tile number of parameters)  B C IB T =  b TC-Ib = T r(C - I  b b T ) L I' I' L I' p L p P p=1 u=1 u=1 = Tr(C- 1 u 2 C) = u 2 Tr II = [u 2 Therefore, Eq. (14) becomes (7 -13-16)'j;; . .:.:t E(V*) = (I - 111l)u 2 (7 - I3-1q;;,, '_,' ., :;;,,: 
'4;,13. Tbe Residuals 195 If we want to estimate u 2 (u belllg the standard devIatIOn of the errors) :from the residuals, we use I 1 1 I /I -2 V * _ ' / * _ '\ *2 U =- ---}, - Le l-I/n l-I/l1n 11_1,'=1" (7-13-18) .his is a well known formula, which states that the variance of the errors IS ',estimated by dividing the sum of squares of the residuals by the number of /!egrees of freedom, which is the number of observations n less the number of ,;'Unknown parameters I. .. If the model contains 11l > I equatIons, the sItuation IS more complicated. :::S!lppose, however, that we wish to estimate V as some multiple of V*; say ,ewe assume E(V*) = pV (7-13-19) :rhen, substituting Eq. (19) in Eq. (14) and multiplying by V-I, we obtain III pI", = 1",- (1/11) I B"e-IB/V-' p=1 (7- 13-20) ,!1:aking traces on both sIdes. and remembering that if AB is a square matrIX iihen Tr(AB) = Tr(BA) ;e obtain ;fm= 111- (I/n) Tr e- I ( f B/V-1B JI ) = m- (1/11) Tr e-Ie = III -//11 Jl= I ;so'tha t p = I - 1//1111 (7-13-21) . Hence, to estimate V we use the (adjusted) sample covariance matrix 1 I I I /I V = - y* = - M(e*) = I e *e*T (7-13-22) p 1 - (l/mn) n . n - I/m,,= I " " !l _';. ;PiJ.c;e again, to estimate the covariance matrix of the errors, we take the IWpITIent matrix of the residuals and divide by the number of degrees of :J!eedom per equation, that is, the number of observations 11 per variable less :!I!e:," average" number of parameters per equation 11m. Clearly, Eq. (22) :gpces to Eq. (18) in the single equation case m = 1. ,:c:'In the case where V is completely unknown, we have from row 3 of Table i'T proportional to M- I , If we now make the further assumption that M 'p{oportional to V we have once more r proportional to V, and Eq. (22) iSi-!ill valid. The maximum likelihood estimate V = O/n)M(e*) is biased by "';. 
196 VII Interpretation of the Estimates' the factor I - l/ml1. Since this factor approaches unity aSll  co, the maximum likelihood estimate is consistent. The formulas derived here should be viewed with caution, since the residuals for one equation may turn out much smaller than predicted, while another equation has much larger ones. It is only on the average, in a certain sense, that our bias factors apply. 7-14, The Independent Variables Subject to Error Even consistency does not hold when the independent variables are subject to error, so the number of unknown parameters is essentially pro- portional to 11. A very rough indication of the bias to be expected even when the number of experiments is very large can be obtained as follows: Suppose we have an m-equation model with r variables subject to error. Let the observed, estimated, and true values of these variables for the pth experiment be designated WI" w/', and \vJ1' respectively. We assume these three values are not far apart. By definition, the residuals are e/' == w p * - WI' (7-14-1) and the errors are Ep == "'IL - w p (7-14-2) Since we are dealing with the asymptotic case, we assume that our estimates 0* are errorless, i.e., 0* O. The estimates w p * must then minimize l. ( w * - W ) TV-1 ( W * - W ) 1. 11 Jl 11 Jl JJ subject to g(w , ,*, 0*) = O. Letting Ap == cg/owl,)o=o* we have approximately g(W p *, 0*) = g(W IL , 0*) + Ap(w/ - w p ) = 0 (7-14-3) Since 0* = 0, the true values w p also satisfy g(\V p , 0*) = 0, i.e., approximately g(\"I" 0*) = g(w l " 0*) + Ap(\v JI - w p ) = g(w p , 0*) + AI,Ep = 0 (7-14-4)1 Solving Eq. (4) for g(W II , 0*) and substituting in Eq. (3) -All E'l + Aiw/ - W,.) = 0 (7-14-5);;;i Because of its nature as the solution to an equality-constrained minimiza;! tion problem, w/ must be a stationary point of the Lagrangian I ( W 'I' - W ) TV -I ( W '" - W ) + AT g( W * 0* ) 2 I' /1 JI . JI JI Jl ' ""-' ¥!,, .f ',I.,,'!,'1. 
7-14, The Independent Variables Subject (0 Error 197 Differentiating with respect to W/ VIl(W,,* - w,.) + AuTA = u (7-14-6) }Vhence W , ,* - w,, = -V,IAI,TA (7-14-7) so that Eq. (5) becomes -AIIEJl - AJl VJlA/A = 0 (7-14-8) and %: A = -(AIIVJlA/rIAJlEII Now Eq. (I), Eq. (7) and Eq. (9) combine to form (7-14-9) e/ = VJlAJlT(AI' VJlA/)-IAJlf.Jl ::'fnd remembering that £(f."f./) = VJl' we obtain E(V * ) - E( * *T ) - V A T (A V A T ) -I A V Jl = ell eJl . - Jl Jl I' I' I' I' Jl (7-14-10) (7-14-11) Assuming once more that V I ,* is proportional to V'I' e.g., V/' = pV " , J::q. (ll) reduces to pI, = VI,A/(AI,V"AJlT)-IAp (7-14-12) .:raking traces pr = Tr[V"A"T(A" V"A/r l A,,] = Tr[(A II VIIA"T)(AII VIIA})-I] = Tr 1/11 = 1Jl (7-14-13) i:' so that p = /11/r (7-14- I 4) If 111 = r, i.e., there is only one inaccurately measured variable per equa- ::::'tion, p = r. There is no asymptotic bias and V/ is a consistent estimate of :';yl'" This corresponds to the situation discussed in Section 7-13. In the prob- ;:,}c£1 of Section 6-13, however, we had one eq uation with .Y I' X 2, and y :J;:ubject to error. Here p = t, and we expect the residual covariance to be 1dnly one-third the true covariance, no matter how many experiments are 'jtperforrned. .. It is clear that if we assume all V" equal, then the matrix " V* = (I/n) I e/e'T JI=1 (7-14-15) [h the same bias factor p. If, in addition, we correct for the bias caused by fJe I parameters e, we arrive at the estimate !:-:'. . I.- ..1.-- " V = [r//11(n - 1//11)] I e/<'T Jl=t (7-14-16) 
198 VI I Interpretation of the Estimates 7-15, Goodness of Fit The crucial question that arises after the estimates have been obtained is whether or not our model fits the data. The question can be answered in the affirmative if the residuals of the fitted model can be explained as errors in the observations. On the other hand, if the residuals are so large, or of such a nonrandom nature, that they cannot be ascribed to random observa- tion errors, then we say that the model does not fit the data. We stress that whereas a lack of fit constitutes strong grounds for rejecting, or at least amending the model, a good fit does not prove that the model is correct. A good fit merely establishes the fact that there is no reason to reject the model on the basis of the data at hand. I n fact, no amount of data can ever prove a model; all we can hope is that it does not disprove it. Our least squares or maximum ]ikelihood estimates were usually based on the assumption that the errors EJI in each experiment were realizations of a random variable with mean 0 and covariance matrix V. After estimating the parameters we have a set of residuals e/ from which we compute an adjusted covariance matrix V (see, e.g., Eq. (7-13-22)). To establish the goodness of fit is to test the hypothesis that, with certain reservations,t the residuals form a sample from the distribution that we ha ve postulated for the errors, corrected for the bias discussed in Sections 7-13-7-14. To test a statistical hypothesis we usually compute a certain relevant statistic I, from the sample. We compare I. to a certain reference value 1. 0 , and reject the hypothesis if I.  )'0' In doing so we incur the danger of erring in one of the following ways: I. Error of the flrst kind; we reject the hypothesis although it is true, 2. Error of the second kind; we accept the hypothesis although it is false. On the assumption that the model is correct, the distribution of the statistic I. may be determ ined. and hence we can find the value 1'0 such that lI< - . , f ;o " .i? .   . . , PrCi.  )'0) = CI. (7-15-1) ,- :i   where CI. IS a suitably chosen small number, e.g., 0.05 or 0.0]. If we reject the model when I.  1. 0 , the probability of committing an error of the first kind is C1.. The probability of committing an error of the second kind depends on what the true model actually is, and we shall not consider this question here, The statistics that we shall use are valid for a wide class of error distri- butions. The distributions of the statistics, however, are known and tabulated primarily for the case when the distribution of the errors is normal Only i .:. 1 E.g., thc rcsiduals arc scrially correlated even when the errors are independent. I 
:.J- ,-::  :';' -' <,' 7-]6. TesTs 011 Residuals 199 ' i'X", , " in this case is it easy to find the test value )'0 assocIated with a given probability rJ.. On the other hand, it makes no difference here whether or not the model equations are linear. Residuals should be attributable to errors, no matter what kind of model they were computed from. When the distribution of ), is unknown and cannot be derived by analytic means, it is still easy to estimate the critical )'0 by Monte Carlo techniques. Error samples with the proper distribution are generated on the computer, the statistic), is computed for each sample, and )'0 is chosen so that), exceeds it in a fraction CI. of the samples. Various commonly used statistics ), are described in the next section I, - :'" ".-' , . ot;., .(, 't, ", :f.;; ;;': i.i1: t: . !.f.;" \ ;: ;;, :';!:,: ff: 7-16. Tests on Residuals ";.;. The tests we wish to perform on the residuals relate to their mean and covariance. In many cases questions concerning the mean of the residuals are of no significance, since the estimation process guarantees that (except for the effect of rounding errors) the mean is zero. For instance, whenever a model of the form . ; .<, )l = EI! + cP( x, EI 2 , 0 3 , . . . , 0/) (7-16-1) 1; is estimated by least squares, the average residual is zero For, let the objec- tive function be given by <-'.<'" II n tP= Le,/= L(J'p-O,-f/J,Y p=! "= t 0-16-2) To mlllllTIlZe tP we form as one of the normal equations n otP/oEl! = -2 L (Yll - Ell - cP p ) = -2 L e p = 0 11=1 J1 (7-16-3) Thus the sum of the residuals is zero, and so is their average. Suppose, however, that we have a model that does not guarantee zero average residuals. Now if the errors in each experiment have covariance matrix V, we expect the residuals to have covariance matrix (J - 1/lI1n)V [see Eq. (7-13-22)). The mean residual vector n e=(l/n) L p=t e * II 0-16-4) should have covariance matrix (J/n)(1 - 1/II1I1)V, since the variance of the mean of 11 independent observations is 1/11 times the variance of the observa- tions (we neglect the fact that the residuals are correlated even when the 
200 VI I I nterpretation of the Estimates observattons are not). If the errors Ell are assumed to be NJO, V), then the average residuals e should be N",(O, (1/11)(1 - 1/I1lI1)V), and we may easily construct confidence regions for e, as we have done for e in Section 7-10, I n particular, the statistic }, == [11/(1 -1/I11I1)]e T y- l e (7- 1 6-5) is distributed as x,/. If V is not known, we !TIust Introduce the matrix If S == [1/(11 - I)] L (e/' -e)(e p * _e)l 1 1 =1 (7-] 6-6) If our hypothesis IS true, then the statistic A == [(11 - 111)11/(11 - l)m]eTS-le (7-16-7) is distributed as F,If,n-",' The quantity [(11 - l)m/(11 - m)]}c is sometimes known as T 2 [see Anderson (I958, Ch. 5)]. For a single equation model m = 1 and T 2 =k In this case, the quantity )1/2 is known as f, and the associated test is the well-known {-test. If the zero-mean hypothesis is accepted, we may wish to test the hypothe- sis that the errors in each experiment -possess a given covariance matrix V, That is, we wish to compare the covariance of the residuals V given by Eq, (7-13-22) with V. An appropriate statistic is given by }, = 11[log det(Vy-1)"- m + Tr(yy-1)] (7-16-8) Its distribution in the normal case is tabulated by Korl1l (1968). Another frequently tested hypothesis is that two sets of residuals are uncorrelated. The need for such a test arises when we base our estimates on an assumed diagonal covariance matri of the errors. We then obtain a covariance matrix Y of the residuals, and wish to find out whether V nb (0 =I b) differs significantly from zero. We compute the correlation coeftlcient ruh = Vuh/(V nu r/ hh ) 112 (7- 16-9) If ruh is the correlation coeftlcient computed from a sample of 11* independent.;;. pairs of mutually independent normal deviates, then the quantity }" = r uh [(I1* - 2)/( I - r;b)]1/2 (7-16-10); has the {-distribution with 11* - 2 degrees of freedom (Anderson, 1958, p. 65). For our purposes, we should probably take 11* = 11 - I/m, but with 11* > 10 the {-distribution is quite insensitive to the number of degrees of freedom. "-.-:;:'11:'"'. . :.:t, ,;:: "', 
   7-17. RUl1s al1d Outliers 201 Suppose /1* = 20 and r"n = -0.25; we have} = - t .095. According to Table 4 (Cramer, 1946), the probability that Itlsl exceeds 1.095 is a sizable 29 %, so we cannot reject the hypothesis that V"n = o. We could be 99";; sure that Vab # 0 if we have It t sl  2.878, corresponding to I r i ; I  0.561. Further examples appear in Section 7-24.  'J\; tti i 7-17. Runs and Outliers Residuals that have passed the tests of the previous section may still be unsatisfactory. Though of reasonable magnitude, they may display trends and other departures from randomness that call for modifications in the model. The reader is referred to Acton (1959, Ch. 3) or to Draper and Smith (1966) for an excellent treatment of this problem on a practical level. Briefly, the residuals should be plotted against the various variables that are included in the model, and also against the time at which the observations were taken. Linear, quadratic, or periodic trends may reveal themselves and will call Jar the inclusion of appropriate additional terms in the model. Trends in the variance of the errors may also be detected, and may shed some light on the measuring process. Finally we may test for randomness by counting the if,o,number of "runs" in the residuals, a run being a sequence of residuals of  equal sign. If the number of runs is much lower than expected the randomness ,,:,.ofthe residuals is suspect. J 00 If 11 1 and 112 are the numbers of negative and positive residuals respectively, then the expected number of runs (on the assumption of complete randomness) .IS p=211 1 11 2 /(l1 t +112)+ I 'and the variance of the number of runs is  u 2 = 2I1 t I1 2 (2/1 1 /1 2 -/1, -11 2 )/[(11 1 + /1 2 )2(/1 1 + 11 2 - I)] (7-17-2) .;The actual distribution of the number of runs was derived and tabulated by .,:Swed and Eisenhart (1943). A table also appears in Draper and Sm ith (1966, )), 98). When both 11 1 and 11 2 exceed 10, then the quantity (7-17-1) z == (r - Ji -t- t)/u (7-17-3) .,(r=number of actual runs) is distributed approximately as NI(O, I). A ;;ilumerical example appears in Section 7-24. We stress that failure to pass the nUI11 ber of runs test is no reason for ;.outright rejection of the model. Usually it is merely an indication that some ?possibly minor effects have been neglected. Particularly in cases where the I!.I!i ".' 
202 VII Interpretation of the Estimates data are very accurate, neglected effects outweigh random errors in measure- ment. Consequently, nonrandomness of residuals is the rule, rather than the exception, when models are fitted to good data. Many tests on residuals are best accomplished by graphical means. ]f the probability distribution of the errors is to be investigated, a histogram or a cumulative frequency plot is called for. Suppose the residuals are re- numbered so that e l is the smallest (algebraicalIy) and en is the largest. Let Pj == (i - t)/I1; then Pi is an estimate of the quantity Pr(e  ej). A plot of Pj versus ej then approximates the cumulative distribution function of the errors. When this plot is made on normal probability paper, the result should be a straight line if the error distribution is normal. ]f all the points are rea- sonably close to a straight line except for a few at the low and high ends, then the presence of outliers (see below) is suspected. ]f the points seem to fall into a few clusters rather than follow a smooth curve, then one may conclude that different sources of error were operative in different subsets of the observations. It may happen that some gross error is committed in the conduct or recording of some experiment. Naturally, the erroneous observations give rise to unusually large residuals, called outliers. More seriously, such erroneous values can gravely distort the parameter estimates. Therefore, one wishes to eliminate such observations from the analysis, and the easiest way to spot them is by examination of the residuals. If there is a clear-cut differentiation between the "regular" residuals which fall on the smooth part of the probability plot, and the" outliers", then we should not hesitate to remove the latter and recompute the estimates in their absence. However, if the distinction is blurred, then the problem of diagnosing outliers is a difficult one. A procedure often adopted in practice is to remove all residuals whose magnitude exceeds the standard deviation either known or estimated using all residuals) by a fixed factor, say 2.5 or 3. When setting such a threshhold one should take into account the probability of residuals of such magnitude occuring by chance in a population of size 11. For instance, with a normal distribution and 100 observations the probability of finding a residual exceeding 30- is 23.7%, giving one little reason for rejecting such a residual out-of-hand. For a more systematic approach, see Anscombe (1960). 7-18, Causes of Failure If the parameters turn out well-determined and the residuals are accept- able, then our estimation problem is solved. Only too often, however, we run into one of the following less satisfactory situations: 'itrn1 
;: t f':!  i fii(:: I  ;. ,  .  . : , . _ ' . : . i ; : . t . : 1 : ' . ' : ' ' . ! ;t ;, obv;ou,ly ;mpo.%;bl, to "';mat, 0, and 0, ."pa,,'''y. Th, d,,,,n,,aey . is not always quite so obvious. Consider for instance our falling sphere i model Eq, (2-14-5). If we write out that equation in full we find that the  . distance s travelled by the sphere in time t is r£ . 7-]8. Causes of Failure 203 (a) Parameters ill-determined, residuals large but acceptable, since the meas- urement errors are known to be large. Barring the possibility of reducing measurement errors, we can improve our estimates only by conducting many more experiments, As a rule, the standard deviation of estimates decreases roughly as /1- 1/2 so that a tenfold improvement in the estimate requires a hundredfold increase in the number of experiments. (b) Parameters (or some linear combinations of them) ill-determined, residuals small. This may be due to a degeneracy in the model. For example, in the model y = (0 1 + °2)X (7-18-1) s = [gem - 1710)f6nrfl]t - [gm(111 - 1710)f36n2r2fl2] (1 _e-( 6n l"!'/m)t) (7-18-2) I Some study is required to see that among the parameters g, 111, 111 0 , r, and JL appearing in the equation, only two can be estimated independently. An even more common source of degeneracy are the data themselves. For example, suppose that for the model 1 .1  y = OIX; + 02 X 2 (7-18-3) we have made many observations of y at different values of XI and x 2 ' but :'by chance in each experiment XI turned out to be approximately equal to X 2 . ,For these data, model Eq. (3) is indistinguishable from y = (Ot + 02)X 1 (7-18-4) :;in which 0 1 and O 2 cannot be estimated independently. The above case appears 'trivial, but similar conditions obtain, perhaps in more subtle form, in many ;;xperimental situations. The only solution is to plan the experiments properly, 'as indicated in Chapter x. .'<9 Parameters ill-determined and residuals unacceptable. The model must '!Je rejected, or at least amended to include those effects that were observed in the residuals, One of the hypotheses underlying a model is that the unknown parameters .are constants that do not depend on the model variables. [t is clearly desirable :'to test this hypothesis, and this can be done if the data are sufficiently rich. ,>To test, for instance, the hypothesis that e is independent of some variable /1:'1' we break up the data into subsets each corresponding to a single value. .''}Jr a narrow range of values, of ZI. We estimate the parameters separately i1J1<i 
204 VII Interpretation of the Estimates from each data subset, and employ the usual statistical techniques to test whether the estimates obtained from the subsets are significantly different from the estimate obtained from the whole sample, Or the subset estimates show any trends or other functional relationships with Zt. If such relationships exist, they may be used to amend the original model. This technique has been described by Hunter and Mezaki (I964) and Box and Hunter (I965), It is not always possible to apply this method directly. For instance, let the model be y = (JI + 02 X I + 02 X 2 (7-18-5) It is impossible to estimate the parameters 0 1 and O 2 separately if we restrict the data to a single value of XI. However, we may still break up the entire range of XI values into a few fairly wide intervals, and obtain a separate estimate for each range. 7-19, Prediction Perhaps the most important object of mathematical modeling of physicat situations is that of predicting future responses to given conditions, The' estimation procedures provide values for the parameters to, be inserted into.;;.,';; the prediction equations. These equations need not be the same as the model.' equations used for estimation, nor need the variables to be predicted coincide! . with the dependent variables of the model equations. For instance, we observe,:< the time a liquid takes to flow through a capillary tube in order to estimate. viscosity; we use the viscosity to predict damping factors for standing waves': in a pool. At any rate, let us say that we wish to predict the value of some" vector TI, based on the value of a vecto of independent variables  and,' a vector of parameters O. The prediction is to be made on the basis of the'. model TIp = <!>(, 0*) (7-19-1) j where the subscrIpt p stands for the predicted value.,. Assu.ming the model itself is correct, there are three possible sources of\.: inaccuracy in the predictions: errors in the estimated 0*, errors in the setting;: ." of, and errors in the measurement of TI. All three sources contribute to the'; difference between the predicted TIp and the eventually observed TI, Usually,' (except in purely linear models) there will be some bias in the predicted TIp' ;';.$ but there is little that we can say about it. Assuming, however, that thisf\ bias is small compared to the other errors involved, and that the errors:;;'.;it from all three sources are statistically independent, then we can obtain an> approximation to the covariance matrix of the prediction errors. , ';Ii' _, i" 
[f;t { ]-20, Parameter Transformation 205 Suppose we denote the three errOrs by c5{J, () and ()TI, respectively. The ;;'observed value of 11 will be given by I Tlo = <I>( + (), 0* + (0) + ()TI ,,,A Taylor series expanSlOn up to linear terms yields (7-19-2) fi i ; I ' \ where V s' V 0 and V 1] are, respectively, the covariance matrices of (), ()O, i and ()TI. The first term on the right hand side of Eq. (4) may be omitted if  S can be set (or is known) precisely. The matrix V o is obtained in the process  · of estimating the parameters, as shown in Section 7-5. If TI coincides with the 'y in the model equations, then V 1] is estimated (if not known previously) ,Jram the residuals, as in Section 7-13. Tlo - TIp = (a<l>/a) () + (a<l>/aO) ()O + () TI (7-19-3) The covariance matrix of the prediction errors is given by V'1 == £(Tlo - Tlp)(Tlo - TIp? = (a<l>/a)Vs (a<l>/aEY + «(l<1>/aow o (a<l>/am T + V 1] (7-19-4) .7-20. Parameter Transformation '., It is frequently convenient to perform the estimation not in terms of the :}briginal parameters of the model, but in terms of transformed variables :;,ythich simplify the mathematical form of the model equations Examples >:Qf this have been given in Section 4-19, in connection with linearizing trans- ::.formations, and the point is also illustrated in the problem of Section 5-23. , Let us assume, then, that we have estimated a vector 0 of I parameters :;'\yhich are functions O(e) of the original model parameters e. Let 0* and V o t'be the estimates for 0 and its covariance matrix, respectively. I f the transfor- !mation from e to 0 is reversible around 0 = 0*, i.e., if in the neighborhood of 'J)* there exist functions 1'(0) such that e = 1'(0) is a unique solution to the \.equations 0 = O(e), then J == al'/ao is nonsingular at 0 = 0*. A flrst-order 'Taylor series expansion of I' has the form e = 1'(0*) + (al'/aO)(O - 0*) = e* + J* 00 (7-20-1 ) :where J* == Jo=o*. Hence, approximately V c == E(e - e*)(e - e*? ::::: E(J'i- (iO ()OT J*T) = J*V 0 J*T (7-20-2) :Eq. (2) may be regarded as a special case of Eq. (7-19-4), where the 11 to be :':predicted are simply the e. See Section 7-24 for a numerical example. i 
;:; , <.-'" ! ".!. .. ,': , 206 VII Interpretation of the Estimates ;  . If the equations 0* = O(c*) cannot be solved explicitly for c*, we have  to resort to a Ilumerical solution. In this case, we can still calculate J* = .., C80/Dc);:=\.. If we use the Newton method to solve for c*, we obtain J* as a by-product. 7-21. Single-Equation Least Squares Problem We shall now interpret the results of Section 5-21 in the light of the tech- niques described in this chapter. We recall that the least squares solution to the model Eq. (5-21-5) with the data of Table 5-2 is given by: * _ _ [ 8 I 3.4583 ] e 960.9063' eP* = 0.03980599 H* . N* = r 0.271890 x IO- -0.957336 X 1O5 ]  -0.957336 x 10- 0 3.50371 x 10- 0 According to Eq. (7-2-4) we may represent ePee) approximately by means of the equation (1)(0)  0.03980599 + .l I 0 - 5[0.271890 (0 1 - 813.4583)Z - 1.914672 (0 1 - 813.4583)(Oz - 960.9063) + 3.50371 (Oz - 960.9063)Z] (7-21-1) I . II'!  ." I ; cf h....; If I I) "" " " - '.; I How good is this approximation? I n Fig. 7-3 we compare the contours of the true objective function to those of the approximation Eq. (I). We also show the boundary of the region in which the approximate value of cP c£;' - 200 -100 ;5 /° :J / /./ 05 ./ , 100 200 300 400 500 I c£;' o 8,-8,* Fig, 7-3, Contours of objective function. Contours of cJj - cJj*: -, true; ---, quadratic approximation; -, limits of 5 ;; crrDr region. .... 11 . . . .i. It . 4 ., ...: ' , :.\:i,.. _,,'-" . 
7-21, Single-Equation LeasT Squares Problem 207 is in error by no more than 5;';, We find that there is excellent agreement between the true and approximate values within the region I q)(O) - (1)* I  '.' 0,005, and in some areas the agreement extends far beyond this region. The eigenvalues and vectors of N* are: A] = 3.7660 X 10- 5 , [ 0.2642 ] ill = -0.9645' A 2 = 0.0096 x 10- 5 [ 0.9645 ] il 2 = 0.2642 ",. Accordingly, Eq. (1) can be rewritten In canonical form as ":i<: ..... CP - 0.03980599 = -}1O- 5 (3.7660 IJ;/ + 0.0096 'f//) (7-21-2) (i" r,'. ,;; " where :-;"'.' ,,\;' IJ;I = 0.2642 (e] - 813.4583) - 0.9645 (e 2 - 960.9063) (7-2 I -3) 1J;2 = 0.9645 eel - 813.4583) + 0.2642 (e 2 - 960.9063) If we choose E = 0.005, then the indifference region I q) - CP* I  0.005 is defined (approximately) by " 3.7660 IJ;/ + 0.0096 IJ;/  2 X 0.005/10- 5 = 1000 (7-21-4) <., '. '0 With 1J;2 held constant at zero, this corresponds to '1.--' IIJ;] I  (1000/3.7660)1/2 = 16.3 (lfi2 = U) and with IJ;I held constant at zero 11J;21  (1000/0.0096)1/2 = 323 (tf/I = 0) Thus IJ;, (the short axis of the ellipse Eq. (4)) is relatively well-determined, but 1J;2 (the long axis of the ellipse) is poorly-determined. In Fig. 7-3, the 'f/( and 1J;2 axes do not appear perpendicular to each other. because e 1 and e 2 are drawn to different scales. To estimate the covariance matrix of the estimate, we use Eq. (7-5-15), but we must estimate (J2 first. The residual sum of squares is CP* = 0.03980599, and: (J2 = [1/(15-2)]0.03980599 = 0.00306200, V o  2 x 0.00306200 N*-] = [: (J = 0.05533533 16547.6 ] 4696.17 (7-21-5) The standard deviations of the individual parameter estimates are (JI = 60561.7 1/2 = 246.093, (J2 = 4696.17 1 / 2 = 68.5286 correlation between the estimates IS Pl,2 = (16547.6/246,093 x 68.5286) = 0.981214 
208 VII Interpretation of the Estimates The principal components, of course, coincide with t/J1 and 1J;2' Their variances are given by 1[) = 2(52/)'1 = ]62.61, with standard devIatIOns 1[2 = 2(52/J 2 = 65095.3 (5) = 12.752, (52 = 255.14 Again, we see that 1J;1 is weH-determined, 1J;2 Jess so. To compute the scaled principal components, we scale each parameter to have unit standard deviation, i.e., we define ,,) = OJ246.093, \'2 = O 2 /68.5286 The covariance matrix of v is simply the correlation matrix of 0, i.e,. [ I 0.98]214 1 v" = 0.98]214 I whose eIgenvalues and vectors (in the v coordlIlates) are: ILl = 1.981214, PI [ I/V ] , 1/,: 2 {12 = 0,018786 _ [ 1//2 ] P2 - -1/)2 To express p) and Pl in terms of 0, we have to unsca]e, i.e., = [ 1/(/2 x 246.093) ] = [ 0.00287333 ] PI 1/(,/2 x 68.5286) 0.0103184' Thus the quantity PI = 0.00287333(0 1 - 0 1 *) + 0.0103 I 84(0 1 - 0 1 *) _ [ 0.00287333 ] Pl - -0.0103184 has variance 1.981214, the quantity Pl = 0.00287333(0 1 - 0 1 *) - 0.0103184(0 2 - O 2 *) has variance 0.018786, and the two are uncorrelated, To obtain a 95 o confidence region for 0 we use the statistic of Eq, (7-11-5) 13 I -5 - T [ 0.271890 -0.957336 ] Fl, 13 ? 2 x 0.03980599 210 00 -0.957336 3.50371 (50 = 163.292 x 110-5(0.271890 (50/ - 1.914672 (50 1 (50 2 + 3,50371 (5e z Z ) (7-21-6) The upper 0.05 point of F with 2 and 13 degrees of freedom is. according to the tables, 3.81. Our confidence region thus has the equation !1O- s (0.271890 (501 1 - 1.914672 (50 t (50 1 + 3,50371 (5e/),;;; 3.81/163.292 = 0.023332 '4_;. 
I 17-21, Single-Equation Least Squares Problem ; 209 F.omparison with Eq. (I) indicates that this region is bounded by the con tOUt cP - CP* = 0.023332 According to Fig. 7-3, this contour is partly outside the region where Eq. (I) ,,is a reliable approximation. The fact that the exact contour is inside the .'pproximate contour, suggests, however, that the latter should be a conserva- tive estimate of the confidence region. Finally, we examine the residuals, given in Table 7-1 and Fig. 7-4. A Table 7-1 R I B * _ [ 8 I 3.4583 ] cSldua S at - 960.9063 "i 5 I i , I I   fL I 2 3 4 5 6 7 8 9 10 11 12 I3 14 15 ;  X,II = t X ll 2 = T e,/" = )'II-/'L(B*, X,,) 0.] 100 -0.0145552 0.2 100 -0.00613993 0.3 100 -0.0287542 0.4 100 0.000602186 0.5 100 0.0199295 0.05 200 -0.0906165 0.] 200 0.0304608 0.]5 200 0.0869893 0.2 200 -0.0387225 0.25 200 -0.0219878 0.02 300 0.04975 J 5 0.04 300 0.0504873 0.06 300 - O. I 03587 0.08 300 -0.0550289 0.1 300 0.02933 I 4 01 0.05 ...- '.Cb o '.. ;::,QJ, 1=300 -0.05 -0.1 o '2::.. /' \ \ \ - - o----------\---__ _----- \ ..___x---- _-- T = 100 x 200 -.."..- _..0-- 0.1 02 03 04 05 Fig, 7-4. Residuals (least squares problem). 
210 ,,;, VII Interpretation of the Estimates'j J glance at the latter suggests that the residuals at T = 100 are considerably ","i smaller than those at T = 200 and T = 300. An F-test on the ratio of the sum.':!4 of squares of the last ten residuals to that of the first five residuals indicatesl a significant difference even at the 99.5 % confidence level. We shall deal with. ;i this problem further in Chapter IX, although this small body of data prob-: ... ably does not merit further analysis. 7-22. A Monte Carlo Study To investigate the reliability of statistics obtained in the previous section" we used the simulation technique suggested in Section 3-3. We assumed that\, model Eq. (5-21-5) was correct, with 0 1 = O 2 = 1000. We used this to com-;< pute .I'll for the fifteen data points, and added as "experimental error':,;: a pseudorandom number drawn from an appropriate distribution. Six":; distributions were studied: normal and uniform distributions, each with:; (J = 0.0 I, 0.03, 0.05. The estimation procedures were always carried out;; however, as though the errors were thought to be normally distributed. " For each one of the six cases, 100 replications (samples) of fifteen obserf) vations were generated. The parameters were estimated in each one of thes(h samples, and bias = 8* - [] was calculated, as well as the estimated covariance matrix. These were." averaged over all samples We also calculated the actual covariance of th.A estimates around their means [Eq. (3-3-2)]. The results appear in Table 7-iJ." The following conclusions can be drawn from the table: ], The average bias in all cases is small compared to the standard deviaT tions of the estimates. 2. The estimated covariance matrix is, on the average, an acceptabl estimate of the true covariance, particularly at small values of experimentiti error. Even at (J = 0.05, the estimates are not unreasonable, particularly:, when one takes square roots to obtain the standard deviations of the estimates} 3. The estimates are reasonably robust, at least as far as the differenc between normal and uniform distributions is concerned. 4. The supposition that for this model the estimated variance is conserV£, ative (too large) is confirmed. 5. The standard deviations of the residuals (corrected for bias) provide); on the average, excellent estimates of the experimental error. While these conclusions hold for the average of many replications, thet results for individual replications vary quite sharply. In fact, the specis :;;1f ;}'" ., ,"Ii 
;f-22, A Monte Carlo Study 211 'T!lble 7-2 lt>l':1onte Carlo Study of Single-EquatIOn Least Squares Model" " Experimental error a=O.OI a = 0.03 a = 0.05 e, eo e, eo e, eo 6.10 0.79 29.22 2.67 66.68 5.05 Normal distribution ; @J i A verage bias True covariance Eq. (3-3-2) Average Estimated covariance Eq. (7-5-17) Standard deviation of residuals [ 3264 751 J [ 28167 6468 ] l 79169 17566 ] 751 179 6468 1560 17566 4193 [ 3315 725 ] . [ 32752 6766 ] [ 105175 19828 ] 725 165 6766 1487 19828 4131 0.01 002497 0.03008677 0.05016445 Uniform distribution A verage bias Truc covariance Eq. (3-3-2) A vcragc cstimated covariancc Eq. (7-5-17) Standard dcviation of rcsiduals -3.07 -1.01 0.83 -2.67 17.96 -3.82 [ 2743 . 622 ] [ 23911 5447 ] [ 67208 15061 ] 622 146 5447 ]300 15061 3597 [ 3328 735 ] [ 31396 6695 ] [ 9644] 19173 ] 735 169 6695 1514 19173 4191 0.01015331 0.03046174 0.05076649 "Avcragcs arc Ovcr 100 Rcplications. prqblem that we have solved III Sections 5-21 and 7-2] is one of tile rep- {!qtions (with )'/1 rounded to three decimal places) of the normal distri- ittion with (J = 0.05. The bias on this particular replication is [ 813.4583 - 1000 ] = [ -186.5 J 960.9063 - 1000 - 39.1 if, ..1fhe true value e = (1000, ] OOO)T is marked on Fig. 7-3. I t lies just withlll ;,e'.region of good approximation, and corresponds to an F2. 13 value given @\pyEq, (7-21-6) fi1''I;" ., F2,13 = 163.292 x tlO- 5 (0.271890 x ]86.5 2 - 1.91.+672 x \86.5 x 39.\ -r 3.50371 x 39.1 2 ) =--=0.695 I.!J!IJ'W' _t:, 
212 VII Interpretation of the Estimates Though this value is far from excessive, the bias in this replication is much larger than the average bias of (66.68, 5.05) in Table 7-2. On the other hand, the covariance estimate Eq. (7-21-5) is quite a lot closer to the true covariance I than is the average estimate of Table 7-2. In some other replications this I estimate is much worse. I n one case, for instance (still with (J = 0.05), the estimated covariance is V = [ 149984 33191.4 J o 33191.4 7616.09 which is off approximately by a factor of two (still not very significant in an F-test). Oddly enough. this replication yielded the almost unbiased estimate 0* = [ 1000.512 J 1000.942 r .... We conclude that in this particular problem our estimates for 0, V o , and the confidence regions are quite reasonable. 7-23. Independent Variables Subject to Error We shall now interpret the estimates obtained in Section 6-13 for the same model, but with all variables subject to error. According to Eq, (7-6-4) and using Eq. (6-13-5), we have V  D- J = [ 93021.94 16298,55 ] o 16298.55 2912.917 This IS not very diAerent from what we found previously in Section 7-2C' under different assumptions. We shall try to determine whether the assumpl-, tions underlying the estimate of Section 6-13 are validated by th'e dataV''f From the matrix 1\1 [Eq. (6-1 3-6)] w obtain, using Eq, (7-14-1(1), thi following estimate for the covariance matrix of the residuals [1"/111(11 -1/111)]1\1 = [3/1(15 - 2/1)]1\1 = (3/13)1\1 [ 0.000411 0.00224 0.000452 ] = 0.00224 0.0162 0.00278 0.000452 0.00278 0.000957 On the other hand, to obtain our estimate we had assumed an error cQ-:' variance of _ [ 0.0001 0 V" - 0 0.25 o 0 The quantities 13 x 0.00041]/0,0001 and 13 x 0.0162/0.25 should both); be samples from a Xl distribution with ] 3 degrees of freedom. A ulance af: .. b 'c'; ;!ii;,""!" -"1"f1 ,,] 
tg: 3 i 7-24, Tlvo-Equationlvfaximum Likelihood Problen1 2]3  . f.;.':the tables shows the first to be much too large, the second too small to be tf:'acceptable; the odds for rejecting each are greater than 99:1. Even summing i. 'the two quantities (this is equivalent to evaluating Tr(y- I Y*) as in Section .. 7-14), we obtain a number too large for Xi6 . Our assumed covariance matrix f)s contradicted by the data, [r- .,", 7-24. Two-Equation Maximum Likelihood Problem  ;;' .... },::: We now treat the estimates obtained in Section 5-23. Let us examine case :;'(a), unknown V. The estimate 8* is given in the first row of Table 5-8, and i:\ ,the corresponding value c* is found in Table 5-9. Since all calculations :>:\vere performed in terms of 8, the inverse approximate Hessian with respect ;; to e is found to be I V,(N")-' -;;:;;-' r.;,;.  I tifcrhe notation 0.834966 - I is used to represent 0.834966 x 10- 1 .) We Y;y.,rite below the values of Oa* along with their standard deviations, the latter i:lX:ing the square roots of the diagonal elements of V 0 k;:!io:', [ 0.834%6. - I . symmetnc ¥ 0.12]944 - 2 -0.360968 - 4 0.122197 - 2 -0.76] 164 - 1 0.517330 - 4 -0.704885 - 3 0.698556 - 1 ( -0.0758463 ::!:: 0.288958 ] -0.0115747 + 0.0011026 8* = 0.790686::!:: 0.034957 1.00224 ::!:: 0.26430 0.859255 ::!:: 0.093578 -0.269914- 1 1 0.181755 - 4 0.47651 - 3 0.4607] - 1 0.875678 - 2 ,-" We are interested, however, in c, not in 8. According to Section 7-20 !iy need therefore to calculate the matrix ;. . ','  J* = [lc / ?O) o = o * !iiI' trhis can be readily obtained from Eq. (5-23-8) by differentiation Ii = ;: I '. .0 6905560  I :f .  :j i 0 'X;; 0 Ii";::.::,; o - O. 5492625 o o o 0.4 178637 -0.008040566 - 1.599525 o o - 0.4263207 o o -0.9955402 -0.0368951 -Of3583' ] -0.5681071 
214 VfI Interpretation of the Estimates I Following Eg. (7-20-2) we now compute rO.543786 - 3 lsymmctrlc -0.412978 - 6 0.221255 - 3 0.126945 - 6 -0.159973 - 4 0.312639 - 2 0.564779 - 2 0.226459 - 4 0.112245 - 2 0.692338 - 1 0.137911 - 2 ] 0.537960 - 5 0.266644 - 3 0.164833 - .I 0.395299 - 2 I I I VC=J*VO.JH= and the estimate c* is represented as [ 0.5460134 + 0.0233192 ] 0,006357569 I 0.000356293 c* = 1.264724 I 0.0559141 0.9977676 I 0.263123 0.8592555 I 0.0628728 All the parameters are fairly well-determined, with C4 less so than the others. I The residuals corresponding to our estimate are listed in Table 7-3, I Table 7-3 Rcsiduals e" * = y" - f(x" , e ') for Casc (a) fL e,ri e,f! {.l. eJ\ e,f2 1 -0.2294+ -0.01629 22 0.42388 0.04043 2 -0.18880 0.00559 23 0.24983 0.02265 3 -0.19394 -0.01330 24 0.37242 -0.00994 4 -0.17473 -0.00069 25 0.24696 0.01022 5 -0.19199 -0.01578 26 0.] 6855 -0.00204 6 -0.2] 667 -0.01114 '27 0.] 1696 -0.00205 7 -0.10269 0.00038 28 0.07203 -0.0]] 95 8 -0.05086 -0.00752 29 0.08727 0.02 173 9 0.00012 0.01701 30 0.02814 0.00542 10 -0.13722 0.00483 31 0.01613 0.00927 II -0.06465 -0.02522 32 0.00542 0.02753 12 -0.00414 0.03791 33 0.05353 -0.04208 13 -0.01195 -0.03430 34 0.07066 -0.00772 14 -0.08990 -0.01499 35 -0.01496 -0.00765 ] 5 -0.02357 -0.00057 36 -0.17103 -0.04543 16 -0.02433 -0.00699 37 -0.26740 -0.04499 ] 7 0.00547 -0.01582 38 -0.19278 0.01363 18 0.06089 0.01065 39 -0.04582 0.0380] 19 0.10571 0.02486 40 -0.00165 0.03415 20 0.23525 0.02979 41 -0.00726 0.01637 1] 0.25371 0.03832 I ' ,   I 
",:.> c l :'>:: t':: ';.:' :.; ....i. 7-24_ Two-Equal ion Iv! aXil1llll1l Likelihood Problem 215 J:,- ;,." f! Their moment matrix is :'{:;;.; :,. iVI* [ 1.066369 0.06834212 ] = 0.06834212 0.02096695 ;:>: and the estimated covariance matrix of the errors is '." 'f.;., i.. [ 0.0276979 0.00177512 ] 0,00177512 0.000544596 V corresponding to standard deviations of 0.166427 and 0.0233366 of the Yt f0;; and)'2 errors, respectively. We do not know enough about econometrics to decide whether errors :;} of this magnitude are reasonable, and whether they can be ascribed to  . ; . ,; . ' . - measurement errors alone, A glance at Table 7-3, however, reveals at once , that at least the YI residuals are not random. They have been plotted in ;. Fig, 7-5. It appears that Eq. (5-23-2) fails to account for certain strong v* = 1/(41 - 5/2)M L.. ;.',:." 04 L oJ oJ I "" ,... ;:\'! ..... :"{p .;:. t.C;  r(.jf:. ;  :o 1 1 -02 -03 / rP 1929 o 1939 10 ;::..' .'.t. f.:.; .,..' \,. :':t. .::<: ...._ 0 I  1909 -20 1919 -10 i>;:. Fig, 7-5. First cquation rcslduals, production modcl. r:;..' :-:',:,'" :;;.i:,xariations of 2 1 with respect to time. The equation for 2 2 seems somewhat !.ore satisfactory. However, there are 21 negative residuals, 20 positive ones 0!;i,':(both numbers exceed 10) and 16 runs. From Eqs. (7- 17-1), (7-17-2), and f':; :::V;/(20 + 21) + I  21488 ;; = expected number of runs if residuals were random ; 0'2 = 2 x 20 x 21(2 x 20 x 2] - 20 - 21)/[(20 + 21)2(20 + 21 - I)] = 9.982 ; ,z = (16 - 21.488 + 0,5)/(9,982)1 / 2 = - 1.579 F,"'' :;' i';were z is approximately a standard normal deviate. ""'i:'> 
216 VII I nterpretation of the Estimates The probability of finding 16 or fewer runs is approximately F(z  - 1,579), I which according to tables of the normal distribution is only about 6 %. Hence there is strong, though not conclusive evidence to indictae that the Z2 residuals are also not random. I n case (b) we assumed that V is diagonal. The estimated covariance of the residuals was r* - [ 0.024722 0.00131260 1 \ - 0.00131260 0.000571524 /"12 0.00131260/(0.0247422 X 0.000571524)1/2 = 0.349 Letting 11* = 41 - 5/2 = 38.5, we have from Eq. (7-16-10) I. =(36.5)1/20.349/(1 - 0.349 2 )1/2 = 2.31 ] 2 If . The corrclation between the residuals of 2"1 and 2"2 is According to tables of the {-distribution, the chance of encountering a value of II.I as large or larger than 2.31 with 36.5 degrees of freedom is only about 3':,. We reject therefore the hypothesis that V is diagonal. In case (c) we assumed V proportional to Q = diag (4, I). The covariance of the residuals turns out to be V = [ 0.0156715 0.000679944 J 0.000679944 0.00120482 The correlation /"'2 = 0.157 leads to ). = 0.943, which IS not Incompatible with the supposition that V'2 = 0. On the other hand, if Vtl is an estimate, of a variance four times as large as the variance estimated by V; (each based on 38.5 degrees of freedom), then I. = 0.0156715/(4 x 0.00120482) = 3.25 would be an F 38 . 5 . 38.5 variate. The prob<{bility of encountering such a value'.: is less than OS\;, so hypothesis (c) stands refuted. In case (d) we made no assumptions concerning the value of V, T}ij residuals of log 2"t, however, behave no better than those of Zl (Fig, 7;5.) The same is true of the residuals in cases (e) and (r). At this time, on the bas of the data alone, we have no reason to prefer any of the models (a), (d), (e)i and (f). and none of them account sufficiently for variations in ZI' c- 7-25, Problems Verify Egs. (7-12-6)-(7-12-7). 2. Show that Eg. (7-5-16) holds when the covariance matrix is proPQf tlOl1al to a known matrIX Q [see Eg. (4-21-2) and row 2 of Table 5-1],."" ..:jLRV 
'if. I ; 7-25. Problems 217 3. Suppose P In Problem 8 of SectIOn 4-21 is an unknown matrix. Show that its MLE is given by I :if.: j .-.:. ."  -c. m.i ; A.::: i /"  (ljn)M(8*) - (BTQ-IB)-I where n IVI(e*) = L [5" - n8*, X/,)][5" - f(8*, xI,W J1= I Denve an expreSSton for P applicable to Problem 9 of Section 4-21. 4. Using the Monte Carlo technique, investigate the robustness of the test for correlation Eq. (7-16- I 0) for non normal distributions. 5. Suppose observations )'1" xl,(p = 1,2, . . . 11) are to be fitted by the model Y = {)o + (11""1" + ()2 .\,,2 + . . (7-25-1) Let the error distribution be such as to justify estimation of 0 by least squares. "Suppose Eq. (I) is to be used for predicting)' at given values of x. Show that ,the prediction error variance is minimum at the centroid of the observations f Eused for estimating 8, i.e., at x = I;: = I x)n. ..:.A..,,",. 
Chapter VIII Dynamic Mod(ls 8-1. Models Involving Differential Equations Models are often formulated in terms of differential equations, That is, the model equations contain not only dependent and independent variables, but also derivatives of the former with respect to the latter. The model equa- tions thus take the form g(x, y, 8yj8x, 8 2 yj8x 8x, .. .,8) = 0 (8-1..1) ti :! 1 : When experiments are conducted, we measure values ofy for given values of x, but we do not usually directly measure the values of the derivatives, Hence the model equations cannot be used directly for the estimation ofthe parameters 8, However, this difficulty may be overcome in one of the following ways: (a) Differentiation of Data. Approximate values of the derivatives appearing in the equation can be calculated by differencing adjacent data values, If xJl and X/I + 1 are neighboring points differing only in the ith coordinate, thn (YJl+I-y/,)j(X/1+1.i-X/1,J is an apPI;oximation to 8yj8x i in that region, Even though more accurate approximations are available in some cases, the maximum accuracy attainable with this method is severely limited, and ts errors difficult to assess. The main advantage of this method (in those cases where it is feasible) is that the estimation is performed using Eqs, (1) directJy, and these are usually much simpler than the integrated equations, Therefore, the computation can usually be performed much faster than in the method to be described next. We feel, however, that this advantage does not outweigh the disadvantage of limited accuracy. The method cannot be used at all if the separation between data points is large, but may be useful when experiments are speciaIly planned for it, e.g., by the use of differential reactors, (b) Integration of Equations. In principle, the differential Eqs. (I) may be integrated to yield expressions of the form y = rex, 8) (8-1-2) 
.l -  ill 8-1, Models Involving Differential Equations 219 which is identical to Eq. (2-4-2). Therefore, all standard estimation methods may be applied, If Eq. (I) can be solved analytically in closed form, we end up with explicit formulas for the functions f in Eq. (2), and their origin as solutions of differential equations need no longer concern us. The problem we shalI deal with in the succeeding sections is that of estimating e when Eq. (I) must be integrated numerically, so that the functions f are only implicitly defined. A special problem associated with method (b) is that of the initial orbound- ary values required for integrating the differential equations. These are fre- quently defined by the experimental conditions, in which case no further problem exists. When these conditions are entirely or partly unknown, they must be included in the problem as additional unknown parameters. (c) Integration of Data. It is sometimes possible to integrate out all the derivatives appearing in the differential equations. The differential equations are, thereby, transformed into integral equations, If our observations cover Ihe region of interest densely, we may integrate the data numerically to obtain the values of the integrals appearing in the equations, which can now be regarded as algebraic equations in e. This method, like (a), requires dense data, and gives rise to unassessable errors. In addition, it is applicable only in a limited number of cases. Its advantage over (a) is that numerical integration is generally more accurate than differentiation. Like (a), it is computationalIy faster than (b). On the whole, we recommend method (b) whenever the required computa- tions are not beyond the capability of available machinery. Method (a) or (c) may be used to obtain an initial guess for (b). We ilIustrate these methods by means of a simple example, that of a system (e,g" a radioactive material) undergoing first-order decay. Here we have dy/dx + 0IY = 0 (8-1-3) as the model equation. Upon integration, Ihis becomes y = Yo exp( -0 1 .."1:) (8-1-4) We have measured values Y I , at an ascending sequence of values XI' (J.L = 1,2, ., " n). If the initial value Yo is known, we may use our data directly, in conjunction with Eq. (4), to estimate 0 1 . If Yo is unknown, we treat it as a parameter O 2 and use y = O 2 exp( - Otx) (8-1-5) to estimate both 0] and O 2 , This constitutes method (b), Here we were able to integrate the equations analyticalIy. The method applies equally well when the equations must be solved numerically, 
220 '1 .'-1 :'f VIII Dynamic Models .j To apply method (a), we could dellne =/. == (Y,I+I - Y/,-I)j(X/ dl - X/,_I) (p=2,3,. ,11-1) ! !i ' IJ 'Ii as an approximation to dyjdx at x = XII' We then use : + Od! = 0 ., i, ! (8-1-6) :i ':I 'I as the model equation, from whIch we estImate 0 1 , by mll1imizing, say IJ-I I (z/. + 0ly/.)2 Jl = 2 To apply method (c), we follow Himmelblau et al. (J 967) and integrate Eq. (3) from x = 0 to x = x" y/. - Yo + 0 1 ("y(X) dx = 0 '0 (8-1-7) If we have measured sulf1ciently many values of y between X = 0 and x = xJl' " then \ve can obtain an approxinlate value I of the integral in Eq. (7), say by ;.; using the trapezoidal rule J, I' 1/. = {I (Y" + )"/ I)(X,/ - X,,_I) ,,= I (8-1-8) 'I; iJ Then Ot may be estimated from the linear model )'/.-1'0+ 0 1 / ,,=0 ;1 (8-1-9)11 ';:1 say, by least squares In an alternative data lI1[egration method, due to Shinbrot (1954), we multiply Eq. (3) by sin y.x, and integrate the result over the range of x values, say from x -= 0 to x = A. using the integra'tion by parts technique ,1 f A 0= f (dyjdx) sin a.x £Ix + 0t Y sin a.x £Ix ,[) - 0 _A ",.4 [ysin'l.xJl:-alycosaxdx+O I / ysinaxdx , [) . 0 f A = yeA) sin 7.A + (Ot sin 'l.X - a cos ax)y £Ix , 0 ( 8-1-10) .:1 If we choose a. = knjA, where k is any integer, then yeA) sin C/.A vanishes, <! Hence we have ;'j .04 f '1 III J J' sin(knxi-4)dx = U-"jA) ycos(krrx/A)dx o 0 : (8-1-11 )' :,;L 
8-2, The Standard Dynamic Model 221 If y is known at a sufficiently dense set of points, then we can II1tegrate both sides of Eq. (II) numerically for various values of k. This gives us several equations for the unknown G 1 , and we may choose that value of 0 1 which satisfies these equations in the least squares sense. By choosing appropriate multiplier functions we can apply this method to problems involving higher derivatives, as well as to models involving partial differential equations [see Perdreauville and Goodson (1966)] 8-2. The Standard Dynamic Model We do not propose to treat models represented by Eq. (8-1-1) in complete generality. Rather, we restrict ourselves to a subclass of models which are par- ticularly tractable, yet at the same time extremely important in practice. These are the so-called standard dynamic models, which we define below by listing the , variables included and the relations among them: (a) A vector of independent variables x (b) An additional independent variable t, usually referred to as time, although it need not represent the actual physical dimensions of time. (c) A vector of unknown parameters 8. (d) A vector of state variables s, which are functions of T, x, and 8. The func- tions are defined implicitly by means of 1. A set of simultaneous first-order ordinary differential equations s == dsJdt = h(t, x, s, 8) (8-2-1) where h is a vector of given functions, and 2, A set of initial conditions s(O) == s),=o = so(x. 8) (8-2-2) where So is a vector of given functions. Note that Eq. (2) includes the possibil- J,ities that some or all s(O) are given numbers (which are independent variables), , or that they are themselves unknown parameters. :., (e) A vector of observed variables y, whose exact values yare given functions .... of the state variables, and possibly of the other variables as well y = Yes, t, x, 8) (8-2-3) ; {\commonspecial case is that in whIch the state variables are observed directly, ....i.e" y = s, Note that of the set of parameters making up the vector 8, some may a,ppear explicitly only in Eq, (I), others only in Eq, (2) or Eq. (3), .},J:C 
222 '-:-1 VIII Dynamic Modelsj;i:j .:t ',..;i By solving (numerically, if necessary) the differential Eqs, (1) with the'.;1 initial conditions Eq. (2), and substituting these solutions for s in Eq. (3), we 'j! bring. Eq. (3) into the fo:m Eq. (8-1-2), with x and t jointly playing the role':,:: of x 111 the latter expressIOn, Hence, the model we have defined conforms to ,/ our general form, thoug.h in a son:ewhat roUl:dabout fashion. . Ji In essence, a dynamic system IS charactenzed by a set of state vanables.;" which change with time (or some other independent variable) according to ';ij certain first-order differential equations, The initial conditions mayor may:J not be fully known, The state of the system is observed at various points in,.1 tinle, but son1etinles the state variables are not directly measurable, and wec.' have to measure the.related observed variables instead, Unknown parameters.,:j may appear in the initial conditions Eq, (2), in the differential equations (l),;  r and in the observation equations (3), In the last case they usually represent";i unknown characteristics of the measuring devices, e,g., calibration constants):; Our main interest usually lies in estimating the parameters that appear in) the differential equations, but we cannot escape estimating the others as well'.o!j Fortunately, good initial estimates for these are frequently available. Any,.d , '.-' inexact knowledge we have concerning the values of these parameters should. < be included in the form of a prior distribution, .:.: :i We illustrate the concept of a dynamic system by means of a chemlca}:a reaction involving three species whose concentrations C 1 , C 2 , C 3 , satisfy the,d following differential equations: '.., dct/dt = -klc/ + k 2 C 2 C 3 dC2/dt =k 1 c 1 2 - k 2 C 2 C3 - k3 C2 dc 3 /dt = k l c 1 2 - k 2 C 2 C3 + k3 C2 'i (8-2-4) ,<1 ....1 ;1 The initial concentrations C2 and C3 are not known exactly, but all concentra-;; tions must add up to unity, so that we may.write ;: (8-2-5r::.; ,; where CJ. and f3 are respectively a known and unknown quantity, '; At time t we withdraw three samples from our reactor. In two of these wey;a determine C I directly by titration. The third sample is passed through an;;" optical instrument, which measures the light absorptivity of the mixture, This::; is believed to be a linear function with unknown coefficients of C 1 and C2 :'!.] Denoting the results of the measurements on the three samples as Yl, Y2 ;.; and Y3, we may write. CI(O) = C/., C2(0) = {3, C3(0) = I - C/. - {3 .h = CI, Yz = Cl, Y3 = p + qC I + rc z where p, q, and r are unknown quantities. In this model, C 1 , C z , and C3 are the state variables; Yl, Yz, and Y3 are th'\'!J observed variables; C/. and t are the independent variables; {3, k 1 , k z , k 3 , p, q,.,itl .... d1iF! ( 8- 2-6 ) .:1  ' ".'\t :;! 
8-3. Models Reducible to Standard Form 223 /)'iand r are the unknown parameters. Eq. (4), Eq. (5), and Eq. (6) correspond to " 'lEg, (1), Eq. (2), and Eq. (3), respectively. We are primarily concerned with ':istimating the reaction rate constants k J, k 2' and k 3' Good initial guesses ,for [3 may be known from the manner in which the solution was made up, and :,tor p, q, and r from previous experiments on the same apparatus. . An experiment performed on a dynamic system consists of measuring the "yalues of the observed variables y for given values of the independent variables "0: and t. A group of experiments performed with identical values of x and Jidentical initial conditions, and differing only in the values of t, constitute a Irun. For Our purposes it does not matter whether all the experiments in a ::given run were actually performed as part of a physical run, or whether the :':apparatus was reset to the same conditions on separate occasions. If several :.runs each have unknown initial conditions, these constitute separate unknown j'!rameters. In the above example, we may have unknown parameters [31' :113.2, ,,' corresponding to distinct runs, The covariance between the errors in different experiments may, however, ,I qepend on whether or not the experiments belong to the same physical run '(see Problem 4 in Section 8-9), :'8,-3. Models Reducible to Standard Form " Our defil1ltlon of a standard dynamic model is not as restrictive as might ,'appear at first glance. Many problems not originally in this form may be recast to fit the definition. We show how this can be done in several cases. ;(a) Suppose a model corresponds to our definition in all respects, except that '.if contains some second- or higher-order derivatives. If we can rearrange the ",:: ;"pifferentiaI" equations in such a way that we have explicit equations for the \;Jljghest-order derivative of each variable, then we can reformulate the model psing the method illustrated by the following example. ". Let a model be defined by means of the following two differential equations I:in. the variables Z I and Z 2 log d 2 z,/dt 2 + OJ d 2 z 2 /dt 2 + 02(dz l /dt)3 + 0 3 dz 2 /dt + Zl2 = 0 d 3 z 2 /dt 3 + (d 2 z l /dt 2 )2 + 0 4 sin Z,Z2 = 0 (8-3-1) .'" The highest-order derivatives of each variable are d 2 ztfdt 2 and d 3 z 1 /dt 3 , :i:We may solve for these: d 2 ztfdt 2 = exp{ - [0 1 d 2 z 1 /dt 1 + 02(dztfdt)3 + 0 3 dz1/dt + z/]) d 3 z 2 /dt 3 = -exp{ -2[0 1 d 2 z 2 /dt 2 + 02(dz i /dt)3 + 0 3 dz1/dt + z/]} -0 4 sin ZIZ1 (8-3-2) "'.  "'.-;. \: i '. :/I';" , . :,.;;:;;.... 
224 ;;J j VIII Dynamic Model{) ',I :1 1 Let us introduce the following state variables ,:' SI == Zt, S2 == dzi/dt, S] == 2 2 , S4 == dz 1 /dt, Ss = d 1 z 1 /dt 2 whereupon Eq. (2) are equivalent to: 'I=S1' s1=exp-(0ISS+02 S 2]+0]S4+ S /), S]=S4' '4 = Ss, .s = -exp[ -2(0Iss + 01S1] + 0]S4 + s/)] - 0 4 sin SIS]., "'i (8-3-3t3 which is in the desired form [Eq. (8-2-1 )]. Initial conditions on.: l , 2 1 , and thei derivatives are immediately translatable into conditions on the state variables. A One need not be able to solve the differential equations explicitly for thf(".::, highest-order derivatives. It is sufficient that one have a numerical procedure<'f for computing these derivatives if the values of all other quantities appearingin'; the equations are given. . .,;:;J. We note that most computer programs for the numerical IntegratIOn of; ordinary dfferential equat!ons require that the problem be formulated as ;"JJ.I system of first-order equations. ":,,. (b) Some partial differential equations, particularly of the parabolic type, ma;. be approximated by a dynamic model. For instance, consider the diffusion or+j, heat conduction equation A jl os/ct = CJ. '1 1 s (8-3-4):;JI '.l. where CJ. is a constant and '1 2 is the Laplace operator .:Ji /, '1 1 s = I01S/0X/ i= 1 and where k. = , 2, or 3 depending on the !:umber f dimnions ofte objet:'f]1 we are considering. Suppose we select a grid of pOInts wlthlJ1 the object, an:'1 let .sp).be the value of (l) t thejth,grid P9it. Furthe:more, let [;Sj be omei,\1 fiJ1lte dffer1e PPOXIl11atJO1 to 2'1- .For IJ1san, IJ1 te o1e;d!mens!.on:1 case, x I. j - .Il1.-\ I' If we take [; Sj - (Sj+ I - _Sj , Sj_1 ),(lI.."\ I) [bette! a, proximations are discussed by Hicks and Wei (1967)], then,,,,, S j = (I. [;2 Sj (8_3-n' I 11 has the desired form, Again, this is the way in which parabolic equations a,re;r often formulated for numerical solution (Rosenbrock and Storey, 1966,} Ch. 7), Unfortunately, an excessive number of state variables may be requirecC"i(:: (c) A large variety of problems which are already in the desired form ariss:;:;1 from the theory of proess contJ:ol. Linea.r contr1 theor usually deals with:il models whose state varIables satIsfy the d!fferentJal equatIOns ,... ,. S = As + Bu(t) + E(t) (8-3-R)::f I where A and B are matrices, u(t) is a known function (the control signal), an<. 
",84, Computation of the Objective Function and Its Gradiel1l 225 "::(t) is an unknown function (the noise) possessing certain statistical properties. :'The observed variables, in turn, are given by y = Cs + oCt) ',!where oCt) is another noise function, and C is a matrix, Generalization to ::)lOnlinear systems is obvious, When E(t) = 0, we have a dynamic system that ; conforms to our definition. Commonly arising problems are those of identification, in which unknown :::JeJements of A, B, C are to be determined, and of tracking, in which s(1)is to ;:. estimated from the measured values of y( -r) (-r  t), The former problem ,)belongs directly to the class of parameter estimation problems that we are ;,'onsidering here. The tracking problem is essentially one of filtering, and the >methods for dealing with it, mostly due to Kalman (1960), are discussed ,'extensively in the literature (for lucid expositions with many additional .,references, see Deutsch (1965) and Sorenson (1966)). Here we only wish to >point out that if E(t) = 0, then the initial conditions completely determine the ::'values ofs(t) at any time. Once the initial conditions and the matrices A and B ;;.are known, s(t) can be obtained by straightforward integration, Hence, the ;:ti-acking problem is equivalent to the problem of estimating unknown initial ;:;'conditions and elements of A and B, which is a special case of the parameter .'stimation problems that we shall treat. ,. The central problem of control theory is the determination of control :'.Junctions u(t) that will cause the state variables to behave in a desired way. :;'t:ven this problem may sometimes be treated within the parameter estimation fAramework. For practical reasons, one must usually restrict oneselfto funclions :;:u(t) which depend on a finite number of parameters (e.g" polynomials with 'coefficients to be determined, or piecewise-constant functions). We then wish :'io, determine the optimal values of these parameters, i.e" those values that ;{inaximize some performance index of the system, This is entirely analogous :; to a parameter estimation problem in which the performance index plays the (role of the objective function. The general problem of determining u(1) can :::also be reduced to a two point boundary value problem by using the maximum ;,'principle (Pontryagin et aI., 1962). This problem can now be formulated as a .parameter estimation problem, in which the missing initial conditions are the i'upknown parameters and the available final conditions act as the observations. trn this form the problem can be solved using, say. the Gauss method. !:i-tt. Computation of the Objective Function and Its Gradient In order to proceed with the estImation of the model parameters 0 of a 'qynamic system, we must be able to calculate the value of the objective func- i. . ::;:.tibn f!J for any given feasible values of the parameters, Now, once the parameter ;:",'3t 
226 VIII DynamIc Models values have been prescribed, the Initwl conditIOns are determined by means of Eq. (8-2-2). The differential equation Eq. (8-2-1) can now be integrated, numerically if necessary, from I = 0 to I = III (the time of the /lth experiment) for /l = I, 2, . , , , n. This determines Sll' the predicted values of the state vari- ables at the /lth experiment. Now we are in position to determine the Y ll from Eq. (8-2-3), which in turn are used to compute the residuals ell = Y ll - Y Il , From these, most objective functions (sum of squares, likelihood, etc.) can be calculated directly If we wish to use a gradient method (Chapter V) for the estimation ofparam- eters in a dynamic system, we must compute not only the objective function (P(O), but also its derivatives q == acfJlao. As we have stated before, gradient methods are the most efficient among currently available methods, The in- centive to use an efficient (in terms of total number of function evaluations) method is particularly great in the case of dynamic systems, where each func- tion evaluation is itself a complex procedure requiring the solution of a set of differential equations. We detail below several ways for calculating the re- quired derivatives. (a) Finite Differences. Finite difference methods, discussed In SectIOn 5-18, are applicable to dynamic systems. As usual, we must face the problem of balancing the truncation en'ort (increasing with t:..8) against the rounding error in differencing (decreasing with t:..8), There is, however, an additional difficulty associated with dynamic models. Taking small t:..8 and avoiding the concomitant rounding errors by using multiple-precision arithmetic is in- effective in itself, since the accuracy of cfJ is limited not only by rounding errors, but primarily by the truncation errors of the integration method. Increased precision in (p can be acquired only by combining multiple-precision arith- metic with decreased integration steps, or by using a higher-order integration method. Both solutions are costly in computer time. The finite differen method in its raw form works satisfactorily in many problems, In many others, however, it fails to provide the accurate derivative values that are required for convergence of the gradient method. (b) Sensitivity Equations. Several methods, variously referred to as quasi- linearization, sensitivity analysis, perturbations, etc. (Howland and Vail- lancourt, 1961; Tomovic, 1963; McGhee, 1963; Bellman el 01.,1967; Rosen- brock and Storey, 1966, Ch, 8), are based (at least implicitly) on the fact that the required derivatives must satisfy certain linear differential equations, These may be integrated along with the model Eq. (8-2-1) to yield the desired :t. We are talking here of the truncation error incurred by representing a(fJ/a6 as /:::,,(fJ//:::,,6, This is quite different from the truncation error of the integration method, which affects the accuracy of (fJ itself. .f ..1t , 
8-4, Computation of the Oiective Function and Its Gradient 227 gradient. In this way, the gradient can be computed with essentially the same degree of accuracy as the function itself without undue difl1culties. In order to apply the method, we must trace step-by-step the dependency of the objective function on the various model variables and parameters. We only list those dependencies which are relevant to our purposes: ]. cp depends on e/I = Y/I - Y/ I . The Y/I are measured (Jl I, 2, . . . , II). cP may also depend directly on 0, e.g., when there is a prior density function. This requires addition of the appropriate terms to Eq. (I), 2, Y" depends on SII = S(1/I) and 0 [Eq, (8-2-3)]. 3. s(t,,) depends 011 So for the run containing the pth experiment, and on 0 [through integration of Eq. (8-2-1)]. 4. So depends on 0 [Eq. (8-2-2)]. Using the chain rule of differentiation we find that q = ocP/oO = I (OCPjcel/)(ael,!cO) = - I (c'CP/re/,) Dy/,/DO) 18-4-1) " n where we have used Dy/,! DO to indicate the total derivative of Y/I with respect to 0, given by DY,,/DO = oyjoO + (OYI'/OS/I)(aSI'/OO) (8-4-2) so that altogether - -" ( c P ! )( ::1- / " 0 ( 4' /'1 )(  j ' O)) q - L C ,ce/1 lY 1I L + C)/I,LS/ 1 CS/ 1 ( U (8-4-3 ) The quantities acp/oe/l' OY/I/eO, and as'/'/os,1 are easily computed, the latter two by differentiation of Eq, (8-2-3). That leaves us the problem of delermining (osjaO). Let us write down the original differential equation Eq. (8-2-1) ds/df = h(t, x, s, 0) (8-4-4) Differentiating both sides with respect to 0, and employing the chain rule, we find  ( dS ) = Dh = ah + ah os ao df. DO cO 8s ao Interchanging the order of differentiations on the left-hand side of Eq. (5) d ( as ) eh (ih os - - = - + - - (8-4-6) df ao ao os ao (8-4-5) The quantities oh/oO and oh/as are easily determined by differentiation. We have, then, in Eq. (6) a set of simultaneous linear first-order ordinary differential equations in the unknown functions os/oO. These are called the ....JI1;;;"i 
228 VIII Dynamic Models sensitivity equations, since their solutions indicate how sensitive the state variables are to changes in the parameters. The functions as/ao themselves are called sensitivity coefficients. To find as,)ao we must integrate these equations, jointly with Eq. (4), from t = 0 to t = t ll . To do this, we need initial values, i.e., (iJs/iJO),=o, These are obtained simply by differentiating the initial condi- tions Eq. (8-2-2), i.e., ( as ) ao. r=O as o ao (8-4-7) If p IS the number of state variables and I[ the number of parameters appearing either in the initial conditions or in the differential equations, then the number of quantities aslaO is pI!, and the total number of equations to be integrated [Eq. (4) and (6)] isp(l + II)' On the other hand, if we were to use one-sided differences to estimate acp/ao we would need to integrate the p Eqs. (4) for I + II different values of 0, again resulting in a total of p(l + 11) equations. The computational effort involved in the two methods is roughly the same, but the accuracy attainable in the sensitivity-equations method is much higher, and more easily controlled. Admittedly, a greater effort is required to prepare a problem for treatment by the sensitivity equation method. In the stmplest case, the initial conditions are known and the state vari- ables are observed directly. The initial conditions Eq. (7) then-read simply as/aO),=o = 0 and Eq. (2) reduces to Dy,) DO = asu/ao Let us examine a simple example. There is one state variable, with known il11tial condition dsldt = -es, 5(0) = I ( 8-4-8) We know the solution to be s = e- or , so that aslae= _te- or However, let us form the sensitivity equation :!.. ( as ) = ah + ah as = -s _ e as dt aD ae as ao oe Substituting for s its value, we have to solve for aslae the differential equation (8-4-9) :!.. ( iJ . S ) 0 ( JS ) = _e- ol dt 00. + aD ' as ) -0 ae r=O (8-4-10) The solution is aslaO = _fe-or, in agreement with the prevIOus result, .;i. . 
r ,1 8-4, Computation of the Objective Function and Its Gradient 229 The required steps in a more general case are illustrated on the example of Section 8-2. The unknown parameters involved in the initial conditions and djfferential equations are [3, k l , k l , and k3' From Eq. (8-2-5) we have OC I 0[3 = 0, CC, oC 3 --=-=1 -=-1 0[3 , 0[3 , OC- =O ok j (I,J = 1,2,3), t = 0 ( 8-4-1 ]) And from Eq, (8-2-4): d ( OCt ) ( OCI ) ( OC1 ) ( OC3 ) it 0[3 . = -2k l c l 0[3 , +k l c 3 0[3 +klc l 0[3 . d ( OC 1 ) ( OCt ) ( OC1 ) ( OC3 ) dt 0[3 . = 2k l c i 0[3 . - (k l c 3 + k 3 ) 0[3 - klc l 0[3 d ( OC 3 ) ( OCI ) ( OC1 ) ( OC3 ) dt a[3 =2k t c l op . -(k l C 3- k 3 ) 0[3 . -klc l 0[3 . d ( OC I ) 1 ( OCt ) ( 8Cl ) ( OC3 ) - - =-C t -2k l c l - +k l c 3 - +k}Cl- dt ok l ak l ak l akl. !!.. ( OC 1 ) = C i l -t- 2k l c i ( OC I ) _ (k l C 3 -t- k3) ( Cl ) _ k l Cl ( C3 ) dt ak l okt ok l ok l !!.. ( C3 ) = Cil + 2k l c i ( OC I ) _ (k l C 3 _ k3) ( C} ) _ k l C} ( DC 3 ) dt ok l akl. ok l akl. (8-4-12) d ( OC I ) ( OCI ) ( OC1 ) ( OC3 ) - - =C l c 3 -2k l c l - +k z c 3 - +klc}- dt ok z ok l . ok 2 . ok}. d ( OC 1 ) ( OCI ) ( OC} ) ( OC3 ) - - = -C l C 3 + 2k l c i ;=;-- , _ - (k l C 3 + k 3 ) -:;-:- - k} C z -::;-:- dt ok l 0\1 ok}. ok l d ( OC 3 ) ( OCI ) ( OC1 ) ( OC3 ) - d - , = -C z C 3 + 2k l c i _ 0' - - (k z C 3 - k 3 ) -;-:- - k} C l -::;-:- t 0 (1, '1. 0/\1 o/\} d ( OCt ) ( OCI ) ( OC1 ) ( OC3 ) dt ok 3 . = -2k t c l ak 3 . +k l c 3 ok3 . +klc} ok3 . d !!.. ( Oz ) = -C l + 2k t c i ( 3 ) - (k} C 3 + k3) ( 1 ) - k} C} ( 3 ) t 0/\3. ok 3 0'\ 3. 0'\ 3. d ( OC 3 ) ( OCI ) ( OC1 ) ( OC3 ) d - - , = C l + 2k t c i ;-:- - (k l C3 - k 3 ) --.- - k} C} --.- t 0(3 ok 3 . 0"3. 0/\3 Eq. (8-2-5) and (II) provide initial conditions to the differential equations :.Eq, (8-2-4) and Eq, (12), which may be integrated simultaneously from t = 0 ';.::L"" 
i 230 vnr Dynamic Models ii ;:!i <.", to t = f/l' A separate integration IS requIred for each run, each integrationTe, going up to the largest f l1 belonging to the run, ;'! Setting up the Eq. (12) is a rather tedious task. The computer can perform,j I this job, using Eq. (6), provided it is given subroutines that compute iJhlDel and iJhliJs. ,J Once iJcJiJ[3 and iJcJiJkj (i,j = I, 2,3) have been computed for f = f/l' we ,,1 derive from Eq, (8-2-6): ;H II DYt/D[3 = DYzID[3 = iJc t /iJ[3 DYtl Dk j = DJ/21 Dk j = iJcIliJk j (j = 1,2,3) Dhl D[3 = q iJc 1 /iJ[3 + r iJc z liJ[3 DJ/3IDkj=qiJcl/iJkj+riJczliJkj (j= 1,2.3) and also, for the additional parameters, p, q, and r: H (8-4-13) ,'Ii Dyul Dp = DYal Dq = DYal Dr = 0 (0 = 1,2) DhlDp = I. Dh/Dq = C t ' DhlDr = C2 This gives us all the quantities needed to evaluate q, using Eq. (I). The quantities Dy/J DO are also used to generate N, the Gauss approxima- tion to the Hessian. A numerical illustration of this method appears in Section 8-7. J It is possible to formulate the problem in such a way that only unknown initial conditions need be determined. This is done by replacing each parameter o appearing in the differential equations by a new state variable So subject to (8-4-14) So = 0, so(O) = 0 (8-4-15) This procedure has been advocated by several authors, (e.g., Bellman et al. (1967) within the framework of quasilinear,ization), but it serves no purpose other than to increase the number of differential equations that must be: integrated.:! ..j 8-5, Numerical Integration j .., 1 .'j '::) ,l. .'j The methods described in the preceding section require the numerical ,.:;j integration of a set of simultaneous first-order ordinary differential equa- «J tions, Methods for performing this task are described in textbooks on the j subject, to which the reader is referred, Routines for evaluating the integrals ;;.';1 are available at most computer installations, The following are some remarks':};! pertinent to the choice of integration method in parameter estimation'.); problems. ." ;'IL 
I 8-6, Some Difficulties Associated with Dynamic Systems 231 , , Most integration methods are either of the fixed- or the variable-step type, The former methods (e.g., Runge-Kutta) are easier to implement, but the latter provide better control over the truncation errors incurred in the calculations. Intelligent adjustment of step size can save a great deal of computer time, On the other hand, if we use a variable-step method, we must observe precautions. In such methods, the step length at any time is governed by the behavior of the equations. Two slightly different values of 0 may give rise to different sequences of step lengths, resulting in slight discontinuities in the computed functions. These may cause severe errors in derivatives obtained by differencing. It is suggested, therefore, that all P(ll + 1) equations required for obtaining a complete set of differences be integrated simul- taneously, all using the same integration step sizes. In the algorithm for minimizing the objective function there occur some points (the main iterates) at which both the function and its derivatives are required, while at other points only the function is required. It is essential that the same function value should be obtained at a point whether or not derivtives are also required, Hence. regardless of the method used for com- puting derivatives, the integration step size should be determined by the behavior of the state equations at the point 0 alone. The sensitivity equations, or the state equations at the perturbed points 0 + f..0 a , should have no effect on the integration step size. While this runs a certain risk of getting wrong values of the derivatives, in practice the alternative of computing a non- reproducible objective function has been found to give more trouble, ') 8-6. Some Difficulties Associated with Dynamic Systems , " Solutions of differential equations behave in a variety of ways: some are 'stable and converge to a steady state; some are unstable and diverge to :infinity; others oscillate or enter into limit cycles. What concerns us here is the fact that the nature of the solutions to a given set of equations may change 'drastically when one changes the values of the parameters, For instance, the solution of dsfdt + Bs = 0 is stable when B is positive or zero, and unstable 'when B is negative, For Ihis reason, we may find it difficult to estimate param- ,eters if the initial guesses or any subsequent iterates give rise to solutions 'pf the wrong type. " In a few cases, we can overcome this problem easily enough, If the system "escribed by dsfdt + Bs = 0 is known from physical considerations to be 'table, all we need do is impose the constraint B  O. Again, if a system is ',described by the set of equations: ds1fdt + h ll (O)Sl + h 12 (O)sz = 0 dSzfdt + hZ1(O)Sl + hn(O)sz = 0 (8-6-1) ", , ."..,., . " ,..,.' .,\ ... . . _ :i:' ( 
232 VIII Dynanlic Models  where !lij(O) are known functions, then the constraints hI dO) > 0, h 22 (O) > 0, hll(O)!rdO) -!r12(O)h 21 (O) > 0 (8-6-2) guarantee stability by making the matrix of coefficients positive definite, Unfortunately, in most practical situations such conditions turn out to beL unwieldy. Besides, unless the equations are of the linear time-invariant type it is difficult to formulate stability conditions which hold at all times. We don't even have any reason to believe that the solutions must be locally stable at all times, since although appearing unstable at one time, they may eventually pass into a stable region. In addition to unstable solutions in which the state variables increase rapidly beyond bounds, we may be troubled by solutions which are too stable, i.e., in which the state variables rapidly reach steady state values which are independent at least of some of the parameters. Take, for instance, the system: dsildl = -kis l + k 2 s 2 , ds 2 /dt = kls i - k 2 s 2 , (SI(O) = c l ) (S2(0) = c 2 ) (8-6-3) the solution of which is: St = Hc! - Kc 2 )/(I + K)] exp[ -(kl + k 2 )t] + K(c i + c 2 )/(1 + K) S2 = [(KC2 - cl)/(I + K)] exp[ -(kt + k 2 )1] + (c 1 + c 2 )/(1 + K) (8-6-4). . where K == k 2/k I' Suppose we have assigned to k! and k 2 initial guesses that are much too large, so that exponential terms are already negligible for the smallest III at which measurements of s are available. Then SI = K(ci + C2)!( I + K), S2 = (c i + c2)/(1 + K) (8-6-5). Clearly, we have lost all information pertaining to k l and k 2 individually, and we can hope to determine only their r?tio K. In other words, the values". kt = 10,000, k 2 = 20,000 will Ilt the data just as well (or just as poorly) asl k! = 100,000, k 2 = 200,000, and the estimation procedure would have no incentive to reduce the values of k I and k 2' but only to adjust their ratio, . It seems clear then that we are most likely to avoid both instability and overstability if we start out with very small values of any unknown param eters which are rate coefficients. This gives us the best chance of obtaining:. solutions whose magnitude remains reasonable throughout the time intervals.. for which observations are available, and which are sensitive to the values!; of the parameters. In many cases it pays to place reasonable bounds on the J .:, magnitudes of the state variables. Should these be exceeded in the course o( an integration, we reject the current value of 0 as infeasible. If we already' have a feasible 0 from the previous iteration, then we can interpolate, i.e" ,J!"...... 
8-7, A Chemical Kinetics Problem 233 return to a value of 0 halfway between the current and previous values. If necessary, this procedure may be repeated several times. If the infeasibility occurs in the course of the first iteration, simply reducing the magnitudes of all parameters by successive halvings often produces feasible values. Alternatively, we may temporarily assign fictitious observed values zero t,o the state variables at time t equal to the value at which these variables exceed their bounds, and ignore for the present iteration observations taken at later t. For example, suppose St is observed at t = I, 2, ..., 10, and ISll :( 1000 is the bound. If we integrate the equations for current values of 0 and find that SI = 1000 at t = 4.5, then we act as though we only had the observations at t = I, 2, 3, 4, and in addition we add an "observation" SI = 0 at t = 4.5. It may be profitable to attach a large weight to this last observation in forming the objective function. Degeneracies of various types may arise when the differential equations are linearly dependent. In the chemical reaction scheme of Eq. (3) for instance, we have dsddt = -ds 2 /dt, hence SI + Sz = C 1 + C z remains constant. Suppose all our observations were taken in runs for which the initial conditions C I and t;z always added up to the same value y. Suppose, further, that the observed variable y is a linear function of SI and Sz with unknown coefficients b o , b l , and hz, Thus J' = b o + b l s 1 + bzs z = b o + bl(sl + sz) + (b 2 - bl)sz = b o + b ll , + (b 2 - bd s 2 = 0 0 + O 2 S2 (8-6-6) Under these conditions, then, y appears to be a linear function of S2 (or St) .)alone, Any attempt to determine three coefficients independently will fail unless new observations with different values of C 1 + C 2 are made, Additional problems associated with linearly dependent systems are discussed in ,'Section 8-8. ::8-7. A Chemical Kinetics Problem < The following example IS somewhat artificially concocted, but it serves to ;:;-llustrate many points. ., We consider a heterogeneous catalytic reaction in which a molecule of species A is reversibly transformed into two molecules of species B A +2 2B :;V it were not for the catalyst, then the rate of the forward reaction A -+ 2B (Y-'ould be proportional to SI' the concentration of A R F = kFs l ,;,,hk: 
".i: 234 :l VIII Dynamic Models The rate of the reverse reaction would be proportional to the square of S2 J the concentration of B R H = kHs/ If the reaction reaches a state of equIlibnum. no further changes 111 concentra- tions occur because Rr = R H ; hence kH!k r = s//(S/')2 where SI Land s/ are the concentrations at equilibirum, and K == k R/k I' is called an equilibrium constant. It is determined by thermodynamic considera- tions alone, and is unaffected by the catalyst. The net forward reaction rate is given by R = Rr - R H = krs] - kRs/ = kAsl - Ks/) The presence of the catalyst affects the value of R. The nature of the effect depends on the mechanism of the reaction. We shall adopt the following expression for the rate of the catalyzed reaction R = "r(s] - Ks/)/(I + MS])2 ;:jl The three constants k 1" K, and M are functions of the temperature T, usually:.'j: " 'j assumed to be of the form: ')! kr=O] exp(-Oz/T) K = CI. exp( - fJ/T) M = OJ exp( -04/T) We assume that K has been determined accurately from thermOdynamic, : ': . ' . ) : ' data as . K = exp( -IOOO/T) Species A is disappearing at a rate equal to R, and B appears at a rate of 2R. Hence the differential equations goverping the system are: dl/d( = I1 I (s,O) = -0 1 exp(-02/T)(SI - e-IOOO/Ts/)/(1 + OJ ex p(-04/ T )S])z "i d2Id( = 11 2 (s, = 0) 20 1 exp( - 02/T)(St - e - I OOO/T s/)/( I + OJ exp( - 04/T)Sl)Z,;j (8-7-1) , .:J To estimate 0 1 , O 2 . OJ' and 0 4 we conduct three runs, at temperatures!::i T = 200°. 400°. and 600 :'. The second run is started with pure A, and the,; third run with pure B. Otherwise. the initial concentrations are known only\ approximately. In the first run ;;J '\.1(0) = 0 5 = I :t 0.05, In the second run S2(0) = 0 6 = I :t 0.05 SI(O) = 0 7 = I :!: 0,05, S2(0) = 0 In the third run s](O) = 0, S2(0) = Os = I :!: 0,05 ".;  '. L :<.< ..,'. 
8-7, A Chemical Kinetics Problem 235 In the course of each run, samples are width drawn at ten different times (in- cluding initially, at t = 0), and analyzed in a densitometer. The instrument's readings are linear in the concentrations of A and B, i.e" y= I +09S1 +OtOSz (8-7-2) The coefficients 0 9 and 0 10 are known approximately from past experience 0 9 = I :t 0,05, 0 10 = 2 :!: 0.05 The data are given in Table 8-1. Table 8-1 Data for Kinetics Problem Run T Initial conditions f.L 1,. Y'l 200 8 5 8 6 I 0 3.988 2 10 4.073 3 20 4.]53 4 30 4.231 5 40 4.309 6 50 4.376 7 60 4.457 8 70 4.522 9 80 4.615 10 90 4.667 2 400 8 7 0 II 0 I. 997 12 2 2.149 13 4 2.320 14 6 2.465 15 8 2.611 16 10 2.754 17 12 2.896 18 14 3.034 19 16 3.166 20 18 3.278 3 600 0 fiR 21 0 3.012 22 0.5 2.956 23 I 2.926 24 1.5 2.877 25 2. 2.853 26 2.5 2.823 27 3 2.800 28 3.5 2.776 29 4 2.767 30 4.5 2.760 "t'.f:' 
236 VIII Dynamic Models Along with 0 1 , O 2 , 0 3 , 0 4 , we must also estimate the values of the un- known initial concentrations 0 5 , 0 6 , 0 7 , and 0 8 , and of the unknown coef- ficients OC) and 0 10 ' To account for our partial knowledge of the latter values, we assign to the six last parameters independent normal prior distributions with means [I, I, I, I, I, 2] and standard deviations 0.05. If we do not know the standard deviation of the measurement errors, we are led to the objective function 30 (!J(O) = (10/2) log L e/(O) /1=1 + (1/2 x 0,05 2 )[(0 5 - 1)2 + (0 6 - 1)2 + (0 7 - 1)2 + (0 8 _1)2 + (OC) _1)2 + (010 - 2)2] where el,(O) = J'IL - I - OC)SI(tIP 0') - 0IOS2(t1L' 0') Here 0' denotes the vector consisting of the first eight elements of 0, The func- tions s(t/1' 0') are to be determined by integrating Eq. (1) from t = 0 to 1 = 1/1' ,:! using initial conditions (05' 0 6 ), (0 7 , 0), or (0, 0 8 ) depending on whether the 11th experiment belongs to run I, 2, or 3. We shall estimate 0 by means of the method of sensitivity equations) That is, along with the two Eqs. (I) we integrate at each iteration the set of sixteen differential equations for the functions as/ao'. The initial conditions for these equations are [ 0 0 0 1 0 0 ] (run 1) 0 0 0 0 1 0 as/aO'),=o = [ 0 0 0 0 P 1 ] (run 2) 0 0 0 0 0 0 [ 0 0 0 0 0 0 ] lrun 3) 0 0 0 0 0 0 To form the differential Eqs, (8-4-6) we need the matrices ahlaO' and ah/as"j! The first row of each matrix is given by: ah t = [ ht , _ '!J., _ 2ht exp(-04/ T )Sl , ao' 0 1 T I + 0 3 exp( - 04/ T )SI :1 ".. : 2h 1 0 3 exp(-04/ T )SI o] . . ' . ! . " , 0, 0, 0, I.:" T(l + 0 3 exp( - 04/T)SI) ,) j alr t = [ _ 0t exp( -02/T) - 2lr l 0 3 exp( -04/T) , 20 t exp[ -(0 2 + 1 000)/T]S2 ] : .. : . ; as I + 0 3 exp( -04/T)SI 1 + 0 3 exp( -04/T)Sl < 
I I 8-7. A Chemical Kinetics Problem 237 To obtain the second row in each case we multiply the first row by - 2. The reader may write out in full the differential equations for t11e sixteen func- tions ah/aO'. We use the initial guesses [2, 500, 0.5, 50] for the first four parameter. The guesses [I, I, I, I, I, 2] for the remaining parameters are obvious. In Table 8-2 we give the results of integrating the eq uations in 5 and I Table 8-2 Intcgratcd First Run Data". Using Initial Gucss e I . t = 0 f= 10 s, .1', .1', S, s I I 0.3623495 2.275271 Bs/ Be , 0 0 -0.2064785 0.4 I 29581 Bs/Be l 0 0 0.002064790 -0.004129570 Bs/Be] 0 0 0.3270462 - 0.6540943 Bs/Be 4 0 0 -0.0008176151 0.001635234 Bs/Be, ] 0 0.5249249 0.950146 Bs/Be 6 0 ] 0.01804298 0.9639123 "Accurate to About Fivc Dccimal Placcs, .a5/ao' for the first run from t = 0 to t = 10, using the initial guess values for :.,0. The values of as/aG 7 and as/aG B were omitted, being all zero for the first :Irun, From these values it is easy to compute the residuals and their deriva- ,tives for the first two observations. The residuals can be found from Eq. (2). First, at Jl = 1 (t = 0): e l =Y] - (1 + G 9 s] + G]Osz) = 3.988 -] - I x 1-7- x 1=-0.012 o o o o [ G9 aSI/aO' + GJO asz/ao' ] ] a11/ ao = -aedaO = Sl 2 Sl 0 o I I '<,. }if..;,j 
238 VIII Dynamic Models and, at 11 = 2 (t = ]0): e2 = 4.073 - I - I x 0.3623495 - 2 x 2.275272 = -0.83892 I x (-0.2064785) + 2 x (0.4129581) I x (0.002064790) + 2 x (-0,004129570) o o 0.3623495 2.275272 0.6194376 -0.006194349 -0.9811422 0.002452854 2.45216 1,945867 o o 0.3623495 2.275272 (?f/ao = In similar manner, we can compute ell and afjaO for It = 3, 4, ..., 30. From these, q and N can be computed and the Gauss method applied, The process does not converge unless penalty functions are used to maintain all parameters (or at least the first four) strictly positive. The solution is 1.39266 :t 0.20891 1140.01 ::!:: 75.]6 1.82052 .:!:: 0.80081 366.524 :t 194.213 0* = 1.0060 I ::!:: 0.04502 0.998853 :t 0.02988 0.986829 ::!:: 0.02606 1.01898 :t 0,01677 1.01086 ::!:: 0.02620 1.97541 :b 0.03198 One would judge all parameters well-determined except for 0 3 and 0 4 , The ::il amount of information conceming the values of 0 5 through 0 10 that was'::f gained from the data can be gaged by comparing the present standard devi..:q ations of these parameters with the prior standard deviations of 0,05, There:::j is substantial improvement in all cases but 0 5 . :') 8-8. Linearly Dependent Equations , .1 We return to the kinetics problem of the previous section, but now the::,':), initial conditions for each run are known precisely (see Table 8-3). This time:: the concentrations of A and B are measured directly in each experiment, and::;4 appear under the headings )'/d and )'1,2 respectively in Table 8-4. We recall: ;! ,'C'..""'.' 
I I 8-8, Linearly Dependent Equations 239 Table 8-3 Run Data for Kinctics Problcm I Initia] conditions I Run .1',(0) S2(O) 0: c= 2s J to) + S2(0) T I ] .00830 0.99662 3.01322 200 0 1 0.98862 0 1.97724 400 0 3 0 1.01731 1.0173] 600 0 I : Table 8-4 Data for Kinctics Problcm Run f.L Y,'I Y , J2 J";'JI = Ji"1 j!'J2 I ]0 0.98040 1.05134 0.980832 1.05] 556 2 20 0,95262 I. ] 0796 0.952628 I.J 07964 3 30 0.92703 I.J 6017 0.926626 I.J 59968 4 40 0.90120 1.21002 0.90]520 1.210180 5 50 0.87706 1.26056 0.876476 1.260268 6 60 0,84883 1.31534 0.848918 1.315384 7 70 0.82480 1.36299 0.825052 1.363] 16 8 80 0.80163 1.41035 0.80]474 1.410272 9 90 0.77726 1.45822 0.777452 1.458316 2 10 2 0.93445 0.10857 0,934358 0.108524 i ]] 4 0.88192 0.21395 0.881700 0.213840 12 6 0.82997 0.31716 0.830026 0.3]7188 ]3 8 0.78015 0.41627 0.780418 0.416404 ]4 10 0.73047 0.51561 0.730746 0.515748 ]5 12 0.6827] 0.6]219 0.682562 0.612]]6 16 14 0.638]6 0.70085 0.638188 0.700864 17 16 0.59445 0.78925 0.594086 0.789068 ]8 18 0.55375 0.86976 0.553742 0.869756 3 19 0.5 0.01688 0.98263 0.0] 7248 0.982814 20 I 0.03331 0.94991 0.033622 0.950066 21 1.5 0.04726 0.92359 0.046940 0.923430 21 2 0.05604 0.90485 0.056192 0.904926 23 2.5 0,06330 0.89059 0.063348 0.890614 24 3 0.07073 0.87613 0.070618 0.876074 25 3.5 0.078]] 0.86138 0.077994 0.861322 26 4 0.082]4 0.85298 0.082] 60 0.852990 27 4.5 0.08782 0.84250 0.087488 0.842334 ..r.:: 
VIII Dynamic Models,'1 240 that for each molecule of A that disappears, two molecules of B are created" Hence the quantity CJ. = 2s t U) + sit) remains constant throughout any run,: We can compute its value from the known initial conditions CJ. = 2s l (0) + sz<0Y (see Table 8-3), and it does not depend on what values of 0 we choose. Suppose 0 is the true value of 0 The observed concentrations are given' by Yut = SIU/l' 0) + B/lt, Y/l2 = sz(t/l' 0) + B/l2 where B/l t and B/l2 are errors, hopefully small. Hence 2Y/ll + Y/lZ = 2s 1 + S2 + 2B/ll + BuZ = CJ. + 2B/ll + B/l2 (8-8-1) But also, for any trial value 0 2S t (l/l' 0) + szU/l' 0) = [j Subtracting Eq. (2) from Eq. (l) and remembering that ei O ) == Y/l(O) - s(t/l' 0) (8-8-2), we obtain 2e/lt(0) + e/l2(0) = 2B/ll + B/l2 (8-8-:.),' Unless 0 is close to 0, the residuals e/l(O) are large compared to the errors'; E/l' hence Eq. (3) takes the approximate form e'12(0) ::::; -2e/lI(0) (0 i= 0) From this it follows that the momelll matrix M(O) is nearly singular -2  e1 1 4 Le1 /l Indeed, an attempt to estimate 0 = [01' 0z, 0 3 , 04r by mmimizing";_i log det M(O) fails when one starts with 0 1 = [2, 500, 0.5, 50]T. One simply finds det M(OI) = 0, and no progress can be made, However, using th:,, results of Problem 8, Section 4-21, let us take SI as our sole state variable (if we know St we can always compute S2 = CJ. - 2s l ). If we define YI' == YI ana:.s Yz' == CJ. - Yz, then we have the representation l Le1 MIO);::;; _; '"' 2 - L.. e/l l /l (0 i= 0) Yt' =SI, Y2' = 2s 1 or y = BS t (8--4!t T .u. where B = [I, 2] , Let us assume, further, that the measurement errors of{i Y/ll and Y , 12 are independent and have the same standard deviation (J, Since: CJ. is a known constant, the error in Y2 likewise has standard deviation ,ri{: ;,1 i, !l' ..:, :8 
;;;' lJ-lJ. Linearly Dependent bquatlOns L41 Hence, at each experiment we can obtain the least squares estimate of 5" on the basis of the measured Yul and Yuz S =(BTB ) -IBTv " J II (8-8-5) ,9f S"l = ([I, 2][]rl[l, 2][t] I I = -t)'t + t)'z = -t)\11 + -t(a - )'IIZ) .The values of sui appear in Table 8-4, as do the values ofYl1 computed accord- 'jng to YUI =Sul' YJlz = a - 2s Jl1 (8-8-6) The standard deviation of the measurement errors may be estimated : from (J = {[1/27(2 -l)]:tJ[(YJlI - YJlI)Z + (YJlz - YJl2n} 1/2 = 0,0002876 The values Sill can now be used as "data" for estimating 0 We do this . py minimizing 27 4i(0) = I [Sid - SIU Jl , oW Jl=1 ,We employ the Gauss method, and use penalty functions to keep all Ba positive, Starting from 0 1 = [2, 500, 0.5, 50]T we arrive in 19 iterations at the solution .. l l.8393::t 0.05; 8515 1 0* = 1175.)6 ::t 19.5L9 2.29692:!: 0.344311 471.100 :!: 66,5655 (8-8-7) .;'! 'Fhe estlmtes of all parameters are fairly well-determined. The minimum sum ?f squares is 4i(0*) = 0,1353722 X 10- 4 t;:Fhe estimate Eq, (7) should be sufficiently close to 0 so as to make M(O*) :,nonsingular. Hence we can use 0* as the initial guess for minimizing 4i(0) = (27/2) log det M(O) (8-8-8) t',_ .., . iIrjdeed, three iterations bring us to l l.48555 + 0.0395848 1 0** = 1176.30 :!: 13.5234 2.31028::t 0.241886 473.880 ::t 46.4845 .\",\L.. 
242 VIII Dynamic Models 8-9. Problems I. State variables are usually computed by solving finite difference equa-, tions which are approximations to the differential equations. It is suggested by Kelley and Denham (1969) that one ought to obtain exact derivatives of the approximate s (by differentiating the difference equations with respect to" 0), rather than approximate derivatives of the exact s (by solving the sensi- tivity equations approximately with a finite difference method). Show that with the Runge-Kuna method both approaches lead to precisely the same results. lIIustrate with the models of Eqs. (8-2-4) and (8-4-8). 2. Using the method suggested at the end of Section 8-3, solve the follow- mg two point boundary value problem: Find SI(O) for the system ,SI =SI -2ts l /(1 +S2)' 52 = 2Iog[sl(1 + S2)]' S2(0) = 0, S2(1) = I Use the initial guess SI (0) = 1.5. 3. Suppose the observed variables are measured continuously, so that a record exists of y(t) (0  t  n. Let the objective function be .T (/)(0) = I eT(t, O)Qe(t, 0) dl , 0 where Q IS a gtven positIve detlnite matrix. Derive the sensitivity equation, i.e., a differential equation for iJC/J/ao. Apply a variable metric method to finding the minimum of C/J for an example given by Bellman, Kagiwada, et 0/. (1964): There are two parameters and one state variable .s = -s + O,sJ, The observed data are given as y(t) = s(t) + 0.5 cos 60t S(O) = O 2 . where s(t) is the solution of the differential equation with 8 1 = 1/30 and 8 2 = I. Use T= 5 and Q = I, i.e., 5 (/)(0) = r [y(t) - s(t)f dt , 0 4, The errors in measurements taken in the course of one physical run are generally not independent. Using the theory of power spectra, one can obtain expressions for the covariance between the errors in different measure- ments, provided the differential equations are linear. Specifically, suppose cls/dl = Os + c;(t), s(O) = So . .:,. ':r.k 
8-9, Problems 243 where e(t) is a random noise with given power spectrum. Let s satisfy dsjdt = Bs, s(O) = So and let u(t) = s(t) - s(t) be the error function. Derive expressions for the power spectrum of u(t), and for the autocovariance function V(t, T) = E[lI(t)U(T)]. Generalize to the case of a vector of state variables, I.e., dsjdt = A(e)s + E(t). !(! :,....; :k :  w.::, -:14 i1i1. r 'c. .if.  1,.;, .. J'; ...; , . "'; ",' ) . ... ';>." F 
 .':' Chapter I Some Special Problems;j 9-1. Missing Observations It is not uncommon to find that one is missing one or more data items:,:! i.e., elements of the matrix W. We distinguish between two situations, whiclf! " are illustrated by the following cases: . (a) A single equation model, with the objective function L= 1 e p Z(9)! 1, The value of Xl, I is missing. It is clear that we cannot do better than de!! termine e* so as to minimize L=z e/(e). The entire first experimen.ii ",II contributes nothing to the estimation of e, and may be dropped from the1 objective function. Another example in which a term may be dropped occurs when Yt,l is;: missing and the objective function is II 111 L L bJLlI[YPll - n.xp, e)]Z /1= 1 Q= 1 (b) On tile other hand, with YI, I missiI;lg, suppose the objective functioll:;1 has the form n III L L Bpllb[YJLlI- fJx p , e)][YJLb - fb(X JL , e)] 11= 10.1)= J Now YI,I appears in several terms, which cannot all be dropped. :j Similarly, with X I . I missing, let the model consist of two equations, The:! values of )'1, I and )'1,2 jointly do contain some information on e, since wj! cannot in general solve the equations,.,; YI,I-!I(X I . t , .,e*)=o. YI,Z-/Z(XI,I,...,e*)=o (9-1-1)1! ,'j simultaneously for X \, I' The first experiment residuals shouLd not be droppt;d,II from the objective function. Instead, the missing value XI I or YI I can b#Ji regarded as an additional unknown parameter, whose vale is to' be deter:j! mined together with e so as to optimize the objective function. 1 iL 
9-1. Missing Observations 245 When several noneliminable items are missing, they can all be treated as unknown parameters. However, from a practical computational point of view only a few such parameters can be handled in this manner. The folJowing is a systematic approach: I. Write down the objective function, with all missll1g data items repre- sented as unknown parameters 2, Differentiate this expression with repsect to all the missll1g-data parameters, and equate the derivatives to zero. 3. Solve the equations thus formed for the mlssll1g data parameters. 4, Substitute the solutions in the objective function for those parameters 'where the substitution results in a simplification of the expression. Retain other unknown parameters in the objective function. :l!xamples : J n 1. <p(e) = I e/(e), 1l=1 XI, I unknown. 8<Pj8x l , I = -2c t (e)8J;j8x I ,1 =0 " el(e) = 0 Substituting in <p(e), we find <p(e) = IZ=z c/(e), I.e., we drop the first :term, II III 2, <p(e) = I I Bllllb e llll e llb , 11=1 ll,b=l Y I, I missing. III 8<Pj8YI,l = 2 I BIlI,t el,a = 0 ll= 1 111 :, e l , I = -B-;II I B IlI . l e l ,lI ll=2 , Substitution of this expression in <p(e) only complicates matters, so we ,,may as well retain YI, J as an unknown parameter. Incidentally, the same 'if ..esult would be obtained if c l . I is made an unknown parameter, and then ;I. I can be computed from )'1. I = ci. I + 11 (XI' e*), where c', I and e* are the estimated values, fS .". A survey of the literature on the problem of missing observations is given iijiJ?y Afifi and Elashoff (1966), but most of the reported results pertain only to  'rtlultiple linear regression. Numerical ill ustrations appear 111 Sections 9-6-9-7, .,. . "M.'''1!; , . "'!"-"-' 
246 IX Some Special Problems 9-2. Inhomogeneous Covariance Most of the estimation formulas were derived on the assumption that the covariance matrices V /1 (Ji = I, 2, , , " n) of the errors in the Jith experi- ment were all equal to a fixed (though possibly unknown) matrix V. The modifications required when the V/1 differ from each other are trivial, pro- vided the manner of the variation is known. If the VI' vary in an entirely unknown manner, nothing much can be done; we simply cannot estimate a variance from a single observation. The following three cases can be treated easily: (a) Suppose the following holds V/1 = A"VA"T (9-2-1) where V is an m x /1l positive definite known or unknown matriX, and the A" are known m x /1l nonsingular matrices. This includes the case where the V/1 are known matrices, since then V = I and there exist A/1 such that V/1 = A/A/. In the single equation case, Eq. (I) amounts to (J/12 =0,,2(J2 (9-2-2) where the a/1 are known constants. For a normal distribution, the objective function takes the form II II ePee, V) = I log det A" + n/2 log det V + t I e/ (A) -1 V- t A t e /1 (9-2-3) /1=1 /1=1 We may drop the first term on the right-hand side, which is a constant. Let us redefine our model equations so that instead of y /1 = f(x/1' 8) we write A; Iy /1 = r\ If(X/1' e). We obtain new residuals e/1 == A;; 1 e/1 = A/ 1 Y /1 - A;; If(X/1' e) (9-2-4) The objective function now becomes /I (J)(e. V) = (n/2) log det V + t I e/v- I e" /1=t (9-2-5) which is identical to the expressions derived in Chapter IV, Example Suppose, in a single-equation model, the standard deviation is proportional to the magnitude of the measurement, that is v =)' (J2V = (JlV 2 11 - JI -' J.1  11 (9-2-6) The redefined residuals are e/1 = (I/Y , ,)[.1'/1 - f(X/1' e)] = I - (I/Y/1)f(x/1' e) (9-2-7) 
9-2. Il1homogel1eolis Covarial1ce 247 These have variance (f2. If instead the errors are assumed proportional to the true values of the measured variables, we should use ell = J'/I!J(X/ 1 ' 0) - I (9-2-8) Eg. (7) is easier to deal with, and the error committed in using it in place of Eq. (8) is likely to be small (b) Suppose we know that experiments Jl = I, 1, ...,11 1 have the unknown covariance matrix VI' experiments 11=11, + 1,11 1 +2, ..., //1 +//2 the matrix V 2, and so on. The objective function has the form :f::.  I/J(O, VI' V 2, . . .) = (11 1 /1) log det VI + (11 2 /2) log det V 2 + . + i Tr(V I M I ) + t Tr(V;2M 2 ) + . (9-2-9) ... :i: where M; is the moment matrix of the residuals in the experiments with covariance V j . Proceeding as in Section 4-9, we differentiate with respect to Vi and obtain eventually V j = (1II1 j )I\'I,(O) (9-2-10) and the concentrated objective function becomes (/)(0) = -1 I 11; log det M;(O) j (9-2-11) Should there be some experiments with known covanances, the objective function would be  I II .,. i 1 cPCO) = 1- I 11; log det NUS) + I'e/(O)V/ I e/,(O) j (9-2-12) where the summation I is extended only over those experiments. The minimum number of required experiments given in SectIon 4-12 now applies to each I1 j separately, since Eq. (II) cannot be used unless none of the M j are singular. Hence we must usually have /lj  max (I + I. m) (i = I. 2, . . .) (9-2-13) I i . , !\ ;l Another problem that can be solved by means of the maximum likeli- hood method is one in which the covariance matrix varies regularly as some function of the independent variables. These functions may depend on un- known parameters, which can be estimated. The case where both the model . equation and the standard deviation of the erorrs are linear functions of the 'independent variables is treated by Rutemiller and Bowers (1968). (c) Known Serial Correlations. Correlations between errors in different exper- iments are called serial correlatiol1s. Suppose, in a single-equation model, the covariance matrix of all errors is given by R. where I I . : c . R/D' =£(£/' £) (9-2-14) 
248 IX Some Special Problems The likelihood takes the form 11 log L(O) = -(n/2) log 2iT - t log det R - t I [R - t ]11I/ eiO)eiO) (9-2-15) ", = t If R is known, we can find a matnx S such that SST = R. Defining a new set of residuals 11 £i,/(O) == I [S-I]lle,,(O) /1=1 (9-2-16) we find that maxlllllzlIlg Eq. (15) is equi valent to minimizing the sum of squares of the e,/(O). The estimation of R when the serial correlations are unknown is relatively difficult, and we shall not consider this problem here. We observe, however, that residuals almost always show serial correlations even when the errors possess none. For instance, in the case of the linear model Eq. (4-4-2) with V = a 2 1, we find that the covariance matrix of the residuals is given by V r = a 2 [1 - B(BTB) -I BT]. Thus while V is diagonal (no serial correlations), V r is nondiagonal. Furthermore, V,. is singular since clearly V,.B = O. 9-3. Sequential Reestimation Suppose a series of experiments is being conducted, and we wish to re- estimate the parameters as the results of each experiment come in. Many of the objective functions that we have studied consist of a sum of terms, each containing the results of a single experiment. Examples are sums of squares, weighted sums of squares, and log-likelihood functions with known covariance matrices. [n such a case, let us denote by CP,,(O) the ternl corresponding to the lith experiment, and by (//11)(0) the objective function for n experiments. Then 11 (/yCII)(O) == I cPJO) / 1 =1 (9-3-1) If follows that (/)(11 + I )(0) = qyCII)(O) + cP 11 + I (0) (9-3-2) If we had estimated the parameters after the nth experiment, we would have,' found 0(11) which minimizes (/)(11). We have also obtained the matrix HCn):.':: "-'r which is an approximation to the Hessian of (//") at 0 = OCII). A Taylor series ". approximation to (/)(11) in the neighborhood of OCII) is given by ,/)(11)(0)  (//II)(OCII») + t(O - OCIIJ)TH(II)(O - 0(11») (9-3-3) 
'( if ,. f.  9-4. Computational Aspects 49 When the results of the n + I st experiment are available, we wish to find o(n+ 1) which minimizes <Jj(n+ I). It is reasonable to expect that 0(11+ 1 J will not differ very much from o(n>, so that the approximation Eq. (3) is valid, and may be substituted in Eq, (2). Instead of minimizing cp(II+1) we may, instead, mlIllmlze ". <p(1I+1)(0) == 1-(0 - O(II))THl1lJ(O - 0 (11 ») + CPII+I(O) (9-3-4) i i  The function ;j;{n+l) is much simpler and easier to calculate than cp{II+1), so that a great deal of computer time may be saved by this substitution. Of course, if it turns out that (j) {II + I) is minimized at a point so far removed from o{n} that Eq. (3) cannot be accepted, then we may have to revert to 4>(11+ I). We use 0(11) as the initial guess for the minimization of (j)(I1+ 1), and the result of that minimization as the initial guess for minimizing (J){II+ 1) if required. A single iteration may suffice for the latter. If <1\,(0) is the negative log-likelihood for the pth experiment, then Eq. (4) may be regarded as the logarithm of a posterior density function, in which t(O - O(II))TH(II)(O - 0(11)) plays the part of a log prior density. It corresponds to a normal distribution of the parameters 0 with mean 0(11) and covariance matrix (H(II») -1. This accords well with the fact (established in Chapter VII) that (Hln}) -1 is an approximation to the covariance matrix of the estimate o(n), and that 1(0 - fj(n)?H(n\o - o(nJ) is approximately (apart from an addi- tive constant) the logarithm of the posterior distribution after n experiments. Minimizing cp(II+1) corresponds to finding the mode of the posterior distri- bution after n + I experiments, where the posterior distribution after n experiments is taken as the prior distribution. These results may be extended to the case where the parameters are to be reestimated only after v additional experiments have been completed. Here   fi .. " \' \' ifj(II+V)(O) = (Jillt(O) + I 4>n+I'(O) ;:::: 1(0 - o(nYH(II)(O - Oln») + I CPII+P(O) p=t p=1 (9-3-5) Sequential estimation procedures are of particular interest when the com- puter designs. controls, and analyzes the results of experiments on line (see Chapter X), A numerical illustration appears in Section 9-8. ,9-4. Computational Aspects Let us consider a single new observation in the single equation least squares case. Here cp(III(O) ;:::: CP{II)(O{II») + (0 - O(II»)T AII(O _ 0(11») (9-4- I) 
250 IX Some Special Problems where 11 All == I bl'b/, 1'=1 bl' == 8fl'/8{J (9-4-2) At the same time (/)II+I({J) = [YII+1 - f(x lI + l , (JW = e+l({J) Now, for (J close to (J(II), we have approximately (9-4-3) f(x. 0) =/(x, 0(11») + b+I({J - (J(II») so that the Gauss approximation to Eq. (3) is (/)11 + 1(0) ;:,,; e;; + I ({J(II») - 2e,,+ I (8(") )b + t ({J - (JIll)) + [(0 - {J(II)j T b,,+ IF (9-4-4) and Eq, (I) becomes, after dropping constant terms (/>(11)(8);:,,; ({J - {J(,,»)TAII+I({J - (J(II)) - 2ell+l({J(n»)b+1({J - (J(II») (9-4-5) where An+1 =An + bll+lb+l ( 9-4-6) The millimum of Eq. (5) IS easily seen to occur at (Jfll+ I) == 0(11) + A,+\bn+lell+t({J(II)) (9-4-7) Having computed (J(II) and A" after n experiments, the updating procedure after the (n + 1 )st experiment may be summarized as follows: I. Compute e,,+I(O(")) = Y,,+I - f(x,,+t, (J(n»), 2. Compute b"+1 = 8f(x"+I' {J)/8{J)O=OI;nj, 3. Compute All + I (Eq. 6). 4, Compute A,+\ (see below). 5. Use Eq, (7) to estimate (Jfn+ I), Step 4 requires elaboration. One does not wish to invert an I x I matrix:! at each step. Fortunately, this is not necessary. Suppose that A, 1 has alreadY: been computed. Define: an + 1 = A; 1 b ll + 1 (9-4-8) (9-4-9) [311+[ = b+[an+1 Then, as may be verified by multiplying Eq. (6) by Eq. (10) A'';l = A,l - all+la+d(l + {311+d (9-4-10)' AI 
9-6. A Missing Data Problem 25] Thus A,+\ can be calculated without explicit reinversion. Somewhat more complicated inverse updating formulas are given by Powell (1969); these help reduce the accumulation of rounding errors, 9-5. Stochastic Approximation " ..\c. )"'- ., .I l\; When reestimations have to be performed at a very rapid pace, even the formulas of the preceding section may be too cumbersome. It is the aim of stochastic approximation methods to introduce further simplifications. Equation (9-4-7) is a special case of the general stochastic approximation formula (,. "-!. , i 0(11+1) = 0(") + C" C,,+ 1(0("») (9-5-1) where c" is some suitably chosen vector. This formula states that the correc- tion to be applied to 0(") is proportional to C,,+ 1 (0(")), i.e., to the errOr in- curred in predicting Y,,+I given X" + 1 and 0("). In Eq. (9-4-7) we used 'y  c" = A,llb"+1 .'i! Sometimes It IS preferable to multiply this value by some positive constant less than one. The following variations represent progressive simplifications: I. The procedure of Section 9-4 can only be started after enough observa- tions have been accumulated to make A" nonsingular (at least n = I). Instead, we can start with an arbitrarily chosen positive definite Ao. 2, Use :: / ,,+1 C" = b"+1 I bJlTb" Jl=t (9-5-2) I These methods, and some others, are discussed In detail by Albert ..and Gardner (I967). .,9-6. A Missing Data Problem We return to case (a) of the two equation maximum likelihood problem of Section 5-23. We assume, however, that measurements on ZI, 1 and Z1.3 ., are missing, It is clear that relevant data still remain in data points Jl = I and 2, and these should not be discarded. Instead, we treat ZI. 1 as an unknown .' parameter 0 6 , and Z 1,3 as an unknown parameter 0 7 . The first equation of ,'.Eq, (5-23-6) now takes the form 0= -0 6 + J;tx,,, 0) 
252 IX Some Special Problems '" for J1 = I only, and Eq. (5-23-7) becomes a = 0 1 + 02X/11 + 0 3 log 0 7 + 0 4 log {Os + exp[(x/13/(1 + 04)]} II for JL = 2 only. The model equations for JL = 3, 4, ...,41 remain unchanged" "II The matrices B/1 = 8fja8 have zero sixth and seventh columns for J1 = 3, 4,g . . . ,41. For JL = I the sixth row is [- ], 0] and the seventh roW is zero; fop 'Ii J1 = 2, the sixth row is zero and the seventh row is [(0 3 /0 7 )12, I' (0 3 /0 7 )/2,2]' As initial guesses for Ob and 0 7 we take 1.3 and 0.4 respectively, values which are reasonable in view of Table 5-6. The estimate, obtained by means of the Marquardt method, is e* = [-0.00551000, -0.0115201,0.789967,0.939]89, 0,835258, I.5l783, 0.417195r The estimate for 0 6 , i,e., ZI, 1 differs considerably from the known value 1.33135. However, this value leads to a residual of - 0.22944 (Table 7-3);: whereas the present estimate has a residual of only - 0.0479336, The estimate 0 7 is quite close to the true value 0:4084 of Z2, 3' The parameters 0 1 , O 2 , . . . , 05 can be converted into c as usual c* = [0.5397926.0.00633332],1.265875,1.064749, 0.591 8664]T The covariance of e* can be obtained .as usual; the marginal covariance of 0 1 *, O 2 *, , . . , 0 5 * consists of the 5 x 5 upper left hand cotner of the 7 x 7 matrix V 0' and V c can be computed from it as usual. We wish to see how much information has been lost due to our Ignorance of ZI, I and Z2,3 and also how much more would be lost if we dropped the first two observations completely. For this purpose we also obtained the estimate based on observations JL = 3, 4, ...,41. We computed det V c forni all three cases; this quantity is proportional to the square of the volume of.:! any confidence ellipsoid, and is also a n1easure of the uncertainty in the sampling distribution (see Section 10-2). The results are given in Tab]e 9-1. We see that the more data are lost. the more uncertain are our estimates. Table 9-1 Comparison of Information in Data Casc dct V c 41 full obscrvations 41 obscrvations, Zt.! aDd ZO,) misslllg 39 obscrvations 0.246678 ,', 10- 20 0.291663.-: 10- 20 0.384105 >: 10- 20 ;; .  '''... ',1'.. 
9-7. Further Problem with Missing Data 253 9-7, Further Problem with Missing Data Galambos and Cornell (1962) supply data on the observed proportions fY/l) and J'/l2 of radioactive tracer in two human body compartments at times "x/l after injection. These data are presented in Table 9-2. The value Yt, 1 missing. The model equations are y)=0Iexp(-02 X )+(I-OJexp(-03 X ) (9-7-1) 0)0 3 [ 0103 J Y2 =] - 8)(0 3 _ O 2 ) + O 2 exp( -02 X ) + 0 1 (0 3 _ O 2 ) + O 2 - 1 exp( -03X) (9-7-2) Table 9-2 Data for Radioactive Traccr Problcm Proportion radioactivc traccr Time, x,. Compartmcnt I, Compartmcnt 2, f-L (hr) )'''1 J'/J2 ] 0,33 missing 0.03 2 2 0.84 0.10 3 3 0.79 0.14 4 5 0.64 0.21 5 8 0.55 0.30 6 12 0.44 0.40 7 24 0.27 0.54 8 46 0.]2 0.66 9 72 0.06 0.71 Beauchamp and Cornell (1966) used the following method to estimate G: First, least squares estimates were obtained for 0 using the}'1 data alone, k<giving [ 0.555524 + 0.072741 ] 0(1) = 0.03]4238::i:: 0.0038325 0.171109 ::i:: 0.027847 Corresponding to this estimate is a minimum sum of squares of residuals =0.0009510273, a variance of 0,0009510273/(8-3) = 0.0001902054, and a -standard deviation CT I = 0.01379. The same procedure was applied to the ;;. ryz data alone, yielding [ 0.0606528 +0.0124113 ] 0(2) = 0.00680764::i:: 0,00107404 0.09316 .::!:: 0.00593128 'i:, ., r;!&..., 
254 IX Some Special Problems with residuals having a variance of 0.0002860781/(9 - 3) = 0.0000476797 and standard deviation a z = 0.006905. The residuals of each equation are close to the rounding error of the data. The two estimates of 8, however, are so far apart (measured by the scale of their standard deviations) as to cast doubt on the hypothesis that the same parameters apply to both equations. Nevertheless, let us proceed with the joint fitting of the two equations. Beauchamp and Cornell compute the residuals from the two separate fits, and form their covariance matrix (neglecting the first observation on J'z) without compensating for degrees of freedom. They quote the matrix as being V= [ 0.1189 0.009753 ] 10-3 0.009753 0.03179 x (9-7-3) They now use the inverse of this matrix as a weight in the objective function 9 I e/(8W- l e p (8) p=2 The minimum occurs at 8* = [0,06751, 0.00706, 0.08393]T We shall now proceed to calculate an estimate based on the method used in the preceding section. We let 0+ denote the missing value of YI. t, and we assume V unknown, Our objective function then is 9 eP(e) = (9/2) log det I e p (8)e/(8) p=t where e ll (8) = 0+ - 0 1 exp( -OzX 1 ) - (I - 0l)exp( -03 X t) and all other residuals are defined as usual. Using Beauchamp and Cornell's initial guess (with a value for 0+ appended) 8 1 =(0.381.0.21,0.197, ])T The Gauss method (with nonnegauvity constrall1ts, using penalty functions) ">,. converged to 8* = [0.0782549, 0.00792904. 0.0975048, 1.04852F (9-7-4) This result is unacceptable since it requires .1'1,1 = 1.04852, but no value of y 
9-8. A Sequential Reestimation Problem 255 can exceed one, In fact, we must have )'Ill + .1'112 ::::; I; therefore, since .1't 2 = 0.03, we impose the additional constraint 0+ ::::; 0.97.j: The result this time is r 0.0358913 + 0.0073400 1 8* = 0.00463630:t 0.0007897 0.0812979 :t 0.0037664 0.910611 :t 0.013837 (9-7-5) Note that this is an interior minimum (0+ is below its constraint). Curiously, the objective function attains a lower value at Eq. (5) than at Eq. (4), indi- cating that the latter is only a local minimum, and that Eq. (5) is the proper estimate even in the absence of the constraint on 0+. Our estimate is quite well determined, and significantly diO'erent from Beauchamp and Cornell's estimate. This may be accounted for by the fact that the estimated covariance of residuals corresponding to Eq. (5) I ' < l "  . I  " ;; v = [ 3.42423 - 0.565601 -0.565601 ] 10-3 0.118719 x is very different and much larger than Eq. (3). In other words, the combll1ed fit attainable for both equations is much worse than the fit obtainable for each equation separately. The residuals found in fitting the individual equations are very poor measures for the errors in the simultaneous fit. If, however the residuals from the joint fit, which have standard deviations of 0"1 = 0.0582 and 0"2 = 0.01090, a.re still considered not in excess of experi- mental error, then we have no compelling reason for rejecting the joint model even though the separate models give much better fits. I :; -i 9-8. A Sequential Reestimation Problem : In SectIon 5-21 we solved a. single equation least squares problem. On the basis of fifteen observations we found: cP* = 0.03980599, 8* = [ 813.4583 J 960.9063 -0.957336 x IO- ] 3,50371 x 10- 0 .  N* = [ 0.271890 x IO- -0.957336 x 10- 0 t Sincc 0.03 is not an cxact valuc, wc rcally should havc uscd thc constraint 8 1 8, [ 8,8, ] 8..+1- exp(-8,x)+ -I cxp(-8,x)<:I 8 1 (8, - 8 2 ) + 8, 8,(8, - 8 2 ) + 8 2 but in practicc, thc simplcr constralilt sufficcs. 
256 IX Some Special Problems Therefore, the objective function Eg. (5-2\-6) has the approximate representa- tion Eg. (7-21-\). Suppose four additional observations were made. Our new objective function could be written approximately as 19 <p(t91(S) = (IP51(S) + z= e/(e):::::o 0.03980599 =16 + (10- 5 /2)[0.271890(0 1 - 813.4583)2 - 1.914672(8 1 - 8\3.4583) 19 x (0 2 - 960.9063) + 3.50371(0 2 - 960.9063)2 + z= [y - f(x w s)f = 16 The data for the four new observations are given in Table 9-3, Starting Table 9-3 Additional Good Data 11. X,I 1 X" 2 )' 16 17 18 19 0.1 0.1 0.2 0.2 150 250 150 250 0.851 0.176 0.825 0.011 with [813.4583, 9/l0.9063] as the initial guess, a single Gauss IteratIon takes us to S = [ 891.4626 J 984.38\8 A total of three iterations bring us to the minimum at s = [ 895.2656 J 985.1655 On the other hand, the true minimum of <p(191(S) = L t e/(S) occurs at S = [ 892.934\ ::t 2\3.702 J 983.7429::t 53.0 \24 We see then that the single Gauss iteration on the approximate objective function produced very acceptable results. The new estimate is vcry close to the old one because the data of Table 9-3 were generatcd by the same model as the previous data. The data of Table 9-4, however, came from a different model. Nevertheless, when these .'0!.iI0 <'1t 
I  '" *:  9-9. Problems 257 Table 9-4 Additional Poor Data fL X" I X,.2 )"1 . 16 17 18 ]9 0.1 0.1 0.2 0.2 150 250 150 250 0.760 0.300 0.608 0.095 .:  ii£ data are used in place of those of Table 9-3, we find after one Gauss iteration on the approximate objective function ,  fl, 'if' 8 = [ 462.8711 ] 863.4094 The minimum of the approximate function is found in six iterations at 8 = [ 448.9961 ] 852.5732 And the mlI1lmUm of the exact function IS at ''\  rlil 8 = [ 484.7656 :t 135. 943 1 841.6172:t 61.2455 . !!!, ! Even here, where the new estimate is very far from the old. we obtain an acceptable result in the single iteration. 9-9. Problems Show that Eq. (9- 4 -7) can be generalized for a multiple equation model as follows I I 8(11+ I) = 8(11) + A,;-+II B  I v- le,,+ 1(8 111 )) where BII + t = cle ll + 1/i'8, " A" = I BI,Ty-IB/, 1 1 = I Show that A,;-+II = A,;-I - A,;-IC II + 1(1 + C+ ,A,;- tc,,+, )-1 C+ ,A,;- t where C IL = B/S, and S is a matrix such that SST = V-I, e.g., the Cholesky decomposition of y- t. ';t;i;' -- J<fI!!.'. 
Chapter Design 01 Experiments 10-1. Introduction Parameters are estimated on the basis of data obtained in experiments, It is natural to ask whether we can plan experiments so as to facilitate the task of estimating the parameters. The answer is generally in the affirmative, and this chapter is devoted to the study of suitable experimental strategies. For our present purposes, we define an experiment as the act of observing the value of certain dependent variables Y/l at given values of the independent variables x/l' We design an experiment by choosing in some rational way the values of x at which Y is to be observed. We shall use the phrase" the experi- ment x" to denote the experiment whose independent variables take the value of x. The values of the independent variables are referred to as the experi- mental conditions, The design of the experiments and the estimation of the parameters form but two stages in a scientific investigation. What constitutes a "rational way" of choosing experimental conditions can be decided only on the basis of the overall aims of the investigation. A some\vhat idealized scheme of a typical investigation is depicted in Fig. 10-1. In practice, investigators rarely adopt such a scheme explicitly, but nevertheless they adhere to it in a loose informal way. This book is concerned with investigations in which parameter estImatIOn plays a crucial role, forming the contents of box 3 in Fig. 10-1. Such investi- gations are naturally concerned with the development of mathematical models to represent physical situations. To devise a formal scheme for pro- ducing such a model in a general situation is, as yet, beyond our capabilities, Therefore, we place somewhat more modest goals into box I of our scheme, Typically, the goal may be one of the following: (a) The estimation of the parameters in a given model to a specified degree of precision. For instance, we may wish to estimate the kinematic viscosity of a liquid, using Eq. (2-1-1) as the model. .:, 
10-], Introduction ':', r: o!..' ...;', : J/ ".1'. .;l 'r;' .,.;. i' ! j\: 7i!1  i.   ',:i- !.i '"' .' if !  ""' ; '. I. Define goal of investigation. 2. Collcct prior avaiIablc relcvant data and information. Analyzc data available to date 4. Has thc goal of the invcstigation bccn met? No 5. Is therc a rcasonablc chancc of attaining the goal with availablc rcso II rccs ? Ycs 6. Dcsign thc ncxt expcrimcnt or scries of cxpcrimcnts. 7. Pcrform thc speciflcd expcri- mcnt(s) Ycs No Fig, 10-1, A schcmc for scicntific invcstigations. 259 
260 X Design of Experiments (b) The prediction of the values of certain variables which depend on some unknown parameters. For instance, we may wish to predict the power required to pump the liquid at a specified rate through a given pipe. To do this, we need to determine the liquid's viscosity. (c) The selection of which one of several proposed models best accords with reality. Returning to the liquid and its viscosity, we may wish to determine whether the liquid is Newtonian (viscosity constant) or non-Newtonian (viscosity depending on shear rate, past history, or other factors). (d) The determination of a course of action in a situation where the optimal action depends on what the correct mode] is and what the values of the param- eterS are. For instance, the proper design of a structure depends on the tensile strength of the materials used; the proper design of a chemical re- actor depends on whether or not the reaction can be catalyzed; and the in- ventory required in a stockroom depends on the predicted demand, which in turn depends on the values of parameters appearing in an econometric model. The method of selecting the experiments to be performed must be tailored to the goal of the investigation. A simple example suffices to illustrate this point: Suppose we propose the model y = 0 1 + O 2 x. For physical reasons, measurements are restricted to the range - I  x  I. It is intuitively obvious (and we shall later derive this fact rigorously) that best estill1ates for 0 1 and O 2 will be obtained if all experiments are performed at the two extreme points of the range x = - I and x = I. On the other hand, if our main concern is to prove that the model is as given and not, say, y = 0 1 + O 2 x + 0 3 x 2 , it be- comes imperative to perform experiments with at least three distinct values of x. In fact, the best three experiments are at x = - 1,0, and I. It is meaningful, then, to ask" what is the best experiment for the attainment of our goal?". but not simply" what is the best experimdt?" The classical methods of experimental design were devised by Fisher (1935), Davies and coworkers (1954), and others to satisfy goals different from those we are concerned with here, They referred to agricultural or industrial situa- tions where no a priori mathematical models were availab]e, Generally, one designed in advance a large number of experiments to be performed simultane- ously (a necessary condition in agriculture, where an experiment takes months) In the scientific laboratory, on the other hand, an experiment usually takes only a short time, but requires expensive apparatus of which not many speci- mens are availab]e. Experiments are perforce carried out in sequence, one (or at most several) at a time. Wald (]947) has demonstrated that when experiments are carried out tl1 sequence a smaller number of them is required, on the average, than when they are performed simultaneously. This is true even where no use is made of 
W. k. i" j r 'Ii J lr' ,. -;:F l:; t::; .-: .'i' r. O t": j". :: !:.{. ,,: t. l 1J'; f,; t. f: r ,i. < :  1:i< :,.' { 11-S .?il 1S ....;-; q d. :;:. "' ;;.'. ; :: 10-2. Information and Uncerrainry 261 information gained in one experiment for planning the next one; The gain in this case accrues entirely from the ability to terminate the experimentation precisely at the point at which one's goal has been met. If, in addition, one is able to design each experiment in the light of the results of the previous ones, the gain in efficiency can be even more impressive. Informally, this strategy is adopted by most chemists, physicists, and other experimental scientists. What we are seeking here is the formalization of well-established intuitive pro- cedures. The major contributions to the attainment of this goal are those of G.E.P. Box and his coworkers, starting with Box and Lucas (1959). Many more of their papers will be cited in the sequel. 10-2, Information and Uncertainty It is the purpose of an experiment to gain relevant information. The best experiment is the one that is most informative, It is only natural that we should turn to information theory in our quest for a quantitative criterion for select- ing the experiments to be performed. Suppose  is a random vector. From the probability distribution of  we can gain a picture of the uncertainty associated with ; the more disperse the distribution of, the more uncertain is the value any specific realization of  will assume. These intuitive notions of uncertainty have been formalized by Shannon (1948), who showed that the unique (except for a positive multi- plicative factor) suitable measure of uncertainty associated with the proba- bility density function p() is given by H(p) == -E(log p) = - J p() log p() d ( 10-2-1) We gain information by reducing the uncertall1ty, Suppose Po() and p"'() are, respectively, the prior density of, and the posterior density after an experiment has been performed. According to Lindley (1956), the amount of information J that is gained by the experiment equals the reduction in un- certainty from the prior to the posterior distributions J = H(po) - H(p*) (I 0-2-2) Our aim is to find that experiment which maximizes J. Since H(po) is un- affected by the experiment, we may equally well look for the experiment that minimizes H(p*). When  is the vector of unknown parameters e, Po and p* may be the prior and posterior densities in the usual Bayesian sense. If we wish to eschew this interpretation, we may take Po and p* to be the estimated sampling 
262 X Design of Experiments distribution densities before and after the experiment is conducted.i When the normal approximations are adopted, the two interpretations yield identical results. We shall need to evaluate H(p) for the multivariate normal distribution, Let p() = Nn(a, V). We have H(p) = -E(log p) = - E[ -(n/2) log 2n - -! log det V - -t( - a)TV- I( - a)] = (n/2) log 2n + t log det V + t Tr[V- 1 E( - a)( - a) T] = (n/2) log 2n + t log det V + t Tr V-IV = (n/2)(I + log 2n) + t log det V (10-2-3) Discarding irrelevant constants, we can say that H*(p) == log det V (10-2-4) is a measure of the uncertainty In the distribution N,la, V), We have remarked previously (Section 7-10) that for a normal distribution det t / 2 V is proportional to the volume of a confidence region in  space, Eq, (4) tells us that the uncertainty increases linearly with the logarithm of the volume of the confidence region. An experiment that seeks to minimize un- certainty also seeks to shrink the volume of the confidence region as much as possible. 10-3. Design Criterion for Parameter Estimation Suppose our current state of knowledge concerning the value of the param- eters e may be summarized in a normal prior distribution N(8 o , Yo), Typically, this is the posterior (or sampling) distribution relative to experi- ments already performed. We are conten\plating performing n additional experiments, in which y I' (p = I, 2, , . . , n) are to be measured. Our task is to determine the values of xl' (Il = I, 2, , . ., n) at which these measurements are to be taken, We assume that the errors y J1 - f(xll' 8) are distributed as ; N",(O, V). After these experiments are performed, we shall be able to construct a new posterior distribution. Let the normal approximation to that distribution be Ni(O, V). The vector 0 will be obtained as the mode of the posterior density; the posterior covariance V is given by Eq. (7-12-5) V = [ :t B/V-tB/l + V;l ] -I (10-3-1) IL=I t We assumc that scvcral cxperimcnts havc already been pcrformcd, and Po estimated; from thc rcsults. :.,,, ',: ; 
/ . " 10-3. Design Criterion for Parameter Estimation 263 . ; where, as usual, BI' == aflJae evaluated at x = xI" e = 0, Since the experi- ments have not been performed yet, we cannot tell what value 0 will take; but in trying to calculate V we can use eo in place of 0 when evaluating af,J ae, Given any proposed set of experimental conditions X t ' Xl' ,.., XII we are thus able, by using Eq, (I), to estimate what the value of V will be after the proposed experiments are conducted. This is the same thing as saying that the estimated V is a function of Xl' Xl' ., , , XII' To maximize the amount of information gained by the experiments, we wish to select the xI' (Il = I, 2, "', n) in such a way that the uncertainty is minimized, i.e., so that '.  ' f ;. : " , ,':' , " t'.r. " . R = log det V (10-3-2) :r. is minimized. This also mmimizes the volume of the confidence region for the parameters, Clearly, minimizing log det V is equivalent to minimizing det V, or maximizing det(V) -I, Let us reintroduce the notation " r Bt J _ Bl B= i,_' IT  rI : :J (]O-3-3) .,' The matrix II is the joint covariance matrixt of the errors E I , El , . . . , Ell' Then, Eq, (I) becomes '1,-1 V = (BTII-IB + Va I)-I ( 10-3-4) and . . det(V)-J =det(V o J +BTII-IB)=detV o l det(I+VoBTII-IB) (10-3-5) We recall now [see Eq. (A-I-33)] that det(I + AB) = det(I,. BA) so that det(V)-1 = det Va I det(I + BV 0 BTII- J ) = det Va I det II-I det(II + BV 0 BT) (10-3-6) Since det Va t det II-I is a positive constant, we may simply maximize the function T(x l , Xl' ..., XII) == det(II + BV a B r ) ( 1 0- 3- 7) Let us examine the matrix II + BV a B T . As stated before, II is the joint covariance matrix of the errors in all the measurements to be taken in the *,'-;j'!;i'$. :'1' I' :t: If the Ell are senally correlated, we introduce the proper off-diagonal elements into the definition of n. 
264 X Design of Experiments course of the 11 proposed experiments, and BV a B T is the covariance matrix of the errors incurred in computing the predicted outcomes f(x p , 0) of the pro- posed experiments due to the current uncertainty in the values of the para- meters. Therefore, n + BY 0 B T is the total covariance of the predicted out- come (see Eq. (7-19-4), with Va and n playing the roles ofV o and V1]' respect- ively; Y is assumed zero). Eq. (7) is then a measure of the joint uncertainty of the predicted outcomes.t We have shown that to obtain maximum information we must perform those experiments whose outcome is the most uncertain. This result is not surprising; experiments whose outcomes are most uncertain represent the greatest gaps in our knowledge of the system under considera- tion; to fill the gaps we must perform those experiments. We have a choice of minimizing det(V o I + B T n- 1 B) or, equivalently, maximizing det(n + BV 0 B T ). Our choice should depend on the relative di- mensions of the two matrices, which are I x I and /1111 x 111/1, respectively. We should obviously choose the determinant of lower dimension. The case that is most favorable to the second formulation is one where a single experiment is to be conducted on a single equation model. Here n reduces to a single number a 2 , and B is a row vector b T [af/aol T . Hence Eq. (7) reduces to T(x t ) = a 2 + bTVob (l0-3-8) If a 2 is a constant, we need only find the X, which maximIzes the error of prediction variance bTVob. We cite the following simple example: Suppose the model is linear y =f(x. 0) = 0 1 + 02 X ( ]0-3-9) We have b T = [l, x]. Let a 2 = 0.1 and suppose the current estimates are 0 1 = 2, O 2 = I, with covariance matrix V 0 = diag(O.I, 0.5). The predicted out- come y of any experiment x is given by 2 + x, and the variance of this predic- tion is, according to Eq. (8) T(x) = 0.1 + [I, x] [Oc/ 05] [.:] = 0.2 + 0.5x 2 ( I 0-3-10) To improve the estimates of 0 1 and O 2 we should perform an experiment maximizing T(x), that is we should choose as large (in absolute value) an x as is practically feasible. The situation is illustrated in Fig. 10-2, where the pre- dicted curve 2 + x is plotted surrounded by a confidence curve of width :t(0.2 + 0.5X 2 )1/2. We choose for our experiment a value of x at which the confidence band is as wide as possible t Thc dctcrminant of a covariance matrix is somctimcs rcfcrred to as the generalized variance. 
"I i ''i1 , 10-4. Design Criterion for Prediction 265 ; . <". F 4 1 f.- I 1 I o ] 1iI .,.... .;- f.i ;. . t I ,,', <.1-" "" < t }t;, '- -I Fig, 10-2, Prcdicted J' with confidcncc band5. g'i ;:':: .;::. i '. Ji. "':." :..,. .j. The desIgn criterion that was described here has been arrived at from different points of view by Box and Lucas (I 959) and Box and Hunter (1963), with further details supplied by Draper and Hunter (1966, 1967a, 1967b) and Atkinson and Hunter (1968), The use of the method on a computer-simulated chemical kinetics model is described by Kittrell, Hunter, and Watson (1966), and details for the estimation of polymerization parameters are worked out by Behnken ( 1964). ',!,;.. ':. ' 10-4, Design Criterion for Prediction .. Ife is to be estimated purely for the purpose of predicting certain quanti- . ti,,"  <l>1, e), then th, u ""''' i oIy nf t 10, p"d ictio" i, gi"co by dol V. ' wh", I V p is defined in Eq. (7-19-4) and V is used in place of V o . Choose X/I I : , : ,  , ; , : .  :, , :  : . ' : : ' , . (/I = I. 2. . . . , n) so as to minimize the uncertainty of the prediction _. R = det V p = det[(acpj[11;)V «(!cpji 1 1;)T + (c1cpjae)V (ccpj(le)T + V'I] (10-4-1) Here V, given by Eq. (10-3-1), is the only quantity that depends on the X/I' 
X Design of Experiments 266 In the special case when the parameters e themselves are the 11 to be predIcted, then Eq. (I) red uces to the criterion of Section 10-3. We may be interested only in a subset of the parameters, in which case we associate that subset with 11, and minimize the determinant of the matrix obtained from V by deleting the rows and columns corresponding to the unwanted parameters, 10-5. Design Criterion for Model Discrimination Sometimes several alternative models are proposed for the same physical situation. Vve wish to cond uct experiments that will enable us to select the "best" model, i.e., the one that best fits the data. Each one of our models attempts to predict y as a function of x and e, Vvhat varies from model to model is the mathematical form of the function, and the set of parameters involved (although some of the parameters appear- ing in different models may possess the same physical interpretation), We attach a superscript (i) to quantities pertaining to the ith model. The ith model equation reads Suppose that we already have estimates eg 1 for the parameters appearing in the ith model, and estimates vg 1 for the associated covariance matrices, Typically, these are obtained by fitting each model in turn to data from previously performed experiments, Using the parameter values eg) we can predict the outcome yti1 of any proposed experiment x, assuming the ith model is the correct one. This prediction is given by y = f(i)(x, e(i)) (10-5-1) :i IP: R' f Still assumll1g that the ith model is corret, we can compute the covariance of the prediction error in Eq. (2), Following Eq. (7-19- 4 ), this is Vii\x) = V + B(i\X)VJB(ilT(X) (10-5-3) where Wi) == iJf(i1/iJe(i) and V is the covariance of the measurement errors of... y (which may also be a function of x), The hypothesis that the ith model is correct leads us to regard the outcome. of a proposed experiment x as a random variable 11 with pdf P(i\11 I x) having'" mean and covariance given by Eq. (2) and Eq. (3), respectively. Suppose the experiment has actually yielded an outcome y. Then we can compute the i, numbcr pUJ(y I x), which is the likelihood associated with the ith hypothesis. For the moment we restrict ourselves to the case of two alternative models, ; The quantity y(i1(X) == f(i1(X, eg 1 ) ( 10-5-2) :i i. I ! 012(x) == log(P('1(y\x)/p(Z1(y\x)] 
I I ..",;;;. 10-5. Design Criterion for Model Discrimination lbl is a measure of how much the observed Y suppons model I in preference to model 2 (it is related to the likelihood ratio, see Section 10-6). In advance of performing the experiment we do not know y, so we cannot compute a t 2' but we can compute its expected value under the assumption that model I is correct (the symbol E(I) denoting expectation under this assumption) E(1)[a t2 (x)] = J p(l)(Ylx)log[P(l)(ylx)/p(2)(y\x)] dy (10-5-4) If indeed model I is correct, we wish to conduct an experiment x which is likely to confirm this, i.e., is expected to produce a large value of a12' Con- versely, if model 2 is correct, we wish our experiment to have a large value of the corresponding quantity E(2)[a2l(X)]. Since we do not know which model is correct, we form the sum of these two quantities 1t, 2(X) == E(l I [a\2(x)] + E(2J[a 21 (x)] = J [pp I(y I x) - p(2J(y I x)]log[p(l I(y I x)/ p(21(y I x)] dy (10-5-5) The experiment to be selected is the one that maximizes 1 1 , 2(X); a large value of 1t, 2 can only be obtained if p(21 is much larger than pel \ or vice versa, In either case, the outcome shows a strong preference for one model as opposed to the other. The quantity 1t, 2 is called the divergence or the i/?formationfor discrimina- tion (Kullback and Leibler, 1951; Kullback, 1959) Its similaJ ity to Eq. (10-2-1) is evident. If both models assume normal error distributions with covariance matrices vt) and vi 2 1, respectively, then it can be shown that 11. 2(X) = -m + -t Tr(Q(l)Vi 2 ) + Q( 2JV il) +Hy(2) _ y(1Y(Q(IJ + Q(2J)(y(2J _ yet)) (10-5-6) where Q(il == (V)) -I. The dependence of 1 1 ,2 on the experimental conditions x comes about through Eqs. (2) and (3), An important special case occurs when the models are of the single-equation type, with m = I. Then V) = a/, Q(i) = ai 2 (i = 1,2), and 1 1 ,2(X) = _I + -H(a 2 /aY + (a 1 /a 2 )2] + .HO!a. 2 ) + (I/a/)](y(2) - y(I))2 (10-5-7) This equation was derived by Box and Hill (1967). The analogues to Eqs. (2) and (3) are in this case: yfi)(x) = f(i)(x, eg)) a/ex) = a 2 + b(j)1(x)V)b(i)(x) (10-5-8) (10-5-9) 
268 X Design of ExperIments where (J is the standard deviation of the measurements errors and b(iJ(x) = (If(ilf (iB! II. Equations (6) and (7) have a simple heuristic interpretation, particularly in the single-equation case. Let us plot the predicted values ylll and y<ZI as functions of x; this is done for a hypothetical situation in Fig, 10-3 where x y y(2!(X) XI Xz X FiJI. 10-3. Discrimination bctwccn two prcdictcd rcsponses. is assumed one-dimensional. If we chose to perform the experiment Xt, where },o> and /Z) coincide, the resulrs of the experiment will tell us nothing about which prediction was the better one. On the other hand, the two predictions are most divergent at X z , and the result of the experiment (unless it happens to fall exactly midway between the two predictions) is likely to confirm one or the other of the two models depending on which prediction it falls nearer to. It seems reasonable to select. then, the experiment x for which {j'(ZI _/11)Z is maximum. It may happen, however. that at that value ofx (xz in Fig. 10-3) one or both of the predictions are particularly uncertain, possessing large values of (Ji. Performing this experiment then is likely to be inconclusive, and we may prefer another experiment for which (ylZI - ylll)Z is somewhat smaller, but where the uncertainty is much smaller. Therefore we must attach to the term (ylZI - ylll)Z a weight which is small when even one (Ji is large, and large when both (Ji are small. Eq, (7) provides the right weight, and the same is true of Eq. (6) in the multiresponse case. It frequently turns out that the (J i do not vary strongly with x, so the weights are nearly the same for all values of x. In this case we need only find the maximum of(yIZI- yll))Z or ly(Z) - yI"l. . : 
. ;,;  f: i: . ,...." j'). j '; ;s:;. ..J'.' ,< i' " ..l, .;;:;.0 ':," :1 '."' I ;j .;!., :? .::.(1" ': I t:. r;. ': ..... -  t' F , t; !,,: '.:.' ,. !;' t, :( ii ,,y;.  i .t - f; 'b . : .I;: i .{1 1  &".1.  .. , "  .. ;ill III . Jfr<. 10-6. Termination Criteria 269 Our results can be generalized in several directions. To design sevcral experiments simultaneously, we maximize .1 1 ,2 constructcd with yU) and V(il augmented to include the responses from all the planned experimcnts; in Eq. (3) B(iJ takes the meaning defincd in Eq. (10-3-3), and n of Eq. (10-3-3) replaces V. There are several ways in which we may treat morc than two modcls. After each experimem is performed. wc can compute the likelihood i1 associated with each model and the bcst current estimate of its parameters. We then design the next experiment so as to discriminatc specifically betwccn the two models with largest values of the likelihood. Or. following Box and Hill (1967), we may form a joint divergence as a linear combination of the pairwise divergences .II. 2. J,...!X) == I I..!').1).I,..1(X) i*j ( 10-5-10) We have at this point no experience to guide us in the choice of thc mcthod to use, but it is obvious that the first one requircs fewcr calculations. Our aim may be both to find the best among alternative models, and at the same time find good estimates for the parameters in the best modcl. A solution to this problem suggested by Hill et 01. (1968) is to use as dcsign criterion a weighted sum of Eq. (6) and Eq. (10-3-2), the lattcr quantity being evaluated for the currently best model. Initially, a relatively largc weight is placed on Eq. (6), but as one model becomes increasingly preferred. the relative weight given to Eq. (10-3-2) is progressively increased. 10-6, TerminaTion CriTeria We now turn our attention to box 4 of Fig. 10-1. How do we dccide whether more experiments are needed? How and when do we decide that a givcn model is better than the alternatives ') Wc have advocated the use of the maximum likelihood method for estimat- ing parameters. We preferred to assign to our parameters 0 the valuc 0 1 rather than O 2 , provided that the likelihood associated with 0 1 was greater than that associated with O 2 , The same idea applies to the choice of models; we prefer model I to model 2 if the maximum likelihood attainable with model I is greater than that attainable with model 2. These considerations lead to Wald's (Wald, 1947) sequential probabilit)' ratio (or likelihood ratio) test. Suppose our aim is to choose one of two alternative hypotheses, /-I) (model I is correct), or 1-1 2 (model 2 is correct). Let j)(y, ogl) be the likelihood (Le., the value of the joint probability density function) associated with the data obtained to date, and with the current best estimate ogl for the param- etcrs based on the ith hypothesis (i = I, 2). 
270 X Design of Experiments Let A and B be two constants satisfying O<B<I<A ( 10-6-1) Then the likelihood ralto test proceeds as follows: I. If IJII/IJ2i.,;; B accept hypothesis 2. 2. If IJI 1/ IJ2 I  A accept hypothesis I. 3. If B < IJI 1/ IJ.21 < A continue experimentation. The choice of the constants A and B is determined by what confidence we desire to place on the results. Let 'l. be the probability that I-I) is accepted when H 2 is true, and fJ the probability that H 2 is accepted when HI is true, It was shown by Wald (1947) that the following relations hold approximately (the last two being consequences of the first two) A  (I - jJ)/'l., 'l.  (I - B)/(.4 - B), B  fJ/(I - C/.), fJ  B(A - I)/(A - B) (10-6-2) If we want. say, to be 95 o certain that we accept HI only if H) is true, and 90 .n certain that we accept H 2 only if 1-/ 2 is true, then C/. = 0.05 and fJ = 0.1 so that A = 0.9/0.05 = IS and B = 0.1/0.95 = 0.105. Conversely, suppose we choose A = 10, B = 0.1. This is tantamount to. accepting error probabilities 'l. = 0.9/9.9 = 0.0909 and fJ = 0.1 x 9/9.9 = 0.0909. The choice C/. = fJ leads to B = I/A, and hence C/. = fJ = 1/(1 + A). When more than two alternatives are present, we need only apply the test to the two currently most likely models. It is instructive to derive an expression for the likelihood ratio after 11 experiments in the single equation case. Assuming normal distributions, we have Vi) = (2n) -In/2 ia -n ex p { -( IJ2( 2 ) ptp p - fi)(e(i»)]2} (10-6-3) For the ith model. L IS maximized If we estimate a to be I " \ 1/2 ati) = \( I jIJ)J;}r" - .t;:il(e(i»)f J (10-6-4) Hence lS'l = (2n)-1/l121(a(i))-/l exp( -JI/2) and the likelihood ratio is (10-6-5) V I) = ( a I2 ) ) /l = { I:= I [Y'l - .t;2\e(2))f r/2 IJ2) a(11 I=I[)'p - fll(e(I»)]2J (10-6-6) 
:. f f  :1-.  : 'i-F :.... <if ..- ;; -,.. i!i;', ..;:::;, .:'t-  :  I .J.: :: ; {;  g: I ., . , : '; I I, 10-7. Some Practical COllsideratiolls 271 If after 11 experiments a(2) > a(11, we expect to find ultimately that model I is to be preferred, If [a(2)/a(1)]" < A, we must defer final conclusions until some more experiments are performed. Having no reason to expect the estimates of a(2) or a(]) to be changed much by the results of future ex- periments, we can predict that ]j1)/L(2) will exceed A after we conduct 110 additional experiments with [a(2)/a(1)]"+lo  A 110  (log A)/(Iog a(2)/a(l) - 11 ( 10-6- 7) The smallest integer 110 satisfying Eq. (7) is an estimate of the number of additional experiments required to reach the conclusion that model I is the better one, If a( I) < aPI, then no  -(log B)/(Iog a(1)/a(2» - n (I 0-6-8) provides an estimate for the number of additional experiments required to establish a preference for model 2. The reliability of these estimates, which is very small when 11  11 0 , increases steadily as 110 approaches zero. For further discussion of the expected number of experiments, the reader is referred to Wald (1947). When experiments are being conducted for the purpose of estimating parameters in a single model, the termination criterion is usually formulated in terms of the variance of the estimates. One demands that det V o fall below a specified value, or that the individual parameter variances V Oii (i = 1,2"", l) all fall below specified levels a/. The number of additional experiments required at any stage may be estimated easily from the fact that the elements of V o are roughly proportional to (11 - l) -1. If det V o = a after 11 experiments, and the number of additional experiments 110 required to reach det V 0 = b < a is to be determined, then we must solve the equation (11 + 110 - l)lb = (n - l)la ( 10-6-9) for 110 ' 10-7. Some Practical Considerations We have derived several experimental design criteria, given by Eqs. (10-3-2), (10-3-7), (10-4-1), (10-5-6), and (10-5-7) for the various cases that may arise, Let D(x) denote the criterion adopted in a given situation. The experimental conditions x are to be chosen so as to maximize D(x). We discuss here some of the problems associated with finding these experimental conditions, 
272 X Design of ExperIments In the first place, we must realize that the choice of experimental condi- tions is generally not unrestricted. Mole fractions can only range from zerO to one, the temperature of a liquid is constrained between its freezing and boiling points, and the pressure in a vessel is limited by the strength of its walls. Therefore, searching for the maximum of D(x) involves constrained optimization, with the variables (experimental conditions) confined to a bounded feasible region. Experience has shown that the maximum usually falls on the boundary of the feasible region (Atkinson and Hunter, 1968) have derived conditions under which this must be so). The experimenter must apply the design criterion with caution; the extreme values of the experi- mental conditions prescribed by the criterion may be far removed from the region of interest, and it may be well to impose stricter bounds on the variables than is required by physical or technical limitations. There is also the danger that the properties (i.e., the model equations or parameter values) of the system under investigation are not the same at the boundary as in the center of the feasible region. We recommend therefore that occasional experiments be chosen in the interior of the region, even when not prescribed by the design criterion. The reader will have noticed thm the design criterion cannot be computed unless initial estimates eo and V 0 are given for the parameters and their co- variance matrix. At the start of the investigation such estimates may not be available, and some initial experiments must be performed to get things going, The number of such experiments must exceed somewhat the number of unknown parameters, so that the estimates eo and V 0 can be obtained. The initial experiments may be selected by standard methods such as factorial, fractional factorial, or rotatable designs covering the feasible range of the experimental conditions. An experimenter using these designs must remember that he cannot expect to get more out of the proced ure than he has put into it. He cannot expect to obtain clear-cut preference for one model or one value of the parameters, if major effects have been neglected. For example, suppose a compound A is converted into a product C according to the consecutive reaction scheme A-> B, B->C (10-7-1) However, the expcrlmenter has set down models II1volvmg only the reaction A->C (10-7-2) He should not be dtsappolllted then If the design criterion does not tell him to run experiments with varying initial concentrations of B. In our derivations, expected information was the sole criterion for selecting experiments. In practice, considerations of economics and con- venience in experimental setup must also playa role. In many situations, ....::, 
 ..., . . I : r.  : ;c . . ' < 10-8, Computational Considerations 273 .;r_ particularly those involving dynamic systems, experiments are conducted in runs; several observations are made at different times on a process starting from given initial conditions. In such cases one should design whole runs, rather than single observations, We must select, then, a set of initial condi- tions so, and times t 1 , t 2 , ,." t ll at which observations are to be made. Com- puting the total information obtainable in each possible run is a formidable task because of the high correlation between the predicted values of successive observations of a run. It is, however, easy to calculate the expected informa- tion in any single observation taken at times tl' with initial conditions so. Ifwe plot the expected information 1 as a function of t for given So we usually find that there is a definite time t M(SO) at which the expected information attains a maximum value.t Let I M (so) be the expected information at 'M(SO)' It is reasonable to choose that run (i.e., the value of so) whose 1,\I(SO) has the largest value, The actual observations to be made during the run, i.e., the values of the t 1" are chosen in that portion of the [(I) curve where 1 is not much below 1 M , The problem of determining values of 'Il is further treated by Heineken et 01. (l967a,b). Other complications arise when the cost of an experiment depends strongly on the experimental conditions, It may then be cheaper to gain a certain amount of information by performing several cheap though inetlicient experi- ments, rather than a single etlicient though expensive one. The simplest solu- tion is to divide the expected information gain in an experiment by the cost of that experiment, and maximize the expected information per unit cost. Design criteria based purely on economic considerations can be derived from decision theory, as shown in Section 10-10. (I £ r.,<, .. :i.' t;  ;'- -/:;:.' j t  ':;:.. , ? . . :, . :. . ' ' ,,; , .8E ! 'r ::::   i' .,  10-8, Computational Considerations The problem IS to locate the maximum of the design criterion, which is a complicated nonlinear function D of the experimental conditions x. The function is often so complicated that analytic computation of its derivatives is out of the question, Additional factors which contribute to the ditliculty of the problem are the following: I. The maximum is usually located on the boundary of the feasible region, 2, There are usually several local maxima. In the cases that have been studied in detail, the number of local maxima tended to be close to the number of unknown parameters in the model. I, .' Ir:Jf' . . ... , ; . . , . . , . ' .. j . " .  ;.: , ; L.. t It is possible for thc maximum to bc approachcd asymptotically as 1-- w. Wc thcn sclect t." as thc timc at which I = 1 M - E. 
274 X Design of Experiments As JnOlcared in Chapter V, maximization of a nonlinear function is easiest when derivatives can be calculated, no constraints apply, and there is a unique local maximum. On all these scores our problem is a difficult one. Furthermore, if we wish to obtain the most information in each experiment, we must repeat the maximization procedure before each experiment is per- formed. Fortunately, there are some mitigating circumstances: I. There is no need to locate the maximum with a great deal of precision. 2. The locations of the local maxima do not seem to vary much from one experiment to the next. What do change are the relative heights of the various maxima, so that the conditions chosen for a sequcnce of experimcnts cyclc among the scvcral local maxima.t Indecd, Box (1968) shows that if a sequence of 11 experiments is designed (nonsequentially) to cstimate 1< /1 parameters, thcn an optimal or near- optimal dcsign is usually obtained if the I best experiments are each replicated n/I (as closely as possible) times. It seems, thereforc, that we need search for the local maxima throughout the entire feasible region only the first tjme around, i.e., after the initial experimcnts have been pcrformed, After that, when the rcsults of each new experimcnt come in, we need only search in the neighborhood of each alrcady established local maximum so as to locate its current position (which may shift slightly after each cxperiment). The safest way to conduct the initial thorough search for local maxima is to evaluatc the dcsign critcrion at all points on a dcnse grid throughout the feasible region. Those grid points where the design criterion exceeds the values at all neighbors are selected as approximate locations of the local maxima, Further refinement can then be achieved by starting hill climbing procedures (e.g., direct search optimization, see SectiDn 5-19) at these points. A suffi- ciently fine initial grid makes this step superfluous The grid search technique is feasible only when the number of independent experimental conditions is small. With three variables, a ten-level grid in each dimension results in a thousand points, which is not excessive if the model equations are simple. With four or more dimensions, the grid search technique is likely to be impractical. In this case we suggest the following proced ure: I. Select a feasible point at random. 2. Starting from this point, apply a direct search optimization procedure until a local maximum is reached, 3. Repeat I and 2 until at least 1(= maximum number of parameters in :1: This statcmcnt, like most others in this scction, is bascd solcly on a limitcd amount nf pvnl"'l.i,....n{p with rnn,n"fpr_l;in1111:.:,'prl P'\'n(>rin,pnt 
)" 1 -0'.1"., :Ji. Tasks pcrformcd by computcr 2. Estimatc paramctcrs for all proposcd modcls. 8"',8(2, , 3. By grid or random scarch, locate x' ",..., x'P', thc sct of local maxima of D(x). XCI), x(:!),... ,XCV) 4 Choosc thc localmaximllm with largcst valuc of D. 6. Estimatc paramctcrs for all proposcd models. I .!" "il'. i " ; ;\(' t ,.. " ,}}' 8. Starting from thc old local maxima (and possibly somc additional randomly choscn points) locatc thc ncw sct of maxima by dircct scarch X(IJ, X(2)._. x(P) Tasks pcrformcd by laboratory Stan X h X21 . . . _X'I I. Perform initial cxpcrimcnts x I1 .... Xli' i.e., nleaSlIre YI...", Y". y" yo,..., y" X,I 5. Pcrform the cxperimcnt x". i.e., llleasure Y". y" Ycs )10 End Fig, 10-4, A scqucntial cxpcrimental proccdurc, Symbols ncxt to arrows indicatc transmittcd data. 
276 X Design of Expenments any of the models considered) distinct local maxima have been obtained, or until a certain number of tries has failed to uncover a new local maximum, Let xlI), X(21, ..., x(p) be the known local maxima after 11 experiments have been performed. The (/1 + I)st experiment is, of course, conducted at the highest local maximum, i.e., at the xU) whose design criterion is largest. After this experiment has been performed, we establish new values of the xU) by applying the direct search technique, starting at each ofthe old xU), It is not unCommon for some of the new Xli) to coalesce; i.e., searches starting at several of the old xU) may lead to the same (within some tolerance E) new xU), To guard against the possibility that some maxima are being overlooked, one may also include after each experiment additional random starting points for searches, Fig. 10-4 contains a proposed flow diagram for the procedure to be followed. The diagram is divided into two sections, dealing respectively with the functions of the computer (estimation and design), and of the laboratory (execution of specified experiments). This raises the question of how to implement the Jinks between the computer and the laboratory apparatus. The answer depends on the circumstances; if the experiments are of very short duration and suitable instrumentation is available, the computer may be connected directly on-line to the apparatus. Otherwise, manual transfer of data is required. Note that the computer functions described here are quite distinct from actual on-line control of experiments, where the question is not what experiments to perform, but how to insure that the specified experiment is carried out properly. Of course, on-line design cannot perform unless the control function is also implemented, but the latter is outside the scope of this book. 10-9. Computer Simulated Experiments Before applying our design methods to real experiments it may be wise to test them on computer simulated experiments. In this way we can deter- mine economically whether the method is likely to succeed, How do we simulate an experiment on the computer? An experiment is, from our point of view, merely a device for generating the value of Y p for a given value ofx", To simulate the experiment, all we need then is a computer routine which accepts a value of X'I' and returns a value of Y,I' Internally, this subroutine should compute Y,I using a formula such as Y = f(O) ( x 0(0» ) + E ( 10-9-1 ) 11 Jl' Jl where fro) is one of the models proposed for the phenomenon under study, and 0(0) is a specific set of values for the parameters appearing in this model. 
1 .'''' J  :i;'. ;::.: .it.. 10-9. Computer Simulated Experiments 277 :',. .. ,. .' . :;:J'- -.;. J The error term Ell consists of pseudorandom numbers with the proper proba- bility distribution (see Section 3-3). In addition, we may include a systematic error, to test what happens if none of the proposed models is really correct. The experimental design procedure is tested by applying the procedure of Fig. 10-4, with the functions of boxes ( I ) and (5) performed by the computer routine just discussed. Note that only this particular routine" knows" which model has been selected, and what parameter values have been assigned; precisely as in nature the laboratory apparatus" knows" the model and the parameters. The only way in which other computer routines (e.g., those performing the functions of boxes (2) and (6» can guess at the right model and parameter values is by analyzing the data (YII values) supplied by boxes (1) and (5). We present, now, a numerical example (Bard and Lapidus, 1968) in which the design method for discrimination among models is applied to computer- simulated experiments. The example clearly illustrates the potential power of the method. Hougen and Watson (1947, pp. 943-958) have proposed eIghteen alterna- tive models for determining the rate of catalytic hydrogenation of mixed isooctenes into isooctane C s HI6 + H} -> C S H 1S (10-9-2) Ii. I'- P r' I.:' Blakemore and Hoerl (1963) have attempted to fit all these models, and two additional ones, to data that were available in the literature. They found that all but two of the models could be rejected immediately. There was no conclusive evidence to choose between these two. which have the forms y = OI)XIX2/(1 + Ojl)X:/} + OI)X2 + O1)X3)3 (10-9-3) and y = O2)XIX2/(1 + O2)Xt + O2)X2 + O2)X3)2 (10-9-4) I ,. '" II - .}. ;.(: where y is the rate of reaction and Xt, X 2 and X 3 are the partial pressures of hydrogen, isooctene, and isooctane, respectively. Blakemore and I-Ioerl conclude, in part "Carefully desIgned experIments are necessary. , , there are no fittlllg techniques which can overcome the deficiencies of poorly-designed experi- ments . . ." This system was, therefore, considered a good one for testing the experi- mental design procedure, To simulate the reaction on the computer, the following relations were used For experiments II = I, 2, . . ., 6 ]I = 0.0653477x 1 X 2 [1 + £(u)]/(l + 0.128246x:/ 2 + 0.159038x 2 + 0.0206618x 3 )3 (l0-9-5) 
278 X Design of Expenments and for experiments tl = 7, 8, . . y = 0.0558x[x 2 [1 + s(a)]/(I + 0.104x 1 + 0.264x 2 + 0.0151x 3 )2 (10-9-6) where s(a) is a pseudorandom number with distribution N 1 (0, ( 2 ), Note that a is the standard deviation of the relative error in y. This choice of model is to be interpreted as follows: Model Eq. (4) is the correct one, but by chance the first six experiments happen to give wrong results which appear to be closer to model Eq. (3), The aim was to see how soon the experimental design procedure could pick out Eq. (4) as the correct model, in spite of the handicap posed by the first six observations. The parameter values used in Eqs, (5) and (6) were those that gave the best least squares fits to the literature data used by Blakemore and Hoerl. The permitted ranges of the independent variables were the same as in the literature data, i.e., 0.1  Xl  2,5 0.1  x 2  3 0.05  x 3  2.7 (I 0-9-7) The flow chart of Fig. 10-4 was implemented in the following way: Box I The initial experiments, six in number, formed a fractional fac- torial design. They consisted of the centers of the six surfaces bounding the region defined by Eq. (7). The conditions for these experiments are listed in Table 10-1, along with the results [computed from Eq. (5)] for the case a = 0,03, i.e., 3 % relative error. Table 10-1 Initial Expcriments fL Xl X2 X3 Y (a = 0.03) 1 0.1 1.55 1.375 0.00441 2 2.5 1.55 1.375 0.07932 3 1.3 0.1 1.375 0.00508 4 1.3 3 ] .375 0.05633 5 1.3 1.55 0.05 0.04912 6 1.3 1.55 2.7 0.04292 Box 2 The least squares criterion was used to estimate parameters for both models. The fact that the relative rather than absolute error remained constant from experiment to experiment was ignored (i,e" it was assumed that the experimenter did not know that the error standard deviation varied from 
10-9. Computer Simulated Experime11ls 279 experiment to experiment), The parameter estimates wIth their standard deviations and the residual standard deviations for the data of Table 10-1 are presented in Table 10-2. Table 10-2 Paramctcr Estimates for Initial Expcrimcnts (a = 0 03) '::! Modcl 8 J 8, 8 3 8.. 0.064372 0.116329 0.160034 0.024028 . :I: :I: :I: :r: 0.000294 0.001065 0.000544 0.000272 0.056738 0.071874 0.277537 0.040] 66 :I: :I: :I: :I: 0.001808 0.005881 0.009287 0,003611 Standard dcviation of residuals Eq. (10-9-3) 0.488055 ,,: /0-4 ,. Eq. (10-9-4) 3.75137 X /0-4 It is not surprising that at thIs point model Eq, (3) gives much the better fit, and its parameters are the better-determined ones. Box 3 Since there are only three independent variables, a complete grid search was considered feasible. The design criterion function 1 1 ,2(X) of Eq. (10-5-7) was evaluated at all points on an 11 x I I x 11 grid encompassing the feasible region defined by Eq. (7). Local maxima are taken to be those grid points at which 1 1 ,2 exceeds values at all direct neighbors. The local maxima after the six preliminary experiments are listed in Table 10-3, with the highest maximum underlined, Table 10-3 Local Maxima of Dcsign Criterion Func- tion Aftcr Initial Expcriments (0" = 0.03) _Yt X,2 X3 .1 1 , , 2.02 0.68 0.05 1.0]445 2.5 2.13 0.05 0.1536827 0.58 2.42 0.05 0.9023/07 2.5 3 2.7 1.287048 Box 4 We choose the highest maximum of 1 1 ,2 for our next eXperIment. According to Table 10-3, then, we perform the seventh experiment at x 7 = (2.5, 3, 2,7), 
2S0 X Design of Experiments Box 5 Eq. (6) is used to generate Y'L. In our example, Y7 turns out to be 0.09769. Box 6 Procedure identIcal to Box 2, Box 7 The simulation runs were terminated after 30 experiments. How- ever, the likelihood ratio Eq. (10-6-6) was evaluated and printed out after each experiment, so that the number of experiments that would have been required for given confidence levels a., 13 could be determined easily, Let Rll = IJ2)/D1J after f.1 experiments, and assume a. = 13. Then quitting after f.1 experiments would be correct had we set (3 = 1/(1 + R , ,), and our confidence in preferring the second model after f.1 experiments is given by CI2} == I _ {3 = R,J(I + R,,) = D 2 )/(L(I) + D 2 ») = «(J(l»)"/[«(J(l»)'L + «(J(2))"] ( I 0-9-S) Box 8 The entire grid search of box 3 was repeated after each experiment. This, of course, would be impractical in larger problems. The procedure described in Fig. 10-4 was also tried, and led to results that were very nearly as good. Table 10-4 gives the details of experiments 7-30 for the case (J = 0,03, In addition to X}L and YIL we list the logarithm of the likelihood ratio and the confidence CI2) in prefer! ing model Eq. (4) over Eq. (3) after each experiment has been processed. Similar runs were made with relative error standard deviations of I %, 3 % and 6%. Fig. 10-5 summarizes the results of the three runs. It should be noted that to establish a preference for model Eq. (4) with 95 % confidence, we needed 17 experiments with (J = 0.01, 21 experiments with (J = 0.03, and by the method of Section 10-6 we predicted that 36 experiments would be required with (J = 0,06. ; For this problem, the use of max I i 2 ) - y(J) I as the design criterion worked just as well as using Eq, (10-5-7). To determine whether the sequential design procedure employed here provides any improvement over classical design procedures, the 27 experi- ments of a 3 x 3 x 3 factorial design were simulated. These are formed by taking all possible combinations of the independent variables at the following levels: XI = 0.1, 1.3, 2.5 X 2 = 0.1, 1.55, 3 X3 = 0.05, 1.375,2.7 These lI1c1ude the six initial experiments of Table 10-1. The results are com- pared to those obtained in the sequential design procedure in Table 10-5. 
10-9. Computer Simulated Experiments 281 Table 10-4 Computer Designed Experiments (cr = 0,03) Confidence in prcfcrring model Eq. fL Xl X2 -'\.] Y 10g(Ll2ljU I ') (10-9-4) 1-6 (scc Tablc 10-1) -12.24 0.00005 7 2.5 3 2.7 0.09769 1.29 0.563 8 1.78 0.68 0.05 0.03709 -0.443 0,391 9 0.58 2.42 0.05 0.02649 0.635 0.654 10 1.78 0.39 0.05 0.02243 1.392 0.801 11 2.5 1.84 1.375 0.08126 1.029 0.737 12 0.34 2.42 1.11 0.01589 1.148 0.759 13 1.54 0.68 0.05 0.03363 I. 630 0.836 14 0.34 2.42 3.15 0.01588 1.936 0.874 15 2.02 3 2.17 0.0821 I 2.209 0.901 1q 16 2.5 1.84 1.64 0.08494 1.825 0.861 17 1.54 0.68 0.05 0.03364 2.325 0.911 .\ 18 0.34 2.42 0.05 0.01638 2.503 0.924 ;.1 ]9 2.02 3 2.435 0.0796] 1.757 0.853 20 2.5 1.84 1.905 0.08108 2.499 0.924 21 1.54 0.68 0.05 0.03411 3.081 0.956 -:.1. 22 0.34 2.42 0.05 0.01525 3.677 0.975 23 1.54 0.68 0.05 0.03146 3.077 0.956 24 0.34 2.42 0.05 0.01627 3.280 0.964 25 2.02 3 2.7 0.08 174 3.884 0.980 26 2.5 1.84 2.17 0.07816 4.308 0.987 27 1.54 0.68 0.05 0.03096 3.783 0.978 ., 28 2.5 1.84 1.11 0.07775 3.636 0.974 29 0.34 2.42 1.11 0.01610 3.812 0.978 30 2.5 1.84 2.7 0.08084 4.053 0.983 To interpret the numbers in the table, remember that a 0.5 preference level indicates complete indifference between the two models. Thus, at error levels of 3 / or more, the factorial design completely fails to differentiate between the models, whereas the sequential procedure generates 83.3 .{, con- fidence in the correct model even with a 6 o, error. At a I 0.;; error level, the factorial design barely prefers the correct model, whereas the sequential design selects the proper model with almost complete certainty, Admittedly, systematic errors and other complications that may be expected in practice were absent from this study. Still, the benefits of the sequential approach turned out to be very substantial. One has reason to hope that even under less favorable circumstances, at least some of these benefits will be retained, In fact, Hunter and Mezaki (1967) have reported 
282 X Design of Experiments 10 9999% <0 m 9 o o - 9995 999 '" , m o o 0 0- W 0- W o 00 x 995 - 99 '" "0 o E 5 o '" "0 o E o 0 " m 9 !2 x x x o x x +t+ + x x o.++  + ++T ++ +!OOO 0- ++ + 95 90 80 '= '" U C '" "<e c o u o 0 x x x x x S' o 0 x x 0- W 50 '" "0 o E '"  "0 0 0  -5 '"  "0  CP 0 ..J -10 6 10 14 , 18 ,22 26 30 Number of experlmenlS Fig, 10-5, Scqucntial discrimination betwecn modcls. Standard deviation of measurc- n1enl errors: 8 I I:: x, 3,: -'t-. 6. Table 10-5 Comparison of Expcrimcntal Dcsign Proccdurcs Prcfcrcncc for model Eq. (10-9-4) aftcr 27 experimcnts a Factorial dcsign Scqucntial dcsign 0.01 0.589 0.9983 0.03 0.481 0.978 0.06 0.500 0.833 
.' ',': .. '., ; -':" .:'. ." ,  J :''J- ' 2 ,; ',i; ., ;; "{! ,';; :.:i :! ';,.. ! .5 t.E ! +; . 10-10. Design/or Decision Making 283 successful application of the sequential design procedure to the discrimination between two alternative models for the kinetics of the catalytic hydrogenation of propylene. Nine experiments previously performed yielded a likelihood ratio of JjJ)/Jj 2 i = 1.22. After a mere four additional properly designed experiments a firm preference for model I was established with IJ I )/IJ2i 99. 10-10. Design for Decision Making So far we have been concerned with the somewhat abstract aim ofelucidat- ing the" true" model or parameter values. Consequently, we used an abstract measure of information to select the experiments. When the parameter values are required for some specific purpose it may be more appropriate to minimize the total expected cost of achieving that purpose. We have already (Section 4-16) introd uced the loss function c (0*, 8) which represents the cost of using the value 0* where 8 is the true value. Similarly, we introduce a function d(X) which represents the cost of performing the series of experiments X == [XI' X 2 , . . . , xn]T. The outcome of the as yet unperformed experiments is the random variable Y == [YI' Y2, . . ., YII]T whose pdf is plY I X, 0). The latter function is by definition also the likelihoodt L(O I X, Y) of any hypothetical sample Y. Hence, given a prior density PoCO), we can form the posterior density p*(O I X, Y) = kpo(O)L(O I X, Y) for any possible outcome Y. We can also form the expected (marginal) pdf of Y by assigning the weight PolO) to each possible value of p(YIX, 0) p(YIX) = J p(YIX, O)Po(O) dO ( I 0- 10-1 ) Using Eq. (4-16- I) we can evaluate the risk associated with USll1g the value 0* on the assumption that the outcome of the experiments to be performed will be some specific value Y R(O* I X, Y) = J c(O*, O)p"'(O IX. Y) dO 110-10-2) Once the outcome Y becomes known, we shall of course choose 0* so as to minimize the risk. We denote this minimum risk R*(X, Y) R*(X, Y) == 111111 R(O* I X, Y) ()* ( 10-10-3) We cannot yet evaluate R* because we have not measured Y. However, foHowing Raiffa and Schlaifer (1961 J. we can find the expected value of R* t Contrary to prcvious practicc wc rctain thc argumcnts X and Y in thc expression for the likelihood bccause the data have not yct bcen taken. 
284 X Design of Experiments by averaging over all possible outcomes of the proposed experiments X, using p(YIX) as defined in Eq. (I) R(X) == f R*(X, Y)p(YIX) dY (10-10-4) R(X) is the expected risk assocIated wIth performmg the experiments x, To this we add the cost of experimentation d(X) to obtain the total expected cost of X C(X) == d(X) + R(X) (10-10-5) We shall perform the set of experiments X for which OX) is minimum, Among the possible sets of experiments is the null set, i.e., no experiments at all. In this case d(X) = 0 and p*(e I X, Y) = po(e). Hence R does not depend on Y, and C = R = min RW*) = min r cW*, e)po(e) de 0* O. oJ We now analyze the case in which both PoW) and pry I X, e) are normal. Let: Po(e) = Nle o , V 0) p(YIX, 6) = N,,/,[F(X, e), II] (10-10-6) (l 0-1 0-7) where, as usual, Y denotes the /1711-dimensional vector obtained by adjoining to each other the 11 rows ofY, and n is the joint covariance matrix of the errors in all projected experiments, usually given by Eq. (10-3-3), We now assume that the model equations F(X, e) can be reasonably approximated in the region of interest by a first-order Taylor series expansion around e = eo, i.e., F(X, e)  Yo + B(e - eo) (10-10-8) where Yo == F(X, eo) and B == aF/ae)o=oo. 'Note that B is a function of X. Now Eq. (7) can be rewritten as p(Y I X, e) = N",n[Y 0 + B(e - eo), n] (10-10-9) We leave It as an exercise for the reader to show that both the posterIor density of e and the marginal density of Yare also normal. Specifically p*(e I X, Y) = NI(e, V) (10-10-10) where 0= eo + (VOl + BTn-IB)-IBTn-l(y - Yo) V = (VOl + BTn-'B)-1 (10-10-11) (10-10-12) and p(Y I X) = N",n(Y 0' fr) (10-10- 13) 
.,' :!. ., ,"j ;;; :0- .  .:. ;; s;' :. '." ,';" ' ' f ';  . ';;£ ;:\ f.! ..,.. -  t 10-10. Design/or Decision Making 285 where IT = [n- I - n-1BVBTn-lr l (10-10-14) The situation is particularly tractable if the loss function is quadratic, i.e., as in Eq. (4-16-6) c(e't-, 0) = (0* - O)Tp(O* - 0) ( 10-10-15) where P is a given positive definite (or at least semidefinite) matrix. As was shown in Section 4-16, this leads to an optimal choice of 0* = 0 (the mean of the posterior distribution). Then the minimum risk is the expected value of (0 - O)Tp(O - 0) under the posterior distribution i.e., R*(X. Y) = £[(0 - O)Tp(O - OJ] = Tr PV (10- I 0-16) A glance at Eq. (12) indicates that V and hence R* are independent of Y, hence R(X) = R*(X, Y) and C(X) = d(X) + Tr P(V 0 1 + BTn-1B)- ( 10-10- I 7) When Eq, (10-3-3) applies " C(X)=d(X)+TrP(V o l + I B/' 'B,,)-I 11=1 (10-10-18) When minimizing C(X) we can seek to find the optimal number of experi- ments as well as the conditions under which they are to be performed. If the experiments are to be performed in sequence, it is only necessary at any given titne to find the optimal conditions for a single experiment XI' and compare the associated cost min C(xd with the expected cost of performing XI no experiment at all (which is Tr PV 0 1 when Eq. (18) holds), If the outcome is favorable to the additional experiment, we perform that experiment, re- place PoCO) by P*CO), and repeat the procedure. The stopping rule is obvious; cease experimentation when the expected cost of no experiment falls below the minimum expected cost of the next experiment. It must be admitted that while the procedures outlined above are concep- tually simple and appealing, their implementation is difficult in most practical situations. While the minimization of Eq. (18) is no more difficult than the minimization of Eq. (10-3-2), almost any other loss function leads to severe computational difficulties which arise from the need to evaluate multiple infinite integrals for all possible values of X, Y, and O. To this must be added the diftlculty of assigning realistic cost functions, a by no means trivial task. 
286 X Design of Experiments 10-11. Problems I. Verify Eg. (10-10-10)-(10-]0-12). 2. Using Eg. (10-2-2), Eg. (10-2-4), and Eg. (10-10-] 2), show that in the case of normal prior and error distributions and a linear model, the value of I is positive; i.e., one gains information no matter what the outcome of the experiment. For more general results, see Lindley (1956). 3 Derive Eg. (10-5-6). 4. Derive a decision-theoretic design criterion for discriminating between alternative models, Assume that one is given prior probabilities n(i) that the ith model is correct, and that the loss function has the form cijee, e) which represents the cost of assuming that model i hOlds with parameter values e, when in fact model) holds with parameter values e. 
."" f :,'"' 'i . hi : 'I I" I.,. ,", '". . ,. f:. '., .'-,-, i:: i1:: ..;,:{. ". :I-,.. > :;'. 'I' : f ti: i;  Appendix A Iall'ix Analysis A-I, Matrix Algebra The reader unfamiliar with matrix notation may prefer to write out matrix expressions in full. But he will soon develop facility in manipulating matrices and will no longer need subscripts and summations. This will greatly enhance his insight and enjoyment of the subject. Throughout the book, boldface normal size capital letters (both latin and greek) denote matrices, e.g., l Alt Al1 An An A=[Aij]= : Ami A m2 AIII J ..4 2 " 4..J mn (A-I-I) is an m x 11 matrix. A is square if m = 11. A matrix all of whose elements are zero is denoted 0 and is called the nul! matrix. Bold face small capital letters denote column vectors obtained by adjoining to each other the rows of the corresponding matrix. Thus, if A is defined by Eq. (I), then All Au A= AlII A 21 A 22 (A-I-2) ....4.'1111 Boldface lower case letters denote column vectors, e.g., l al j a = [aJ = ? am (A-I-3) 
288 Appendix A. Matrix Analysis 'I,i is an m-dimensional vector. A vector all of whose elements are zero is de- noted O. All non boldface characters are scalars. Capital or lower case non- boldface letters with subscripts may be elements of the corresponding matrix or vector. A subscripted boldface letter indicates one in a set of vectors or matrices. The superscript T denotes transposition. if A is defined by Eq. (1), then lA" A ZI A", J AT=[Aj;]= A{Z An A rnz (A-I-4) A ln A Zn Am1l is an 11 x m matrix. A square matrix A is symmetric if AT = A: i,e" Aij = A ji for all i and j. If a is defined by Eq. (3), then aT=[al,a z ,' >,a m ] (A-1-5) is an m-dimensional row vector. If A and B are both m x n, then [A + B]ij = Aij + Eij. We define the following matrix products: (a) A is m x nand B is II X k. Then AB is the m x k matrix whose i,j element IS " [AB]ij = I Ai/Blj 1= I (i = I, 2, . > . , m; j = I. 2. . . .. k) ( A-I-6) (b) A is m x nand b is n-dimensional. Then Ab is the m-dimensional column vector whose ith element is n [AbL= IAi/b , 1= 1 (I = I, 2, . > . , m) (A-I-7) (c) A is m x nand b is m-dimensional. Then b T A is the n-dimensional row vector whose ith element is m [bTA]; = I blAli 1= I (A-I-8) (d) a and b are Ill-dimensional Then the inner product a Tb = b T a is the scalar m aTb = '" a.b. L I I i==1 (A-I-9) The inner product of a vector with itself, i.e" aT a is the square of the length (also called norm) of a. We use the notation lIall to designate the norm of a, -', l 
A-I, Matrix Algebra 289 (e) a is m-dimensional and b is n-dimensional. Then the outer product ab 1 is the m x n matrix whose i, j element is [abTtJ = ajb j (i = 1,2, . .. , m; i = 1,2, . . . , n) (A-I-IO) If we regard an m-dimensional column vector as an m x I matrix and a similar row vector as a I x m matrix, then all the above products become special cases of (a). From these definitions, one can work out the product of any number of terms. For example, the quantity aT Ab is the scalar aTAb =  a.A..b. L I I} J i. j (A-I-II) which may be verified by applying Eq. (7) first, and then Eq. (9). This IS permissible because matrix and vector products are associative, i.e., ., aTAb = aT(Ab) = (aTA)b I 1 '1 ,. I .- .  . .  1 > Let A be a square m x m matrix. The main diagonal of A is the set of elements At t, A 22, ..., AIIIIII A diagonal matrix is one whose only nonzero elements are on the main diagonal. The identity matrix 1 is a diagonal matrix, all of whose diagonal elements are- unity, i.e., Jr!  l (A-I-12) or ,'"q  $ .. '" I I jj = ()ij == 0 (i = j) (i -:f. j) (A-I-13) The symbol ()ij is called the Kronecker delta. Clearly IA=A, BI B, Ia = a, b TI = b T '-,1 .:L.: for any suitable matrices A and B, and vectors a and b. If A is a square matrix, then A-I designates a matrix (if one exists) such that I' t;. A-IA=AA- 1 =1 (A-I-14) ". A -1 is called the inverse of A. A matrix A can possess at most one inverse. If A has no inverse, it is said to be singular. The following relations may be derived easily . , t,: .. (Ab)T=bTAT, (AB)T=BTAT, (AB)-I=B-IA- I , (AT)-I=(A-I)T (A-I-IS) ,..  .. 
'-::; 290 Appendix A. Matrix Analysis" A nonzero vector v is an eigenvector of the square matrix A, and A is the associated eigenvalue, if Av = ),v (A-I-I6) Vectors a and b are orthogonal to each other if a Tb = 0, If A is symmetric m x m, then one can find m mutually orthogonal eigenvectors VI' v 1 , "V rn of A, Usually, we normalize the vectors so that VjTV j = ()ij (i,j= 1,2, .,111) (A-I-I7) The Vi then form a set of orthonormal eigenvectors of A, Let V be the m x m matrix whose ith column is Vi. In view of Eq. (17), we have VTy = yyT = I, i,e., yT = V-I. The matrix Y is said to be unitary, If Ax = 0 (x -:f. 0), then x is called a null vector of A. If A is square, then it can possess null vectors only if it is singular. A singular matrix has at least one zero eigenvalue. Let x be a vector and A a symmetric matrix. The scalar x TAx may be regarded as a function of x, It is called the quadratic form associated with A, The matrix A is positive definite if x TAx> 0 for all x -:f. 0, and positive semidefinite if x T Ax  0 for all x, Negative definiteness is defined analog- ously. All eigenvalues of a positive definite or positive semidefinite matrix are positive or nonnegative, respectively. The symbol Ai] t is used to denote the i, j element of A-I, and not the reciprocal of A ij . If A is a square nonsingular matrix and y is a known vector, then the solution to the set of simultaneous linear equations Ax=y (A-I-IS) is given by y = A-IX (A-I-I9) Suppose A is any matrix, not necessarily square. Then there exists [see Penrose (1955)] a unique matrix A +, called the pseudoinverse of A, satisfying the relations AA + A = A, A + AA + = A + , A + A = (A + A) T, AA + = (AA +)T (A-I-20) If A is square nonsingular, then A + = A - I . If A is m x n, then A + is n x /11, If the equations Ax = y have a solution, then x = A +y is the solution of minimum length. If Ax = y has no solution, then x = A +y minimizes the sum of squares of the deviations y - Ax; and of all vectors having this prop- erty, x = A +y has minimum length. J 
A-I. Matrix Algebra 29\ The Trace of an m x m matrix A is the scalar '" Tr(A) == I A ii ;=1 (A-I-2l) The trace of a matrix is equal to the sum of its eigenvalues and the de- terminanT of a matrix is equal to the product of its eigenvalues. One verifies easily that Tr(AB) = I AijB ji = Tr(BA) i. j Hence and Tr(ab T ) = bTa = aTb aT Aa = TrlAaa T) <, IfI IS the m x m identity matrix, then Tr(l) = /11 and det(l) = I. Let A be the m x n matrix defined by Eq, (I). Suppose k and I are positive integers satisfying k < m and 1< 11. Define the following matrices: fAtl B  ll:: ..+: l At. t+t C == A21+1 Ak,l+l l Ak+t.l D == Ak+;2.1 A,,,,1 l Ak+I.I+1 E == A kt2 , 1+1 A""l+l An An Au At.I+2 A2,1+2 Ak,1+2 A k -rt.2 Ak+2.2 Am.:? A k +I.I+2 Ak+2.1+2 A"'.1+2 A tl J An Akl At''' J A 2 . n A k " (A-I-22) Ak+t.l l Ak+2,1 A"" Ak+t." l Ak+2... Allin \: We write the matrix A in partitioned/arm as .',i,-" f!!1 A = [ J (A-I-23) 
292 Appendix A. Matrix Analysis Matrices in partitioned form may be multiplied as though the submatrices were elements, provided the resulting expressions make sense. For instance, let x be an n-dimensional vector partitioned as follows x = [] (A-I-24) where l Xt J l XI+I J = Xl b = X'+l a _ ., -. . . . . XI X" (A-I-25) Then one may easily verify that Ax = [ B C ] [ a ] = [ Ba + Cb ] DEb Da + Eb (A-I-26) Note that this makes sense only if x is partitioned so that the dimension of a equals the number of columns in Band D. The partitioning of a matri.x into more than four submatrices proceeds analogously. The rank of a matrix is the maximum number of linearly independent columns or rows in the matrix (it makes no difference whether we take rows or columns). A nonzero vector has rank I. The rank of a square matrix equals the number of nonzero eigenvalues. We have: rank (A + B)  rank A + rank B rank (AB)  min (rank A, rank B) (A-I-27) (A-I-28) It follows that I rank (ab T) = o (a -:f. 0 -:f. b) (a = 0 or b = 0) (A-I-29) and " rank I a i biT  11 i= 1 (A-I-30) A matrix whose rank equals the number of rows or columns (whIchever is less) is said to be of full rank. A square matrix of full rank is nonsingular, and vice versa. A matrix of the form A = aa T is positive semidellnite, because for every vector x xTAx = (x Ta)l): 0 ,.t -<.) 
A-2. Matrix Differentiation 293 The sum of positive semidefimte matrices IS positive semldefimte. Hence 2:7= 1 aj aj T is positive semidefinite, If A is positive semidefinite, then so IS B T AB where B tS any matrix or vector. Let A be a square matrix. Suppose }'Illin and }'Illa, are the eigenvalues of A with smallest and largest absolute values, respectively. Then for any vector b' IAlllinlllbll  IIAbll  1}'lllaxlll b ll 1}'U1inlllbIl2  IbTAbl  I A maxill b ll 2 (A-I-31) (A-I-32) If A and Bare 111 x nand n x 111 matrices, respectively, then (Wilkinson, 1965, p. 54) det(I lII + AB) = det(I" + BA) (A-I-33) A-2. Matrix Differentiation , '& . Ji- { * ::. .. Let Cf. be a scalar function of a vector a and a matrix A; let b be a vector function of a scalar [3 and a vector c, and let C be a matrix function of a scalar )J, Table A-] lists the various derivatives that may be formed. Deriva- tives of vectors with respect to matrices, and matrices with respect to vectors and matrices, require more than two subscripts. They cannot, therefore, be represented in matrix notation. On the rare occasion when they are needed, subscript notation will be used. Table A-I Matrix Derivatives '!" .'  :: b. . The symbol is a whose elcments arc f Ba/Ba Ba/BA Bb/B{3 Bb/Bc BC/oy column vector" matrix column vector matrix matrix (8a/8a); -= Ba/Bo, (oa/8A),j cc Ba/BA" (Bb/B{3), - Bb.fi!{3 (i!b/i!c)'j oeo 8b.fi!cj (oC/i!y)u = acu/ay a A casc may be made for dcfining oa/i!a as a row vector, but we prefer to regard all vectors that do not carry the symbol T as column vcctors. ' 1 "- '. , 1.'. : To differentiate a product of vectors and matrices with respect to one term, we proceed as follows (assume we are computing (l::t./('A): ]. Write the expression out in terms of subscripts and summations Do not use the symbols i and j as subscripts. 
294 Appendix A. Matrix Analysis 2, Suppose the term Aid appears in the summation. Remove this term, replace the remaining appearances of subscripts k and 1 with i and j, respec- tively, and remove summations with respect to k and I. The result is the de- rivative with respect to A jj . 3. Reorder the expression so that the term containing i appears first and the term containingj appears last. Reorder the other terms so that any two occurrences of other indices are in consecutive terms. It may happen that some of the terms are left over. These terms can be grouped to form a scalar, which can be placed in front of the remaining matrix expression, as in example (e) below. 4. Drop all summations and indices. Add T symbols where necessary. Examples (a) a = aTAb. l. a = IJ.,I0J.AJ.,b,. 2. uajiJAij=ajb j . 3. iJajiJAij = ojb j 4. iJajiJA = ab T (b) a = Tr(BA TC). I. a = I",.J..,BmIAJ.ICkm' 2. iJajiJA jj = I", B",j C j ", 3. iJa/iJA;j = I", C;", B",j 4. iJajiJA = CB. If the matrix A appears more than once, each appearance should be treated separately and the results added. Example (c) a = aT ABATb. l. a = IJ..,.",."akAkIBI",A"",b". 2. aajiJAij I",."ajBJ",A"",b" + IJ..laJ.AJ.IBubj. 3, GajiJAij = I",."a;b" A,,,,, Bj", + II.. IbJlk AI.l BJj' 4. iJajiJA = abTAB T + baTAB. The handling of other derivatives is analogous. Examples (d) Compute iJa/iJa, where a = aT Aa. l. a = IJ..taJ.AJ.lal' 2. iJajiJaj = II A iI al + II. aJ. Akj. 3. aajIlai = I, A il a l + II. AJ.JIJ.. 4. aajiJa = Aa + A T a . If A is symmetric, iJajiJa = 2Aa,  
 I;, . ,<  *- I ' ,. -.t: . . :: ; ': ':' . ; :; A-2. Matrix Differentiation 295 (e) Compute ab/ae, where b = Aea TBe. I. b i = Ik,I,,,,AikckaIBI,,,c,,,. 2. abdacj = II,,,,AijaIBI,,,c,,, + Ik,IAikckalBIj' 3. abJacj = II, ",(al BI'" c".)A ij + Ik, I A ik cka l Bij. 4. abjae = (a TBe)A + Aea TB. , :; :. - .' I : (note that the term II. '" aIBI",c". = a TBe is a scalar and can be placed any- where in a product). We shall also need the followll1g denvattves: (f) We wish to compute aAki 1 jaAij, where Ai:; I is the k, I element of A -I. By definition aAki1jaAij = lim (I/£)[(A + £B)-I - A -Ihl c-o (A-2-I) , j ; where B is a matrix whose 111, n element is (j",;tJ"j; i.e., the i,j element is unity, f&: and all other elements are zero. Now ;, (A+£B)-J = [A(I+£A-1B)r l =(I+£A- B)-IA- 1 ( A-2-2) . 'j For sufficiently small £ the following series expansion is valid (I + lOA - 1 B) - 1 A - J = (I - lOA - I B + £2 A - I BA - 1 B - . . . )A - I = A - I - lOA I BA - I + [,2 A - I BA - I BA - I _ . . . (A-2-3) , . , . -  and we can prove easily that lim (lj£)[(A + £B)-J - A- J ]= -A-IBA- I <-0 (A-2-4) Therefore' I aAki1jiJAij= -[A-'BA-J]k/= -I Ai:;"IB",,,A.1 m.tJ , A -I < 5 A -I 1-I A -I = - L I..m O",i (lJj III = - / 1..; 'jl "'.11 (A-2-5) ;.: which is the desired result. . (g) Now we can evaluate, for example, aCf.jaA where a. == x T A -IX. Indeed: a. = I xkAkiJx, k.1 aa.jaA ij = I xk(aAI:t I jaAi)xl = - I X k AII Aft' XI = - I Akl I XkxlAftl k,l k.1 k.1 (A-2-6) so that !it,fiE. a(xTA-1x)jaA= _(A-J)TXXT(A-I)T (A-2-7) 
296 Appendix A. Matnx Analysis (11) Let a. = det A. We wish to evaluate 8a.ji7A. Let us expand the determll1ant in co factors of the ith row, i.e., det A = IAikA k where A IS the cofactor of A ik . A i does not depend on any of the elements in the ith row. Therefore (A-2-8) i7 det Aji7A;k = A (A-2-9) As is well known A,nl = A,/det A (A-2-1O) Hence, A = Ai:; 1 det A and i7 det Aj8A = (A -I)T det A (A-2-11) Furthermore a log det Aji7A = (I/det A) 8 det Af8A = (A -1)T (A-2-12) A-3, Pivoting and Sweeping Many computations involving matrices may be viewed as a sequence of operations called pivoting. It is useful to examine the pivoting operation in detail, and list some of its applications. In the sequel we always assume that we start with a given matrix B which is progressively modified by successive pivotings. Unless otherwise stated, whenever we refer to B or to any of its elements we mean the current, rather than the original values Definition Suppose Bij 1= 0 for some pair of indices i, j. Then performing a Gauss-Jordan pivot, or simply pivoting on (i, j) means changing the elements of B according to the following scheme: I. Replace Bpq by Bpq - B iq Bp) Bij for all p 1= i, q 1= j. 2. Replace B;'J by B;q/Bij for all q I=j. 3. Replace B pj by -Bpj/Bij for all p 1= i. 4. Replace Bij by IfB;j' The element Bij (before pivoting) is referred to as the pivoT. Pivoting on (i, i), i.e., with a pivot on the main diagonal, is referred to as sweeping (Beaton, 1964) row i. Two pivots are unrelated if they differ in both row and column, i.e., Bij and Bkl are unrelated if i 1= k andj 1= I. The following properties are easily verified: I. Pivoting is reversible, i.e., pivoting on (i,j) twice restores the original matrix. 'ii 
:j .y: >:!i :f ,j; -' 3ft , '::F : ; A-3. Pivoting and Sweeping 297 ;: 2. Pivoting on unrelated pivots is commutative, i.e., pivoting first on (i, j) and then on (k, I) produces the same matrix as pivoting first on (k, I) and then on (i, j), provided i t= k and} t= I. Since different elements on the main diagonal are unrelated, it follows that sweeps are always commutative. 3. From land 2 we deduce that pivoting in sequence on (i, j), (k, I), and (i, j) is equivalent to pivoting on (k, I) alone if (i, j) and (k, I) are unrelated. The following applications will motivate the definition of pivoting: (a) Exchange of Variables. Suppose B is 111 X n, x is an II-vector, and y is an m-vector satisfying ". '::j ;1 :! '. i " ';" ,-{ y + Bx = 0 (A-3-l) The elements of x and y may be regarded as independent and dependent variables, respectively. Suppose we wish to interchange the roles of, say, Xl and Yl' That is, we wish to express the variables XI' Yz , Y3 , . . . ,)'m as functions of )'1' Xz, X 3 , ..., XII' The first row of Eq. (I) reads  Ir. t r. r: )'1 + Ellx l + El2 Xz + . . . = 0 If Ell t= 0, then this is equivalent to XI + B/)'I + Bt Z BII Xz + . . . = 0 (A-3-2) ,. ;.t'' ,t '.J' ,i! ... (A-3-3) Solving for XI and substituting in the ith row ofEq. (I) we find, after collecting terms Yi - Bi1B/)'t + (BiZ - BiIBl2/BII)xz + ... = 0 (A-3-4) Consider the following tableau as a schematic representation of Eq. (I) ,C'" Xt Xz XII YI Bll Bl2 Btll )'Z B ZI B Z2 B ZII (A-3-5). YIII Bml B m2 Emil y ' ,. ..:    Then, after exchanging XI with)'l we can represent Eq. (3) and Eq. (4) In a new tableau )'1 Xz XlJ X I/BII Bl2/ B ll Bill/Bit 1 Yz -Bzl/B ll Bn -B21 B l2/ B ll B ZII -BZIBIII/Bll Ym -BmdBll B mZ -BmlBIZ/Bll Bmll - BmtBtll/Bll (A-3-6) 
298 Appendix A. Matnx Analysis It is evident that the elements of B have been transformed as by pivoting on (I, I). Generally, exchanging Y i for Xj is accomplished by pivoting on (i,j). (b) Partial Elimination, Instead of interchangIng just one pair of variables, we may wish to interchange several. Let the equations ofEq. (I) be partitioned as follows YI + B l1 x t + Bl2 X2 = 0, Y2 + B 2l x I + B22X2 = 0 (A-3-7) The correspondll1g tableau is X T X 2 T . I YI Bll Bl2 (A-3-8) Y2 B2l B 22 Let BII be a k x k nonsingular submatrix of B. Then we can solve the first k equations in Eq. (7) for X t ' and substitute in the remaining equations to obtain Xt +B;-II YI +B;-IIB I2 X 2 =0, Y2 - B 2I B;-/Yt + (B22 - B 2I B;-/B l2 )x 2 =0 (A-3-9) Suppose it is possible to exchange, in sequence, Yt for Xl' Yz for x 2 , ,.., Yk for Xk' The result is the same as exchanging the entire vector YI for Xl' According to Eq. (9), then, sweeping (if possible) rows I, 2, . . . , k of tableau Eq. (8) produces YI T X T . 2 X B- 1 B;-/B l2 (A-3-1O) I II Y2 -B 2I B;-l t B 22 -B2IB;-/Bl2 This property is used in the projectIon method (Sections 6-2 and 6-3). It can be shown that if BII is positive definite, then the required sweeps can always be executed, i.e., no B;i ever turns zero. (c) Matrix Inversion, When BII is the entire matrix B, then sweeping all rows transforms B into B- 1 , since Y + Bx = 0 is changed into X + B-1y = O. This procedure cannot be carried out if zero diagonal elements are encountered, For instance [ ] '.:I 
";J'L b li' , , , A-3, Pivoting and Sweeping 299 f; f cannot be swept though it is nonsingular. However, we can always proceed as follows: ;.. it  '; t .. iF tL F. f:. 1. Write out the tableau Eq. (5). The Yi and x j are symbolic headings, whereas the Bij are numerical values. 2, Among all the elements whose rows are headed by a Yi and whose columns are headed by an Xj, find the one, say Bpq, with largest absolute value. If no such elements exist, proceed to step 5. Otherwise: 3. If Bpq = 0 the matrix B is singular, and the process is terminated. Otherwise: 4, Pivot on (p, q) and interchange the headings Y p and x'}' Return to step 2 5. Rearrange the rows so that their headings appear in the natural order "' I l." :q r! Xl' Xl' .. . , Xm. 6. Rearrange the columns so that their headings appear 111 the natural order Y1, Yl, . . ., Y",. Our tableau now contains B- 1 . .-,-' '" ',". ro;.. f .,\;' . . ( :' .' (d) Simultaneous Linear Equations, We wish to solve for x the set of simul- taneous equations ,,", "_ t. ;.;: Ax= b (A-3-1 J) f_:; : .l- iE 'f" i. " i!Y ,.,. Jr.;. ( 't.' :;. where A is 111 X 171. Let us define B as the 111 x (171 + I) matrix [A, b] and let us apply to B the algorithm of the preceding section, except that no pivots are allowed in the last column. If A is nonsingular, one ends up with the matrix [A -1, A -Ib], i.e., the solution x is found in the last column. If one is only interested in x, then step 6 may be omitted. Also, in step 5 only the elements of the last column need be rearranged. This method of solving equations is known as Gauss-Jordan elimination. Ordinary Gaussian elimination is faster, but Gauss-Jordan elimination is very convenient and economical in storage space when the inverse too is desired. If A is singular, the process terminates in step 3 with all eligible pivots equal to zero. Let us partition x and Y into vectors XI' Xl and Yt, Yl, respec- tively, where subscript I refers to those elements which have been exchanged, and subscript 2 to those elements which have not been exchanged. For in- stance, Xl consists of elements of X which appear as column headings in the final tableau. The final tableau takes the form (wc= have rearranged rows and columns suitably) .... f' . ; 1i: . -' .. .. r", f." f{ '..,., . .... :,' . f:: ):.7 ,,: -: \, '... Y1 T X T . 1 '.\ illtr Xl Yz C tl C t2 C I C 2I C n = 0 C l (A-3-12) 
300 Appendix A. Matrix Analysis The matrix C n must vanish, for otherwise we could have continued pivoting. Let the partitioning of Eq. (II) that corresponds to the partitioning of x and y be Alix i + A I2 X 2 = b l . A2[x 1 + An x 2 = b 2 (A-3-13) Then, from Eq. (10) and Eq. (12) we must have: CII =Ai, C n = A2IA;-tl, C I2 = AIIAI2 C n = An - A 21 A II A I2 = 0, C t = Allbl C 2 = b 2  A2IAllbl (A-3-14) Now, if we eliminate XI directly from Eq. (13) we find XI = Allbl - A/AuX2' (An - A2IA/Au)X2 = b 2 - A2lAtlbt (A-3- J 5) which, in view of Eq. (14), can be written as Xt = c t - C I2 x 2 ' OX 2 = c 2 (A-3-16) From this we deduce the followll1g: I. If c 2 t= 0 then the equations Ax = b have no solution. Note that C 2 is the set of elements in the last column which belong to rows with r headings. 2. If c 2 = 0, then the equations Ax = 0 have infinitely many solutions. These can be obtained by assigning arbitrary values to X 2  and letting XI = C t -C I2 X 2 . (e) Rank of Matrix and Linear Independence of vectors, Let ai' a 2 , ..., alii be a set of l1-vectors, and let A be the 111 x 11 matrix whose ith row is a? We wish to determine the rank of A, or what is the same, the number of linearly independent a j , We write down A in tableau form, and proceed to apply the algorithm of (c) above. The number of pivots executed before the process had to be halted equals the rank of the matrix. Referring to Eq. (14), the condition C n = 0 may be rewritten as A 22 = C21A12 = A 21 C 12 . But also A2[ = - C 21 AI I and A I2 = AI] Cu. Combining these, we find: [ AI2 ] = [ All ] Cu An A 21 (A-3-17) [A21' An] = C21[AII' Au] (A-3-18) Thus, C u and C 2 I contain the coefficients for the linear dependence among the columns and rows of A, respectively. The rows of [AI I' A ,2 ] form a maximal linearly independent subset of the a,. The columns of [ All ] A 21 ::'  
A-3. Pivoling and Sweeping 301 form a maximal linearly independent subset of the columns of A. When the rank of A equals the number of rows, then All' An, Cll' and C n are vacuous; when the rank of A equals the number of columns, then All' An, C Il ' and C n are vacuous. When A is square and nonsingular, only "-' Au and C u = A/ exist. (f) Determinant. To compute the determlllant of B, we follow the procedure of (c); step 6 may be omitted. If the process cannot be completed, then det B = O. Otherwise, the determinant equals the product of all pivots times ( -1)'", where r is the number of row interchanges required in step 5. [I (g) Stepwise Linear Regression. We wish to find the I-vector 0 which mini- mIzes c.P(e) = (Y - Be)TV-I(y - Be) I iL ; Let us form the (l + I) x (l + I) matrix [ BTV-IB BTV-I Y ] A== (BTV-ly)T yTV-Iy 1"1 [, .  I t, . . '. .. . (:' Suppose we sweep some of the first I rows of A, produclllg a modified matrix A (whenever we speak of A we are referring to its current form). Let I denote the set of indices of the swept rows, and J the set of indices of the unswept rows (excluding row I + I). Let a be the last column of A. Then aa(a E I) is the optimal value of Oa provided all 011 (f3 E J) are restricted to vanish. Furthermore, a l + 1 is the minimum of c.P(e) under the above restriction, and 1/ == a/IAp/l (f3 E J) is the reduction in c.P(e) that would ensue if Of! were to be included in the regres- sion, i.e., if row fJ were to be swept. Therefore, the following algorithm is suggested for forward stepwise regression: t,; I '.: ,I.; : . , .." f: :P:' i ;'-' I. Choose a small positive number G such that a change G in q)(e) is considered insignificant. 2. Construct the matrix A. Let I be the empty set, and let J = {I, 2, . . . , I}. 3. Of the elements fJ E J for which 1/ > G, find the one, say f3*, for which 1/ is largest. Sweep row f3*, and transfer f3* from J to T. 4. Repeat step 3 until no 1/ exceeds G. At this point, the model is repre- sented by the equation 1;, = I BJlaOa, aEI Oa=aa=Aa,'+1 o (a E 1) (a E J) 
302 Appendix A. Matrix Analysis I n backward stepwise regression, we start with _ [ (BTY-IB)-I e* ] A = _e*T yTy-ly _ e*TBTy-1Be* where e* = (BTy-IB)-IBTY-ly. We let I = {I, 2".., I} and J is the empty set. Step 3 above becomes: 3'. Of the elemems a. E I for which fa 2  8, find the one, say a.*, for which f/ is smallest. Sweep row a*, and transfer a* from [ to J. A-4. Eigenvalues and Vectors of a Real Symmetric Matrix Presented below are the computational details for an algorithm which combines Givens-Householder reduction to tridiagonal form, the QR algorithm with origin shifts for diagonalizing the tridiagonal matrix, and successive orthogonal transformations of the unit matrix to obtain the eigen- vectors. For explanations, see Wilkinson (1965) and Ortega and Kaiser (1963), Steps which are starred (*) can be omitted if only eigenvalues are to be computed. A is the 11 x 11 symmetric matrix whose eigenvalues are to be computed. Step 2 is omitted if /1 = 2. Two constants 8 1 and 8 2 are used in termination tests [steps 8 and 9 below]. The following method is suggested for selecting the values of these constants: Let 8 be the desired relative accuracy of the largest eigenvalue (this cannot exceed the precision of the computer. For instance, if a k-bit word length is used, we must have 8 > r k ). Let S = I, j= 1 Al j , Then let 8 1 = 8(2.S)1/2 and 8 2 = 81/112. 1*. SetY=I". 2.. For i = I, 2, ." ,11 2. in turn, perform the following steps: a. Let a = 1 if Ai+l, i:;' 0, a = -1 otherwise. b Let .-'" .2 S - ac 1/2 . ' c - Lj=i+l '"1j.i' - . c. Let b i = -so If s = 0, proceed to step i. d. Leta= I/(c+ IAi+l,iSI). e, Lctll'i+I=Ai+t,i+S, W j =Aj,i(j=i+2.,i+3"..,I1), f. Let u j = a.Wj (j = i + I, i + 2, . , ., 11). g*. Let PJ.. = I1=i+ 1 VJ...j w j (k = 1,2, ...,11), h*. Replace VJ..j with VJ..j - pJ..uj(k = I, 2.,." ,11; j = i + I, i + 2, . . . , 11). 1. Let qJ.. = I1=i+ t AJ..jllj (k = i + I, i + 2.,. .,11). J. Let{J=-lIz=i+lqJ.. u J..' k, Let qJ.. = qJ.. - {J11'J.. (k = i + 1, i + 2, . , " 11). I. Replace AjJ.. with Aji. - qjWk - wjqJ.. (j, k = i + I, i + 2"..,11). 
,s; " } ) .:'{ o : : ;4 J J " .\ "'i '! ; .;:J. . . . " ';'.} :'J:J "f,:j q .! }'1 /;.:J ;.fj  d ';'1 ,:J ... ",,"/. ': :r {J oiJ .t I .d .... ) . " ',,;.,  -' :; ';;1 .j :;1: : 'I -1 .,;.'1: '\ I ;<' '.' .'-.. -..1 f ; ;':, A-5. Spectral Decompositions 303 3, Let b 17 - 1 = AI7-1,17' a i = Ai,;(i = I. 2,. ., n). 4. Let m = n, 9 = O. 5. Letc1=I, a=a m , p=a 1 . 6. For i = 1,2, ..., m - I in turn, perform the following steps: a, If I bil  [;2' replace b i by zero, set s = 0 and c = sign (p), and proceed to step e. b. Let x = (p2 + b/)1/2. c. Let s = bJx, c = pix. d*. For j = I, 2,..., n in turn, let fJ = cV j , i+1 - sV j . i Replace V j , i with SVj,i+t + cV j , i and V j , i+t with 13. e. Let r = cp + sb i , d = CCI. f. Letq=dbi+sa i + l . g. Replace a i with dr + sq h, If i > I, replace b i - I with Sl r. i. Let St = s, P = ca i + 1 - sClbi' c i = C. 7,. Replace b m - 1 with SIP and am with Ctp. 8, If I7:"'-/ I bd  [;1 proceed to step 16. 9. If I bm-II  [;2 proceed to step 13. 10. If II am/a I - II > t, return to step 5. II. Replace 9 with 9 + am' 12. Replace a i with ai - am (i = 1,2,. ., m) and return to step 5. 13. Replace am with am + g. 14. Replace /11 with m - I. 15. If 111  2, return to step 9. 16. Replace a i with a i + 9 (i = 1,2, ..., m). 17, At this point, a i is the ith eigenvalue of A, and (*) V ji (j = I, 2, . . ., n) is the ith eigenvector of A (i = I, 2, ..., n). These eigenvectors form an orthonormal set, i.e., Ii Vij Vi = (5jk (j, k = I, 2, . , . , n). A-5. Spectral Decompositions Let A be a symmetric I x I matrix. Suppose D and E are, respectively, diagonal and nonsingular I x I matrices, satisfying A = EDE T (A-5-I) Then EDE T is referred to as a spectral decomposition of A. In component form, a spectral decomposition is given by I A ij = I dkEikEjk k=1 (A-5-2) where d k == Dk (A-5-3) 
304 Appendix A Matrix Analysis Let x be any I-dimensional column vector. The quantity 1 A(x) == x TAx = L AijXiXj i, j=t is called the quadratic form defined by A. From Eq. (2) we have 1 A(x) = xTAx = xTEDETx = yTDy = L diy/ i= 1 (A-5-4) where y == E T x (A-5-5) The matrix A IS positive (negative) definite if A(x) > 0 «0) for all x oF O. It Follows From Eq. (4) that A is positive (negative) definite if and only if all d i are positive (negative). If none of the d, are zero, the matrix A is nonsingular, and we can form A-I = (ET)-ID-1E- 1 (A-5-6) since E was assumed nonsingular, and D - I is a diagonal matrix with (D- 1 )ii = d i - I . Any symmetric matrix possesses infinitely many spectral decompositions. Of these, the following play important roles: (a) The Eigenvalue Decomposition, Suppose E IS a ullltary matrix Y satIs- fying. yT = Y - I (A-5-7) In this case we denote D by A and d, by I". Then we have from Eq. (I) and Eq. (7) AY = vAyTy = YA (A-5-8) Let Vi denote the ith column of Y. Then Eq. (8) is equivalent to AV i = l'i V , (i = I, 2, , . . , I) (A-5-9) which states that the lei and v, are, respectively, the eigenvalues and eigen- vectors of A. The equation 1 A = YAy T = " ). v . v . T  l 1 1 i= 1 (A-5-1O) represents the eqenvalue decomposition of A. Inverting Eq. (10) we find 1 A-I =(yT)-IA-ly-1 =YA-1yT= Llei 1V i V / i= 1 (A-5-Il) provided all lei oF O. I F we omit from the summation in Eq. (I I) all the terms for which )" = 0, we obtain a matrix A +, called the pseudo in verse of A. This definition of the pseudoinverse applies only to symmetric matrices; for the genera! case see Eq. (A-I-20). 
A-5, Spectral Decompositions 305 Equations (10) and (II) show how both a matrix and its inverse can be reconstituted when the eigenvalues and vectors are known. We now consider the quadratic form A(x), We have I ACx) = xTYAyTx = y1Ay = L )'IY/ i=l (A-5-12) where y = yT X (A-5-13)  !". t- f Since Y is unitary, the transformation of coordinates given by Eq. (13) does not affect the shape of the contours of the function A(x), i.e., the shape of the surfaces on which A(x) = constant. From Eq. (12) it is evident that these surfaces are quadratics whose ith principal axis is inversely proportional to I ),d 1/2 and lies in the direction of Vi' For instance, if 1= 2 and )'1 and )'2 are positive, the contours are ellipses whose principal axes are proportional in length to }i l / 2 and ),;-1/2. If the eigenvalues are nearly equal, the contours are nearly circles; if they differ widely, the contours are very elongated. The most extensive analysis of methods for computation of eigenvalues and vectors can be found in Wilkinson C 1965). A summary of a fast and convenient method based on the QR algorithm appears in the previous section. If com- putations are carried to n digits of precision, then the error in any computed eigenvalue is about ::!:: 1O- II }max' where }max is the eigenvalue of largest absolute value. It follows that eigenvalues much smaller than }m" cannot be computed with great precision. We define the condition number of a matrix as the ratio of largest to smallest (in absolute value) eigenvalues. The computa- tion of the small eigenvalues (corresponding to long principal axes) of a matrix with a large condition number poses a serious problem. Fortunately, this problem can usually be eliminated if we use a different spectral decomposi- tion, as de.scribed below. (b) The Scaled and Inverse Scaled Decompositions, An apparent ill-condi- tioning (large condition number) of matrices encountered in practice is often due to the scaling of the variables. For instance, consider the function «P(G) = .He 1 2 + e 2 2 ). This has the Hessian matrix H = [ ] which is very well-conditioned indeed, having both eigenvalues equal to one. Let us rescale the first variable by substituting 'II = I 050 t . We leave the second variable unchanged, setting '12 = O 2 , In terms of the new coordinates, «P = t[CI1JI0 5 )2 + '1/] B' :... ..  L H = [IOIO ] 
306 Appendix A, Matrix Analysis The condition number has been increased from I to 10 10 . ThiS suggests that before computing eigenvalues and vectors we should scale the matrix properly. The simplest scaling is one which reduces all diagonal elements to unit magnitude. If our matrix is a Hessian, this means that we are rescaling all variables so that the curvature of the objective function at the minimum is unity along all coordinate axes. If our matrix is the inverse of a Hessian, we are scaling all variables to possess unit standard deviation (see Section 7-5). If our matrix is positive or negative definite, the proposed scaling sets the magnitude of all off-diagonal elements to less than unity. If the matrix is not positive definite, this scaling method may fail by leaving very large off-diagonal elements. On the whole, however, the method has given very good results. Given a matrix A, we define a diagonal matrix B with _fI A jjll/2 B jj =t I (Ajj=FO) (A jj = 0) (A-5-14) then the matrix C = B-IAB- 1 (A-5-15) has elements Cij = Aij/IAjjAjjll/2 (except when Ajj or Ajj = 0), and, in particular, C jj = I. We refer to C as the scaled version of A. If A is a covariance matrix then C is the correlation matrix. Let the eigenvalue decomposition of C be given by C = urru T (A-5-16) where rr is diagonal; lljj = ITj, the ith eigenvalue ofC; U is the matrIX whose ith column is Up the ith eigenvector of C; and U T = U-I . Eqs. (15) and (16) may be combined to yield A = BUrrUTB = FrrF T (A-5-17) where F == BU, i.e., Fij ;= B jj Uij (A-5-18) We call the relation A = FrrF T the scaled decomposition of A, Inverting Eq. (17) yields A-I = B-Iurr-luTB- I = err-leT (A-5-19) where e = B-IU, i.e., Gij = BjitUjj (A-5-20) We call the relation A-I = err-I e T the inverse scaled decomposition of A. The following is a summary of the steps required to compute the scaled decompositions: I. Divide each element A ij by I A jj A jj II /2, forming the matrIX C. 2. Obtain the eigenvalues IT; and eigenvectors Uj of C. 
z A-5. Spectral Decompositions 307 '.' f::' E' :: : - :.- . 3. Multiply the jth element of u i by 1 A jj 1 1/2 to form a vector f;, which is the ith column of F. 4, The scaled decomposition of A is given by  n: f':., , '-'. f; .. , r : . i: ; li:- [, [: i' i' i;' r. [ W I A =  IT f.f.T 1-.. I I I i=I (A-S-21) This IS equivalent to Eq, (I 7). S. Divide thejth element ofu j by IAjjl l /2 to form a vector gj, which is the ith column of G. 6, The inverse scaled decomposition of A-I is given by I A -I ,,-t T = L. IT; gigj i=1 (A-S-22) provided all IT j to O. This is equivalent to Eq. (19). J i Note: Replace any zero A j ; by one in the above computations. A numerical example appears in Section 5-21 The above procedure [omitting steps 3 and 4] can be regarded as a method for computing the inverse of a symmetric matrix. As such it is unlikely to win any prizes for speed, but it is quite accurate and stable. It provides insight into the nature of the matrix, and lets us generate" almost inverses" of A. By this we mean matrices which (like the pseudoinverse) differ from Eq. (22) only in the values of the IT;, the latter being chosen so as to confer certain desirable properties (e.g., positive definiteness, or well-conditioning) on the matrix. For examples, see Sections 5-7-5-8, , if: If. Ii!. '-, [;' l: rc . \ ; t;) :: (): t :: (c) The Square Root Decomposition. If A is positive definite, it is possible to obtain spectral decompositions in which D = I, the identity matrix; i,e., A = EE T . Of particular interest is the decomposition in which E is a sym- metric matrix S, whence A = S2. The matrix S is called the square root of A. If A = Y A y T is the eigenvalue decomposition of A, then we have, because yTy = I A = (Y N/2yT)2 (A-5-23) so that A 1/2 = S = YA 1/2yT. Here A 1/2 is a diagonal matnx with elements )F2. :.i - tI.,_., I,," :: t "  "," (d) The Cholesky Decomposition, [See, e,g., Fox (1964)]. Again we assume that A is positive definite, and choose D = I. Now, however, we specify that E should be a lower diagonal matrix L, that is, a matrix whose elements above the main diagonal are all zero L;j=O (j > i) (A-5-24) f/' ,. jlt!.t ,B,,', 
308 Appendix A. Matrix Analysis Since A=LL r we have A,j=I=1 LikLjk' which In view of Eq. (24) be- comes: j -(.. = '\ L., L., ") L lto. Jh. k=1 (j < i) (A-5-25) j A ii = IL?k k=1 (A-5-26) These equations may be solved recursively for the L ii . From Eq. (26) LII = A:,;2 ( A-5-27) From Eq. (25) L'I = A n/LII (i = 2, 3, .." I) Then, uSlllg Eq. (25) and Eq. (26) alternately for i = 2, 3,. ,I: ( A-5-28) ( j-I )/ [.. = A.. - "' [. , L., L.. I) I) L I' J"- JJ k=1 (j = 2, 3, . . . , i-I; skip for i = 2) (A-5-29) . i-I ) 1/2 [.. = ( A . - '\ L " II L--.11. k= I . (A-5-30) This procedure can be carried through provided all of the square root arguments are positive. This occurs if and only if A is positive definite. Of all the decompositions discussed, the last is the only one that can be accom- plished in a finite procedure, which requires approximately 1 3 /6 multiplica- tions. All the other decompositions depend on the evaluation of eigenvalues. which requires an iterative procedure. The Cholesky decomposition is particularly useful in solving for x the set of linear equations Ax = b (A-5-31) These can be rewritten as Ly = b (A -5-32) where y = LTx (A-5-33) Now Eq. (32) on account of the triangular nature ofL, has the form: L 11 )'1 = b 1 L 21 YI + L 22 )'2 = b 2 L 31 YI + L 32 )'2 + L 33 )'3 = b 3 (A-5-34) 
A-j, Spectral Decompositions 309 These equations can easily be solved in turn for )'1' h,. .,)'/. Then Eq. (33), which has the form: L 11 x 1 + L 21 X 2 +". + Lllx/ =)'1 L 22 x 2 +... + L/ 2 x/ =)'2 (A-5-35) LI/x/ = J't ':\;; 1 : ..:-: _»f Jf can be solved in turn for X t , X/-I"", Xl' This is the fastest method for solving Eq. (31) when A is positive definite, We conclude this section with a computational note. In most applications the matrix A is given only in decomposed form [Eq. (I)]. We are interested in computing a vector y = Ax, where x is a known vector, but have no use for A itself. The following procedure is much more economical than generating A first and then computing Ax: I. Suppose A = EDE T . Compute the vector z = ETx. 2. Compute the vector u = Dz. This is done simply by applying the formula llj = djz j to each component ofz, 3. Compute y = Eu. In summary, the proper order of the calculations is given by Zj = L Ekjxk, k lI j = djz j , ) '. =" E../I. I  lJ J j (A-5-36) A numerIcal example appears 111 Section 5-21. "':f ; / ......,., ':l :; , ....,..,1.- , t f f. l . ...." , . i'.:.' _ :" e' .:. 
Appcmli:x B Probability This section is not to be regarded as an Il1Iroduction to or summary of probability theory. It merely lists the probabilistic concepts and the notation used in the book. Pr(A) denotes the probability that event A occurs. Pr(A I B) is the conditional probability of event A, provided B is known to occur. If  is a random variable with contll1uous distribution, then F(x) == Pr( < x) (B-1 ) is the cumulaTive probability distribution jimctiolJ of . If F(x) is differentiable, then p(x) == dFfdx is the probability density jimctlOlJ (pdf) of . In this case b Pr(a <  < b) = F(b) - F(a) = f p(X) dx a (B-2) Let I be any function of the random variable. Then the expected value off is 00 E(f) == f I(x)p(x) dx -00 (B-3) In particular, the mean, or expected value of  itself is  == Em = fOO xp(X) dx -00 (B-4) and the variance, or expected square deviatIon from the mean, IS v == E[( - )2] = foo (x - )2p(X) dx - CfJ ( B- 5) The standard deviation of  is the square root ofthe variance, The mode of  is the value of x at which p(x) is maximum, 
Appendix B. Probability 311 These definitions generalize to the case where S is a vector of random variables (I' l' "', "" The joint cumulative disTribution F(x) is given by F(x) = Pr(1  XI' l  Xl' .." '"  X",) (B-6) the pdf by p(x) = a"'F(ax I aX 2 ' . . ax", (B- 7) and the expected value f ro f OJ f OJ S == E(s) = . .. -co -co -co xp(x) dx I dx z ' , . £Ix", (B-8) which we write in shorthand notation as  = f xp(x) dx ( B-9) Using the same notation, we define the covariance matrix (sometimes called variance-covariance matrix) V == E[(s - )(s - )T] = f (x - )(x - )Tp(X) dx (B-IO) v  is positive definite, or at least semi definite. The variance of  i is the diagonal element Vii = E((i - c;Jl. The covariance of i and (j (i i= j) is the off- diagonal element Vij = E(i - J(j - j). The generalized variance of S is defined as det V, The correlation of (j and j is given by Pij == (V"f,ijf(Ji (J) (B-1 I) where (Ji= J:;;W (B-12) is the standard deviation of (i. Two variables are un correlated if their cor- relation (or covariance) vanishes. Let ( and tl be two random variables with pdf PI (x) and PlCV) respectively. These variables are said to be statistically independent if their joint pdf p(x, y) has the form p(x, y) = PI (X)Pl(y) (B- 13) It can be shown that independent variables are ullcorrelated. The COIl- verse does not always hold, For normally distributed variables, however, zero correlation implies independence. If sand TJ are random variables with jOll1t pdf p(x, y), then the marginal distribution ofTJ is given by p(y) = f p(x, y) dx (B-14) 
312 Appendix B. Probability If the distributIon of  depends on some other variables y, we write the pdf as p(x I y). If Y is a possible value of some other vector random variable 11, then we call p(x I y) the conditional pdI of  given 11 = y, The following equation relates the joint, marginal, and conditional dis- tributions p(x, y) p(x I y)p(y) (B-15) hence p(x I y) = p(x, y)/ f p{u, y) du (B- I 6) provided the denominator does not vanish. If  and 11 are independent, we find, in view of Eq. (13) and Eq. (15), that p(x I y) = p(x), i.e., knowledge ofll does not affect the distribution of . If x is an m-dimensional vector random variable with pdf fi(x), and f(x) is an l11-dimensional vector of continuous, differentiable single-valued functions such that f(x l ) = f(x 2 ) only if Xl = x 2 , then the vector y = I(x) is a random variable with pdf pry) = p(x)ldeC I 8f(8xl (B-17) The Jacobian matrix of the transformation from X to y, is defined as 8f(8x, and its determinant (which under the above conditions must be nonzero) is the Jacobian. , ,"'! ":'1 . , 
j i Appemlix c The Rao-{:l'amel' Theorem  Let p(Y! <j» be the pdf of the sample y, Then from the definition of a pdf f p(YI<j» dY = I (C-l) Let t* be a vector-valued statistIc of the sample, l.e" t* = t*(Y) and let t be the expected value of t*, i,e.. (C-2) t( <j» = f t*(Y)p(Y I <j» dY (C-3) From Eq. (I) we have c f f lCP ) T c<j> p dY = \ c<j> dY = 0 (C-4) Also, from Eq, (3) c ( C P ) T - r t"'p dY = r t"' -:;- dY = P c<j> . . c<j> (C-5) where P == ct(c<j> ( C-6) r ,. Thus, using Eq, (4) and Eq. (5) f (1* - t)(Cp(C<j»T dY = f t*(Cp(C<j»T dY - t f (Op(C<j»T dY = ct(c<j> = P ( C-7) Now cp(c<j> = pc ]ogpjc<j> hence Eq. (7) may be rewritten as f uv T dY = P (C-8) 
314 Appendix C. The Rao-Cramer Theorem where we define u == plll(t* - t): v == pIll a 10gp/a<l> (C-9) The covariance matrix of the statistic t* is V t < == f uu T dY = f p(t* - t)(t* - t)T dY = E(t* - t)(t* - t)T (C-lO) Let R == f vv T dY = f pea 10gp/a<l»(a 10gp/a<l»T dY = E(a logpja<!»(a 10gp/a<l»T (C-II) Let A(<I» be an arbitrary matrix function of <I> such that Av is a column vector of the same dimension as u, The matrix (u + Av)(u + AV)T is clearly positive semidefinite, and so is the sum of any number of matrices of this form. Hence, if B == f (u +Av)(u + Avf dY (C-12) then B is positive semidefinite. But B= f(uu T + AvvTAT + Avu T + uvTA)dY=V t < + ARA T + ApT + PAT = V t < - PR-1p T + (A + PR-1)R(A T + R-IPT) (C-13) Now, B must be positive semidefinite for any A; in particular it must be so for A = -PR - I, in which case B = V t < -PR- tp T . Hence, V t < -PR-1p T must be positive semidefinite. From Eg. (12), B = 0 if and only If u =:= -A v, In this case, from Eg, (8) P = f - Avv T dY = -AR (C-14) and therefore Eg. (13) reduces to 0= V t < - PR-1p T Conversely, suppose V I < - PR-Ip = 0, Then, from Eq, (13) B = (A + PR-1)R(A T + R-IPT) In particular, if we choose A = -PR-l, we have B = 0, Hence, from Eg, (12) :; (C-1S) f (u - PR-1v)(u - PR-1v)T dY = 0  ",,.] 
f i ,  .  , .4 ;1  J  1 I  1 1  i i . f t " t i 1 i j, I , r i I i I ! ! !  I Appendix C. The Rao-Cramer Theorem 315 and it follows that u =PR-1v. Associating the estimate <1>* with t* and <P with t, we obtain the results stated in Section 3-2, Note that the proof is valid only if p satisfies regularity conditions which permit the differentiations under the integral sign in Eq. (4) and Eq. (5), We have also assumed that R was nonsingular. 
Appendix D Generating a Samplf from a Given IuUivariate Norma] Distribution We wish to generate on a computer a sample from the distribution Nk(a, V). That is, we need a vector z of num bers ZI' Z2, . . . , Zk derived from a normal distribution with mean a and covariance matrix V We proceed as follows: (a) Let m = k/2 if k is even, or 111 = (k + I )/2 if k is odd, (b) Generate 2m pseudorandom numbers XI' X 2 , .,', X 2 ", uniformly and independently distributed on the interval zero to one. For a discussion of methods to accomplish this see Moshman (1967) and Lewis el 01. (]969). (c) From the Xi we generate Yj (j =. I, 2, ..., k) which are normally and independently distributed with zero means and unit variances. For this transformation many methods have been proposed, but only the following two are easy to program and reasonably fast, yet produce the required distribution exactly: I. Method of Box and Muller (1958). Compute: Y2i-1 =(-210gx2i_I)1I2cos2nx2i Y2i = ( - 2 log X li - J )1/2 sin 2nx 2i (i = 1,2,...,111) (0-1) If k is odd, Y2", need not be computed. 2. Method of Marsaglia and Bray (1964). Compute U i = 2(x i - 1) for i = I, 2, ..., 2m. I f for any j = I, 2, ..., m it happens that ui j - J + ui j > I, replace x 2j - J and x 2j by a new pair of uniform random numbers, recompute u 1j - J and 1I 2j , and repeat untilllL_1 + ui j  I. Compute: Y2i-1 = 1I 2 i-l[ -210g (lIi i - J + lIi i )/(lIii-1 + lIi i )]'/2 .' ., j i "'1 )lzi = u 2 ;[-2 log (lIii-1 +lIiJ/(lIii-J +lID]J/2 (i = I, 2, , . . , 111) (0-2) ..! : 
J J '1 1 ! i j 1 J ,\ 1 "  ;1 d Appendix D. Generating a Sample 317 The second method is probably faster, sltlce it requires no evaluation of trigonometric functions. ! '., \J (d) Compute the eigenvalues }'i and eigenvectors Vi (i = I, 2, " k) of V. Generate the matrix U whose ith column is J.:/ 2 Vi Note that the }.i must all be nonnegative. A faster method, useful if V is known to be nonsingular, is to find a lower triangular matrix U such that UU T = V by means of the Choleky decomposition (see Section 5-5). (e) Compute ; i .! , "I , z = a + Uy (0-3) i to obtam the desired sample z. ', Tf many samples are required from the same distribution, step (d) should ! be performed once for all samples at the beginning. . A i " :1 1 :1 I 1 I ! 
Appendix E The Gauss-JUadlov Theorem Suppose we have a model y-Be=1': (E-!) where I': is a random veclor with mean 0 and nonsingular covariance matrix V, y is a vector of observations, and B a known matrix. We wish to find the least-variance linear unbiased estimator e* for e. Linearity implies the existence of a matrix A independent of y such that e* = Ay (E-2) Therefore E(e*) = E(Ay) = AE(y) = ABe (E-3) If the estimate is unbiased, we must have E(e*) = e for any e, hence AB=I (E-4) The covariance matrix of e* is given by' V o = E(e* - e)(e* - e)T = E(Ay - e)(Ay _ e)T = E[A(y - Be) + (ABe e)][A(y - Be) + (ABe - e)f = E[A(y - Be)(y - Be)TA T ] = E(AI':I':TA T) = AVA T (E-5) We wish to find the matrix A which minimizes some measure of V o. The following are possible measures: I. The so-callcd .. generalizcd variance," i.e., det V o . 2. Some wcighted average of the elements ofV o , e.g.. Tr(GV o ), where G is an arbitrary positive definite matrix. 3. The spectral norm (largest eigenvalue) of V. All measures lead to the same answer. We shall use the measure (I) here, and the reader may derive the resull for the other measures as an exercise. 
. .{ J .,  , i i ,,'{ 1 Appendix E. The Gauss-Markov Theorem 319 We wish, then, to determine the matrix A which minimizes det A Y AT wbile satisfying AB = I We introduce a matrix of Lagrange multipliers A, and construct the Lagrangian 2(A, A) = det A Y A T + Tr[A(AB - I)] (E-6) We must find the stationary point of Eg. (6). By the methods of Section A-2 we find .-.! 82j8A = 2(AYA T )-J AY + A1B T = 0 PostmuItiplying Eg. (7) by AT, we obtain in view of Eg. (4) 21 + AT = 0 Substituting AT = -21 In Eg. (7), we obtain (AYAT)-JAY = B T PostmuItiplying by y-I and then by B we find, successively: (AYAT)-IA = BTy- 1 (AYAT)-IAB = (AYAT)-I = BTV-1B Substituting Eg. (II) in Eg. (10) BTy-1BA = BTy- 1 .i ";i , i .j 1 i '\ So tbat finalIy A = (BTy-1B)-JBTy- 1 and e* = (BTY-1B)-JB1Y-1y (E- 7) (E-8 ) (E-9) (E-I0) (E-11 ) (E-12) (E-13) (E-14) in agreement with Eg. (4-4-7), It is not difficult to verify that this solution is indeed a minimum of det A Y AT, A treatment of the general case where singular matrices may occur is given by Price (1964), The results call for substitution of pesudoinverses for inverses whenever needed. 
Appendix F A Convergence Theorem for Gradient Methods T/zeorem Given a continuous function <p(e) with continuous differentiable first derivatives. Let <PI = <p(e t ), and let 0) be the set of all points e such that <p(e)::;:;; <PI' Define a sequence of points e 1 , e j , '.., by e j + 1 = e j - pjRjqj (F-l) where qj == a<Pjae)e=o, We make the following further assumptions. I. There exists a number AI such that no eigenvalue of the Hessian H(e) exceeds 1\1 in absolule value for all e E 0), 2, All R j are positive definite matrices whose eigenvalues fall between two positive numbers 0 < fJ < y. 3. All pj are chosen so that min(po, allj) ::;:;; pj::;:;; pj (F-2) where Po is a positive constant, a a constant satisfying 0 < a < I, and pj is the smallest nonnegative p at which <p(e j - pR j q;) is a stationary function of p. Let e* be a limit point of the sequence {e j }, Then e* is a stationary point of <P, j,e., q* == q(e*) = o. Comment: Such a limit pomt (not necessarily umque) must exist if 0) is bounded. Proof Clearly {<PJ == {<p(e j )] is a monotone nomncreasmg sequence. Because of continuity, we must have qJ* == qJ(H*) ::;:;; qJ j (i = I, 2, .. ,) (F-3) 
Appendix F. A Conl'ergCl/ce Theorem for Gradient Methods 321 Suppose e* is not statIOnary. Then IIq*1I = a > O. Due to continuity of ({J and q, and the definition of a limit point, we can find an integer j such that and 2a  II qj II  ta (F-4) . [ (2 - a)'J.a 1 fJ1 a 1 fJ1 a 1 (f P o ] c{) :0( c{)* + 111111 " 7 ' _ J 256y-M 256}'-M 16 Consider the function ( F-5) we have tP(p) == c{)(e j - pRjqj) (F-6) and d'P/dp = (i!C/1/iJe)(Oe/cp) = _qTR;qj (F-7) d'P/dp)" = ° = -q/Rjqj:o( -fJllqif:o( - fJa 1 /4 (F-8) which follows from Eg. (A-I-32) and Eg. (4). Also Id 1t P/dp11 = /(iJe/iJpf(iJ 1 c{J/(le aeHcel(1p) I = Iq/Rj Hi Rjqjl :0( IvIIIRjqjf  4a 1 1,1/11 (F-9) which follows from Eg. (A-I-31) and Eg. (A-I-32). In view of Eg. (8) and Eg. (9) we have. for p > 0 dlp/dp:o( - fJa 1 /4 + 4a 1 }'lMp (F-IO) At p = flj we have d'P/dp = O. Hence, from Eg. (10) 0:0( -fJa 1 /4 + 4a 1 }'lMJlj (F-II) or, Jlj  (fJ/16}'lM) and aflj  ('Y.fJ/16y 1 ML In view ofEg. (10) we have, for p>O tP(p) :0( cfJ j - (fJa 1 /4)p + 2a1},2Mpl (F-12) Suppose Pj has been chosen so that CJ.flj:o( Pj:o( Jlj [see (Eg. 2)]. Since pep) is monotonically nonincreasing for 0:0( p:o( flj' we have because of Eg. (12) C/1j+l = 'P(p):o( tP(rlp):o( 'P(Cf.fJ/J6lM) :0( C/1 j - (fJa 1 /4)(Cf.fJ/16}'1J\f) + 2a1y1M(CJ.fJ/J6y1Mf = C/1 j - [(2 - CJ.)CJ.a 1 fJl/128}'lM] Employing Eg. (5) we find, then C/1j+l  C/1* - [(2 - a)aa 1 fJl/256y 1 M] < c{J* This contradicts Eq. (3). (F-13 ) (F-14) 
322 Appendix F. A Convergence Theorem for Gradient Methods The other alternative is that Po ::;:;; Pj ::;:;; /lj' But then rJ)j+1 ::;:;; rJ)j - (fJa 2 /4)po -t- 2a 2 y2JV/po2 (F- 15) Now there are two possibilities (a) Po  fJ/16 y 2 M. Therefore rJ)j+1  rJ)j - a 2 po(fJ/4 - 2y 2 Mpo)::;:;; rJ)j - a 1 po[fJ/4 - 2iM(fJ/I6y 2 M)J = rJ)j - a 2 fJpo/8 ::;:;; rJ)* - a 2 fJpo/16 < rJ>* (F-I6) In contradiction wIth Eq. (3). (b) fJ/16iM.:S; Po::;:;; Pj. Therefore, because IfJ(p) IS monotonIcally non- increasing at P = Po rJ)j+1 = c])(Po)  (f)(fJ/16y 2 M)  rJ)j - (fJa 2 /4)(fJlI6y 2 M) + 2a 2 y 2 M(fJ/I6y 2 Mf = rJ}j - (02fJ2/128iM}::;:;; rJ>* - (a 2 fJ2/256y 2 M) < rJ>* (F-I7) again contradicting Eq. (3). Hence (j* must be stationary. 
Appendix G Some Estimation Prorams It is impossible to list all existing estimation programs. The ones listed below are either of historical interest, possess special features, or are in widespread use. The list is in chronological order. I. G. W. Booth and T. L Peterson (1958), Nonlinear estimation (IBM- Princeton), 704 G2 3226 NLI (Previously designated SHARE 687 WLNLI). First generally available computer program for nonlinear parameter estima- tion. Written in IBM 704 Assembly Language. Uses Gauss method with finite difference approximation to solve single equation least squares problems. 2. M. A. Efroymson (1961), Nonlinear regression with differential equations, 7090 G2 3146 NLR. Written in FORTRAN II for the IBM 7090 specifically to handle models which are in differential equation form. Uses Gauss method with finite difference approximations. 3, L. Lapidus and T. I. Peterson (1964), Chemical reaction analysis by nonlinear estimation, 7090 T2 IBMOOI4. A package combining program I above with a FORTRAN II interface for estimating parameters in chemical- kinetics models, 4, D. W, Marquardt (1965), Least squares estimation of nonlinear param- eters, 7040 G2 3094 NUN. Written in FORTRAN IV for the IBM 7040. Uses Marquardt's method with analytic derivatives or finite difference approximations to solve weighted least squares problems. 5. H Eisenpress, A. Bomberault, and J. Greenstadt (1966), Nonlinear regression equations and systems, estimation and prediction, 7090 G2 IBM0035. Written in FORTRAN IV-FORMAC for the IBM 7090. Per- forms maximum likelihood estimation of multiple equation econometric models. The FORMAC system evaluates automatically analytic derivatives of all orders required for the full Newton method with rotational dis- crimination, 6. D. F. Shanno, 1967, CREEP-Constrained nonlinear estimation package, 7094 G2 3492, Written in FORTRAN IV for the IBM 7094. Estimates parameters in least square models with constraints, using a 
324 Appendix G. Some EstimatIOn Programs modified Marquardt method combined with gradient projection. ReqUires analytic derivatives, 7. Y. Bard, 1968, Nonlinear estimation and programming, 3600 13.6.003. Written in FORTRAN IV for the IBM System/360, Solves least squares and multiple equation maximum likelihood problems with known or unknown covariance matrix. Uses the Gauss method with analytic deriva- tives. Includes provisions for constraints (penalty function method), prior distributions, models in standard dynamic form (sensitivity equations), and chemical kinetics models. 8, H. Eisenpress, 1968, Nonlinear regression equations and systems, estimation and prediction, 3600 13,6.005. Program 5 above, rewritten in PL/I-FORMAC for the IBM System/360. 9. F. S. Wood, 1971, Nonlinear least squares curve-fitting program, 3600 13.6.007. Written in FORTRAN TV for the IBM System/360. Solves least squares problems. using a modified Marquardt method. Note: At the time of writing, several of the above programs were available from the SHARE Program Library Agency, Triangle Universities Computa- tion Center, P.O. Box 12076, Research Triangle Park, North Carolina 27709. 
Reierences Abadie, J., ed. (l967a). .. Nonlinear Programming." Wiley, New York. Abadie, J. (l967b). "Generalization of the Wolfe Reduced Gradient Method to the Case of Nonlinear Constraints." Electricite dc France, Direction des Etudcs et Recherches, Clamart, France, Abadie, J., and Carpentier, J. (1966). Generalisation de la methode du gradient reduit de Wolfe au cas de contraintes non lineaires. Internat. Congr. Operations Research, 4th, Boston. Acton, F. S. (1959). " Analysis of Straight Linc Data." Wi Icy, New York. Afifi, A. A., and Elashoff, R. M. (1966). Missing observations in multivariatc statistics: 1. Review of the literature. J. Amer. Statist. Assoc. 61, 595-604. Albert, A. E., and Gardncr, L. A. (1967). "Stochastic Approximation and Nonlinear Rcgression." MIT Press, Cambridge, Massachusetts. Anderson, T. W. (1958). "An Introduction to Multivariatc Statistical Analysis." Wiley, New York. Anscombe, F. J. (1960). Rcjection of outliers. Tecl1l10metrics 2, 123-147. Arndt, R. A., and MacGrcgor, M. H. (1966). Nucleon-nucleon phase shift analyscs by chi-squared minimization. In "Methods in Computational Physics," (B. Alder, F. Fernbach, and M. Rotenberg, eds.), Vol. 6. Academic Press, Ncw York. Atkinson, A. c., and Hunter, \Iv. G. (1968). The design of experiments for parametcr cstimation. Teclmometrics. 10,271-289. Bard, .Y. (1967). .. A Function Maximization Method with Application to Parameter Estimation." New York Scientific Ccntcr Report 322.0902, IBM, Ncw York. Bard, Y. (1968). On a numcrical instability of Davidon-like mcthods. Math. Camp. 22, 665-666. Bard, Y. (1970). Comparison of gradient methods for the solution of nonlinear parameter estimation problems. SIAlY! J. Nillner. Anal. 7, 157-186. Bard, Y. (1971). An cclectic approach to nonlinear programmlllg. Proc. ANU Sem. Optimi- zation, Canberra, Austral. Nat. Univ. Bard, Y., and Lapidus, L. (1968). Kinctics analysis by digital paramcter estimation. Catal. Rev. 2, 67-112. Barnctt, V. D. (1967). A notc on lincar structural relationships whcn both rcsidual varianccs are known. Biometrika. 54, 670-672. Bartels, R. H., and Golub, G. H. (1968). Chcbyshev solution to an ovcrdetcrmined linear system. Comll1. ACJ\1. 11,428. Baycs, T. (1763). Essay towards solving a problem in the doctrinc of chances. PlIilns. Trans. Roy. Soc. 53, 370-418. [Rcprintcd in Biomerriko. 45, 293-315 (1958)]. Bcalc, E. M. L. (1960). Confidence regions in nonlincar cstimation. J. Roy. Srarisl. Soc. Ser. B. 22,41-76. 
326 References Bcaton, A. E. (1964). "Thc Usc of Spcclal MatrIx Opcrators in Statistical Calculus." Rcscarch Bullctin RB-64-5I, Educational Tcsting Scrvicc, Princeton, New Jersey. Bcauchamp, J. J. and Cornell, R. G. (1966). Simultaneous nonlinear estimation. Tech- lIomelrics. 8, 319-326. Behnkcn, D. W. (1964). Estimation of copolymer reactivity ratios: an example of nonlinear cstimation. J. PolYIl7. Sci. Pari A. 2,645-668. Bellman, R., Collicr, c., Kagiwada, H., Kalaba, R., and Selvester, R. (1964). Estimation of hcart parameters using skin potential measurements. Comm. AO\1. 7,666-668. Bellman, R., Jacqucz, J., Kalaba, R., and Schwimmcr, S. (1967). Quasilincarization and the estimation of chemical rate constants from raw kinctic data. Alalh. Biosci. 1, 71-76. Bellman, R. E., Kagiwada, H H., Kalaba, R. E., and Sridhar, R. (1964). "Invariant ]mbcdding and Nonlincar Filtering Theory." Memorandum RM-4374-PR, The Rand Corporation, Santa Monica, California. Berman, M., Weiss, M. F., and Shahn, E. (1962). Some formal approaches to thc analysis of kinctic data in terms of linear compartmental systems. Biophys. J. 2,289-316. Blakemore, J. W., and Hoed, A. E. (1963). Fitting nonlinear reaction rate equations to data. Chem. Ellg. Progr. Symp. Ser. 59 (42), 14-27. Bodkin, R. G., and Klcin, L. R. (1967). Nonlinear estimation of aggregate production functions. Rev. Ecollom. Sialisi. 49, 28-44 Bond, E., Auslander, M., Grisoff, S., Kenney, R., Myszewski, M., Sammet, J., Tobey, R., and Zilles, S. (1964). FORMAC, An experimental FORmula MAnipulation Compiler. Proc. Nal. Calif Ass. CampI/I. l\1ach. 191h Booth, G. W., and Peterson, T. L (1958). "Nonlinear Estimation." IBM SHARE Pro- gram Pa. No. 687 WLNLI. Box, G. E. P. (1957). Use of statistical methods III the elucidation of basic mechanisms. Bull. III.IT. IlIlernal. Slali.\"I. 36, 215-225. Box, G. E. P., and I-lill, W. .I. (1967). Discrimination among mechanistic models. Tech- lIomelrics. 9,57-71. Box, G. E. P., and Hunter, W. G. (1962). A useful method for model-building. Technomelrics 4,301-318. Box, G. E. P., and Hunter, W. G. (1963). Sequential design of cxpcrIments for nonhnear models. Proc. IBAI Sci. Comp/ll. Symp. Sialisl., IB1H, IYhile Plains, New York. Box., G. E. P., and Hunter, W. G. (1965). The cxperimental study of physical mechanisms. Techllomelric.r. 7, 23-42. Box, G. E. P.. and Lucas, H. L. (1959). Design of experimcnts in nonlinear situations. Biomelrika. 46, 77-90. Box, G. E. P., and Muller, M. E. (1958). A note on thc gencration of random normal dcviatcs. Anu. Ma/It. Slall\-l. 29, 610-611 Box, G. E. P.. and You Ie, P. V. (1955). The exploration and exploitation of response surfaccs: an cxampIc of the link bctwcen thc fittcd surfacc and thc basic mechanism of thc systcm. BimllelriC.\". 11,287-323. Box, M. J. (1966). A comparison of sevcral current optimization methods, and the use of transformations in constrained problems. Complli. J. 9, 67-77. Box, M. J. (1968). The occurrcncc of replications in optimal designs of experiments to cstimate paramcters in nonlinear modcls. J. Roy. SIC/lisl. Soc. Ser. B. 30,290-302. Brcnt, R P. (971). "Algorithms for Finding Zeros and Extrema of Functions without Calculating Derivativcs." Computer Sciencc Rcport STANS-CS-71-198, Stanford Univcrsity. Palo Alto, California. Broyden, C. G. (1967). Quasi-Newton methods and their application to function minimi- zation. Malh. Camp. 21, 368-381. 
References 327 Buzzi Ferraris, G. (1968). Mctodo automatico per trovare I'ottimo di una funzione. Il/g. Chim. Ital. 4, 17]-]92. Carney, T. M., and Goldwyn, R. M. (1967). NumcrIcal experIments with various optimal estimators. J. Optimization Theory Appl. 1, 113-130. Carroll, C. W. (1961). The crcated response surface technique for optimizing nonlinear, restraincd, systems. Operations Res. 9, ] 69-184. Chow, G. C. (1964). A comparison of alternative estimators for simultaneous equations Econometrica. 32, 532-553. Colville, A. R. (1968). "A Comparative Study of Nonlinear Programming Codes." IBM N.Y. Scientific Center Report No. 320-2949, New York. Cornfield, ], (1967). Bayes Theorcm. Rev. Inst. Internat. Statist. 35, 34-49. Cottle, R. W., and Dantzig, G. B. (1968). Complementary pivot theory of mathematical programming. Linear Algebra and Appl. 1, 103-125. Cragg, J. G. (1967). On thc rclative small sample properties of several structural-equation estimators. Econometrica. 35, 89-110. Cramer, H. (1946). " Mathcmatical Mcthods of Statistics." Princeton Univ. Press, Princeton, New Jersey. Daniel, J. W. (1971). "The Approximate Minimization of Functionals." Prentice-Hall, Englcwood Cliffs, New Jersey. Dantzig, G. B., and Cottle, R. W. (1967). PositIve (semi-) defimte programming. III "Non- linear Programming," (J. Abadie, cd.), pp. 55-73. Wiley, New York. Davidon, W. C. (1959). "Variable Metric Method for Minimization." A.E.C. Rcsearch and Development Report ANL-5990 (Rev.). Davidon, W. C. (1968). Variance algorithm for minimization. Comput. J. 10,406-410 Davies, D. (1970). Some practical methods of optimization. In "Integcr and Nonlinear Programming" (1. Abadie, ed.), North-Holland Pub!., Amsterdam. Davies, O. L. D. (1954). "The Design and Analysis ofIndustrial Experiments." Oliver & Boyd, Edinburgh. Deming, W. E. (1943). "Statistical Adjustment of Data." Wiley, New York. Deutsch, R. (1965). "Estimation Theory." Prentice-Hall, Englewood Cliffs, New Jersey. Draper, N. R., and Hunter, W. G. (1966). Design of experiments for parameter estimation in multiresponse situations. Biometrika. 53, 525-533. Draper, N. R., and Hunter, W. G. (l967a). The use of prior distributions in the desIgn of cxpcriments for parameter estimation in nonlinear situations. Biometrika. 54, 147-153, Draper, N. R., and Hunter, W. G. (l 967b). The use of prior distributions in the design of experiments for parameter estimation in nonlinear situations: multiresponsc casc. Biometrika. 54, 662-665. Drapcr, N. R., and Smith, H. (1966). "Applicd Regression Analysis." Wiley, New York, Eisenpress, H., Bomberault, A., and Greenstadt, J. (1966). "Nonlinear Regression Equa- tions and Systems, Estimation and Prediction (IBM) 7090." Computer program 7090-G2 IBM0035 G2, IBM, Hawthorne, Ncw York. Eisenpress, H., and Greenstadt, J. (1966). The estimation of nonlinear econometric systems. Econometrica. 34, 851-861. Eisenpress, H., and Surkan, A. (1966). "Fitting Ore Deposit Models to Geophysical Survey Data." Pcrsonal Communication. Fariss, R. H., and Law, V. J. (1967). Practical tactics for overcoming difficulties in nonlinear regression and cquation solving. AIChE Nleet., Houston, Feb. 1967. Faure, P., and Huard, P. (1965). Resolution des programmes mathematiqucs a fonction nonIineaire par la methode du gradient reduit. Rev. Hal/('aise Recherche Operationelle. 9, 167-205. 
328 Reference Fcller, W. (1966). "An Introduction to Probability Theory and Its Applications," V.)l. ] Wiley, New York. Ferguson, T. S. (1967). "Mathematical Statistics, A Decision Theoretic Approach. Academic Press, New York. Fiacco, A. V. (1968). Sccond-order sufficicnt conditions for weak and strict constnine minima. SIAlvl J. Appl. lvlath. 16, 105-108. Fiacco, A. V., and McCormick, G. P. (1964). The sequential unconstrained minimizatio techniquc for nonlincar programming: A primal-dual mcthod. lvlal1agemellt Sci. IC 360-366. Fiacco, A. V., and McCormick, G. P. (1965). "Thc Scqucntial Unconstrained MinirlllZ1!IOI Technique for Convex Programming with Equality Constraints." RAC-TP-155, Researcl Analysis Corporation, McLean, Virginia. Fiacco, A. V., and McCormick, G. P. (1967). The slacked unconstraincd minimizatior tcchniquc for convex programming. SIAAf J. Appl. lvlath. 15,505-515. Fiacco, A. V., and McCormick, G. P. (I 968). "Nonlinear Programming: Sequcntia Unconstrained Minimization Techniqucs," Wiley, New York. Fisher, R. A. (1935). "The Design of Experiments." Oliver & Boyd, Edinburgh. Fishcr, R. A. (1950). "Contributions to Mathematical Statistics" (Collection of papers publishcd 1920-1943,) Wilcy, New York. Flanagan, P. D., Vitale, P. A., and Mendelsohn, J. (I969). A numcrical investigation of sevcral one-dimensional search procedures in nonlinear regression problems. 7 eel,- 110metrics 11, 265-284. Fletcher, R. (1965) Function minimization without evaluating derivatives-a review. Compl/t. J. 8, 33-41. Flctcher, R. (1970). A new approach to varIable metrIC algori.thms. Compl/t. J. 13,317-.122, Fletcher, R., and Powell, M. J. D. (1963). A rapidly convergcnt desccnt mcthod for mini- mization. Compl/t. J. 6, 163-168. Fox, L. (I 964). "An Introduction to Numcrical Lincar Algebra." Oxford Univ. P-ess (Clarendon), London and New York. Freudenstcin, F., and Woo, L. S. (I968). "Kinematics of the Human Knce Joim." /13M New York Scientific Center Report No. 320-2928, New York Galambos, J. T., and Cornell, R. G. (1962). Mathematical modcls for the study of the metabolic pattern of sulfate. J. Lab. C/il1. lvled. 60, 53-63. Gauss, K. F. (1809). "Thcoria Motus Corporum Coelestium." 111" Wcrke," Vol. 7, 240- 254. Goldfarb, D., and Lapidus, L. (1968). A conjugatc gradient mcthod for nonlinear pro- gramming problems with linear constraints. Ind. Eng. Chem. FIll/dam. 7, 142-151. Goldfcld, S. M., Quandt, R. E., and Trotter, H. F. (1966). Maximization by quadratic hill climbing. Ecol/ometrica. 34, 541-551. Goldstein, A. A., and Price, J. F. (1967). An effectivc algorithm for minimization. NI/mer. lvlath. 10,184-189. Golub, G. (1965). Numerical methods for solving linear least squares problems. NIl/I"'r. Math. 7,206-216. Golub, G. H. (1969). Matrix dccompositions and statistical calculations. 111 "Statistical Computation" (c. Milton and J. A. Nclder, eds.). Academic Press, New York. Golub, G. I-I., and Percyra, V. (1972). Thc Differentiation of Pseudoinverses and Nonline'lr Least Squares Problcms Whose Variables Separate." Rep. No. ST AN-CS-72-26 I , Computcr Scicnce Dcpt., Stanford University, Palo Alto, California. Grant, F. S., and Wcst, G. F. (1965). "Interpretation Theory in Applied Geophysics" McGraw-Hili, New York. 
s References 329 1. Grcenstadt, J (1967). On the relative efllciencies of gradient methods. l\,flllh. Camp. 21, 360-367. Greenstadt, J. (1970). Variations on variablc metric mcthods. Iv/alh. Camp. 24, 1-22. Guttman, 1., and Meeter, D. A. (1965). On Bealc's mcasures of nonlinearity. Tec/lIlOmelrics 7, 623-637. Hadley, G., (1964). "Nonlinear and Dynamic Programming." Addison-Wcslcy, Reading, Massachusetts. Hammcrslcy, J. M., and Hanscomb, D. C. (1964). "Monte Carlo Mcthods." Methuen, London. Hartley, I-I. O. (1961). The modificd Gauss-Newton method for thc fitting of nonlincar regression functions by Icast squarcs. Tecl1l10melrics 3,269-280. Hartley, I-I. O. (1964). Exact confidence rcgions for the parametcrs III nonlincar regression laws. Biometrika. 51, 347-353. Healy, M. J. R. (1968). Multiple regression with a singular matrix. J. Roy. Statist. Soc. Ser. C Appl. Statist. 17, 110-117. Heineken, F. G., Tsuchiya, H. M., and Aris, R. (I 967a). On the mathcmatical status of the pseudosteady state hypothcsis of biochemical kinetics. A/ath. Biosci. I, 95-113. Heineken, F. G., Tsuchiya, H. M., and Aris, R. (I 967b). On thc accuracy of dcterIllInmg rate constants in enzymatic reactions. iWath. Biosci. I, 115-141. Hicks,J.S.,and Wei,J. (1967). Numerical solution of parabolic partial dilTcrential cquations with two-point boundary conditions by use of the method of lincs. J. Assoc. Compllf. Mach. 14,549-562. Hill, W. J., Hunter, W. G. and Wichcrn, D. W. (1968). Ajoint design critcrion for thc dual problem of model discrimination and parameter estimation. Teclmometrics. 10, 145-160. Himmelblau, D. M., Jones, C. R., and Bischoff, K. B. (1967). Dctcrmination of rate con- stants for complex kinetics models. II/d. £I/{). Chem. FI/I/dam. 6, 539-543. Hoerl, A. E. (1962). Application of ridgc analysis to regression problems. Chem. £I/g. Progr. 58,54-59. Hoerl, A. E., and Kennard, R. W. (1970). Ridgc regrcssion: blascd estimation for non- orthogonal problems. Techl/ometrics. 12,55-67. Hood, W. c., and Koopmans, T. c., eds. (1953). .. Studies in Econometric Mcthod." Wiley, Ncw York. Hookc, R., and Jcevcs, T. A. (1961). .. Direct search" solution of numerical and statistical problems. J. Assoc. Compllf. l\'/ach. 8, 2:12-229. Hougen, O. A., and Watson, K. M. (1947) "Chemical Process Principlcs, Part Thrcc: Kinetics and Catalysis." Wiley, New York. Howland, J. L., and Vaillancourt, R. (1961). A generalizcd curve fitting method. SIAlW J. Appl. Math. 9, 165-168. Hunter, W. G., and Mezaki, R. (1964). A model building technique for chemical enginecring kinetics. AICh£. J. 10,315-322. Huntcr, W. G., and Mezaki, R. (1967). An cxperimental dcsign strategy for distinguishing among rival mechanistic mode/s'.an application to the catalytic hydrogenation of pro- pylene. Can. J. Chem. £I/g. 45, 247-249. Jeffreys, H. (1961). "Theory of Probability," 3rd ed. Oxford Univ. Press, London and New York. Jennrich, R. 1., and Sampson, P. F. (1968). Application of stcpwise regression to nonlinear estimation. Tec/1I10metrics. 10, 63-72 John, F. (1948). Extrcmum problems with inequalities as subsidiary conditions. In " Studies and Essays," 187-204. Wiley (lnterscicnce), New York. Johnston, J. (1963). "Econometric Methods," McGraw-HilI. New York. d n ), n h 
330 References Kalman, R. E. (1960). A new approach to linear filtcnng and predicllon problems, J. Basic Eng. 82, 33-45. Kelley, H. J., and Denham, W. F. (1966). Orbit dctcrmination with the Davidon Method, Joint Allfomlit. Control Can! Seattle, Washington. Kelley, I-I. J., and Denham, W. F. (1969). Modeling and adjoints for continuous systems, J. Optimization Theory Appl. 3, 174-183. Kellcy, Jr., J. E., (1958). An application of lincar programming to curve fitting. SIAM J. Appl. Math. 6, 15-22. Kittrell, J. R., Hunter, W. G., and Mezaki, R. (1966). The use of diagnostic parameters for kinetic model building. AIChE J. 12, 1014-1017. Kittrell, J. R., Hunter, W. G., and Watson, C. C. (1966). Obtaining precise parameter estimates for nonlincar catalytic rate models. AIChE J. 12,5-10. Kittrell, J. R., Mezaki, R., and Watson, C. C. (1965). Estimation of parameters for non- linear least squarcs analysis. Ind. Eng. Chem. 57, 18-27. Koopmans, T. c., and Hood, W. C. (1953). The estimation of simultaneous linear economic relationships. In "Studics in Econometric Method" (W. C. Hood and T. C. Koopmans, eds.). Wiley, New York. Korin, B. P. (1968). On the distribution of a statistic used for tcsting a covariance marrix. Biometrika 55, 171-178. Kowalik, J., and Osbornc, M. R. (1968). "Methods for Unconstrained Optimization Problems." American Elsevier, New York. Kuhn, I-I. W., and Tucker, A. W. (1951). Nonltnear programming. In Proc. Berkeley Symp. lvIath. Statist. and Probability, 2nd (.I. Ncyman, ed). Univ. of California Prcss. Berkcley, California. Kullback, S. (1959). "Information Theory and Statistics." Wiley, New _York. Kullback, S., and Leiblcr, R. A. (1951). On information and sufllciency. AmI. lvIath. Statist. 22, 79-86. Kiinzi, I-I. P., and Krclle, W. (1966) .. Nonlinear Programming." Ginn (Blaisdell), BostOn, Massachusetts. Lawton, W. I-I., and Sylvcstre, E. A. (1971). Elimination oflincar parameters in nonlinear regression. Tecl1ll0metrics. 13,461-467. Legendre, A. M. (1805). .. Nouvellcs Methodes pour la Dctermination dcs Orbltes de Cometes." Paris Lehman, E. L. (1959). "Testing Statistical Hypotheses." Wilcy, New York. Lcvcnbcrg, K. (1944). A mcthod for thc solution of ccrtain nonlinear problems in least squares. Qnart. Appl. Math. 2, 164-168. Lcwis, P. A. W., Goodman, A. S., and Millcr, J. M. (1969). A pseudorandom number generator for thc System/360. IElvI Systems J. 8,136-145. Lindlcy, D. V. (1956). On a measurc of the information provided by an experiment. Ann. IVIath. Statist. 27, 986-1005 Longley, J. W. (1967). An appraisal of least squares programs for the electronic computer from the point of view of thc user. J. Amer. Statist. Assoc. 62,819-841. Mangasarian, O. L. (1969). "Nonlinear Programming." McGraw-Hili, New York. Marquardt, D. W. (1963). An algorithm for least squares estimation of nonlinear parameters. SIAlvI J. 11,431-441. Marsaglia, G., and Bray, T. A. (1964). A convenient mcthod for generating normal variables. SIAM Rep. 6, 260-264. McCormick, G. P. (1967). Second-ordcr conditions for constraincd minima. SIAl\1 J. Appl. Math. 15,641-652. McGhce, R. B. (1963). "Identification of nonlinear dynamic systems by regression analysis 
References 331 ,1 1 I I methods." Doctoral dissertation, Univ. Southern California, Los Angelcs, Califorllla (University Microfilms 64-2588, Ann Arbor, Mich.). Melkanoff, M. A., Sawada, T., and Raynal, J. (1966). Nuclear optical modcl calculations. In " Methods in Computational Physics" (B. Alder, S. Fcrnbach, and M. Rotenberg, eds.), Vol. 6. Academic Press, Ncw York. von Mises, R. (1919). Fundamcntalsiitze der Wahrscheinlichkeitsrechnung. IHa/h. Zei/.IThriji 4,1-97 Moshman, J. (1967). Random number gcneration. In " Mathcmatical Methods for Digital Computcrs" (A. Ralston and H. S. Wilf, eds.), Volume II. Wiley, New York. Murtagh, B. A., and Sargent, R. W. H. (1969). A constrained minimization method with quadratic convergence. In" Optimization" (R. Fletcher eel.). Academic Prcss, Ncw York. Neider, J. A., and Mcad, R. (1965). A simplex method for function minimization. Compllt. J. 7, 308-313. Neyman, J. (1937). Outline of a thcory of statistical estimation bascd on the classical theory of probability. Phil. Trans. Roy. Soc. London Ser. A, 231, 333-380. Neyman, J. (1962). Two brcakthroughs in the thcory of statistical dccision making. ReI'. Inst. Intemat. Statist. 30, 11-27. Ortega, J. M., and Kaiser, I-I. F. (1963). Thc LL T and Q R mcthods for symmetric tri- diagonal matrices. Compll/. J. 6, 99-10 I. Osborne, M. R., and Watson, G. A. (1969). An algorithm for minimax approximation in the nonlinear casc. Compll/. J. 12, 63-68. Pearson, J. D. (1969). Variable metric methods of mtlllmizalion COII/pllt. J. 12, 17/-178. Penrose, R. (1955). A generalizcd inverse for matrices. Proc. Call/hri{qe Philos. Soc. 51, 406-413. Perdrcauville, F. J., and Goodson, R. E. (1966). Identification of syslcms dcscrIbed by partial diffcrential equations. Trails. ASJ\./E, Ser. D, 88,463-468. Peterson, T. I. (1962). Kinctics and mechanism of naphthalcnc oxidation by nonlinear estimation. Chem. EII.9. Sci. 17,203-219. Pontryagin, L. S., Bolyanskii, V. G., Gamkrclidzc, R. V., and Mlshchcnko, E. F. (1962). "Thc Mathematical Theory of Optimal Proccsses." K. N. Trigoroff (transl.)' Wiley (lntersciencc), New York. Powell, M. J. D. (1964). An efllclent method for finding thc minimum of a function of several variablcs without calculating dcrivativcs. COII/pllf. J. 7, 155-/62. . powcil, M. J. D. (1965). A method for minimizing a sum of squarcs of nonlincar functions without calculating derivativcs. COII/pllt. J. 7, 303-307. Powell, M. J. D. (1969). A theorem on rank one modification to a matrix and its inverse. Compl/t. J. 12, 288-290. Price, C. M. (1964). The matrix pscudoinverse and minimal variance estimates. SIAl\1 Rev. 6, 115-120. Qucnouille, M. I-I. (1956). Notes on bias in estimation. Biometrika. 43, 353-360. Raiffa, H., and SchIaifer, R. (1961). "Applied Statistical Dccision Theory." Graduate School of Business Administration, Harvard Univ., Boston. Rao, C. R. (1957), Thcory of the method of cstimation by minimulll chi-squarc. BIIll 111femat. Statist. Inst. 35,25-32. Robbins, H. (1955). An empirical Bayes' approach to statistics. In Proc. Berkeley Symp. Stati.w. and Probability, 3rd I, 157-164. Univ. of California Press, Berkeley, California. Robbins, I-I. (1964). The empirical Bayes' approach to statistical dccision problems. Ann 'Math. Statist. 35, 1-20 Rosen, J. B. (1960). The gradient projection method for nonlinear programming: I. LlIlear constraints. SIAM J. 8, 181-217. I I I ) ] I I i i j i I I 'J i i I ! 
332 References Rosen, J. B. (196]). The gradient projection method for nonlinear programming: II. Nonlinear constraints. SIA!vI J. 9,514-532. Rosenbrock, I-I. H. (1960). An automatic method for finding thc greatest or least value of a function. Comput. J. 3,175-184. Rosenbrock, H. H., and Storey, C. (1966). "Computational Methods for Chemical Engi- ncers." Pergamon, Oxford. Rutemillcr, H. c., and Bowcrs, D. A. (1968). Estimation in a heteroscedastic regrcssion model. J. Amer. SllIlisl. Assoc. 63, 552-557. Sammet, J. E. (1966). Survey of formula manipulation. Comm. ACi\1. 9, 555-569. Savage, L. J. (1954). "The Foundations of Statistics." Wiley, Ncw York. Scheffe, H. (1959). "The Analysis of Variance." Wiley, New York. Seal, ]-1. L. (1967). The historical development of the Gauss linear model. Biomelrika. 54,1-24 Shannon, C. E. (1948). A mathematical thcory of communication. Bel! Syslem Tech. J 27, 379-423 and 623-656. Shinbrot, M. (1954). "On the Analysis of Linear and Nonlinear Dynamical Systcms from Transient-Response Data." NACA Technical Notes, TN 3288. Smith, Jr. F. 8., and Shallllo, D. F. (1971). An improved Marquardt proccdurc for non- linear regressions. Tec!momelrics. 13, 63-74. Solow, R. M. (1957). Tcchnical change and the aggregate production function, Rev, £collom. SWlisl. 39,312-320. Sorenson, I-I. W. (1966). Kalman filtcring technIques. In" Advances in Control Systems" (c. T. Lcondes, ed.), Vol. 3 Academic Press, New York. Spendley, W. (1969). Nonlincar least squares fitting using a modified simplex minimization method. III "Optimization" (R. Fletcher, ed.). Academic Press, New York. Stewart II], G. W. (1967). A modification of Davidon's minimization method to accept difference approximations of derivatives. J. Assoc. CampI/I. !vlach. 14 72-83. Swed, F. S., and Eisenhart, C. (1943). Tables for testing randomness of grouplllg III a sequence of alternatives. AI/n. !vlalh. Slalisl. 14,66-87. Tomovic, R. (1963). "Sensitivity Analysis of Dynamic Systems." McGraw-Hili, New York. Turner, M. E., Monroe, R. J., and Homer, L. D. (1963). Gencralized kinetic rcgression analysis: hypergeometric kinetics. Biomelrics. 19,406-428. Wagner, H. M. (1959). Linear programming techniques for regrcssion analysis. J. Amer. Slalisl. Assoc. 54, 206-212. Wald, A. (1947). "Sequential Analysis." Wiley, Ne,w York. Wiener, N. (1949). "Extrapolation, Interpolation and Smoothing of Stationary Time Series." MIT Press, Cambridge, Massachusetts and Wiley, New York Wildc, D. J., and Beightlcr, C. S. (1967). "Foundations of Optimization." Prentice-Hall, Englewood Cliffs, New Jerscy. Wilkinson, J. H. (1965). "The Algebraic Eigenvalue Problem." Oxford Univ. Press (Claren- don), London and New York. Winkler, R. L. (1967). The asscssment of prior distributions III Bayesian analysis, J. Amer. Slali./. Assoc. 62, 776-800. Wolfe, P. (1963). Methods of nonlinear programming. III "Recent Advances in Mathc- matical Programming" (R. L. Graves and P. Wolfe, eds.), McGraw-Hili, New York. Zangwill, W. r. (l967a). Nonlincar programming via penalty functions. !vlal/agemel/I Sci. 5, 344-358 Zangwill, W. 1. (I 967b). Minimizing a function without calculating derivatives. Complll. J. 10, 293-296. Zoutendijk, G. (1960), "Methods of Feasible Directions." Elsevier, Amsterdam. 
Authol' Index Numbers in italics rcfer to the pages on which the complete references are listed. A Abadie, J., 83, 146,325 Acton, F. S,. 201, 325 Afifi, A. A" 245, 325 Albert, A. E, 251, 325 Anderson, T. W., 170, 200, 325 Anscombe, F. J., 202, 325 Aris, R., 273, 329 Arndt, R. A., 15,325 Atkinson, A. c., 265, 272, 325 Auslander, M., 116,326 B Bard, Y., 91, 96, 107, 109, 110, 148, 15], 277, 324, 325 Barnett, V. D., 81, 325 Bartels, R, H., 77,325 Bayes, T., 36, 325 Beale, E M. L., 170, 191,325 Beaton, A. E, 296, 362 Bcauchamp, J. J., 16, 253, 326 Behnken, D. W" 265, 326 Beightler, C. S" 83, 332 Bellman, R., 16, 226, 230, 242, 326 Berman, M., 16,326 Bischoff, K. 8., 220, 329 Blakcmore, J. W., 277, 326 Bodkin, R. G., 25, 133, 138, 326 Bolyanskii, V. G., 225, 331 BOlllberauIt, A., 116, 133, 323,327 Bond, E, 116,326 Booth, G. W., 7, 323,326 Bowers, D. A., 247, 332 Box, G. E P., 100, 123,204,261,265, 267,269,316,326 Box, M. J., 119, 120, 153,274,326 Bray, T. A., 316, 330 Brent, R. P., 120,326 Broyden, C. G., 107, 108, 326 Guzzi Fcrraris, G., 120,327 C Carney, T M., 61, 327 Carpentier, J., 146,325 Carroll, C. W., 141,327 Chow, G, c., 61,327 Collier, c., 16,326 Colvillc, A. R., 117, 327 CornelI, R. G.,]6, 253, 326, 328 Cornfield, J., 33, 327 Cottle, R. W., ]48,327 Cragg, J. G., 6],327 Cramer, H., 19, 80, 178, ]85, 186, 188, 201,327 D Daniel, J, W., 83, 327 Dantzig, G. G., 148,327 Davidon, W. c., 106, 107, 108, 110,327 Davies, D., .146, 327 Davics, O. L. D., 260, 327 Dcming, W. E., 154,327 Denham, W. F., ]6,242,330 Dcutsch, R., 77, 225, 327 Draper, N, R., 201, 265, 327 E Efroymson, M. A., 323 Eisenha rt, c., 201, 332 Eisenprcss, H., ]5, ]16, ]33,323,324,327 ElasholT, R. M., 245,325 
334 F Fariss, R. H., 9], 327 Faure, P., 146,327 Feller, W., .19,328 Ferguson, T. S., 33, 328 Fiacco, A. V., 52, 107, 108, 142, 145, 159,328 Fisher, R. A., 7, 260, 328 F]anagan, P. D., 91, 328 Fletcher, R., 107, ] 10, ]20, 328 Fox, L., 307, 328 Freudenstein, F., .16,328 G Galambos, J. T., 253, 328 Gamkrelidze, R. V., 225,331 Gardncr, L. A., 25], 325 Gauss, K. F., 6, 97, 328 Goldfarb, D., 110, ]46,328 Goldfeld, S. M., 94, 328 Goldstein, A. A., 90, 328 Goldwyn, R. M., 6],327 Go]ub, G. H., 77, 102, 103, ]22,325,328 Goodman, A. S., 316, 330 Goodson, R. E., 221, 33/ Grant, F. S., ]5,328 Greenstadt, J., 89, 92, 107, I] 6, 133, 323, 327, 329 Grisoff, S., ]16,326 Guttman, I., 170, 191,329 H Hadley, G., 84, 329 Hammcrsley, J. M., 46, 329 Hanscomb, D. c., 46, 329 Hartlcy, H. 0., 100, 170, 191. 329 Healy, M. ./. R., 102, 329 Heinekcn, F. G., 273, 329 Hicks, ./. S., 224, 329 Hill, W. J., 267, 269, 326, 329 Himmclblau, D. M., 220, 329 Hoerl, A. E., 60, 277, 326, 329 Homer, L. D., 16, 332 Hood, W. c., 7, 64,329, 330 Author Index Hooke, R., 119, 120,329 Hougen, O. A, 277, 329 Howland, J. L., 226, 329 Huard, P., 146,327 Hunter, W. G., 123, 204, 265, 269, 272, 281,325,326,327,329,330 Jacquez, J., 226, 230, 326 Jeeves, T. A., 119, 120,329 Jeffreys, H., 35, 329 Jennrich, R. I., 91, 94, 102,329 John, F., 52, 329 Jonnston, J., 16, 329 Jones, C. R., 220, 329 K Kagiwada, 1-/., 16, 242, 326 Kaiser, H. F., 302, 331 Kalaba, R., 16, 226, 230, 242, 326 Ka]man, R. E., 225, 3-10 KeJley, H. J., 16, 242, 330 Kelley, Jr., J. E., 71,330 Kennard, R. W., 60, 329 Kenney, R., 116,326 KittreJl, J. R., 120, ] 23, 265, 330 Klein, L. R., 25, 133, 138, 326 Koopmans, T. c., 7, 64, 329, 330 Korin, B. P., 200, 330 Kowalik, J., 84, 330 Krelle, W., 83,330 Kuhn, H. W., 52,330 Kullback, S., 267,330 KLlI1zi, I-/. P., 83, 330 L Lapidus, L., 1.10, 146,277,323,325,328 Law, V. J., 91,327 Lawton, W. 1-/., 122,330 Lcgendre, A. M., 6,330 Lehman, E. L., 170, 330 Leibler, R. A., 267, 330 Levenbcrg, K., 94, 330 
Author Index Lewis, P. A. W., 316, 330 Lindley, D. V., 26.1, 286,330 Longley, J. w., 103,330 Lucas, H. L., 261,265,326 Mc McCormick, G. P., 52, 107, 108, 142, 145, 159, 328, 330 McGhce, R. B., 100, 226,330 MacGregor, M. H., 15,325 M Mangasarian, O. L., 53, 330 Marquardt, D. W., 94, 114,323,330 Marsaglia, G., 316, 330 Mead, R., 120,331 Meeter, D. A., 170, 191,329 MeJkanoff, M. A., 15,331 Mendelsohn, J., 91,328 Mezaki, R., 120, 123,204,281,329,330 MiJler, J. M., 316, 330 von Mises, R., 73, 331 Mishchenko, E. F., 225,331 Monroe, R. J., 16,332 Moshman, J., 316,33/ MuIJer, M. E., 316,326 Murtagh, B. A., 146,331 Myszewski, M., 116,326 N NeIder, J. A., 120,331 Neyman, J., 35, 185,331 o Ortega, J. M., 302,33/ Osbornc, M. R., 84, 154,330,331 p Pearson, J. D., 107,33/ Penrose, R., 290, 331 335 PerdreauviIJe, F. J., 22],331 Pereyra, V., 122,328 Peterson, T. 1., 7, ]23,323,326,331 Pontryagin, L. S., 225,331 Powell, M. 1. D., 107, 1.10, .120, 251, 328,331 Price, C. M., 319, 331 Price, J. F., 90, 328 Q Quandt, R. E., 94, 328 QuenouiIJc, M. H., 187,33/ R Raiffa, H., 33, 35, 77, 283, 33/ Rao, C. R., 80,33/ Raynal, J., ]5,33/ Robbins, H., 35, 331 Rosen, J. B., 146,33/,332 Rosenbrock, H. H, .120,224,226,332 Rutemiller, H. c., 247, 332 s Sammet, J. E., 116, 326, 332 Sampson, P. F., 9 J, 94, 102, 329 Sargcnt, R. W. H., 146,331 Savage, L. J., 33, 332 Sawada, T., ] 5, 33/ Schcffe, H., ] 90, 332 Schlaifer, R., 33, 35, 77, 283, 33/ Schwimmer, S., 126, 230, 326 Scal, H. L., 7, 332 Sclvester, R., 16, 326 Shahn, E., 16, 326 Shan no, D. F., 95, 323, 332 Shannon, C. E., 19, 261,332 Shinbrot, M., 220, 332 Smith, /-I., 201, 327 Smith, F. B., Jr., 95, 332 Solow, R. M., ]34,332 Sorenson, /-I. W., 225, 332 Spendley, W., .I 20, 332 Sridhar, R., 142, 326 Stewart, III, G. W., 119, 332 
336 Storcy, c., 120, 224, 226,332 Surkan, A., ] 5, 327 Swed, F. S., 101, 332 Sylvestrc, E. A., ]22,330 T Tobey, R., 116,326 TOll1ovic, R., 226, 332 Trottcr, 1-1. F., 94, 328 Tsuchiya, 1-1. M., 273, 329 Tucker, A. W., 52, 330 Turner, M. E., ] 6, 332 v Vaillancourt, R., 226, 329 Vitale, P. A., 9], 328 w Wagner, H. M., 71, 332 Wald, A., 260, 269, 270, 271, 332 Author Index Watson, C. c., 120, 265,330 Watson, G. A., 154,331 Watson, K. M., 277, 329 Wei, J., 224, 329 Weiss, M. F., 16, 326 West, G. F., ]5,328 Wichern, D. W.,269, 329 Wicner, N., ]6,332 Wildc, D. J., 83, 332 Wilkinson, J. H., 293, 302, 305, 332 Winkler, R. L., 35, 332 Wolfc, P., 146,332 Woo, L. S., 16, 328 Wood, F. S., 324 y You Ie, P. V., 123,326 z ZangwllI, W. 1., 120, IA5, 332 ZilIes, S., 116,326 Zoutendijk, G., 148,332 
B Bayes' theorem, 36-37 Bayesian estimation, 72-77 Bias, 40-41 estimation of, 47 reduction of, 187 Bienayme-Chebyshev inequality for multiple parameters, 188-189 for single parameter, 186 Bounds on parameters, 151-153 effect on sampling distribution, 182-183 need for, 141 Bounds on state variables, 232 C Canonical form, 174-175 Chebyshev estimate, see Minimax deviation cstimate Chemica] kinetics models, 15, 222, 229-230, 233-241 Chi square distribution, 21 Cholesky decomposition, 307-309 modified for Marquardt's method, 95 Complementary pivot problem, 148 Complementary slackness, 52 Computer role in experiments, 175 Computer programs for parameter estimation, 323-324 Conditional distribution, 312 Confidence interval, 6,184-187 Confidence region, 187-191 iIIustration, 208 for linearized model, 190-191 minimizing volume of, 263 Subject Index Constraints, 49-53, see a/so Bounds on parameters, Bounds on state variables, Equality constraints, Incquality constraints arIsing from prior information, 32-33 effect on sampling distribution, ]80-183 Control theory, problems of, 225 Correlation, 311 test for, 200-201, 216 Curve fitting, 1-2 D Data errors in, see Errors randomness of, 18 requirements for estimatIOn, 69-70 Data matrix, 17 Davidon-Fletcher-PowcIl method, I] 0 Decision theory applied to design of experiments. 283-285 applied to parameter estimation, 74-76 Deming's mcthod, 154-]59 Dependent variables, ] 3 Derivativc-free methods, 117-] 20 Design criteria, locating maximum of, 273-276 Design of experiments, 258 for decision making, 283-285 for discriminating among models, 266-269 for parametcr estimation, 261-265 for prediction, 265-266 termination criteria, 269-271 
338 Determinant, 291 computation of, 301 Deterministic modcl, 11-12 Differential equations, see a/so Dynamic models models formulated as, 218 numerical integration of, 230-231 stability of, 231-232 Differentiation analytic, by computcr, ]] 6, 323 of dynamic model objective function, 226-228 importance of accuracy Ill, 116 of matrix functions, 193-196 numerical, 117- I 19 Dircct search mcthods, I 19-120 Directional discrimination methods, 91-94 Discrimination among models design of expcrimcnts for, 266-169 illustration, 277-283 Dynamic models computation of objective function, 225-130 difficulties associatcd with, 231-133 gradient of objectivc function, 226-130 illustration, 133-138 methods of solution, 218-221 reduction to standard form, 223-224 standard, 221-223 E Economctric models, 25-26, 133-] 38, 167-168,213--216 Eigenvalue decomposition, 304-305 Eigenvalues and vcctors, 290 computation of, for real symmetrIc matrix, 301-303 Equality constraints, 49-51 linear, application of projection mcthod to, ]60 model equations viewed as, .Iee Exact structural model penalty functions for, 159-160 Errors, 54 distribution of, 22--13 estimating paramctcrs of distribution, 195 Subject Index Estimate asymptoltcally efficient, 43 consistcnt, 42 efficicnt, 41 ill-determined, 172, 203 linear, 44 robust, 44 statistical properties of, see Sampling distribution sufficient, 44 unbiased,40 well determined, 172 Estimation procedures desirable properties for, 44-45 reasons for failure, 102-204 Exact structural model, 24 computation of estimates for, 154-159 covariance matrix of parameter estimates, 179, 212-214 estimating parameters of error distribution, 196-197 illustration, 163-168 maximum likelihood method for, 68-1)9 Expcrimcntal conditions, 17, 258 Experiments, ] 7 cost of, 173, 284-285 design of, see Design of experiments simulated by computer, 46, 176-277 F F distribution, 21, ] 90 Farris-Law method, 93 Feasible region, 48 Finite differences, 1.1 7-119 central, I] 9 determination of optimum length, 118 for dynamic mod cis, 226 one-sided, ]] 7-1.1 8 G Gauss-Jordan pivot, 196 Gauss-Markov theorem, 59, 318-319 Gauss method, 97-106 implementation of, 101-106 with penalty functions, 106, 144 with prior distribution, 10.1, 106, 131-133 
Subject Index as sequence ofIinear regressIOns, 99-100 single-equation least squares, 97 illustration, 114-130 Gaussian distribution, see Normal distribution Givens-Householder transformation, 302 Goodness of fit criteria, 198-202 Gradient methods, 86 convergence of, 87-88, 1.1 5-.1] 7, 320-322 efficiency of, 89 step length determinatIon 111, 110-113 I-I Hessian matrix, 88 Gauss approximation for, 97-99 variable metric approximations for, 106-1I 0 Indcpendent variables, ]4,221 subject to error, see Exact structural model Indifference region, ] 71-173 illustration, 207 Inequality constraints, 51-53 linear, application of projection method to, 146-153 pen.altY functions for, .141-145 Inexact structural model, 24-15 Information for discrimination, 267 in a distribution, 19, 261 gained from an experiment, 261 in normal distribution, 262 prior, 32-35 Inhomogeneous covariance, 146-248 Initial conditions, 211 Initial guess, 120-123 Interpolation-extrapolation methods, II .1-113 illustrations, 117-129 Interval estimate, 6 Iterative methods, 84-88, see also Derivative-free mcthods, Gradient methods 339 acceptable, 85-86 initial guess for, .I 20-123 termination criteria for, 114-1I5 K Kuhn-Tucker condition, 52 for quadratic program, 147 L Lagrange multipliers, 50-53 Least squares method, 55-61, see a/so Regression ull\veighted, 55 weighted 56-57 Levenbcrg method, see Marquardt method Likelihood, 26-29 concentrated, 65-66 standard reduced model, 27 structural models, 27-29 Likelihood equations, 62 Likelihood ratio, 269 Linear equations, solution of, 299 Linearity, 5 Linearizing transformatIOns, 78-80, 122 illustration, 131 Linearly dcpendent equations, 238-241 IVI Marginal distribution, 311 Marquardt method, 94-96 illustration, 130-131 Matrix condition of, 89, 305 improved by scaling, 306 rank of, 292, 300-301 spectral decompositions of, 303-309 square root of, 307 trace of, 29.1 Matrix algebra, 287-293 Matrix functions, differentiation of, 293-296 Matrix inverse, 289, 298-299 updating of, 250 Matrix pseudoinverse, 290, 304 
340 Maximization, see Optimization Maximum likelihood method, 61-71 exact structural model, 68-69 illustrations, 133-138 independent variables subject to error, 67-68 normal distributIOn, 63-70 two-sided exponential distribution, 70-71 uniform distribution, 70 unknown covariance matrix, 64-66 Measurement errors, see Errors Minimax deviation cstimatc, 77 computational method for, ]54 Minimization, see Optimization Minimum chi-square method, 80 Minimum risk estimatc, 74-76 Minimum variance bound, 4] Missing observations, 244-145 illustrations, 151-255 Modc-of-postcrior-distribulton eSltmate, 73-74 illustration, 131-133 sampling distribution of, 192 Model deterministic form, 11-12 estimation of parameters of, 2-4 formulation of, illustration, 29-31 stochastic form, 24-26 Moment matrix, 64 likelihood expressed as function of, 97-99 Monte Carlo method, 46 illustration, 210-212 N Newton-Greenstadt mcthod, 92 Newton's mcthod, 89-91 Nonlinear programming, 83 Normal distribution, 18-21 gencrating pscudorandom sample from, 316-317 information in, 261 maximum likclihood method for, 63-70 multivariate, 20-11 univariatc, 20 Normal cquations, 49 Subject Index o Objective function, 47 computation for standard dynamic model, 215-230 as function of moment matrix, 97-99 Observed variables, 221 Optimality conditions constrained problems, 49-53 unconstrained problems, 48-49 Optimization, 47 Orthogonalization, 103-106 p Parametcr estimation, 4 applications of, 14-16 computcr programs for, 323-314 dcsign of experiments for, 262-265 history of, 6-7 in a probability distribution, 3, 80 problem formulation, 37 Parameters, 11-12 Penalty functions equality constraints, 159-160 illustration, 160-16i inequality constraints, 141-145 as a prior distribution, 145 Pivoting, 296 Point estimate, 6, 39 Positive definite matrix, 290 role in gradient methods, 86, 116 Posterior distribution, 36-37 mode of, 73-74 Prediction, 13-14 dcsign of experiments for, 265-266 errors in, 204-205 Principal components, 183-184, 208 Prior distribution, 33-35 informative, 35 noninformative, 34-35 Probability distribution, 310 estimating parameters of, 3, 80 Projection method, 146-] 53 for bounded parameters, 151-153 illustration, 162-163 for linear equality constraints, 160 step length determination, 149 Pseudomaximum likelihood method, 78 Pseudorandom numbers, 316-317 
Subject Index Q QR method, 302-303 Quadratic programming, 147 Quasilinearization. 230 R Rank one correction method, 107-109 Rao-Cramer theorem, 41, 313-315 Reduced model, 13-14 standard, 26 Regression backward selection, 302 forward selection, 301 multiple linear, 58-61 methods of solution, 102-106 ridge, 60 stepwise, 59, 101, 301-302 Reparametrization, see Transformation of variables Residuals, 54 analysis of, illustration, 209-210, 214-216 outliers, 202 run tests, 201-202 statistical properties of, 193-196 statistical tests on, 199-202 Risk, 74, 283-284 Rotational discrimination methods, 91 s Sampling distribution, 39-45 covariance matrix of, 176-179, 207 effect of constraints on, 180-183 estimation of statistical properties of, 175-183 evaluation by Monte Carlo method, 46, 210-212 evaluation of statistical properties of, 45-47 statistics of, 40 341 Scaled decomposition, 305-307 Scientific investigation, goals of, 258-260 Sensitivity equations, 226-230 Sequential reestimation, 248-251 illustration, 255-157 Serial correlation, 247-248 Simulation of experiments, 46, 276-277 Spectral decompositions, 303-309 State variables, 221 bounds on, 231 Steepest descent method, 88 Stochastic approximation, 251 Stochastic model, 14-16 Structural model, ] 2-] 3, see a/so Exact structural model, Inexact structural model Sufficient stahshc, 43, 61 SUMT method, 4]1 Sweeping, 296 T Termination criteria for iterative methods, 114-115 for sequential experiments, 169-171 Transformation of variables, see also Linearizing transformations effect on sampling distribution, 205-106 to eliminate constraints, ] 53 invariance of estimates under, 44 to simplify model, 133 u Uncertalllty, 26] Uniform distribution, 2]-22 maximum likelihood method for, 70 v VarIable metrIC methods, 106-] 10 Vectors, linear independence of, 300-301