Notes on Statistics for Physicists, Revised

APPENDIX IV. NUMERICAL METHODS FOR MAXIMUM LIKELIHOOD AND LEAST SQUARES SOLUTIONS

In many cases the likelihood function is not analytical or else, if analytical, the procedure for finding the alpha _j* and the errors is too cumbersome and time consuming compared to numerical methods using modern computers.

For reasons of clarity we shall first discuss an inefficient, cumbersome method called the grid method. After such an introduction we shall be equipped to go on to a more efficient and practical method called the method of steepest descent.

The grid method

If there are M parameters alpha ₁, ... , alpha _M to be determined one could in principle map out a fine grid in M-dimensional space evaluating w( alpha ) (or S( alpha )) at each point. The maximum value obtained for w is the maximum likelihood solution w*. One could then map out contour surfaces of w = (w* - ½), (w* - 1), etc. This is illustrated for M = 2 in Fig. 6.

Figure 6. Contours of fixed w enclosing the max. likelihood solution w*.

In the case of good statistics the contours would be small ellipsoids. Fig. 7 illustrates a case of poor statistics.

Figure 7. A poor statistics case of Fig. 6.

Here it is better to present the (w* - ½) contour surface (or the (S* + 1) surface) than to try to quote errors on alpha . If one is to quote errors it should be in the form alpha ₁^- < alpha ₁ < alpha ₁⁺ where alpha ₁^- and alpha ₁⁺ are the extreme excursions the surface makes in alpha ₁ (see Fig. 7). It could be a serious mistake to quote a^- or a⁺ as the errors in alpha ₁.

In the case of good statistics the second derivatives ð²w / ð alpha _a ð alpha _b = - H_ab could be found numerically in the region near w*. The errors in the alpha 's are then found by inverting the H-matrix to obtain the error matrix for alpha ; i.e., = (H^-1)_ij. The second derivatives can be found numerically by using

In the case of least squares use H_ij = ½ ðS / ð alpha _i ð alpha _j .

So far we have for the sake of simplicity talked in terms of evaluating w( alpha ) over a fine grid in M-dimensional space. In most cases this would be much too time consuming. A rather extensive methodology has been developed for finding maxima or minima numerically. In this appendix we shall outline just one such approach called the method of steepest descent. We shall show how to find the least squares minimum of S( alpha ). (This is the same as finding a maximum in w( alpha )).

Method of Steepest Descent

At first thought one might be tempted to vary alpha ₁ (keeping the other alpha 's fixed) until a minimum is found. Then vary alpha ₂ (keeping the others fixed) until a mew minimum is found, and so on. This is illustrated in Fig. 8 where M = 2 and the errors are strongly correlated. But in Fig. 8 many trials are needed. This stepwise procedure does converge, but in the case of Fig. 8, much too slowly. In the method of steepest descent one moves against the gradient in alpha -space:

Figure 8. Contours of constant S vs. alpha ₁ and alpha ₂. Stepwise search for the minimum.

So we change all the 's simultaneously in the ratio ðS / ð₁ : ðS / ð₂ : ðS / ð₃ : ... . In order to find the minimum along this line in -space one should use an efficient step size. An effective method is to assume S(s) varies quadratically from the minimum position s* where s is the distance along this line. Then the step size to the minimum is

where S₁, S₂, and S₃ are equally spaced evaluations of S(s) along s with step size Delta s starting from s₁; i.e., s₂ = s₁ + Delta s, s₃ = s₁ + 2 Delta s. One or two iterations using the above formula will reach the minimum along s shown as point (2) in Fig. 9. The next repetition of the above procedure takes us to point (3) in Fig. 9. It is clear by comparing Fig. 9 with Fig. 8 that the method of steepest descent requires much fewer computer evaluations of S( alpha ) than does the one variable at a time method.

Figure 9. Same as Fig. 8, but using the method of steepest descent.

Least Squares with Constraints

In some problems the possible values of the _j are restricted by subsidiary constraint relations. For example, consider an elastic scattering event in a bubble chamber where the measurements y_j are track coordinates and the _i are track directions and momenta. However, the combinations of _i that are physically possible are restricted by energy-momentum conservation. The most common way of handling this situation is to use the 4 constraint equations to eliminate 4 of the 's in S(). Then S is minimized with respect to the remaining 's. In this example there would be (9 - 4) = 5 independent 's: two for orientation of the scattering plane, one for direction of incoming track in this plane, one for momentum of incoming track, and one for scattering angle. There could also be constraint relations among the measurable quantities y_i. In either case, if the method of substitution is too cumbersome, one can use the method of Lagrange multipliers.

In some cases the constraining relations are inequalities rather than equations. For example, suppose it is known that ₁ must be a positive quantity. Then one could define a new set of 's where (₁')² = ₁, ₂' = ₂, etc. Now if S(') is minimized no non-physical values of a will be used in the search for the minimum.