Practical Statistics for Astronomers II

1. INTRODUCTION

This paper is the second and last in a short series describing some aspects of statistics and statistical inference which may be useful to astronomers. The ancient disclaimers of Paper I (Wall 1979) stand: there is no attempt at completeness, rigour or indeed any theoretical justification at all. The emphasis is on non-parametric (distribution-free) statistics, primarily because astronomers rarely know enough about what they are measuring to know the form of the underlying population.

In the gap between the first and last papers of this series, some progress has taken place in the awareness of astronomers of the existence and power of statistical techniques; and it would be unfortunate if it were otherwise. Much of that progress has been highly specialized. Entire astronomical industries have grown, creating their own jargons and barriers. Some examples of these industries, a few of which have ancient roots, are as follows.

Statistical techniques in spectral analysis, including the Fourier quotient method (Sargent et al. 1977), cross-correlation analysis (Tonry & Davis 1979), and the Fourier difference method (Efstathiou, Ellis & Carter 1980).
Space-density and luminosity-function analysis via the V / V_max or ``luminosity-volume'' test (Schmidt 1968; Rowan-Robinson 1968; Avni & Bahcall 1980).
Time-series analysis and period finding (Jenkins & Watts 1968; Anderson 1971; Kendall 1976).
Galaxy and source clustering analysis via power spectrum techniques (Webster 1976), two-point to N-point correlation analyses (Peebles 1980), percolation theory (Zeldovich, Einasto & Shandarin 1982) and minimal spanning trees (Barrow, Bhavsar & Sonoda 1985).
Image processing/restoration via various algorithms: for example, the iterative Clean process (Högbom 1974), the Bayesian maximum-entropy method (Gull & Daniel 1978; Gull & Skilling 1984), and the Lucy-Richardson iterative algorithm (Richardson 1972; Lucy 1974), which gained a new lease of life in image processing for the pre-COSTAR Hubble Space Telescope (e.g. Adorf 1992).
Survival analysis, the analysis of ``censored'' or incomplete data (upper limits) based on actuarial techniques of survival/mortality (Avni et al. 1980; Miller 1981; Pfieiderer & Krommidas 1982; Feigelson & Nelson 1985; Isobe, Feigelson & Nelson 1986; Sadler, Jenkins & Kotanyi 1989).

Here I leave these industries to their own devices for the most part; there remains the need for the general industrialist, and Paper I plus this paper are for such people rather than for the industrial specialist.

Paper I mentioned a few basic references. It might be useful to discuss updated references of a general nature, the specialized references appearing in the course of this paper. The general references I have binned into five types: popular, the basic text, the rigorous text, the data analysis manual and the books of specialist interest to astronomers.

The classic popular books have legendary titles: How to Lie with Statistics (Huff 1973), Facts from Figures (Moroney 1965), Statistics in Action (Sprent 1977) and Statistics without Tears (Rowntree 1981). They are all fun. A modern version with a twist in the title is Seeing through Statistics (Utts 1996), which entertains, serves as a statistics primer, and is almost a member of my next group.
Textbooks come in sub-types (a) and (b), both of which almost inevitably cover similar material, at least for the first two-thirds of the book. They start with descriptive or summarizing statistics (mean, standard deviation), the distributions of these statistics, then move to the concept of probability and hence statistical inference and hypothesis testing, including correlation of two variables. They then diverge, choosing from a menu including analysis of variance (ANOVA), regression analysis, non-parametric statistics, etc. Modern versions come in bright colours and flavours, perhaps to help presentation to undergraduates of a subject with which excitement is not always associated; the value in many such books is exceptional because of the huge sales they generate. They are complete with tables, ready summaries of tests and formulae inside covers or in coloured insets, and frequently with floppy discs including test data sets. Those of sub-type (a) are essentially devoid of any mathematics but with much algebra and arithmetic in the form of worked examples, and are statistics primers for undergraduates in non-scientific disciplines. Those of type (b) have basic mathematics which may run through to simple calculus. A wonderfully readable example of the former is Statistics by Freedman et al. (1995), in which a non-conventional approach is adopted, very successfully. Another which goes substantially further, for example including ANOVA and non-parametric tests, is Introductory Statistics by Weiss (1995), which is entertaining through its inclusion of short biographical sketches of the founding fathers of statistical science. Of sub-type (b), more appropriate in the present context but not necessarily so entertaining, an outstanding example is Mathematical Statistics and Data Analysis by Rice (1995), basic but erudite and thorough; it goes so far as to discuss covariance matrices, Bayesian inference, moment-generating functions, multiple linear regression and computer-intensive methods such as the bootstrap; it includes a floppy disk with examples, and all for under £20 in hardback. Unfortunately non-parametric tests are not mentioned. They are in other basic texts of sub-type (b), such as that by Hogg & Tanis (1993: Probability and Statistical Inference), a tried- and-true serious textbook with excellent presentation, now in its 4th edition.
The serious books which go beyond the undergraduate level include Statistics: Concepts and Applications by Frank & Altheon (1994), a thorough and well-set out description of classical general statistics; and Statistical Inference by Casella & Berger (1990), where the theory is presented in a highly accessible manner. As a very complete reference, there is the modern version of Kendall & Stuart in the form of the three volume set of Kendall's Advanced Theory of Statistics, the three volumes being Distribution Theory (Stuart & Ord 1994), Classical Inference and Relationship (Stuart & Ord 1991), and Bayesian Inference (O'Hagan 1994).
The data analysis books are led by Bevington's (1969) highly practical Data Reduction and Error Analysis for the Physical Sciences. A useful little monograph is A Practical Guide to Data Analysis for Physical Science Students by Lyons (1991). Monographs which simply discuss the application of statistical tests might also be considered in this class, and among these 100 Statistical Tests by Kanji (1993) stands out for the sheer baldness with which the tests are presented, one page plus a page for the worked example. Classic in its simplicity it may be, but the lethal nature of the availability of a large number of unconsidered tests must be emphasized. With regard to applying non- parametric statistical tests, the books by Conover (1980; Practical Non-parametric Statistics) and Siegel & Castellan (1988: Non-parametric Statistics for the Behavioral Sciences) are very straight-forward, the latter being particularly recommended. Manuals of the now highly developed statistics program packages, e.g. MINITAB, SPSS, GENSTAT, S-PLUS, contain much practical advice. The dominant force in physical analysis books is, however, Numerical Recipes (Press et al. 1992), which contains unparalleled breadth, much common sense and subroutines in our favourite computer languages which invariably work. Finally note the two books by Tufte, The Visual Display of Quantitative Information and Envisaging Information (1983 and 1990, respectively), magnificent in their presentation and representing essential browsing for anybody wishing to present data in graphical form.
The growth of interest by astronomers in statistical methods, perhaps driven by the data explosion, is demonstrated by a series of specialist conferences which have resulted in the collection of much useful information. The first of these, Statistical Methods in Astronomy (Rolfe 1983), contains useful background bibliographies in time-series analysis and in non-parametric statistics. The two later conferences, Errors, Bias and Uncertainties in Astronomy (Jaschek & Murtagh 1990) and Statistical Challenges in Modern Astronomy (Feigelson & Babu 1992) reflect the dramatic change in what we consider to be the important data sets over a 15-yr period, and are instructive reading for this alone.

Paper I described some basic concepts, summarized the most common probability distributions, discussed the Normal distribution and its relevance to signal detection, and described filtering to improve the signal-to-noise ratio in the detection process. This paper continues by considering correlation in the various guises in which astronomers find it, regression analysis, data modelling and sample comparison. The statistical tables supporting the tests are in Appendix A; these are illustrative rather than exhaustive, and are modelled on the tables in Siegel & Castellan (1988), which are more extensive and contain full references.