manual.txt

                                ASURV

                      Astronomy SURVival Analysis 


             A Software Package for Statistical Analysis of 

               Astronomical Data Containing Nondetections


      Developed by: 
                 Takashi Isobe (Center for Space Research, MIT)  

                 Michael LaValley (Dept. of Statistics, Penn State)

                 Eric Feigelson (Dept. of Astronomy & Astrophysics,
                                 Penn State) 


       Available from:   
                  Code@stat.psu.edu

                  or,

                  Eric Feigelson     
                  Dept. of Astronomy & Astrophysics  
                  Pennsylvania State University 
                  University Park PA  16802 
                  (814) 865-0162 
                  Email:  edf@astro.psu.edu  (Internet)


                  Rev. 1.1, Winter 1992


                           TABLE OF CONTENTS


       1 Introduction ............................................  3

       2 Overview of ASURV .......................................  4 
            2.1  Statistical Functions and Organization ..........  4 
            2.2  Cautions and caveats ............................  6
       
       3 How to run ASURV ........................................  8 
            3.1  Data Input Formats ..............................  8
            3.2  KMESTM instructions and example .................  9 
            3.3  TWOST instructions and example .................. 10 
            3.4  BIVAR instructions and example .................. 12 
       
       4 Acknowledgements ........................................ 20
       
       Appendices ................................................ 21
            A1  Overview of survival analysis .................... 21 
            A2  Annotated Bibliography on Survival Analysis ...... 22 
            A3  How Rev 1.1 is Different From Rev 0.0 ............ 25
            A4  Errors Removed in Rev 1.1 ........................ 27
            A5  Errors removed in Rev 1.2 ........................ 27
            A6  Obtaining and Installing ASURV ................... 27
            A7  User Adjustable Parameters ....................... 28
            A8  List of subroutines used in ASURV Rev 1.1 ........ 31


                                NOTE

     This program and its documentation are provided `AS IS' without
warranty of any kind.  The entire risk as the results and performance
of the program is assumed by the user.  Should the program prove
defective, the user assume the entire cost of all necessary correction.
Neither the Pennsylvania State University nor the authors of the program
warrant, guarantee or make any representations regarding use of, or the
results of the use of the program in terms of correctness, accuracy 
reliability, or otherwise.  Neither shall they be liable for any direct or
indirect damages arising out of the use, results of use or inability to
use this product, even if the University has been advised of the possibility
of such damages or claim.  The use of any registered Trademark depicted
in this work is mere ancillary; the authors have no affiliation with
or approval from these Trademark owners.


                          1  Introduction

    Observational astronomers frequently encounter the situation where they 
observe a particular property (e.g. far infrared emission, calcium line 
absorption, CO emission) of a previously defined sample of objects, but 
fail to detect all of the objects.  The data set then contains nondetections 
as well as detections, preventing the use of simple and familiar statistical 
techniques in the analysis.   
 
     A number of astronomers have recently recognized the existence 
of statistical methods, or have derived similar methods, to deal with these 
problems.  The methods are collectively called `survival analysis' and 
nondetections are called `censored' data points.  ASURV is a menu-driven 
stand-alone computer package designed  to assist astronomers in using methods 
from survival analysis.  Rev. 1.1 of ASURV  provides the maximum-likelihood
estimator of the censored distribution, several two-sample tests, correlation 
tests and linear regressions  as described in our papers in the Astrophysical 
Journal (Feigelson and Nelson, 1985; Isobe, Feigelson, and Nelson, 1986). 

     No statistical procedure can magically recover information that was
never measured at the telescope.  However, there is frequently important 
information implicit in the failure to detect some objects which can be
partially recovered under reasonable assumptions. We purposely provide 
a variety of statistical tests - each based on different models of where upper
limits truly lie - so that the astronomer can judge the importance of the 
different assumptions.  There are also reasons for not applying these methods.
We describe some of their known limitations in Section 2.2.

     Because astronomers are frequently unfamiliar with the field of 
statistics, we devote Appendix A1 to background material.  Both general issues 
concerning censored data and specifics regarding the methods used in ASURV are 
discussed.  More mathematical presentations can be found in the references 
given in Appendix A2. Users of Rev 0.0, distributed between 1988 and 1991, are 
strongly encouraged to read Appendix A3 to be familiar with the changes made in
the package.  Appendices A4-A8 are needed only by code installers and those 
needing to modify the I/O or array sizes.  Users wishing to get right down to 
work should read  Section 2.1 to find the appropriate program, and follow  the 
instructions in Section 3. 


                        Acknowledging ASURV

     We would appreciate that studies using this package include phrasing
similar to `... using ASURV Rev 1.2 (B.A.A.S. reference), which implements the 
methods presented in (Ap. J. reference)', where the B.A.A.S. reference is the 
most recent published ASURV Software Report (Isobe and Feigelson 1990; 
LaValley, Isobe and Feigelson 1992) and the Ap. J. reference is Feigelson and 
Nelson (1985) for univariate problems and Isobe,Feigelson and Nelson (1986) 
for bivariate problems.  Other general works appropriate for referral include 
the review of survival methods for astronomers by Schmitt (1985), and the 
survival analysis monographs by Miller (1981) or Lawless (1982).


                         2 Overview of ASURV
    
       
2.1 Statistical Functions and Organization

     The statistical methods for dealing with censored data might be
divided into a 2x2 grid: parametric vs. nonparametric, and univariate vs.
bivariate.  Parametric methods assume that the censored survival times
are drawn from a parent population with a known distribution function (e.g.
Gaussian, exponential), and this is the principal assumption of survival
models for industrial reliability.  Nonparametric models make no assumption 
about the parent population, and frequently rely on maximum-likelihood 
techniques.  Univariate methods are devoted to determining the characteristics 
of the population from which the censored sample is drawn, and comparing such 
characteristics for two or more censored samples.  Bivariate methods measure 
whether the censored property of the sample is correlated with another property
of the objects and, if so, to perform a regression  that quantifies the 
relationship between the two variables.
 
     We have chosen to concentrate on nonparametric models, since  the 
underlying distribution of astronomical populations is usually unknown. 
The linear regression methods however, are either fully parametric (e.g. EM 
algorithm regression) or semi-parametric (e.g. Cox regression, Buckley-James 
regression).  Most of the methods are described in more detail in the 
astronomical literature by Schmitt (1985), Feigelson and Nelson (1985) and 
Isobe et al. (1986).  The generalized Spearman's rho used in ASURV 
Rev 1.2 is derived by Akritas (1990).

     The program within ASURV giving univariate methods is UNIVAR.  Its first 
subprogram is KMESTM, giving the Kaplan-Meier estimator for the distribution 
function of a randomly censored sample.  First derived in 1958, it lies at the 
root of non-parametric survival analysis.  It is the unique, self-consistent, 
generalized maximum-likelihood estimator for the population from which the 
sample was drawn. When formulated in cumulative form, it has analytic 
asymptotic (for large N) error bars. The median is always well-defined, though 
the mean is not if the lowest point in the sample is an upper limit.  It is 
identical to the differential `redistribute-to-the-right' algorithm, 
independently derived by Avni et al. (1980) and others, but the differential 
form does not have simple analytic error analysis.  

     The second univariate program is TWOST, giving a variety of ways to test 
whether two censored samples are drawn from the same parent population.  They 
are mostly generalizations of standard tests for uncensored data, such as the 
Wilcoxon and logrank nonparametric two-sample tests.  They differ in how the 
censored data are weighted or `scored' in calculating the statistic.  Thus each
is more sensitive to differences at one end or the other of the distribution.
The Gehan and logrank tests are widely used in biometrics, while some of the 
others are not.  The tests offered in Rev 1.1 differ significantly from those 
offered in Rev 0.0 and details of the changes are in Section A3.

     BIVAR provides methods for bivariate data,  giving three correlation 
tests and three linear regression analyses. Cox hazard model (correlation 
test), EM algorithm, and Buckley-James method (linear regressions) can treat 
several independent variables if the dependent variable contains only one kind 
of censoring (i.e. upper or lower limits).  Generalized Kendall's tau 
(correlation  test), Spearman's rho (correlation test), and Schmitt's binned 
linear regression can treat mixed censoring including censoring in the 
independent variable, but can have only one independent variable.  

     COXREG computes the correlation probabilities by Cox's proportional 
hazard model and BHK computes the generalized Kendall's tau.  SPRMAN computes 
correlation probabilities based on a generalized version of Spearman's rank 
order correlation coefficient.   EM calculates linear regression coefficients 
by  the EM algorithm assuming a normal distribution of residuals, BJ 
calculates the Buckley-James regression with Kaplan-Meier residuals, and 
TWOKM calculates the binned two-dimensional Kaplan-Meier distribution and  
associated linear regression coefficients derived by Schmitt (1985).  A 
bootstrapping procedure provides error analysis for Schmitt's method in 
Rev 1.1.  The code for EM algorithm follows that given by Wolynetz (1979) 
except minor changes. The code for Buckley-James method is adopted from 
Halpern (Stanford Dept. of Statistics; private communication).  Code for the 
Kaplan-Meier estimator and some of the two-sample tests was adapted from that 
given in Lee (1980).   COXREG, BHK, SPRMAN, and the TWOKM code were developed 
entirely by us.
   
     Detailed remarks on specific subroutines can be found in the comments at 
the beginning of each subroutine.  The reference for the source of the code 
for the subroutine is given there, as well as an annotated list of the 
variables used in the subroutine.


2.2  Cautions and Caveats
 
     The Kaplan-Meier estimator works with any underlying distribution
(e.g. Gaussian, power law, bimodal), but only if the censoring is `random'.
That is, the probability that the measurement of an object is censored can
not depend on the value of the censored variable.   At first glance, this
may seem to be inapplicable to most astronomical problems:  we detect the
brighter objects in a sample, so the distribution of upper limits always 
depends on brightness.  However, two factors often serve to randomize the 
censoring distribution.  First, the censored variable may not be correlated 
with the variable by which the sample was initially identified.  Thus, 
infrared observations of a sample of radio bright objects will be randomly 
censored if the radio and infrared emission are unrelated to each other.  
Second, astronomical objects in a sample usually lie at different distances, 
so that brighter objects are not always the most luminous.  In cases where the 
variable of interest is censored at a particular value (e.g. flux, spectral 
line equivalent width, stellar rotation rate)  rather than randomly censored, 
then parametric methods appropriate to `Type 1' censoring should be used.  
These are described by Lawless (1982) and Schneider (1986), but are not 
included in this package.
 
     Thus, the censoring mechanism of each study MUST be understood  
individually to judge whether the censoring is or is not likely to be  
random.  A good example of this judgment process is provided by Magri et al. 
(1988).  The appearance of the data, even if the upper limits are clustered 
at one end of the distribution, is NOT a reliable measure. A frequent (if 
philosophically distasteful) escape from the difficulty of determining the 
nature of the censoring in a given experiment is to define the population of 
interest to be the observed sample. The Kaplan-Meier estimator then always 
gives a valid redistribution of the upper limits, though the result may not be 
applicable in wider contexts. 

     The two-sample tests are somewhat less restrictive than the
Kaplan-Meier estimator, since they seek only to compare two samples rather
than  determine the true underlying distribution.  Because of this, the
two-sample tests do not require that the censoring patterns of the two samples 
are random.  The two versions of the Gehan test in ASURV assume that the 
censoring patterns of the two samples are the same, but the version with 
hypergeometric variance is more reliable in case of different censoring 
patterns.  The logrank test results appear to be correct as long as the 
censoring patterns are not very different.  Peto-Prentice seems to be the test 
least affected by differences in the censoring patterns.  There is little 
known about the limitations of the Peto-Peto test. These issues are discussed 
in Prentice and Marek (1979), Latta (1981) and Lawless (1982). 

     Two of the bivariate correlation coefficients, generalized Kendall's tau 
and Cox regression, are both known to be inaccurate when many tied values
are present in the data.  Ties are particularly common when the data is
binned.  Caution should be used in these cases.  It is not known how the
third correlation method, generalized Spearman's rho,  responds to ties in the
data.  However, there is no reason to believe that it is more accurate than
Kendall's tau in this case, and it should also used be with caution in the 
presence of tied values.

     Cox regression, though widely used in biostatistical  applications,
formally applies only if the `hazard rate' is constant; that is,  the 
cumulative distribution function of the censored variable falls 
exponentially with increasing values.  Astronomical luminosity functions,
in contrast, are frequently modeled by power law distributions.  It is not 
clear whether or not the results of a Cox regression are significantly
affected by this difference.

     There are a variety of limitations to the three linear regression
methods -- EM, BJ, and TWOKM -- presented here.  First, only Schmitt's binned 
method implemented in  TWOKM works when censoring is present in both variables.
Second, EM requires that the  residuals about the fitted line follow a 
Gaussian distribution.  BJ and TWOKM are less restrictive, requiring only that 
the censoring distribution about the fitted line is random.  Both we and 
Schneider (1986) find little difference in the regression coefficients derived 
from these two methods.  Third, the Buckley-James method has a defect in that 
the final solution occasionally oscillates rather than converges smoothly.  
Fourth, there is considerable uncertainty regarding the error analysis of the 
regression coefficients of all three models.   EM gives analytic formulae 
based on asymptotic normality, while Avni and Tananbaum (1986) numerically 
calculate and examine the likelihood surface.  BJ gives an analytic formula 
for the slope only, but it lies on a weak theoretical foundation.  We now 
provide Schmitt's bootstrap error analysis for TWOKM, although this may be 
subject to high computational expense for large data sets.  Users may thus 
wish to run all methods and evaluate the uncertainty with caution.  Fifth, 
Schmitt's binned regression implemented in TWOKM has a number of drawbacks
discussed by Sadler et al. (1989), including slow or failed convergence to 
the two-dimensional Kaplan-Meier distribution, arbitrary choice of bin size 
and origin, and problems if either too many or too few bins are selected.  
In our judgment, the most reliable linear regression method may be the 
Buckley-James regression, and we suggest that Schmitt's regression be reserved 
for problems with censoring present in both variables. 

    If we consider the field of survival analysis from a broader perspective, 
we note a number of deficiencies with respect to censored statistical problems 
in astronomy (see Feigelson, 1990).  Most importantly, survival analysis 
assumes the upper limits in a given experiment are precisely known, while in 
astronomy they frequently represent n times the RMS noise level in the 
experimental detector, where n = 2, 3, 5 or whatever.  It is possible that all
existing survival methods will be inaccurate for astronomical data sets 
containing many points very close to the detection limit.  Methods that 
combine the virtues of survival analysis (which treat censoring) and 
measurement error models (which treat known measurement errors in both 
uncensored and censored points) are needed. See the discussion by Bickel and 
Ritov (1992) on this important matter.  A related deficiency is the absence 
of weighted means or regressions associated with censored samples.  Survival 
analysis also has not yet produced any true multivariate techniques, such as 
a Principal Components Analysis that permits censoring.  There also seems to 
be a  dearth of nonparametric goodness-of-fit tests like the Kolmogorov-
Smirnov test.

    Finally, we note that this package - ASURV - is not unique.  Nearly a 
dozen software packages for analyzing censored data have been developed 
(Wagner and Meeker 1985).  Four are part of large multi-purpose commercial 
statistical software systems:  SAS, SPSS, BMDP, and GLIM.  These packages are 
available on many university central computers. We have found BMDP to be the 
most useful for astronomers (see Appendices to Feigelson and Nelson 1985, 
Isobe et al. 1986 for instructions on its use).   It provides most of the 
functions in KMESTM and TWOST as well as a full Cox regression, but no linear 
regressions. Other packages (CENSOR, DASH, LIMDEP, STATPAC, STAR, SURVAN, 
SURVCALC, SURVREG) were written at various universities, medical institutes, 
publishing houses and industrial labs; they have not been evaluated by us. 
 

                          3  How to Run ASURV


3.1  Data Input Formats
 
     ASURV is designed to be menu-driven.  There are two basic input
structures:  a data file, and a command file.  The data file is assumed to
reside on disk.  For each observed object, it contains the measured value 
of the variable of interest and an indicator as to whether it is detected
or not.  Listed below are the possible values that the censoring indicator
can take on.  Upper limits are most common in astronomical applications and
lower limits are most common in lifetime applications. 


 For univariate data:    1 : Lower Limit
                         0 : Detection
                        -1 : Upper Limit

 For bivariate data:     3 : Both Independent and Dependent Variables are 
                               Lower Limits
                         2 : Only independent Variable is a Lower Limit
                         1 : Only Dependent Variable is a Lower Limit
                         0 : Detected Point
                        -1 : Only Dependent Variable is an Upper Limit
                        -2 : Only Independent Variable is an Upper Limit
                        -3 : Both Independent and Dependent Variables are
                               Upper Limits


     The command file may either reside on disk, but is more frequently
given interactively at the terminal.

     For the univariate program UNIVAR, the format of the data file is 
determined in the subroutine DATAIN.  It is currently set at 10*(I4,F10.3) 
for KMESTM, where each column represents a distinct subsample.  For TWOST, it 
is set at I4,10*(I4,F10.3), where the first column gives the sample 
identification number. Table 1 gives an example of a TWOST data file called 
'gal2.dat'.  It shows a hypothetical study where six normal galaxies, six 
starburst galaxies and six Seyfert galaxies were observed with the IRAS 
satellite. The variable is the log of the 60-micron luminosity.  The problem 
we address are the estimation of the luminosity functions of each sample, and 
a determination whether they are different from each other at a statistically
significant level.  Command file input formats are given in subroutine UNIVAR, 
and inputs are parsed in subroutines DATA1 and DATA2.  All data input and 
output formats can be changed by the user as described in Appendix A7.

     For the bivariate program BIVAR, the data file format is determined by 
the subroutine DATREG.  It is currently set at I4,10F10.3.  In most cases, 
only two variables are used with input format I4,2F10.3.  Table 2 shows an 
example of a bivariate censored problem.   Here one wishes to investigate 
possible relations between infrared and optical luminosities in a sample of 
galaxies.  BIVAR command file input formats are sometimes a bit complex. The 
examples below illustrate various common cases.  The formats for the basic 
command inputs are given in subroutine BIVAR.  Additional inputs for the 
Spearman's rho correlation, EM algorithm, Buckley-James method, and Schmitt's 
method are given in subroutines R3, R4, R5, and R6 respectively.
 
     The current version of ASURV is set up for data sets with fewer than 500 
data points and occupies about 0.46 MBy of core memory.  For problems with 
more than 500 points, the parameter values in the subroutines UNIVAR and   
BIVAR must be changed as described in appendix A7.  Note that for the 
generalized Kendall's tau correlation coefficient, and possibly other programs,
the computation time goes up as a high power of the number of data points. 

     ASURV has been designed to work with data that can contain either upper 
limits (right censoring) or lower limits (left censoring).  Most of these 
procedures assume restrictions on the type of censoring present in the data.  
Kaplan-Meier estimation and the two-sample tests presented here can work with 
data that has either upper limits or lower limits, but not with data containing
both.  Cox regression, the EM algorithm, and Buckley-James regression can use 
several independent variables if the dependent variable contains only one type
of censored point (it can be either upper or lower limits).  Kendall's tau, 
Spearman's rho, and Schmitt's binned regression can have mixed censoring, 
including censoring in the independent variable, but they can only have one 
independent variable.


3.2  KMESTM Instructions and Example
 
     Suppose one wishes to see the Kaplan-Meier estimator for the infrared
luminosities of the normal galaxies in Table 1. If one runs ASURV
interactively from the terminal, the conversation looks as follows:


   Data type                :  1         [Univariate data]
   Problem                  :  1         [Kaplan-Meier]
   Command file?            :  N         [Instructions from terminal]
   Data file                :  gal1.dat  [up to 9 characters]
   Title                    :  IR Luminosities of Galaxies
                                         [up to 80 characters]
   Number of variables      :  1         [if several variables in data file] 
   Name of variable         :  IR        [ up to 9 characters]
   Print K-M?               :  Y
   Print out differential
       form of K-M?         :  Y
                               25.0      [Starting point is set to 25]
                               5         [5 bins set]
                               2.0       [Bin size is set equal to 2]
   Print original data?     :  Y
   Need output file?        :  Y
   Output file              :  gal1.out  [up to 9 characters]
   Other analysis?          :  N


If an output file is not specified, the results will be sent to the 
terminal screen.
 
     If a command file residing on disk is preferred, run ASURV
interactively as above, specifying 'Y' or 'y' when asked if a command file is
available. The command file would then look as follows:

 
   gal1.dat                     ... data file                      
   IR Luminosities of Galaxies  ... problem title                         
   1                            ... number of variables    
   1                            ... which variable?
   IR                           ... name of the variable        
   1                            ... printing K-M (yes=1, no=0) 
   1                            ... printing differential K-M (yes=1, no=0)
   25.0                         ... starting point of differential K-M
   5                            ... number of bins
   2.0                          ... bin size
   1                            ... printing data (yes=1, no=0)
   gal1.out                     ... output file              


All inputs are read starting from the first column.  

     The resulting output is shown in Table 3.  It shows, for example, 
that the estimated normal galaxy cumulative IR luminosity  function is 0.0
below log(L) = 26.9, 0.63 +/- 0.21 for 26.9 < log(L) < 28.5, 0.83 +\- 0.15 
for 28.5 < log(L) < 30.1, and 1.00 above log(L) = 30.1.  The estimated mean 
for the sample is 27.8 +\- 0.5.  The 'C' beside two values indicates these are
nondetections;  the Kaplan-Meier function does not change at these values.


3.3  TWOST Instructions and Example

       Suppose one wishes to see two sample tests comparing the  subsamples
in Table 1. If one runs ASURV interactively from the terminal, the  
conversation goes as follows:


   Data type                :     1         [Univariate data]
   Problem                  :     2         [Two sample test]
   Command file?            :     N         [Instructions from terminal]
   Data file                :     gal2.dat  [up to 9 characters]
   Problem Title            :     IR Luminosities of Galaxies
                                            [up to 80 characters]
   Number of variables      :     1 
                                            [if the preceding answer is more
                                             than 1, the number of the variable
                                             to be tested is requested]
   Name of variable         :     IR        [ up to 9 characters]
   Number of groups         :     3
   Which combination ?      :     0         [by the indicators in column one 
                                  1          of the data set]
   Name of subsamples       :     Normal    [up to 9 characters]
                                  Starburst 
   Need detail print out ?  :     N
   Print full K-M?          :     Y 
   Print out differential
       form of K-M?         :     N
   Print original data?     :     N
   Output file?             :     Y
   Output file name?        :     gal2.out     [up to 9 characters]
   Other analysis?          :     N
 

A command file that performs the same operations goes as follows, after 
answering 'Y' or 'y' where it asks for a command file:

 
    gal2.dat                     ... data file                   
    IR Luminosities of Galaxies  ... title                      
    1                            ... number of variables      
    1                            ... which variable?         
    IR                           ... name of the variable   
    3                            ... number of groups        
    0   1   2                    ... indicators of the groups   
    0   1   0   1                ... first group for testing 
                                     second group for testing
                                     need detail print out ? (if Y:1, N:0)
                                     need full K-M print out? (if Y:1, N:0)
    0                            ... printing differential K-M (yes=1, no=0)
                                     ... starting point of differential K-M
                                     ... number of bins
                                     ... bin size
                                     (Include the three lines above only if
                                      the differential KM is used.)
    0                            ... print original data? (if Y:1, N:0)
    Normal                       ... name of first group    
    Starburst                    ... name of second group        
    gal2.out                     ... output file               


     The resulting output is shown in Table 4. It shows that the
probability that the distribution of IR luminosities of normal and starburst 
galaxies are similar at the 5% level in both the Gehan and Logrank tests.


3.4  BIVAR Instructions and Example

       Suppose one wishes to test for correlation and perform regression
between the optical and infrared luminosities for the galaxy sample in 
Table 2. If one  runs ASURV interactively from the terminal, the conversation 
looks as follow:


  Data type                :  2         [Bivariate data]
  Command file?            :  N         [Instructions from terminal]
  Title                    :  Optical-Infrared Relation 
                                        [up to 80 characters]
  Data file                :  gal3.dat  [up to 9 characters]
  Number of Indep. variable:  1
  Which methods?           :  1         [Cox hazard model]
    another one ?          :  Y
                           :  4         [EM algorithm regression] 
                           :  N
  Name of Indep. variable  :  Optical
  Name of Dep. variable    :  Infrared
  Print original data?     :  N
  Save output ?            :  Y
  Name of Output file      :  gal3.out 
  Tolerance level ?        :  N          [Default : 1.0E-5]
  Initial estimate ?       :  N
  Iteration limits ?       :  Y 
  Max. iterations          :  50
  Other analysis?          :  N


     The user may notice that the above test run includes several input
parameters specific to the EM algorithm (tolerance level through maximum 
iterations).  All of the regression procedures, particularly  Schmitt's
two-dimensional Kaplan-Meier estimator method that requires specification 
of the bins, require such extra inputs.  

     A command file that performs the same operations goes as follows, 
following the request for a command file name:


Optical-Infrared Relation          ....  title     
gal3.dat                           ....  data file    
1   1   2                          ....  1. number of independent variables
                                         2. which independent variable
                                         3. number of methods
1   4                              ....  method numbers (Cox, and EM) 
Optical  Infrared                  ....  name of indep.& dep
                                         variables
0                                  ....  no print of original data
gal3.out                           ....  output file name 
1.0E-5                             ....  tolerance  
0.0       0.0       0.0       0.0  ....  estimates of coefficients
50                                 ....  iteration  


     The resulting output is shown in Table 5. It shows that the correlation 
between optical and IR luminosities is significant at a confidence level 
P < 0.01%, and the linear fit is log(L{IR})~(1.05 +/- 0.08)*log(L{OPT}). It 
is important to run all of the methods in order to get a more complete 
understanding of the uncertainties in these estimates.

      If you want to use Buckley-James method, Spearman's rho, or Schmitt's 
method from a command file, you need the next extra inputs:

                          (for B-J method)
1.0e-5                           tolerance
30                               iteration

                          (for Spearman's Rho)
1                                  Print out the ranks used in computation;
                                    if you do not want, set to 0  

                          (for Schmitt's)
10  10                             bin # for X & Y
10                                 if you want to set the binning
                                   information, set it to the positive
                                   integer; if not, set to 0.
1.e-5                              tolerance
30                                 iteration
0.5      0.5                       bin size for X & Y
26.0    27.0                       origins for X & Y
1                                  print out two dim KM estm;
                                    if you do not need, set to 0.
100                                # of iterations for bootstrap error 
                                    analysis; if you do not want error 
                                    analysis, set to 0
-37                                negative integer seed for random number
                                    generator used in error analysis.


                               Table 1
  
               Example of UNIVAR Data Input for TWOST


            IR,Optical and Radio Luminosities of Normal,
                   Starburst and Seyfert Galaxies


                                            ____
                        0   0      28.5       |
                        0   0      26.9       |
                        0  -1      29.7     Normal galaxies
                        0  -1      28.1       |
                        0   0      30.1       |
                        0  -1      27.6     ____
                        1  -1      29.0       |
                        1   0      29.0       |
                        1   0      30.2     Starburst galaxies
                        1  -1      32.4       |
                        1   0      28.5       |
                        1   0      31.1     ____
                        2   0      31.9       |
                        2  -1      32.3     Seyfert galaxies
                        2   0      30.4       |
                        2   0      31.8     ____
                        |   |         |
                       #1  #2       #3

                     ---I---I---------I--
       Column #         4   8        17
   
    Note : #1 : indicator of subsamples (or groups)
                If you need only K-M estimator, leave out this indicator.
           #2 : indicator of censoring
           #3 : variable (in this case, the values are in log form)


                              Table 2

             Example of Bivariate Data Input for MULVAR


         Optical and IR luminosity relation of IRAS galaxies


                      0      27.2      28.5 
                      0      25.4      26.9
                     -1      27.2      29.7
                     -1      25.9      28.1  
                      0      28.8      30.1 
                     -1      25.3      27.6
                     -1      26.5      29.0   
                      0      27.1      29.0  
                      0      28.9      30.2 
                     -1      29.9      32.4
                      0      27.0      28.5
                      0      29.8      31.1 
                      0      30.1      31.9   
                     -1      29.7      32.3  
                      0      28.4      30.4 
                      0      29.3      31.8
                      |       |         |
                     #1      #2        #3
                           _________  ______
                           Optical      IR

                   ---I---------I---------I--
       Column #       4        13        22

    Note : #1 :  indicator of censoring
           #2 :  independent variable (may be more Than one)
           #3 :  dependent variable


                              Table 3
    

                      Example of KMESTM Output
    

        KAPLAN-MEIER ESTIMATOR
    
        TITLE : IR Luminosities of Galaxies                                                     
    
        DATA FILE :  gal1.dat
    
        VARIABLE : IR       
    
    
        # OF DATA POINTS :   6 # OF UPPER LIMITS :   3
    
    
          VARIABLE RANGE      KM ESTIMATOR   ERROR
FROM    0.000   TO   26.900       1.000
FROM   26.900   TO   28.500       0.375       0.213
       27.600 C 
       28.100 C 
FROM   28.500   TO   30.100       0.167       0.152
       29.700 C 
FROM   30.100   ONWARDS           0.000       0.000
    

      SINCE THERE ARE LESS THAN 4 UNCENSORED POINTS,
      NO PERCENTILES WERE COMPUTED.
    
        MEAN=    27.767 +/- 0.515
 
  
        DIFFERENTIAL KM ESTIMATOR
        BIN CENTER          D
  
         26.000          3.750
         28.000          1.250
         30.000          1.000
         32.000          0.000
         34.000          0.000
                       -------
          SUM =          6.000
  
 (D GIVES THE ESTIMATED DATA POINTS IN EACH BIN)
        INPUT DATA FILE: gal1.dat 
        CENSORSHIP     X 
               0    28.500
               0    26.900
              -1    29.700
              -1    28.100
               0    30.100
              -1    27.600


                                Table 4


                        Example of TWOST Output


           ******* TWO SAMPLE TEST ******
    
        TITLE : IR Luminosities of Galaxies                                                     
    
        DATA SET : gal2.dat 
        VARIABLE : IR       
        TESTED GROUPS ARE Normal    AND Starburst
     
      # OF DATA POINTS :  12, # OF UPPER LIMITS :   5
      # OF DATA POINTS IN GROUP I  :   6
      # OF DATA POINTS IN GROUP II :   6
     
        GEHAN`S GENERALIZED WILCOXON TEST -- PERMUTATION VARIANCE
     
          TEST STATISTIC        =       1.652
          PROBABILITY           =       0.0986

     
        GEHAN`S GENERALIZED WILCOXON TEST -- HYPERGEOMETRIC VARIANCE
     
          TEST STATISTIC        =       1.687
          PROBABILITY           =       0.0917

     
        LOGRANK TEST 
     
          TEST STATISTIC        =       1.814
          PROBABILITY           =       0.0696

     
        PETO & PETO GENERALIZED WILCOXON TEST
     
          TEST STATISTIC        =       1.730
          PROBABILITY           =       0.0837

     
        PETO & PRENTICE GENERALIZED WILCOXON TEST
     
          TEST STATISTIC        =       1.706
          PROBABILITY           =       0.0881

    
        KAPLAN-MEIER ESTIMATOR
    
    
        DATA FILE :  gal2.dat
    
        VARIABLE : Normal   
    
    
        # OF DATA POINTS :   6 # OF UPPER LIMITS :   3
    
    
          VARIABLE RANGE      KM ESTIMATOR   ERROR
FROM    0.000   TO   26.900       1.000
FROM   26.900   TO   28.500       0.375       0.213
       27.600 C 
       28.100 C 
FROM   28.500   TO   30.100       0.167       0.152
       29.700 C 
FROM   30.100   ONWARDS           0.000       0.000
    

      SINCE THERE ARE LESS THAN 4 UNCENSORED POINTS,
      NO PERCENTILES WERE COMPUTED.
    
        MEAN=    27.767 +/- 0.515
 
    
        KAPLAN-MEIER ESTIMATOR
    
    
        DATA FILE :  gal2.dat
    
        VARIABLE : Starburst
    
    
        # OF DATA POINTS :   6 # OF UPPER LIMITS :   2
    
    
          VARIABLE RANGE      KM ESTIMATOR   ERROR
FROM    0.000   TO   28.500       1.000
FROM   28.500   TO   29.000       0.600       0.219
       29.000 C 
FROM   29.000   TO   30.200       0.400       0.219
FROM   30.200   TO   31.100       0.200       0.179
FROM   31.100   ONWARDS           0.000       0.000
       32.400 C 
    
        PERCENTILES    
         75 TH     50 TH     25 TH
         17.812    28.750    29.900
    
        MEAN=    29.460 +/- 0.460
 

                                 Table 5


                         Example of BIVAR Output

     
      CORRELATION AND REGRESSION PROBLEM
      TITLE IS  Optical-Infrared Relation                                                       
     
      DATA FILE IS gal3.dat 
     
     
      INDEPENDENT       DEPENDENT
        Optical   AND    Infrared
     
     
      NUMBER OF DATA POINTS :    16
      UPPER LIMITS IN  Y    X    BOTH   MIX
                       6    0       0    0
     
    
     CORRELATION TEST BY COX PROPORTIONAL HAZARD MODEL
    
       GLOBAL CHI SQUARE =   18.458 WITH 
               1 DEGREES OF FREEDOM
       PROBABILITY    =    0.0000
    
          
    LINEAR REGRESSION BY PARAMETRIC EM ALGORITHM
          
       INTERCEPT COEFF    :  0.1703  +/-  2.2356
       SLOPE COEFF 1      :  1.0519  +/-  0.0793
       STANDARD DEVIATION :  0.3687
       ITERATIONS         : 27
          

                           4  Acknowledgements
 
     The production and distribution of ASURV Rev 1.1 was principally funded 
at Penn State by a grant from the Center for Excellence in Space Data and 
Information Sciences (operated by the Universities Space Research Association 
in cooperation with NASA), NASA grants NAGW-1917 and NAGW-2120, and NSF grant 
DMS-9007717. T.I. was supported by NASA grant NAS8-37716.  We are grateful to 
Prof. Michael Akritas (Dept. of Statistics, Penn State) for advice and guidance
on mathematical issues, and to Drs. Mark Wardle (Dept. of Physics and 
Astronomy, Northwestern University), Paul Eskridge (Harvard-Smithsonian Center 
for Astrophysics), and Eric Smith (Laboratory for Astronomy and Solar Physics, 
NASA-Goddard Space Flight Center) for generous consultation and assistance on 
the coding.  We would also like to thank Drs. Peter Allan (Rutherford Appleton 
Laboratory), Simon Morris (Carnegie Observatories), Karen Strom (UMASS), and 
Marco Lolli (Bologna) for their help in debugging ASURV Rev 1.0.


                                APPENDICES


A1  Overview of Survival Analysis 
  
     Statistical methods dealing with censored data have a long and
confusing history.  It began in the 17th century with the development of
actuarial mathematics for the life insurance industry and demographic 
studies.  Astronomer Edmund Halley was one of the founders. In the mid-20th 
century, this application grew along with biomedical and clinical research 
into a major field of biometrics. Similar (and sometimes identical) statistical
methods were  being developed in two other fields:  the theory of reliability, 
with industrial and engineering applications; and econometrics, with attempts 
to understand the relations between economic forces in groups of people. 
Finally, we find the same mathematical problems and some of the same solutions
arising in modern astronomy as outlined in Section 1 above.
 
     Let us give an illustration on the use of survival analysis in these
disparate fields.  Consider a linear regression problem.   First, an 
epidemiologist wishes to determine how the human longevity or `survival time' 
(dependent variable) depends on the number of cigarettes smoked per day 
(independent variable).  The experiment lasts 10 years, during which some 
individuals die  and others do not. The survival time of the living 
individuals is only known to be greater than their age when the experiment
ends. These values are therefore `right censored data points'.  Second, 
an engine manufacturing company engines wishes to know the average time 
between breakdowns as a function of engine speed to determine the optimal
operating range.  Test engines are set running until 20% of them fail; the 
average `lifetime' and dependence on speed is then calculated with 80% of 
the data points right-censored.  Third, an astronomer observes a sample of 
nearby galaxies in the far-infrared to determine the relation between dust 
and molecular gas.  Half have detected infrared luminosities  but half are 
upper limits (left-censored data points).  The astronomer then seeks the 
relationship between infrared luminosities and the CO traces of molecular
material to investigate star formation efficiency.  The CO observations may 
also contain censored values.
 
     Astronomers have dealt with their upper limits in a number of fashions.  
One is to consider only detections in the analysis; while possibly acceptable 
for some purposes (e.g. correlation),  this will clearly bias the results in 
others (e.g. estimating  luminosity functions).  A second procedure considers 
the ratio of detected objects to observed objects in a given sample. When 
plotted in a range of luminosity bins, this has been called the `fractional 
luminosity function' and has been frequently used in extragalactic radio 
astronomy.  This is mathematically the same as actuarial life tables.  But it 
is sometimes used incorrectly in comparing different samples:  the detected 
fraction clearly depends on the (usually uninteresting) distance to the objects
as well as their (interesting) luminosity.  Also, simple sqrt N error bars do 
not apply in these fractional luminosity functions, as is frequently assumed. 
 
     A third procedure is to take all of the data, including the exact values 
of the upper limits, and model the properties of the parent population under 
certain mathematical constraints, such as maximizing the likelihood that the 
censored points follow the same distribution as the uncensored points.  This 
technique, which is at the root of much of modern survival analysis, was 
independently developed by astrophysicists (Avni et al. 1980; Avni and 
Tananbaum 1986) and is now commonly used in observational astronomy. 

     The advantage accrued in learning and using survival analysis methods 
developed in biometrics, econometrics, actuarial and reliability mathematics 
is the wide range of useful techniques developed in these fields. In general, 
astronomers can have faith in the mathematical validity of survival analysis.  
Issues of great social importance (e.g. establishing Social Security benefits,
strategies for dealing with the AIDS epidemic, MIL-STD reliability standards) 
are based on it. In detail, however, astronomers must study the assumptions 
underlying each method and be aware that some methods at the  forefront of 
survival analysis that may not be fully understood.
 
     Section A2 below gives a bibliography of selected works concerning 
survival analysis statistical methods.  We have listed some  recent monographs 
from the biometric and reliability field that we have found to be useful 
(Kalbfleisch and Prentice 1980; Lee 1980;  Lawless 1982; Miller 1981; 
Schneider 1986), as well as one from econometrics (Amemiya 1985).  Papers from 
the astronomical literature dealing with these methods include Avni et al. 
(1980), Schmitt (1985), Feigelson and Nelson (1985), Avni and Tananbaum (1986),
Isobe et al. (1986), and Wardle and Knapp (1986).  It is important to recognize
that the methods presented in these papers and in this software package 
represent only a small part of the entire body of statistical methods 
applicable to censored data.


A2  Annotated Bibliography

Akritas, M. ``Aligned Rank Tests for Regression With Censored Data'',
   Penn State Dept. of Statistics Technical Report, 1989. 
   (Source for ASURV's generalized Spearman's rho computation.)

Amemiya, T.  { Advanced Econometrics} (Harvard U. Press:Cambridge MA) 1985.
   (Reviews econometric `Tobit' regression models including censoring.)

Avni, Y., Soltan, A., Tananbaum, H. and Zamorani, G. ``A Method for 
   Determining Luminosity Functions Incorporating Both Flux Measurements 
   and Flux Upper Limits, with Applications to the Average X-ray to Optical 
   Luminosity Ration for Quasars", { Astrophys. J.} 235:694 1980. 
   (The discovery paper in the astronomical literature for the differential
    Kaplan-Meier estimator.)

Avni, Y. and Tananbaum, H. ``X-ray Properties of Optically Selected QSOs",
   { Astrophys. J.} 305:57 1986. 
   (The discovery paper in the astronomical literature of the linear 
    regression with censored data for the Gaussian model.)

Bickel, P.J and Ritov, Y. ``Discussion:  `Censoring in Astronomical Data Due
    to Nondetections' by Eric D. Feigelson'', in {Statistical Challenges
    in Modern Astronomy}, eds. E.D. Feigelson and G.J. Babu, (Springer-Verlag:
    New York) 1992. 
    (A discussion of the possible inadequacies of survival analysis for
    treating low signal-to-noise astronomical data.)

Brown, B .J. Jr., Hollander, M., and Korwar, R. M. ``Nonparametric Tests of 
    Independence for Censored Data, with Applications to Heart Transplant 
    Studies" from { Reliability and Biometry}, eds. F. Proschan and 
    R. J. Serfling (SIAM: Philadelphia) p.327 1974.
    (Reference for the generalized Kendall's tau correlation coefficient.)

Buckley, J. and James, I. ``Linear Regression with Censored Data", {Biometrika}
    66:429 1979.
    (Reference for the linear regression with Kaplan-Meier residuals.)

Feigelson, E. D. ``Censored Data in Astronomy'', { Errors, Bias and 
    Uncertainties in Astronomy}, eds. C. Jaschek and F. Murtagh, (Cambridge U. 
    Press: Cambridge) p. 213 1990.
    (A recent overview of the field.)

Feigelson, E. D. and Nelson, P. I. ``Statistical Methods for Astronomical Data 
    with Upper Limits: I. Univariate Distributions", { Astrophys. J.} 293:192 
    1985.
    (Our first paper, presenting the procedures in UNIVAR here.)

Isobe, T., Feigelson, E. D., and Nelson, P. I. ``Statistical Methods for 
    Astronomical Data with Upper Limits: II. Correlation and Regression",
    { Astrophys. J.} 306:490 1986.
    (Our second paper, presenting the procedures in BIVAR here.)

Isobe, T. and Feigelson, E. D. ``Survival Analysis, or What To Do with Upper 
    Limits in Astronomical Surveys", { Bull. Inform. Centre Donnees 
    Stellaires}, 31:209 1986.
    (A precis of the above two papers in the Newsletter of Working Group for 
    Modern Astronomical Methodology.)

Isobe, T. and Feigelson, E. D. ``ASURV'', { Bull. Amer. Astro. Society}, 
    22:917 1990.
    (The initial software report for ASURV Rev 1.0.)

Kalbfleisch, J. D. and Prentice, R. L. { The Statistical Analysis of Failure 
    Time Data} (Wiley:New York) 1980.
    (A well-known monograph with particular emphasis on Cox regression.)

Latta, R. ``A Monte Carlo Study of Some Two-Sample Rank Tests With Censored 
    Data'', { Jour. Amer. Stat. Assn.}, 76:713 1981. 
    (A simulation study comparing several two-sample tests available in ASURV.}

LaValley, M., Isobe, T. and Feigelson, E.D. ``ASURV'', { Bull. Amer. Astro. 
    Society} 1992
    (The new software report for ASURV Rev. 1.1.)

Lawless, J. F. { Statistical Models and Methods for Lifetime Data}
    (Wiley:New York) 1982.
    (A very thorough monograph, emphasizing parametric models and engineering 
     applications.)

Lee, E. T. { Statistical Methods for Survival Data Analysis}, (Lifetime 
    Learning Pub.:Belmont CA) 1980.
    (Hands-on approach, with many useful examples and Fortran programs.)

Magri, C., Haynes, M., Forman, W., Jones, C. and Giovanelli, R. ``The Pattern 
    of Gas Deficiency in Cluster Spirals: The Correlation of H I and X-ray 
    Properties'', { Astrophys. J.} 333:136 1988. 
    (A use of bivariate survival analysis in astronomy, with a good discussion
    of the applicability of the methods.)

Millard, S. and Deverel, S. ``Nonparametric Statistical Methods for Comparing 
    Two Sites Based on Data With Multiple Nondetect Limits'',
    { Water Resources Research}, 24:2087 1988. 
    (A good introduction to the two-sample tests used in ASURV, written in 
    nontechnical language.}

Miller, R. G. Jr. { Survival Analysis} (Wiley, New York) 1981.
    (A good introduction to the field; brief and informative lecture notes
    from a graduate course at Stanford.)

Prentice, R. and Marek, P. ``A Qualitative Discrepancy Between Censored Data 
    Rank Tests'', { Biometrics} 35:861 1979. 
    (A look at some of the problems with the Gehan two-sample test, using
    data from a cancer experiment.)

Sadler, E. M., Jenkins, C. R.  and Kotanyi, C. G.. ``Low-Luminosity Radio 
    Sources in Early-Type Galaxies'', { Mon. Not. Royal Astro. Soc.} 240:591 
    1989. 
    (An astronomical application of survival analysis, with useful discussion 
    of the methods in the Appendices.)

Schmitt, J. H. M. M. ``Statistical Analysis of Astronomical Data Containing 
    Upper Bounds: General Methods and Examples Drawn from X-ray Astronomy", 
    {Astrophys. J.} 293:178 1985.
    (Similar to our papers, a presentation of survival analysis for 
    astronomers.  Derives TWOKM regression for censoring in both variables.)

Schneider, H. { Truncated and Censored Samples from Normal Populations} 
    (Dekker: New York) 1986.
    (Monograph specializing on Gaussian models, with a good discussion of 
    linear regression.)

Wagner, A. E. and Meeker, W. Q. Jr. ``A Survey of Statistical Software
    for Life Data Analysis", in { 1985 Proceedings of the Statistical 
    Computing Section}, (Amer. Stat. Assn.:Washington DC), p. 441.
    (Summarizes capabilities and gives addresses for other software packages.}

Wardle, M. and Knapp, G. R. ``The Statistical Distribution of the 
    Neutral-Hydrogen Content of S0 Galaxies", { Astron. J.} 91:23 1986.
    (Discusses the differential Kaplan-Meier distribution and its error
    analysis.)

Wolynetz, M. S. ``Maximum Likelihood Estimation in a Linear Model from Confined
    and Censored Normal Data", { Appl. Stat.} 28:195 1979.
    (Published Fortran code for the EM algorithm linear regression.)


A3 Rev 1.1 Modifications and Additions

     Each of the three major areas of ASURV; the KM estimator, the two-sample 
tests, and the bivariate methods have been updated in going from Rev 0.0 to 
Rev 1.1. 

A3.1 KMESTM

     In the Survival Analysis literature, the value of the survival function
at a x-value is the probability that a given observation will be at least as
large as that x-value.  As a result, the survival curve starts with a value 
of one and declines to zero as the x-values increase.  The KM estimate of the 
survival curve should mirror this behavior by starting at one and declining to 
zero as more and more of the observations are passed.  In Rev 0.0, the KM 
estimate for right-censored (lower limits) data was given in that form, but 
the KM estimate for left-censored (upper limits) data started at zero and 
increased to one.  As a result, the x-values where jumps in the KM estimate 
occurred were correct, and the jumps were of the correct height, but the 
reported survival value at most points were (in a strict sense) incorrect.  In 
Rev 1.1, this has been corrected so that the KM estimate will move in the 
proper direction for both upper limits and lower limits data.  

     A differential, or binned, Kaplan-Meier estimate has also been added to
the package.  This allows the user to find the number of points falling into
specified bins along the X-axis according to the Kaplan-Meier estimated
survival curve.  However, astronomers are strongly encouraged to use the
integral KM for which analytic error analysis is available.  There is no known 
error analysis for the differential KM.

A3.2 TWOST

     In Rev 0.0 the code for two-sample tests relied heavily on code published
in { Statistical Methods for Survival Data Analysis} by E.T. Lee.  Since the
publication of this book in 1980, there have been several articles and 
simulation studies done on the various two-sample tests.  Lee's book uses a 
permutation variance for its tests, which assumes that both groups being 
considered are subject to the same censoring pattern.  Tests with 
hypergeometric variance form seem to be more `robust' against differences
in the censoring patterns, and some statistical software packages (e.g. SAS) 
have replaced permutation variances with hypergeometric variances.  We have 
also realized that Rev 0.0 presented Lee's Peto-Peto test, rather than the 
Peto-Prentice test described in Feigelson and Nelson (1985) and the Rev. 0.0 
ASURV manual.

     In light of these developments we have modified the two-sample tests
calculated in ASURV.  Rev 1.1 offers two versions of Gehan's test:  one with 
permutation variance (which will match the Gehan's test value from Rev 0.0) 
and one with hypergeometric variance.  The logrank test now uses a 
hypergeometric variance, but the Peto-Peto test still uses a permutation 
variance.  The Cox-Mantel test has been removed as it was very similar to the 
logrank test, and the Peto-Prentice test has been added.  The Peto-Prentice 
test uses an asymptotic variance form that has been shown to do very well in 
simulation studies (Latta, 1981).

     ASURV now automatically does all the two-sample tests, instead of asking 
the user to specify which tests to run.  These tests are not very time 
consuming and the user will do well to consider the results of all the tests.  
If the p-values differ significantly, then the Peto-Prentice test is probably 
the most reliable (Prentice and Marek, 1979).  The two-sample tests all use 
different, but reasonable, weightings of the data points, so large 
discrepancies between the results of the tests indicates that caution should 
be used in drawing conclusions based on this data.

A3.3 BIVAR

     The bivariate methods have been extended in two ways.  First, standard 
error estimates for the slope and intercept are offered for Schmitt's method 
of linear regression.  These error estimates are based upon the bootstrap, a 
statistical technique which randomly resamples the set of data points, with 
replacement, many times and then runs Schmitt's procedure on the artificial 
data sets created by the resampling.  Two hundred iterations is considered 
sufficient to get reasonable estimates of the standard error of the estimate 
in most situations.  However, this might be computationally intensive for a 
large data set.

     
     The bivariate methods have also been extended by an additional correlation
routine, a generalized Spearman's rho procedure (Akritas 1990).  The usual 
Spearman's rho correlation estimate for uncensored data is simply the 
correlation between the ranks of the independent and dependent variables.  In 
order to extend the procedure to censored data, the Kaplan-Meier estimate of 
the survival curve is used to assign ranks to the observations.  Consequently, 
the ranks assigned to the observations may not be whole numbers.  Censored 
points are assigned half (for left-censored) or twice (for right-censored) the 
rank that they would have had were they uncensored.  This method is based on 
preliminary findings and has not been carefully scrutinized either 
theoretically or empirically.  It is offered in the code to serve as a less 
computer intensive approximation to the Kendall's tau correlation, which 
becomes computationally infeasible for large data sets (n > 1000).  The 
generalized Spearman's Rho routine is not dependable for small data sets 
(n < 30).  In that situation the generalized Kendall's tau routine should be 
relied upon.  It should also be noted that the test statistics reported by 
ASURV for Kendall's tau and Spearman's rho are not directly comparable.  The 
test statistic reported for Spearman's rho is an estimated correlation, and 
the test statistic given for Kendall's tau is an estimated function of the 
correlation.  It is the reported probabilities that should be compared.

A3.4 Interface, Outputs and Manual

     The screen-keyboard interface has been streamlined somewhat.  For
example. the user is now provided all two-sample tests without requesting 
them individually.  New inputs have been introduced for the new programs
(differential Kaplan-Meier function, the generalized Spearman's rho, and
error analysis for Schmitt's regression).  The printed outputs of the
programs have been clarified in several places where ambiguities were
reported.  For example, the Kaplan-Meier estimator now specifies whether the
change occurs before or after a given value, the meaning of the correlation
probabilities is now given explicitly, and warnings are printed when there
is good reason to suspect the results of a test are unreliable.  The Users 
Manual has been reorganized so that material that is not actually needed to 
operate the program has been located in several Appendices.  

A4  Errors Removed in Rev 1.1

     Since ASURV was released in November 1991, several errors have been 
discovered in the package by users and have been reported to us.  In 
revision 1.1, all the bugs that have been reported are corrected.  We have 
also taken the opportunity to correct some subtle programming errors that 
we came across in the code.  The major errors were:

     The command files gal1.com and gal2.com, provided in asurv.etc did not
        match the new input formats.  This caused the output files created by
        the command files not to match the examples in the manual.

     The character variable containing the name of the data file was not 
        defined properly in the subroutine which prints out the results of
        Kaplan-Meier estimation.

     In subroutine TWOST, the variable WRK6 was wrongly specified as WWRK6.

     Various problems with truncation and integer arithmetic were found
        and removed from the code.

     To help VAX/VMS users, a carriage control line was added to ASURV.  
        For information on this addition, look in A7.

A5 Errors removed in Rev 1.2

     After releasing ASURV Rev 1.1 in March 1992, we determined that
Rev. 1.1, and all previous versions, contained an inconsistency in the
way the Kaplan-Meier routine treated certain data sets.  The problem occurred 
when multiple upper limits were tied at the smallest data point, or when
multiple lower limits were tied at the largest data point.  Since such an 
event would be very unlikely in the biomedical setting, and ASURV
was initially modeled on biomedical software, no contingency for such a
situation was contained in the package.  However, this type of censoring
occurs commonly in astronomical applications, so the package has been 
modified to reflect this.  The Kaplan-Meier routine in Rev 1.2 temporarily 
changes all such tied censored points to detections so they can be 
treated consistently.

A6  Obtaining and Installing ASURV
 
     This program is available as a stand-alone package  to any member of 
the astronomical community without charge.  We provide the FORTRAN source code,
not the executable files. We prefer to distribute it by electronic mail.  
Scientists with network connectivity should send their request to:

                          code@stat.psu.edu

specifying   `ASURV Rev. 1.1'} and providing both email and regular mail 
addresses.   ASCII versions of the code can also be mailed, if necessary,  on 
a 3 1/2-inch double-sided high-density MS-DOS diskette, or a 1/2 inch 9-track 
tape (1600 BPI, written on a UNIX machine).  For any questions regarding the 
package, contact:

                             Prof. Eric Feigelson 
                   Department of Astronomy & Astrophysics
                        Pennsylvania State University 
                          University Park PA  16802

                          Telephone: (814) 865-0162 
                               Telex: 842-510  
                          Email: edf@astro.psu.edu

The package consists of 59 subroutines of Fortran totally about 1/2 MBy. It
is completely self-contained, requiring no external libraries or programs.  The
code is distributed in ten files: a brief READ.ME file; six files 
called asurv1.f-asurv6.f containing the source code; two documentation files,
with the users manual in ASCII and LaTeX; and one file called asurv.etc
containing test datasets, test outputs and a subroutine list. We request 
that  recipients not distribute the package themselves beyond their own 
institution.  Rev. 0.0 of ASURV has been incorporated into the larger 
astronomical software system SDAS/IRAF, distributed by the Space Telescope 
Science Institute. 
 
     Installation requires removing the email headers, Fortran compilation, 
linking, and executing.  We have written the code consistent with Fortran 77 
conventions.  It has been successfully ported to a Sun SPARCstation under UNIX,
a DEC VAX under VMS, a personal computer under MS-DOS using Microsoft FORTRAN,
and an IBM mainframe under VM/CMS.  It can also be ported to a Macintosh with 
small format changes (see Section A7 below).  

     When SURV is compiled using Microsoft FORTRAN the user will notice 
several warnings from the compiler about labels used across blocks and formal 
arguments not being used.  The user should not be alarmed, these warnings are 
caused by the compilation of the subroutines separately and do not reflect 
program errors.

     All of the functions have been tested against textbook formulae, 
published examples, and/or commercial software  packages.  However, some 
methods are not widely used by researchers in other fields, and their behavior
is not well-documented.  We would appreciate hearing about any difficulties or
unusual behavior encountered when running the code. A bug report form for
ASURV can be found in the asurv.etc file. 

(Note:  SPARCstation is a trademark of Sun Microsystems, Inc; UNIX is a 
trademark of AT&T Corporation; VAX and VMS are trademarks of Digital Equipment 
Corporation; MS-DOS and Microsoft FORTRAN are trademarks of Microsoft 
Corporation; VM/CMS is a trademark of International Business Machines 
Corporation; and Macintosh is a trademark of Apple Corporation.)


A7  User Adjustable Parameters

       ASURV is initially set to handle data sets of up to 500 points, with up 
to four variables, however a user may wish to consider even larger data sets.  
With this in mind, we have modified Rev 1.1 to be easy for a user to adjust to 
a given sample size.

       The sizes of all the arrays in ASURV are controlled by two parameter 
statements; one is in UNIVAR and one is in BIVAR.  Both statements are 
currently of the form:

                  PARAMETER(MVAR=4,NDAT=500,IBIN=50)

MVAR is the number of variables allowed in a data set and NDAT is the number 
of observations allowed in a data set.  To work with larger data sets, it is 
only necessary to change the values MVAR and/or NDAT in the two parameter 
statements.  If MVAR is set to be greater than ten, then the data input formats
should also be changed to read more than ten variables from the data file.  
Clearly, adjusting either MVAR or NDAT upwards will increase the memory space 
required to run ASURV.

       The following table provides listings of format statements that a user 
might wish to change if data values do not match the default formats:  

Input Formats:

Problem          Subroutine    Default Format      
-------          ----------    --------------
Kaplan-Meier     DATAIN        Statement    30   FORMAT(10(I4,F10.3))
Two Sample tests 
                 DATAIN                     40   FORMAT(I4,10(I4,F10.3))
Correlation/Regression
                 DATREG                     20   FORMAT(I4,11F10.3)


Output Formats:

Problem          Subroutine     Default Format
-------          ----------     --------------
Kaplan-Meier     KMPRNT         Statements  550 [For Uncensored Points] 
                                            555 [For Censored Points]  
                                            850 [For Percentiles]
                                           1000 [For Mean]
                 KMDIF                      680 [For Differential KM]

Two Sample Tests TWOST          Statements 612, 660, 665,780
        
Correlation/Regression
 Cox Regression  COXREG         Statements 1600, 1651, 1650
 Kendall's Tau   BHK                       2005, 2007
 Spearman's Rho  SPRMAN                     230, 240, 250, 997
 EM Algorithm    EM                        1150, 1200, 1250, 1300, 1350
 Buckley-James   BJ                        1200, 1250, 1300, 1350
 Schmitt         TWOKM                     1710, 1715, 1720, 1790, 1795, 1800,
                                                 1805, 1820
 

     Macintosh users who have Microsoft Fortran may have difficulty with the
statements that read from the keyboard and write to the screen.  Throughout
ASURV we have used statements of the form,

           WRITE(6,format)        READ(5,format).

Macintosh users may wish to replace the 6 and 5 in statements of the above
type by `*'.
           
     Finally, if the data have values which extend beyond three decimal places,
then the user should reduce the value of `CONST' in the subroutine CENS to 
have at least two more decimal places than the data.


A8  List of subroutines

     The following is a listing of all the subroutines used in ASURV and their 
respective byte lengths.


                            Subroutine List


        Subroutine  Bytes   Subroutine  Bytes   Subroutine  Bytes 
        ----------  -----   ----------  -----   ----------  -----
        AARRAY       4868   FACTOR       1199   REARRN       3158 
        AGAUSS       1292   GAMMA        1447   REGRES       5502 
        AKRANK       5763   GEHAN        3455   RMILLS       1447 
        ARISK        2681   GRDPRB      11867   SCHMIT       8505
        ASURV        6936   KMADJ        3579   SORT1        2511
        BHK          6827   KMDIF        4783   SORT2        1816
        BIN         15882   KMESTM       7627   SPEARHO      2012
        BIVAR       27982   KMPRNT       8613   SPRMAN       7997
        BJ           5773   LRANK        5433   STAT         1858
        BUCKLY       9781   MATINV       3443   SUMRY        2323
        CENS         2230   MULVAR      15588   SYMINV       2978
        CHOL         2629   PCHISQ       2017   TIE          2776
        COEFF        1947   PETO         7169   TWOKM       15855
        COXREG       7983   PLESTM       5502   TWOST       15458
        DATA1        2649   PWLCXN       2159   UNIVAR      27321
        DATA2        3490   R3           1491   UNPACK       1701
        DATAIN       2958   R4           6040   WLCXN        5573
        DATREG       4097   R5           3050   XDATA        1825
        EM          12516   R6           8957   XVAR         5189
        EMALGO      13073   RAN1         1430