**HKA**
-- A COMPUTER PROGRAM FOR TESTS OF NATURAL SELECTION --
DOCUMENTATION
Jody Hey
Department
of Genetics
Rutgers
University
Nelson
Biological Labs
604
Allison Rd.
Piscataway,
NJ� 08854-8082
732-445-5272
fax
732-445-5870
�* This computer program and documentation may be freely copied and used by anyone, provided no fee is charged for it.
_______________________
_______________________
______________________
_______________________
HKA is a computer program that carries out the widely used statistical test for natural selection that was developed by Hudson, R. R., M. Kreitman and M. Aguad� (1987 A test of neutral molecular evolution based on nucleotide data. Genetics 116: 153-159). This program can handle very large numbers of loci and sample sizes, and conducts tests via coalescent simulation as well as by the conventional chi square approximation. The simulations can also be used to conduct other tests of natural selection, including tests of Tajima's D statistic (1989) and the D statistic of Fu and Li (1993).
RATIONAL
The
HKA test is based on the most basic prediction of the neutral model of molecular
evolution, thatDNA sequence polymorphism within a species, and DNA sequence
divergence between species, will be proportional to the neutral mutation rate
(Kimura, 1968, 1969). To this basic
expectations we must add that polymorphism also depends on the effective
population size and that divergence also depends on the time since speciation.
When we consider data collected
from two species, and from multiple loci, we expect that all of the loci
within a species will share the same effective population size, and we expect
that each locus, regardless of species, will have a characteristic neutral
mutation rate (depending on its length and level of selective constraint). Thus
we can consider data on polymorphism and divergence cast in form of a
contingency table (Figure 1). We
can also build a similar table of the expected values, each a function of the
basic model parameters that are fit to the data. Those
parameters are:
T
- the time since the species divergence
f
- the ratio of the the two species effective population sizes.
Theta_i
= 4N1 u_i - the population
mutation rate for locus i in species 1, where N1
- the effective population size of species 1
There
is one Theta parameter for every locus.
Despite
the presence of observations and corresponding expectations in parallel tables,
one cannot conduct a conventional contingency table test (e.g. a
chi-square test). Such a test relies upon a sampling variance within each cell
that is roughly multinomial. However, the sampling variances within the cells of
an HKA table are those of the coalescent process.
Fortunately these can be calculated (Watterson, 1975).
Using these variances, Hudson et. al., (1987) developed a
chi-square-like test statistic, that they called
X^2. If divergence has been
sufficiently long, and polymorphism levels are not very low, then
X^2 should be approximately chi-square distributed with 2*K-2 degrees of
freedom where K is the number of loci in the test.
Fig
1.
Loci |
||||||
Species |
1 |
2 |
3 |
. |
. |
. |
1 |
Polymorphism Species 1 Locus 1 |
Polymorphism Species 1 Locus 2 |
Polymorphism Species 1 Locus 3 |
. |
. |
. |
2 |
Polymorphism Species 2 Locus 1 |
Polymorphism Species 2 Locus 2 |
Polymorphism Species 2 Locus 3 |
. |
. |
. |
1 vs 2 |
Divergence Locus 1 |
Divergence Locus 2 |
Divergence Locus 3 |
. |
. |
. |
Constant population sizes since the time of common ancestry of the two species.
The size of the common ancestral species is either the average of that of the two descendant species or (if only one species is represented by multiple sequences) then the ancestor is assumed to be the same size as that descendant species for which multiple sequences have been sampled. This assumption is not critical if T is so large that there is little chance that any of the variation within species dates from the time of common ancestry.
No recombination within loci. This is a conservative assumption, as the presence of recombination will reduce the variance of the observations under the model.
Free recombination between loci. Again this is a conservative assumption. If the loci are not evolving independently then they assumption of stochastic variance among loci does not hold.
All mutations are selectively neutral.
The variance of departures of observations of polymorphisms and divergence, from expectations under the model, follow normal distributions. This assumption is required if the X^2 test statistic is to be well approximated by a chi square distribution. The assumption will hold approximately if Theta, T and sample sizes are large. The program conducts a test by simulation, and compares the observed distribution of the X^2 statistic with the simulated distribution. Thus failure of the assumption of normality of deviations will not lead to biased estimates of the distribution of the test statistic. However failure of the assumption of normality can be expected to reduce the power of the test.
The
basic HKA protocol has several attributes that
permit a variety of applications in addition to an overall test of departure
from the neutral model.
Calculation
of X^2 test
statistic requires calculation, for each cell of the matrix,
of departure between observation and expectation under the model. When just 2 loci are examined, these cells are
highly non-independent and so these individual departure statistics are not
very informative. But when many
loci are included in the test, there arises scope for inferring which loci,
and in what ways, the departure from neutrality occurs.
The HKA procedure generates estimates for each of several basic neutral model parameters that are necessary for calculating the expected values for the data set. Furthermore, since these parameters are estimated jointly from as many loci as are included in the test, they can be expected to have lower variance than comparable estimates that are based on individual loci. These parameter estimates can then also be used for other purposes.
The
parameter estimates can be used to conduction simulations in which the null
model assumptions of the test are taken to be correct.
These are coalescent simulations,
and it is a straightforward matter to have them follow the null model
HKA assumptions exactly (Hudson, 1990).
Because
the model underlying the HKA test is readily simulated, it is not difficult
to generate new data sets from the same parameters as fit the actual data,
in order to determine the actual distribution of X^2 under the model.
In other words, if conditions are such that the X^2
may not follow a chi-square distribution, then there is no need to
assume that it does. One can
simply conduct the simulations and use the simulated distribution of X^2
for test purposes.
The simulated data sets, that are based on parameters estimated from the actual data, can also be used to examine other features of the data. For example, the computer program will conduct tests of Tajima's and Fu and Li's D statistics (Tajima, 1989; Fu and Li, 1993). These tests begin with a calculation of the difference between two distinct estimates (i.e. using different estimators) of the parameter Theta, and proceed to consider whether the observed value is greater than expected by chance. One difficulty with the methods is that the test statistics are a bit sensitive to the true value of Theta, which is generally not known (Hudson, 1993). The HKA protocol generates estimates of Theta, and does so using considerably more information (and more assumptions) than any of the conventional estimators used by the D statistics.
The program can handle a very large number of loci (default compilation is for up to 20), with sample sizes per species per locus up to 350 (192 on the PowerPC).
The sample size for the second species can be 1 sequence, as in the original implementation of the test (Hudson et al., 1987)
The program will conduct rapid coalescent simulations using parameters estimated from the data.
The program will conduct a variety of tests based on Tajima's D statistic (Tajima, 1989) and Fu and Li's D statistic (Fu and Li, 1993)
______________________
Downloadable Files������������� Return to Contents
______________________
WIN32 package (including documentation, sample data file, and WIN32 program)
PowerPC package (including documentation, sample data file, and PowerPC program)
_______________________
�Input File
Format��������������� Return to Contents
_______________________
Input
Data File Format:
Line
1 - a line of text. If
desired, up to 10 extra lines of commentary can be added following line 1.
Each additional line of commentary must have a '#' at the very beginning of
the line. All comment lines will be reproduced in the output file.
line
2 - the number of loci = numloci
line
3 - the names of the two species, species 1 and species 2
the
next numloci lines each have the following items
locus
identifier
inheritance
scalar (1.0 for autosomal 0.75 for X linked 0.25 for extrachromosomal or
Y)
mean
sequence length for species 1
mean
sequence length species 2
mean
sequence length in comparisons between species
#
of sequences for species 1
#
of sequences for species 2
#
of polymorphic sites species1
#
of polymorphic sites species2
mean
pairwise divergence between species
Tajima
D for species 1
(Tajima, 1989)
Tajima
D for species 2
(Tajima, 1989)
Fu
& Li D (rooted) for species 1 (Fu and Li, 1993)
Fu
& Li D (rooted) for species 2 (Fu and Li, 1993)
Thus
the basic format is of a table. With
K loci the table would have K rows. Each row begins with a locus name and is
followed by 13 numbers.
If one species has only a single sequence at one, at one or any of the loci, then that data set should be reduced to just one sequence for all of the loci for that species. In addition the species represented by just a single sequence for each locus should be species 2.
Inclusion
of Tajima's and Fu&Li's statistics is optional and they need not be
included. However if they are the program will test their significance.
It will also conduct tests of the overall mean among these statistics and
of their variances among the loci.
If
tests of D statistics are desired, but the values for some loci are not known or
are not calculable, then the corresponding value in the input table should be
set to -10 or less.
_______________________
�Running the
Program������������� Return to Contents
_______________________
The program file should reside either in the same folder as the data file or in a folder automatically searched by the operating system. The program can be run using command line parameters, or by simply typing the name of the program ('hka').
Command
line parameters:
-D
(or -I) Input file name
e.g. -Dmydata
-R
Output file name e.g. -RMyresults
-S
Number of simulations e.g.
-S1000 (the maximum is 10000)
-T
Do tests of Tajima's D statistic
-F
Do tests of Fu and Li's D statistic
-M
comment line e.g. -Mtest_my_data
To
start the program with command line parameters:
For
example, type:
hka
-Dmydata -Rmyresults -S1000 -Mtest_my_data -T -F
To
start the program without command line parameters, simply enter 'hka'.
The program then asks for all of the necessary information
On a PowerPC, clicking on the program icon opens a small
window in which command line parameters can be entered (e.g. -Dmydata
-Rmyresults -S1000 -Mtest_my_data -T -F
_______________________
Output������������������������� Return
to Contents
_______________________
Basic information about input data, including comments from input file, and numbers and sizes of loci.
Tables of Observed and Expected Values. There are two tables, one for polymorphism within species and one for divergence between species. These tables include the observations, as well as the expected values after fitting the HKA model, the variance under the model, and the deviation for each observation (i.e. (observed - expected)^2/variance) )/
The X^2 statistic (i.e. the sum of deviations) is compared to the chi square distribution with appropriate number of degrees of freedom. For two species and K loci there are 2*K-2 degrees of freedom. When one species is represented by just a single sequence at each locus, there are L-1 degrees of freedom.
If simulations were done:
The observed value of X^2 is also placed within the simulated distribution of X^2.
The cumulative distribution of the X^2 statistic is given together with the corresponding chi square distribution.
a TEST OF MAXIMUM CELL VALUE is conducted (Wang and Hey, 1996). The result is very simply the location of the largest observed deviation value in the distribution of largest deviation values, regardless of species, or locus and regardless of whether the observation corresponds to polymorphism or divergence. If the distributions of all cells of the data matrices are equally distributed, as would be the case under the assumptions of the method, then the properties of this test could be predicted. If that does not hold, then the test will be based primarily on those portions of the simulated data sets that tend to have positively skewed deviations from expectations.
ESTIMATED PARAMETER VALUES AND SIMULATION STATISTICS. This is a table of parameter estimates, and, if simulations are done, of 95% confidence intervals and of simulation means. The simulation means should be close to the parameter estimates.
TESTS OF MEANS AND VARIANCES OF TAJIMA D VALUES. If the input data contained Tajima D values, then the mean values, among loci, are compared with the distribution of simulated mean values.
TESTS OF INDIVIDUAL TAJIMA D VALUES. These are tests of observed values of Tajima's D for each locus against the simulated distributions.
The same types of tests are done using Fu and Li's D statistic.
_______________________
�Program
Limitations������������ Return to Contents
_______________________
The program can handle very large data sets. The distributed version will handle 20 loci and up to 350 individuals per species per locus. When sample sizes are modest it does simulations quickly.
The most glaring limitation is that the program does not incorporate recombination into the simulations. Recombination reduces the variance of polymorphism levels, and for some purposes it would be nice to include this.
_______________________
�Literature Cited��������������� Return to
Contents
_______________________
Fu,
Y. X., and W. H. Li, 1993 Statistical tests of neutrality of mutations. Genetics
133: 693-709.
Hudson, R. R., 1990 Gene genealogies and the coalescent process, pp. 1-44 in Oxford Surveys in Evolutionary Biology, edited by D.
Futuyma
and J. Antonovics. Oxford University Press, New York.
Hudson,
R. R., M. Kreitman and M. Aguad�, 1987 A test of neutral molecular evolution
based on nucleotide data. Genetics 116: 153-159.
Kimura,
M., 1968 Evolutionary rate at the molecular level. Nature 217: 624-626.
Kimura,
M., 1969 The number of heterozygous nucleotide sites maintained in a finite
population due to steady flux of mutations. Genetics 61: 893-903.
Tajima, F., 1989 Statistical method for testing the neutral mutation hypothesis by DNA polymorphism. Genetics 123: 585-595.
Wang, R. L., and J. Hey, 1996 The speciation history of Drosophila pseudoobscura and close relatives: inferences from DNA sequence variation at the period locus. Genetics 144: 1113-26.
Watterson,
G. A., 1975 On the number of segregating sites in genetical models without
recombination. Theor. Pop. Biol. 7: 256-275.
This page was last changed September 04, 2013