**HKA** -- A COMPUTER PROGRAM FOR TESTS OF NATURAL SELECTION --

 DOCUMENTATION

Jody Hey 

Department of Genetics

Rutgers University

Nelson Biological Labs

604 Allison Rd.

Piscataway, NJ08854-8082

732-445-5272

fax 732-445-5870

hey@biology.rutgers.edu

http://lifesci.rutgers.edu/~heylab

 

* This computer program and documentation may be freely copied and used by anyone, provided no fee is charged for it. 

 

_______________________

Contents

_______________________  

______________________

Overview ��

_______________________

 

HKA is a computer program that carries out the widely used statistical test for natural selection that was developed by Hudson, R. R., M. Kreitman and M. Aguad� (1987 A test of neutral molecular evolution based on nucleotide data. Genetics 116: 153-159).   This program can handle very large numbers of loci and sample sizes, and conducts tests via coalescent simulation as well as by the conventional chi square approximation.   The simulations can also be used to conduct other tests of natural selection, including tests of Tajima's D statistic (1989) and the D statistic of Fu and Li (1993).

 

RATIONAL

 

The HKA test is based on the most basic prediction of the neutral model of molecular evolution, thatDNA sequence polymorphism within a species, and DNA sequence divergence between species, will be proportional to the neutral mutation rate (Kimura, 1968, 1969).  To this basic expectations we must add that polymorphism also depends on the effective population size and that divergence also depends on the time since speciation.  When we consider data  collected from two species, and from multiple loci,  we expect that all of the loci within a species will share the same effective population size, and we expect that each locus, regardless of species, will have a characteristic neutral mutation rate (depending on its length and level of selective constraint). Thus we can consider data on polymorphism and divergence cast in form of a contingency table  (Figure 1). We can also build a similar table of the expected values, each a function of the basic model parameters that are fit to the data.  Those parameters are:

T - the time since the species divergence

f - the ratio of the the two species effective population sizes.

Theta_i = 4N1 u_i   - the population mutation rate for locus i in species 1, where N1 - the effective population size of species 1 and u_i is the neutral mutation rate at locus i.

There is one Theta parameter for every locus.

 

 

Despite the presence of observations and corresponding expectations in parallel tables, one  cannot conduct a conventional contingency table test (e.g. a chi-square test). Such a test relies upon a sampling variance within each cell that is roughly multinomial. However, the sampling variances within the cells of an HKA table are those of the coalescent process.  Fortunately these can be calculated (Watterson, 1975).  Using these variances, Hudson et. al., (1987)  developed a chi-square-like test statistic, that they called   X^2.  If divergence has been sufficiently long, and polymorphism levels are not very low, then  X^2 should be approximately chi-square distributed with 2*K-2 degrees of freedom where K is the number of loci in the test.

 

Fig 1.

Loci

Species 

1

2

3

.            

.           

.        

1

Polymorphism

Species 1

Locus 1

Polymorphism

Species 1

Locus 2

Polymorphism

Species 1

Locus 3

.

.

.

2

Polymorphism

Species 2

Locus 1

Polymorphism

Species 2

Locus 2

Polymorphism

Species 2

Locus 3

.

.

.

1 vs 2

Divergence

Locus 1

Divergence

Locus 2

Divergence

Locus 3

.

.

.

 

 

ASSUMPTIONS OF THE HKA TEST

 

ADDITIONAL FEATURES OF THE HKA PROTOCOL

 

The basic HKA protocol has several attributes  that permit a variety of applications in addition to an overall test of departure from the neutral model. 

 

BASIC FEATURES OF THE HKA PROGRAM

 

The program can handle a very large number of  loci (default compilation is for up to 20), with sample sizes per species per locus up to 350 (192 on the PowerPC).

 

The sample size for the second species can be 1 sequence, as in the original implementation of the test (Hudson et al., 1987)

 

The program will conduct rapid coalescent simulations using parameters estimated from the data. 

 

The program will conduct a variety of tests based on Tajima's D statistic (Tajima, 1989) and Fu and Li's D statistic (Fu and Li, 1993)

 

______________________

 

Downloadable Files������������� Return to Contents

______________________

 

 

_______________________

 

Input File Format��������������� Return to Contents

_______________________

 

Input Data File Format:

 

Thus the basic format is of a table.  With K loci the table would have K rows. Each row begins with a locus name and is followed by 13 numbers.

 

If one species has only a single sequence at one, at one or any of the loci,  then that data set should be reduced to just one sequence for all of the loci for that species.  In addition the species represented by just a single sequence for each locus should be species 2.

 

Inclusion of Tajima's and Fu&Li's statistics is optional and they need not be included. However if they are the program will test their significance.  It will also conduct tests of the overall mean among these statistics and of their variances among the loci.

 

If tests of D statistics are desired, but the values for some loci are not known or are not calculable, then the corresponding value in the input table should be set to -10 or less. 

_______________________

 

Running the Program������������� Return to Contents

_______________________

 

The program file should reside either in the same folder as the data file or in a folder automatically searched by the operating system.  The program can be run using command line parameters, or by simply typing the name of the program ('hka'). 

 

Command line parameters:

 

-D (or -I)  Input file name   e.g.  -Dmydata

-R  Output file name   e.g. -RMyresults

-S   Number of simulations  e.g. -S1000  (the maximum is 10000)

-T   Do tests of Tajima's D statistic

-F   Do tests of Fu and Li's D statistic

-M   comment line   e.g. -Mtest_my_data

 

To start the program with command line parameters:

For example, type:

hka -Dmydata -Rmyresults -S1000 -Mtest_my_data -T -F

 

To start the program without  command line parameters, simply enter 'hka'.  The program then asks for all of the necessary information

 

On a PowerPC, clicking on the program icon opens a small window in which command line parameters can be entered (e.g. -Dmydata -Rmyresults -S1000 -Mtest_my_data -T -F ).  The user can also just hit return at this point and the program will request runtime parameters.

 

_______________________

 

Output������������������������� Return to Contents

_______________________

 

The output is all included in the Output file, and includes the following:

 

_______________________

 

Program Limitations������������ Return to Contents

_______________________

 

The program can handle very large data sets.  The distributed version will handle 20 loci and up to 350 individuals per species per locus.  When sample sizes are modest it does simulations quickly. 

 

The most glaring limitation is that the program does not incorporate recombination into the simulations.   Recombination reduces the variance of polymorphism levels, and for some purposes it would be nice to include this. 

 

_______________________

 

Literature Cited��������������� Return to Contents

_______________________

 

Fu, Y. X., and W. H. Li, 1993 Statistical tests of neutrality of mutations. Genetics 133: 693-709.

 

Hudson, R. R., 1990 Gene genealogies and the coalescent process, pp. 1-44 in Oxford Surveys in Evolutionary Biology, edited by D.

 

Futuyma and J. Antonovics. Oxford University Press, New York.

 

Hudson, R. R., M. Kreitman and M. Aguad�, 1987 A test of neutral molecular evolution based on nucleotide data. Genetics 116: 153-159.

 

Kimura, M., 1968 Evolutionary rate at the molecular level. Nature 217: 624-626.

 

Kimura, M., 1969 The number of heterozygous nucleotide sites maintained in a finite population due to steady flux of mutations. Genetics 61: 893-903.

 

Tajima, F., 1989 Statistical method for testing the neutral mutation hypothesis by DNA polymorphism. Genetics 123: 585-595.

Wang, R. L., and J. Hey, 1996 The speciation history of Drosophila pseudoobscura and close relatives: inferences from DNA sequence variation at the period locus. Genetics 144: 1113-26.

 

Watterson, G. A., 1975 On the number of segregating sites in genetical models without recombination. Theor. Pop. Biol. 7: 256-275.

 

This page was last changed September 04, 2013