**SITES** -- DNA POLYMORPHISM ANALYSIS PROGRAM --

 

DOCUMENTATION  

Jody Hey 

Department of Genetics

Rutgers University

Nelson Biological Labs

604 Allison Rd.

Piscataway, NJ 08854-8082

732-445-5272

fax 732-445-5870

hey@biology.rutgers.edu

http://lifesci.rutgers.edu/~heylab

 

* This computer program and documentation may be freely copied and used by anyone, provided no fee is charged for it. 

_______________________

Contents

_______________________

______________________

Overview

_______________________

 

SITES is a computer program for the analysis of comparative DNA sequence data.  Basic analyses include: data summaries by polymorphism class;  polymorphism estimates within and between groups (species); estimates of migration, neutral model, and recombination parameters; and linkage disequilibrium analyses.  SITES is primarily intended for data sets with multiple closely related sequences. It is especially useful when multiple sequences have been obtained from each of one or several closely related populations or species.  

 

SITES can handle large data sets, and is flexible with regard to the input data. With a few commands on the command line, or in response to menus, one can tailor a particular data set in different ways to suit different questions, including working on only a subset of the data and regrouping the data. 

 

If you find the need to mention the program in a publication, you may cite the following reference which mentions the program:  

Hey, J and J. Wakeley. 1997. A coalescent estimator of the population recombination rate. GENETICS 145: 833-846.

______________________

 

Downloadable Files Return to Contents

______________________

 

_______________________

 

New Features Return to Contents

_______________________

The major recent addition, as of 2001, is a suite of linkage disequilibrium analyses.

SITES will also generate lines suitable for input to the HKA  and WH computer programs. 

SITES will also now handle alternative genetic codes.

Note to previous users - some command line flags have changed - check MENUS.. 

_______________________

 

Analyses

_______________________

 

 

 

 

 

 

 

 

 

 

 

 

 

_______________________

 

Input File Format Return to Contents

_______________________

 

There are three kinds of input format. SITES format, PHYLIP sequential format and PHYLIP interleaved format. PHYLIP is Joe Felsenstein's package of phylogenetic programs. The first line of a PHYLIP format file begins with two integers. SITES will check to see if the first character of a file is a number (i.e., 0..9) and if so it will assume that the file is in PHYLIP format. If the first character is not a number it will assume a SITES format. SITES format is very similar to PHYLIP sequential format. If multiple analyses may be run, it is most efficient to set up a SITES format file.

 

Data Restrictions.

 

The only characters that are allowed in the sequences are 'A','a','G','g','C','c','T','t','N','n','-','.', and '*'. 'N' and 'n' represent base positions where the sequence is not known. '-','*', and '.' represent base positions where one sequence has a gap relative to another sequence. Other characters cause the program to insert an 'N' in their place, and to display a warning at runtime. Each DNA sequence must have exactly the same number of characters. The only exception to this is that the data can have spaces (i.e. ' ') which are ignored.

 

SITES format:

 

Sample SITES Format

Below is a sample SITES input file for 10 sequences each of 50 characters. There is one noncoding region extending between bases 20 to 44, inclusive, and the first base of the coding region (i.e. base 1) is in frame 3 of the codon. There are four groups of sequences.

 

****

SITES sample input

10 50

3 1

20 44

4

simulans 3

mauritiana 2

sechellia 2

melanogaster 3

SI-CA1 CAGGGTGTCCGACTCGGCCTACTCGAGCA......GCAACAGCCAGTCAC

SI-CA2 CAGGGTGTCCGACTCGGCCTACTCGAACAGCTGCAGCAACAGCCAGTCAC

SI-K1 CAGGGTGTCCGACTCGGCCTACTCGAACA......GTAACAGCCAGTCAC

MA-1 CAGGGTGTCCGACTCGGCCTACTCGAACAGCTGCAGCAATAGCCAGTCGC

MA-2 CAGGGTGTCCGACTCGGCCTACTCGAACAGCTGCAGCAATCGCCAGTCGC

SE-C1 CAAGGTGTCCGACTCGGCCTACCCGAACAGCTGCAGCAACAGCCAGTCGC

SE-P1 CAAGGTGTCCGACTCGGCCTACTCGAACAGCTGCAGCAACAGCCAGTCGC

ME-NJ1 CAAGGTGTCCGACTCGGCCTACCCGAACGGCTGCAGCAACAGCCAGCCGC

ME-K1 CAAGGTGTCCGACTCGGCCTACCCGAACAGCTGCAGCAACAGCCAGCCGC

ME-LI1 CAAGGTGTCCGACTCGGCCTACCCGAACAGCTGCAGCAACAGCCAGCCGC

****

For many questions, intron and exon boundaries are not relevant. To run SITES without providing information on introns and exons, simply indicate that the entire sequence is one large noncoding sequence. In this case, the value of the first coding base is irrelevant. For example to do this with the sample data set above, lines 3 and 4 of the data file would be:

1 1

1 50

 

PHYLIP sequential and interleaved formats:

_______________________

 

Running the Program Return to Contents

_______________________

 

The program file should reside either in the same folder as the data file or in a folder automatically searched by the operating system. The user starts the program simply by going to the folder where the data file and the program exist and typing the name of the program (e.g. 'sites'). The program asks several questions about the data file and the desired analysis. Nearly all commands and options can also be entered using command line parameters. 

 

On a PowerPC, clicking on the program icon opens a small window in which command line parameters can be entered.  The user can also just hit return at this point and the program will request runtime parameters.

 

_______________________

 

Menus Return to Contents

_______________________

 

Site choice menu.

-----------------------

 

If the 'S' flag is not used when the program is started, then the following menu will appear.

 

SITE OPTIONS ARE : a for All site types

i for all noncoding base changes

e for all coding base changes

s for synonymous coding sites

r for replacement coding sites

o for ambiguous coding sites

d for insertion/deletion (indel) sites

f for only informative sites

n for only transitions

v for only transversions

 

z to skip all positions with more than 2 base values

x to skip all positions within indels

TYPE THE LETTERS OF THE DESIRED SITE TYPES (no spaces):

 

These options determine the information that is included in the polymorphism table, and they may affect which kinds of polymorphic sites are subject to other analyses. The user should enter a string of characters corresponding to those types of polymorphic sites that are to be included in the analyses. For example, an analysis on only informative synonymous sites would be called for by entering 'fs' or 'sf'. The most common analysis is done on all types of polymorphic sites (i.e. 'a'); 'ambiguous coding sites' are those for which the program could not determine whether a coding base change was a replacement or a synonymous change. 

'z' will cause all those positions with more than two types of base values to be excluded from the analysis.

'x' will cause all those positions that are within regions for which some of the lines differ by indels to be excluded from the analyses.

By invoking both 'z' and 'x', the data set can be reduced to just those sites that fit a simple infinite sites model of mutation.

 

Analysis choice menu.

-----------------------------

 

If the 'A' command is not used when the program is started, then the following menu will appear.

 

ANALYSIS TYPES ARE:

a for all basic types (except r,m,g,l)

s for basic site table

e for coding change details

c for codon usage tables

p for polymorphism analysis

i for indel site table & indels in recombination analysis

 

r for recombination analysis

m for population model fitting

l (L) for linkage disequilibrium analyses (invokes r)

g for GC analyses

TYPE THE LETTERS OF THE DESIRED ANALYSES (no spaces):

 

        'a' calls for the most basic analyses, including s,e,c,p & i

        's' calls for a table of polymorphic base positions to appear in the output file.

        'e' calls for a table that contains the codons that are the sites of synonymous, replacement, and ambiguous polymorphisms. This is especially useful for determining the likely sequence of mutations when multiple polymorphisms occur within a codon.

        'c' calls for a table of codon usage. Only the 'universal' code is used. Two tables are printed, one for the first sequence in the data file, and one for all of the sequences. Also included are counts of the numbers of synonymous and replacement bases (i.e. relative proportions of random mutations expected to cause synonymous or replacement changes). These counts of synonymous and replacement bases can be used to calculate the number of synonymous polymorphisms per synonymous base. Similarly, the counts of replacement bases can be used to calculate the number of replacement polymorphisms per replacement base.

        'p' calls for a set of analyses on polymorphism within and among groups.

        'i' calls for a polymorphism table for indel variation. This option also causes indels to be included in some recombination analyses, if the 'r' option is also used.

        'r' calls for a set of recombination analyses.

        'm' calls for fitting the ISOLATION SPECIATION MODEL to pairs of groups of sequences, and the POPULATION SIZE CHANGE MODEL to each group of sequences. These analyses are not applicable for many data sets, and they can be slow for data sets with large groups.

        'l' ( L ) calls for the Linkage Disequilibrium Options Menu to appear.

        'g' calls for a table of GC content by codon position and for noncoding regions, for each of the sequences.

 

Data And Output Options Menu

-----------------------------------

 

If the 'O' command is not used when the program is started, then the following menu will appear. Several of these commands require additional information that must be entered in response to queries, or provided with other command line flags.

 

DATA OPTIONS: g for comparisons within groups

n for new group designations

d for dropping some sites

m for data limited to some sites

x for dropping some sequences

                c for non-canonical genetic code

OUTPUT OPTIONS: s for no screen output during analysis

p suppress large pairwise difference table

h text line of input values for HKA program

w text line of input values for WH program

f for first sequence reference in site table

t for table style: (. and -) replace (- and *)

TYPE THE LETTERS OF THE DESIRED OPTIONS (no spaces):

 

 

DROPPING SOME SITES, enter option : r for range

f for file

i for individual entry

The user should enter 'r','f', or 'i', and follow the directions. In the case of 'f', the user should previously have created a file that contains base positions that are to be dropped. The format is simply a list of numbers, one number per line. In the case of 'r' or 'i', the program asks the user for information on the range limits, or individual base positions, respectively.

 

 

Linkage Disequilibrium Options Menu

-------------------------------------------

 

If the 'L' command is not used on the command line, and the 'L' option is listed in response to the Analysis Choice Menu then the following menu will appear. Some of these commands require additional information that must be entered in response to queries, or provided with other command line flags.

 

Linkage Disequilibrium (LD) Analyses:

m print matrix of pairwise values and significance tests

y measure and compare average LD by regions

t randomization tests of average LD by regions

s analyze LD among polymorphism shared between groups

x exclude singletons from all LD analyses

 

Apply Analyses to the Following LD Measures

d D - standard linkage disequilibrium (default)

p D' - D prime = D/Dmax

b |D| - absolute value of D

a |D'| - absolute value of D prime

r correlation coefficient

q r^2 - squared correlation coefficient

 

TYPE THE LETTERS OF DESIRED ANALYSES AND MEASURES (no spaces):

 

 

_______________________

 

Command line usage Return to Contents

_______________________

 

Nearly all commands and options can be given at the command line.  Usage of the command line permits many analyses to be automated, so that they can be very easily repeated and, if desired, given in batch files. 

 

Primary Command Line Flags:

 

Each command line parameter flag may be upper or lower case and may be preceded by a '-', '\' or '/'. Following a command line parameter is a string of characters appropriate to that command.

 

Flag parameter string example

---------------------------------- ----------------

I the name of the data file -Imydata.seq

R the name of the output file -Rmyresult

M message up to 78 characters (no blanks) -Mmy_data

S kinds of polymorphic sites to analyze -Sa   (see Site Options Menu above)

A kinds of analyses to perform -Aspr (see Analysis Menu above)

O data and output options                  -Ogc  (see Data and Ouput Menu above)

L linkage disequilibrium options -Lyrp (see LD Analysis Menu)

C    alternate genetic code                   -Cf   (see Alternative Genetic Code Menu above)

 

Any of these parameters can be given at the command line when the program is started. Either uppercase or lowercase letters may be used for flags and for their parameter strings.  For example, the following command line, entered in a command prompt window will generate a file called myresult.sit.

 

sites -isitestestdata -rmyresult -mmy_data -sa -aspr -ogc -cf

 

For each of the parameter flags that are not included at the command line when the program is started, the program will ask for the information. It is not required that any of the command line parameter flags be used. The program will also ask for more information if phylip style data formats are used, and if some of the analysis options are used. The data file can have any name, though it should not include the folder information. This usually means that the program and the data need to reside in the same folder. The name of the output file should not have an extension. The characters '.SIT' will be added to the output file name that is given to the program. Thus, for the example above, the program would produce a file called 'myresult.sit'.

 

Three of the flags ('S', 'A', and 'O') correspond to menus (see below). If the flag is not used at the command line, a menu will appear, requesting user input.

 

Secondary Command Line Flags:

 

Additional command line options can be used to limit the size of the data set if the 'O' flag or the L flag is used. These secondary flags should be placed after the respective primary flag.

 

Note that command line flags for dropping sequences can be used either in conjunction with flags for excluding base positions, or with flags for keeping certain base positions.

 

 

 

_______________________

 

Output Return to Contents

_______________________

 

When running, the program writes some messages to the screen (unless the 's' ANALYSIS OPTION is used). While reading in the data, the program writes the base position of each polymorphic site as it is found. Following this, the program writes a brief message for each category of analysis being conducted.

 

The output of the program is all contained in one file that has a '.SIT' extension. This file can be quite long for large data sets and complete analyses. If there are many sequences, the 'p' option in the OUPUT OPTIONS menu can cut down on the size of the data set.

 

For the most part, the formatting of the output is not hard to follow. At the top of the file, the run parameters are listed, as are the sequence and group names. For a full analysis of a data set with multiple groups, the following headings are generated:

 

Table of Polymorphic Sites

-------------------------------

This table provides the base position and type of polymorphism for all variable base positions in the data set (excluding those sequences and regions of sequences that were not included in the analysis).

 

Counts of Site Types

-----------------------

An approximate list of counts of the number of base positions associated with each kind of polymorphisms in the table. These counts are rough because some base positions may have multiple kinds of polymorphisms.

 

Indel Tables

--------------

A set of two tables: The first is very similar to that for polymorphic sites, except that each distinct indel (regardless of length) gets just one position. The second table lists the sequence and position of each distinct indel.

 

Table of Synonymous, Replacement and Uncertain Exon Changes

--------------------------------------------------------------------------

A list of all (or most) coding region base changes. With each polymorphism the different codon states are shown. This can be used to resolve cases where multiple changes have occurred within the same codon, and where it was not clear to the program whether a change was synonymous or replacement. This analysis may not be complete if there are three or four bases segregating at a position, or in cases where a single aligned codon has many amino acids segregating.

 

Codon Usage Table

---------------------

A table of codon usage for the first sequence. Also included are counts of the numbers of synonymous and replacement sites (i.e. relative proportions of random mutations expected to cause synonymous or replacement changes). The calculation of these values is best explained with an example. Consider a codon (e.g. AAA for lysine) and consider each base in the codon. For each base, the fraction of all mutations that will change the codon in such a way that the amino acid does not change is counted. For AAA, 0 of the three possible mutations at the first base will lead to a synonymous change, similarly 0 for the second base, and 1/3 of the mutations for the third base (because an A-> G change leads to a AAG which is also lysine). So the number of synonymous sites in an AAA codon is 1/3. The number of replacement sites is 3-1/3 = 2 2/3. Every codon gets a score this way, and the final tally is just the sum of scores for all codons.

 

Codon Usage Table For All Lines

-------------------------------------

A table of codon usage for all sequences and counts of synonymous and replacement sites summed across all sequences.

 

Polymorphism Analyses

--------------------------

Most analyses are applied only to those types of sites that are specified in the SITES OPTIONS menu, but there are exceptions - see below.

        BASE PAIR COMPARISONS - not including N's or indels - counts of the number of bases compared and the number of differences for all pairs of sequences. Essentially a distance matrix. The counts of the numbers of bases compared are based on the entire sequence, or a shorter region, depending on the 'x' SITE OPTION, and the 'l' and 'm' ANALYSIS OPTIONS. The counts of the number of bases compared are not reduced by other SITE OPTIONS choices. However, the counts of site differences are affected by SITE OPTION choices. This table is not generated if the 'p' option is used in the OUTPUT OPTIONS.

        GROUP DIFFERENCES - not including N's or indels -a matrix with group by group comparisons. Above and on the diagonal are the average pairwise differences for those sites specified under SITE OPTIONS. Below the diagonal is the net average pairwise divergence (e.g. Nei, 1987, p.276).

        GROUP DIFFERENCES PER BASE PAIR - same as above but numbers are per base pair. Note that the divisor is the average # of base pairs compared, calculated from the same numbers above the diagonal in the BASE PAIRS COMPARISON table. Thus these per base pair measures may be misleading for some SITE OPTIONS choices. For example, if only synonymous sites are analyzed (option 's' under SITE OPTIONS), a per base pair measure of divergence can be obtained by dividing numbers in the GROUP DIFFERENCES matrix, by the estimated number of synonymous sites, that are given beneath the CODON USAGE TABLE.

        FIXED DIFFERENCES - A fixed difference is a polymorphic site at which all of the sequences of one group are different from all of the sequences of a second group.

        SHARED POLYMORPHISMS - A shared polymorphism is a polymorphic site at which each of two groups of sequences are found to have at least two of the same bases.

        Fst AND POPULATION MIGRATION RATES - Fst values, between pairs of populations, and estimates of Nm, assuming diploidy (i.e. N is the effective number of diploid individuals) calculated according to equation 4 (except using a factor of 1/4) of Hudson et al., 1992. This estimate should be multiplied by 4/3 to get the corresponding number for an X linked locus. It should be multiplied by 2 to get the corresponding number for a haploid locus. Also if the locus is haploid and sex-limited (e.g. mitochondria), the estimate should be multiplied by 2, and then it applies only to the number of individuals of the sex that carry the locus.

        POLYMORPHIC SITE FREQUENCIES PER GROUP FOLDED & ROOTED -These tables give the counts of the number of lines that carry a polymorphism of a certain frequency. For example, a site in which two sequences are different from the remaining n-2 sequences is counted in either category 2 or category n-2 (depending on whether the distribution is folded, or if not depending on the value of the root sequence. There are two tables, one for the folded distribution which shows only the frequencies for the rarest bases, and one for the rooted distribution. A rooted distribution is not shown if the analysis includes only one group. The table for the rooted distribution is based on an outgroup sequence chosen from one of the other sequence groups. The method for picking outgroups is very simple. The outgroup sequence for one group is the first sequence listed in the most divergent other group. This may not be an ideal outgroup for various reasons. Some properties of these distributions are known for some models and the distributions are useful for considering questions about changing population size and natural selection. See papers by Tajima and also by Fu.

        SITES WITH MORE THAN TWO BASES SEGREGATING - A number of analyses assume an infinite sites model, under which a polymorphic site is caused by exactly one mutation and there can be no more than two bases segregating. This table lists those sites that are clearly not consistent with this assumption. If desired, the program can be rerun using analysis option 'd' to exclude these sites.

        D STATISTICS - These indices are measures of departure from a neutral Fisher-Wright model. See Tajima (1989) and Fu and Li (1993). These statistics rely on a count of the number of polymorphic sites. The counts that are used come directly from the site frequency distribution. However these counts will underestimate the actual number of mutations, particularly if some sites are segregating more than two bases within a sequence sample group. For Fu and Li D, an outgroup sequence is picked from among the other groups in the data set (if any occur - see above for POLYMORPHIC SITE FREQUENCIES). Note, the outgroup sequence that is picked may not be ideal, depending on the divergence among groups. Also given are counts of the different classes of mutations, as defined by Fu and Li.

        THETA (4Nu) ESTIMATES - Two different estimates of the neutral mutation parameter 4Nu: Watterson s and nucleotide diversity or pi. See, for example, Hudson (1990) or Tajima (1993). Also listed are the number of sequences for each group and the number of bases, which is calculated by taking the average of the number of bases compared in the pairwise comparison matrix.

Historical Population Model Fitting

----------------------------------------

This section provides the results of the fitting of specific historical models to the data. These models are described in Wakeley and Hey (1997). The methods work best for data sets for which the models are a close fit to reality, and for which there are many polymorphisms and lots of recombination. For most data sets, the models do not fit very well, and the fitting methods are not able to return a reasonable solution.

 

Recombination

-----------------

 

        POPULATION RECOMBINATION PARAMETER (4Nc) ESTIMATES - Two estimates of 4Nc the population recombination rate are generated: gamma (Hey & Wakeley, 1997) and Hud4Nc (Hudson 1987). c is the crossing over rate for the sequenced region per generation. These estimates assume diploidy, and must be adjusted otherwise. Also listed is the estimated ratio of the number of recombination events per mutation (c/u). This is calculated simply by dividing gamma by Theta (i.e. estimated 4Nc, divided by estimated 4Nu).

 

A very important assumption of these methods is that just a single mutation has caused each polymorphism within a group. gamma in particular is quite sensitive to a violation of this infinite sites assumption. Hud4Nc may not be as sensitive, however this estimator suffers from extreme bias (estimates tend to be way too high) if the numbers of sequences or polymorphisms are low (Hudson, 1987).

 

        CONGRUENCY AND INTERVAL TABLES For each group, two tables are printed out.

o       TABLE OF INFORMATIVE SITES. - This is a matrix that I sometimes find useful for thinking about recombination. It lists all informative polymorphic sites and the values carried by each sequence. A site is considered informative if there are at least two bases that occur in at least two of the sequences. 'rarecount' is the number of times the less common base occurs. The matrix tells whether each pair of sites is congruent. Two sites are congruent if together they do not reveal all four gametic types. For example a G-C polymorphism and an A-T polymorphism would be congruent if there occur individuals of just three of the possible gametic types (e.g. GT, GA, and CT, but not CA). Under the infinite sites model all four types can only occur if there has been crossing over or gene conversion. The matrix has a '*' or a '+' at pairs of congruent sites, and a blank for incongruent pairs. Each site with itself gets an 'S'. Thus a completely congruent data set would be all S's and *'s. The more recombination the more there is a tendency for blocks of *'s and +'s to be broken up. It can be a handy visual aid, but is unwieldy with too many polymorphisms. It also assumes an infinite sites mutation model. The matrix can be enormous if there are a great many polymorphisms. The program will only print up to 245 columns (i.e. 245 informative polymorphic sites). The matrix does not deal well with sites that exhibit more than 2 bases in a group. One of the bases gets arbitrarily lumped with one of the others.

 

o       MINIMAL SET OF RECOMBINATION INTERVALS - This table lists the smallest set of locations of recombination events in the history of the sample. See Hudson and Kaplan (1987) for full details. Under the model, each interval must bracket at least one recombination event.

 

Linkage Disequilibrium Analyses

-------------------------------------

        AVERAGE LINKAGE DISEQUILIBRIUM BETWEEN REGIONS - If the 'y' or 't' option is listed in the LD Menu, then an analysis is conducted of the mean LD between regions of the sequence. These regions are listed in a separate file, the name of which is either given on the command line (-G flag) or requested. For each region, and each pair of regions the mean LD is calculated between all pairs of sites. These values are listed in a matrix that also shows separately for each group how many sites could be compared for each mean (if there are less than 10 pairs of sites, the mean is not shown).

 

A different table is generated for every LD measure that was called for (LD MENU or '-L' command line flag).

 

If the 't' option is listed in the LD Menu, then a randomization test is conducted of the mean LD values. For each randomization mean LD is calculated within and between each region, and the simulated mean values are compared to the actual values. For randomizations within regions, the base values for each polymorphic site within a group are randomly shuffled among the sequences of that group. For randomizations between regions the complete sequences for each region, within each group, are randomly shuffled so that the sequences remain intact but they linkage with respect to other regions is randomized.

 

A total of 1000 LD region randomizations are done. If screen output is not suppressed the number of simulations are listed on the screen. These tests can take seconds or days depending on the size of the data set and the speed of the computer, so it is best not to turn of screen output.

the test results are shown in the tables of means, with symbols to indicate the level and direction of significant departure from random.

 

        MATRICES OF LINKAGE DISEQUILIBRIUM VALUES AND TESTS -These matrices will be large if there are many polymorphic sites within a group. They are printed out in a separate file, the name of which is either given on the command line ('-E' flag) or is requested by the program in response to the 'm' option in the LD OPTION MENU. For each group there is a matrix of tests of association. Fisher's Exact Test is shown above the diagonal, and a chi-square test is shown below the diagonal.

 

For each group and each LD measure requested, a matrix is given that shows the value of LD between all pairs of sites for which values could be calculated. Values are shown to just 1 figure so as to have the matrix in manageable space.

 

Following each matrix is given the mean LD value for that matrix.

 

        LD ANALYSES AMONG SHARED POLYMORPHISMS - A series of matrices are listed, each containing LD calculations involving shared polymorphisms between species. These can be useful in trying to figure out whether shared polymorphisms are due to gene flow or due to shared ancestry.

 

GC Content Analysis

-----------------------

A table of the number of bases at each codon position, and the GC percentage for each DNA sequence.

 

Formatted Values for Input to HKA Program

--------------------------------------------------

The HKA program carries out the HKA test (Hudson et al., 1987) for multiple loci. It can be conducted for pairs of species or for one species and a single outgroup sequence. HKA program input lines are generated for all possible pairs of groups.

 

Formatted Values for Input to WH Program

------------------------------------------------

The WH program carries out the analyses described in Wakeley and Hey (1997) and in Wang, Wakeley and Hey (1997). WH input lines are generated for all possible pairs of species.

_______________________

 

Program Limitations Return to Contents

_______________________

 

The program requires that the DNA sequences be aligned prior to analysis. SITES does not carry out any phylogenetic analysis. It does not estimate trees, and it does not endeavor any assessment of multiple hits. In general the situations for which SITES is intended are ones in which there have not been many cases of multiple mutations at base positions and where there have not been so many insertion/deletion events that alignment is difficult.

 

The program can handle large amounts of data, depending on available RAM.  If there is sufficient RAM then the distributed version of the program  can handle up to 200 sequences, each of 20,000 base pairs, and up to 20 groups.  These values can be increased by changing constants in sites.h and recompiling the program.  

The program is quick for basic polymorphism analyses. Recombination, Population Model Fitting, and Linkage Disequilibrium analyses can be time consuming for large data sets.

 

If you have a data set that crashes or hangs the program, you can send it to me with a description of the problem. Depending on time, I'll try and find the problem.

 

_______________________

 

Literature Cited Return to Contents

_______________________

 

 

Fu, Y. X., and W. H. Li, 1993 Statistical tests of neutrality of mutations.

Genetics 133: 693-709.

 

Hey, J. and J. Wakeley. 1997. A coalescent estimator of the population

recombination rate. Genetics 145, 833-846.

 

Hedrick, P. W., 2000 Genetics of Populations. Jones and Bartlett, Sudbury, MA.

 

Hudson, R. R., 1987 Estimating the recombination parameter of a finite

population model without selection. Genet. Res. Camb. 50: 245-250.

 

Hudson,R.R. 1990 Gene genealogies and the coalescent process. In: Oxford

Surveys in Evolutionary Biology. Vol. 7. (Eds: Harvey,PH; Partridge,L)

Oxford University Press, New York, (1-44)

 

Hudson, R. R., and N. L. KAPLAN, 1985 Statistical properties of the number of

recombination events in the history of a sample of DNA sequences.

Genetics 111: 147-164.

 

Hudson, R. R., M. Kreitman and M. Aguad , 1987 A test of neutral molecular

evolution based on nucleotide data. Genetics 116: 153-159.

 

Hudson,R.R., M.Slatkin and W.P. Maddison 1992 Estimation of levels of gene

flow from DNA sequence data. Genetics 132: 583-589.

 

Kliman,R.M. and J. Hey 1993 DNA sequence variation at the period locus within

and among species of the Drosophila melanogaster complex. Genetics 133:

375-387.

 

Nei,M 1987. Molecular Evolutionary Genetics. Columbia University

Press, New York.

 

Schaeffer,S.W. and E.L. Miller 1992 Estimates of gene flow in Drosophila

pseudoobscura determined from nucleotide sequence analysis of the

alcohol dehydrogenase region. Genetics 132: 471-480.

 

Tajima, F., 1989 Statistical method for testing the neutral mutation

hypothesis by DNA polymorphism. Genetics 123: 585-595.

 

Tajima,F 1993 Measurement of DNA polymorphism. In: Mechanisms of Molecular

Evolution. (Eds: Takahata,N; Clarke,AG) Sinauer Associates, Sunderland,

MA, 37-59.

 

Watterson, G. A., 1975 On the number of segregating sites in genetical models

without recombination. Theor. Pop. Biol. 7: 256-275.

 

Wakeley, J. and J. Hey. 1997 Estimating ancestral population parameters.

Genetics 145, 847-855.

 

Wang, R. L., J. Wakeley and J. Hey, 1997 Gene flow and natural selection

in the origin of Drosophila pseudoobscura and close relatives.

Genetics 147: 1091-106.

 

This page last changed November 10, 2015