**FPG** -- A COMPUTER PROGRAM FOR FORWARD POPULATION GENETIC SIMULATION--

 

DOCUMENTATION 

 

Jody Hey

Department of Genetics

Rutgers University

Nelson Biological Labs

604 Allison Rd.

Piscataway, NJ  08854-8082

732-445-5272

fax 732-445-5870

hey@biology.rutgers.edu

http://lifesci.rutgers.edu/~heylab

 

 * This computer program and documentation may be freely copied and used by anyone, provided no fee is charged for it.

_______________________

 Contents

_______________________  

______________________

 Overview   

_______________________

 

FPG (for Forward Population Genetic simulation) simulates a population of constant size that is undergoing mutation, recombination, natural selection, and (if called for) migration between populations.   The meaning of "forward" in this context is simply that time, within the simulation, moves forward just as it does in the real world.  This is in contrast to coalescent population genetic simulation in which time, as represented within the simulation, proceeds back into the past.  Coalescent simulations have many advantages, but they are unwieldy if they incorporate natural selection on multiple sites.

 

FPG is useful for assessing the impact of natural selection on patterns of genetic variation.  Since many selected sites can be segregating at one, the program is useful for understanding the Hill-Robsertson effect (Hill and Robertson, 1966) under genetic hitchhiking (Maynard Smith and Haigh, 1974) or background selection (Charlesworth, Morgan and Charlesworth, 1993).   It is designed so as to be able to approximate real world situations with fairly large population sizes and high mutation rates over long stretches of DNA.  The mutation model is an infinite sites model, meaning that no site that is segregating in the population can receive another mutation.  

Life Cycle

The basic life cycle alternates between haploid genomes (gametes) and diploid genomes (zygotes), and proceeds as follows:

 

Generations are non-overlapping.

Fitness Schemes

 

The current mutation model accommodates three mutation rates for mutations that are neutral, deleterious, and beneficial.  The deleterious and beneficial mutations have identical selection coefficients that are used with opposite effect.  

 

Three different fitness schemes are incorporated: additive, multiplicative, and epistatic. In each scheme, the basic parameters are a selection coefficient, s >= 0,  and the dominance, 0<= h<=1.

 

Four quantities are measured for every zygote:

Additive Fitness Scheme

W = 1 + (op - on) * s + (ep - en) * h*s

Multiplicative Fitness Scheme

W= ps^op * ns^on * (ph*ps)^ep * (ns*nh)^en

For values of s near 0, this is well approximated by 

 

W = Exp(s*((op-on) + h*(ep-en)))


Epistatic Fitness Scheme

The epistatic model requires an additional parameter, y, that is the exponent in a power function.

 

The development of a general epistatic function begins by considering the exponential form of the multiplicative fitness scheme.   Consider, in the absence of dominance,  W=Exp(s*(op-on)).  Then if we include the epistatic factor, y, we get

W=Exp(s*(op-on)^y)

 

However, this expression will not work if the term to be raised to the y power has a negative value. This can be dealt with directly, if awkwardly  by using 

 

W=Exp( SIGN(op-on) s *|op-on|^y))

 

In other words, for every increase in (op-on) the effect on fitness is more than multiplicative if y>1 (positive or synergistic epistatis) and less than multiplicative  if  0<y<1 (negative epistasis).

 

To include dominance we simply use the corresponding expression of the multiplicative fitness scheme with dominance.

 

W=Exp( SIGN((op-on) + h(ep-en)) s * |(op-on) + h(ep-en)|^y )

 

Genome Structure

Each gamete consists of a haploid copy of the genome.  The genome consists of one or more chromosomes, each consisting of a number of segments.  Each segment can hold up to 32 mutations. Large genomes with hundreds to thousands of segments can be used to allow for an overall high mutation rate, corresponding perhaps to long stretches of DNA.   

There are two key considerations in selecting a genome size:  simulation speed and population mutation rate.  The larger the genome the slower the simulation.  Also if the mutation rate is too high, for a given genome size, then all available sites become segregating and no more mutations can be added to the population, and the program halts. 

Recombination occurs within chromosomes according to the value of the recombination parameter.  Chromosomes sort randomly into gametes, so there is, of course, high recombination between chromosomes.  

Usually there will be little need for more than 2 chromosomes in the genome, as this is sufficient to compare regions that are entirely unlinked. 

Making Measurements

 

The fitness and polymorphism patterns within the simulated population can be quite dynamic depending on the selection parameters and fitness scheme.  In particular, population mean fitnesses may fluctuate fairly dramatically over time, particularly if recombination is low.  In order to get an accurate assessment of the results for any particular parameter set, it may be necessary to run the population for a long period of time and to make many measurements. 

 

Most processes that go on within the simulation will have the population size (G - the number of gametes) as the key rate parameter.  In larger populations things happen more slowly (i.e. it takes more generations for something to happen).  Thus it is usually simplest to think of time in units of G generations, where G is the number of haploid genome copies in the population. 

 

At the beginning of any simulation the population is not segregating any mutations, and it is expected to take about 2G generations on average for a neutral mutation to reach the highest frequencies.  Thus the time until the first measurement can be taken from the population should generally not be less than 2G generations.  

 

The same considerations lead to the point that the time between measurements should also be of the order of G generations, at least if it is desired that successive measurements not be strongly correlated with one another. 

 

The Practicalities of Population Size

 

G,  the number of gametes in the population is an awkward parameter.  Often we do not know effective population size values very well for natural populations, and so have little idea what values to use in simulations that are intended to approximate a natural situation.  Another consideration is that we often expect the quantities that are to be measured in the simulation to  vary in proportion to the population size, in which case it would be suitable to make the simulated population size as small as possible as size does not matter and the simulations will be fast if the size is small.    However size often does matter,  particularly with selection models in which the selection coefficients are near zero.  In these cases it will often be necessary to consider a range of population sizes to see if the effect varies nonlinearly with G.  

 

Together these considerations often leave an investigator feeling as if they should consider all possible population sizes.  This would be fine except that increasing population size increases the simulation time (i.e. makes it slower).  An increase in G entails three types of speed reductions:

 

______________________

 

Downloadable Files              Return to Contents

______________________

The following files are available 

 

_______________________

 

 Running the Program              Return to Contents

_______________________

The program file should reside either in the same folder as the data file or in a folder automatically searched by the operating system.   It will usually be the case that a user will require several runs of the program using several different parameter values.  For this reason the program requires just two  parameters.

 

For example, to run the program type:

fpg   -Dmydata -Rmyresults

 

Alternatively, just enter the program name, and it will ask for the file names.

 

On a PowerPC, clicking on the program icon opens a small window in which command line parameters can be entered.  The user can also just hit return at this point and the program will request runtime parameters.

 

During the run the program will output a line of text to the screen.  For a population with G genome copies, this text will be written every G generations.  It contains the data set number, the generation number, the population mean fitness and the amount of room left for mutations.

 

The program generates a file with an extension of '.fpg'  (e.g. in this example the file would be myresults.fpg).

The input file is essentially a batch file, with multiple lines, each corresponding to a command to the program. Each line of the input file will run a simulation under a set of parameters.  

 

Key Parameters

There are three key parameters that play a large role in whether or not a simulation is workable.  

 

The downloadabe version of the program calls for 5 or 6 Megabytes of memory while it is running.  This memory model can accommodate a population of  haploid size 1000, each with a genome length of 1000 segments.  Larger or smaller memory models can be run, but require that parameters in the source code be changed, and the program recompiled.

 

Runtime parameters

Each line of the input file consists of a string of runtime parameters, most of which are required. The input file can have as many lines as desired. All results are saved to the output file, the name of which was given when the program was started.  After the run for one parameter set is complete, the output file is saved, so that a cancellation of the program will not destroy results from completed simulations.

The parameter flags, and their meanings, are as follows  (upper or lower case letters may be used):

Cautions

FPG has a large number of parameters and options.  Thus even for a very simple question, such has how much variation is present in the absence of natural selection, recombination and migration, there are still  many decisions to be made.   It is quite easy to input a set of parameters that are not workable for one reason or another or that generate an output that you cannot make much sense of.  In order for results to be informative, it is usually best  to: (1) begin with a simple model; (2) familiarize yourself with the results; and (3) build up from there.  For example here is  a good starting point :

-g100 -c1 -x2 -u2 -v0 -r0 -s0 -n100 -i4 -j1 -oN -k1 

This runs with  a small genome (e.g. 2 chromosomes, each of 1 segments, -x2 -c1),  low mutation rates (-u2 -v0), no selection (-s0), no recombination (-r0), small to moderate population size (e.g. -g100), and makes enough measurements (-n100) so that the results should not be subject to too much variance.

The population model implemented in FPG  is somewhat arbitrary in a number of ways.  In particular, there are a number of other reasonable ways to invoke selection and migration, in particular.  One basic constraint of having populations with 100's to 1000's of genomes is that many classic population genetic results that hold best when populations are very large do not hold so well when populations are small.  This makes it hard to check for bugs in the code.

_______________________

 

Output                                    Return to Contents

_______________________

 

All of the results are placed into a single text file, which can be quite large depending on the analyses invoked and the number of line sin the input file. 

 

At the top of the output file is basic information regarding file names, and parameter definitions.  Thereafter the results for each parameter set are presented in order.

 

_______________________

 

 Program Limitations               Return to Contents

_______________________

 

The distributed version of the program is set to handle populations of up to G=1000  (i.e. 1000 gametes, 500 zygotes) and total genome length of up to 1000 segments.  Since each segment has a length of 32 bits,  this means that the population can hold up to 32000 mutations.  With these lengths,  it is possible to simulate situations that model real world genome sizes.  For example you can assume that each segment corresponds to 100 base pairs and model a genome of 3.2 Megabases.  This is quite workable - so long as the  mutation rate is appropriate such that the mutation rate you use does not overload the genome (see Key Parameters).

 

For sample sizes of 100 and genome sizes of 10 segments,  the simulations are rapid, and it takes most computers only a few minutes to do 50G generations (i.e. 5000 generations).   Increasing the population size has a drastic effect on speed (see The Practicalities of Population Size).

 

If a very large number of simulations are desired for a particular population size and genome size, then it may be possible to increase speed by recompiling the program with reset values of key constants in the source code.

 

 

_______________________

 

Literature Cited                Return to Contents

_______________________

 

 

Charlesworth, B., M. T. Morgan and D. Charlesworth, 1993 The effect of deleterious mutations on neutral molecular evolution. Genetics 134: 1289-1303.

 

Hill, W. G., and A. Robertson, 1966 The effect of linkage on limits to artificial selection. Genet. Res. 8: 269-294.

 

Hudson, R. R., M. Slatkin and W. P. Maddison, 1992 Estimation of levels of gene flow from DNA sequence data. Genetics 132: 583-589.

 

Maynard Smith, J., and J. Haigh, 1974 The hitch-hiking effect of a favourable gene. Genet. Res. 23: 23-35.

Tajima, F., 1989 Statistical method for testing the neutral mutation hypothesis by DNA polymorphism. Genetics 123: 585-595.

 

Watterson, G. A., 1975 On the number of segregating sites in genetical models without recombination. Theor. Pop. Biol. 7: 256-275.


 

This page was last changed March 31, 2014