GenePlot

Click on a point for more information.

Add your own comments about the plot (max 500 characters)

GenePlot Help

The basic procedure for using GenePlot on the Web is as follows:

Upload a data file containing allele data for reference populations and any additional individuals you wish to assign.
Choose two or more populations from the data to act as the reference populations
[Optional] Choose additional groups of individuals to assign
Run GenePlot
View results (as a graph or a table)

You can repeat the analysis for different populations from the same file without uploading the file again. To do this, change your selection of reference populations and/or groups to be assigned, or change the method to use, and click "Run GenePlot" once you have finished your selections.

You can also open up a PDF and save individual plots as separate pages of that PDF (see Saving Plots below).

Not that if you make any changes to your selection of reference populations, groups to be assigned, loci, method or prior, then the previous plot will vanish, until you click on Run GenePlot to rerun the plot. This is to ensure that you don't get confused about whether or not the plot you're seeing reflects your currently selected options.

Uploading Files

The app can accept Genepop format files, or the file format defined on the "Example File" tab. Your data file can include data from many populations, and you can then analyse them in different combinations. For Genepop format files, the populations will be named "Pop1", "Pop2", etc. in the order that they appear in the file.

Reference Populations

After you have uploaded a file, GenePlot will detect the populations listed in the data. You can then choose any of those populations to be reference populations. You can use GenePlot to analyse the genetic structure of the reference populations, or to assign individuals by comparing them to the reference populations.

Choose reference populations by holding down the "Ctrl" key and then using the mouse to click on multiple populations. Scroll down the list to see more populations. If you want to deselect a population, press "Ctrl" again and then click on the population to deselect it; the other populations you have chosen will still be selected.

You must choose at least two reference populations. If you choose 2 reference populations, GenePlot will display a graph with the fit for one population on the x-axis and the fit for the other population on the y-axis. If you choose more than 2 reference populations, GenePlot will carry out PCA (principal component analysis) on the fit results with respect to all the populations, and display the first two principal components as the x- and y-axes of the graph. That is, the x-axis will show the linear combination of population fits that gives the best separation of individuals. If you have chosen 4 or more reference populations, then you can choose to show the 1st and 2nd principal components or the 3rd and 4th principal components.

Groups to be Assigned

After selecting reference populations, you can also select groups of individuals to be assigned. GenePlot will list the population labels found in the data file, so choose labels for your individuals to be assigned that will help you to identify them. If you need to be able to distinguish between specific individuals on the graph, list them in the data file with a different population label for each individual.

You can choose as many groups to be assigned as you like.

Method

We recommend that you run GenePlot using leave-one-out. Leave-one-out is strongly recommended, especially for data containing small population samples (i.e. fewer than about 30 samples).

The principle of leave-one-out is that when we calculate the fit for a reference individual with respect to their own reference population, we leave them out of the population. When we calculate the fit of other individuals, we include this one in its population again.

As an example, imagine that you have two reference populations, Pop A and Pop B, and individual A1 was sampled in Pop A. When you calculate the fit of individual A1 with respect to Pop A, you temporarily remove individual A1 from Pop A before estimating the allele frequencies for Pop A.

The other method, Basic, calculates the fit of individuals in the reference populations whilst also including those individuals' alleles in the pool of alleles from their population. Using this method, when we calculate the fit of a reference sample, the fit will be positively biased because the individual's data is used both to characterise the population and to calculate the fit. If the reference population only has a few samples, every one of those samples can have a big effect on the estimated allele frequencies for the population. Overall, the level of separation between the populations will tend to be exaggerated, because each reference sample will have a better fit to their own population than they should.

Prior

If an allele was not found in the sampled data from one of the populations, that could be simply because it is rare in the population, rather than non-existent. In other words, the estimated allele frequencies for the populations are subject to sampling error.

The assignment algorithm uses a Bayesian method to infer the allele frequencies of each population, based on the sample data. It is necessary to set a prior on the allele frequencies at each locus. By setting non-zero prior frequenciess for all possible alleles at that locus (i.e. any alleles found within the chosen reference populations), all of these alleles will have a non-zero estimated posterior frequency, which allows for sampling error.

There are two standard priors used for assignment. One is defined in Rannala & Mountain (1997), and takes the value 1 for every allele at a locus. This is the default prior. The other is defined in Baudouin & Lebrun (2001), and takes the value 1/k for every allele at a locus, where k is the number of distinct alleles at that locus.

The prior defined by Baudouin & Lebrun may be more suitable for small samples, especially when the true population is thought to be much larger. Small samples from large populations are subject to much greater sampling error. The Baudouin & Lebrun prior penalizes rare alleles less than the Rannala & Mountain prior does, and thus compensates more for the sampling error.

Marking Imputed Individuals

The option "mark imputed individuals" determines whether individuals with missing data, who have their probabilities imputed, are marked on the plot or not. If you check this option, then individuals with missing data will be shown on the plots with asterisks inside their shapes.

Swapping Axes

Whenever you have plotted only two reference populations, there will be an option available to "Swap axes". If you tick this box, then the GenePlot will be replotted with the axes switched. Additional plots will remain in switched mode until you untick the box.

The default axis order is determined by the order of the populations in your file. Whichever of the two populations appears first in the list of populations in your file will be the one plotted on the x-axis by default. If "Swap axes" is ticked then that population will be plotted on the y-axis instead.

Results

The results of the analysis are shown as a graph (on the Graph tab) and as a table (on the Results tab).

Individuals from different populations and groups are displayed on the graph with different colours.

Individuals who have missing loci are marked on the graph with an asterisk "*" inside their symbol, and are marked in the results with "impute" status. GenePlot also reports how many individuals were excluded from the analysis because they had data at too few loci. For individuals with missing data, the results show the "raw" fit for each population based on the loci that were present for the individual, and also the final fit for all the loci, obtained via the saddlepoint method.

If you have selected two reference populations, the x-axis will show the fit of each individual with respect to one of the populations, and the y-axis will show the fit of each individual with respect to the other population. The thick diagonal line is the line of equal fit in both populations. The thinner diagonal lines on either side are the lines at which the fit for one population is 9 times larger than the fit for the other population.

On a graph for two populations, the vertical lines show the 1% and 99% quantiles for the population on the x-axis, and the horizontal lines show the 1% and 99% quantiles for the population on the y-axis. This means that an individual on the 99% quantile for Pop A has a better fit for Pop A than 99% of all theoretical individuals that could possibly come from Pop A; in other words, the individual has a very good fit to Pop A. An individual on the 1% quantile for Pop A has a worse fit for Pop A than 99% of all theoretical individuals that could possibly come from Pop A; in other words, the individual has a very poor fit to Pop A. The method for calculating these quantiles is explained briefly on the Background tab.

If you have selected more than two reference populations, GenePlot performs PCA (principal component analysis) on the fit results for all the populations, and will then display the first two principal components as the x- and y-axes. That is, the x-axis will show the linear combination of fits with respect different populations that gives the best separation of fit results for the individuals in the data (including reference individuals and individuals to be assigned).

Saving Plots

You can save a particular plot by right-clicking on the plot and selecting "Save Image". You will need to save the file as a .PNG file, e.g. "Results_plot.png".

Alternatively, you can save multiple plots as separate pages of a PDF file. If you want to do this, first click on "Open PDF to save plots", which will start a blank PDF. Then, whenever you have a plot you want to save, click on "Add this plot to PDF", which will add the plot to a new page of the PDF.

As soon as you have saved a particular plot to the PDF, the "Add this plot to PDF" button will be disabled, so that you can't accidentally add the same plot multiple times to the same PDF. If you have selected new options, such as new populations to plot, make sure to click "Run GenePlot" to get the new plot, and this will re-enable the "Add this plot to PDF" button.

If you tick or untick "Swap axes" then the result of that is treated as a new plot, and you can save it to the PDF separately to the previous one.

Once you have all the plots you need, then you can download the complete PDF file by clicking on "Download plots PDF". This will open up a dialog that lets you choose whether to open or download the file, and you can now rename it if you want, and save it to your local machine.

The "Add plot comments" option, if checked, adds a text box under the plot, where you can type in comments about the current plot, and those comments will be saved to the PDF along with the plot. The comments for a given plot can be up to 500 characters (the maximum is required in order to keep the text within the limits of the PDF page). Be warned! If you change any of the options such as changing the reference pops, etc. before saving the plot to PDF, the comments will disappear!

While the "Add plot comments" option is selected, the text box for comments will appear below every single plot; unticking the option will hide the text box again.

Data Files

The GenePlot app can accept Genepop format files: http://genepop.curtin.edu.au/help_input.html. As an alternative, you can also use our GenePlot format directly.

GenePlot Format Files

An example file in GenePlot format can be found at https://www.stat.auckland.ac.nz/~fewster/GenePlot/

Data uploaded to GenePlot has to be in CSV (comma-separated) format.

The file must include a Data section and a Locnames section (which lists the names of the loci).

Every individual in the data file must have an ID and a population label. Individuals from the reference populations will have the name of the population they were sampled from, but additional individuals that you want to assign must also be given a population label. If you want to be able to distinguish specific individuals on the GenePlot graph, give each one a different population label. You can use any alphanumeric label for each population, group or individual.

Please do NOT repeat the keywords "DATA", "END", and "LOCNAMES".

Genetic Data

The main table of genetic data must be preceded by a single line containing the word "DATA", in capitals. The following lines should contain the data, using one row per individual. The line after the end of the data should contain the word "END" in capitals.

Do NOT include a header row in the data table. The line after the word "DATA" should contain the first entry.

The first two columns of the data table must give the IDs and populations of the individuals. The remaining columns should contain the allele names. IDs and population names can contain letters and/or numbers. There should be 2 allele columns for every locus name listed under LOCNAMES. Every row should end with a standard carriage return.

Use "0" for missing alleles.

Population Names

You can use any text you like for the population names, but do not use quotes within a population name. For example, it is fine to enter "PopA" in the population column, but not "Pop A 'Main'".

Locus Names

The locus names must be preceded by a single line containing the word "LOCNAMES", in capitals. The next line should list the locus names as a row of strings. Do NOT have spaces between the strings, only commas, as in the example data file

Example Data File

The following is an example data file. Note that the locus names are a single line, with no carriage return.

DATA "Bi01","Mahu",96,126,280,280,236,250,165,165,232,246,231,231,185,187,89,89,170,176,154,164 "Bi02","Mahu",96,126,280,280,250,262,155,155,232,232,231,233,149,185,127,127,174,174,164,166 "Bi03","Mahu",96,126,280,280,258,262,165,165,232,232,231,231,185,187,89,127,174,174,164,164 "Bi04","Mahu",96,126,280,280,238,262,155,155,232,232,231,233,149,185,127,127,174,174,164,164 "Bi05","Mahu",96,122,280,280,250,258,0,0,226,244,231,231,187,187,107,127,174,176,164,164 "Bi06","Mahu",96,96,280,280,238,262,155,155,232,232,231,231,187,187,123,127,174,174,164,164 "Bi07","BM",122,126,276,280,236,238,155,165,242,242,219,225,161,187,123,127,176,176,164,182 "Bi08","Flat",126,126,280,280,238,262,155,165,226,232,231,231,187,187,89,89,174,174,164,164 "Bi10","Flat",96,96,280,280,250,250,165,165,226,246,219,231,0,0,107,127,174,174,154,166 "Bi11","Taik",96,96,278,280,234,250,165,165,226,240,231,231,149,187,89,99,170,170,154,164 "Bi12","Taik",96,96,276,280,234,250,165,165,240,240,231,231,187,187,89,99,170,174,154,164 "Bi13","Taik",96,126,276,276,246,250,165,165,226,244,231,231,149,187,99,99,174,174,164,164 "Bi14","Taik",96,126,276,276,262,262,155,165,226,244,231,231,149,187,89,107,170,174,154,164 "Bi15","Taik",96,96,276,280,234,262,155,165,232,240,231,231,149,187,99,99,170,174,154,164 END

LOCNAMES "D10Rat20","D11Mgh5","D15Rat77","D16Rat81","D18Rat96","D19Mit2","D20Rat46","D2Rat234","D5Rat83","D7Rat13"

The data includes 4 populations (Mahu, BM, Flat and Taik), 10 loci (2 alleles per locus) and 15 individuals.

R package

The GenePlot R package is available on Github: https://github.com/lfmcmillan/geneplot. The R package has the same functionality, but has additional options for customizing the GenePlots. It can be installed from R by using the install_github() function from the devtools package: install_github("lfmcmillan/geneplot") or using the more lightweight version, the install.github() function from the remotes package.

Authors

GenePlot was created by Rachel Fewster and Louise McMillan

Citations

Please cite as follows:

McMillan, L. and Fewster, R. "Visualizations for genetic assignment analyses using the saddlepoint approximation method" (2017) Biometrics.

The online publication is available here.

Background

Population genetics is the study of multiple populations, typically of a single species, and the level of migration and gene flow or separation between them. Genetic assignment is a process by which the genetic data of an individual is compared with genetic data of samples from two or more reference populations, to assess how likely it is that the individual might have come from any of those populations.

GenePlot performs assignment via the algorithm proposed by Rannala and Mountain (1997) but also improves upon that by adjusting the results for individuals with missing data so that they can be visualized on the same graph as results from individuals with complete data. This is achieved by characterizing the genetic distribution of each reference population using the saddlepoint algorithm (specifically the formula proposed by Lugannani and Rice, 1980). This characterization also enables us to calculate the quantiles of each population, to indicate the shape of the distribution and get a better feel for how well a particular individual fits within a population.

References

Lugannani, R., and Rice, S. (1980). Saddle point approximation for the distribution of the sum of independent random variables. Advances in Applied Probability 12, 475--490.

Rannala, B., and Mountain, J. L. (1997). Detecting immigration by using multilocus genotypes. Proceedings of the National Academy of Sciences 94, 9197--9201.

Baudouin, L., and Lebrun, P. (2001). An operational Bayesian approach for the identification of sexually reproduced cross-fertilized populations using molecular markers. ISHS Acta Horticulturae 546, 81--93.

Contact Details

For more questions or additional help, please contact geneplotontheweb"at"gmail.com.