Intro to assignPOP

assignPOP is an R package that helps perform population assignment using a machine-learning framework. It employs supervised machine-learning methods to evaluate the discriminatory power of your data collected from source populations, and is able to analyze large genetic, non-genetic, or integrated (genetic plus non-genetic) data sets. This framework is designed for solving the upward bias issue discussed in previous studies¹, ². Other features are listed as follows.

Use principle component analysis (PCA) for dimensionality reduction (or data transformation)
Use Monte-Carlo cross-validation to estimate mean and variance of assignment accuracy
Use K-fold cross-validation to estimate membership probability
Allow to resample various proportions or numbers of training individuals
Allow to resample various proportions of training loci either randomly or based on locus F_ST value
Provide several machine-learning classification algorithms, including LDA, SVM, naive Bayes, decision tree, and random forest, to build tunable predictive models.
Output results in publication-quality plots that can be edited using ggplot2 functions

Conceptual Framework

The ability to assign individuals of unknown origins to source populations relies on a robust baseline (or training data) collected from the source populations. However, noise in training data or lack of distinct features could lower the assignment ability. It is important to evaluate baseline data via cross-validation before using the baseline to predict source populations of unknown individuals. The diagrams below illustrate how the assignPOP fits into such assignment framework, and the workflow for evaluting baseline data as well as predicting source populations of unknonw individuals.

Analytical Workflow

Resampling Cross-validation Workflow

The diagram below illustrates the workflow of resampling cross-validation, in which individuals from each population are divided into training and test sets, and assignment test repeats through resampling training individuals. Multiple proportions of training individuals by multiple proportions of training loci (when analyzing genetic data) can be specified in one single analysis (assign.MC() or assign.kfold()). For example, the diagram shows top 10% or 50% of high F_ST loci or all loci are used as training loci. This helps evaluate whether using top 10%, 50% of high F_ST loci or overall loci results in similar assignment accuracies. For more details, see Perform population assignment in the Data Analysis page.

Package Citation

Chen, K-Y, Marschall, E.A., Sovic, M.G., Fries, A.C., Gibbs, H.L., Ludsin, S.A. (2018). assignPOP: An R package forpopulation assignment using genetic, non-genetic, or integrated data in a machine-learning framework. Methods in Ecology and Evolution. 9:439–446. https://doi.org/10.1111/2041-210X.12897

References

Anderson, E. C. (2010). Assessing the Power of Informative Subsets of Loci for Population Assignment: Standard Methods Are Upwardly Biased. Molecular Ecology Resources 10(4): 701–710.↩
Waples, R. S. (2010). High-Grading Bias: Subtle Problems with Assessing Power of Selected Subsets of Loci for Population Assignment. Molecular Ecology 19(13): 2599–2601.↩