In this page, we demonstrate how to visualize assignment accuracy and membership probability results using box plots and stacked bar plots.
After the assignment accuracy of Monte-Carlo cross-validation results is calculated (using accuracy.MC
), we can use the following function to create a box plot to visualize the assignment accuracy.
# accuMC is the object (a data frame) returned from the function accuracy.MC()
accuracy.plot(accuMC, pop = "all")
The argument pop
is used to specify which populations to be included in the plot. Multiple populations can be specified (e.g., pop=c("pop_A,"pop_B","pop_C")
) to create a faceted plot. If pop
is not specified, assignment accuracy of overall populations (pop="all"
) will be used by default.
The function accuracy.plot()
is built based on the ggplot2 library, so the plot can be modified using ggplot2. Below we made another plot that includes the assignment accuracy results of overall and individual populations. We also adjusted its y-axis limits (ylim
), drew a horizontal line (annotate
), added a plot title (ggtitle
).
library(ggplot2)
accuracy.plot(accuMC, pop=c("all", "pop_A", "pop_B", "pop_C")) +
ylim(0, 1) + #Set y limit between 0 and 1
annotate("segment",x=0.4,xend=3.6,y=0.33,yend=0.33,colour="red",size=1) + #Add a red horizontal line at y = 0.33 (null assignment rate for 3 populations)
ggtitle("Monte-Carlo cross-validation using genetic loci")+ #Add a plot title
theme(plot.title = element_text(size=20, face="bold")) #Edit plot title text size
The above results were estimated from the genetic data (simulated 1000 SNPs). Assignment accuracies of populations A and B are relatively low whereas those of population C are high, indicating that the genetic data could be used to distinguish between population C and population A or B.
Next, we created the assignment accuracy plot for the integrated data (genetics plus morphometrics, the object comin
returned from compile.data
). We skip the codes and show the results below.
When using the genetic-morphometric data, the assignment accuracies of populations A and B increased and that of population C remained high, resulting in increasing overall assignment accuracy. These results demonstrate the potential of using multiple data types to improve assignment success. Additionally, we can use the same analytical methods (i.e., Monte-Carlo cross-validation with the same proportions of training individuals, iterations, and classifier) to evaluate the discriminatory power of morphometric data.
The above results indicate that morphometric data helps distinguish population A from the other two. Therefore, it is expected that using integrated data would best discriminate among the three populations (Figure 3) despite that fact that using genetics or morphometrics alone was unable to distinguish population B from the other two (see population B in Figure 2 and 4).
In addition to estimating assignment accuracy, we can use probabilities to understand how individuals are assigned to the populations. To visualize membership probability, we use the results from K-fold cross-validation and create a stacked bar plot, like the STRUCTURE plot) that is commonly used in molecular biology papers.
# The folder 'Result-folder2' contains the results generated from K-fold cross-validation
membership.plot(dir = "Result-folder2/")
After entering the above function, it will prompt a few questions and allow you to choose which data set (results estimated from which combination of training data) and plot style to be used. The interactive conversation is shown as follows.
The first question is to choose the results from which fold.
## K = 3 4 5 are found.
## Please enter one of the K numbers: (You will enter your answer here)
The second question is to choose the results from which level of training loci (if data include genetic loci).
## 4 levels of training loci are found.
## Levels[train.loci]: 0.1 0.25 0.5 1
## Please enter one of the levels: (You will enter your answer here)
Lastly, if you didn’t specify the output style (e.g., style = 1
) in membership.plot()
, then it will print the following message and ask you to choose an output style.
## Finally, select one of the output styles.
## [1] Random order (Individuals on x-axis are in random order)
## [2] Sorted by probability (Individuals are sorted by probabilities within each group)
## [3] Separated by fold (Individuals of different folds are in separate plots)
## [4] Separated and Sorted (Individuals are separated by fold and sorted by probability)
## Please enter 1, 2, 3, or 4: (You will enter your answer here)
Below we use the results of 3-fold and all loci (train.loci = 1) to demonstrate membership probability plots, with four different output styles.
The above Figures 5 to 8 are the results from the genetic data. Here we also created the membership probability plot of results estimated from the integrated data.
As expected, using the integrated data allowed more individuals to be correctly assigned to the populations, particularly in populations A and B (see differences between Figure 5 or 6 vs. 9).
To check how test individuals from each population assigned to its own and other populations, users can use the following function to print mean and standard deviation of assignment results across Monte-Carlo or K-fold cross-validation tests in R console.
#Default setting reads through all the files in the specified result folder
assign.matrix( dir="Result-folder/")
#Users also can specify certain results for printing the assignment matrix
assign.matrix( dir="Result-folder/", train.inds=c(0.7, 0.9), train.loci=c(0.5, 1))
Once the function was executed, it will print two tables in your R console. One is the means of assignment results across the tests; the other is the corresponding standard deviations. An example of assign.matrix(dir="Result-folder/")
is shown below.
## Assignment across 360 tests from Monte-Carlo cross-validation
## Mean
## assignment
## origin pop_A pop_B pop_C
## pop_A 0.57 0.43 0.01
## pop_B 0.42 0.48 0.10
## pop_C 0.02 0.06 0.92
##
## Standard Deviation
## assignment
## origin pop_A pop_B pop_C
## pop_A 0.24 0.24 0.02
## pop_B 0.25 0.25 0.12
## pop_C 0.06 0.11 0.13