The FSTruct package provides an \(F_{ST}\)-based tool to quantify the variability of \(Q\) matrices (matrices with row vectors of ancestry coefficients, the default output of population structure inference programs such as ADMIXTURE and STRUCTURE).
The package includes four functions:
Q_stat
calculates \(F_{ST}/F_{ST}^{max}\) (a normalized measure of variability of membership vectors across individuals) for a \(Q\) matrix.Q_bootstrap
generates bootstrap replicates of one or more \(Q\) matrices along with associated statistics, including \(F_{ST}/F_{ST}^{max}\), as well as plots and statistical tests to compare bootstrap distributions of \(F_{ST}/F_{ST}^{max}\) for different matrices.Q_plot
plots \(Q\) matrices using ggplot2
.Q_simulate
generates \(Q\) matrices by drawing vectors of membership coefficients (each representing an individual) from the Dirichlet distribution.This package accompanies the paper “FSTruct: an FST-based tool for measuring ancestry variation in inference of population structure” by Maike Morrison, Nicolas Alcala, and Noah Rosenberg. You can access the paper in Molecular Ecology Resources at this link.
Load or simulate \(Q\) matrices
Visualize \(Q\) matrices
Calculate the \(F_{ST}/F_{ST}^{max}\) variability statistic for each \(Q\) matrix
Generate & analyze bootstrap distributions of \(F_{ST}/F_{ST}^{max}\) for each matrix
Use statistical tests to compare the variability of multiple \(Q\) matrices
All of these tasks can be accomplished using the four functions in the package FSTruct.
A \(Q\) matrix represents the ancestry of a population; each row represents an individual and the last \(K\) columns contain proportions called membership coefficients (the other columns give details about the individual, such as its name and population identifier). Membership coefficients describe how much of an individual’s ancestry derives from a given ancestral cluster. An individual’s membership coefficients sum to 1.
A \(Q\) matrix generated by the program ADMIXTURE with \(K=3\) ancestral clusters looks like this:
1 | 1 | (x) | 1 | : | 1.0000 | 0.0000 | 0 |
2 | 2 | (x) | 1 | : | 1.0000 | 0.0000 | 0 |
3 | 3 | (x) | 2 | : | 0.0137 | 0.7567 | 0.2296 |
4 | 4 | (x) | 2 | : | 0.0000 | 0.7325 | 0.2674 |
5 | 5 | (x) | 2 | : | 0.0224 | 0.7279 | 0.2496 |
6 | 6 | (x) | 3 | : | 0.0222 | 0.7372 | 0.2406 |
7 | 7 | (x) | 3 | : | 0.0000 | 0.7396 | 0.2603 |
8 | 8 | (x) | 3 | : | 0.0231 | 0.7496 | 0.2272 |
9 | 9 | (x) | 3 | : | 0.0460 | 0.6861 | 0.2678 |
10 | 10 | (x) | 3 | : | 0.0843 | 0.6739 | 0.2418 |
The last \(K=3\) columns contain the membership coefficients.
The FSTruct package can be downloaded from GitHub using the code below. Be sure to first install the package devtools
if it is not yet installed.
# install.packages("devtools") # Run if devtools is not yet installed
# devtools::install_github("MaikeMorrison/FSTruct") # Run if FSTruct is not yet installed
To follow along with this vignette, make sure the following packages are also installed:
# install.packages("dplyr") # Run if dplyr is not yet installed
# install.packages("ggplot2") # Run if ggplot2 is not yet installed
# install.packages("cowplot") # Run if cowplot is not yet installed
# install.packages("data.table") # Run if data.table is not yet installed
Then load the package FSTruct
and the package dplyr
, which contains functions such as arrange
, which reorders the rows of a data frame.
library(FSTruct)
library(dplyr)
To begin the analysis, read one or more \(Q\) matrices into R. For each \(Q\) matrix, this task can be accomplished using:
<- data.table::fread("file path to your Q matrix.output") Q_matrix_name
or other functions, such as read.table()
.
Note that the default column names of these \(Q\) matrices will likely be V1
, V2
, V3
, … If you wish to give the columns more informative names, use the function colnames
, listing on the right-hand side your desired column names (the example below uses the same names as Q_simulate
for \(K=2\) ancestral clusters):
colnames(Q_matrix_name) <- c("rep", "ind", "alpha", "Pop", "spacer", "q1", "q2")
Repeat this process for each \(Q\) matrix you wish to analyze.
Alternatively, \(Q\) matrices can be simulated using Q_simulate
, which draws individuals from a Dirichlet probability distribution (for more details on the Dirichlet distribution, see the “Simulation examples” section of the paper).
The function Q_simulate
has four parameters:
alpha
, a number that controls the variability. Higher values of \(\alpha\) result in lower variances across membership coefficients.*
lambda
, a vector that controls the mean membership of each ancestral cluster.
popsize
, the number of individuals to put in the \(Q\) matrix.
rep
, the number of populations to simulate. Default is 1.
seed
, an optional number which, if included, is used as the random seed to make the random results reproducible.
For this example, I chose to simulate four \(Q\) matrices (A
, B
, C
, and D
), each with 20 individuals and with the same vector of mean membership coefficients, \(\lambda = (.75, .25)\). This means that each \(Q\) matrix will have 20 rows and, on average, individuals from any of the four matrices will possess three-quarters of their ancestry from the first ancestral cluster, and one-quarter of their ancestry from the second ancestral cluster.
I chose to simulate matrices A
and B
using \(\alpha = 0.1\), which results in fairly high variability across membership coefficients, and matrices C
and D
using \(\alpha = 10\), which results in lower variability. Properties of the Dirichlet distribution tell us that the variance of the \(k^{th}\) ancestral cluster, \(q_k\), is \(Var[q_k] = \lambda_k(1-\lambda_k)/(\alpha+1)\). It follows that, for matrices A
and B
, \(Var[q_1] = Var[q_2]=.75\times.25/(.1+1) \approx 0.17\). For matrices C
and D
, \(Var[q_1] = Var[q_2] = .75\times.25/(10+1) \approx 0.02\).
*alpha
can also be a numeric vector. See the “Advanced note on the usage of Q_bootstrap
” at end of document for more details.
= Q_simulate(alpha = .1, lambda = c(.75, .25), popsize = 20, rep = 1, seed = 1)
A
= Q_simulate(alpha = .1, lambda = c(.75, .25), popsize = 20, rep = 1, seed = 2)
B
= Q_simulate(alpha = 10, lambda = c(.75, .25), popsize = 20, rep = 1, seed = 3)
C
= Q_simulate(alpha = 10, lambda = c(.75, .25), popsize = 20, rep = 1, seed = 4) D
Here is what matrix A
looks like:
rep | ind | alpha | Pop | spacer | q1 | q2 |
---|---|---|---|---|---|---|
1 | 1 | 0.1 | 0.1_1 | : | 0.9900510 | 0.0099490 |
1 | 2 | 0.1 | 0.1_1 | : | 1.0000000 | 0.0000000 |
1 | 3 | 0.1 | 0.1_1 | : | 0.0001171 | 0.9998829 |
1 | 4 | 0.1 | 0.1_1 | : | 1.0000000 | 0.0000000 |
1 | 5 | 0.1 | 0.1_1 | : | 1.0000000 | 0.0000000 |
1 | 6 | 0.1 | 0.1_1 | : | 0.0018261 | 0.9981739 |
1 | 7 | 0.1 | 0.1_1 | : | 0.9971413 | 0.0028587 |
1 | 8 | 0.1 | 0.1_1 | : | 1.0000000 | 0.0000000 |
1 | 9 | 0.1 | 0.1_1 | : | 1.0000000 | 0.0000000 |
1 | 10 | 0.1 | 0.1_1 | : | 1.0000000 | 0.0000000 |
1 | 11 | 0.1 | 0.1_1 | : | 0.0108406 | 0.9891594 |
1 | 12 | 0.1 | 0.1_1 | : | 1.0000000 | 0.0000000 |
1 | 13 | 0.1 | 0.1_1 | : | 0.0000029 | 0.9999971 |
1 | 14 | 0.1 | 0.1_1 | : | 1.0000000 | 0.0000000 |
1 | 15 | 0.1 | 0.1_1 | : | 0.0000175 | 0.9999825 |
1 | 16 | 0.1 | 0.1_1 | : | 1.0000000 | 0.0000000 |
1 | 17 | 0.1 | 0.1_1 | : | 0.0000311 | 0.9999689 |
1 | 18 | 0.1 | 0.1_1 | : | 0.1082434 | 0.8917566 |
1 | 19 | 0.1 | 0.1_1 | : | 1.0000000 | 0.0000000 |
1 | 20 | 0.1 | 0.1_1 | : | 0.9339119 | 0.0660881 |
The format of this \(Q\) matrix is very similar to the format of the ADMIXTURE \(Q\) matrix above. The last \(K=2\) columns contain the membership coefficients.
We visualize these matrices using the function Q_plot
, which has one required parameter: a \(Q\) matrix (Q
). If Q
contains any columns that are not ancestry coefficients (as is the case in our example), we must also specify the number of ancestral clusters (K
). If K
is not provided, the function will use all of the columns of Q
, which is a problem if Q
contains information other than ancestry vectors. The third parameter, arrange
, is optional. If arrange = TRUE
, the individuals are horizontally ordered by the clusters with the greatest mean membership. Below, we accomplish this ordering manually for plot A, and use arrange = TRUE
for all subsequent plots. These two methods accomplish the same ordering.
# Generate and modify a plot for each Q matrix
# Manual arranging of individuals
<- Q_plot(Q = dplyr::arrange(A, q1),
plot_A K=2)
# Using the Q_plot arrange option
<- Q_plot(Q = B,
plot_B K=2,
arrange = TRUE) +
::scale_fill_brewer(palette = "Blues") +
ggplot2::scale_color_manual(values = rep("white", 2))
ggplot2
<- Q_plot(Q = C,
plot_C K=2,
arrange = TRUE) +
::scale_fill_brewer(palette = "Greens")+
ggplot2::scale_color_brewer(palette = "Greens")
ggplot2
<- Q_plot(Q = D,
plot_D K=2,
arrange = TRUE) +
::scale_fill_brewer(palette = "Reds") +
ggplot2::scale_color_manual(values = rep("white", 2)) ggplot2
arrange(matrix, var)
, a function from dplyr, sorts the rows in matrix
according to a column named var
. Here, I order individuals in each \(Q\) matrix according to their value of q1
, but other variables can also be used. Use colnames(Q_matrix_name)
to see the column names and decide which to sort by. Matrices can also be sorted by multiple columns (e.g., arrange(matrix, var1, var2)
).
scale_fill_brewer()
, scale_color_brewer()
, and scale_color_manual
, functions from ggplot2, modify the color scheme of the plot. Because Q_plot()
outputs a ggplot object, you can do lots of modifications! For example, you can use scale_fill_brewer()
to set the fill color scheme, but use scale_color_manual
to set the outline color.
# Display these plots in a grid
::plot_grid(plot_A, plot_B, plot_C, plot_D,
cowplotlabels = "AUTO",
nrow = 1,
vjust = 2)
We observe that matrices A and B are visually similar, with most individuals from each matrix possessing all of their ancestry in just one of the two clusters (e.g., for B, all light blue or all dark blue). For a \(Q\) matrix with two ancestral clusters, the maximum variability case occurs when every individual (or all but one) has a membership vector of \((0,1)\) or \((1,0)\) (possessing all of its ancestry in just one of the two clusters). Matrices A and B are close to this maximum case, as they contain just a few individuals with ancestry in multiple clusters.
Matrices C and D, on the other hand, are comprised of individuals with some ancestry from each cluster. These matrices are closer to the minimum variability case, in which every individual’s membership coefficients are the same as the mean (e.g., for C, 3/4 dark green and 1/4 light green).
We have now built intuition about what high and low variability \(Q\) matrices look like. The statistic \(F_{ST}/F_{ST}^{max}\) quantifies this variability.
As a preliminary analysis, let’s calculate \(F_{ST}/F_{ST}^{max}\) for each of our four simulated \(Q\) matrices. We will do this analysis using the function Q_stat
, which accepts two parameters: a \(Q\) matrix (Q
) and the number of ancestral clusters (K
).
Q_stat(Q = A, K = 2)
#> $Fst
#> [1] 0.9595213
#>
#> $FstMax
#> [1] 0.9927522
#>
#> $ratio
#> [1] 0.9665264
Q_stat(Q = B, K = 2)
#> $Fst
#> [1] 0.8301276
#>
#> $FstMax
#> [1] 0.9942423
#>
#> $ratio
#> [1] 0.8349349
Q_stat(Q = C, K = 2)
#> $Fst
#> [1] 0.08282438
#>
#> $FstMax
#> [1] 0.9640783
#>
#> $ratio
#> [1] 0.08591043
Q_stat(Q = D, K = 2)
#> $Fst
#> [1] 0.06545711
#>
#> $FstMax
#> [1] 0.9514734
#>
#> $ratio
#> [1] 0.06879552
For each matrix, the function Q_stat
returns \(F_{ST}\), \(F_{ST}^{max}\), and their ratio \((F_{ST}/F_{ST}^{max})\). In population genetics, the statistic \(F_{ST}\) is conventionally used to quantify variability in allele frequencies among a collection of subpopulations. Here, we use this statistic to quantify the variability in membership coefficients among a collection of individuals. (See the methods section of the paper for a more thorough discussion of this analogy.) Importantly, when the \(Q\) matrix contains few individuals, or the mean membership of the most prevalent ancestral cluster is close to \(1/K\) (the minimum value) or 1 (the maximum value), the maximum possible value of \(F_{ST}\) \((F_{ST}^{max})\) can be much less than 1. For this reason, we use the normalized statistic, \(F_{ST}/F_{ST}^{max}\), which always ranges between 0 and 1, and can thus be used to compare the variability of multiple \(Q\) matrices.
We observe from the results of Q_stat
that matrices A
and B
have values of \(F_{ST}/F_{ST}^{max}\) greater than 0.8, whereas matrices C
and D
have values less than 0.1. Are the differences in variability between these matrices statistically significant?
To answer these questions, we employ bootstrapping.
A bootstrap replicate of a \(Q\) matrix is a second matrix with the same dimensions whose rows were drawn with replacement from those of the original matrix. By generating many bootstrap replicate matrices and computing \(F_{ST}/F_{ST}^{max}\) for each, we generate a bootstrap distribution of \(F_{ST}/F_{ST}^{max}\). This bootstrap distribution is an estimate of the sampling distribution (the distribution obtained by drawing multiple \(Q\) matrices from the full population). We can perform statistical tests of identity of the bootstrap means to draw conclusions about whether the original \(Q\) matrices have significantly different variabilities.
To generate a bootstrap distribution for the four simulated \(Q\) matrices, I used the function Q_bootstrap
, which has four core parameters:
matrices
, a named list of \(Q\) matrices to analyze*
n_replicates
, the number of bootstrap replicates to generate for each \(Q\) matrix
K
, the number of ancestral clusters. Can also be a vector if the number of ancestral clusters differs among the provided \(Q\) matrices.**
seed
, an optional number which, if included, is used as the random seed to make the random results reproducible
*Q_bootstrap
can also take as its matrices
parameter a single matrix, possibly containing multiple groups for comparison. If the matrix contains individuals from multiple groups which you wish to compare, you must also provide a group
parameter, the name of the column specifying which group each row (individual) belongs to. See the “Advanced note on the usage of Q_bootstrap
” at the end of this document for more details.
**If K
is not provided, Q_bootstrap
will assume that K
is equal to the number of columns in the first matrix provided.
<- Q_bootstrap(matrices = list("Matrix A" = A, "Matrix B" = B,
bootstrap "Matrix C" = C, "Matrix D" = D),
n_replicates = 100,
K = 2,
seed = 1)
The output of Q_bootstrap
, which I named bootstrap
, is a list of the following objects:
bootstrap_replicates
: A list of bootstrap replicate matrices for each \(Q\) matrix provided
statistics
: A dataframe containing the values of \(F_{ST}\), \(F_{ST}^{max}\), and \(F_{ST}/F_{ST}^{max}\) for each bootstrap replicate matrix
plot_boxplot
, plot_violin
, plot_ecdf
: A box plot, violin plot, or empirical cumulative distribution function (ECDF) plot of the bootstrap distribution of \(F_{ST}/F_{ST}^{max}\) for each \(Q\) matrix
test_kruskal_wallis
: The results of a Kruksal-Wallis test, which tests the identity of all provided bootstrap means
test_pairwise_wilcox
: The results of a Wilcoxon rank-sum test, which tests the identity of pairwise bootstrap means for each pair of matrices
All three options for visualizing the bootstrap distributions are ggplot objects and can be modified in many ways using tools from the package ggplot2
:
::plot_grid(bootstrap$plot_boxplot + ggplot2::ggtitle("Box Plot"),
cowplot$plot_violin + ggplot2::ggtitle("Violin Plot"),
bootstrap$plot_ecdf + ggplot2::ggtitle("ECDF Plot") +
bootstrap::scale_color_brewer(palette = "Dark2"),
ggplot2nrow = 2)
Comparing these distributions by eye, it seems like the bootstrap distributions of matrices A
and B
are each quite different from all of the other matrices. C
and D
, on the other hand, seem more similar to each other than A
and B
, but are still not identical. Rather than relying on our intuition, we can statistically test the similarity of these distributions. Included in the output of Q_bootstrap
are the output of two such tests:
The Kruskal-Wallis test determines if statistically significant differences exist among a collection of two or more distributions. Its output includes a chi-squared statistic and a p-value.
The Wilcoxon rank sum test looks at every pairwise combination of distributions, and for each determines if a statistically significant difference exists between the two distributions. Its output is a matrix of p-values, one for each pair of distributions.
For both tests, the null hypothesis is that the distributions are the same.
$test_kruskal_wallis
bootstrap#>
#> Kruskal-Wallis rank sum test
#>
#> data: all_stats$ratio by all_stats$Matrix
#> Kruskal-Wallis chi-squared = 327.71, df = 3, p-value < 2.2e-16
$test_pairwise_wilcox
bootstrap#>
#> Pairwise comparisons using Wilcoxon rank sum test with continuity correction
#>
#> data: all_stats$ratio and all_stats$Matrix
#>
#> Matrix A Matrix B Matrix C
#> Matrix B <2e-16 - -
#> Matrix C <2e-16 <2e-16 -
#> Matrix D <2e-16 <2e-16 2e-04
#>
#> P value adjustment method: holm
The results of these tests support the claim that the variabilities of all four matrices are significantly different, but that C
and D
are more similar than A
and B
.
You cannot make comparisons if only one \(Q\) matrix is provided. If only one matrix is provided to Q_bootstrap
, the entries test_kruskal_wallis
and test_pairwise_wilcox
will contain the text “This statistical test can only be performed if a list of matrices is provided.” instead of the results of a statistical test.
Q_bootstrap
We noted above that Q_bootstrap
can take as its matrices
parameter one matrix which contains individuals from multiple groups we wish to compare. This analysis is relevant if we want to compare the ancestry variability among multiple groups based on the output of programs such as STRUCTURE or ADMIXTURE, which generate one \(Q\) matrix for all individuals, with a column corresponding to the group each individual belongs to.
The function Q_simulate
generates a matrix of this same form. If rep>1
and/or alpha
is a vector with more than one entry, Q_simulate
will generate one matrix containing rep
times length(alpha)
total groups of popsize
individuals, stacked vertically. The column Pop
of the output is defined as alpha_rep
(e.g., for the second repetition of a matrix with alpha = 10
, Pop=10_2
), and specifies what alpha/rep combination each simulated individual belongs to.
For example, we can use the following code to generate a \(Q\) matrix with four groups of 5 individuals and mean membership \((1/3, 1/3, 1/3)\), two with \(\alpha = 1\) and two with \(\alpha = 10\).
<- Q_simulate(alpha = c(1,10), lambda = c(1/3, 1/3, 1/3), popsize = 5, rep = 2, seed = 1) Q
Here is the matrix we have simulated:
rep | ind | alpha | Pop | spacer | q1 | q2 | q3 |
---|---|---|---|---|---|---|---|
1 | 1 | 1 | 1_1 | : | 0.0577009 | 0.7675994 | 0.1746997 |
1 | 2 | 1 | 1_1 | : | 0.0461490 | 0.9531101 | 0.0007409 |
1 | 3 | 1 | 1_1 | : | 0.5564586 | 0.3811179 | 0.0624235 |
1 | 4 | 1 | 1_1 | : | 0.8083399 | 0.0034037 | 0.1882564 |
1 | 5 | 1 | 1_1 | : | 0.3108832 | 0.5643938 | 0.1247230 |
1 | 1 | 10 | 10_1 | : | 0.3569613 | 0.4724304 | 0.1706083 |
1 | 2 | 10 | 10_1 | : | 0.2897101 | 0.3403869 | 0.3699029 |
1 | 3 | 10 | 10_1 | : | 0.2433330 | 0.1304592 | 0.6262078 |
1 | 4 | 10 | 10_1 | : | 0.1890401 | 0.4372094 | 0.3737505 |
1 | 5 | 10 | 10_1 | : | 0.3003683 | 0.2487210 | 0.4509107 |
2 | 1 | 1 | 1_2 | : | 0.1203557 | 0.1119421 | 0.7677023 |
2 | 2 | 1 | 1_2 | : | 0.6850412 | 0.0481824 | 0.2667764 |
2 | 3 | 1 | 1_2 | : | 0.8298894 | 0.1701059 | 0.0000047 |
2 | 4 | 1 | 1_2 | : | 0.0042849 | 0.4747644 | 0.5209507 |
2 | 5 | 1 | 1_2 | : | 0.0591896 | 0.9405881 | 0.0002223 |
2 | 1 | 10 | 10_2 | : | 0.3515363 | 0.1865224 | 0.4619412 |
2 | 2 | 10 | 10_2 | : | 0.4561017 | 0.2180631 | 0.3258352 |
2 | 3 | 10 | 10_2 | : | 0.1632815 | 0.4506858 | 0.3860327 |
2 | 4 | 10 | 10_2 | : | 0.2894660 | 0.1885444 | 0.5219895 |
2 | 5 | 10 | 10_2 | : | 0.4009728 | 0.2318401 | 0.3671871 |
We can generate bootstrap distributions comparing the four groups of matrices with the following code:
<- Q_bootstrap(matrices = Q, n_replicates = 100, K = 3, seed = 1, group = "Pop")
bootstrap2
$plot_boxplot bootstrap2
$test_pairwise_wilcox
bootstrap2#>
#> Pairwise comparisons using Wilcoxon rank sum test with continuity correction
#>
#> data: all_stats$ratio and all_stats$Matrix
#>
#> 1_1 10_1 1_2
#> 10_1 < 2e-16 - -
#> 1_2 1.1e-08 < 2e-16 -
#> 10_2 < 2e-16 3.3e-07 < 2e-16
#>
#> P value adjustment method: holm
This concludes the introduction to the package FSTruct–we hope you found it helpful. For more information, see the paper “FSTruct: an FST-based tool for measuring ancestry variation in inference of population structure” or the FSTruct manual. Or, in the R console, type ?FSTruct::Q_...
to read the documentation for particular functions (e.g. ?FSTruct::Q_bootstrap
).