Introduction to FSTruct

The FSTruct package provides an \(F_{ST}\)-based tool to quantify the variability of \(Q\) matrices (matrices with row vectors of ancestry coefficients, the default output of population structure inference programs such as ADMIXTURE and STRUCTURE).

The package includes four functions:

This package accompanies the paper “FSTruct: an FST-based tool for measuring ancestry variation in inference of population structure” by Maike Morrison, Nicolas Alcala, and Noah Rosenberg. You can access the paper in Molecular Ecology Resources at this link.

In this document, we walk through an example analysis, including how to:

All of these tasks can be accomplished using the four functions in the package FSTruct.

What is a \(Q\) matrix?

A \(Q\) matrix represents the ancestry of a population; each row represents an individual and the last \(K\) columns contain proportions called membership coefficients (the other columns give details about the individual, such as its name and population identifier). Membership coefficients describe how much of an individual’s ancestry derives from a given ancestral cluster. An individual’s membership coefficients sum to 1.

A \(Q\) matrix generated by the program ADMIXTURE with \(K=3\) ancestral clusters looks like this:

1 1 (x) 1 : 1.0000 0.0000 0
2 2 (x) 1 : 1.0000 0.0000 0
3 3 (x) 2 : 0.0137 0.7567 0.2296
4 4 (x) 2 : 0.0000 0.7325 0.2674
5 5 (x) 2 : 0.0224 0.7279 0.2496
6 6 (x) 3 : 0.0222 0.7372 0.2406
7 7 (x) 3 : 0.0000 0.7396 0.2603
8 8 (x) 3 : 0.0231 0.7496 0.2272
9 9 (x) 3 : 0.0460 0.6861 0.2678
10 10 (x) 3 : 0.0843 0.6739 0.2418

The last \(K=3\) columns contain the membership coefficients.

Install and load FSTruct

The FSTruct package can be downloaded from GitHub using the code below. Be sure to first install the package devtools if it is not yet installed.

# install.packages("devtools") # Run if devtools is not yet installed
# devtools::install_github("MaikeMorrison/FSTruct") # Run if FSTruct is not yet installed

To follow along with this vignette, make sure the following packages are also installed:

# install.packages("dplyr") # Run if dplyr is not yet installed
# install.packages("ggplot2") # Run if ggplot2 is not yet installed
# install.packages("cowplot") # Run if cowplot is not yet installed
# install.packages("data.table") # Run if data.table is not yet installed

Then load the package FSTruct and the package dplyr, which contains functions such as arrange, which reorders the rows of a data frame.

library(FSTruct)
library(dplyr)

Load or simulate \(Q\) matrices

To begin the analysis, read one or more \(Q\) matrices into R. For each \(Q\) matrix, this task can be accomplished using:

Q_matrix_name <- data.table::fread("file path to your Q matrix.output")

or other functions, such as read.table().

Note that the default column names of these \(Q\) matrices will likely be V1, V2, V3, … If you wish to give the columns more informative names, use the function colnames, listing on the right-hand side your desired column names (the example below uses the same names as Q_simulate for \(K=2\) ancestral clusters):

colnames(Q_matrix_name) <- c("rep", "ind", "alpha", "Pop", "spacer", "q1", "q2")

Repeat this process for each \(Q\) matrix you wish to analyze.

Alternatively, \(Q\) matrices can be simulated using Q_simulate, which draws individuals from a Dirichlet probability distribution (for more details on the Dirichlet distribution, see the “Simulation examples” section of the paper).

The function Q_simulate has four parameters:

For this example, I chose to simulate four \(Q\) matrices (A, B, C, and D), each with 20 individuals and with the same vector of mean membership coefficients, \(\lambda = (.75, .25)\). This means that each \(Q\) matrix will have 20 rows and, on average, individuals from any of the four matrices will possess three-quarters of their ancestry from the first ancestral cluster, and one-quarter of their ancestry from the second ancestral cluster.

I chose to simulate matrices A and B using \(\alpha = 0.1\), which results in fairly high variability across membership coefficients, and matrices C and D using \(\alpha = 10\), which results in lower variability. Properties of the Dirichlet distribution tell us that the variance of the \(k^{th}\) ancestral cluster, \(q_k\), is \(Var[q_k] = \lambda_k(1-\lambda_k)/(\alpha+1)\). It follows that, for matrices A and B, \(Var[q_1] = Var[q_2]=.75\times.25/(.1+1) \approx 0.17\). For matrices C and D, \(Var[q_1] = Var[q_2] = .75\times.25/(10+1) \approx 0.02\).

*alpha can also be a numeric vector. See the “Advanced note on the usage of Q_bootstrap” at end of document for more details.

A = Q_simulate(alpha = .1, lambda = c(.75, .25), popsize = 20, rep = 1, seed = 1)

B = Q_simulate(alpha = .1, lambda = c(.75, .25), popsize = 20, rep = 1, seed = 2)

C = Q_simulate(alpha = 10, lambda = c(.75, .25), popsize = 20, rep = 1, seed = 3)

D = Q_simulate(alpha = 10, lambda = c(.75, .25), popsize = 20, rep = 1, seed = 4)

Here is what matrix A looks like:

rep ind alpha Pop spacer q1 q2
1 1 0.1 0.1_1 : 0.9900510 0.0099490
1 2 0.1 0.1_1 : 1.0000000 0.0000000
1 3 0.1 0.1_1 : 0.0001171 0.9998829
1 4 0.1 0.1_1 : 1.0000000 0.0000000
1 5 0.1 0.1_1 : 1.0000000 0.0000000
1 6 0.1 0.1_1 : 0.0018261 0.9981739
1 7 0.1 0.1_1 : 0.9971413 0.0028587
1 8 0.1 0.1_1 : 1.0000000 0.0000000
1 9 0.1 0.1_1 : 1.0000000 0.0000000
1 10 0.1 0.1_1 : 1.0000000 0.0000000
1 11 0.1 0.1_1 : 0.0108406 0.9891594
1 12 0.1 0.1_1 : 1.0000000 0.0000000
1 13 0.1 0.1_1 : 0.0000029 0.9999971
1 14 0.1 0.1_1 : 1.0000000 0.0000000
1 15 0.1 0.1_1 : 0.0000175 0.9999825
1 16 0.1 0.1_1 : 1.0000000 0.0000000
1 17 0.1 0.1_1 : 0.0000311 0.9999689
1 18 0.1 0.1_1 : 0.1082434 0.8917566
1 19 0.1 0.1_1 : 1.0000000 0.0000000
1 20 0.1 0.1_1 : 0.9339119 0.0660881

The format of this \(Q\) matrix is very similar to the format of the ADMIXTURE \(Q\) matrix above. The last \(K=2\) columns contain the membership coefficients.

Visualize \(Q\) Matrices

We visualize these matrices using the function Q_plot, which has one required parameter: a \(Q\) matrix (Q). If Q contains any columns that are not ancestry coefficients (as is the case in our example), we must also specify the number of ancestral clusters (K). If K is not provided, the function will use all of the columns of Q, which is a problem if Q contains information other than ancestry vectors. The third parameter, arrange, is optional. If arrange = TRUE, the individuals are horizontally ordered by the clusters with the greatest mean membership. Below, we accomplish this ordering manually for plot A, and use arrange = TRUE for all subsequent plots. These two methods accomplish the same ordering.

# Generate and modify a plot for each Q matrix

# Manual arranging of individuals
plot_A <- Q_plot(Q = dplyr::arrange(A, q1),
                 K=2)

# Using the Q_plot arrange option
plot_B <- Q_plot(Q = B, 
                 K=2,
                 arrange = TRUE) + 
  ggplot2::scale_fill_brewer(palette = "Blues") + 
  ggplot2::scale_color_manual(values = rep("white", 2))

plot_C <- Q_plot(Q = C, 
                 K=2,
                 arrange = TRUE) + 
  ggplot2::scale_fill_brewer(palette = "Greens")+ 
  ggplot2::scale_color_brewer(palette = "Greens")

plot_D <- Q_plot(Q = D, 
                 K=2,
                 arrange = TRUE) + 
  ggplot2::scale_fill_brewer(palette = "Reds") + 
  ggplot2::scale_color_manual(values = rep("white", 2))

arrange(matrix, var), a function from dplyr, sorts the rows in matrix according to a column named var. Here, I order individuals in each \(Q\) matrix according to their value of q1, but other variables can also be used. Use colnames(Q_matrix_name) to see the column names and decide which to sort by. Matrices can also be sorted by multiple columns (e.g., arrange(matrix, var1, var2)).

scale_fill_brewer(), scale_color_brewer(), and scale_color_manual, functions from ggplot2, modify the color scheme of the plot. Because Q_plot() outputs a ggplot object, you can do lots of modifications! For example, you can use scale_fill_brewer() to set the fill color scheme, but use scale_color_manual to set the outline color.

# Display these plots in a grid
cowplot::plot_grid(plot_A, plot_B, plot_C, plot_D,
                   labels = "AUTO", 
                   nrow = 1,
                   vjust = 2) 

We observe that matrices A and B are visually similar, with most individuals from each matrix possessing all of their ancestry in just one of the two clusters (e.g., for B, all light blue or all dark blue). For a \(Q\) matrix with two ancestral clusters, the maximum variability case occurs when every individual (or all but one) has a membership vector of \((0,1)\) or \((1,0)\) (possessing all of its ancestry in just one of the two clusters). Matrices A and B are close to this maximum case, as they contain just a few individuals with ancestry in multiple clusters.

Matrices C and D, on the other hand, are comprised of individuals with some ancestry from each cluster. These matrices are closer to the minimum variability case, in which every individual’s membership coefficients are the same as the mean (e.g., for C, 3/4 dark green and 1/4 light green).

We have now built intuition about what high and low variability \(Q\) matrices look like. The statistic \(F_{ST}/F_{ST}^{max}\) quantifies this variability.

Calculate the variability of each \(Q\) matrix

As a preliminary analysis, let’s calculate \(F_{ST}/F_{ST}^{max}\) for each of our four simulated \(Q\) matrices. We will do this analysis using the function Q_stat, which accepts two parameters: a \(Q\) matrix (Q) and the number of ancestral clusters (K).

Q_stat(Q = A, K = 2)
#> $Fst
#> [1] 0.9595213
#> 
#> $FstMax
#> [1] 0.9927522
#> 
#> $ratio
#> [1] 0.9665264
Q_stat(Q = B, K = 2)
#> $Fst
#> [1] 0.8301276
#> 
#> $FstMax
#> [1] 0.9942423
#> 
#> $ratio
#> [1] 0.8349349
Q_stat(Q = C, K = 2)
#> $Fst
#> [1] 0.08282438
#> 
#> $FstMax
#> [1] 0.9640783
#> 
#> $ratio
#> [1] 0.08591043
Q_stat(Q = D, K = 2)
#> $Fst
#> [1] 0.06545711
#> 
#> $FstMax
#> [1] 0.9514734
#> 
#> $ratio
#> [1] 0.06879552

For each matrix, the function Q_stat returns \(F_{ST}\), \(F_{ST}^{max}\), and their ratio \((F_{ST}/F_{ST}^{max})\). In population genetics, the statistic \(F_{ST}\) is conventionally used to quantify variability in allele frequencies among a collection of subpopulations. Here, we use this statistic to quantify the variability in membership coefficients among a collection of individuals. (See the methods section of the paper for a more thorough discussion of this analogy.) Importantly, when the \(Q\) matrix contains few individuals, or the mean membership of the most prevalent ancestral cluster is close to \(1/K\) (the minimum value) or 1 (the maximum value), the maximum possible value of \(F_{ST}\) \((F_{ST}^{max})\) can be much less than 1. For this reason, we use the normalized statistic, \(F_{ST}/F_{ST}^{max}\), which always ranges between 0 and 1, and can thus be used to compare the variability of multiple \(Q\) matrices.

We observe from the results of Q_stat that matrices A and B have values of \(F_{ST}/F_{ST}^{max}\) greater than 0.8, whereas matrices C and D have values less than 0.1. Are the differences in variability between these matrices statistically significant?

To answer these questions, we employ bootstrapping.

Generate & analyze bootstrap distributions of \(F_{ST}/F_{ST}^{max}\) for each matrix

A bootstrap replicate of a \(Q\) matrix is a second matrix with the same dimensions whose rows were drawn with replacement from those of the original matrix. By generating many bootstrap replicate matrices and computing \(F_{ST}/F_{ST}^{max}\) for each, we generate a bootstrap distribution of \(F_{ST}/F_{ST}^{max}\). This bootstrap distribution is an estimate of the sampling distribution (the distribution obtained by drawing multiple \(Q\) matrices from the full population). We can perform statistical tests of identity of the bootstrap means to draw conclusions about whether the original \(Q\) matrices have significantly different variabilities.

To generate a bootstrap distribution for the four simulated \(Q\) matrices, I used the function Q_bootstrap, which has four core parameters:

*Q_bootstrap can also take as its matrices parameter a single matrix, possibly containing multiple groups for comparison. If the matrix contains individuals from multiple groups which you wish to compare, you must also provide a group parameter, the name of the column specifying which group each row (individual) belongs to. See the “Advanced note on the usage of Q_bootstrap” at the end of this document for more details.

**If K is not provided, Q_bootstrap will assume that K is equal to the number of columns in the first matrix provided.

bootstrap <- Q_bootstrap(matrices = list("Matrix A" = A, "Matrix B" = B, 
                                         "Matrix C" = C, "Matrix D" = D), 
                  n_replicates = 100,
                  K = 2, 
                  seed = 1)

The output of Q_bootstrap, which I named bootstrap, is a list of the following objects:

All three options for visualizing the bootstrap distributions are ggplot objects and can be modified in many ways using tools from the package ggplot2:

cowplot::plot_grid(bootstrap$plot_boxplot + ggplot2::ggtitle("Box Plot"), 
                   bootstrap$plot_violin + ggplot2::ggtitle("Violin Plot"), 
                   bootstrap$plot_ecdf + ggplot2::ggtitle("ECDF Plot") +
                     ggplot2::scale_color_brewer(palette = "Dark2"), 
                   nrow = 2)

Use statistical tests to compare the variability of multiple \(Q\) matrices

Comparing these distributions by eye, it seems like the bootstrap distributions of matrices A and B are each quite different from all of the other matrices. C and D, on the other hand, seem more similar to each other than A and B, but are still not identical. Rather than relying on our intuition, we can statistically test the similarity of these distributions. Included in the output of Q_bootstrap are the output of two such tests:

For both tests, the null hypothesis is that the distributions are the same.

bootstrap$test_kruskal_wallis
#> 
#>  Kruskal-Wallis rank sum test
#> 
#> data:  all_stats$ratio by all_stats$Matrix
#> Kruskal-Wallis chi-squared = 327.71, df = 3, p-value < 2.2e-16

bootstrap$test_pairwise_wilcox
#> 
#>  Pairwise comparisons using Wilcoxon rank sum test with continuity correction 
#> 
#> data:  all_stats$ratio and all_stats$Matrix 
#> 
#>          Matrix A Matrix B Matrix C
#> Matrix B <2e-16   -        -       
#> Matrix C <2e-16   <2e-16   -       
#> Matrix D <2e-16   <2e-16   2e-04   
#> 
#> P value adjustment method: holm

The results of these tests support the claim that the variabilities of all four matrices are significantly different, but that C and D are more similar than A and B.

You cannot make comparisons if only one \(Q\) matrix is provided. If only one matrix is provided to Q_bootstrap, the entries test_kruskal_wallis and test_pairwise_wilcox will contain the text “This statistical test can only be performed if a list of matrices is provided.” instead of the results of a statistical test.

Advanced note on the usage of Q_bootstrap

We noted above that Q_bootstrap can take as its matrices parameter one matrix which contains individuals from multiple groups we wish to compare. This analysis is relevant if we want to compare the ancestry variability among multiple groups based on the output of programs such as STRUCTURE or ADMIXTURE, which generate one \(Q\) matrix for all individuals, with a column corresponding to the group each individual belongs to.

The function Q_simulate generates a matrix of this same form. If rep>1 and/or alpha is a vector with more than one entry, Q_simulate will generate one matrix containing rep times length(alpha) total groups of popsize individuals, stacked vertically. The column Pop of the output is defined as alpha_rep (e.g., for the second repetition of a matrix with alpha = 10, Pop=10_2), and specifies what alpha/rep combination each simulated individual belongs to.

For example, we can use the following code to generate a \(Q\) matrix with four groups of 5 individuals and mean membership \((1/3, 1/3, 1/3)\), two with \(\alpha = 1\) and two with \(\alpha = 10\).

Q <- Q_simulate(alpha = c(1,10), lambda = c(1/3, 1/3, 1/3), popsize = 5, rep = 2, seed = 1)

Here is the matrix we have simulated:

rep ind alpha Pop spacer q1 q2 q3
1 1 1 1_1 : 0.0577009 0.7675994 0.1746997
1 2 1 1_1 : 0.0461490 0.9531101 0.0007409
1 3 1 1_1 : 0.5564586 0.3811179 0.0624235
1 4 1 1_1 : 0.8083399 0.0034037 0.1882564
1 5 1 1_1 : 0.3108832 0.5643938 0.1247230
1 1 10 10_1 : 0.3569613 0.4724304 0.1706083
1 2 10 10_1 : 0.2897101 0.3403869 0.3699029
1 3 10 10_1 : 0.2433330 0.1304592 0.6262078
1 4 10 10_1 : 0.1890401 0.4372094 0.3737505
1 5 10 10_1 : 0.3003683 0.2487210 0.4509107
2 1 1 1_2 : 0.1203557 0.1119421 0.7677023
2 2 1 1_2 : 0.6850412 0.0481824 0.2667764
2 3 1 1_2 : 0.8298894 0.1701059 0.0000047
2 4 1 1_2 : 0.0042849 0.4747644 0.5209507
2 5 1 1_2 : 0.0591896 0.9405881 0.0002223
2 1 10 10_2 : 0.3515363 0.1865224 0.4619412
2 2 10 10_2 : 0.4561017 0.2180631 0.3258352
2 3 10 10_2 : 0.1632815 0.4506858 0.3860327
2 4 10 10_2 : 0.2894660 0.1885444 0.5219895
2 5 10 10_2 : 0.4009728 0.2318401 0.3671871

We can generate bootstrap distributions comparing the four groups of matrices with the following code:

bootstrap2 <- Q_bootstrap(matrices = Q, n_replicates = 100, K = 3, seed = 1, group = "Pop")

bootstrap2$plot_boxplot

bootstrap2$test_pairwise_wilcox
#> 
#>  Pairwise comparisons using Wilcoxon rank sum test with continuity correction 
#> 
#> data:  all_stats$ratio and all_stats$Matrix 
#> 
#>      1_1     10_1    1_2    
#> 10_1 < 2e-16 -       -      
#> 1_2  1.1e-08 < 2e-16 -      
#> 10_2 < 2e-16 3.3e-07 < 2e-16
#> 
#> P value adjustment method: holm

Concluding Thoughts

This concludes the introduction to the package FSTruct–we hope you found it helpful. For more information, see the paper “FSTruct: an FST-based tool for measuring ancestry variation in inference of population structure” or the FSTruct manual. Or, in the R console, type ?FSTruct::Q_... to read the documentation for particular functions (e.g. ?FSTruct::Q_bootstrap).