Introduction to FSTruct

The FSTruct package provides an

-based tool to quantify the variability of

matrices (matrices with row vectors of ancestry coefficients, the default output of population structure inference programs such as ADMIXTURE and STRUCTURE).

In this document, we walk through an example analysis, including how to:

Load or simulate matrices
Visualize matrices
Calculate the variability statistic for each matrix
Generate & analyze bootstrap distributions of for each matrix
Use statistical tests to compare the variability of multiple matrices

All of these tasks can be accomplished using the four functions in the package FSTruct.

What is a matrix?

A matrix represents the ancestry of a population; each row represents an individual and the last columns contain proportions called membership coefficients (the other columns give details about the individual, such as its name and population identifier). Membership coefficients describe how much of an individual’s ancestry derives from a given ancestral cluster. An individual’s membership coefficients sum to 1.

A matrix generated by the program ADMIXTURE with ancestral clusters looks like this:

1	1	(x)	1	:	1.0000	0.0000	0
2	2	(x)	1	:	1.0000	0.0000	0
3	3	(x)	2	:	0.0137	0.7567	0.2296
4	4	(x)	2	:	0.0000	0.7325	0.2674
5	5	(x)	2	:	0.0224	0.7279	0.2496
6	6	(x)	3	:	0.0222	0.7372	0.2406
7	7	(x)	3	:	0.0000	0.7396	0.2603
8	8	(x)	3	:	0.0231	0.7496	0.2272
9	9	(x)	3	:	0.0460	0.6861	0.2678
10	10	(x)	3	:	0.0843	0.6739	0.2418

The last columns contain the membership coefficients.

Install and load FSTruct

The FSTruct package can be downloaded from GitHub using the code below. Be sure to first install the package devtools if it is not yet installed.

# install.packages("devtools") # Run if devtools is not yet installed
# devtools::install_github("MaikeMorrison/FSTruct") # Run if FSTruct is not yet installed

To follow along with this vignette, make sure the following packages are also installed:

# install.packages("dplyr") # Run if dplyr is not yet installed
# install.packages("ggplot2") # Run if ggplot2 is not yet installed
# install.packages("cowplot") # Run if cowplot is not yet installed
# install.packages("data.table") # Run if data.table is not yet installed

Then load the package FSTruct and the package dplyr, which contains functions such as arrange, which reorders the rows of a data frame.

library(FSTruct)
library(dplyr)

Load or simulate matrices

To begin the analysis, read one or more matrices into R. For each matrix, this task can be accomplished using:

Q_matrix_name <- data.table::fread("file path to your Q matrix.output")

or other functions, such as read.table().

Note that the default column names of these matrices will likely be V1, V2, V3, … If you wish to give the columns more informative names, use the function colnames, listing on the right-hand side your desired column names (the example below uses the same names as Q_simulate for ancestral clusters):

colnames(Q_matrix_name) <- c("rep", "ind", "alpha", "Pop", "spacer", "q1", "q2")

Repeat this process for each matrix you wish to analyze.

Alternatively, matrices can be simulated using Q_simulate, which draws individuals from a Dirichlet probability distribution (for more details on the Dirichlet distribution, see the “Simulation examples” section of the paper).

The function Q_simulate has four parameters:

alpha, a number that controls the variability. Higher values of result in lower variances across membership coefficients.*
lambda, a vector that controls the mean membership of each ancestral cluster.
popsize, the number of individuals to put in the matrix.
rep, the number of populations to simulate. Default is 1.
seed, an optional number which, if included, is used as the random seed to make the random results reproducible.

For this example, I chose to simulate four matrices (A, B, C, and D), each with 20 individuals and with the same vector of mean membership coefficients, . This means that each matrix will have 20 rows and, on average, individuals from any of the four matrices will possess three-quarters of their ancestry from the first ancestral cluster, and one-quarter of their ancestry from the second ancestral cluster.

I chose to simulate matrices A and B using , which results in fairly high variability across membership coefficients, and matrices C and D using , which results in lower variability. Properties of the Dirichlet distribution tell us that the variance of the ancestral cluster, , is . It follows that, for matrices A and B, . For matrices C and D, .

*alpha can also be a numeric vector. See the “Advanced note on the usage of Q_bootstrap” at end of document for more details.

A = Q_simulate(alpha = .1, lambda = c(.75, .25), popsize = 20, rep = 1, seed = 1)

B = Q_simulate(alpha = .1, lambda = c(.75, .25), popsize = 20, rep = 1, seed = 2)

C = Q_simulate(alpha = 10, lambda = c(.75, .25), popsize = 20, rep = 1, seed = 3)

D = Q_simulate(alpha = 10, lambda = c(.75, .25), popsize = 20, rep = 1, seed = 4)

Here is what matrix A looks like:

rep	ind	alpha	Pop	spacer	q1	q2
1	1	0.1	0.1_1	:	0.9900510	0.0099490
1	2	0.1	0.1_1	:	1.0000000	0.0000000
1	3	0.1	0.1_1	:	0.0001171	0.9998829
1	4	0.1	0.1_1	:	1.0000000	0.0000000
1	5	0.1	0.1_1	:	1.0000000	0.0000000
1	6	0.1	0.1_1	:	0.0018261	0.9981739
1	7	0.1	0.1_1	:	0.9971413	0.0028587
1	8	0.1	0.1_1	:	1.0000000	0.0000000
1	9	0.1	0.1_1	:	1.0000000	0.0000000
1	10	0.1	0.1_1	:	1.0000000	0.0000000
1	11	0.1	0.1_1	:	0.0108406	0.9891594
1	12	0.1	0.1_1	:	1.0000000	0.0000000
1	13	0.1	0.1_1	:	0.0000029	0.9999971
1	14	0.1	0.1_1	:	1.0000000	0.0000000
1	15	0.1	0.1_1	:	0.0000175	0.9999825
1	16	0.1	0.1_1	:	1.0000000	0.0000000
1	17	0.1	0.1_1	:	0.0000311	0.9999689
1	18	0.1	0.1_1	:	0.1082434	0.8917566
1	19	0.1	0.1_1	:	1.0000000	0.0000000
1	20	0.1	0.1_1	:	0.9339119	0.0660881

The format of this matrix is very similar to the format of the ADMIXTURE matrix above. The last columns contain the membership coefficients.

Visualize Matrices

We visualize these matrices using the function Q_plot, which has one required parameter: a matrix (Q). If Q contains any columns that are not ancestry coefficients (as is the case in our example), we must also specify the number of ancestral clusters (K). If K is not provided, the function will use all of the columns of Q, which is a problem if Q contains information other than ancestry vectors. The third parameter, arrange, is optional. If arrange = TRUE, the individuals are horizontally ordered by the clusters with the greatest mean membership. Below, we accomplish this ordering manually for plot A, and use arrange = TRUE for all subsequent plots. These two methods accomplish the same ordering.

# Generate and modify a plot for each Q matrix

# Manual arranging of individuals
plot_A <- Q_plot(Q = dplyr::arrange(A, q1),
                 K=2)

# Using the Q_plot arrange option
plot_B <- Q_plot(Q = B, 
                 K=2,
                 arrange = TRUE) + 
  ggplot2::scale_fill_brewer(palette = "Blues") + 
  ggplot2::scale_color_manual(values = rep("white", 2))

plot_C <- Q_plot(Q = C, 
                 K=2,
                 arrange = TRUE) + 
  ggplot2::scale_fill_brewer(palette = "Greens")+ 
  ggplot2::scale_color_brewer(palette = "Greens")

plot_D <- Q_plot(Q = D, 
                 K=2,
                 arrange = TRUE) + 
  ggplot2::scale_fill_brewer(palette = "Reds") + 
  ggplot2::scale_color_manual(values = rep("white", 2))

arrange(matrix, var), a function from dplyr, sorts the rows in matrix according to a column named var. Here, I order individuals in each matrix according to their value of q1, but other variables can also be used. Use colnames(Q_matrix_name) to see the column names and decide which to sort by. Matrices can also be sorted by multiple columns (e.g., arrange(matrix, var1, var2)).

scale_fill_brewer(), scale_color_brewer(), and scale_color_manual, functions from ggplot2, modify the color scheme of the plot. Because Q_plot() outputs a ggplot object, you can do lots of modifications! For example, you can use scale_fill_brewer() to set the fill color scheme, but use scale_color_manual to set the outline color.

# Display these plots in a grid
cowplot::plot_grid(plot_A, plot_B, plot_C, plot_D,
                   labels = "AUTO", 
                   nrow = 1,
                   vjust = 2)

We observe that matrices A and B are visually similar, with most individuals from each matrix possessing all of their ancestry in just one of the two clusters (e.g., for B, all light blue or all dark blue). For a matrix with two ancestral clusters, the maximum variability case occurs when every individual (or all but one) has a membership vector of or (possessing all of its ancestry in just one of the two clusters). Matrices A and B are close to this maximum case, as they contain just a few individuals with ancestry in multiple clusters.

Matrices C and D, on the other hand, are comprised of individuals with some ancestry from each cluster. These matrices are closer to the minimum variability case, in which every individual’s membership coefficients are the same as the mean (e.g., for C, 3/4 dark green and 1/4 light green).

We have now built intuition about what high and low variability matrices look like. The statistic quantifies this variability.

Calculate the variability of each matrix

As a preliminary analysis, let’s calculate for each of our four simulated matrices. We will do this analysis using the function Q_stat, which accepts two parameters: a matrix (Q) and the number of ancestral clusters (K).

Q_stat(Q = A, K = 2)
#> $Fst
#> [1] 0.9595213
#> 
#> $FstMax
#> [1] 0.9927522
#> 
#> $ratio
#> [1] 0.9665264
Q_stat(Q = B, K = 2)
#> $Fst
#> [1] 0.8301276
#> 
#> $FstMax
#> [1] 0.9942423
#> 
#> $ratio
#> [1] 0.8349349
Q_stat(Q = C, K = 2)
#> $Fst
#> [1] 0.08282438
#> 
#> $FstMax
#> [1] 0.9640783
#> 
#> $ratio
#> [1] 0.08591043
Q_stat(Q = D, K = 2)
#> $Fst
#> [1] 0.06545711
#> 
#> $FstMax
#> [1] 0.9514734
#> 
#> $ratio
#> [1] 0.06879552

For each matrix, the function Q_stat returns , , and their ratio . In population genetics, the statistic is conventionally used to quantify variability in allele frequencies among a collection of subpopulations. Here, we use this statistic to quantify the variability in membership coefficients among a collection of individuals. (See the methods section of the paper for a more thorough discussion of this analogy.) Importantly, when the matrix contains few individuals, or the mean membership of the most prevalent ancestral cluster is close to (the minimum value) or 1 (the maximum value), the maximum possible value of can be much less than 1. For this reason, we use the normalized statistic, , which always ranges between 0 and 1, and can thus be used to compare the variability of multiple matrices.

We observe from the results of Q_stat that matrices A and B have values of greater than 0.8, whereas matrices C and D have values less than 0.1. Are the differences in variability between these matrices statistically significant?

To answer these questions, we employ bootstrapping.

Generate & analyze bootstrap distributions of for each matrix

A bootstrap replicate of a matrix is a second matrix with the same dimensions whose rows were drawn with replacement from those of the original matrix. By generating many bootstrap replicate matrices and computing for each, we generate a bootstrap distribution of . This bootstrap distribution is an estimate of the sampling distribution (the distribution obtained by drawing multiple matrices from the full population). We can perform statistical tests of identity of the bootstrap means to draw conclusions about whether the original matrices have significantly different variabilities.

To generate a bootstrap distribution for the four simulated matrices, I used the function Q_bootstrap, which has four core parameters:

matrices, a named list of matrices to analyze*
n_replicates, the number of bootstrap replicates to generate for each matrix
K, the number of ancestral clusters. Can also be a vector if the number of ancestral clusters differs among the provided matrices.**
seed, an optional number which, if included, is used as the random seed to make the random results reproducible

*Q_bootstrap can also take as its matrices parameter a single matrix, possibly containing multiple groups for comparison. If the matrix contains individuals from multiple groups which you wish to compare, you must also provide a group parameter, the name of the column specifying which group each row (individual) belongs to. See the “Advanced note on the usage of Q_bootstrap” at the end of this document for more details.

**If K is not provided, Q_bootstrap will assume that K is equal to the number of columns in the first matrix provided.

bootstrap <- Q_bootstrap(matrices = list("Matrix A" = A, "Matrix B" = B, 
                                         "Matrix C" = C, "Matrix D" = D), 
                  n_replicates = 100,
                  K = 2, 
                  seed = 1)

The output of Q_bootstrap, which I named bootstrap, is a list of the following objects:

bootstrap_replicates: A list of bootstrap replicate matrices for each matrix provided
statistics: A dataframe containing the values of , , and for each bootstrap replicate matrix
plot_boxplot, plot_violin, plot_ecdf: A box plot, violin plot, or empirical cumulative distribution function (ECDF) plot of the bootstrap distribution of for each matrix
test_kruskal_wallis: The results of a Kruksal-Wallis test, which tests the identity of all provided bootstrap means
test_pairwise_wilcox: The results of a Wilcoxon rank-sum test, which tests the identity of pairwise bootstrap means for each pair of matrices

All three options for visualizing the bootstrap distributions are ggplot objects and can be modified in many ways using tools from the package ggplot2:

cowplot::plot_grid(bootstrap$plot_boxplot + ggplot2::ggtitle("Box Plot"), 
                   bootstrap$plot_violin + ggplot2::ggtitle("Violin Plot"), 
                   bootstrap$plot_ecdf + ggplot2::ggtitle("ECDF Plot") +
                     ggplot2::scale_color_brewer(palette = "Dark2"), 
                   nrow = 2)

Use statistical tests to compare the variability of multiple matrices

Comparing these distributions by eye, it seems like the bootstrap distributions of matrices A and B are each quite different from all of the other matrices. C and D, on the other hand, seem more similar to each other than A and B, but are still not identical. Rather than relying on our intuition, we can statistically test the similarity of these distributions. Included in the output of Q_bootstrap are the output of two such tests:

The Kruskal-Wallis test determines if statistically significant differences exist among a collection of two or more distributions. Its output includes a chi-squared statistic and a p-value.
The Wilcoxon rank sum test looks at every pairwise combination of distributions, and for each determines if a statistically significant difference exists between the two distributions. Its output is a matrix of p-values, one for each pair of distributions.

For both tests, the null hypothesis is that the distributions are the same.

bootstrap$test_kruskal_wallis
#> 
#>  Kruskal-Wallis rank sum test
#> 
#> data:  all_stats$ratio by all_stats$Matrix
#> Kruskal-Wallis chi-squared = 327.71, df = 3, p-value < 2.2e-16

bootstrap$test_pairwise_wilcox
#> 
#>  Pairwise comparisons using Wilcoxon rank sum test with continuity correction 
#> 
#> data:  all_stats$ratio and all_stats$Matrix 
#> 
#>          Matrix A Matrix B Matrix C
#> Matrix B <2e-16   -        -       
#> Matrix C <2e-16   <2e-16   -       
#> Matrix D <2e-16   <2e-16   2e-04   
#> 
#> P value adjustment method: holm

The results of these tests support the claim that the variabilities of all four matrices are significantly different, but that C and D are more similar than A and B.

You cannot make comparisons if only one matrix is provided. If only one matrix is provided to Q_bootstrap, the entries test_kruskal_wallis and test_pairwise_wilcox will contain the text “This statistical test can only be performed if a list of matrices is provided.” instead of the results of a statistical test.

Advanced note on the usage of `Q_bootstrap`

We noted above that Q_bootstrap can take as its matrices parameter one matrix which contains individuals from multiple groups we wish to compare. This analysis is relevant if we want to compare the ancestry variability among multiple groups based on the output of programs such as STRUCTURE or ADMIXTURE, which generate one matrix for all individuals, with a column corresponding to the group each individual belongs to.

The function Q_simulate generates a matrix of this same form. If rep>1 and/or alpha is a vector with more than one entry, Q_simulate will generate one matrix containing rep times length(alpha) total groups of popsize individuals, stacked vertically. The column Pop of the output is defined as alpha_rep (e.g., for the second repetition of a matrix with alpha = 10, Pop=10_2), and specifies what alpha/rep combination each simulated individual belongs to.

For example, we can use the following code to generate a matrix with four groups of 5 individuals and mean membership , two with and two with .

Q <- Q_simulate(alpha = c(1,10), lambda = c(1/3, 1/3, 1/3), popsize = 5, rep = 2, seed = 1)

Here is the matrix we have simulated:

rep	ind	alpha	Pop	spacer	q1	q2	q3
1	1	1	1_1	:	0.0577009	0.7675994	0.1746997
1	2	1	1_1	:	0.0461490	0.9531101	0.0007409
1	3	1	1_1	:	0.5564586	0.3811179	0.0624235
1	4	1	1_1	:	0.8083399	0.0034037	0.1882564
1	5	1	1_1	:	0.3108832	0.5643938	0.1247230
1	1	10	10_1	:	0.3569613	0.4724304	0.1706083
1	2	10	10_1	:	0.2897101	0.3403869	0.3699029
1	3	10	10_1	:	0.2433330	0.1304592	0.6262078
1	4	10	10_1	:	0.1890401	0.4372094	0.3737505
1	5	10	10_1	:	0.3003683	0.2487210	0.4509107
2	1	1	1_2	:	0.1203557	0.1119421	0.7677023
2	2	1	1_2	:	0.6850412	0.0481824	0.2667764
2	3	1	1_2	:	0.8298894	0.1701059	0.0000047
2	4	1	1_2	:	0.0042849	0.4747644	0.5209507
2	5	1	1_2	:	0.0591896	0.9405881	0.0002223
2	1	10	10_2	:	0.3515363	0.1865224	0.4619412
2	2	10	10_2	:	0.4561017	0.2180631	0.3258352
2	3	10	10_2	:	0.1632815	0.4506858	0.3860327
2	4	10	10_2	:	0.2894660	0.1885444	0.5219895
2	5	10	10_2	:	0.4009728	0.2318401	0.3671871

We can generate bootstrap distributions comparing the four groups of matrices with the following code:

bootstrap2 <- Q_bootstrap(matrices = Q, n_replicates = 100, K = 3, seed = 1, group = "Pop")

bootstrap2$plot_boxplot

bootstrap2$test_pairwise_wilcox
#> 
#>  Pairwise comparisons using Wilcoxon rank sum test with continuity correction 
#> 
#> data:  all_stats$ratio and all_stats$Matrix 
#> 
#>      1_1     10_1    1_2    
#> 10_1 < 2e-16 -       -      
#> 1_2  1.1e-08 < 2e-16 -      
#> 10_2 < 2e-16 3.3e-07 < 2e-16
#> 
#> P value adjustment method: holm

Introduction to FSTruct

In this document, we walk through an example analysis, including how to:

What is a Q matrix?

Install and load FSTruct

Load or simulate Q matrices

Visualize Q Matrices

Calculate the variability of each Q matrix

Generate & analyze bootstrap distributions of FST/FmaxST for each matrix

Use statistical tests to compare the variability of multiple Q matrices

Advanced note on the usage of Q_bootstrap

Concluding Thoughts