Alternate methods of analysis using ADMIXTURE & PCAs

Alternate methods of analysis using ADMIXTURE & PCAs

The use of PCAs and ADMIXTURE program is widespread in the study of population histories, although the most common way those programs are used, using all available SNPs, including common variants, pruned for maximizing the genotype rate among samples in the dataset, may not be best for accurately inferring shared drift amongst populations. Here I outline some of the benefits of placing upper thresholds  on minor allele frequencies (MAF) in both ADMIXTURE and PCAs.

ADMIXTURE is a very useful tool to study population structure, although I am noticing that most of the time it is not used properly to design admixture calculators. This includes the calculators that have been uploaded to Gedmatch.com. The major contributing factor to inaccuracy is due to the way the calculator creators are conducting the tests, where the number of test subjects way outweigh the number of population sources/references, resulting in a shift of the allele frequencies of the references.

The following papers corroborate some of the issues I have raised:

  1. In “Fast individual ancestry inference from DNA sequence data leveraging allele frequencies for multiple populations4“, the authors determined that “the allele frequencies are determined not only by the genotypes of the reference individuals but also by the individual(s) that are analyzed for admixture”;
  2. AncestryDNA is also cognizant of this problem. Here is an excerpt from page 24 of their white paper here “It should also be noted that the approach we use is not entirely “supervised,” although we use a supervised version of the algorithm. While the reference populations are set as the “source” populations, genotypes of the tested samples can also influence the allele frequency estimates in the source clusters; i.e., the approach is not fully supervised.
    This is because the model not only estimates Q, but also P, as a function of both the reference samples and the customer samples (a total of N samples). While ideally the P values should remain stable regardless of the customer samples, the customer samples could slightly change the P estimates from their “true” values.
    Customer samples are run in batches of varying sizes; due to the details of the algorithm described above, in theory a customer’s results could vary by batch.
    Extensive tests have shown that the effect of batch on customer estimates is minimal. This is because the batch size is very small compared to the size of the reference panel. Also, removing related samples from the same batch, as described above, ensures minimal effects on customer ethnicity estimates.”

One of the limitations of ADMIXTURE program based calculators is that they are not informative as to the direction of gene flow. This is fundamentally  because the ADMIXTURE program like many other programs, including ones based on formal methods are allele frequency based, and therefore the output is subject to how we interpret allele frequencies. For example, most ADMIXTURE based calculators are developed with no limitations on the range of minor allele frequencies (MAF), except perhaps to filter out positions with MAF<1%, which are positions homozygous for the major allele, since mathematical estimates suggest most of the common SNPs to have originated thousands of years ago and therefore to have a wider geographic distribution in contrast to rare variants which are mostly more recent and geographically restricted. The rare and common variants therefore allow us to investigate events at different time scales of demographic histories1.

For example, let’s assume that using an ADMIXTURE based calculator an individual scores 10% Yamnaya. Would that necessarily mean that the individual has 10% admixture attributable to the Yamnaya culture? The answer is no, because ADMIXTURE based calculators are generally created with no upper thresholds on MAF, which means that the results are based on frequencies of minor alleles ranging from 1% to 100%. The allele frequencies vary, and based on the specific panel used to create the calculator. High MAF translates into Common Population Specific (CPS) variants, which sort of translates into general sharing of an allele by two or more populations on a deeper time scale, likely via a common ancestor.

For those with a genetics background, you may remember that the time to fixation or loss of an allele is dependent on the allele frequency and population size. If the frequency of an allele, say G is 80%, and the frequency of an allele T is 20%, then in due time, the probability G will become fixed in the population is 80%, whereas the probability that T will become fixed is 20%. Additionally, the time to fixation is dependent of the population size, where it is predicted that it will take less time time to fixation in small populations vs large populations. So although differences in allele frequencies might be influenced by various demographic factors like selection and population size, time is the major factor in the rise or fall of allele frequencies.

So if you think about it, this implies that rare alleles with a MAF, say under 5%, are relatively more recent, which if you think about it some more, implies that they are a better indicator of relatedness of two individuals or populations due to more recent gene flow, vs CPS alleles with a frequency of say 80%, which would indicate more distant relatedness due to a distant common ancestor. The main catch here is, and this is based on my investigations, is that panels such as HO don’t have enough overlapping variants with Illumina based platforms, to have enough variants survive a deep pruning to MAF < 0.05. I am able to see this on when a PCA I generate “collapses”, meaning that the number of variants surviving the prunning process is too low to account for much of the intra-Eurasian population variation. I have been able to work around this problem by using other published panels, which have more overlapping SNPs with Illumina, such as those from the Estonian Biocentre , my own project members, and elsewhere.

Returning to the example with Yamanya, it is possible that the test sample and Yamnaya agree on an allele at a specific locus due to both having a common ancestor, and not necessarily due to the test sample being descended from Yamnaya, especially if the agreement is due to an allele that has a high frequency of occurrence at that locus in that dataset. By contrast, if the agreement is due to an allele with a low incidence of occurrence for that locus, then the probability is much higher that the test sample is descended from Yamnaya.

PCAs and ADMIXTURE tests based on common vs rare alleles

To illustrate some of the concepts I have discussed so far, here are a couple of PCAs. I have only included samples that have the highest marker overlaps with Illumina platforms, to enable pruning down to MAF < 8%, and have enough markers left to account for adequate variation among Eurasians. I have also carefully selected the samples to mitigate some of the unpleasant inherent issues with PCAs, namely, projection bias, sample bias, skewing due to close relatedness of samples, and so on.

First, a “traditional” PCA using common alleles. In other words, only filtered for MAF >0.1%, sourced on about 376K SNPs. The following PCAs use plink bed bim fam files as inputs, and are not based on ADMIXTURE outputs.

Fig 1 – PCA with MAF > 0.001, with 376K SNPs

In the above figure, the positions of the various samples are primarily based on common variants, which likely reflects ancient as well as more recent admixture events. Notice the close proximity of the Arab samples to other West Asians. This is the PCA most commonly seen, where there is no pruning for rare alleles only.

For comparison, here is one based on relatively rare alleles (0.1% > MAF < 8%). This PCA likely more accurately reflects specific relationships between populations based on more recent direct geneflow between the populations, in contrast to the previous one, which likely reflects more generalized relationships based on more distant common ancestors. Notice how the distances between Arab populations and other W Asians has become larger here, suggesting more recent gene flows into Kurds and other W Asians from the north. With Kurds, the distance with Pashtuns and Tajiks has also shrunk, suggesting some more recent gene flow from C/SC Asia or visa versa.

Fig 2 – PCA based on mostly rare alleles only

 

This brings me to the jist of what I am trying to convey, which is that I have invested quiet some time in designing an ADMIXTURE based calculator, which is  more informative and precise on admixture percentages. Those percentages being more likely based on more direct relatedness or recent admixture events with the population source (component), rather than general relatedness with the population source due to distant common ancestry. Accomplishing this is not easy, and there is more to it than what I have just explained.

 A novel approach to designing calculators using the ADMIXTURE program

With these types of tests, I have ventured  into somewhat unchartered waters, as there do not appear to be any papers specifically outlining some of the aforementioned issues. After dozens of tests, the results seem promising and reasonable.

To test my theories, I simplified matters by taking drift and pseudo-diploid sequences out of the equation, thus removing ancients altogether, and designed a K14 ADMIXTURE based calculator, using only the highest overlapping sequences with Illumina. This limited my components/population sources, but I am ok with that. I ended up with a supervised test, with components based on various W/S/SC/E Asian and European population sources. I also introduced test subjects into the runs only one at a time, to circumvent the over-fitting issues that I have previously addressed concerning other ADMIXTURE based calculators out there, whereby the component allele frequencies are skewed by the numerous non-reference test subjects in the runs.

Some of the the population sources/references are self explanatory, others are based on the following:

  1. NE Caucasus: Kumyks, Lezgins, N Ossetians;
  2. Finno-Ugric: Saami, Finns, and Russian Karelians;
  3. Indian: Asur, Gond, Ho, Kapu, Kol, Kurmi, Marwadi, Orissans, Santhal, & Balija;
  4. NE European: Latvian, Lithuanian, Poles, & Belarussian;
  5. SE European: Albanians & Montenegran;
  6. SW Asian: Saudi & Jordanians;
  7. Pashtun: 10 Pakistani & Afghan Pashtuns;
  8. Tajik: 18 Tajiks from Tajikstan & Afghanistan. Pamiri Tajiks are included;
  9. C Asian: Turkmen & Uzbeks from Uzbekistan, and Kazakh.

Test subjects with slight E/NE/SE Asian admixture don’t show any with this calculator, because it is included in the Tajik, Pashtun, Finno-Ugric, and NE Caucasus components. Also, “noise” is not an issue here, as evidenced by the lack of percentages under 1-2% for the most part.

Here are some sample test results, using 386K SNPs, using allele frequencies above 1%, which includes common alleles. The results are sorted with the Tajik component increasing from top to bottom. I was fortunate to have a project member who happens to be 50% Tajik/ 50% Pashtun to test the robustness of the Tajik and Pashtun components. Fortunately, his results, which are shown in the bottom of the chart, showed him as 51.63% Tajik, 47.05% Pashtun, and 1.31% Indian.

Fig 3 – Admixture results for various test subjects using a total of 386K overlapping SNPs, and including common alleles – MAF >0.01

 

Fig 4 – ADMIXTURE results using 386K overlapping SNPs, which include common alleles. MAF>0.01.

 

Fig 5 – FST matrix – 386K SNPs – MAF > 0.1%

 

The following are test results, using mostly rare alleles only. These results I believe are better for gauging more specific genetic relationships between populations. The cross validation errors are considerably lower here than in the above.

Fig 6 – Admixture results using mostly rare alleles only, sorted by increasing Tajik score from top to bottom, MAF < 12%

 

Fig 8 – Admixture results using mostly rare alleles

 

Fig 9 – FST Matrix using mostly rare alleles

Conclusions

The algorithms used here result in admixture proportions more representative of the allele frequencies of the component references, which in turn results in more accurate admixture proportions. Likewise, the component FST distances would be more accurate here.

Higher density sequencing of some of the populations enabled me to conduct tests based on rarer alleles, and those in turn show more specific relationships based on more recent gene flows than conventional ADMIXTURE based tests which show connections based on more common alleles, which in turn are reflective more distant common relationships between populations.

The algorithm behind my rarer alleles test is very different from the algorithms typically used with ADMIXTURE based tests, and my test results showed that it is was able to assign admixture proportions with a higher degree of accuracy, with the results reflecting more recent gene flow than conventional ADMIXTURE based calculators.

Additionally, my calculators are affected far less by spurious minor signal levels (1-3%), also commonly referred to as noise,  than conventional ADMIXTURE based calculators. This is evident from the numerous “0” component values in the test results. Most ADMIXTURE based calculators out there use common alleles to source allele frequencies, as well as hundreds of  non-references (population sources) in the run. This causes a couple of issues due to the following:

1- Common alleles are, as the name suggests common to many populations and in many cases date back 1000s of years, leading the program to pick up on common signals dating back 1000s of years. Thus the results are reflective of admixture events from long ago, and not necessarily specific more recent gene flows;

2- the numerous non-references in the runs “bridge” the allele frequencies between the population sources/references and the test subject, leading to test scores not totally based on the allele frequencies of the population references, but rather on hybrid non-reference – population reference allele frequencies. I’ll try to illustrate with an example. Let’s say a W Asian scores 10%, and a European scores 2% Indian or S Asian on a calculator that supposedly has a component based on Indian tribal references. I am suggesting that those numbers aren’t accurate because if you look at the spreadsheet for that calculator you may notice that Pashtuns, for example, score 20-30%. So here Pashtuns act as a bridge for W Asians who act as a bridge for Europeans. Take away those bridges and the European will score 0 Indian and the W Asian much less also. That is what I meant by skewing of allele frequencies by the numerous non-references in the run. In my test practically no one but Indians score Indian.

An example of the accuracy using my new algorithms is that a known 50%Tajik/ 50% Pashtun test subject was properly parsed as almost 50% Tajik- 50% Pashtun using my calculator.
REFERENCES:

  1. Population-specific common SNPs reflect demographic histories and highlight regions of genomic plasticity with functional relevance, Choudhury et al, BMC Genomics, 2014.
  2. Population structure analysis using rare and common functional variants, Baye et al, BMC Genomics, 2011.

  3. D.H. Alexander, J. Novembre, and K. Lange. Fast model-based estimation of ancestry in unrelated individuals. Genome Research, 19:1655–1664, 2009.
  4. Fast individual ancestry inference from DNA sequence data leveraging allele frequencies for multiple populations, BMC Bioinformatics, V Bansal et al, 2015.