Controlling for population stratification through PCs and replicability

I recently came across this paper on controlling for population stratification with principal components, which claims the practice is highly problematic for genetic studies due to the black-box nature of the technique (i.e., we are not measuring ancestry directly, arbitrary choice of PCs etc.). I have mixed feelings about the paper, but I can see how the technique relies heavily on the sample to estimate eigenvalues, which might threaten replicability and combining results from heterogeneous cohorts.

What are your thoughts on this? Is there any best practice or alternative that would avoid those issues? Thanks!

1 Like

Hi Giulio,

Great question. There are different issues here. The first one is the notion of population, which I spent some time on this morning. The structure you have in a given sample may not be relevant for reflecting relevant structures in the entire population. This is a limitation of how you define the population rather than a limitation of the PCA technique per se. On the other hand, even when you have a sample that is representative of the population (wrt genetic structures) PCs would still only partially capture that structure if your sample size is too small. What is a too-small sample size (N) depends on the level of differentiation (i.e. Fst) between sub-groups (i.e. structure) in the population and the effective number of SNPs (M). Random Matrix Theory predicts that you start to detect the structure if Fst > 1/sqrt(N*M). You can check this paper for more details.

To conclude, I’d say that there is no need to panic and think that all PCA analysis is rubbish but instead be aware that there are caveats (some listed above).


1 Like

Hi Guilio,

Thanks, good question. I need to read that paper in more detail to really understand all the implications of what the author is claiming, but the main conclusions you mention, that a lot of the outcomes of the PCA depend on sample characteristics, are of course true. I see that he also claims that therefore GWAS results are unreliable, which I don’t agree with. Using the PCs of the specific dataset you are conducting a GWAS on to control for population stratification within that sample seems to me to be an appropriate way to control for the population structure within that specific dataset. Of course PCs are not perfect to capture and remove all of the population stratification (see the points made in slide 30 regarding sample size and allele frequencies as well as the points made in the rest of the lecture about how difficult it is to capture ancestry in a homogeneous dataset when you have other patterns of variation present), but we have additional approaches to measure the extent to which the PC correction works (e.g., LDSC regression), and those do suggest that they help in removing much of the population stratification inflation. There may still be residual population stratification in many GWASs, but for most GWASs that are large enough and have done their PCA carefully and/or done mixed-linear modeling approaches, I think the results contain enough polygenic effects to do meaningful follow-up analyses on.

Best wishes,

1 Like

Hi Giulio,

In addition to Abdel and Loic’s excellent answers - there is this paper by Gil McVean which describes what PCA in the context of genealogical analyses.

I also would like to amplify Abdel’s comment about the utility of PCA in the conduct of GWAS - there is no doubt in my mind that this still reflects best practices and improves the replicability of genetic associations.

The other thing I would add to the discussion around the paper is that there is a heavy dose of considering everything as discrete populations - one of the most important aspects PCA is that it allows for continuous solutions naturally and much of human population structure is along a cline [i.e., continuous gradient] rather than truly “discrete”.