QC Genotypic Data by Heterozygosity

Re: filtering samples by the proportion of heterozygosity:
(If I understand this step correctly) if we define “failed QC” as samples with F >3SD from the mean, that means that we always expect ~0.3% of the individuals to fail the QC. (regardless of their actual proportion of heterozygosity). Is that not a bit arbitrary, and also likely too conservative in larger datasets? E.g., we can expect (a priori) that ~1,350 individuals in the UKB do not pass this QC step.

It’s a good point. I don’t think there’s a hard fast rule about when F becomes a signal of a poor sample vs. someone who is highly outbred or admixed. One could look at the excess heterogzygosity vs. what is expected under a normal distribution with observed mean and variance to get some idea of this. You could also see if these individuals appear to be more likely to admixed (using PCA plots), which is what we should expect if they are truly heterozygous. Ultimately, however, these individuals will probably be a mix of sample who are legitimately heterozygous and samples who have low genotyping quality. To play it safe (given that you have plenty of power in the UKB removing 0.3% of the sample), it probably safest to remove them.