GWAS lectures

Hi Abdel,

While going through the course lectures, I had a few questions about the GWAS workflow. Apologies if some of these are very basic.

  1. PCA and covariate file for PLINK

    • When conducting a small GWAS in PLINK, we need to provide covariates such as principal components.

    • Do we usually compute PCA outside PLINK (e.g., in another software) and then format the PCs into a PLINK-compatible covariate file, or is there a standard PLINK command you recommend for running PCA directly and generating the covariate file?

  2. LD pruning for PCA vs association testing

    • When generating PCs, should we first perform LD pruning and then use only the pruned SNP set to calculate PCA, while still using all QC-passed SNPs for the association analysis?

    • Or does PLINK automatically remove SNPs in LD when computing PCA?

  3. GRM: timing and LD pruning

    • At what stage in the workflow do we typically construct the GRM: before association testing, after, or in parallel?

    • When building the GRM, do we usually LD-prune SNPs (and if so, how aggressively), or is it acceptable to use all QC-passed SNPs?

  4. Meta-analysis and phenotype definitions

    • When performing a meta-analysis across cohorts, how are differences in phenotype definitions usually handled? For example, if different cohorts use slightly different case/control definitions or scales.

    • In this context, is it common to run something like Genomic SEM directly on the summary statistics we receive from each cohort?

Thanks

Shalini

Hi Shalini,

Thank you for all the good questions! Please see below my answers. If any tutors or faculty see any room for additions/improvements, please feel free to add!

"Do we usually compute PCA outside PLINK (e.g., in another software) and then format the PCs into a PLINK-compatible covariate file, or is there a standard PLINK command you recommend for running PCA directly and generating the covariate file? "

There are many software packages you can use to compute PCs. A good one that is easy to use is GCTA (see: https://yanglab.westlake.edu.cn/software/gcta/#PCA). You can reformat the file then in R or just in unix bash.

When generating PCs, should we first perform LD pruning and then use only the pruned SNP set to calculate PCA, while still using all QC-passed SNPs for the association analysis? Or does PLINK automatically remove SNPs in LD when computing PCA?

You need to do the LD pruning yourself, before conducting the PCA. In addition to LD pruning, you should also remove long-range LD regions (both in Plink). You can find a list of long-range LD regions to exclude in the description of Table 1 of this paper: https://linkinghub.elsevier.com/retrieve/pii/S0002929708003534. See for more information this presentation from a previous Boulder workshop: https://ibg.colorado.edu/cdrom2023/faculty/abdellaoui/abdel_pop_strat_boulder_2023.pdf - if it’s helpful, you can also do the practical from that lecture in the workshop environment, you can find all the files here: Index of /cdrom2023/faculty/abdellaoui

At what stage in the workflow do we typically construct the GRM: before association testing, after, or in parallel?

For most software packages, you need to build it beforehand, though some integrated tools like REGENIE handle it internally (see articles referenced for each software package in the third lecture video).

When building the GRM, do we usually LD-prune SNPs (and if so, how aggressively), or is it acceptable to use all QC-passed SNPs?

In general it is standard practice to prune before you make a GRM. Using all SNPs is possible (and perhaps even fine) but uses redundant information for the GRM and slows computation.

When performing a meta-analysis across cohorts, how are differences in phenotype definitions usually handled? For example, if different cohorts use slightly different case/control definitions or scales.

Phenotype harmonisation is a tricky part of multi-cohort work, and there is no perfect solution. In practice, cohorts do their best to align definitions before running their GWAS (e.g., agreeing on a common case/control threshold or z-scoring a continuous trait), and any residual heterogeneity is monitored via the meta-analysis heterogeneity statistic. If heterogeneity is high, it can be a signal to investigate whether phenotype differences are driving it.

“In this context, is it common to run something like Genomic SEM directly on the summary statistics we receive from each cohort?”

Yes, people often estimate the heritability of cohort-specific GWAS summary statistics and compute genetic correlations between cohorts to see whether they are capturing the same signal. You can do that with LD score regression as well btw, but Genomic SEM works also.