About QC of data from different microarrays

Bruna · March 7, 2023, 7:36pm

Is that ok to merge datasets from different microarrays (for example, using bmerge plink command) before imputation? What are additional QC steps to check the quality of the merged dataset (e.g. flip issues)? What if there is a low number of overlapping SNPs in the resulting merged dataset - less than 30.000 SNPs?

daniel.howrigan · March 7, 2023, 9:54pm

It is generally a good idea to merge the intersection of SNPs across arrays prior to imputation, although you’ll want to get at least 250k SNPs shared between platforms, so 30K is too small. PLINK --bmerge is quite robust at identifying flipped/mismatched SNPs and such so that the final merge works out. Once merged, you should re-QC the merged dataset again prior to imputation, and a GWAS on the merged dataset will let you know what issues remain (hopefully a QQ plot with little to no inflation). If the different arrays have a decent balance of cases and controls, running a GWAS with chip type as the phenotype will allow you to filter any chip-specific problematic SNPs. This can be run after imputation as well as a filter.

One piece to add is that the SNP overlap and flip/strand issues is minimal across versions (eg GSAv1 vs GSAv3 chips) and a lot worse between manufacturers (eg Affymetrix vs illumina)