Simulated Data

Hi all,

Thank you very much for the great session.

I found working with simulated data particularly useful and instructive, especially because it allows you to modify parameters such as heritability, MAF, and LD and observe how the results change.

I was wondering whether there is any script available for simulating genotype data where these parameters can be adjusted by the user.

I completely understand if such a resource is not available, but I thought it would be worth asking.

Thank you so much in advance!

Thanks Juan for the kind feedback! Glad this is helpful. Sure, please feel free to reuse that code and modify however you like.

Thank you very much Loic, I may not have explained my question very clearly.

What I had in mind was a more general simulation framework for generating a “fake GWAS” dataset, where the user could specify parameters such as:

Number of participants
Number (or fraction) of causal SNPs
Target heritability (h²)
MAF distribution
LD structure
Effect size distribution (e.g. a standard polygenic architecture with normally distributed effects)
Population background / ancestry structure

and then simulate genotype and phenotype data under those assumptions.

I have been looking at the simulate-practical-data.R script, but I was not sure whether it was intended for that level of flexibility, or whether there is another script/resource available for this purpose.

My goal would mainly be to generate synthetic datasets and explore how factors such as sample size, heritability, polygenicity, LD, or population structure influence GWAS results.

Many thanks again for your help.

Hi Juan,

The script that I provided is not as flexible as what you describe. It would require some more work to take it there but that’s possible. The bottleneck is usually how to simulate multivariate binomial distribution (genotypes) given an LD matrix and expected allele frequencies. You may want to check this paper rBahadur: efficient simulation of structured high-dimensional genotype data with applications to assortative mating - PubMed which proposes a tool to do that. Alternatively you could just simulate from a multivariate Gaussian distribution to run some explorary analyses. Finally, what we usually do in my group is to run simulations conditionally on real genotypes so that you don’t have to specify the LD and MAF patterns explicitly but use ones that exist in human populations. GCTA has a function to do that (GCTA | Yang Lab).

Cheers,

Loic

Thank you very much Loic. This is extremely helpful!