Definition variables

I had a few questions about definition variables related to different Day 3 videos. I’m moving them to their own separate thread here because it seems like a big enough topic to need a thread to itself.

The questions I originally asked about definition variables were:

  • In the first video, when you do the first diagram for the SNP predicting height (where SNP is a definition variable), what is the circle with no writing in it?
    • The SNP has an arrow pointing from it to the circle, and then beta1 goes from the circle to height.

As answered during the tutorial, the circle is a latent variable that doesn’t have a name.

I still have some more questions about that latent variable below.

  • When/why might it be preferable to do one version of the model or the other (SNP as definition variable vs. SNP as observed variable)?

From what was discussed in the tutorial, it sounds like definition variables exist to give you a way to condition on the value of a variable without having to include it in the variance/covariance matrix, so it lets OpenMx run the model more efficiently. Using a definition variable kind of gives you the variance/covariance matrix conditional on the definition variable, without including the variable in the matrix.

I still have a question specific to the Height ~ SNP model with definition variables below.

  • On slide 41 (around 24:48 in the part 3 video) why are covariates treated as definition variables rather than as observed variables?

The answer in the tutorial (as I understood it):

When something is treated as a definition variable, it isn’t part of the variance/covariance matrix used in the model. Instead, you kind of get the variance/covariance matrix conditional on the values of the definition variable. This works well when the definition variable is exogenous (not expected to be correlated with the model variables), because it lets you adjust for the variable without having to make a very big and complicated model.

It also works well for some things like age where twins are inherently matched. If you made a multivariate twin model with age as one of the phenotypes, the A/C/E estimates for age would be very weird (all C, no A or E).

The parts I still don’t understand:

I think I halfway get it now, that definition variables give you a way to condition on the value of the variable without including it in the variance/covariance matrix and making a big inefficient model.

But there are still a few parts I don’t understand:

  1. Why does using a definition variable sometimes require inserting a little unnamed latent variable into the model, as in the Height ~ SNP model?
    • What is the definition variable doing to the latent variable? It’s not setting its mean, because the value of a definition variable differs by observation (unlike a coefficient).
    • Why have the unnamed latent variable in the model? Is anything related to it being estimated?
  2. Was making SNP a definition variable in the model of Height ~ SNP in the first video just done to demonstrate definition variables? SNP was the main predictor in this model, would you normally avoid using a definition variable for that? (Or maybe the main predictor being a definition variable works, because with a straight-arrow path between SNP and Height OpenMx won’t be modeling cov(SNP,height) anyway?)
  3. Definition variables let you get other estimates conditional on the definition variable. But is it only the path the definition variable is in that’s conditional on the definition variable, or does inserting a definition variable make everything conditional on it?
    • This question may be a moot point. If inserting a definition variable for zygosity on the covariance path between A1 and A2 in univariate twin model affects the estimate of VA, that will still indirectly affect the estimates of VC and VE because the three terms have to sum to V(pheno) so anything affecting VA has to affect at least one of the other two terms too.