Using SNPknock in R

Matteo Sesia (msesia@stanford.edu)

2019-05-16

This vignette illustrates the usage of the SNPknock package for creating knockoffs of discrete Markov chains and hidden Markov models (Sesia, Sabatti, and Candès 2019). For simplicity, we will use synthetic data.

The SNPknock package also provides a simple interface to the genotype imputation software fastPhase, which can be used to fit hidden Markov models for genotype data. Since fastPhase is not available as an R package, this particular functionality of SNPknock cannot be demonstrated here. A tutorial showing how to use a combination of SNPknock and fastPhase to create knockoffs of genotype data can be found here.

Knockoffs of discrete Markov chains

First, we verify that the SNPknock can be loaded.

library(SNPknock)

We define a Markov chain model with 50 variables, each taking one of 5 possible values. We specify a uniform marginal distribution for the first variable in the chain and create 49 transition matrices with randomly sampled entries.

p=50; # Number of variables in the model
K=5;  # Number of possible states for each variable
# Marginal distribution for the first variable
pInit = rep(1/K,K)
# Create p-1 transition matrices
Q = array(stats::runif((p-1)*K*K),c(p-1,K,K))
for(j in 1:(p-1)) {
  Q[j,,] = Q[j,,] + diag(rep(1,K))
  Q[j,,] = Q[j,,] / rowSums(Q[j,,])
}

We can sample 100 independent observations of this Markov chain using the SNPknock package.

set.seed(1234)
X = sampleDMC(pInit, Q, n=100)
print(X[1:5,1:10])
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,]    0    0    3    2    3    3    2    0    0     2
## [2,]    3    3    3    4    2    3    1    2    3     2
## [3,]    3    1    1    1    3    3    1    3    2     3
## [4,]    3    0    3    3    3    0    1    1    1     2
## [5,]    4    1    1    4    3    3    2    2    2     3

Above, each row of X contains an independent realization of the Markov chain.

A knockoff copy of X can be sampled as follows.

Xk = knockoffDMC(X, pInit, Q)
print(Xk[1:5,1:10])
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,]    3    3    3    3    3    3    2    0    0     2
## [2,]    3    3    3    3    0    0    2    3    2     2
## [3,]    0    1    1    2    3    1    1    4    3     2
## [4,]    3    3    3    4    0    1    1    1    1     1
## [5,]    0    0    1    2    3    3    2    2    4     2

Knockoffs of hidden Markov models

We define a hidden Markov chain model with 50 variables, each taking one of 3 possible values, and 50 latent variables distributed as a Markov chain, each taking one of 5 values. We specify a uniform marginal distribution for the first variable in the chain and create 49 transition matrices with randomly sampled entries. We also define an emission distributions over the 3 possible emission states, for each of the 50 latent variables and the 5 latent states.

p=200; # Number of variables in the model
K=5;  # Number of possible states for each variable
M=3;  # Number of possible emission states for each variable
# Marginal distribution for the first variable
pInit = rep(1/K,K)
# Create p-1 transition matrices
Q = array(stats::runif((p-1)*K*K),c(p-1,K,K))
rho = stats::runif(p-1, min = 1, max = 50)
for(j in 1:(p-1)) {
  Q[j,,] = Q[j,,] + rho[j] * diag(rep(1,K))
  Q[j,,] = Q[j,,] / rowSums(Q[j,,])
}
# Create p emission matrices
pEmit = array(stats::runif(p*M*K),c(p,M,K))
for(j in 1:p) { pEmit[j,,] = sweep(pEmit[j,,],2,colSums(pEmit[j,,]),`/`) }

We can sample 100 independent observations of this hidden Markov model using the SNPknock package.

set.seed(1234)
X = sampleHMM(pInit, Q, pEmit, n=100)
print(X[1:5,1:10])
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,]    0    2    2    2    0    1    1    2    0     0
## [2,]    2    1    2    0    1    2    0    0    0     0
## [3,]    2    2    0    1    0    1    2    0    1     2
## [4,]    1    0    2    2    0    0    2    0    1     0
## [5,]    1    0    2    2    2    1    2    2    2     1

Above, each row of X contains an independent realization of the hidden Markov model.

A knockoff copy of X can be sampled as follows.

Xk = knockoffHMM(X, pInit, Q, pEmit)
print(Xk[1:5,1:10])
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,]    0    0    2    2    0    2    1    1    2     0
## [2,]    0    0    1    0    0    2    1    0    2     1
## [3,]    1    1    2    1    0    1    1    0    2     2
## [4,]    1    0    2    2    0    1    0    1    1     1
## [5,]    1    0    1    2    2    2    2    2    2     2

See also

If you want to see how to use SNPknock to create knockoffs of genotype data, see the genotypes vignette.

If you want to learn about SNPknock for large genome-wide association studies (Sesia et al. 2019), see https://msesia.github.io/knockoffzoom/

Sesia, M., E. Katsevich, S. Bates, E. Candès, and C. Sabatti. 2019. “Multi-Resolution Localization of Causal Variants Across the Genome.” bioRxiv. Cold Spring Harbor Laboratory. https://doi.org/10.1101/631390.

Sesia, M., C. Sabatti, and E. J. Candès. 2019. “Gene Hunting with Hidden Markov Model Knockoffs.” Biometrika 106:1–18. https://doi.org/10.1093/biomet/asy033.