fusedlasso_inf.Rd
This functions tests the null hypothesis of no difference in means between
connected components c1
and c2
of the output of the graph fused
lasso solution. The ordering are numbered as per the results of the fusedlasso
function in the genlasso
package.
fusedlasso_inf( y, D, c1, c2, method, sigma, K = NULL, L = NULL, early_stop = NULL, compute_ci = FALSE, alpha_level = 0.05 )
y | Numeric vector; \(n\) dimensional observed data |
---|---|
D | Numeric matrix; \(m\) by \(n\) penalty matrix, i.e., the oriented incidence matrix over the underlying graph |
c1, c2 | Integers selecting the two connected components to test, as indexed by the results of
|
method | One of "K" or "CC", which indicates which conditioning set to use |
sigma | Numeric; noise standard deviation for the observed data, a non-negative number. |
K | Integer; number of steps to run the dual-path algorithm. It must be specified if method=="K". |
L | Integer; the targeted number of connected components. It must be specified if method=="CC". |
early_stop | Numeric; specify when the truncation set computation should be terminated. The default is NULL, which indicates infinity. |
compute_ci | Logical; the default is False. Specifying whether confidence intervals for \(\nu^{T}\beta\), the difference in means between the two estimated connected components, should be computed. |
alpha_level | Numeric; parameter for the 1- |
Returns a list with elements:
pval
the p-value in Chen et al. (2021+)
truncation_set
the conditioning set of Chen et al. (2021+) stored as Intervals
class
test_stats
test statistics: the difference in means of two connected components
beta_hat
Graph fused lasso estimates
connected_comp
Estimated connected component
Naive
the naive p-value using a z-test
Hyun
the p-value proposed in Hyun et al. (2018)
hyun_set
the conditioning set of Hyun et al. (2018) stored as Intervals
class
CI_result
confidence interval of level 1-alpha_level
if compute_ci=TRUE
Currently, we support two different conditioning sets: conditioning set 1 is based on the output after K steps dual-path algorithm; and conditioning set 2 is based on the output of the after the dual-path algorithm yields c connected components in the output.
Input:
Consider the generative model \(Y_j = \beta_j + \epsilon_j, \epsilon_j \sim N(0, \sigma^2). j=1,...,n\), where the underlying signal \(\beta\) is assumed to be piecewise constant with respect to an underlying graph. The fused lasso estimate minimizes the following objective function $$minimize_{\beta} \frac{1}{2} \sum_{j=1}^{n} ( y_j - \beta_j )^2 + \lambda \sum_{(i,j)\in E}|\beta_i-\beta_j|,$$ where E is the edge set of the underlying graph. The solution \(\hat{\beta}\) can then be segment into connected components; that is, the set of \(\hat{\beta}\) that takes on the same value, and are connected in the original graph.
Now suppose we want to test whether the means of two estimated connected components c1
and c2
are equal; or equivalently, the null hypothesis of the form \(H_{0}: \nu^T \beta = 0\) versus
\(H_{1}: \nu^T \beta \neq 0\) for suitably chosen \(\nu\).
This function computes the following p-value: $$P(|\nu^T Y| \ge |\nu^T y| \; | \; \hat{C}_1, \hat{C}_2 \in CC_K(Y), \Pi_\nu^\perp Y = \Pi_\nu^\perp y),$$ where \(CC_K(Y)\) is the set of estimated connected components from applying K steps of the dual path algorithm on data Y , and \(\Pi_\nu^\perp\) is the orthogonal projection to the orthogonal complement of \(\nu\). In particular, the test based on this p-value controls the selective Type I error and has higher power than an existing method by Hyun et al. (2018). Readers can refer to the Section 3 in Chen et al. (2021+) for more details.
Chen YT, Jewell SW, Witten DM. (2022+) More powerful selective inference for the graph fused lasso. arXiv preprint. https://arxiv.org/abs/2109.10451.
Hyun S, G’Sell M, Tibshirani RJ. (2018) Exact post-selection inference for the generalized lasso path. Electron J Stat.
lev1 <- 0 # mean for group 1 lev2 <- 3 # mean (absolute value) for group 2/3 sigma <- 1 # level of noise nn <- 8 # grid size Dmat <- genlasso::getD2d(nn, nn) # generate D matrix for the 2D fused lasso ### Create the underlying signal A <- matrix(lev1, ncol=nn, nrow = nn) A[1:round(nn/3),1:round(nn/3)] <- 1*lev2 A[(nn-2):(nn),(nn-2):(nn)] <- -1*lev2 ### Visualize the underlying signal lattice::levelplot(A) set.seed(2005) A.noisy <- A + rnorm(nn^2,mean=0,sd=sigma) y <- c(t(A.noisy)) ### Now use the fusedlasso function to obtain estimated connected components after K=13 ### steps of the dual path algorithm K = 13 complete_sol <- genlasso::fusedlasso(y=y,D=Dmat,maxsteps=K) beta_hat <- complete_sol$beta[,K] ### estimated connected components estimated_CC <- complete_sol$pathobjs$i estimated_CC #> [1] 2 2 2 3 3 3 3 3 2 2 2 3 3 3 3 3 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 #> [39] 3 3 3 3 3 3 3 1 1 1 3 3 3 3 3 1 1 1 3 3 3 3 3 1 1 1 ### Run a test for a difference in means between estimated connected components 1 and 2 result_demo <- fusedlasso_inf(y=y, D=Dmat, c1=1, c2=2, method="K", sigma=sigma, K=K) summary(result_demo) #> CC_1 CC_2 test_stats pval_c1c2 pval_hyun LCB UCB #> 1 1 2 -6.515453 6.854142e-30 0.1772895 NA NA