Tutorials_hier.Rmd
In this tutorial, we demonstrate basic use of the CADET
package for testing for a difference in feature means between clusters
obtained via hierarchical clustering.
First we load relevant packages:
We first generate data according to \(\mathbf{X} \sim {MN}_{n\times q}(\boldsymbol{\mu}, \textbf{I}_n, \sigma^2 \textbf{I}_q)\) with \(n=150,q=2,\sigma=1,\) and \[\begin{align} \label{eq:power_model} \boldsymbol{\mu}_1 =\ldots = \boldsymbol{\mu}_{50} = \begin{bmatrix} -\delta/2 \\ 0_{q-1} \end{bmatrix}, \; {\boldsymbol\mu}_{51}=\ldots = \boldsymbol{\mu}_{100} = \begin{bmatrix} 0_{q-1} \\ \sqrt{3}\delta/2 \end{bmatrix} ,\; \boldsymbol{\mu}_{101}=\ldots = \boldsymbol{\mu}_{150} = \begin{bmatrix} \delta/2 \\ 0_{q-1} \end{bmatrix}. \end{align}\] Here, we can think of \({C}_1 = \{1,\ldots,50\},{C}_2 = \{51,\ldots,100\},{C}_3 = \{101,\ldots,150\}\) as the “true clusters”. In the figure below, we display one such simulation \(\mathbf{x}\in\mathbb{R}^{100\times 2}\) with \(\delta=10\).
set.seed(2022)
n <- 150
true_clusters <- c(rep(1, 50), rep(2, 50), rep(3, 50))
delta <- 10
q <- 2
mu <- rbind(c(delta / 2, rep(0, q - 1)), c(rep(0, q - 1), sqrt(3) * delta / 2), c(-delta / 2, rep(0, q - 1)))
sig <- 1
X <- matrix(rnorm(n * q, sd = sig), n, q) + mu[true_clusters, ]
ggplot(data.frame(X), aes(x = X1, y = X2)) +
geom_point(cex = 2) +
xlab("Feature 1") +
ylab("Feature 2") +
theme_classic(base_size = 18) +
theme(legend.position = "none") +
scale_colour_manual(values = c("dodgerblue3", "rosybrown", "orange")) +
theme(
legend.title = element_blank(),
plot.title = element_text(hjust = 0.5)
)
In the code below, we call the hcl
function in the
fastclusters
package to estimate clusters using average
linkage. In the figure below, observations are colored by the clusters
obtained via cutting the dendrograms at \(K=3\).
k <- 3
hcl <- hclust(dist(X, method = "euclidean")^2, method = "average")
estimated_clusters <- cutree(hcl, 3)
ggplot(data.frame(X), aes(x = X1, y = X2, col = as.factor(estimated_clusters))) +
geom_point(cex = 2) +
xlab("Feature 1") +
ylab("Feature 2") +
theme_classic(base_size = 18) +
theme(legend.position = "none") +
scale_colour_manual(values = c("dodgerblue3", "rosybrown", "orange")) +
theme(
legend.title = element_blank(),
plot.title = element_text(hjust = 0.5)
)
In this section, we demonstrate how to use our software to obtain
\(p\)-values for testing for a
difference in the means of a single feature between a pair of clusters
identified via hierarchical clustering (using average linkage as an
example). As an example, consider testing for a difference in the first
feature (x axis) between the blue cluster (labeled as 1 in
estimated_clusters
) and the pink cluster (labeled as 3 in
estimated_clusters
).
The code below demonstrates how to use the function
test_hier_clusters_exact_1f
, which performs inference on
the specified two estimated clusters. After obtaining the inferential
result, we call the summary method to get a summary of the results, in
the form of a data frame.
cluster_1 <- 1
cluster_2 <- 3
cl_inference_demo <- test_hier_clusters_exact_1f(X = X, link = "average", hcl = hcl, K = 3, k1 = cluster_1, k2 = cluster_2, feat = 1)
summary(cl_inference_demo)
#> cluster_1 cluster_2 test_stat p_selective p_naive
#> 1 1 3 9.910708 2.868783e-26 4.596766e-31
In the summary, we have the empirical difference in means of the
first feature between the two clusters, i.e.,\(\sum_{i\in
{\hat{{G}}}}\mathbf{x}_{i,2}/|\hat{{G}}| - \sum_{i\in
\hat{G}'}\mathbf{x}_{i,2}/|\hat{G}'|\)
(test_stats
), the naive p-value based on a z-test
(p_naive
), and the selective \(p\)-value (p_selective
). In
this case, the test based on \(p_{1,\text{selective}}\) can reject this
null hypothesis that the blue and pink clusters have the same mean in
the first feature (\(p_{1,\text{selective}}<0.001\)).
In this section, we demonstrate that our proposed \(p\)-value yields reasonable results when the null hypothesis does hold. Consider the same data and clustering assignment as before, there is no difference in means in the second feature between cluster 1 (grey cluster) and cluster 3 (blue cluster).
Now the selective \(p\)-value yields a much more moderate \(p\)-value, and the test based on it cannot reject the null hypothesis when it holds. By contrast, the naive \(p\)-value is tiny and leads to an anti-conservative test.
cluster_1 <- 1
cluster_2 <- 3
cl_inference_demo <- test_hier_clusters_exact_1f(
X = X, link = "average",
hcl = hcl, K = 3, k1 = cluster_1, k2 = cluster_2, feat = 2
)
summary(cl_inference_demo)
#> cluster_1 cluster_2 test_stat p_selective p_naive
#> 1 1 3 -0.1766818 0.8362984 0.8362984