Title: | Sparse Principal Component Analysis via Random Projections (SPCAvRP) |
---|---|
Description: | Implements the SPCAvRP algorithm, developed and analysed in "Sparse principal component analysis via random projections" Gataric, M., Wang, T. and Samworth, R. J. (2018) <arXiv:1712.05630>. The algorithm is based on the aggregation of eigenvector information from carefully-selected random projections of the sample covariance matrix. |
Authors: | Milana Gataric, Tengyao Wang and Richard J. Samworth |
Maintainer: | Milana Gataric <[email protected]> |
License: | GPL-3 |
Version: | 0.4 |
Built: | 2024-11-02 02:55:57 UTC |
Source: | https://github.com/cran/SPCAvRP |
Computes l
-sparse leading eigenvector of the sample covariance matrix, using A x B
random axis-aligned projections of dimension d
. For the multiple component estimation use SPCAvRP_subspace
or SPCAvRP_deflation
.
SPCAvRP(data, cov = FALSE, l, d = 20, A = 600, B = 200, center_data = TRUE, parallel = FALSE, cluster_type = "PSOCK", cores = 1, machine_names = NULL)
SPCAvRP(data, cov = FALSE, l, d = 20, A = 600, B = 200, center_data = TRUE, parallel = FALSE, cluster_type = "PSOCK", cores = 1, machine_names = NULL)
data |
Either the data matrix ( |
cov |
|
l |
Desired sparsity level in the final estimator (see Details). |
d |
The dimension of the random projections (see Details). |
A |
Number of projections over which to aggregate (see Details). |
B |
Number of projections in a group from which to select (see Details). |
center_data |
|
parallel |
|
cluster_type |
If |
cores |
If |
machine_names |
If |
This function implements the SPCAvRP algorithm for the principal component estimation (Algorithm 1 in the reference given below).
If the true sparsity level k
is known, use l = k
and d = k
.
If the true sparsity level k
is unknown, l
can take an array of different values and then the estimators of the corresponding sparsity levels are computed. The final choice of l
can then be done by the user via inspecting the explained variance computed in the output value
or via inspecting the output importance_scores
. The default choice for d
is 20
, but we suggest choosing d
equal to or slightly larger than l
.
It is desirable to choose A
(and B = ceiling(A/3)
) as big as possible subject to the computational budget. In general, we suggest using A = 300
and B = 100
when the dimension of data is a few hundreds, while A = 600
and B = 200
when the dimension is on order of 1000
.
If center_data == TRUE
and data
is given as a data matrix, the first step is to center it by executing scale(data, center_data, FALSE)
, which subtracts the column means of data
from their corresponding columns.
If parallel == TRUE
, the parallelised SPCAvRP algorithm is used. We recommend to use this option if p
, A
and B
are very large.
Returns a list of three elements:
vector |
A matrix of dimension |
value |
An array with |
importance_scores |
An array of length p with importance scores for each variable 1 to p. |
Milana Gataric, Tengyao Wang and Richard J. Samworth
Milana Gataric, Tengyao Wang and Richard J. Samworth (2018) Sparse principal component analysis via random projections https://arxiv.org/abs/1712.05630
p <- 100 # data dimension k <- 10 # true sparsity level n <- 1000 # number of observations v1 <- c(rep(1/sqrt(k), k), rep(0,p-k)) # true principal component Sigma <- 2*tcrossprod(v1) + diag(p) # population covariance mu <- rep(0, p) # population mean loss = function(u,v){ # the loss function sqrt(abs(1-sum(v*u)^2)) } set.seed(1) X <- mvrnorm(n, mu, Sigma) # data matrix spcavrp <- SPCAvRP(data = X, cov = FALSE, l = k, d = k, A = 200, B = 70) spcavrp.loss <- loss(v1,spcavrp$vector) print(paste0("estimation loss when l=d=k=10, A=200, B=70: ", spcavrp.loss)) ##choosing sparsity level l if k unknown: #spcavrp.choosel <- SPCAvRP(data = X, cov = FALSE, l = c(1:30), d = 15, A = 200, B = 70) #plot(1:p,spcavrp.choosel$importance_scores,xlab='variable',ylab='w', # main='choosing l when k unknown: \n importance scores w') #plot(1:30,spcavrp.choosel$value,xlab='l',ylab='Var_l', # main='choosing l when k unknown: \n explained variance Var_l')
p <- 100 # data dimension k <- 10 # true sparsity level n <- 1000 # number of observations v1 <- c(rep(1/sqrt(k), k), rep(0,p-k)) # true principal component Sigma <- 2*tcrossprod(v1) + diag(p) # population covariance mu <- rep(0, p) # population mean loss = function(u,v){ # the loss function sqrt(abs(1-sum(v*u)^2)) } set.seed(1) X <- mvrnorm(n, mu, Sigma) # data matrix spcavrp <- SPCAvRP(data = X, cov = FALSE, l = k, d = k, A = 200, B = 70) spcavrp.loss <- loss(v1,spcavrp$vector) print(paste0("estimation loss when l=d=k=10, A=200, B=70: ", spcavrp.loss)) ##choosing sparsity level l if k unknown: #spcavrp.choosel <- SPCAvRP(data = X, cov = FALSE, l = c(1:30), d = 15, A = 200, B = 70) #plot(1:p,spcavrp.choosel$importance_scores,xlab='variable',ylab='w', # main='choosing l when k unknown: \n importance scores w') #plot(1:30,spcavrp.choosel$value,xlab='l',ylab='Var_l', # main='choosing l when k unknown: \n explained variance Var_l')
Computes m
leading eigenvectors of the sample covariance matrix which are sparse and orthogonal, using the modified deflation scheme in conjunction with the SPCAvRP algorithm.
SPCAvRP_deflation(data, cov = FALSE, m, l, d = 20, A = 600, B = 200, center_data = TRUE)
SPCAvRP_deflation(data, cov = FALSE, m, l, d = 20, A = 600, B = 200, center_data = TRUE)
data |
Either the data matrix ( |
cov |
|
m |
The number of principal components to estimate. |
l |
The array of length |
d |
The dimension of the random projections (see Details). |
A |
Number of projections over which to aggregate (see Details). |
B |
Number of projections in a group from which to select (see Details). |
center_data |
|
This function implements the modified deflation scheme in conjunction with SPCAvRP (Algorithm 2 in the reference given below).
If the true sparsity level is known and for each component is equal to k
, use d = k
and l = rep(k,m)
. Sparsity levels of different components may take different values. If k
is unknown, appropriate k
could be chosen from an array of different values by inspecting the explained variance for one component at the time and by using SPCAvRP
in a combination with the deflation scheme implemented in SPCAvRP_deflation
.
It is desirable to choose A
(and B = ceiling(A/3)
) as big as possible subject to the computational budget. In general, we suggest using A = 300
and B = 100
when the dimension of data is a few hundreds, while A = 600
and B = 200
when the dimension is on order of 1000
.
If center_data == TRUE
and data
is given as a data matrix, the first step is to center it by executing scale(data, center_data, FALSE)
, which subtracts the column means of data
from their corresponding columns.
Returns a list of two elements:
vector |
A matrix whose |
value |
An array with |
Milana Gataric, Tengyao Wang and Richard J. Samworth
Milana Gataric, Tengyao Wang and Richard J. Samworth (2018) Sparse principal component analysis via random projections https://arxiv.org/abs/1712.05630
p <- 50 # data dimension k <- 8 # true sparsity of each component v1 <- 1/sqrt(k)*c(rep(1, k), rep(0, p-k)) # first principal compnent (PC) v2 <- 1/sqrt(k)*c(rep(0,4), 1, -1, 1, -1, rep(1,4), rep(0,p-12)) # 2nd PC v3 <- 1/sqrt(k)*c(rep(0,6), 1, -rep(1,4), rep(1,3), rep(0,p-14)) # 3rd PC Sigma <- diag(p) + 40*tcrossprod(v1) + 20*tcrossprod(v2) + 5*tcrossprod(v3) # population covariance mu <- rep(0, p) # population mean n <- 2000 # number of observations loss = function(u,v){ sqrt(abs(1-sum(v*u)^2)) } loss_sub = function(U,V){ U<-qr.Q(qr(U)); V<-qr.Q(qr(V)) norm(tcrossprod(U)-tcrossprod(V),"2") } set.seed(1) X <- mvrnorm(n, mu, Sigma) # data matrix spcavrp.def <- SPCAvRP_deflation(data = X, cov = FALSE, m = 2, l = rep(k,2), d = k, A = 200, B = 70, center_data = FALSE) subspace_estimation<-data.frame( loss_sub(matrix(c(v1,v2),ncol=2),spcavrp.def$vector), loss(spcavrp.def$vector[,1],v1), loss(spcavrp.def$vector[,2],v2), crossprod(spcavrp.def$vector[,1],spcavrp.def$vector[,2])) colnames(subspace_estimation)<-c("loss_sub","loss_v1","loss_v2","inner_prod") rownames(subspace_estimation)<-c("") print(subspace_estimation)
p <- 50 # data dimension k <- 8 # true sparsity of each component v1 <- 1/sqrt(k)*c(rep(1, k), rep(0, p-k)) # first principal compnent (PC) v2 <- 1/sqrt(k)*c(rep(0,4), 1, -1, 1, -1, rep(1,4), rep(0,p-12)) # 2nd PC v3 <- 1/sqrt(k)*c(rep(0,6), 1, -rep(1,4), rep(1,3), rep(0,p-14)) # 3rd PC Sigma <- diag(p) + 40*tcrossprod(v1) + 20*tcrossprod(v2) + 5*tcrossprod(v3) # population covariance mu <- rep(0, p) # population mean n <- 2000 # number of observations loss = function(u,v){ sqrt(abs(1-sum(v*u)^2)) } loss_sub = function(U,V){ U<-qr.Q(qr(U)); V<-qr.Q(qr(V)) norm(tcrossprod(U)-tcrossprod(V),"2") } set.seed(1) X <- mvrnorm(n, mu, Sigma) # data matrix spcavrp.def <- SPCAvRP_deflation(data = X, cov = FALSE, m = 2, l = rep(k,2), d = k, A = 200, B = 70, center_data = FALSE) subspace_estimation<-data.frame( loss_sub(matrix(c(v1,v2),ncol=2),spcavrp.def$vector), loss(spcavrp.def$vector[,1],v1), loss(spcavrp.def$vector[,2],v2), crossprod(spcavrp.def$vector[,1],spcavrp.def$vector[,2])) colnames(subspace_estimation)<-c("loss_sub","loss_v1","loss_v2","inner_prod") rownames(subspace_estimation)<-c("") print(subspace_estimation)
Computes m
leading eigenvectors of the sample covariance matrix which are sparse and orthogonal, using A x B
random axis-aligned projections of dimension d
.
SPCAvRP_subspace(data, cov = FALSE, m, l, d = 20, A = 600, B = 200, center_data = TRUE)
SPCAvRP_subspace(data, cov = FALSE, m, l, d = 20, A = 600, B = 200, center_data = TRUE)
data |
Either the data matrix ( |
cov |
|
m |
The dimension of the eigenspace, i.e the number of principal components to compute. |
l |
Desired sparsity level of the eigenspace (i.e. the number of non-zero rows in |
d |
The dimension of the random projections (see Details). |
A |
Number of projections over which to aggregate (see Details). |
B |
Number of projections in a group from which to select (see Details). |
center_data |
|
This function implements the SPCAvRP algorithm for the eigenspace estimation (Algorithm 3 in the reference given below).
If the true sparsity level k
of the eigenspace is known, use l = k
and d = k
.
If the true sparsity level k
of the eigenspace is unknown, the appropriate choice of l
can be done, for example, by running the algorithm (for any l
) and inspecting the computed output importance_scores
. The default choice for d
is 20
, but we suggest choosing d
equal to or slightly larger than l
.
It is desirable to choose A
(and B = ceiling(A/3)
) as big as possible subject to the computational budget. In general, we suggest using A = 300
and B = 100
when the dimension of data is a few hundreds, while A = 600
and B = 200
when the dimension is on order of 1000
.
If center_data == TRUE
and data
is given as a data matrix, the first step is to center it by executing scale(data, center_data, FALSE)
, which subtracts the column means of data
from their corresponding columns.
Returns a list of two elements:
vector |
A matrix whose |
value |
An array with |
importance_scores |
An array of length p with importance scores for each variable 1 to p. |
Milana Gataric, Tengyao Wang and Richard J. Samworth
Milana Gataric, Tengyao Wang and Richard J. Samworth (2018) Sparse principal component analysis via random projections https://arxiv.org/abs/1712.05630
p <- 50 # data dimension k1 <- 8 # sparsity of each induvidual component v1 <- 1/sqrt(k1)*c(rep(1, k1), rep(0, p-k1)) # first principal compnent (PC) v2 <- 1/sqrt(k1)*c(rep(0,4), 1, -1, 1, -1, rep(1,4), rep(0,p-12)) # 2nd PC v3 <- 1/sqrt(k1)*c(rep(0,6), 1, -rep(1,4), rep(1,3), rep(0,p-14)) # 3rd PC Sigma <- diag(p) + 40*tcrossprod(v1) + 20*tcrossprod(v2) + 5*tcrossprod(v3) # population covariance mu <- rep(0, p) # pupulation mean n <- 2000 # number of observations loss = function(u,v){ sqrt(abs(1-sum(v*u)^2)) } loss_sub = function(U,V){ U<-qr.Q(qr(U)); V<-qr.Q(qr(V)) norm(tcrossprod(U)-tcrossprod(V),"2") } set.seed(1) X <- mvrnorm(n, mu, Sigma) # data matrix spcavrp.sub <- SPCAvRP_subspace(data = X, cov = FALSE, m = 2, l = 12, d = 12, A = 200, B = 70, center_data = FALSE) subspace_estimation<-data.frame( loss_sub(matrix(c(v1,v2),ncol=2),spcavrp.sub$vector), loss(spcavrp.sub$vector[,1],v1), loss(spcavrp.sub$vector[,2],v2), crossprod(spcavrp.sub$vector[,1],spcavrp.sub$vector[,2])) colnames(subspace_estimation)<-c("loss_sub","loss_v1","loss_v2","inner_prod") rownames(subspace_estimation)<-c("") print(subspace_estimation) plot(1:p,spcavrp.sub$importance_scores,xlab='variable',ylab='w', main='importance scores w \n (may use to choose l when k unknown)')
p <- 50 # data dimension k1 <- 8 # sparsity of each induvidual component v1 <- 1/sqrt(k1)*c(rep(1, k1), rep(0, p-k1)) # first principal compnent (PC) v2 <- 1/sqrt(k1)*c(rep(0,4), 1, -1, 1, -1, rep(1,4), rep(0,p-12)) # 2nd PC v3 <- 1/sqrt(k1)*c(rep(0,6), 1, -rep(1,4), rep(1,3), rep(0,p-14)) # 3rd PC Sigma <- diag(p) + 40*tcrossprod(v1) + 20*tcrossprod(v2) + 5*tcrossprod(v3) # population covariance mu <- rep(0, p) # pupulation mean n <- 2000 # number of observations loss = function(u,v){ sqrt(abs(1-sum(v*u)^2)) } loss_sub = function(U,V){ U<-qr.Q(qr(U)); V<-qr.Q(qr(V)) norm(tcrossprod(U)-tcrossprod(V),"2") } set.seed(1) X <- mvrnorm(n, mu, Sigma) # data matrix spcavrp.sub <- SPCAvRP_subspace(data = X, cov = FALSE, m = 2, l = 12, d = 12, A = 200, B = 70, center_data = FALSE) subspace_estimation<-data.frame( loss_sub(matrix(c(v1,v2),ncol=2),spcavrp.sub$vector), loss(spcavrp.sub$vector[,1],v1), loss(spcavrp.sub$vector[,2],v2), crossprod(spcavrp.sub$vector[,1],spcavrp.sub$vector[,2])) colnames(subspace_estimation)<-c("loss_sub","loss_v1","loss_v2","inner_prod") rownames(subspace_estimation)<-c("") print(subspace_estimation) plot(1:p,spcavrp.sub$importance_scores,xlab='variable',ylab='w', main='importance scores w \n (may use to choose l when k unknown)')