Title: | Nested Cross-Validation with 'glmnet' and 'caret' |
---|---|
Description: | Implements nested k*l-fold cross-validation for lasso and elastic-net regularised linear models via the 'glmnet' package and other machine learning models via the 'caret' package <doi:10.1093/bioadv/vbad048>. Cross-validation of 'glmnet' alpha mixing parameter and embedded fast filter functions for feature selection are provided. Described as double cross-validation by Stone (1977) <doi:10.1111/j.2517-6161.1977.tb01603.x>. Also implemented is a method using outer CV to measure unbiased model performance metrics when fitting Bayesian linear and logistic regression shrinkage models using the horseshoe prior over parameters to encourage a sparse model as described by Piironen & Vehtari (2017) <doi:10.1214/17-EJS1337SI>. |
Authors: | Myles Lewis [aut, cre] |
Maintainer: | Myles Lewis <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.7.13 |
Built: | 2025-02-17 11:25:49 UTC |
Source: | https://github.com/myles-lewis/nestedcv |
Produces a ggplot2 plot of stability (as SEM) of variable importance across models trained and tested across outer CV folds. Optionally overlays directionality for binary response or regression outcomes.
barplot_var_stability( x, final = TRUE, top = NULL, direction = 0, dir_labels = NULL, scheme = c("royalblue", "red"), breaks = NULL, percent = TRUE, level = 1, sort = TRUE )
barplot_var_stability( x, final = TRUE, top = NULL, direction = 0, dir_labels = NULL, scheme = c("royalblue", "red"), breaks = NULL, percent = TRUE, level = 1, sort = TRUE )
x |
a |
final |
Logical whether to restrict variables to only those which ended up in the final fitted model or to include all variables selected across all outer folds. |
top |
Limits number of variables plotted. Set to |
direction |
Integer controlling plotting of directionality for binary or
regression models. |
dir_labels |
Character vector for controlling the legend when
|
scheme |
Vector of 2 colours for directionality when |
breaks |
Vector of continuous breaks for legend colour/size |
percent |
Logical for |
level |
For multinomial |
sort |
Logical whether to sort by mean variable importance. Passed to
|
A ggplot2 plot
Randomly samples predictors and averages the ranking to give an ensemble measure of predictor variable importance.
boot_filter(y, x, filterFUN, B = 50, nfilter = NULL, type = "index", ...)
boot_filter(y, x, filterFUN, B = 50, nfilter = NULL, type = "index", ...)
y |
Response vector |
x |
Matrix of predictors |
filterFUN |
Filter function, e.g. |
B |
Number of times to bootstrap |
nfilter |
Number of predictors to return |
type |
Type of vector returned. Default "index" returns indices, "full" returns full output. |
... |
Optional arguments passed to the function specified by |
Integer vector of indices of filtered parameters (type = "index"
)
or if type = "full"
a matrix of rankings from each bootstrap is returned.
Randomly samples predictors and averages the ranking from filtering functions
including ttest_filter()
, wilcoxon_filter()
, anova_filter()
,
correl_filter()
and lm_filter()
to give an ensemble measure of best
predictors by repeated random sampling subjected to a statistical test.
boot_ttest(y, x, B = 50, ...) boot_wilcoxon(y, x, B = 50, ...) boot_anova(y, x, B = 50, ...) boot_correl(y, x, B = 50, ...) boot_lm(y, x, B = 50, ...)
boot_ttest(y, x, B = 50, ...) boot_wilcoxon(y, x, B = 50, ...) boot_anova(y, x, B = 50, ...) boot_correl(y, x, B = 50, ...) boot_lm(y, x, B = 50, ...)
y |
Response vector |
x |
Matrix of predictors |
B |
Number of times to bootstrap |
... |
Optional arguments passed to the filter function |
Integer vector of indices of filtered parameters (type = "index"
),
or if type = "full"
, a matrix of rankings from each bootstrap is
returned.
ttest_filter()
, wilcoxon_filter()
, anova_filter()
,
correl_filter()
, lm_filter()
and boot_filter()
Filter using Boruta algorithm.
boruta_filter( y, x, select = c("Confirmed", "Tentative"), type = c("index", "names", "full"), ... )
boruta_filter( y, x, select = c("Confirmed", "Tentative"), type = c("index", "names", "full"), ... )
y |
Response vector |
x |
Matrix of predictors |
select |
Which type of features to retain. Options include "Confirmed" and/or "Tentative". |
type |
Type of vector returned. Default "index" returns indices, "names" returns predictor names, "full" returns a named vector of variable importance. |
... |
Other arguments passed to |
Boruta works differently from other filters in that it does not rank variables by variable importance, but tries to determine relevant features and divides features into Rejected, Tentative or Confirmed.
Integer vector of indices of filtered parameters (type = "index") or
character vector of names (type = "names") of filtered parameters. If
type
is "full"
full output from Boruta
is returned.
Boxplots to show range of model predictors to identify exceptional predictors with excessively low or high values.
boxplot_expression(x, scheme = NULL, palette = "Dark 3", ...)
boxplot_expression(x, scheme = NULL, palette = "Dark 3", ...)
x |
a "nestedcv" object |
scheme |
colour scheme |
palette |
palette name (one of |
... |
other arguments passed to boxplot. |
No return value
Myles Lewis
Check class balance in training folds
class_balance(object) ## Default S3 method: class_balance(object) ## S3 method for class 'nestcv.train' class_balance(object)
class_balance(object) ## Default S3 method: class_balance(object) ## S3 method for class 'nestcv.train' class_balance(object)
object |
Object of class |
Invisibly a table of the response classes in the training folds
Extracts model coefficients from a fitted cva.glmnet()
object.
## S3 method for class 'cva.glmnet' coef(object, ...)
## S3 method for class 'cva.glmnet' coef(object, ...)
object |
Fitted |
... |
Other arguments passed to |
Sparse matrix containing coefficients from a cv.glmnet
model
Extracts coefficients from the final fit of a "nestcv.glmnet"
object.
## S3 method for class 'nestcv.glmnet' coef(object, s = object$final_param["lambda"], ...)
## S3 method for class 'nestcv.glmnet' coef(object, s = object$final_param["lambda"], ...)
object |
Object of class |
s |
Value of penalty parameter lambda. Default is the mean of lambda values selected across each outer fold. |
... |
Other arguments passed to |
Vector or list of coefficients ordered with the intercept first, followed by highest absolute value to lowest.
This function identifies predictors with r^2 above a given cut-off and produces an index of predictors to be removed. The function takes a matrix or data.frame of predictors, and the columns need to be ordered in terms of importance - first column of any pair that are correlated is retained and subsequent columns which correlate above the cut-off are flagged for removal.
collinear(x, rsq_cutoff = 0.9, rsq_method = "pearson", verbose = FALSE)
collinear(x, rsq_cutoff = 0.9, rsq_method = "pearson", verbose = FALSE)
x |
A matrix or data.frame of values. The order of columns is used to
determine which columns to retain, so the columns in |
rsq_cutoff |
Value of cut-off for r-squared |
rsq_method |
character string indicating which correlation coefficient
is to be computed. One of "pearson" (default), "kendall", or "spearman".
See |
verbose |
Boolean whether to print details |
Integer vector of the indices of columns in x
to remove due to
collinearity
Filter combining univariate (t-test or anova) filtering and reliefF filtering in equal measure.
combo_filter(y, x, nfilter, type = c("index", "names", "full"), ...)
combo_filter(y, x, nfilter, type = c("index", "names", "full"), ...)
y |
Response vector |
x |
Matrix or dataframe of predictors |
nfilter |
Number of predictors to return, using 1/2 from |
type |
Type of vector returned. Default "index" returns indices, "names" returns predictor names, "full" returns full output. |
... |
Optional arguments passed via relieff_filter to CORElearn::attrEval |
Integer vector of indices of filtered parameters (type = "index") or
character vector of names (type = "names") of filtered parameters. If type
is "full"
a list containing full outputs from either ttest_filter or
anova_filter and relieff_filter is returned.
Fast Pearson/Spearman correlation where y
is vector, x
is matrix, adapted
from stats::cor.test.
correls2(y, x, method = "pearson", use = "complete.obs")
correls2(y, x, method = "pearson", use = "complete.obs")
y |
Numerical vector |
x |
Matrix |
method |
Type of correlation, either "pearson" or "spearman". |
use |
Optional character string giving a method for computing covariances in the presence of missing values. See cor |
For speed, p-values for Spearman's test are computed by
asymptotic t approximation, equivalent to cor.test with exact = FALSE
.
Matrix with columns containing the correlation statistic, either
Pearson r or Spearman rho, and p-values for each column of x
correlated
against vector y
Extracts coefficients from outer CV glmnet models from a nestcv.glmnet
fitted object.
cv_coef(x, level = 1)
cv_coef(x, level = 1)
x |
a |
level |
For multinomial models only, either an integer specifying which level of outcome is being examined, or the level can be specified as a character value |
matrix of coefficients from outer CV glmnet models plus the final glmnet model. Coefficients for variables which are not present in a particular outer CV fold model are set to 0.
Extracts variable importance or coefficients from outer CV glmnet models from
a nestcv.train
fitted object.
cv_varImp(x)
cv_varImp(x)
x |
a |
Note that caret::varImp()
may require the model package to be fully loaded
in order to function. During the fitting process caret
often only loads the
package by namespace.
matrix of variable importance from outer CV fold caret models as well as the final model. Variable importance for variables which are not present in a particular outer CV fold model is set to 0.
Performs k-fold cross-validation for glmnet, including alpha mixing parameter.
cva.glmnet(x, y, nfolds = 10, alphaSet = seq(0.1, 1, 0.1), foldid = NULL, ...)
cva.glmnet(x, y, nfolds = 10, alphaSet = seq(0.1, 1, 0.1), foldid = NULL, ...)
x |
Matrix of predictors |
y |
Response vector |
nfolds |
Number of folds (default 10) |
alphaSet |
Sequence of alpha values to cross-validate |
foldid |
Optional vector of values between 1 and |
... |
Other arguments passed to glmnet::cv.glmnet |
Object of S3 class "cva.glmnet", which is a list of the cv.glmnet
objects for each value of alpha and alphaSet
.
fits |
List of fitted glmnet::cv.glmnet objects |
alphaSet |
Sequence of alpha values used |
alpha_cvm |
The mean cross-validated error - a vector of length
|
best_alpha |
Value of alpha giving lowest |
which_alpha |
Index of |
Myles Lewis
glmnet::cv.glmnet, glmnet::glmnet
Convenience function for retrieving coefficients from a glmnet::cv.glmnet model at a specified lambda. Sparsity is removed and non-intercept coefficients are ranked by absolute value.
glmnet_coefs(fit, s, ...)
glmnet_coefs(fit, s, ...)
fit |
A glmnet::cv.glmnet fitted model object. |
s |
Value of lambda. See glmnet::coef.glmnet and glmnet::predict.cv.glmnet |
... |
Other arguments passed to glmnet::coef.glmnet |
Vector or list of coefficients ordered with the intercept first, followed by highest absolute value to lowest.
Filter using sparsity of elastic net regression using glmnet to calculate variable importance.
glmnet_filter( y, x, family = NULL, force_vars = NULL, nfilter = NULL, method = c("mean", "nonzero"), type = c("index", "names", "full"), ... )
glmnet_filter( y, x, family = NULL, force_vars = NULL, nfilter = NULL, method = c("mean", "nonzero"), type = c("index", "names", "full"), ... )
y |
Response vector |
x |
Matrix of predictors |
family |
Either a character string representing one of the built-in
families, or else a |
force_vars |
Vector of column names |
nfilter |
Number of predictors to return |
method |
String indicating method of determining variable importance. "mean" (the default) uses the mean absolute coefficients across the range of lambdas; "nonzero" counts the number of times variables are retained in the model across all values of lambda. |
type |
Type of vector returned. Default "index" returns indices, "names" returns predictor names, "full" returns full output. |
... |
Other arguments passed to glmnet::glmnet |
The glmnet elastic net mixing parameter alpha can be varied to
include a larger number of predictors. Default alpha = 1 is pure LASSO,
resulting in greatest sparsity, while alpha = 0 is pure ridge regression,
retaining all predictors in the regression model. Note, the family
argument is commonly needed, see glmnet::glmnet.
Integer vector of indices of filtered parameters (type = "index") or
character vector of names (type = "names") of filtered parameters. If
type
is "full"
a named vector of variable importance is returned.
Obtain predictions on held-out test inner CV folds
innercv_preds(x) ## S3 method for class 'nestcv.glmnet' innercv_preds(x) ## S3 method for class 'nestcv.train' innercv_preds(x)
innercv_preds(x) ## S3 method for class 'nestcv.glmnet' innercv_preds(x) ## S3 method for class 'nestcv.train' innercv_preds(x)
x |
a |
Dataframe with columns testy
and predy
, and for binomial and
multinomial models additional columns containing probabilities or log
likelihood values.
Build ROC (receiver operating characteristic) curve from left-out folds
from inner CV. Object can be plotted using plot()
or passed to functions
pROC::auc()
etc.
innercv_roc(x, direction = "<", ...)
innercv_roc(x, direction = "<", ...)
x |
a |
direction |
Set ROC directionality pROC::roc |
... |
Other arguments passed to pROC::roc |
"roc"
object, see pROC::roc
## Example binary classification problem with P >> n x <- matrix(rnorm(150 * 2e+04), 150, 2e+04) # predictors y <- factor(rbinom(150, 1, 0.5)) # binary response ## Partition data into 2/3 training set, 1/3 test set trainSet <- caret::createDataPartition(y, p = 0.66, list = FALSE) ## t-test filter using whole dataset filt <- ttest_filter(y, x, nfilter = 100) filx <- x[, filt] ## Train glmnet on training set only using filtered predictor matrix library(glmnet) fit <- cv.glmnet(filx[trainSet, ], y[trainSet], family = "binomial") plot(fit) ## Predict response on test partition predy <- predict(fit, newx = filx[-trainSet, ], s = "lambda.min", type = "class") predy <- as.vector(predy) predyp <- predict(fit, newx = filx[-trainSet, ], s = "lambda.min", type = "response") predyp <- as.vector(predyp) output <- data.frame(testy = y[-trainSet], predy = predy, predyp = predyp) ## Results on test partition ## shows bias since univariate filtering was applied to whole dataset predSummary(output) ## Nested CV fit2 <- nestcv.glmnet(y, x, family = "binomial", alphaSet = 1, filterFUN = ttest_filter, filter_options = list(nfilter = 100), n_outer_folds = 3) summary(fit2) ## ROC plots library(pROC) testroc <- roc(output$testy, output$predyp, direction = "<") inroc <- innercv_roc(fit2) plot(fit2$roc) lines(inroc, col = 'blue') lines(testroc, col = 'red') legend('bottomright', legend = c("Nested CV", "Left-out inner CV folds", "Test partition, non-nested filtering"), col = c("black", "blue", "red"), lty = 1, lwd = 2, bty = "n")
## Example binary classification problem with P >> n x <- matrix(rnorm(150 * 2e+04), 150, 2e+04) # predictors y <- factor(rbinom(150, 1, 0.5)) # binary response ## Partition data into 2/3 training set, 1/3 test set trainSet <- caret::createDataPartition(y, p = 0.66, list = FALSE) ## t-test filter using whole dataset filt <- ttest_filter(y, x, nfilter = 100) filx <- x[, filt] ## Train glmnet on training set only using filtered predictor matrix library(glmnet) fit <- cv.glmnet(filx[trainSet, ], y[trainSet], family = "binomial") plot(fit) ## Predict response on test partition predy <- predict(fit, newx = filx[-trainSet, ], s = "lambda.min", type = "class") predy <- as.vector(predy) predyp <- predict(fit, newx = filx[-trainSet, ], s = "lambda.min", type = "response") predyp <- as.vector(predyp) output <- data.frame(testy = y[-trainSet], predy = predy, predyp = predyp) ## Results on test partition ## shows bias since univariate filtering was applied to whole dataset predSummary(output) ## Nested CV fit2 <- nestcv.glmnet(y, x, family = "binomial", alphaSet = 1, filterFUN = ttest_filter, filter_options = list(nfilter = 100), n_outer_folds = 3) summary(fit2) ## ROC plots library(pROC) testroc <- roc(output$testy, output$predyp, direction = "<") inroc <- innercv_roc(fit2) plot(fit2$roc) lines(inroc, col = 'blue') lines(testroc, col = 'red') legend('bottomright', legend = c("Nested CV", "Left-out inner CV folds", "Test partition, non-nested filtering"), col = c("black", "blue", "red"), lty = 1, lwd = 2, bty = "n")
Calculates performance metrics on inner CV held-out test folds: confusion matrix, accuracy and balanced accuracy for classification; ROC AUC for binary classification; RMSE, R^2 and mean absolute error (MAE) for regression.
innercv_summary(x)
innercv_summary(x)
x |
a |
Returns performance metrics from outer training folds, see predSummary.
data(iris) x <- iris[, 1:4] y <- iris[, 5] fit <- nestcv.glmnet(y, x, family = "multinomial", alpha = 1, n_outer_folds = 3) summary(fit) innercv_summary(fit)
data(iris) x <- iris[, 1:4] y <- iris[, 5] fit <- nestcv.glmnet(y, x, family = "multinomial", alpha = 1, n_outer_folds = 3) summary(fit) innercv_summary(fit)
Adds a precision-recall curve to a base graphics plot. It accepts an S3
object of class 'prc', see prc()
.
## S3 method for class 'prc' lines(x, ...)
## S3 method for class 'prc' lines(x, ...)
x |
An object of class 'prc' |
... |
Optional graphical arguments passed to |
No return value
Linear models are fitted on each predictor, with inclusion of variable names
listed in force_vars
in the model. Predictors are ranked by Akaike
information criteria (AIC) value, or can be filtered by the p-value on the
estimate of the coefficient for that predictor in its model.
lm_filter( y, x, force_vars = NULL, nfilter = NULL, p_cutoff = 0.05, rsq_cutoff = NULL, rsq_method = "pearson", type = c("index", "names", "full"), keep_factors = TRUE, method = 0L, mc.cores = 1 )
lm_filter( y, x, force_vars = NULL, nfilter = NULL, p_cutoff = 0.05, rsq_cutoff = NULL, rsq_method = "pearson", type = c("index", "names", "full"), keep_factors = TRUE, method = 0L, mc.cores = 1 )
y |
Numeric or integer response vector |
x |
Matrix of predictors. If |
force_vars |
Vector of column names |
nfilter |
Number of predictors to return. If |
p_cutoff |
p-value cut-off. P-values are calculated by t-statistic on the estimated coefficient for the predictor being tested. |
rsq_cutoff |
r^2 cutoff for removing predictors due to collinearity.
Default |
rsq_method |
character string indicating which correlation coefficient
is to be computed. One of "pearson" (default), "kendall", or "spearman".
See |
type |
Type of vector returned. Default "index" returns indices, "names" returns predictor names, "full" returns a matrix of p values. |
keep_factors |
Logical affecting factors with 3 or more levels.
Dataframes are coerced to a matrix using data.matrix. Binary
factors are converted to numeric values 0/1 and analysed as such. If
|
method |
Integer determining linear model method. See
|
mc.cores |
Number of cores for parallelisation using
|
This filter is based on the model y ~ xvar + force_vars
where y
is the
response vector, xvar
are variables in columns taken sequentially from x
and force_vars
are optional covariates extracted from x
. It uses
RcppEigen::fastLmPure()
with method = 0
as default since it is
rank-revealing. method = 3
is significantly faster but can give errors in
estimation of p-value with variables of zero variance. The algorithm attempts
to detect these and set their stats to NA
. NA
in x
are not tolerated.
Parallelisation is available via mclapply()
. This is provided mainly for
the use case of the filter being used as standalone. Nesting parallelisation
inside of parallelised nestcv.glmnet()
or nestcv.train()
loops is not
recommended.
Integer vector of indices of filtered parameters (type = "index"
)
or character vector of names (type = "names"
) of filtered parameters in
order of linear model AIC. Any variables in force_vars
which are
incorporated into all models are listed first. If type = "full"
a matrix
of AIC value, sigma (residual standard error, see summary.lm),
coefficient, t-statistic and p-value for each tested predictor is returned.
Calculates Matthews correlation coefficient (MCC) which is in essence a correlation coefficient between the observed and predicted binary classifications. It has also been generalised to multi-class classification.
mcc(cm) mcc_multi(cm)
mcc(cm) mcc_multi(cm)
cm |
A contingency table or matrix of predicted vs observed classes with reference classes in columns and predicted classes in rows. |
Use mcc()
for 2x2 tables (binary classification). mcc_multi()
is
for multi-class classification with k x k tables and is calculated using
Gorodkin's method.
Returns a value between -1 and +1. A coefficient of +1 represents a perfect prediction, 0 no better than random prediction and -1 indicates total disagreement between prediction and observation.
Gorodkin, J. (2004). Comparing two K-category assignments by a K-category correlation coefficient. Computational Biology and Chemistry. 28 (5): 367–374.
Returns model metrics from nestedcv models. Extended metrics including
metrics(object, extra = FALSE, innerCV = FALSE, positive = 2)
metrics(object, extra = FALSE, innerCV = FALSE, positive = 2)
object |
A 'nestcv.glmnet', 'nestcv.train', 'nestcv.SuperLearner' or 'outercv' object. |
extra |
Logical whether additional performance metrics are gathered for classification models: area under precision recall curve (PR.AUC, binary classification only), Cohen's kappa, F1 score, Matthews correlation coefficient (MCC). |
innerCV |
Whether to calculate metrics for inner CV folds. Only available for 'nestcv.glmnet' and 'nestcv.train' objects. |
positive |
For binary classification, either an integer 1 or 2 for the
level of response factor considered to be 'positive' or 'relevant', or a
character value for that factor. This affects the F1 score. See
|
Area under precision recall curve is estimated by trapezoidal estimation
using MLmetrics::PRAUC()
.
For multi-class classification models, Matthews correlation coefficient is calculated using Gorodkin's method. Multi-class F1 score (macro F1) is calculated as the arithmetic mean of the class-wise F1 scores.
A named numeric vector of performance metrics.
Gorodkin, J. (2004). Comparing two K-category assignments by a K-category correlation coefficient. Computational Biology and Chemistry. 28 (5): 367–374.
This function applies a cross-validation (CV) procedure for training Bayesian
models with hierarchical shrinkage priors using the hsstan
package. The
function allows the option of embedded filtering of predictors for feature
selection within the CV loop. Within each training fold, an optional
filtering of predictors is performed, followed by fitting of an hsstsan
model. Predictions on the testing folds are brought back together and error
estimation/ accuracy determined. The default is 10-fold CV. The function is
implemented within the nestedcv
package. The hsstan
models do not require
tuning of meta-parameters and therefore only a single CV procedure is needed
to evaluate performance. This is implemented using the outer
CV procedure
in the nestedcv
package. Supports binary outcome (logistic regression) or
continuous outcome. Multinomial models are currently not supported.
model.hsstan(y, x, unpenalized = NULL, ...)
model.hsstan(y, x, unpenalized = NULL, ...)
y |
Response vector. For classification this should be a factor. |
x |
Matrix of predictors |
unpenalized |
Vector of column names |
... |
Optional arguments passed to |
Caution should be used when setting the number of cores available for
parallelisation. The default setting in hsstan
is to use 4 cores to
parallelise the Markov chains of the Bayesian inference procedure. This can
be switched off either by adding argument cores = 1
(passed on to rstan
)
or setting options(mc.cores = 1)
.
Argument cv.cores
in outercv()
controls parallelisation over the outer CV
folds. On unix/mac setting cv.cores
to >1 will induce nested
parallelisation which will generate an error, unless parallelisation of the
chains is disabled using cores = 1
or setting options(mc.cores = 1)
.
Nested parallelisation is feasible if cv.cores
is >1 and
multicore_fork = FALSE
is set as this uses cluster based parallelisation
instead. Beware that large numbers of processes will be spawned. If we are
performing 10-fold cross-validation with 4 chains and set cv.cores = 10
then 40 processes will be invoked simultaneously.
An object of class hsstan
Athina Spiliopoulou
# Cross-validation is used to apply univariate filtering of predictors. # only one CV split is needed (outercv) as the Bayesian model does not # require learning of meta-parameters. # control number of cores used for parallelisation over chains oldopt <- options(mc.cores = 2) # load iris dataset and simulate a continuous outcome data(iris) dt <- iris[, 1:4] colnames(dt) <- c("marker1", "marker2", "marker3", "marker4") dt <- as.data.frame(apply(dt, 2, scale)) dt$outcome.cont <- -3 + 0.5 * dt$marker1 + 2 * dt$marker2 + rnorm(nrow(dt), 0, 2) library(hsstan) # unpenalised covariates: always retain in the prediction model uvars <- "marker1" # penalised covariates: coefficients are drawn from hierarchical shrinkage # prior pvars <- c("marker2", "marker3", "marker4") # penalised covariates # run cross-validation with univariate filter and hsstan # dummy sampling for fast execution of example # recommend 4 chains, warmup 1000, iter 2000 in practice res.cv.hsstan <- outercv(y = dt$outcome.cont, x = dt[, c(uvars, pvars)], model = "model.hsstan", filterFUN = lm_filter, filter_options = list(force_vars = uvars, nfilter = 2, p_cutoff = NULL, rsq_cutoff = 0.9), n_outer_folds = 3, chains = 2, cv.cores = 1, unpenalized = uvars, warmup = 100, iter = 200) # view prediction performance based on testing folds res.cv.hsstan$summary # view coefficients for the final model res.cv.hsstan$final_fit # view covariates selected by the univariate filter res.cv.hsstan$final_vars # use hsstan package to examine the Bayesian model sampler.stats(res.cv.hsstan$final_fit) print(projsel(res.cv.hsstan$final_fit), digits = 4) # adding marker2 options(oldopt) # reset configuation # Here adding `marker2` improves the model fit: substantial decrease of # KL-divergence from the full model to the submodel. Adding `marker3` does # not improve the model fit: no decrease of KL-divergence from the full model # to the submodel.
# Cross-validation is used to apply univariate filtering of predictors. # only one CV split is needed (outercv) as the Bayesian model does not # require learning of meta-parameters. # control number of cores used for parallelisation over chains oldopt <- options(mc.cores = 2) # load iris dataset and simulate a continuous outcome data(iris) dt <- iris[, 1:4] colnames(dt) <- c("marker1", "marker2", "marker3", "marker4") dt <- as.data.frame(apply(dt, 2, scale)) dt$outcome.cont <- -3 + 0.5 * dt$marker1 + 2 * dt$marker2 + rnorm(nrow(dt), 0, 2) library(hsstan) # unpenalised covariates: always retain in the prediction model uvars <- "marker1" # penalised covariates: coefficients are drawn from hierarchical shrinkage # prior pvars <- c("marker2", "marker3", "marker4") # penalised covariates # run cross-validation with univariate filter and hsstan # dummy sampling for fast execution of example # recommend 4 chains, warmup 1000, iter 2000 in practice res.cv.hsstan <- outercv(y = dt$outcome.cont, x = dt[, c(uvars, pvars)], model = "model.hsstan", filterFUN = lm_filter, filter_options = list(force_vars = uvars, nfilter = 2, p_cutoff = NULL, rsq_cutoff = 0.9), n_outer_folds = 3, chains = 2, cv.cores = 1, unpenalized = uvars, warmup = 100, iter = 200) # view prediction performance based on testing folds res.cv.hsstan$summary # view coefficients for the final model res.cv.hsstan$final_fit # view covariates selected by the univariate filter res.cv.hsstan$final_vars # use hsstan package to examine the Bayesian model sampler.stats(res.cv.hsstan$final_fit) print(projsel(res.cv.hsstan$final_fit), digits = 4) # adding marker2 options(oldopt) # reset configuation # Here adding `marker2` improves the model fit: substantial decrease of # KL-divergence from the full model to the submodel. Adding `marker3` does # not improve the model fit: no decrease of KL-divergence from the full model # to the submodel.
This function enables nested cross-validation (CV) with glmnet including tuning of elastic net alpha parameter. The function also allows the option of embedded filtering of predictors for feature selection nested within the outer loop of CV. Predictions on the outer test folds are brought back together and error estimation/ accuracy determined. The default is 10x10 nested CV.
nestcv.glmnet( y, x, family = c("gaussian", "binomial", "poisson", "multinomial", "cox", "mgaussian"), filterFUN = NULL, filter_options = NULL, balance = NULL, balance_options = NULL, modifyX = NULL, modifyX_useY = FALSE, modifyX_options = NULL, outer_method = c("cv", "LOOCV"), n_outer_folds = 10, n_inner_folds = 10, outer_folds = NULL, pass_outer_folds = FALSE, alphaSet = seq(0.1, 1, 0.1), min_1se = 0, keep = TRUE, outer_train_predict = FALSE, weights = NULL, penalty.factor = rep(1, ncol(x)), cv.cores = 1, finalCV = TRUE, na.option = "omit", verbose = FALSE, ... )
nestcv.glmnet( y, x, family = c("gaussian", "binomial", "poisson", "multinomial", "cox", "mgaussian"), filterFUN = NULL, filter_options = NULL, balance = NULL, balance_options = NULL, modifyX = NULL, modifyX_useY = FALSE, modifyX_options = NULL, outer_method = c("cv", "LOOCV"), n_outer_folds = 10, n_inner_folds = 10, outer_folds = NULL, pass_outer_folds = FALSE, alphaSet = seq(0.1, 1, 0.1), min_1se = 0, keep = TRUE, outer_train_predict = FALSE, weights = NULL, penalty.factor = rep(1, ncol(x)), cv.cores = 1, finalCV = TRUE, na.option = "omit", verbose = FALSE, ... )
y |
Response vector or matrix. Matrix is only used for
|
x |
Matrix of predictors. Dataframes will be coerced to a matrix as is necessary for glmnet. |
family |
Either a character string representing one of the built-in
families, or else a |
filterFUN |
Filter function, e.g. ttest_filter or relieff_filter.
Any function can be provided and is passed |
filter_options |
List of additional arguments passed to the filter
function specified by |
balance |
Specifies method for dealing with imbalanced class data.
Current options are |
balance_options |
List of additional arguments passed to the balancing function |
modifyX |
Character string specifying the name of a function to modify
|
modifyX_useY |
Logical value whether the |
modifyX_options |
List of additional arguments passed to the |
outer_method |
String of either |
n_outer_folds |
Number of outer CV folds |
n_inner_folds |
Number of inner CV folds |
outer_folds |
Optional list containing indices of test folds for outer
CV. If supplied, |
pass_outer_folds |
Logical indicating whether the same outer folds are
used for fitting of the final model when final CV is applied. Note this can
only be applied when |
alphaSet |
Vector of alphas to be tuned |
min_1se |
Value from 0 to 1 specifying choice of optimal lambda from 0=lambda.min to 1=lambda.1se |
keep |
Logical indicating whether inner CV predictions are retained for
calculating left-out inner CV fold accuracy etc. See argument |
outer_train_predict |
Logical whether to save predictions on outer training folds to calculate performance on outer training folds. |
weights |
Weights applied to each sample. Note |
penalty.factor |
Separate penalty factors can be applied to each
coefficient. Can be 0 for some variables, which implies no shrinkage, and
that variable is always included in the model. Default is 1 for all
variables. See glmnet::glmnet. Note this works separately from filtering.
For some |
cv.cores |
Number of cores for parallel processing of the outer loops.
NOTE: this uses |
finalCV |
Logical whether to perform one last round of CV on the whole
dataset to determine the final model parameters. If set to |
na.option |
Character value specifying how |
verbose |
Logical whether to print messages and show progress |
... |
Optional arguments passed to glmnet::cv.glmnet |
glmnet does not tolerate missing values, so na.option = "omit"
is the
default.
An object with S3 class "nestcv.glmnet"
call |
the matched call |
output |
Predictions on the left-out outer folds |
outer_result |
List object of results from each outer fold containing predictions on left-out outer folds, best lambda, best alpha, fitted glmnet coefficients, list object of inner fitted cv.glmnet and number of filtered predictors at each fold. |
outer_method |
the |
n_inner_folds |
number of inner folds |
outer_folds |
List of indices of outer test folds |
dimx |
dimensions of |
xsub |
subset of |
y |
original response vector |
yfinal |
final response vector (post-balancing) |
final_param |
Final mean best lambda and alpha from each fold |
final_fit |
Final fitted glmnet model |
final_coef |
Final model coefficients and mean expression. Variables with coefficients shrunk to 0 are removed. |
final_vars |
Column names of filtered predictors entering final model. This is useful for subsetting new data for predictions. |
roc |
ROC AUC for binary classification where available. |
summary |
Overall performance summary. Accuracy and balanced accuracy for classification. ROC AUC for binary classification. RMSE for regression. |
Myles Lewis
## Example binary classification problem with P >> n x <- matrix(rnorm(150 * 2e+04), 150, 2e+04) # predictors y <- factor(rbinom(150, 1, 0.5)) # binary response ## Partition data into 2/3 training set, 1/3 test set trainSet <- caret::createDataPartition(y, p = 0.66, list = FALSE) ## t-test filter using whole dataset filt <- ttest_filter(y, x, nfilter = 100) filx <- x[, filt] ## Train glmnet on training set only using filtered predictor matrix library(glmnet) fit <- cv.glmnet(filx[trainSet, ], y[trainSet], family = "binomial") plot(fit) ## Predict response on test partition predy <- predict(fit, newx = filx[-trainSet, ], s = "lambda.min", type = "class") predy <- as.vector(predy) predyp <- predict(fit, newx = filx[-trainSet, ], s = "lambda.min", type = "response") predyp <- as.vector(predyp) output <- data.frame(testy = y[-trainSet], predy = predy, predyp = predyp) ## Results on test partition ## shows bias since univariate filtering was applied to whole dataset predSummary(output) ## Nested CV ## n_outer_folds reduced to speed up example fit2 <- nestcv.glmnet(y, x, family = "binomial", alphaSet = 1, n_outer_folds = 3, filterFUN = ttest_filter, filter_options = list(nfilter = 100), cv.cores = 2) summary(fit2) plot_lambdas(fit2, showLegend = "bottomright") ## ROC plots library(pROC) testroc <- roc(output$testy, output$predyp, direction = "<") inroc <- innercv_roc(fit2) plot(fit2$roc) lines(inroc, col = 'blue') lines(testroc, col = 'red') legend('bottomright', legend = c("Nested CV", "Left-out inner CV folds", "Test partition, non-nested filtering"), col = c("black", "blue", "red"), lty = 1, lwd = 2, bty = "n")
## Example binary classification problem with P >> n x <- matrix(rnorm(150 * 2e+04), 150, 2e+04) # predictors y <- factor(rbinom(150, 1, 0.5)) # binary response ## Partition data into 2/3 training set, 1/3 test set trainSet <- caret::createDataPartition(y, p = 0.66, list = FALSE) ## t-test filter using whole dataset filt <- ttest_filter(y, x, nfilter = 100) filx <- x[, filt] ## Train glmnet on training set only using filtered predictor matrix library(glmnet) fit <- cv.glmnet(filx[trainSet, ], y[trainSet], family = "binomial") plot(fit) ## Predict response on test partition predy <- predict(fit, newx = filx[-trainSet, ], s = "lambda.min", type = "class") predy <- as.vector(predy) predyp <- predict(fit, newx = filx[-trainSet, ], s = "lambda.min", type = "response") predyp <- as.vector(predyp) output <- data.frame(testy = y[-trainSet], predy = predy, predyp = predyp) ## Results on test partition ## shows bias since univariate filtering was applied to whole dataset predSummary(output) ## Nested CV ## n_outer_folds reduced to speed up example fit2 <- nestcv.glmnet(y, x, family = "binomial", alphaSet = 1, n_outer_folds = 3, filterFUN = ttest_filter, filter_options = list(nfilter = 100), cv.cores = 2) summary(fit2) plot_lambdas(fit2, showLegend = "bottomright") ## ROC plots library(pROC) testroc <- roc(output$testy, output$predyp, direction = "<") inroc <- innercv_roc(fit2) plot(fit2$roc) lines(inroc, col = 'blue') lines(testroc, col = 'red') legend('bottomright', legend = c("Nested CV", "Left-out inner CV folds", "Test partition, non-nested filtering"), col = c("black", "blue", "red"), lty = 1, lwd = 2, bty = "n")
Provides a single loop of outer cross-validation to evaluate performance of
ensemble models from SuperLearner
package.
nestcv.SuperLearner( y, x, filterFUN = NULL, filter_options = NULL, weights = NULL, balance = NULL, balance_options = NULL, modifyX = NULL, modifyX_useY = FALSE, modifyX_options = NULL, outer_method = c("cv", "LOOCV"), n_outer_folds = 10, outer_folds = NULL, cv.cores = 1, final = TRUE, na.option = "pass", verbose = TRUE, ... )
nestcv.SuperLearner( y, x, filterFUN = NULL, filter_options = NULL, weights = NULL, balance = NULL, balance_options = NULL, modifyX = NULL, modifyX_useY = FALSE, modifyX_options = NULL, outer_method = c("cv", "LOOCV"), n_outer_folds = 10, outer_folds = NULL, cv.cores = 1, final = TRUE, na.option = "pass", verbose = TRUE, ... )
y |
Response vector |
x |
Dataframe or matrix of predictors. Matrix will be coerced to dataframe as this is the default for SuperLearner. |
filterFUN |
Filter function, e.g. ttest_filter or relieff_filter.
Any function can be provided and is passed |
filter_options |
List of additional arguments passed to the filter
function specified by |
weights |
Weights applied to each sample for models which can use
weights. Note |
balance |
Specifies method for dealing with imbalanced class data.
Current options are |
balance_options |
List of additional arguments passed to the balancing function |
modifyX |
Character string specifying the name of a function to modify
|
modifyX_useY |
Logical value whether the |
modifyX_options |
List of additional arguments passed to the |
outer_method |
String of either |
n_outer_folds |
Number of outer CV folds |
outer_folds |
Optional list containing indices of test folds for outer
CV. If supplied, |
cv.cores |
Number of cores for parallel processing of the outer loops.
NOTE: this uses |
final |
Logical whether to fit final model. |
na.option |
Character value specifying how |
verbose |
Logical whether to print messages and show progress |
... |
Additional arguments passed to |
This performs an outer CV on SuperLearner package ensemble models to measure
performance, allowing balancing of imbalanced datasets as well as filtering
of predictors. SuperLearner prefers dataframes as inputs for the predictors.
If x
is a matrix it will be coerced to a dataframe and variable names
adjusted by make.names()
.
Parallelisation of the outer CV folds is available on linux/mac, but not
available on windows. On windows, snowSuperLearner()
is called instead, so
that parallelisation is performed across each call to SuperLearner.
An object with S3 class "nestcv.SuperLearner"
call |
the matched call |
output |
Predictions on the left-out outer folds |
outer_result |
List object of results from each outer fold containing predictions on left-out outer folds, model result and number of filtered predictors at each fold. |
dimx |
vector of number of observations and number of predictors |
y |
original response vector |
yfinal |
final response vector (post-balancing) |
outer_folds |
List of indices of outer test folds |
final_fit |
Final fitted model on whole data |
final_vars |
Column names of filtered predictors entering final model |
summary_vars |
Summary statistics of filtered predictors |
roc |
ROC AUC for binary classification where available. |
summary |
Overall performance summary. Accuracy and balanced accuracy for classification. ROC AUC for binary classification. RMSE for regression. |
Care should be taken with some SuperLearner
models e.g. SL.gbm
as some
models have multicore enabled by default, which can lead to huge numbers of
processes being spawned.
This function applies nested cross-validation (CV) to training of models
using the caret
package. The function also allows the option of embedded
filtering of predictors for feature selection nested within the outer loop of
CV. Predictions on the outer test folds are brought back together and error
estimation/ accuracy determined. The default is 10x10 nested CV.
nestcv.train( y, x, method = "rf", filterFUN = NULL, filter_options = NULL, weights = NULL, balance = NULL, balance_options = NULL, modifyX = NULL, modifyX_useY = FALSE, modifyX_options = NULL, outer_method = c("cv", "LOOCV"), n_outer_folds = 10, n_inner_folds = 10, outer_folds = NULL, inner_folds = NULL, pass_outer_folds = FALSE, cv.cores = 1, multicore_fork = (Sys.info()["sysname"] != "Windows"), metric = ifelse(is.factor(y), "logLoss", "RMSE"), trControl = NULL, tuneGrid = NULL, savePredictions = "final", outer_train_predict = FALSE, finalCV = TRUE, na.option = "pass", verbose = TRUE, ... )
nestcv.train( y, x, method = "rf", filterFUN = NULL, filter_options = NULL, weights = NULL, balance = NULL, balance_options = NULL, modifyX = NULL, modifyX_useY = FALSE, modifyX_options = NULL, outer_method = c("cv", "LOOCV"), n_outer_folds = 10, n_inner_folds = 10, outer_folds = NULL, inner_folds = NULL, pass_outer_folds = FALSE, cv.cores = 1, multicore_fork = (Sys.info()["sysname"] != "Windows"), metric = ifelse(is.factor(y), "logLoss", "RMSE"), trControl = NULL, tuneGrid = NULL, savePredictions = "final", outer_train_predict = FALSE, finalCV = TRUE, na.option = "pass", verbose = TRUE, ... )
y |
Response vector. For classification this should be a factor. |
x |
Matrix or dataframe of predictors |
method |
String specifying which model to use. See |
filterFUN |
Filter function, e.g. |
filter_options |
List of additional arguments passed to the filter
function specified by |
weights |
Weights applied to each sample for models which can use
weights. Note |
balance |
Specifies method for dealing with imbalanced class data.
Current options are |
balance_options |
List of additional arguments passed to the balancing function |
modifyX |
Character string specifying the name of a function to modify
|
modifyX_useY |
Logical value whether the |
modifyX_options |
List of additional arguments passed to the |
outer_method |
String of either |
n_outer_folds |
Number of outer CV folds |
n_inner_folds |
Sets number of inner CV folds. Note if |
outer_folds |
Optional list containing indices of test folds for outer
CV. If supplied, |
inner_folds |
Optional list of test fold indices for inner CV. This must
be structured as a list of the outer folds each containing a list of inner
folds. Can only be supplied if balancing is not applied. If supplied,
|
pass_outer_folds |
Logical indicating whether the same outer folds are
used for fitting of the final model when final CV is applied. Note this can
only be applied when |
cv.cores |
Number of cores for parallel processing of the outer loops. |
multicore_fork |
Logical whether to use forked multicore parallel
processing. Forked multicore processing uses |
metric |
A string that specifies what summary metric will be used to select the optimal model. By default, "logLoss" is used for classification and "RMSE" is used for regression. Note this differs from the default setting in caret which uses "Accuracy" for classification. See details. |
trControl |
A list of values generated by the |
tuneGrid |
Data frame of tuning values, see |
savePredictions |
Indicates whether hold-out predictions for each inner
CV fold should be saved for ROC curves, accuracy etc see
caret::trainControl. Default is |
outer_train_predict |
Logical whether to save predictions on outer training folds to calculate performance on outer training folds. |
finalCV |
Logical whether to perform one last round of CV on the whole
dataset to determine the final model parameters. If set to |
na.option |
Character value specifying how |
verbose |
Logical whether to print messages and show progress |
... |
Arguments passed to |
When finalCV = TRUE
, the final fit on the whole data using is performed
first. This helps flag errors generated by caret
such as missing packages.
Parallelisation of the final fit when finalCV = TRUE
is performed in
caret
using registerDoParallel
. caret
itself uses foreach
.
Parallelisation is performed on the outer CV folds using parallel::mclapply
by default on unix/mac and parallel::parLapply
on windows. mclapply
uses
forking which is faster. But some models use multi-threading which may cause
issues in some circumstances with forked multicore processing. Setting
multicore_fork
to FALSE
is slower but can alleviate some caret errors.
If the outer folds are run using parallelisation, then parallelisation in
caret must be off, otherwise an error will be generated. Alternatively if you
wish to use parallelisation in caret, then parallelisation in nestcv.train
can be fully disabled by leaving cv.cores = 1
.
xgboost models fitted via caret using method = "xgbTree"
or "xgbLinear"
invoke openMP multithreading on linux/windows by default which causes
nestcv.train
to fail when cv.cores
>1 (nested parallelisation). Mac OS is
unaffected. In order to prevent this, nestcv.train()
sets openMP threads to
1 if cv.cores
>1.
For classification, metric
defaults to using 'logLoss' with the trControl
arguments classProbs = TRUE, summaryFunction = mnLogLoss
, rather than
'Accuracy' which is the default classification metric in caret
. See
caret::trainControl()
. LogLoss is arguably more consistent than Accuracy
for tuning parameters in datasets with small sample size.
Models can be fitted with a single set of fixed parameters, in which case
trControl
defaults to trainControl(method = "none")
which disables inner
CV as it is unnecessary. See
https://topepo.github.io/caret/model-training-and-tuning.html#fitting-models-without-parameter-tuning
An object with S3 class "nestcv.train"
call |
the matched call |
output |
Predictions on the left-out outer folds |
outer_result |
List object of results from each outer fold containing predictions on left-out outer folds, caret result and number of filtered predictors at each fold. |
outer_folds |
List of indices of outer test folds |
dimx |
dimensions of |
xsub |
subset of |
y |
original response vector |
yfinal |
final response vector (post-balancing) |
final_fit |
Final fitted caret model using best tune parameters |
final_vars |
Column names of filtered predictors entering final model |
summary_vars |
Summary statistics of filtered predictors |
roc |
ROC AUC for binary classification where available. |
trControl |
|
bestTunes |
best tuned parameters from each outer fold |
finalTune |
final parameters used for final model |
summary |
Overall performance summary. Accuracy and balanced accuracy for classification. ROC AUC for binary classification. RMSE for regression. |
Myles Lewis
## sigmoid function sigmoid <- function(x) {1 / (1 + exp(-x))} ## load iris dataset and simulate a binary outcome data(iris) x <- iris[, 1:4] colnames(x) <- c("marker1", "marker2", "marker3", "marker4") x <- as.data.frame(apply(x, 2, scale)) y2 <- sigmoid(0.5 * x$marker1 + 2 * x$marker2) > runif(nrow(x)) y2 <- factor(y2, labels = c("class1", "class2")) ## Example using random forest with caret cvrf <- nestcv.train(y2, x, method = "rf", n_outer_folds = 3, cv.cores = 2) summary(cvrf) ## Example of glmnet tuned using caret ## set up small tuning grid for quick execution ## length.out of 20-100 is usually recommended for lambda ## and more alpha values ranging from 0-1 tg <- expand.grid(lambda = exp(seq(log(2e-3), log(1e0), length.out = 5)), alpha = 1) ncv <- nestcv.train(y = y2, x = x, method = "glmnet", n_outer_folds = 3, tuneGrid = tg, cv.cores = 2) summary(ncv) ## plot tuning for outer fold #1 plot(ncv$outer_result[[1]]$fit, xTrans = log) ## plot final ROC curve plot(ncv$roc) ## plot ROC for left-out inner folds inroc <- innercv_roc(ncv) plot(inroc) ## example to show use of custom fold indices for 5 x 5-fold nested CV library(caret) y <- iris$Species out_folds <- createFolds(y, k = 5) in_folds <- lapply(out_folds, function(i) { ytrain <- y[-i] createFolds(ytrain, k = 5) }) res <- nestcv.train(y, x, method="rf", cv.cores = 2, pass_outer_folds = TRUE, inner_folds = in_folds, outer_folds = out_folds) summary(res) res$outer_folds res$final_fit$control$indexOut # same as outer_folds
## sigmoid function sigmoid <- function(x) {1 / (1 + exp(-x))} ## load iris dataset and simulate a binary outcome data(iris) x <- iris[, 1:4] colnames(x) <- c("marker1", "marker2", "marker3", "marker4") x <- as.data.frame(apply(x, 2, scale)) y2 <- sigmoid(0.5 * x$marker1 + 2 * x$marker2) > runif(nrow(x)) y2 <- factor(y2, labels = c("class1", "class2")) ## Example using random forest with caret cvrf <- nestcv.train(y2, x, method = "rf", n_outer_folds = 3, cv.cores = 2) summary(cvrf) ## Example of glmnet tuned using caret ## set up small tuning grid for quick execution ## length.out of 20-100 is usually recommended for lambda ## and more alpha values ranging from 0-1 tg <- expand.grid(lambda = exp(seq(log(2e-3), log(1e0), length.out = 5)), alpha = 1) ncv <- nestcv.train(y = y2, x = x, method = "glmnet", n_outer_folds = 3, tuneGrid = tg, cv.cores = 2) summary(ncv) ## plot tuning for outer fold #1 plot(ncv$outer_result[[1]]$fit, xTrans = log) ## plot final ROC curve plot(ncv$roc) ## plot ROC for left-out inner folds inroc <- innercv_roc(ncv) plot(inroc) ## example to show use of custom fold indices for 5 x 5-fold nested CV library(caret) y <- iris$Species out_folds <- createFolds(y, k = 5) in_folds <- lapply(out_folds, function(i) { ytrain <- y[-i] createFolds(ytrain, k = 5) }) res <- nestcv.train(y, x, method="rf", cv.cores = 2, pass_outer_folds = TRUE, inner_folds = in_folds, outer_folds = out_folds) summary(res) res$outer_folds res$final_fit$control$indexOut # same as outer_folds
Fast one-hot encoding of all factor and character columns in a dataframe to convert it into a numeric matrix by creating dummy (binary) columns.
one_hot(x, all_levels = FALSE, rename_binary = TRUE, sep = ".")
one_hot(x, all_levels = FALSE, rename_binary = TRUE, sep = ".")
x |
A dataframe, matrix or tibble. Matrices are returned untouched. |
all_levels |
Logical, whether to create dummy variables for all levels
of each factor. Default is |
rename_binary |
Logical, whether to rename binary factors by appending the 2nd level of the factor to aid interpretation of encoded factor levels and to allow consistency with naming. |
sep |
Character for separating factor variable names and levels for encoded columns. |
Binary factor columns and logical columns are converted to integers (0 or
1). Multi-level unordered factors are converted to multiple columns of 0/1
(dummy variables): if all_levels
is set to FALSE
(the default), then the
first level is assumed to be a reference level and additional columns are
created for each additional level; if all_levels
is set to TRUE
one
column is used for each level. Unused levels are dropped. Character columns
are first converted to factors and then encoded. Ordered factors are
replaced by their internal codes. Numeric or integer columns are left
untouched.
Having dummy variables for all levels of a factor can cause problems with
multicollinearity in regression (the dummy variable trap), so all_levels
is set to FALSE
by default which is necessary for regression models such
as glmnet
(equivalent to full rank parameterisation). However, setting
all_levels
to TRUE
can aid with interpretability (e.g. with SHAP
values), and in some cases filtering might result in some dummy variables
being excluded. Note this function is designed to quickly generate dummy
variables for more general machine learning purposes. To create a proper
design matrix object for regression models, use model.matrix()
.
A numeric matrix with the same number of rows as the input data. Dummy variable columns replace the input factor or character columns. Numeric columns are left intact.
caret::dummyVars()
, model.matrix()
data(iris) x <- iris x2 <- one_hot(x) head(x2) # 3 columns for Species x2 <- one_hot(x, all_levels = FALSE) head(x2) # 2 columns for Species
data(iris) x <- iris x2 <- one_hot(x) head(x2) # 3 columns for Species x2 <- one_hot(x, all_levels = FALSE) head(x2) # 2 columns for Species
This is a convenience function designed to use a single loop of cross-validation to quickly evaluate performance of specific models (random forest, naive Bayes, lm, glm) with fixed hyperparameters and no tuning. If tuning of parameters on data is required, full nested CV with inner CV is needed to tune model hyperparameters (see nestcv.train).
outercv(y, ...) ## Default S3 method: outercv( y, x, model, filterFUN = NULL, filter_options = NULL, weights = NULL, balance = NULL, balance_options = NULL, modifyX = NULL, modifyX_useY = FALSE, modifyX_options = NULL, outer_method = c("cv", "LOOCV"), n_outer_folds = 10, outer_folds = NULL, cv.cores = 1, multicore_fork = (Sys.info()["sysname"] != "Windows"), predict_type = "prob", outer_train_predict = FALSE, returnList = FALSE, final = TRUE, na.option = "pass", verbose = FALSE, suppressMsg = verbose, ... ) ## S3 method for class 'formula' outercv( formula, data, model, outer_method = c("cv", "LOOCV"), n_outer_folds = 10, outer_folds = NULL, cv.cores = 1, multicore_fork = (Sys.info()["sysname"] != "Windows"), predict_type = "prob", outer_train_predict = FALSE, verbose = FALSE, suppressMsg = verbose, ..., na.action = na.fail )
outercv(y, ...) ## Default S3 method: outercv( y, x, model, filterFUN = NULL, filter_options = NULL, weights = NULL, balance = NULL, balance_options = NULL, modifyX = NULL, modifyX_useY = FALSE, modifyX_options = NULL, outer_method = c("cv", "LOOCV"), n_outer_folds = 10, outer_folds = NULL, cv.cores = 1, multicore_fork = (Sys.info()["sysname"] != "Windows"), predict_type = "prob", outer_train_predict = FALSE, returnList = FALSE, final = TRUE, na.option = "pass", verbose = FALSE, suppressMsg = verbose, ... ) ## S3 method for class 'formula' outercv( formula, data, model, outer_method = c("cv", "LOOCV"), n_outer_folds = 10, outer_folds = NULL, cv.cores = 1, multicore_fork = (Sys.info()["sysname"] != "Windows"), predict_type = "prob", outer_train_predict = FALSE, verbose = FALSE, suppressMsg = verbose, ..., na.action = na.fail )
y |
Response vector |
... |
Optional arguments passed to the function specified by |
x |
Matrix or dataframe of predictors |
model |
Character value or function of the model to be fitted. |
filterFUN |
Filter function, e.g. ttest_filter or relieff_filter.
Any function can be provided and is passed |
filter_options |
List of additional arguments passed to the filter
function specified by |
weights |
Weights applied to each sample for models which can use
weights. Note |
balance |
Specifies method for dealing with imbalanced class data.
Current options are |
balance_options |
List of additional arguments passed to the balancing function |
modifyX |
Character string specifying the name of a function to modify
|
modifyX_useY |
Logical value whether the |
modifyX_options |
List of additional arguments passed to the |
outer_method |
String of either |
n_outer_folds |
Number of outer CV folds |
outer_folds |
Optional list containing indices of test folds for outer
CV. If supplied, |
cv.cores |
Number of cores for parallel processing of the outer loops. |
multicore_fork |
Logical whether to use forked multicore parallel
processing. Forked multicore processing uses |
predict_type |
Only used with binary classification. Calculation of ROC
AUC requires predicted class probabilities from fitted models. Most model
functions use syntax of the form |
outer_train_predict |
Logical whether to save predictions on outer training folds to calculate performance on outer training folds. |
returnList |
Logical whether to return list of results after main outer CV loop without concatenating results. Useful for debugging. |
final |
Logical whether to fit final model. |
na.option |
Character value specifying how |
verbose |
Logical whether to print messages and show progress |
suppressMsg |
Logical whether to suppress messages and printed output from model functions. This is necessary when using forked multicore parallelisation. |
formula |
A formula describing the model to be fitted |
data |
A matrix or data frame containing variables in the model. |
na.action |
Formula S3 method only: a function to specify the action to
be taken if NAs are found. The default action is for the procedure to fail.
An alternative is |
Some predictive model functions do not have an x & y interface. If the
function specified by model
requires a formula, x
& y
will be merged
into a dataframe with model()
called with a formula equivalent to
y ~ .
.
The S3 formula method for outercv
is not really recommended with large
data sets - it is envisaged to be primarily used to compare
performance of more basic models e.g. lm()
specified by formulae for
example incorporating interactions. NOTE: filtering is not available if
outercv
is called with a formula - use the x-y
interface instead.
An alternative method of tuning a single model with fixed parameters
is to use nestcv.train with tuneGrid
set as a single row of a
data.frame. The parameters which are needed for a specific model can be
identified using caret::modelLookup()
.
Case weights can be passed to model function which accept these, however
outercv
assumes that these are passed to the model via an argument named
weights
.
Note that in the case of model = "lm"
, although additional arguments e.g.
subset
, weights
, offset
are passed into the model function via
"..."
the scoping is known to go awry. Avoid using these arguments with
model = "lm"
.
NA
handling differs between the default S3 method and the formula S3
method. The na.option
argument takes a character string, while the more
typical na.action
argument takes a function.
An object with S3 class "outercv"
call |
the matched call |
output |
Predictions on the left-out outer folds |
outer_result |
List object of results from each outer fold containing predictions on left-out outer folds, model result and number of filtered predictors at each fold. |
dimx |
vector of number of observations and number of predictors |
outer_folds |
List of indices of outer test folds |
final_fit |
Final fitted model on whole data |
final_vars |
Column names of filtered predictors entering final model |
roc |
ROC AUC for binary classification where available. |
summary |
Overall performance summary. Accuracy and balanced accuracy for classification. ROC AUC for binary classification. RMSE for regression. |
## Classification example ## sigmoid function sigmoid <- function(x) {1 / (1 + exp(-x))} # load iris dataset and simulate a binary outcome data(iris) dt <- iris[, 1:4] colnames(dt) <- c("marker1", "marker2", "marker3", "marker4") dt <- as.data.frame(apply(dt, 2, scale)) x <- dt y2 <- sigmoid(0.5 * dt$marker1 + 2 * dt$marker2) > runif(nrow(dt)) y2 <- factor(y2) ## Random forest library(randomForest) cvfit <- outercv(y2, x, "randomForest") summary(cvfit) plot(cvfit$roc) ## Mixture discriminant analysis (MDA) if (requireNamespace("mda", quietly = TRUE)) { library(mda) cvfit <- outercv(y2, x, "mda", predict_type = "posterior") summary(cvfit) } ## Example with continuous outcome y <- -3 + 0.5 * dt$marker1 + 2 * dt$marker2 + rnorm(nrow(dt), 0, 2) dt$outcome <- y ## simple linear model - formula interface cvfit <- outercv(outcome ~ ., data = dt, model = "lm") summary(cvfit) ## random forest for regression cvfit <- outercv(y, x, "randomForest") summary(cvfit) ## example with lm_filter() to reduce input predictors cvfit <- outercv(y, x, "randomForest", filterFUN = lm_filter, filter_options = list(nfilter = 2, p_cutoff = NULL)) summary(cvfit)
## Classification example ## sigmoid function sigmoid <- function(x) {1 / (1 + exp(-x))} # load iris dataset and simulate a binary outcome data(iris) dt <- iris[, 1:4] colnames(dt) <- c("marker1", "marker2", "marker3", "marker4") dt <- as.data.frame(apply(dt, 2, scale)) x <- dt y2 <- sigmoid(0.5 * dt$marker1 + 2 * dt$marker2) > runif(nrow(dt)) y2 <- factor(y2) ## Random forest library(randomForest) cvfit <- outercv(y2, x, "randomForest") summary(cvfit) plot(cvfit$roc) ## Mixture discriminant analysis (MDA) if (requireNamespace("mda", quietly = TRUE)) { library(mda) cvfit <- outercv(y2, x, "mda", predict_type = "posterior") summary(cvfit) } ## Example with continuous outcome y <- -3 + 0.5 * dt$marker1 + 2 * dt$marker2 + rnorm(nrow(dt), 0, 2) dt$outcome <- y ## simple linear model - formula interface cvfit <- outercv(outcome ~ ., data = dt, model = "lm") summary(cvfit) ## random forest for regression cvfit <- outercv(y, x, "randomForest") summary(cvfit) ## example with lm_filter() to reduce input predictors cvfit <- outercv(y, x, "randomForest", filterFUN = lm_filter, filter_options = list(nfilter = 2, p_cutoff = NULL)) summary(cvfit)
Plot of cross-validated glmnet alpha parameter against deviance for each outer CV fold.
plot_alphas(x, col = NULL, ...)
plot_alphas(x, col = NULL, ...)
x |
Fitted "nestcv.glmnet" object |
col |
Optional vector of line colours for each fold |
... |
other arguments passed to plot |
No return value
Myles Lewis
Plots the main tuning parameter in models built using caret::train
plot_caret(x, error.col = "darkgrey", ...)
plot_caret(x, error.col = "darkgrey", ...)
x |
Object of class 'train' generated by |
error.col |
Colour of error bars |
... |
Other arguments passed to |
No return value
Plot of cross-validated glmnet lambda parameter against deviance for each outer CV fold.
plot_lambdas( x, scheme = NULL, palette = "Dark 3", showLegend = if (x$outer_method == "cv") "topright" else NULL, ... )
plot_lambdas( x, scheme = NULL, palette = "Dark 3", showLegend = if (x$outer_method == "cv") "topright" else NULL, ... )
x |
Fitted "nestcv.glmnet" object |
scheme |
colour scheme |
palette |
palette name (one of |
showLegend |
Either a keyword to position the legend or |
... |
other arguments passed to plot. Use |
No return value
Myles Lewis
SHAP importance bar plot
plot_shap_bar( shap, x, sort = TRUE, labels = c("Negative", "Positive"), top = NULL )
plot_shap_bar( shap, x, sort = TRUE, labels = c("Negative", "Positive"), top = NULL )
shap |
a matrix of SHAP values |
x |
a matrix or dataframe of feature values containing only features
values from the training data. The rows must match rows in |
sort |
Logical whether to sort predictors by mean absolute SHAP value |
labels |
Character vector of labels for directionality |
top |
Sets a limit on the number of variables plotted or |
A ggplot2 plot
SHAP importance beeswarm plot
plot_shap_beeswarm( shap, x, cex = 0.25, corral = "random", corral.width = 0.7, scheme = c("deepskyblue2", "purple3", "red"), sort = TRUE, top = NULL, ... )
plot_shap_beeswarm( shap, x, cex = 0.25, corral = "random", corral.width = 0.7, scheme = c("deepskyblue2", "purple3", "red"), sort = TRUE, top = NULL, ... )
shap |
a matrix of SHAP values |
x |
a matrix or dataframe of feature values containing only features
values from the training data. The rows must match rows in |
cex |
Scaling for adjusting point spacing. See
|
corral |
String specifying method used to corral points. See
|
corral.width |
Numeric specifying width of corral, passed to
|
scheme |
Colour scheme as a vector of 3 colours |
sort |
Logical whether to sort predictors by mean absolute SHAP value. |
top |
Sets a limit on the number of variables plotted or |
... |
Other arguments passed to |
A ggplot2 plot
Plots variables selected in models ranked by variable importance across the outer folds as well as the final model.
plot_var_ranks( x, sort = TRUE, scheme = NULL, cex = 1, corral.width = 0.75, ... ) hist_var_ranks(x, sort = TRUE, scheme = NULL)
plot_var_ranks( x, sort = TRUE, scheme = NULL, cex = 1, corral.width = 0.75, ... ) hist_var_ranks(x, sort = TRUE, scheme = NULL)
x |
A |
sort |
Logical whether to sort variable by mean rank. |
scheme |
Optional vector of colours which is passed to
|
cex |
Scaling for adjusting point spacing. See
|
corral.width |
Numeric specifying width of corral, passed to
|
... |
Optional arguments passed to |
A ggplot2 plot.
Produces a ggplot2 plot of stability (as SEM) of variable importance across models trained and tested across outer CV folds. Overlays frequency with which variables are selected across the outer folds and optionally overlays directionality for binary response outcome.
plot_var_stability( x, final = TRUE, top = NULL, direction = 0, dir_labels = NULL, scheme = c("royalblue", "red"), breaks = NULL, percent = TRUE, level = 1, sort = TRUE )
plot_var_stability( x, final = TRUE, top = NULL, direction = 0, dir_labels = NULL, scheme = c("royalblue", "red"), breaks = NULL, percent = TRUE, level = 1, sort = TRUE )
x |
a |
final |
Logical whether to restrict variables to only those which ended up in the final fitted model or to include all variables selected across all outer folds. |
top |
Limits number of variables plotted. Set to |
direction |
Integer controlling plotting of directionality for binary or
regression models. |
dir_labels |
Character vector for controlling the legend when
|
scheme |
Vector of 2 colours for directionality when |
breaks |
Vector of continuous breaks for legend colour/size |
percent |
Logical for |
level |
For multinomial |
sort |
Logical whether to sort by mean variable importance. Passed to
|
A ggplot2 plot
Plot of variable importance of coefficients of a final fitted 'nestedcv.glmnet' model using ggplot2. Mean expression can be overlaid as the size of points as this can be informative in models of biological attributes.
plot_varImp(x, abs = TRUE, size = TRUE)
plot_varImp(x, abs = TRUE, size = TRUE)
x |
a 'nestcv.glmnet' class object |
abs |
Logical whether to show absolute value of glmnet coefficients |
size |
Logical whether to show mean expression by size of points |
Returns a ggplot2 plot
Different types of plot showing cross-validated tuning of alpha and lambda
from elastic net regression via glmnet::glmnet. If xaxis
is set to
"lambda"
, log lambda is on the x axis while the tuning metric (log loss,
deviance, accuracy, AUC etc) is on the y axis. Multiple alpha values are
shown by different colours. If xaxis
is set to "alpha"
, alpha is on the x
axis with the tuning metric on y, with error bars showing metric SD. if
xaxis
is set to "nvar"
the number of non-zero coefficients is shown on x
and how this relates to model deviance/ accuracy on y.
## S3 method for class 'cva.glmnet' plot( x, xaxis = c("lambda", "alpha", "nvar"), errorBar = (xaxis == "alpha"), errorWidth = 0.015, min.pch = NULL, scheme = NULL, palette = "zissou", showLegend = "bottomright", ... )
## S3 method for class 'cva.glmnet' plot( x, xaxis = c("lambda", "alpha", "nvar"), errorBar = (xaxis == "alpha"), errorWidth = 0.015, min.pch = NULL, scheme = NULL, palette = "zissou", showLegend = "bottomright", ... )
x |
Object of class 'cva.glmnet'. |
xaxis |
String specifying what is plotted on the x axis, either log lambda, alpha or the number of non-zero coefficients. |
errorBar |
Logical whether to control error bars for the standard
deviation of model deviance when |
errorWidth |
Width of error bars. |
min.pch |
Plotting 'character' for the minimum point of each curve. Not
shown if set to |
scheme |
Colour scheme. Overrides the |
palette |
Palette name (one of |
showLegend |
Either a keyword to position the legend or |
... |
Other arguments passed to plot. Use |
No return value
Myles Lewis
Plots a precision-recall curve using base graphics. It accepts an S3 object
of class 'prc', see prc()
.
## S3 method for class 'prc' plot(x, ...)
## S3 method for class 'prc' plot(x, ...)
x |
An object of class 'prc' |
... |
Optional graphical arguments passed to |
No return value
library(mlbench) data(Sonar) y <- Sonar$Class x <- Sonar[, -61] fit1 <- nestcv.glmnet(y, x, family = "binomial", alphaSet = 1, cv.cores = 2) fit1$prc <- prc(fit1) # calculate precision-recall curve fit2 <- nestcv.train(y, x, method = "gbm", cv.cores = 2) fit2$prc <- prc(fit2) plot(fit1$prc) lines(fit2$prc, col = "red")
library(mlbench) data(Sonar) y <- Sonar$Class x <- Sonar[, -61] fit1 <- nestcv.glmnet(y, x, family = "binomial", alphaSet = 1, cv.cores = 2) fit1$prc <- prc(fit1) # calculate precision-recall curve fit2 <- nestcv.train(y, x, method = "gbm", cv.cores = 2) fit2$prc <- prc(fit2) plot(fit1$prc) lines(fit2$prc, col = "red")
Filter using coefficients from partial least squares (PLS) regression to select optimal predictors.
pls_filter( y, x, force_vars = NULL, nfilter, ncomp = 5, scale_x = TRUE, type = c("index", "names", "full"), ... )
pls_filter( y, x, force_vars = NULL, nfilter, ncomp = 5, scale_x = TRUE, type = c("index", "names", "full"), ... )
y |
Response vector |
x |
Matrix of predictors |
force_vars |
Vector of column names within |
nfilter |
Either a single value for the total number of predictors to
return. Or a vector of length |
ncomp |
the number of components to include in the PLS model. |
scale_x |
Logical whether to scale predictors before fitting the PLS model. This is recommended. |
type |
Type of vector returned. Default "index" returns indices, "names" returns predictor names, "full" returns a named vector of variable importance. |
... |
Other arguments passed to |
The best predictors may overlap between components, so if nfilter
is
specified as a vector, the total number of unique predictors returned may be
variable.
Integer vector of indices of filtered parameters (type = "index") or
character vector of names (type = "names") of filtered parameters. If
type
is "full"
full output of coefficients from plsr
is returned as a
list for each model component ordered by highest absolute coefficient.
Builds a precision-recall curve for a 'nestedcv' model using prediction()
and performance()
functions from the ROCR package and returns an object of
class 'prc' for plotting.
prc(...) ## Default S3 method: prc(response, predictor, positive = 2, ...) ## S3 method for class 'data.frame' prc(output, ...) ## S3 method for class 'nestcv.glmnet' prc(object, ...) ## S3 method for class 'nestcv.train' prc(object, ...) ## S3 method for class 'nestcv.SuperLearner' prc(object, ...) ## S3 method for class 'outercv' prc(object, ...) ## S3 method for class 'repeatcv' prc(object, ...)
prc(...) ## Default S3 method: prc(response, predictor, positive = 2, ...) ## S3 method for class 'data.frame' prc(output, ...) ## S3 method for class 'nestcv.glmnet' prc(object, ...) ## S3 method for class 'nestcv.train' prc(object, ...) ## S3 method for class 'nestcv.SuperLearner' prc(object, ...) ## S3 method for class 'outercv' prc(object, ...) ## S3 method for class 'repeatcv' prc(object, ...)
... |
other arguments |
response |
binary factor vector of response of default order controls, cases. |
predictor |
numeric vector of probabilities |
positive |
Either an integer 1 or 2 for the level of response factor considered to be 'positive' or 'relevant', or a character value for that factor. |
output |
data.frame with columns |
object |
a 'nestcv.glmnet', 'nestcv.train', 'nestcv.SuperLearn', 'outercv' or 'repeatcv' S3 class results object. |
An object of S3 class 'prc' containing the following fields:
recall |
vector of recall values |
precision |
vector of precision values |
auc |
area under precision-recall curve value using trapezoid method |
baseline |
baseline precision value |
library(mlbench) data(Sonar) y <- Sonar$Class x <- Sonar[, -61] fit1 <- nestcv.glmnet(y, x, family = "binomial", alphaSet = 1, cv.cores = 2) fit1$prc <- prc(fit1) # calculate precision-recall curve fit1$prc$auc # precision-recall AUC value fit2 <- nestcv.train(y, x, method = "gbm", cv.cores = 2) fit2$prc <- prc(fit2) fit2$prc$auc plot(fit1$prc, ylim = c(0, 1)) lines(fit2$prc, col = "red") res <- nestcv.glmnet(y, x, family = "binomial", alphaSet = 1) |> repeatcv(n = 4, rep.cores = 2) res$prc <- prc(res) # precision-recall curve on repeated predictions plot(res$prc)
library(mlbench) data(Sonar) y <- Sonar$Class x <- Sonar[, -61] fit1 <- nestcv.glmnet(y, x, family = "binomial", alphaSet = 1, cv.cores = 2) fit1$prc <- prc(fit1) # calculate precision-recall curve fit1$prc$auc # precision-recall AUC value fit2 <- nestcv.train(y, x, method = "gbm", cv.cores = 2) fit2$prc <- prc(fit2) fit2$prc$auc plot(fit1$prc, ylim = c(0, 1)) lines(fit2$prc, col = "red") res <- nestcv.glmnet(y, x, family = "binomial", alphaSet = 1) |> repeatcv(n = 4, rep.cores = 2) res$prc <- prc(res) # precision-recall curve on repeated predictions plot(res$prc)
Prediction wrapper functions to enable the use of the fastshap
package for
generating SHAP values from nestedcv
trained models.
pred_nestcv_glmnet(x, newdata) pred_nestcv_glmnet_class1(x, newdata) pred_nestcv_glmnet_class2(x, newdata) pred_nestcv_glmnet_class3(x, newdata) pred_train(x, newdata) pred_train_class1(x, newdata) pred_train_class2(x, newdata) pred_train_class3(x, newdata) pred_SuperLearner(x, newdata)
pred_nestcv_glmnet(x, newdata) pred_nestcv_glmnet_class1(x, newdata) pred_nestcv_glmnet_class2(x, newdata) pred_nestcv_glmnet_class3(x, newdata) pred_train(x, newdata) pred_train_class1(x, newdata) pred_train_class2(x, newdata) pred_train_class3(x, newdata) pred_SuperLearner(x, newdata)
x |
a |
newdata |
a matrix of new data |
These prediction wrapper functions are designed to be used with the
fastshap
package. The functions pred_nestcv_glmnet
and pred_train
work
for nestcv.glmnet
and nestcv.train
models respectively for either binary
classification or regression.
For multiclass classification use pred_nestcv_glmnet_class1
, 2
and 3
for the first 3 classes. Similarly pred_train_class1
etc for nestcv.train
objects. These functions can be inspected and easily modified to analyse
further classes.
prediction wrapper function designed for use with
fastshap::explain()
library(fastshap) # Boston housing dataset library(mlbench) data(BostonHousing2) dat <- BostonHousing2 y <- dat$cmedv x <- subset(dat, select = -c(cmedv, medv, town, chas)) # Fit a glmnet model using nested CV # Only 3 outer CV folds and 1 alpha value for speed fit <- nestcv.glmnet(y, x, family = "gaussian", n_outer_folds = 3, alphaSet = 1) # Generate SHAP values using fastshap::explain # Only using 5 repeats here for speed, but recommend higher values of nsim sh <- explain(fit, X=x, pred_wrapper = pred_nestcv_glmnet, nsim = 1) # Plot overall variable importance plot_shap_bar(sh, x) # Plot beeswarm plot plot_shap_beeswarm(sh, x, size = 1)
library(fastshap) # Boston housing dataset library(mlbench) data(BostonHousing2) dat <- BostonHousing2 y <- dat$cmedv x <- subset(dat, select = -c(cmedv, medv, town, chas)) # Fit a glmnet model using nested CV # Only 3 outer CV folds and 1 alpha value for speed fit <- nestcv.glmnet(y, x, family = "gaussian", n_outer_folds = 3, alphaSet = 1) # Generate SHAP values using fastshap::explain # Only using 5 repeats here for speed, but recommend higher values of nsim sh <- explain(fit, X=x, pred_wrapper = pred_nestcv_glmnet, nsim = 1) # Plot overall variable importance plot_shap_bar(sh, x) # Plot beeswarm plot plot_shap_beeswarm(sh, x, size = 1)
Makes predictions from a cross-validated glmnet model with optimal value of lambda and alpha.
## S3 method for class 'cva.glmnet' predict(object, newx, s = "lambda.1se", ...)
## S3 method for class 'cva.glmnet' predict(object, newx, s = "lambda.1se", ...)
object |
Fitted |
newx |
Matrix of new values for |
s |
Value of penalty parameter lambda. Default value is |
... |
Other arguments passed to |
Object returned depends on arguments in ...
such as type
.
Draws from the posterior predictive distribution of the outcome.
## S3 method for class 'hsstan' predict(object, newdata = NULL, type = NULL, ...)
## S3 method for class 'hsstan' predict(object, newdata = NULL, type = NULL, ...)
object |
An object of class |
newdata |
Optional data frame containing the variables to use to
predict. If |
type |
Option for binary outcomes only. Default |
... |
Optional arguments passed to |
For a binary outcome and type = NULL
, a character vector with the
name of the class that has the highest probability for each sample.
For a binary outcome and type = prob
, a 2-dimensional matrix with the
probability of class 0 and of class 1 for each sample.
For a continuous outcome a numeric vector with the predicted value for
each sample.
Athina Spiliopoulou
Obtains predictions from the final fitted model from a nestcv.glmnet object.
## S3 method for class 'nestcv.glmnet' predict(object, newdata, s = object$final_param["lambda"], modify = FALSE, ...)
## S3 method for class 'nestcv.glmnet' predict(object, newdata, s = object$final_param["lambda"], modify = FALSE, ...)
object |
Fitted |
newdata |
New data to predict outcome on |
s |
Value of lambda for glmnet prediction |
modify |
Logical whether to modify |
... |
Other arguments passed to |
Checks for missing predictors and if these are sparse (i.e. have zero coefficients) columns of 0 are automatically added to enable prediction to proceed.
Object returned depends on the ...
argument passed to predict
method for glmnet
objects.
Quick function to calculate performance metrics: confusion matrix, accuracy and balanced accuracy for classification; ROC AUC for binary classification; RMSE, R^2 and MAE for regression. Multi-class AUC is returned for multinomial classification.
predSummary(output, family = "")
predSummary(output, family = "")
output |
data.frame with columns |
family |
Optional character value to support specific glmnet models e.g. 'mgaussian', 'cox'. |
For multinomial classification, multi-class AUC as defined by Hand
and Till is calculated using pROC::multiclass.roc()
.
Multi-class balanced accuracy is calculated as the mean of the Recall for each class.
R^2 (coefficient of determination) is calculated as 1 - rss / tss, where rss = residual sum of squares, tss = total sum of squares. Pearson r^2 is also provided. Pearson r^2 can only range from 0 to 1, whereas R^2 can range from 1 to -Inf.
An object of class 'predSummary'. For classification a list is returned containing the confusion matrix table and a vector containing accuracy and balanced accuracy for classification, ROC AUC for classification. For regression a vector containing RMSE, R^2 and MAE is returned. For glmnet 'cox' models, Harrell's C-index is returned.
For glmnet 'mgaussian' models, an object of class 'predSummaryMulti' is
returned which is a list of vectors with regression metrics (RMSE, R^2,
MAE) for each response variable (i.e. each y
column).
Random oversampling of the minority group(s) or undersampling of the majority group to compensate for class imbalance in datasets.
randomsample(y, x, minor = NULL, major = 1, yminor = NULL)
randomsample(y, x, minor = NULL, major = 1, yminor = NULL)
y |
Vector of response outcome as a factor |
x |
Matrix of predictors |
minor |
Amount of oversampling of the minority class. If set to |
major |
Amount of undersampling of the majority class |
yminor |
Optional character value specifying the level in |
minor
< 1 and major
> 1 are ignored.
List containing extended matrix x
of synthesised data and extended
response vector y
## Imbalanced dataset set.seed(1, "L'Ecuyer-CMRG") x <- matrix(rnorm(150 * 2e+04), 150, 2e+04) #' predictors y <- factor(rbinom(150, 1, 0.2)) #' imbalanced binary response table(y) ## first 30 parameters are weak predictors x[, 1:30] <- rnorm(150 * 30, 0, 1) + as.numeric(y)*0.5 ## Balance x & y outside of CV loop by random oversampling minority group out <- randomsample(y, x) y2 <- out$y x2 <- out$x table(y2) ## Nested CV glmnet with unnested balancing by random oversampling on ## whole dataset fit1 <- nestcv.glmnet(y2, x2, family = "binomial", alphaSet = 1, cv.cores=2, filterFUN = ttest_filter) fit1$summary ## Balance x & y outside of CV loop by random oversampling minority group out <- randomsample(y, x, minor=1, major=0.4) y2 <- out$y x2 <- out$x table(y2) ## Nested CV glmnet with unnested balancing by random undersampling on ## whole dataset fit1b <- nestcv.glmnet(y2, x2, family = "binomial", alphaSet = 1, cv.cores=2, filterFUN = ttest_filter) fit1b$summary ## Balance x & y outside of CV loop by SMOTE out <- smote(y, x) y2 <- out$y x2 <- out$x table(y2) ## Nested CV glmnet with unnested balancing by SMOTE on whole dataset fit2 <- nestcv.glmnet(y2, x2, family = "binomial", alphaSet = 1, cv.cores=2, filterFUN = ttest_filter) fit2$summary ## Nested CV glmnet with nested balancing by random oversampling fit3 <- nestcv.glmnet(y, x, family = "binomial", alphaSet = 1, cv.cores=2, balance = "randomsample", filterFUN = ttest_filter) fit3$summary class_balance(fit3) ## Plot ROC curves plot(fit1$roc, col='green') lines(fit1b$roc, col='red') lines(fit2$roc, col='blue') lines(fit3$roc) legend('bottomright', legend = c("Unnested random oversampling", "Unnested SMOTE", "Unnested random undersampling", "Nested balancing"), col = c("green", "blue", "red", "black"), lty=1, lwd=2)
## Imbalanced dataset set.seed(1, "L'Ecuyer-CMRG") x <- matrix(rnorm(150 * 2e+04), 150, 2e+04) #' predictors y <- factor(rbinom(150, 1, 0.2)) #' imbalanced binary response table(y) ## first 30 parameters are weak predictors x[, 1:30] <- rnorm(150 * 30, 0, 1) + as.numeric(y)*0.5 ## Balance x & y outside of CV loop by random oversampling minority group out <- randomsample(y, x) y2 <- out$y x2 <- out$x table(y2) ## Nested CV glmnet with unnested balancing by random oversampling on ## whole dataset fit1 <- nestcv.glmnet(y2, x2, family = "binomial", alphaSet = 1, cv.cores=2, filterFUN = ttest_filter) fit1$summary ## Balance x & y outside of CV loop by random oversampling minority group out <- randomsample(y, x, minor=1, major=0.4) y2 <- out$y x2 <- out$x table(y2) ## Nested CV glmnet with unnested balancing by random undersampling on ## whole dataset fit1b <- nestcv.glmnet(y2, x2, family = "binomial", alphaSet = 1, cv.cores=2, filterFUN = ttest_filter) fit1b$summary ## Balance x & y outside of CV loop by SMOTE out <- smote(y, x) y2 <- out$y x2 <- out$x table(y2) ## Nested CV glmnet with unnested balancing by SMOTE on whole dataset fit2 <- nestcv.glmnet(y2, x2, family = "binomial", alphaSet = 1, cv.cores=2, filterFUN = ttest_filter) fit2$summary ## Nested CV glmnet with nested balancing by random oversampling fit3 <- nestcv.glmnet(y, x, family = "binomial", alphaSet = 1, cv.cores=2, balance = "randomsample", filterFUN = ttest_filter) fit3$summary class_balance(fit3) ## Plot ROC curves plot(fit1$roc, col='green') lines(fit1b$roc, col='red') lines(fit2$roc, col='blue') lines(fit3$roc) legend('bottomright', legend = c("Unnested random oversampling", "Unnested SMOTE", "Unnested random undersampling", "Nested balancing"), col = c("green", "blue", "red", "black"), lty=1, lwd=2)
Fits a random forest model via the ranger
package and ranks variables by
variable importance.
ranger_filter( y, x, nfilter = NULL, type = c("index", "names", "full"), num.trees = 1000, mtry = ncol(x) * 0.2, ... )
ranger_filter( y, x, nfilter = NULL, type = c("index", "names", "full"), num.trees = 1000, mtry = ncol(x) * 0.2, ... )
y |
Response vector |
x |
Matrix or dataframe of predictors |
nfilter |
Number of predictors to return. If |
type |
Type of vector returned. Default "index" returns indices, "names" returns predictor names, "full" returns a named vector of variable importance. |
num.trees |
Number of trees to grow. See ranger::ranger. |
mtry |
Number of predictors randomly sampled as candidates at each split. See ranger::ranger. |
... |
Optional arguments passed to ranger::ranger. |
This filter uses the ranger()
function from the ranger
package. Variable
importance is calculated using mean decrease in gini impurity.
Integer vector of indices of filtered parameters (type = "index") or
character vector of names (type = "names") of filtered parameters. If
type
is "full"
a named vector of variable importance is returned.
Uses ReliefF algorithm from the CORElearn package to rank predictors in order of importance.
relieff_filter( y, x, nfilter = NULL, estimator = "ReliefFequalK", type = c("index", "names", "full"), ... )
relieff_filter( y, x, nfilter = NULL, estimator = "ReliefFequalK", type = c("index", "names", "full"), ... )
y |
Response vector |
x |
Matrix or dataframe of predictors |
nfilter |
Number of predictors to return. If |
estimator |
Type of algorithm used, see CORElearn::attrEval |
type |
Type of vector returned. Default "index" returns indices, "names" returns predictor names, "full" returns a named vector of variable importance. |
... |
Other arguments passed to CORElearn::attrEval |
Integer vector of indices of filtered parameters (type = "index") or
character vector of names (type = "names") of filtered parameters. If
type
is "full"
a named vector of variable importance is returned.
Performs repeated calls to a nestedcv
model to determine performance across
repeated runs of nested CV.
repeatcv( expr, n = 5, repeat_folds = NULL, keep = FALSE, extra = FALSE, progress = TRUE, rep.cores = 1L )
repeatcv( expr, n = 5, repeat_folds = NULL, keep = FALSE, extra = FALSE, progress = TRUE, rep.cores = 1L )
expr |
An expression containing a call to |
n |
Number of repeats |
repeat_folds |
Optional list containing fold indices to be applied to the outer CV folds. |
keep |
Logical whether to save repeated outer CV fitted models for variable importance, SHAP etc. Note this can make the resulting object very large. |
extra |
Logical whether additional performance metrics are gathered for
binary classification models. See |
progress |
Logical whether to show progress. |
rep.cores |
Integer specifying number of cores/threads to invoke. |
We recommend using this with the R pipe |>
(see examples).
When comparing models, it is recommended to fix the sets of outer CV folds
used across each repeat for comparing performance between models. The
function repeatfolds()
can be used to create a fixed set of outer CV folds
for each repeat.
Parallelisation over repeats is performed using parallel::mclapply
(not
available on windows). Beware that cv.cores
can still be set within calls
to nestedcv
models (= nested parallelisation). This means that rep.cores
x cv.cores
number of processes/forks will be spawned, so be careful not to
overload your CPU. In general parallelisation of repeats using rep.cores
is
faster than parallelisation using cv.cores
.
List of S3 class 'repeatcv' containing:
call |
the model call |
result |
matrix of performance metrics |
output |
a matrix or dataframe containing the outer CV predictions from each repeat |
roc |
(binary classification models only) a ROC curve object based on
predictions across all repeats as returned in |
fits |
(if |
data("iris") dat <- iris y <- dat$Species x <- dat[, 1:4] res <- nestcv.glmnet(y, x, family = "multinomial", alphaSet = 1, n_outer_folds = 4) |> repeatcv(3, rep.cores = 2) res summary(res) ## set up fixed fold indices set.seed(123, "L'Ecuyer-CMRG") folds <- repeatfolds(y, repeats = 3, n_outer_folds = 4) res <- nestcv.glmnet(y, x, family = "multinomial", alphaSet = 1, n_outer_folds = 4) |> repeatcv(3, repeat_folds = folds, rep.cores = 2) res
data("iris") dat <- iris y <- dat$Species x <- dat[, 1:4] res <- nestcv.glmnet(y, x, family = "multinomial", alphaSet = 1, n_outer_folds = 4) |> repeatcv(3, rep.cores = 2) res summary(res) ## set up fixed fold indices set.seed(123, "L'Ecuyer-CMRG") folds <- repeatfolds(y, repeats = 3, n_outer_folds = 4) res <- nestcv.glmnet(y, x, family = "multinomial", alphaSet = 1, n_outer_folds = 4) |> repeatcv(3, repeat_folds = folds, rep.cores = 2) res
Create folds for repeated nested CV
repeatfolds(y, repeats = 5, n_outer_folds = 10)
repeatfolds(y, repeats = 5, n_outer_folds = 10)
y |
Outcome vector |
repeats |
Number of repeats |
n_outer_folds |
Number of outer CV folds |
List containing indices of outer CV folds
data("iris") dat <- iris y <- dat$Species x <- dat[, 1:4] ## set up fixed fold indices set.seed(123, "L'Ecuyer-CMRG") folds <- repeatfolds(y, repeats = 3, n_outer_folds = 4) res <- nestcv.glmnet(y, x, family = "multinomial", alphaSet = 1, n_outer_folds = 4, cv.cores = 2) |> repeatcv(3, repeat_folds = folds) res
data("iris") dat <- iris y <- dat$Species x <- dat[, 1:4] ## set up fixed fold indices set.seed(123, "L'Ecuyer-CMRG") folds <- repeatfolds(y, repeats = 3, n_outer_folds = 4) res <- nestcv.glmnet(y, x, family = "multinomial", alphaSet = 1, n_outer_folds = 4, cv.cores = 2) |> repeatcv(3, repeat_folds = folds) res
Fits a random forest model and ranks variables by variable importance.
rf_filter( y, x, nfilter = NULL, type = c("index", "names", "full"), ntree = 1000, mtry = ncol(x) * 0.2, ... )
rf_filter( y, x, nfilter = NULL, type = c("index", "names", "full"), ntree = 1000, mtry = ncol(x) * 0.2, ... )
y |
Response vector |
x |
Matrix or dataframe of predictors |
nfilter |
Number of predictors to return. If |
type |
Type of vector returned. Default "index" returns indices, "names" returns predictor names, "full" returns a named vector of variable importance. |
ntree |
Number of trees to grow. See randomForest::randomForest. |
mtry |
Number of predictors randomly sampled as candidates at each split. See randomForest::randomForest. |
... |
Optional arguments passed to randomForest::randomForest. |
This filter uses the randomForest()
function from the randomForest
package. Variable importance is calculated using the
randomForest::importance function, specifying type 1 = mean decrease in
accuracy. See randomForest::importance.
Integer vector of indices of filtered parameters (type = "index") or
character vector of names (type = "names") of filtered parameters. If
type
is "full"
a named vector of variable importance is returned.
Slims nestedcv objects to only the models in the outer CV folds.
slim(x)
slim(x)
x |
A 'nestedcv' or 'cva.glmnet' fitted model object. |
For 'nestedcv' objects, a list object of the same class but only
containing outer_result
. For 'cva.glmnet' models, only the cv.glmnet
model with the best alpha value is kept. Models for all other values of
alpha are discarded.
nestcv.glmnet()
nestcv.train()
outercv()
cva.glmnet()
Synthetic Minority Oversampling Technique (SMOTE) algorithm for imbalanced classification data.
smote(y, x, k = 5, over = NULL, yminor = NULL)
smote(y, x, k = 5, over = NULL, yminor = NULL)
y |
Vector of response outcome as a factor |
x |
Matrix of predictors |
k |
Range of KNN to consider for generation of new data |
over |
Amount of oversampling of the minority class. If set to |
yminor |
Optional character value specifying the level in |
List containing extended matrix x
of synthesised data and extended
response vector y
Chawla, N. V., Bowyer, K. W., Hall, L. O., and Kegelmeyer, W. P. (2002). Smote: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16:321-357.
Univariate statistic filter for dataframes of predictors with mixed numeric and categorical datatypes. Different statistical tests are used depending on the data type of response vector and predictors:
bin_stat_filter()
t-test for continuous data, chi-squared test for categorical data
class_stat_filter()
one-way ANOVA for continuous data, chi-squared test for categorical data
cor_stat_filter()
correlation (or linear regression) for continuous data and binary data, one-way ANOVA for categorical data
stat_filter(y, x, ...) bin_stat_filter( y, x, force_vars = NULL, nfilter = NULL, p_cutoff = 0.05, rsq_cutoff = NULL, type = c("index", "names", "full", "list"), ... ) class_stat_filter( y, x, force_vars = NULL, nfilter = NULL, p_cutoff = 0.05, rsq_cutoff = NULL, type = c("index", "names", "full", "list"), ... ) cor_stat_filter( y, x, cor_method = c("pearson", "spearman", "lm"), force_vars = NULL, nfilter = NULL, p_cutoff = 0.05, rsq_cutoff = NULL, rsq_method = "pearson", type = c("index", "names", "full", "list"), ... )
stat_filter(y, x, ...) bin_stat_filter( y, x, force_vars = NULL, nfilter = NULL, p_cutoff = 0.05, rsq_cutoff = NULL, type = c("index", "names", "full", "list"), ... ) class_stat_filter( y, x, force_vars = NULL, nfilter = NULL, p_cutoff = 0.05, rsq_cutoff = NULL, type = c("index", "names", "full", "list"), ... ) cor_stat_filter( y, x, cor_method = c("pearson", "spearman", "lm"), force_vars = NULL, nfilter = NULL, p_cutoff = 0.05, rsq_cutoff = NULL, rsq_method = "pearson", type = c("index", "names", "full", "list"), ... )
y |
Response vector |
x |
Matrix or dataframe of predictors |
... |
optional arguments, e.g. |
force_vars |
Vector of column names within |
nfilter |
Number of predictors to return. If |
p_cutoff |
p value cut-off |
rsq_cutoff |
r^2 cutoff for removing predictors due to collinearity.
Default |
type |
Type of vector returned. Default "index" returns indices, "names" returns predictor names, "full" returns a dataframe of statistics, "list" returns a list of 2 matrices of statistics, one for continuous predictors, one for categorical predictors. |
cor_method |
For |
rsq_method |
character string indicating which correlation coefficient
is to be computed. One of "pearson" (default), "kendall", or "spearman".
See |
stat_filter()
is a wrapper which calls bin_stat_filter()
,
class_stat_filter()
or cor_stat_filter()
depending on whether y
is
binary, multiclass or continuous respectively. Ordered factors are converted
to numeric (integer) levels and analysed as if continuous.
Integer vector of indices of filtered parameters (type = "index") or
character vector of names (type = "names") of filtered parameters in order
of test p-value. If type
is "full"
full output is
returned containing a dataframe of statistical results. If type
is
"list"
the output is returned as a list of 2 matrices containing
statistical results separated by continuous and categorical predictors.
library(mlbench) data(BostonHousing2) dat <- BostonHousing2 y <- dat$cmedv ## continuous outcome x <- subset(dat, select = -c(cmedv, medv, town)) stat_filter(y, x, type = "full") stat_filter(y, x, nfilter = 5, type = "names") stat_filter(y, x) data(iris) y <- iris$Species ## 3 class outcome x <- subset(iris, select = -Species) stat_filter(y, x, type = "full")
library(mlbench) data(BostonHousing2) dat <- BostonHousing2 y <- dat$cmedv ## continuous outcome x <- subset(dat, select = -c(cmedv, medv, town)) stat_filter(y, x, type = "full") stat_filter(y, x, nfilter = 5, type = "names") stat_filter(y, x) data(iris) y <- iris$Species ## 3 class outcome x <- subset(iris, select = -Species) stat_filter(y, x, type = "full")
Summarise variables
summary_vars(x)
summary_vars(x)
x |
Matrix or dataframe with variables in columns |
A matrix with variables in rows and mean, median and SD for each
variable or number of levels if the variable is a factor. If NA
are
detected, an extra column n.NA
is added with the numbers of NA
for each
variable.
Performs supervised principle component analysis (PCA) after filtering dataset to help determine whether filtering has been useful for separating samples according to the outcome variable.
supervisedPCA(y, x, filterFUN = NULL, filter_options = NULL, plot = TRUE, ...)
supervisedPCA(y, x, filterFUN = NULL, filter_options = NULL, plot = TRUE, ...)
y |
Response vector |
x |
Matrix of predictors |
filterFUN |
Filter function, e.g. ttest_filter or relieff_filter.
Any function can be provided and is passed |
filter_options |
List of additional arguments passed to the filter
function specified by |
plot |
Logical whether to plot a ggplot2 object or return the PC scores |
... |
Optional arguments passed to |
If plot=TRUE
returns a ggplot2 plot, otherwise returns the
principle component scores.
Obtain predictions on outer training folds which can be used for performance metrics and ROC curves.
train_preds(x)
train_preds(x)
x |
a |
Note: the argument outer_train_predict
must be set to TRUE
in
the original call to either nestcv.glmnet
, nestcv.train
or outercv
.
Dataframe with columns ytrain
and predy
containing observed and
predicted values from training folds. For binomial and multinomial models
additional columns are added with class probabilities or log likelihood
values.
Build ROC (receiver operating characteristic) curve from outer training
folds. Object can be plotted using plot()
or passed to functions
pROC::auc()
etc.
train_roc(x, direction = "<", ...)
train_roc(x, direction = "<", ...)
x |
a |
direction |
Set ROC directionality pROC::roc |
... |
Other arguments passed to pROC::roc |
Note: the argument outer_train_predict
must be set to TRUE
in
the original call to either nestcv.glmnet
, nestcv.train
or outercv
.
"roc"
object, see pROC::roc
Calculates performance metrics on outer training folds: confusion matrix, accuracy and balanced accuracy for classification; ROC AUC for binary classification; RMSE, R^2 and mean absolute error (MAE) for regression.
train_summary(x)
train_summary(x)
x |
a |
Note: the argument outer_train_predict
must be set to TRUE
in
the original call to either nestcv.glmnet
, nestcv.train
or outercv
.
Returns performance metrics from outer training folds, see predSummary
data(iris) x <- iris[, 1:4] y <- iris[, 5] fit <- nestcv.glmnet(y, x, family = "multinomial", alpha = 1, outer_train_predict = TRUE, n_outer_folds = 3) summary(fit) innercv_summary(fit) train_summary(fit) fit2 <- nestcv.train(y, x, model="svm", outer_train_predict = TRUE, n_outer_folds = 3, cv.cores = 2) summary(fit2) innercv_summary(fit2) train_summary(fit2)
data(iris) x <- iris[, 1:4] y <- iris[, 5] fit <- nestcv.glmnet(y, x, family = "multinomial", alpha = 1, outer_train_predict = TRUE, n_outer_folds = 3) summary(fit) innercv_summary(fit) train_summary(fit) fit2 <- nestcv.train(y, x, model="svm", outer_train_predict = TRUE, n_outer_folds = 3, cv.cores = 2) summary(fit2) innercv_summary(fit2) train_summary(fit2)
A selection of simple univariate filters using t-test, Wilcoxon test, one-way
ANOVA or correlation (Pearson or Spearman) for ranking variables. These
filters are designed for speed. ttest_filter
uses the Rfast
package,
wilcoxon_filter
(Mann-Whitney) test uses
matrixTests::row_wilcoxon_twosample, anova_filter
uses
matrixTests::col_oneway_welch (Welch's F-test) from the matrixTests
package. Can be applied to all or a subset of predictors. For mixed datasets
(combined continuous & categorical) see stat_filter()
ttest_filter( y, x, force_vars = NULL, nfilter = NULL, p_cutoff = 0.05, rsq_cutoff = NULL, type = c("index", "names", "full"), keep_factors = TRUE, ... ) anova_filter( y, x, force_vars = NULL, nfilter = NULL, p_cutoff = 0.05, rsq_cutoff = NULL, type = c("index", "names", "full"), keep_factors = TRUE, ... ) wilcoxon_filter( y, x, force_vars = NULL, nfilter = NULL, p_cutoff = 0.05, rsq_cutoff = NULL, type = c("index", "names", "full"), exact = FALSE, keep_factors = TRUE, ... ) correl_filter( y, x, method = "pearson", force_vars = NULL, nfilter = NULL, p_cutoff = 0.05, rsq_cutoff = NULL, type = c("index", "names", "full"), keep_factors = TRUE, ... )
ttest_filter( y, x, force_vars = NULL, nfilter = NULL, p_cutoff = 0.05, rsq_cutoff = NULL, type = c("index", "names", "full"), keep_factors = TRUE, ... ) anova_filter( y, x, force_vars = NULL, nfilter = NULL, p_cutoff = 0.05, rsq_cutoff = NULL, type = c("index", "names", "full"), keep_factors = TRUE, ... ) wilcoxon_filter( y, x, force_vars = NULL, nfilter = NULL, p_cutoff = 0.05, rsq_cutoff = NULL, type = c("index", "names", "full"), exact = FALSE, keep_factors = TRUE, ... ) correl_filter( y, x, method = "pearson", force_vars = NULL, nfilter = NULL, p_cutoff = 0.05, rsq_cutoff = NULL, type = c("index", "names", "full"), keep_factors = TRUE, ... )
y |
Response vector |
x |
Matrix or dataframe of predictors |
force_vars |
Vector of column names within |
nfilter |
Number of predictors to return. If |
p_cutoff |
p value cut-off |
rsq_cutoff |
r^2 cutoff for removing predictors due to collinearity.
Default |
type |
Type of vector returned. Default "index" returns indices, "names" returns predictor names, "full" returns a matrix of p values. |
keep_factors |
Logical affecting factors with 3 or more levels.
Dataframes are coerced to a matrix using data.matrix. Binary
factors are converted to numeric values 0/1 and analysed as such. If
|
... |
optional arguments, including |
exact |
Logical whether exact or approximate p-value is calculated.
Default is |
method |
Type of correlation, either "pearson" or "spearman". |
Integer vector of indices of filtered parameters (type = "index") or
character vector of names (type = "names") of filtered parameters in order
of t-test p-value. If type
is "full"
full output from
Rfast::ttests is returned.
## sigmoid function sigmoid <- function(x) {1 / (1 + exp(-x))} ## load iris dataset and simulate a binary outcome data(iris) dt <- iris[, 1:4] colnames(dt) <- c("marker1", "marker2", "marker3", "marker4") dt <- as.data.frame(apply(dt, 2, scale)) y2 <- sigmoid(0.5 * dt$marker1 + 2 * dt$marker2) > runif(nrow(dt)) y2 <- factor(y2, labels = c("C1", "C2")) ttest_filter(y2, dt) # returns index of filtered predictors ttest_filter(y2, dt, type = "name") # shows names of predictors ttest_filter(y2, dt, type = "full") # full results table data(iris) dt <- iris[, 1:4] y3 <- iris[, 5] anova_filter(y3, dt) # returns index of filtered predictors anova_filter(y3, dt, type = "full") # shows names of predictors anova_filter(y3, dt, type = "name") # full results table
## sigmoid function sigmoid <- function(x) {1 / (1 + exp(-x))} ## load iris dataset and simulate a binary outcome data(iris) dt <- iris[, 1:4] colnames(dt) <- c("marker1", "marker2", "marker3", "marker4") dt <- as.data.frame(apply(dt, 2, scale)) y2 <- sigmoid(0.5 * dt$marker1 + 2 * dt$marker2) > runif(nrow(dt)) y2 <- factor(y2, labels = c("C1", "C2")) ttest_filter(y2, dt) # returns index of filtered predictors ttest_filter(y2, dt, type = "name") # shows names of predictors ttest_filter(y2, dt, type = "full") # full results table data(iris) dt <- iris[, 1:4] y3 <- iris[, 5] anova_filter(y3, dt) # returns index of filtered predictors anova_filter(y3, dt, type = "full") # shows names of predictors anova_filter(y3, dt, type = "name") # full results table
Text progress bar in the R console. Modified from utils::txtProgressBar()
to include title and timing.
txtProgressBar2( min = 0, max = 1, initial = 0, char = "=", width = NA, title = "" )
txtProgressBar2( min = 0, max = 1, initial = 0, char = "=", width = NA, title = "" )
min |
Numeric value for minimum of the progress bar. |
max |
Numeric value for maximum of the progress bar. |
initial |
Initial value for the progress bar. |
char |
The character (or character string) to form the progress bar. |
width |
The width of the progress bar, as a multiple of the width of
|
title |
Title for the progress bar. |
Use utils::setTxtProgressBar()
to set the progress bar and close()
to
close it.
An object of class "txtProgressBar
".
Determines directionality of final predictors for binary or regression models, using the sign of the t-statistic or correlation coefficient respectively for each variable compared to the outcomes.
var_direction(object)
var_direction(object)
object |
a |
Categorical features with >2 levels are assumed to have a meaningful order
for the purposes of directionality. Factors are coerced to ordinal using
data.matrix()
. If factors are multiclass then directionality results should
be ignored.
named vector showing the directionality of final predictors. If the
response vector is multinomial NULL
is returned.
Uses variable importance across models trained and tested across outer CV
folds to assess stability of variable importance. For glmnet, variable
importance is measured as the absolute model coefficients, optionally scaled
as a percentage. The frequency with which each variable is selected in outer
folds as well as the final model is also returned which is helpful for sparse
models or with filters to determine how often variables end up in the model
in each fold. For glmnet, the direction of effect is taken directly from the
sign of model coefficients. For caret
models, direction of effect is not
readily available, so as a substitute, the directionality of each predictor
is determined by the function var_direction()
using the sign of a t-test
for binary classification or the sign of regression coefficient for
continuous outcomes (not available for multiclass caret models). To better
understand direction of effect of each predictor within the final model, we
recommend using SHAP values - see the vignette "Explaining nestedcv models
with Shapley values". See pred_train()
for an example.
var_stability(x, ...) ## S3 method for class 'nestcv.glmnet' var_stability( x, ranks = FALSE, summary = TRUE, percent = TRUE, level = 1, sort = TRUE, ... ) ## S3 method for class 'nestcv.train' var_stability(x, ranks = FALSE, summary = TRUE, sort = TRUE, ...) ## S3 method for class 'repeatcv' var_stability(x, ...)
var_stability(x, ...) ## S3 method for class 'nestcv.glmnet' var_stability( x, ranks = FALSE, summary = TRUE, percent = TRUE, level = 1, sort = TRUE, ... ) ## S3 method for class 'nestcv.train' var_stability(x, ranks = FALSE, summary = TRUE, sort = TRUE, ...) ## S3 method for class 'repeatcv' var_stability(x, ...)
x |
a |
... |
Optional arguments for compatibility |
ranks |
Logical whether to rank variables by importance |
summary |
Logical whether to return summary statistics on variable
importance. Ignored if |
percent |
Logical for |
level |
For multinomial |
sort |
Logical whether to sort variables by mean importance |
Note that for caret models caret::varImp()
may require the model package to
be fully loaded in order to function. During the fitting process caret
often only loads the package by namespace.
If ranks
is FALSE
and summary
is TRUE
, returns a dataframe
containing mean, sd, sem of variable importance and frequency by which each
variable is selected in outer folds. If summary
is FALSE
, a matrix of
either variable importance or, if ranks = TRUE
, rankings across the
outer folds and the final model is returned, with variables in rows and
folds in columns.
cv_coef()
cv_varImp()
pred_train()
Calculate weights for class imbalance
weight(y)
weight(y)
y |
Factor or character response vector. If a character vector is supplied it is coerced into a factor. Unused levels are dropped. |
Vector of weights