autovar(raw_dataframe, selected_column_names, significance_levels = c(0.05, 0.01, 0.005), test_names = c("portmanteau", "portmanteau_squared", "skewness"), criterion = "AIC", imputation_iterations = 30, measurements_per_day = 1)
selected_column_names
, as those may be helpful for the imputation.raw_dataframe
.c(0.05, 0.01, 0.005)
. For example, with the default configuration, a model whose worst (lowest) p-level for any test is 0.03 is always seen as a better model than one whose worst p-level for any test is 0.009, no matter the AIC/BIC score of that model. Also, the lowest significance level indicates the minimum p-level for any test of a valid model. Thus, if a test for a model has a lower p-level than the minimum specified significance level, it is considered invalid.c('portmanteau', 'portmanteau_squared', 'skewness')
. The possible tests are c('portmanteau', 'portmanteau_squared', 'skewness', 'kurtosis', 'joint_sktest')
. In addition to the residual tests, please note that the Eigenvalue stability test is always performed.30
.1
. If this value is 0
, then daypart- and day-dummies variables are not included for any models.logtransformed
, lag
, varest
, model_score
, bucket
, and nr_dummy_variables
.
This function evaluates possible VAR models for the given time series data set and returns a sorted list of the best models found. The first item in this list is the "best model" found.
AutovarCore evaluates eight kinds of models: models with and without log transforming the data, lag 1 and lag 2 models, and with and without day dummy variables. For each of these 8 model configurations, we evaluate all possible combinations for including outlier dummies (at 2.5x the standard deviation of the residuals) and retain the best model (the procedure for selecting the best model is described in more detail below).
These eight models are further reduced to four models because AutovarCore determines whether adding daydummies improves the model fit (considering only significance bucket and the AIC/BIC score, NOT the number of outlier dummy columns), and only when the best model found with day dummies is a "better model" than the best model found without day dummies (other parameters kept the same), do we include the model with daydummies and discard the one without day dummies. Otherwise, we include only the model without daydummies and discard the one with daydummies.
Thus, AutovarCore always returns four models (assuming that we find enough models that pass the Eigenvalue stability test: models that do not pass this test are immediately discarded). There are three points in the code where we determine the best model, which is done according to what we refer to as algorithm A or algorithm B, which we explain below.
When evaluating all possible combinations of outlier dummies for otherwise identical model configurations, we use algorithm A to determine the best model. When comparing whether the best model found without day dummy columns is better than the best model found with day dummy columns, we use algorithm B. For sorting the two models with differing lag but both being either logtransformed or not, we again use algorithm A. Then at the end we merge the two lists of two models (with and without logtransform) to obtain the final list of four models. The sorting comparison here uses algorithm B.
The reason for the different sorting algorithms is that in some cases we want to select the model with the fewest outlier dummy columns (i.e., the model that retains most of the original data), while in other cases we know that a certain operation (such as adding day dummies or logtransforming the data set) will affect the amount of day dummies in the model and so a fair comparison would exclude this property.
Algorithm A applies the following rules for comparing two models in order:
The significance buckets are formed between each of the (decreasingly sorted) specified significance_levels
in the parameters to the autovar function call. For example, if the signifance_levels
are c(0.05, 0.01, 0.005)
, then the significance buckets are (0.05 <= x), (0.01 <= x < 0.05), (0.005 <= x < 0.01),
and (x < 0.005)
. The metric used to place a model into a bucket is the maximum p-level that can be chosen as cut-off for determining whether an outcome is statistically significant such that all residual tests will still pass ("pass" meaning not invalidating the assumption that the residuals are normally distributed). In other words: it is the minimum p-value of all three residual tests of all endogenous variables in the model.
For this count of outlier columns, the following rules apply:
criterion
option specified in the parameters to the autovar function call.
In the end, we should have one best logtransformed model and one best nonlogtransformed model. We then compare these two models in the same way as we have compared all other models up to this point with one exception: we do not compare the number of outlier columns. Comparing the number of outliers would have likely favored logtransformed models over models without logtransform, as logtransformations typically have the effect of reducing the outliers of a sample.
Algorithm B is identical to algorithm A, except that we skip the step comparing the number of outlier dummy variables. Thus, we instead compare by bucket first and AIC/BIC score second. Notice that, if we may assume that the presence or absence of day dummies does not vary between the four models for any particular invocation of the autovar method (which is not an unreasonable assumption to make), that then the arbitrary choice of letting all daydummy columns together add one to the outlier count does not matter at all, since the only times where we are comparing the outlier dummy counts is when both models either both have or both do not have day dummy columns.
We are able to compare the AIC/BIC scores of logtransformed and nonlogtransformed models fairly because we compensate the AIC/BIC scores to account for the effect of the logtransformation. We compensate for the logtransformation by adjusting the loglikelihood score of the logtransformed models in the calculation of their AIC/BIC scores (by subtracting the sum of the logtransformed data).
## <strong>Not run</strong>: # data_matrix <- matrix(nrow = 40, ncol = 3) # data_matrix[, ] <- runif(ncol(data_matrix) * nrow(data_matrix), 1, nrow(data_matrix)) # while (sum(is.na(data_matrix)) == 0) # data_matrix[as.logical(round(runif(ncol(data_matrix) * nrow(data_matrix), -0.3, 0.7)))] <- NA # colnames(data_matrix) <- c('rumination', 'happiness', 'activity') # dataframe <- as.data.frame(data_matrix) # autovar(dataframe, selected_column_names = c('rumination', 'happiness'), # significance_levels = c(0.05, 0.01, 0.005), # test_names = c('portmanteau', # 'portmanteau_squared', # 'skewness'), # criterion = 'AIC', # imputation_iterations = 30, # measurements_per_day = 1) # ## <strong>End(Not run)</strong>