--- title: "GMWMX: Estimate linear models with dependent errors" output: rmarkdown::html_vignette bibliography: REFERENCES.bib vignette: > %\VignetteIndexEntry{GMWMX: Estimate linear models with dependent errors} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} knitr::opts_chunk$set( fig.width = 10, fig.height = 6, fig.align = "center" ) # so as not to execute this vignette during package development # knitr::opts_chunk$set(eval = FALSE) ``` This vignette demonstrates how to use the GMWMX estimator to estimate linear models with dependent errors described by a composite stochastic process. Consider the model defined as: \begin{equation} \mathbf{Y} = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\varepsilon}, \quad \boldsymbol{\varepsilon} \sim \mathcal{F}\left\{\mathbf{0}, \boldsymbol{\Sigma}\left(\boldsymbol{\gamma}\right)\right\}, \end{equation} where $\mathbf{X} \in \mathbb{R}^{n \times p}$ is a design matrix of observed predictors, $\boldsymbol{\beta} \in \mathbb{R}^p$ is the regression parameter vector and $\boldsymbol{\varepsilon} = \{\varepsilon_{i}\}_{i=1,\ldots,n}$ is a zero-mean process following an unspecified joint distribution $\mathcal{F}$ with positive-definite covariance function $\boldsymbol{\Sigma}(\boldsymbol{\gamma}) \in \mathbb{R}^{n\times n}$ characterizing the second-order dependence structure of the process and parameterized by the vector $\boldsymbol{\gamma} \in \boldsymbol{\Gamma} \subset \mathbb{R}^q$. The GMWMX estimator first estimates \eqn{\boldsymbol{\beta}} with the least-squares estimator: \begin{equation} \hat{\boldsymbol{\beta}} = \left( \mathbf{X}^T \mathbf{X}\right)^{-1} \mathbf{X}^T \boldsymbol{Y}, \end{equation} It then estimates the parameters of the stochastic model \eqn{\boldsymbol{\gamma}} with a Generalized Method of Wavelet Moments approach [@guerrier2013wavelet] applied to the estimated residuals defined as $\hat{\boldsymbol{\varepsilon}} = {\boldsymbol{Y}} - { \mathbf{X}} \hat{\boldsymbol{\beta}}$. More precisely, the estimated stochastic parameters, denoted as $\hat{\boldsymbol{\gamma}}$, are defined as: \begin{equation} \hat{\boldsymbol{\gamma}} = \underset{\boldsymbol{\gamma} \in \boldsymbol{\Gamma}}{\arg\min} \left\{\hat{\boldsymbol{\nu}} - \boldsymbol{\nu}(\boldsymbol{\gamma})\right\}^T \boldsymbol{\Omega} \left\{\hat{\boldsymbol{\nu}} - \boldsymbol{\nu}(\boldsymbol{\gamma})\right\}, \end{equation} where $\hat{\boldsymbol{\nu}}$ is the vector of empirical wavelet variances computed on the estimated residuals $\hat{\boldsymbol{\varepsilon}}$, $\boldsymbol{\nu}(\boldsymbol{\gamma})$ is the vector of theoretical wavelet variance implied by the stochastic model with parameters $\boldsymbol{\gamma}$, and $\boldsymbol{\Omega}$ is a weighting matrix. The theoretical wavelet variance $\boldsymbol{\nu}(\boldsymbol{\gamma})$ is obtained using the results of @xu2017study and @zhang2008allan that provide a computationally efficient form of the theoretical Allan variance (equivalent to the Haar wavelet variance up to a constant) for zero-mean stochastic processes such as $\boldsymbol{\varepsilon}$. In @xu2017study they generalize the results in @zhang2008allan to zero-mean non-stationary processes by using averages of the diagonals and super-diagonals of the covariance matrix of $\boldsymbol{\varepsilon}$. This implies that GMWMX does not require storage of the $n \times n$ covariance matrix of $\boldsymbol{\varepsilon}$, but only a vector of length $n$ that is plugged into an explicit formula consisting of a linear combination of its elements. The variance-covariance matrix of the estimated regression parameters $\hat{\boldsymbol{\beta}}$ is then obtained as: \begin{equation} (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top} {\boldsymbol{\Sigma}}(\hat{\boldsymbol{\gamma}}) \mathbf{X} (\mathbf{X}^{\top} \mathbf{X})^{-1}. \end{equation} This vignette is a detailed, self-contained simulation walkthrough that showcases `gmwmx2()` for an arbitrary design matrix `X` and response vector `y`. We build the signal, generate noise, fit models, and run Monte Carlo experiments to check coverage of confidence intervals. For each setting, we also plot the empirical distributions of the estimated functional and stochastic parameters. ```{r, message=FALSE,warning=FALSE} library(gmwmx2) library(wv) library(dplyr) boxplot_mean_dot <- function(x, ...) { boxplot(x, ...) x_mat <- as.matrix(x) mean_vals <- colMeans(x_mat, na.rm = TRUE) points(seq_along(mean_vals), mean_vals, pch = 16, col = "black") } ``` # Build an arbitrary design matrix X We use a generic design matrix that mirrors many geodetic or environmental signals: 1. Intercept. 2. Linear trend. 3. Annual sinusoid and cosine. This is arbitrary and can be replaced with any design matrix suitable to your application. ```{r} n = 15*365 X = matrix(NA, nrow = n, ncol = 4) # intercept X[, 1] = 1 # trend X[, 2] = 1:n # annual sinusoid omega_1 <- (1 / 365.25) * 2 * pi X[, 3] <- sin((1:n) * omega_1) X[, 4] <- cos((1:n) * omega_1) beta = c(0.5 , 0.00004, 0.005, 0.0008) # visualize the deterministic signal plot(x = X[, 2], y = X %*% beta, type = "l", main = "Signal", xlab = "Time", ylab = "Signal", las = 1) ``` We now define three different stochastic noise settings using the same `X` and `beta`. # Example 1: White noise + AR(1) ## Generate signal ```{r} phi_ar1 = 0.975 sigma2_ar1 = 5e-05 sigma2_wn = 7e-04 eps = gmwmx2::generate(ar1(phi = phi_ar1, sigma2 = sigma2_ar1) + wn(sigma2_wn), n = n, seed = 123)$series plot(wv::wvar(eps)) ``` ```{r} y = X %*% beta + eps plot(X[, 2], y, type = "l", main = "Simulated y (WN + AR1)") ``` ## Fit model We fit using `gmwmx2()` with a composite model composed of white noise and autoregressive process of order 1. ```{r} fit = gmwmx2(X = X, y = y, model = wn() + ar1() ) print(fit) ``` To assess uncertainty, we run a Monte Carlo simulation and check empirical coverage of the confidence intervals for the trend. ## Monte Carlo simulation ```{r} B = 100 mat_res = matrix(NA, nrow=B, ncol=19) for(b in seq(B)){ eps = generate(ar1(phi=phi_ar1, sigma2=sigma2_ar1) + wn(sigma2_wn), n=n, seed = (123 + b))$series y = X %*% beta + eps fit = gmwmx2(X = X, y = y, model = wn() + ar1() ) # misspecified model assuming white noise as the stochastic model fit2 = lm(y~X[,2] + X[,3] + X[,4]) mat_res[b, ] = c(fit$beta_hat, fit$std_beta_hat, summary(fit2)$coefficients[,1], summary(fit2)$coefficients[,2], fit$theta_domain$`AR(1)_2`, fit$theta_domain$`White Noise_1`) # cat("Iteration ", b, " completed.\n") } ``` ```{r} # compute empirical coverage mat_res_df = as.data.frame(mat_res) colnames(mat_res_df) = c("gmwmx_beta0_hat", "gmwmx_beta1_hat", "gmwmx_beta2_hat", "gmwmx_beta3_hat", "gmwmx_std_beta0_hat", "gmwmx_std_beta1_hat", "gmwmx_std_beta2_hat", "gmwmx_std_beta3_hat", "lm_beta0_hat", "lm_beta1_hat", "lm_beta2_hat", "lm_beta3_hat", "lm_std_beta0_hat", "lm_std_beta1_hat", "lm_std_beta2_hat", "lm_std_beta3_hat", "phi_ar1","sigma_2_ar1" ,"sigma_2_wn") ``` ## Plot empirical distributions of estimated parameters ```{r} par(mfrow = c(1, 4)) boxplot_mean_dot(mat_res_df[, c("gmwmx_beta0_hat")],las=1, names = c("beta0"), main = expression(beta[0]), ylab = "Estimate") abline(h = beta[1], col = "black", lwd = 2) boxplot_mean_dot(mat_res_df[, c("gmwmx_beta1_hat")],las=1, names = c("beta1"), main = expression(beta[1]), ylab = "Estimate") abline(h = beta[2], col = "black", lwd = 2) boxplot_mean_dot(mat_res_df[, c("gmwmx_beta2_hat")],las=1, names = c("beta2"), main = expression(beta[2]), ylab = "Estimate") abline(h = beta[3], col = "black", lwd = 2) boxplot_mean_dot(mat_res_df[, c("gmwmx_beta3_hat")],las=1, names = c("beta3"), main = expression(beta[3]), ylab = "Estimate") abline(h = beta[4], col = "black", lwd = 2) par(mfrow = c(1, 3)) boxplot_mean_dot(mat_res_df$phi_ar1,las=1, names = c("phi_ar1"), main = expression(phi["AR1"]), ylab = "Estimate") abline(h = phi_ar1, col = "black", lwd = 2) boxplot_mean_dot(mat_res_df$sigma_2_ar1,las=1, names = c("phi_ar1"), main = expression(sigma["AR1"]^2), ylab = "Estimate") abline(h = sigma2_ar1, col = "black", lwd = 2) boxplot_mean_dot(mat_res_df$sigma_2_wn,las=1, names = c("phi_ar1"), main = expression(sigma["WN"]^2), ylab = "Estimate") abline(h = sigma2_wn, col = "black", lwd = 2) par(mfrow = c(1, 1)) ``` ## Compute empirical coverage of confidence intervals ```{r} zval = qnorm(0.975) mat_res_df$upper_ci_gmwmx_beta1 = mat_res_df$gmwmx_beta1_hat + zval * mat_res_df$gmwmx_std_beta1_hat mat_res_df$lower_ci_gmwmx_beta1 = mat_res_df$gmwmx_beta1_hat - zval * mat_res_df$gmwmx_std_beta1_hat # empirical coverage of gmwmx beta dplyr::between(rep(beta[2], B), mat_res_df$lower_ci_gmwmx_beta1, mat_res_df$upper_ci_gmwmx_beta1) %>% mean() ``` ```{r} # do the same for lm beta mat_res_df$upper_ci_lm_beta1 = mat_res_df$lm_beta1_hat + zval * mat_res_df$lm_std_beta1_hat mat_res_df$lower_ci_lm_beta1 = mat_res_df$lm_beta1_hat - zval * mat_res_df$lm_std_beta1_hat dplyr::between(rep(beta[2], B), mat_res_df$lower_ci_lm_beta1, mat_res_df$upper_ci_lm_beta1) %>% mean() ``` # Example 2: White noise + stationary power-law ## Generate signal Here we replace AR(1) by a stationary power-law component. ```{r} kappa_pl <- -0.75 sigma2_wn <- 2e-07 sigma2_pl <- 2.25e-06 eps_pl = generate(wn(sigma2_wn) + pl(kappa = kappa_pl, sigma2 = sigma2_pl), n = n, seed = 123) plot(eps_pl$series, type = "l") plot(wv::wvar(eps_pl$series)) y_pl = X %*% beta + eps_pl$series plot(X[, 2], y_pl, type = "l", main = "Simulated y (WN + PL)") ``` ## Fit model ```{r} fit_pl = gmwmx2(X = X, y = y_pl, model = wn() + pl()) fit_pl ``` ## Monte Carlo simulation ```{r} B_pl = 100 mat_res_pl = matrix(NA, nrow = B_pl, ncol = 19) for(b in seq(B_pl)){ eps = generate(wn(sigma2_wn) + pl(kappa = kappa_pl, sigma2 = sigma2_pl), n = n, seed = (123 + b))$series y = X %*% beta + eps fit = gmwmx2(X = X, y = y, model = wn() + pl()) fit2 = lm(y~X[,2] + X[,3] + X[,4]) mat_res_pl[b, ] = c(fit$beta_hat, fit$std_beta_hat, summary(fit2)$coefficients[,1], summary(fit2)$coefficients[,2], fit$theta_domain$`Stationary PowerLaw_2`, fit$theta_domain$`White Noise_1`) # cat("Iteration ", b, " \n") } mat_res_pl_df = as.data.frame(mat_res_pl) colnames(mat_res_pl_df) = c("gmwmx_beta0_hat", "gmwmx_beta1_hat", "gmwmx_beta2_hat", "gmwmx_beta3_hat", "gmwmx_std_beta0_hat", "gmwmx_std_beta1_hat", "gmwmx_std_beta2_hat", "gmwmx_std_beta3_hat", "lm_beta0_hat", "lm_beta1_hat", "lm_beta2_hat", "lm_beta3_hat", "lm_std_beta0_hat", "lm_std_beta1_hat", "lm_std_beta2_hat", "lm_std_beta3_hat", "kappa_pl","sigma2_pl" ,"sigma2_wn") ``` ## Plot empirical distributions of estimated parameters ```{r} par(mfrow = c(1, 4)) boxplot_mean_dot(mat_res_df[, c("gmwmx_beta0_hat")],las=1, names = c("beta0"), main = expression(beta[0]), ylab = "Estimate") abline(h = beta[1], col = "black", lwd = 2) boxplot_mean_dot(mat_res_df[, c("gmwmx_beta1_hat")],las=1, names = c("beta1"), main = expression(beta[1]), ylab = "Estimate") abline(h = beta[2], col = "black", lwd = 2) boxplot_mean_dot(mat_res_df[, c("gmwmx_beta2_hat")],las=1, names = c("beta2"), main = expression(beta[2]), ylab = "Estimate") abline(h = beta[3], col = "black", lwd = 2) boxplot_mean_dot(mat_res_df[, c("gmwmx_beta3_hat")],las=1, names = c("beta3"), main = expression(beta[3]), ylab = "Estimate") abline(h = beta[4], col = "black", lwd = 2) par(mfrow = c(1, 1)) ``` ```{r} par(mfrow = c(1, 3)) boxplot_mean_dot(mat_res_pl_df$kappa_pl,las=1, main = expression(kappa["PL"]), ylab = "Estimate") abline(h = kappa_pl, col = "black", lwd = 2) boxplot_mean_dot(mat_res_pl_df$sigma2_pl,las=1, main = expression(sigma["PL"]^2), ylab = "Estimate") abline(h = sigma2_pl, col = "black", lwd = 2) boxplot_mean_dot(mat_res_pl_df$sigma2_wn,las=1, names = c("phi_ar1"), main = expression(sigma["WN"]^2), ylab = "Estimate") abline(h = sigma2_wn, col = "black", lwd = 2) par(mfrow = c(1, 1)) ``` ## Compute empirical coverage of confidence intervals ```{r} zval = qnorm(0.975) mat_res_pl_df$upper_ci_gmwmx_beta1 = mat_res_pl_df$gmwmx_beta1_hat + zval * mat_res_pl_df$gmwmx_std_beta1_hat mat_res_pl_df$lower_ci_gmwmx_beta1 = mat_res_pl_df$gmwmx_beta1_hat - zval * mat_res_pl_df$gmwmx_std_beta1_hat # empirical coverage of gmwmx beta dplyr::between(rep(beta[2], B_pl), mat_res_pl_df$lower_ci_gmwmx_beta1, mat_res_pl_df$upper_ci_gmwmx_beta1) %>% mean() # do the same for lm beta mat_res_pl_df$upper_ci_lm_beta1 = mat_res_pl_df$lm_beta1_hat + zval * mat_res_pl_df$lm_std_beta1_hat mat_res_pl_df$lower_ci_lm_beta1 = mat_res_pl_df$lm_beta1_hat - zval * mat_res_pl_df$lm_std_beta1_hat dplyr::between(rep(beta[2], B_pl), mat_res_pl_df$lower_ci_lm_beta1, mat_res_pl_df$upper_ci_lm_beta1) %>% mean() ``` # Example 3: White noise + flicker Flicker noise corresponds to a non stationary power-law with spectral index $\kappa = -1$ but is handled explicitly by `flicker()` in this package. ## Generate signal ```{r} # fix stochastic parameters sigma2_wn_fl <- 8e-07 sigma2_fl <- 1e-06 eps_fl = generate(wn(sigma2_wn_fl) + flicker(sigma2 = sigma2_fl), n = n, seed = 123) plot(eps_fl$series, type = "l") plot(wv::wvar(eps_fl$series)) y_fl = X %*% beta + eps_fl$series plot(X[, 2], y_fl, type = "l", main = "Simulated y (WN + Flicker)") ``` ## Fit model ```{r} fit_fl = gmwmx2(X = X, y = y_fl, model = wn() + flicker()) fit_fl ``` ## Monte Carlo simulation ```{r} B_fl = 100 mat_res_fl = matrix(NA, nrow = B_fl, ncol = 18) for(b in seq(B_fl)){ eps = generate(wn(sigma2_wn_fl) + flicker(sigma2 = sigma2_fl), n = n, seed = (123 + b))$series y = X %*% beta + eps fit = gmwmx2(X = X, y = y, model = wn() + flicker()) fit2 = lm(y~X[,2] + X[,3] + X[,4]) mat_res_fl[b, ] = c(fit$beta_hat, fit$std_beta_hat, summary(fit2)$coefficients[,1], summary(fit2)$coefficients[,2], fit$theta_domain$`Flicker_2`, fit$theta_domain$`White Noise_1`) # cat("Iteration ", b, " \n") } mat_res_fl_df = as.data.frame(mat_res_fl) colnames(mat_res_fl_df) = c("gmwmx_beta0_hat", "gmwmx_beta1_hat", "gmwmx_beta2_hat", "gmwmx_beta3_hat", "gmwmx_std_beta0_hat", "gmwmx_std_beta1_hat", "gmwmx_std_beta2_hat", "gmwmx_std_beta3_hat", "lm_beta0_hat", "lm_beta1_hat", "lm_beta2_hat", "lm_beta3_hat", "lm_std_beta0_hat", "lm_std_beta1_hat", "lm_std_beta2_hat", "lm_std_beta3_hat", "sigma2_fl" ,"sigma2_wn") ``` ## Plot empirical distributions of estimated parameters ```{r} par(mfrow = c(1, 4)) boxplot_mean_dot(mat_res_df[, c("gmwmx_beta0_hat")],las=1, names = c("beta0"), main = expression(beta[0]), ylab = "Estimate") abline(h = beta[1], col = "black", lwd = 2) boxplot_mean_dot(mat_res_df[, c("gmwmx_beta1_hat")],las=1, names = c("beta1"), main = expression(beta[1]), ylab = "Estimate") abline(h = beta[2], col = "black", lwd = 2) boxplot_mean_dot(mat_res_df[, c("gmwmx_beta2_hat")],las=1, names = c("beta2"), main = expression(beta[2]), ylab = "Estimate") abline(h = beta[3], col = "black", lwd = 2) boxplot_mean_dot(mat_res_df[, c("gmwmx_beta3_hat")],las=1, names = c("beta3"), main = expression(beta[3]), ylab = "Estimate") abline(h = beta[4], col = "black", lwd = 2) par(mfrow = c(1, 1)) ``` ```{r} par(mfrow = c(1, 2)) boxplot_mean_dot(mat_res_fl_df$sigma2_fl,las=1, main = expression(sigma["FL"]^2), ylab = "Estimate") abline(h = sigma2_fl, col = "black", lwd = 2) boxplot_mean_dot(mat_res_fl_df$sigma2_wn,las=1, names = c("phi_ar1"), main = expression(sigma["WN"]^2), ylab = "Estimate") abline(h = sigma2_wn_fl, col = "black", lwd = 2) par(mfrow = c(1, 1)) ``` ## Compute empirical coverage of confidence intervals ```{r} zval = qnorm(0.975) mat_res_fl_df$upper_ci_gmwmx_beta1 = mat_res_fl_df$gmwmx_beta1_hat + zval * mat_res_fl_df$gmwmx_std_beta1_hat mat_res_fl_df$lower_ci_gmwmx_beta1 = mat_res_fl_df$gmwmx_beta1_hat - zval * mat_res_fl_df$gmwmx_std_beta1_hat # empirical coverage of gmwmx beta dplyr::between(rep(beta[2], B_pl), mat_res_fl_df$lower_ci_gmwmx_beta1, mat_res_fl_df$upper_ci_gmwmx_beta1) %>% mean() # do the same for lm beta mat_res_fl_df$upper_ci_lm_beta1 = mat_res_fl_df$lm_beta1_hat + zval * mat_res_fl_df$lm_std_beta1_hat mat_res_fl_df$lower_ci_lm_beta1 = mat_res_fl_df$lm_beta1_hat - zval * mat_res_fl_df$lm_std_beta1_hat dplyr::between(rep(beta[2], B_pl), mat_res_fl_df$lower_ci_lm_beta1, mat_res_fl_df$upper_ci_lm_beta1) %>% mean() ```