--- title: "Technical reference: design, contracts, and methodology" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Technical reference: design, contracts, and methodology} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set(collapse = TRUE, comment = "#>", eval = FALSE) ``` This vignette is the single technical reference for **missingmed**: the architecture, the cross-package contracts it depends on, the statistical methodology of every estimator and inference path, and the engineering decisions (and their rationale) made along the way. Code blocks are illustrative (`eval = FALSE`); runnable walkthroughs live in `vignette("missingmed")` and `vignette("mbco-mi")`. --- # 1. Architecture (S7, four verbs, three classes) missingmed is a **thin orchestration layer**: it runs mediation across the "missing-data middle" and delegates the statistics outward — **fitting** to [medfit](https://data-wise.github.io/medfit/), **inference** to [RMediation](https://data-wise.github.io/rmediation/), and (later) simulation to medsim. It is written in **S7** to match the newer ecosystem packages and the shared `medfit::MediationData` contract. The pipeline is four verbs over three classes: ``` set_md_mediation() -> run() -> pool() -> infer() MDMediationData MDMediationFit MDMediationResult CI / MBCO (data + spec) (per-imputation (pooled named (RMediation named Med.Data) Med.Data) or hosted D4) ``` | S7 class | Carries | S4 ancestor | |---|---|---| | `MDMediationData` | data (`mids` or `data.frame`) + mediation spec + estimator/mechanism axes | `SemImputedData` | | `MDMediationFit` | **list** of per-imputation named `medfit::MediationData` (+ IPW `weights`) | `SemResults` | | `MDMediationResult` | **pooled** named `medfit::MediationData` + within/between/total vcov | `PooledSEMResults` | **Orthogonal axes.** The estimator (`method = "mi" | "ipw"`) and the model (formulas + `engine`) are independent. The same `run() -> pool() -> infer()` chain serves both estimators; only `run()` branches internally. ```{r} md <- set_md_mediation(data, formula_y = Y ~ X + M + C, formula_m = M ~ X + C, treatment = "X", mediator = "M", method = "mi") # or method = "ipw" res <- pool(run(md)) infer(res, type = "mc") ``` --- # 2. Ecosystem contracts missingmed produces and consumes objects defined elsewhere. These contracts are load-bearing — getting the **names** right is what makes inference work. ## 2.1 `medfit::MediationData` (the fitting contract) `fit_mediation()` / `extract_mediation()` return a `MediationData` whose `@estimates` is a **named** numeric vector and `@vcov` a **named** matrix: ``` estimates: m_(Intercept), m_X, m_C, y_(Intercept), y_X, y_M, y_C, a, b, c_prime ``` The convenience aliases `a` (= `m_X`), `b` (= `y_M`), `c_prime` (= `y_X`) and the matching `@vcov` dimnames are the interface every downstream consumer resolves **by name**. ## 2.2 `RMediation` (the inference contract) > **Namespace note:** the package is `RMediation` (capital R-M), *not* > `rmediation` — the source repo is `data-wise/rmediation` but > `DESCRIPTION: Package: RMediation`. Code uses `RMediation::`. - `ci_mediation_data(mu, level, type, n.mc)` — Monte-Carlo / distribution-of-the- product CI from a **single named** `MediationData`. Resolves `a`/`b` strictly by name (errors if unnamed). This is the `type = "mc"` path. - `mbco(h0, h1, ...)` — likelihood-ratio MBCO, currently **OpenMx-only**; it has no per-imputation / MI entry point (see §4). ## 2.3 Dependency direction (no cycle) ``` missingmed -> medfit (fit_mediation, MediationData) missingmed -> RMediation (ci_mediation_data, medci) RMediation -> medfit (consumes MediationData) medsim (Suggests, later phases) ``` missingmed imports both medfit and RMediation; neither depends back on missingmed, so the graph is acyclic. --- # 3. The MI estimator and Rubin's rules Under multiple imputation, `run()` fits **every** imputed dataset with `medfit::fit_mediation()`, yielding a list of `m` named `MediationData`. `pool()` applies Rubin's (1987) rules on the **named** vectors so the path labels survive pooling: \[ \bar Q = \tfrac1m\sum_i Q_i,\quad \bar U = \tfrac1m\sum_i U_i,\quad B = \tfrac{1}{m-1}\sum_i (Q_i-\bar Q)(Q_i-\bar Q)^\top,\quad T = \bar U + \left(1+\tfrac1m\right)B . \] The pooled `MediationData` is built by **copy-modifying** a per-imputation one (S7 is copy-on-write), replacing only `@estimates = `\(\bar Q\) and `@vcov = `\(T\) — so all medfit metadata (roles, predictors, families) is inherited and the aliases stay name-addressable for `ci_mediation_data()`. ```{r} fit <- run(md_mi) # m named MediationData res <- pool(fit) # pooled named MediationData + within/between/total vcov res@tidy_table # term, estimate, std_error, var_w, var_b, var_tot ``` --- # 4. MBCO under MI: D4-stacking (and why it is hosted here) ## 4.1 Why MBCO does not commute with Rubin's rules MBCO tests \(H_0: a b = 0\). Since \(a b = 0 \iff a = 0 \lor b = 0\), the constrained log-likelihood is the **branch union**: \[ T = 2\big[\ell_{\text{full}} - \max(\ell_{a=0},\ \ell_{b=0})\big]. \] That `max()` is non-linear, so the MBCO statistic of the *pooled* estimate is **not** the pool of per-imputation MBCO statistics. You cannot pool first and test second — you need the per-imputation fits. ## 4.2 D4 combination Hence `MDMediationFit` retains the per-imputation list (exposed by `per_imputation_list()`), and `infer(type = "mbco")` combines the per-imputation LRT statistics with the **D4** rule (Chan & Meng 2022; Grund et al. 2021): \[ d_S = \tfrac{\text{LRT(stacked data)}}{K},\quad r_4 = \max\!\Big(0, \tfrac{K+1}{k(K-1)}(\bar d - d_S)\Big),\quad D_4 = \tfrac{d_S}{k(1+r_4)} \sim F_{k,\nu}. \] ## 4.3 Hosting decision `RMediation::mbco()` is OpenMx-only with no MI entry point, so the D4 machinery is **hosted in missingmed** (`R/mbco_mi.R`), ported from the Missing-Effect research prototype. It reproduces the prototype **exactly** (max abs diff \(\approx 5\times10^{-11}\) across design cells). When `RMediation` gains an MI entry point, this code should move upstream (`TODO` noted in source). --- # 5. The IPW estimator IPW is a robustness complement to MI (manuscript appendix): instead of imputing, it **reweights the observed complete cases** to represent the full sample under MAR. The whole estimator is a single internal branch, `.ipw_run()`; `pool()` and `infer(type = "mc")` are unchanged. ## 5.1 Pipeline ```{r} md <- set_md_mediation(df, Y ~ X + M + C, M ~ X + C, treatment = "X", mediator = "M", method = "ipw", weight_formula = NULL, weight_stabilize = TRUE, weight_trim = 0.99, se_type = "sandwich") res <- pool(run(md)) # MDMediationFit(m = 1) -> MDMediationResult (B = 0) infer(res, type = "mc") ``` Because there is a single weighted fit, `MDMediationFit@m = 1` and `pool()` reduces to \(\bar Q = Q_1,\ B = 0,\ T = \bar U\) — the result is structurally identical to an MI result, so nothing downstream changes. ## 5.2 Weight estimation Let \(R = 1\) mark a complete case over the model variables. - **Joint complete-case model (default).** `weight_formula = NULL` fits \(\text{logit }P(R=1\mid Z)\) on the **fully observed** model variables (the MAR drivers); a single `formula` overrides the predictors. - **Per-variable / sequential factorization.** A *named list* of formulas fits one model per incomplete variable and multiplies: \(P(\text{complete}) = \prod_V P(R_V = 1\mid \cdot)\), following the causal order \(X \to M \to Y\). **Stabilized weights** (`weight_stabilize = TRUE`, default) use a treatment-only numerator to reduce variance: \[ w_i = \frac{\hat P(R=1\mid X_i)}{\hat P(R=1\mid Z_i)} . \] **Trimming** (`weight_trim`) caps weights at an upper quantile of the untrimmed complete-case weights; `1` disables it. ## 5.3 Variance: sandwich vs model IPW's weighted-GLM model-based vcov is optimistic (it ignores that weights were estimated). `se_type = "sandwich"` (the IPW default) uses heteroskedasticity-consistent `sandwich::vcovHC` instead. This is implemented **in medfit** via an injectable `vcov_fun` threaded through `extract_mediation()` (so the named-vcov assembly stays in one place); missingmed just passes `se_type` through. > MBCO for IPW is **not** implemented (a weighted LRT is methodologically > distinct); `infer(type = "mbco")` on an IPW fit errors with guidance. --- # 6. Cross-package engineering decisions These changes were made upstream to satisfy missingmed's contracts; each is the "small upstream fix" that a new capability turned out to need. | Decision | Where | Why | |---|---|---| | `fit_mediation(weights=)` via `do.call` (value inlined), added only when non-NULL | medfit | `glm` evaluates `weights` by NSE in the **formula's** environment; passing it through `...` resolves to `stats::weights` (a function). `do.call` inlines the vector; gating on non-NULL keeps the unweighted path byte-identical. | | `se_type` -> injectable `vcov_fun` in `extract_mediation` | medfit | Keeps named-vcov assembly in one place; `sandwich::vcovHC` swaps in for `stats::vcov` without missingmed re-implementing medfit's alias expansion. | | Tidiers exported with `@exportS3Method broom::tidy` | missingmed | Plain `@export` registers `S3method(tidy, *)` against the wrong generic; the package-qualified form binds to broom's generic so `tidy()`/`broom::tidy()` dispatch. | | `S7::S4_register()` per S7 class | missingmed | The legacy S4 generics (`print`/`summary`) require S7 classes to be S4-registered before S7 methods can attach. | | `namespace` roclet enabled | missingmed | Lets roxygen regenerate NAMESPACE for the many new S7 exports (surfaced + fixed a latent `generics`/`broom` `tidy` import clash). | | Pooled `MediationData` via copy-modify | missingmed | Inherits all medfit metadata; only estimates/vcov/paths change, so names stay intact for RMediation. | --- # 7. Design-choice summary - **S7-first**, three classes mirroring the S4 ancestors; estimator and model as orthogonal axes. - **Delegate fitting to medfit** (`fit_mediation`) rather than re-implement GLM mediation — keeps missingmed thin and the `MediationData` naming canonical. - **Names are the API**: pooled estimates/vcov stay named so RMediation resolves paths by label. - **MBCO hosted locally** until RMediation offers an MI entry point; exact parity with the research prototype is the acceptance bar. - **IPW = thin `run()` branch + passthrough `pool()`**; weights support both joint and sequential models, stabilized and trimmed; sandwich SE by default. - **S4 deprecated** with `.Deprecated()` shims for one cycle. --- # 8. References - Rubin, D. B. (1987). *Multiple Imputation for Nonresponse in Surveys*. Wiley. - Chan, K. W., & Meng, X.-L. (2022). Multiple improvements of multiple imputation likelihood ratio tests. *Statistica Sinica*. - Grund, S., Lüdtke, O., & Robitzsch, A. (2021). Pooling methods for likelihood-ratio tests with multiply imputed data. *Psychological Methods*. - Seaman, S. R., & White, I. R. (2013). Review of inverse probability weighting for dealing with missing data. *Statistical Methods in Medical Research*.