This vignette is the single technical reference for missingmed: the architecture, the cross-package contracts it depends on, the statistical methodology of every estimator and inference path, and the engineering decisions (and their rationale) made along the way. Code blocks are illustrative (eval = FALSE); runnable walkthroughs live in vignette("missingmed") and vignette("mbco-mi").

1. Architecture (S7, four verbs, three classes)

missingmed is a thin orchestration layer: it runs mediation across the “missing-data middle” and delegates the statistics outward — fitting to medfit, inference to RMediation, and (later) simulation to medsim. It is written in S7 to match the newer ecosystem packages and the shared medfit::MediationData contract.

The pipeline is four verbs over three classes:

set_md_mediation()  ->  run()          ->  pool()             ->  infer()
   MDMediationData       MDMediationFit     MDMediationResult      CI / MBCO
   (data + spec)         (per-imputation    (pooled named          (RMediation
                          named Med.Data)    Med.Data)              or hosted D4)

S7 class	Carries	S4 ancestor
`MDMediationData`	data (`mids` or `data.frame`) + mediation spec + estimator/mechanism axes	`SemImputedData`
`MDMediationFit`	list of per-imputation named `medfit::MediationData` (+ IPW `weights`)	`SemResults`
`MDMediationResult`	pooled named `medfit::MediationData` + within/between/total vcov	`PooledSEMResults`

Orthogonal axes. The estimator (method = "mi" | "ipw") and the model (formulas + engine) are independent. The same run() -> pool() -> infer() chain serves both estimators; only run() branches internally.

md <- set_md_mediation(data, formula_y = Y ~ X + M + C, formula_m = M ~ X + C,
  treatment = "X", mediator = "M", method = "mi")   # or method = "ipw"
res <- pool(run(md))
infer(res, type = "mc")

2. Ecosystem contracts

missingmed produces and consumes objects defined elsewhere. These contracts are load-bearing — getting the names right is what makes inference work.

2.1 `medfit::MediationData` (the fitting contract)

fit_mediation() / extract_mediation() return a MediationData whose @estimates is a named numeric vector and @vcov a named matrix:

estimates: m_(Intercept), m_X, m_C, y_(Intercept), y_X, y_M, y_C, a, b, c_prime

The convenience aliases a (= m_X), b (= y_M), c_prime (= y_X) and the matching @vcov dimnames are the interface every downstream consumer resolves by name.

2.2 `RMediation` (the inference contract)

Namespace note: the package is RMediation (capital R-M), not rmediation — the source repo is data-wise/rmediation but DESCRIPTION: Package: RMediation. Code uses RMediation::.

ci_mediation_data(mu, level, type, n.mc) — Monte-Carlo / distribution-of-the- product CI from a single named MediationData. Resolves a/b strictly by name (errors if unnamed). This is the type = "mc" path.
mbco(h0, h1, ...) — likelihood-ratio MBCO, currently OpenMx-only; it has no per-imputation / MI entry point (see §4).

2.3 Dependency direction (no cycle)

missingmed  ->  medfit       (fit_mediation, MediationData)
missingmed  ->  RMediation   (ci_mediation_data, medci)
RMediation  ->  medfit       (consumes MediationData)
medsim          (Suggests, later phases)

missingmed imports both medfit and RMediation; neither depends back on missingmed, so the graph is acyclic.

3. The MI estimator and Rubin’s rules

Under multiple imputation, run() fits every imputed dataset with medfit::fit_mediation(), yielding a list of m named MediationData. pool() applies Rubin’s (1987) rules on the named vectors so the path labels survive pooling:

\[ \bar Q = \tfrac1m\sum_i Q_i,\quad \bar U = \tfrac1m\sum_i U_i,\quad B = \tfrac{1}{m-1}\sum_i (Q_i-\bar Q)(Q_i-\bar Q)^\top,\quad T = \bar U + \left(1+\tfrac1m\right)B . \]

The pooled MediationData is built by copy-modifying a per-imputation one (S7 is copy-on-write), replacing only @estimates =\(\bar Q\) and @vcov =\(T\) — so all medfit metadata (roles, predictors, families) is inherited and the aliases stay name-addressable for ci_mediation_data().

fit <- run(md_mi)            # m named MediationData
res <- pool(fit)             # pooled named MediationData + within/between/total vcov
res@tidy_table               # term, estimate, std_error, var_w, var_b, var_tot

4. MBCO under MI: D4-stacking (and why it is hosted here)

4.1 Why MBCO does not commute with Rubin’s rules

MBCO tests \(H_0: a b = 0\). Since \(a b = 0 \iff a = 0 \lor b = 0\), the constrained log-likelihood is the branch union:

\[ T = 2\big[\ell_{\text{full}} - \max(\ell_{a=0},\ \ell_{b=0})\big]. \]

That max() is non-linear, so the MBCO statistic of the pooled estimate is not the pool of per-imputation MBCO statistics. You cannot pool first and test second — you need the per-imputation fits.

4.2 D4 combination

Hence MDMediationFit retains the per-imputation list (exposed by per_imputation_list()), and infer(type = "mbco") combines the per-imputation LRT statistics with the D4 rule (Chan & Meng 2022; Grund et al. 2021):

\[ d_S = \tfrac{\text{LRT(stacked data)}}{K},\quad r_4 = \max\!\Big(0, \tfrac{K+1}{k(K-1)}(\bar d - d_S)\Big),\quad D_4 = \tfrac{d_S}{k(1+r_4)} \sim F_{k,\nu}. \]

4.3 Hosting decision

RMediation::mbco() is OpenMx-only with no MI entry point, so the D4 machinery is hosted in missingmed (R/mbco_mi.R), ported from the Missing-Effect research prototype. It reproduces the prototype exactly (max abs diff \(\approx 5\times10^{-11}\) across design cells). When RMediation gains an MI entry point, this code should move upstream (TODO noted in source).

5. The IPW estimator

IPW is a robustness complement to MI (manuscript appendix): instead of imputing, it reweights the observed complete cases to represent the full sample under MAR. The whole estimator is a single internal branch, .ipw_run(); pool() and infer(type = "mc") are unchanged.

5.1 Pipeline

md <- set_md_mediation(df, Y ~ X + M + C, M ~ X + C, treatment = "X",
  mediator = "M", method = "ipw",
  weight_formula = NULL, weight_stabilize = TRUE, weight_trim = 0.99,
  se_type = "sandwich")
res <- pool(run(md))         # MDMediationFit(m = 1) -> MDMediationResult (B = 0)
infer(res, type = "mc")

Because there is a single weighted fit, MDMediationFit@m = 1 and pool() reduces to \(\bar Q = Q_1,\ B = 0,\ T = \bar U\) — the result is structurally identical to an MI result, so nothing downstream changes.

5.2 Weight estimation

Let \(R = 1\) mark a complete case over the model variables.

Joint complete-case model (default). weight_formula = NULL fits \(\text{logit }P(R=1\mid Z)\) on the fully observed model variables (the MAR drivers); a single formula overrides the predictors.
Per-variable / sequential factorization. A named list of formulas fits one model per incomplete variable and multiplies: \(P(\text{complete}) = \prod_V P(R_V = 1\mid \cdot)\), following the causal order \(X \to M \to Y\).

Stabilized weights (weight_stabilize = TRUE, default) use a treatment-only numerator to reduce variance: \[ w_i = \frac{\hat P(R=1\mid X_i)}{\hat P(R=1\mid Z_i)} . \] Trimming (weight_trim) caps weights at an upper quantile of the untrimmed complete-case weights; 1 disables it.

5.3 Variance: sandwich vs model

IPW’s weighted-GLM model-based vcov is optimistic (it ignores that weights were estimated). se_type = "sandwich" (the IPW default) uses heteroskedasticity-consistent sandwich::vcovHC instead. This is implemented in medfit via an injectable vcov_fun threaded through extract_mediation() (so the named-vcov assembly stays in one place); missingmed just passes se_type through.

MBCO for IPW is not implemented (a weighted LRT is methodologically distinct); infer(type = "mbco") on an IPW fit errors with guidance.

6. Cross-package engineering decisions

These changes were made upstream to satisfy missingmed’s contracts; each is the “small upstream fix” that a new capability turned out to need.

Decision	Where	Why
`fit_mediation(weights=)` via `do.call` (value inlined), added only when non-NULL	medfit	`glm` evaluates `weights` by NSE in the formula’s environment; passing it through `...` resolves to `stats::weights` (a function). `do.call` inlines the vector; gating on non-NULL keeps the unweighted path byte-identical.
`se_type` -> injectable `vcov_fun` in `extract_mediation`	medfit	Keeps named-vcov assembly in one place; `sandwich::vcovHC` swaps in for `stats::vcov` without missingmed re-implementing medfit’s alias expansion.
Tidiers exported with `@exportS3Method broom::tidy`	missingmed	Plain `@export` registers `S3method(tidy, *)` against the wrong generic; the package-qualified form binds to broom’s generic so `tidy()`/`broom::tidy()` dispatch.
`S7::S4_register()` per S7 class	missingmed	The legacy S4 generics (`print`/`summary`) require S7 classes to be S4-registered before S7 methods can attach.
`namespace` roclet enabled	missingmed	Lets roxygen regenerate NAMESPACE for the many new S7 exports (surfaced + fixed a latent `generics`/`broom` `tidy` import clash).
Pooled `MediationData` via copy-modify	missingmed	Inherits all medfit metadata; only estimates/vcov/paths change, so names stay intact for RMediation.

7. Design-choice summary

S7-first, three classes mirroring the S4 ancestors; estimator and model as orthogonal axes.
Delegate fitting to medfit (fit_mediation) rather than re-implement GLM mediation — keeps missingmed thin and the MediationData naming canonical.
Names are the API: pooled estimates/vcov stay named so RMediation resolves paths by label.
MBCO hosted locally until RMediation offers an MI entry point; exact parity with the research prototype is the acceptance bar.
IPW = thin run() branch + passthrough pool(); weights support both joint and sequential models, stabilized and trimmed; sandwich SE by default.
S4 deprecated with .Deprecated() shims for one cycle.

8. References

Rubin, D. B. (1987). Multiple Imputation for Nonresponse in Surveys. Wiley.
Chan, K. W., & Meng, X.-L. (2022). Multiple improvements of multiple imputation likelihood ratio tests. Statistica Sinica.
Grund, S., Lüdtke, O., & Robitzsch, A. (2021). Pooling methods for likelihood-ratio tests with multiply imputed data. Psychological Methods.
Seaman, S. R., & White, I. R. (2013). Review of inverse probability weighting for dealing with missing data. Statistical Methods in Medical Research.

Technical reference: design, contracts, and methodology