This vignette is the single
technical reference for missingmed: the architecture,
the cross-package contracts it depends on, the statistical methodology
of every estimator and inference path, and the engineering decisions
(and their rationale) made along the way. Code blocks are illustrative
(eval = FALSE); runnable walkthroughs live in
vignette("missingmed") and
vignette("mbco-mi").
missingmed is a thin orchestration layer: it runs
mediation across the “missing-data middle” and delegates the statistics
outward — fitting to medfit,
inference to RMediation, and
(later) simulation to medsim. It is written in S7 to
match the newer ecosystem packages and the shared
medfit::MediationData contract.
The pipeline is four verbs over three classes:
set_md_mediation() -> run() -> pool() -> infer()
MDMediationData MDMediationFit MDMediationResult CI / MBCO
(data + spec) (per-imputation (pooled named (RMediation
named Med.Data) Med.Data) or hosted D4)
| S7 class | Carries | S4 ancestor |
|---|---|---|
MDMediationData |
data (mids or data.frame) + mediation spec
+ estimator/mechanism axes |
SemImputedData |
MDMediationFit |
list of per-imputation named
medfit::MediationData (+ IPW weights) |
SemResults |
MDMediationResult |
pooled named medfit::MediationData +
within/between/total vcov |
PooledSEMResults |
Orthogonal axes. The estimator
(method = "mi" | "ipw") and the model (formulas +
engine) are independent. The same
run() -> pool() -> infer() chain serves both
estimators; only run() branches internally.
md <- set_md_mediation(data, formula_y = Y ~ X + M + C, formula_m = M ~ X + C,
treatment = "X", mediator = "M", method = "mi") # or method = "ipw"
res <- pool(run(md))
infer(res, type = "mc")missingmed produces and consumes objects defined elsewhere. These contracts are load-bearing — getting the names right is what makes inference work.
medfit::MediationData (the fitting contract)fit_mediation() / extract_mediation()
return a MediationData whose @estimates is a
named numeric vector and @vcov a
named matrix:
estimates: m_(Intercept), m_X, m_C, y_(Intercept), y_X, y_M, y_C, a, b, c_prime
The convenience aliases a (= m_X),
b (= y_M), c_prime (=
y_X) and the matching @vcov dimnames are the
interface every downstream consumer resolves by
name.
RMediation (the inference contract)Namespace note: the package is
RMediation(capital R-M), notrmediation— the source repo isdata-wise/rmediationbutDESCRIPTION: Package: RMediation. Code usesRMediation::.
ci_mediation_data(mu, level, type, n.mc) — Monte-Carlo
/ distribution-of-the- product CI from a single named
MediationData. Resolves a/b
strictly by name (errors if unnamed). This is the
type = "mc" path.mbco(h0, h1, ...) — likelihood-ratio MBCO, currently
OpenMx-only; it has no per-imputation / MI entry point
(see §4).missingmed -> medfit (fit_mediation, MediationData)
missingmed -> RMediation (ci_mediation_data, medci)
RMediation -> medfit (consumes MediationData)
medsim (Suggests, later phases)
missingmed imports both medfit and RMediation; neither depends back on missingmed, so the graph is acyclic.
Under multiple imputation, run() fits
every imputed dataset with
medfit::fit_mediation(), yielding a list of m
named MediationData. pool() applies Rubin’s
(1987) rules on the named vectors so the path labels
survive pooling:
\[ \bar Q = \tfrac1m\sum_i Q_i,\quad \bar U = \tfrac1m\sum_i U_i,\quad B = \tfrac{1}{m-1}\sum_i (Q_i-\bar Q)(Q_i-\bar Q)^\top,\quad T = \bar U + \left(1+\tfrac1m\right)B . \]
The pooled MediationData is built by
copy-modifying a per-imputation one (S7 is
copy-on-write), replacing only @estimates =\(\bar Q\) and @vcov =\(T\) — so all medfit metadata (roles,
predictors, families) is inherited and the aliases stay name-addressable
for ci_mediation_data().
fit <- run(md_mi) # m named MediationData
res <- pool(fit) # pooled named MediationData + within/between/total vcov
res@tidy_table # term, estimate, std_error, var_w, var_b, var_totMBCO tests \(H_0: a b = 0\). Since \(a b = 0 \iff a = 0 \lor b = 0\), the constrained log-likelihood is the branch union:
\[ T = 2\big[\ell_{\text{full}} - \max(\ell_{a=0},\ \ell_{b=0})\big]. \]
That max() is non-linear, so the MBCO statistic of the
pooled estimate is not the pool of
per-imputation MBCO statistics. You cannot pool first and test second —
you need the per-imputation fits.
Hence MDMediationFit retains the per-imputation list
(exposed by per_imputation_list()), and
infer(type = "mbco") combines the per-imputation LRT
statistics with the D4 rule (Chan & Meng 2022;
Grund et al. 2021):
\[ d_S = \tfrac{\text{LRT(stacked data)}}{K},\quad r_4 = \max\!\Big(0, \tfrac{K+1}{k(K-1)}(\bar d - d_S)\Big),\quad D_4 = \tfrac{d_S}{k(1+r_4)} \sim F_{k,\nu}. \]
RMediation::mbco() is OpenMx-only with no MI entry
point, so the D4 machinery is hosted in missingmed
(R/mbco_mi.R), ported from the Missing-Effect research
prototype. It reproduces the prototype exactly (max abs
diff \(\approx 5\times10^{-11}\) across
design cells). When RMediation gains an MI entry point,
this code should move upstream (TODO noted in source).
IPW is a robustness complement to MI (manuscript appendix): instead
of imputing, it reweights the observed complete cases
to represent the full sample under MAR. The whole estimator is a single
internal branch, .ipw_run(); pool() and
infer(type = "mc") are unchanged.
md <- set_md_mediation(df, Y ~ X + M + C, M ~ X + C, treatment = "X",
mediator = "M", method = "ipw",
weight_formula = NULL, weight_stabilize = TRUE, weight_trim = 0.99,
se_type = "sandwich")
res <- pool(run(md)) # MDMediationFit(m = 1) -> MDMediationResult (B = 0)
infer(res, type = "mc")Because there is a single weighted fit,
MDMediationFit@m = 1 and pool() reduces to
\(\bar Q = Q_1,\ B = 0,\ T = \bar U\) —
the result is structurally identical to an MI result, so nothing
downstream changes.
Let \(R = 1\) mark a complete case over the model variables.
weight_formula = NULL fits \(\text{logit }P(R=1\mid Z)\) on the
fully observed model variables (the MAR drivers); a
single formula overrides the predictors.Stabilized weights
(weight_stabilize = TRUE, default) use a treatment-only
numerator to reduce variance: \[
w_i = \frac{\hat P(R=1\mid X_i)}{\hat P(R=1\mid Z_i)} .
\] Trimming (weight_trim) caps
weights at an upper quantile of the untrimmed complete-case weights;
1 disables it.
IPW’s weighted-GLM model-based vcov is optimistic (it ignores that
weights were estimated). se_type = "sandwich" (the IPW
default) uses heteroskedasticity-consistent
sandwich::vcovHC instead. This is implemented in
medfit via an injectable vcov_fun threaded through
extract_mediation() (so the named-vcov assembly stays in
one place); missingmed just passes se_type through.
MBCO for IPW is not implemented (a weighted LRT is methodologically distinct);
infer(type = "mbco")on an IPW fit errors with guidance.
These changes were made upstream to satisfy missingmed’s contracts; each is the “small upstream fix” that a new capability turned out to need.
| Decision | Where | Why |
|---|---|---|
fit_mediation(weights=) via do.call (value
inlined), added only when non-NULL |
medfit | glm evaluates weights by NSE in the
formula’s environment; passing it through
... resolves to stats::weights (a function).
do.call inlines the vector; gating on non-NULL keeps the
unweighted path byte-identical. |
se_type -> injectable vcov_fun in
extract_mediation |
medfit | Keeps named-vcov assembly in one place;
sandwich::vcovHC swaps in for stats::vcov
without missingmed re-implementing medfit’s alias expansion. |
Tidiers exported with @exportS3Method broom::tidy |
missingmed | Plain @export registers S3method(tidy, *)
against the wrong generic; the package-qualified form binds to broom’s
generic so tidy()/broom::tidy() dispatch. |
S7::S4_register() per S7 class |
missingmed | The legacy S4 generics (print/summary)
require S7 classes to be S4-registered before S7 methods can
attach. |
namespace roclet enabled |
missingmed | Lets roxygen regenerate NAMESPACE for the many new S7 exports
(surfaced + fixed a latent generics/broom
tidy import clash). |
Pooled MediationData via copy-modify |
missingmed | Inherits all medfit metadata; only estimates/vcov/paths change, so names stay intact for RMediation. |
fit_mediation) rather than re-implement GLM mediation —
keeps missingmed thin and the MediationData naming
canonical.run() branch + passthrough
pool(); weights support both joint and sequential
models, stabilized and trimmed; sandwich SE by default..Deprecated() shims
for one cycle.