Missing Body Measurements: MICE vs. MissForest

Anthropometric datasets are rarely complete. Survey participants refuse to disclose weight. Measurements are skipped because of time constraints. Equipment failures leave records partially populated. Clinical datasets have selective missing patterns — patients who visit less frequently have fewer recorded measurements.

Before using an anthropometric dataset to train a prediction model or to analyze population trends, you need to handle these missing values. The choice of imputation method affects downstream model quality more than most practitioners expect.

Why imputation matters for anthropometric data

The naive approach — deleting rows with any missing value — is rarely appropriate for anthropometric data because:

Missing is not random. In anthropometric surveys, missing weight values are disproportionately common among participants who are either very heavy or underweight. Deleting these rows creates a biased dataset that underrepresents the population extremes where accurate predictions matter most.
Body dimensions are highly correlated. A dataset with 93 dimensions (like ANSUR II) where each participant has 2–3 missing measurements has almost no complete rows under listwise deletion. With imputation, you retain the full sample.
Uncertainty is quantifiable. Modern imputation methods (particularly MICE) provide multiple imputed datasets that can be analyzed to quantify how imputation uncertainty contributes to final model uncertainty.

Understanding the missing data patterns

Before choosing an imputation method, characterize the missing data pattern:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

def analyze_missing_pattern(df: pd.DataFrame) -> dict:
    """
    Analyze missing data patterns in an anthropometric dataset.
    """
    n_total = len(df)
    
    # Overall missing rate per column
    missing_pct = (df.isna().sum() / n_total * 100).sort_values(ascending=False)
    
    # Identify MCAR vs MAR vs MNAR candidates
    # Test: is missingness in weight correlated with other variables?
    correlations = {}
    for col in df.columns[df.isna().any()]:
        missing_indicator = df[col].isna().astype(int)
        for other_col in ["body_height", "age", "bmi"]:
            if other_col in df.columns and other_col != col:
                non_null = df[other_col].dropna()
                indicator_aligned = missing_indicator[non_null.index]
                corr = non_null.corr(indicator_aligned)
                if abs(corr) > 0.1:
                    correlations[f"{col}~{other_col}"] = round(corr, 3)
    
    return {
        "total_records": n_total,
        "complete_records": int(df.dropna().shape[0]),
        "pct_complete": round(df.dropna().shape[0] / n_total * 100, 1),
        "missing_by_column": missing_pct[missing_pct > 0].to_dict(),
        "missingness_correlations": correlations,
        "likely_mechanism": "MAR" if correlations else "MCAR (tentative)"
    }

MCAR (Missing Completely At Random): absence is unrelated to any other variable. Safe for simpler methods. MAR (Missing At Random): absence depends on observed variables. MICE handles this well. MNAR (Missing Not At Random): absence depends on the missing value itself (e.g., very heavy people skipping weight measurement). Requires domain knowledge; no purely statistical fix.

Method 1: MICE (Multiple Imputation by Chained Equations)

MICE imputes each missing variable using a separate predictive model, cycling through variables multiple times until convergence. It:

Creates multiple complete datasets (typically 5–20), each with different imputed values
Imputed values are drawn from the posterior predictive distribution, incorporating uncertainty
Works well for MAR data in datasets with moderate missing rates (<40%)
Is interpretable: you can examine which variables are used to impute each target

from sklearn.experimental import enable_iterative_imputer  # noqa
from sklearn.impute import IterativeImputer
from sklearn.linear_model import BayesianRidge
import numpy as np

def mice_impute_anthropometric(
    df: pd.DataFrame,
    n_imputations: int = 5,
    random_state: int = 42
) -> list[pd.DataFrame]:
    """
    Perform MICE imputation on an anthropometric dataset.
    Returns a list of n_imputations complete datasets.
    
    Note: sklearn's IterativeImputer implements MICE-like imputation.
    For full MICE with proper uncertainty, use the 'miceforest' package.
    """
    numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()
    
    imputed_datasets = []
    
    for i in range(n_imputations):
        imputer = IterativeImputer(
            estimator=BayesianRidge(),
            n_nearest_features=10,  # Use 10 most correlated features for each imputation
            max_iter=10,
            random_state=random_state + i,
            sample_posterior=True  # Critical: sample from posterior for multiple imputation
        )
        
        imputed_array = imputer.fit_transform(df[numeric_cols])
        imputed_df = df.copy()
        imputed_df[numeric_cols] = imputed_array
        imputed_datasets.append(imputed_df)
    
    return imputed_datasets

def pool_mice_estimates(
    results_per_imputation: list[dict],
    parameter: str
) -> dict:
    """
    Pool estimates from multiple imputed datasets using Rubin's Rules.
    
    results_per_imputation: list of dicts, each containing estimates
    from a model fitted on one imputed dataset.
    """
    estimates = [r[parameter] for r in results_per_imputation]
    m = len(estimates)
    
    # Pooled estimate: simple mean across imputations
    pooled_mean = np.mean(estimates)
    
    # Within-imputation variance (average of individual variances)
    variances = [r.get(f"{parameter}_variance", 0) for r in results_per_imputation]
    within_var = np.mean(variances)
    
    # Between-imputation variance
    between_var = np.var(estimates, ddof=1)
    
    # Total variance (Rubin's Rules)
    total_var = within_var + (1 + 1/m) * between_var
    
    return {
        "pooled_estimate": round(pooled_mean, 4),
        "pooled_se": round(np.sqrt(total_var), 4),
        "within_variance": round(within_var, 6),
        "between_variance": round(between_var, 6),
        "relative_efficiency": round(1 / (1 + between_var / (m * within_var)), 3) if within_var > 0 else None
    }

Method 2: MissForest

MissForest uses Random Forest models to impute missing values. Each missing value is predicted by a forest trained on all non-missing observations. Advantages over MICE:

Handles non-linear relationships and interactions between dimensions naturally
Works well for datasets with complex correlation structures (body dimensions are nonlinearly related)
Doesn’t require specifying a model form per variable
Handles mixed data types (continuous dimensions + categorical sex/region)

Disadvantages:

Slower (Random Forest training per imputed variable)
Single imputation only (doesn’t provide the multiple datasets needed for uncertainty quantification via Rubin’s Rules)
Less interpretable than MICE’s variable-specific models

# Using the missforest Python implementation
# Install: pip install missingpy  OR  pip install missforest

import numpy as np
import pandas as pd
from missingpy import MissForest

def missforest_impute_anthropometric(
    df: pd.DataFrame,
    max_iter: int = 10,
    n_estimators: int = 100
) -> pd.DataFrame:
    """
    Impute missing values using MissForest.
    Returns a single complete dataset.
    """
    numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()
    
    imputer = MissForest(
        max_iter=max_iter,
        n_estimators=n_estimators,
        random_state=42
    )
    
    imputed_array = imputer.fit_transform(df[numeric_cols].values)
    
    result = df.copy()
    result[numeric_cols] = imputed_array
    
    return result

def evaluate_imputation_quality(
    original_df: pd.DataFrame,
    imputed_df: pd.DataFrame,
    test_mask: pd.DataFrame  # Boolean mask of where test values were removed
) -> dict:
    """
    Evaluate imputation quality by artificially masking known values
    and measuring how well imputation recovered them.
    """
    masked_true = original_df.values[test_mask.values]
    masked_imputed = imputed_df.values[test_mask.values]
    
    errors = masked_imputed - masked_true
    
    return {
        "mae_mm": round(float(np.mean(np.abs(errors))), 2),
        "rmse_mm": round(float(np.sqrt(np.mean(errors**2))), 2),
        "bias_mm": round(float(np.mean(errors)), 2),
        "n_values_evaluated": int(np.sum(test_mask.values))
    }

Choosing between MICE and MissForest

Factor	MICE	MissForest
Missing rate	Works well up to ~40%	Works up to ~60%
Linearity assumption	Assumes linear (per variable model)	No linearity assumption
Multiple imputation	Yes (uncertainty quantification)	No (single dataset)
Mixed data types	Requires careful setup per type	Handles natively
Speed	Faster for small datasets	Slower due to RF training
Anthropometric data fit	Good if dimensions are roughly linear with predictors	Better for complex correlation structures

For anthropometric datasets specifically:

MICE is preferred when you need uncertainty propagation — for example, when you’re fitting a prediction model on imputed data and want to correctly quantify uncertainty using Rubin’s Rules pooling.
MissForest is preferred when you need a single high-quality complete dataset for analysis (population percentile computation, size chart development) and don’t need to propagate imputation uncertainty.

For a dataset like ANSUR II with 93 dimensions and typical missing rates of 2–10% per variable, MissForest typically produces lower imputation error because it captures the non-linear body dimension correlations more effectively.

Validating imputation quality

Always validate imputation quality empirically by artificially masking known values:

def cross_validate_imputation(
    df: pd.DataFrame,
    method: str = "missforest",
    mask_pct: float = 0.10,
    n_trials: int = 5
) -> dict:
    """
    Cross-validate imputation quality by masking known values and measuring recovery error.
    """
    numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()
    
    all_errors = []
    
    for trial in range(n_trials):
        # Create artificial missing data
        rng = np.random.default_rng(trial)
        mask = rng.random(df[numeric_cols].shape) < mask_pct
        
        masked_df = df.copy()
        masked_df[numeric_cols] = masked_df[numeric_cols].where(~mask)
        
        # Impute
        if method == "missforest":
            imputed_df = missforest_impute_anthropometric(masked_df)
        else:
            imputed_dfs = mice_impute_anthropometric(masked_df, n_imputations=1)
            imputed_df = imputed_dfs[0]
        
        # Evaluate on masked positions
        true_vals = df[numeric_cols].values[mask]
        imputed_vals = imputed_df[numeric_cols].values[mask]
        errors = imputed_vals - true_vals
        
        all_errors.extend(errors.tolist())
    
    all_errors = np.array(all_errors)
    return {
        "method": method,
        "mask_pct": mask_pct,
        "mae_mm": round(float(np.mean(np.abs(all_errors))), 2),
        "rmse_mm": round(float(np.sqrt(np.mean(all_errors**2))), 2),
        "bias_mm": round(float(np.mean(all_errors)), 2),
        "n_trials": n_trials
    }

Typical imputation accuracy for well-conditioned anthropometric datasets: MICE achieves MAE of 15–25mm on circumference dimensions; MissForest achieves 10–18mm. The difference grows larger when missing rates are high or correlation structures are complex.

Imputation is not a magic fix for missing data — it recovers information that’s estimated rather than measured, and that uncertainty should flow through your downstream analysis. MICE lets you quantify this properly via Rubin’s Rules. MissForest gives you better point estimates but no built-in uncertainty propagation. For anthropometric model development, the choice depends on whether you’re building a final product (use MissForest) or conducting inference where uncertainty bounds matter (use MICE).

Imputation Methods for Missing Body Measurements: MICE vs. MissForest

Why imputation matters for anthropometric data

Understanding the missing data patterns

Method 1: MICE (Multiple Imputation by Chained Equations)

Method 2: MissForest

Choosing between MICE and MissForest

Validating imputation quality

From NHANES to Your App: Understanding Population Datasets in Body Measurement APIs

Why We Chose Ridge Regression Over Deep Learning for Body Measurement Prediction

Understanding Confidence Scores in Anthropometric APIs: A Developer's Guide