Anthropometric datasets are rarely complete. Survey participants refuse to disclose weight. Measurements are skipped because of time constraints. Equipment failures leave records partially populated. Clinical datasets have selective missing patterns — patients who visit less frequently have fewer recorded measurements.
Before using an anthropometric dataset to train a prediction model or to analyze population trends, you need to handle these missing values. The choice of imputation method affects downstream model quality more than most practitioners expect.
Why imputation matters for anthropometric data
The naive approach — deleting rows with any missing value — is rarely appropriate for anthropometric data because:
-
Missing is not random. In anthropometric surveys, missing weight values are disproportionately common among participants who are either very heavy or underweight. Deleting these rows creates a biased dataset that underrepresents the population extremes where accurate predictions matter most.
-
Body dimensions are highly correlated. A dataset with 93 dimensions (like ANSUR II) where each participant has 2–3 missing measurements has almost no complete rows under listwise deletion. With imputation, you retain the full sample.
-
Uncertainty is quantifiable. Modern imputation methods (particularly MICE) provide multiple imputed datasets that can be analyzed to quantify how imputation uncertainty contributes to final model uncertainty.
Understanding the missing data patterns
Before choosing an imputation method, characterize the missing data pattern:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
def analyze_missing_pattern(df: pd.DataFrame) -> dict:
"""
Analyze missing data patterns in an anthropometric dataset.
"""
n_total = len(df)
# Overall missing rate per column
missing_pct = (df.isna().sum() / n_total * 100).sort_values(ascending=False)
# Identify MCAR vs MAR vs MNAR candidates
# Test: is missingness in weight correlated with other variables?
correlations = {}
for col in df.columns[df.isna().any()]:
missing_indicator = df[col].isna().astype(int)
for other_col in ["body_height", "age", "bmi"]:
if other_col in df.columns and other_col != col:
non_null = df[other_col].dropna()
indicator_aligned = missing_indicator[non_null.index]
corr = non_null.corr(indicator_aligned)
if abs(corr) > 0.1:
correlations[f"{col}~{other_col}"] = round(corr, 3)
return {
"total_records": n_total,
"complete_records": int(df.dropna().shape[0]),
"pct_complete": round(df.dropna().shape[0] / n_total * 100, 1),
"missing_by_column": missing_pct[missing_pct > 0].to_dict(),
"missingness_correlations": correlations,
"likely_mechanism": "MAR" if correlations else "MCAR (tentative)"
}
MCAR (Missing Completely At Random): absence is unrelated to any other variable. Safe for simpler methods. MAR (Missing At Random): absence depends on observed variables. MICE handles this well. MNAR (Missing Not At Random): absence depends on the missing value itself (e.g., very heavy people skipping weight measurement). Requires domain knowledge; no purely statistical fix.
Method 1: MICE (Multiple Imputation by Chained Equations)
MICE imputes each missing variable using a separate predictive model, cycling through variables multiple times until convergence. It:
- Creates multiple complete datasets (typically 5–20), each with different imputed values
- Imputed values are drawn from the posterior predictive distribution, incorporating uncertainty
- Works well for MAR data in datasets with moderate missing rates (<40%)
- Is interpretable: you can examine which variables are used to impute each target
from sklearn.experimental import enable_iterative_imputer # noqa
from sklearn.impute import IterativeImputer
from sklearn.linear_model import BayesianRidge
import numpy as np
def mice_impute_anthropometric(
df: pd.DataFrame,
n_imputations: int = 5,
random_state: int = 42
) -> list[pd.DataFrame]:
"""
Perform MICE imputation on an anthropometric dataset.
Returns a list of n_imputations complete datasets.
Note: sklearn's IterativeImputer implements MICE-like imputation.
For full MICE with proper uncertainty, use the 'miceforest' package.
"""
numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()
imputed_datasets = []
for i in range(n_imputations):
imputer = IterativeImputer(
estimator=BayesianRidge(),
n_nearest_features=10, # Use 10 most correlated features for each imputation
max_iter=10,
random_state=random_state + i,
sample_posterior=True # Critical: sample from posterior for multiple imputation
)
imputed_array = imputer.fit_transform(df[numeric_cols])
imputed_df = df.copy()
imputed_df[numeric_cols] = imputed_array
imputed_datasets.append(imputed_df)
return imputed_datasets
def pool_mice_estimates(
results_per_imputation: list[dict],
parameter: str
) -> dict:
"""
Pool estimates from multiple imputed datasets using Rubin's Rules.
results_per_imputation: list of dicts, each containing estimates
from a model fitted on one imputed dataset.
"""
estimates = [r[parameter] for r in results_per_imputation]
m = len(estimates)
# Pooled estimate: simple mean across imputations
pooled_mean = np.mean(estimates)
# Within-imputation variance (average of individual variances)
variances = [r.get(f"{parameter}_variance", 0) for r in results_per_imputation]
within_var = np.mean(variances)
# Between-imputation variance
between_var = np.var(estimates, ddof=1)
# Total variance (Rubin's Rules)
total_var = within_var + (1 + 1/m) * between_var
return {
"pooled_estimate": round(pooled_mean, 4),
"pooled_se": round(np.sqrt(total_var), 4),
"within_variance": round(within_var, 6),
"between_variance": round(between_var, 6),
"relative_efficiency": round(1 / (1 + between_var / (m * within_var)), 3) if within_var > 0 else None
}
Method 2: MissForest
MissForest uses Random Forest models to impute missing values. Each missing value is predicted by a forest trained on all non-missing observations. Advantages over MICE:
- Handles non-linear relationships and interactions between dimensions naturally
- Works well for datasets with complex correlation structures (body dimensions are nonlinearly related)
- Doesn’t require specifying a model form per variable
- Handles mixed data types (continuous dimensions + categorical sex/region)
Disadvantages:
- Slower (Random Forest training per imputed variable)
- Single imputation only (doesn’t provide the multiple datasets needed for uncertainty quantification via Rubin’s Rules)
- Less interpretable than MICE’s variable-specific models
# Using the missforest Python implementation
# Install: pip install missingpy OR pip install missforest
import numpy as np
import pandas as pd
from missingpy import MissForest
def missforest_impute_anthropometric(
df: pd.DataFrame,
max_iter: int = 10,
n_estimators: int = 100
) -> pd.DataFrame:
"""
Impute missing values using MissForest.
Returns a single complete dataset.
"""
numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()
imputer = MissForest(
max_iter=max_iter,
n_estimators=n_estimators,
random_state=42
)
imputed_array = imputer.fit_transform(df[numeric_cols].values)
result = df.copy()
result[numeric_cols] = imputed_array
return result
def evaluate_imputation_quality(
original_df: pd.DataFrame,
imputed_df: pd.DataFrame,
test_mask: pd.DataFrame # Boolean mask of where test values were removed
) -> dict:
"""
Evaluate imputation quality by artificially masking known values
and measuring how well imputation recovered them.
"""
masked_true = original_df.values[test_mask.values]
masked_imputed = imputed_df.values[test_mask.values]
errors = masked_imputed - masked_true
return {
"mae_mm": round(float(np.mean(np.abs(errors))), 2),
"rmse_mm": round(float(np.sqrt(np.mean(errors**2))), 2),
"bias_mm": round(float(np.mean(errors)), 2),
"n_values_evaluated": int(np.sum(test_mask.values))
}
Choosing between MICE and MissForest
| Factor | MICE | MissForest |
|---|---|---|
| Missing rate | Works well up to ~40% | Works up to ~60% |
| Linearity assumption | Assumes linear (per variable model) | No linearity assumption |
| Multiple imputation | Yes (uncertainty quantification) | No (single dataset) |
| Mixed data types | Requires careful setup per type | Handles natively |
| Speed | Faster for small datasets | Slower due to RF training |
| Anthropometric data fit | Good if dimensions are roughly linear with predictors | Better for complex correlation structures |
For anthropometric datasets specifically:
-
MICE is preferred when you need uncertainty propagation — for example, when you’re fitting a prediction model on imputed data and want to correctly quantify uncertainty using Rubin’s Rules pooling.
-
MissForest is preferred when you need a single high-quality complete dataset for analysis (population percentile computation, size chart development) and don’t need to propagate imputation uncertainty.
For a dataset like ANSUR II with 93 dimensions and typical missing rates of 2–10% per variable, MissForest typically produces lower imputation error because it captures the non-linear body dimension correlations more effectively.
Validating imputation quality
Always validate imputation quality empirically by artificially masking known values:
def cross_validate_imputation(
df: pd.DataFrame,
method: str = "missforest",
mask_pct: float = 0.10,
n_trials: int = 5
) -> dict:
"""
Cross-validate imputation quality by masking known values and measuring recovery error.
"""
numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()
all_errors = []
for trial in range(n_trials):
# Create artificial missing data
rng = np.random.default_rng(trial)
mask = rng.random(df[numeric_cols].shape) < mask_pct
masked_df = df.copy()
masked_df[numeric_cols] = masked_df[numeric_cols].where(~mask)
# Impute
if method == "missforest":
imputed_df = missforest_impute_anthropometric(masked_df)
else:
imputed_dfs = mice_impute_anthropometric(masked_df, n_imputations=1)
imputed_df = imputed_dfs[0]
# Evaluate on masked positions
true_vals = df[numeric_cols].values[mask]
imputed_vals = imputed_df[numeric_cols].values[mask]
errors = imputed_vals - true_vals
all_errors.extend(errors.tolist())
all_errors = np.array(all_errors)
return {
"method": method,
"mask_pct": mask_pct,
"mae_mm": round(float(np.mean(np.abs(all_errors))), 2),
"rmse_mm": round(float(np.sqrt(np.mean(all_errors**2))), 2),
"bias_mm": round(float(np.mean(all_errors)), 2),
"n_trials": n_trials
}
Typical imputation accuracy for well-conditioned anthropometric datasets: MICE achieves MAE of 15–25mm on circumference dimensions; MissForest achieves 10–18mm. The difference grows larger when missing rates are high or correlation structures are complex.
Imputation is not a magic fix for missing data — it recovers information that’s estimated rather than measured, and that uncertainty should flow through your downstream analysis. MICE lets you quantify this properly via Rubin’s Rules. MissForest gives you better point estimates but no built-in uncertainty propagation. For anthropometric model development, the choice depends on whether you’re building a final product (use MissForest) or conducting inference where uncertainty bounds matter (use MICE).