The survlab package provides tools for imputing
non-detect values in environmental laboratory data using survival models
(including Tobit models). This is particularly useful for environmental
engineers, consultants, and laboratory professionals working with
analytical data where measurements fall below detection limits or limits
of quantification (LOQ).
Non-detect values are common in environmental monitoring programs, contamination assessments, and regulatory compliance testing. Traditional approaches like substitution with half the detection limit can introduce bias and affect statistical analyses.
The package automatically: - Selects the best-fitting distribution from multiple options - Generates realistic imputed values below their respective detection limits - Ensures all imputed values are unique and properly constrained - Provides validation and diagnostic tools specifically designed for laboratory data
The package includes a synthetic environmental laboratory dataset with non-detect values:
# Load example data
data(multi_censored_data)
# Explore the dataset
multi_censored_data[, .(
total_samples = .N,
non_detects = sum(censored == 0),
detects = sum(censored == 1),
min_value = min(value),
max_value = max(value)
)]
#> total_samples non_detects detects min_value max_value
#> <int> <int> <int> <num> <num>
#> 1: 200 33 167 5 61.62913The main function impute_nondetect() automatically
validates data quality, selects the best distribution, and generates
imputed values:
Use the validation function to check imputation quality:
# Look at the first 10 non-detect observations
result[censored == 0, .(
original_detection_limit = value,
imputed_value = round(value_imputed, 4),
final_value = round(value_final, 4)
)][1:10]
#> original_detection_limit imputed_value final_value
#> <num> <num> <num>
#> 1: 25 8.4872 8.4872
#> 2: 15 10.3007 10.3007
#> 3: 5 4.9371 4.9371
#> 4: 5 4.7981 4.7981
#> 5: 5 4.4084 4.4084
#> 6: 5 4.2841 4.2841
#> 7: 25 16.2662 16.2662
#> 8: 5 4.6366 4.6366
#> 9: 8 6.3738 6.3738
#> 10: 5 4.1271 4.1271You can create plots to visualize the imputation results:
# Prepare data for plotting
plot_data <- rbind(
result[censored == 1, .(value = value, type = "Detected")],
result[censored == 0, .(value = value_imputed, type = "Imputed")]
)
# Create histogram
ggplot(plot_data, aes(x = value, fill = type)) +
geom_histogram(alpha = 0.7, bins = 30, position = "identity") +
geom_vline(xintercept = attr(result, "max_detection_limit"),
linetype = "dashed", color = "red", linewidth = 1) +
labs(title = "Distribution Comparison: Detected vs Imputed Values",
subtitle = paste("Red line shows maximum detection limit =",
round(attr(result, "max_detection_limit"), 3)),
x = "Value", y = "Count", fill = "Type") +
theme_minimal() +
scale_fill_manual(values = c("Detected" = "blue", "Imputed" = "orange"))# Q-Q plot to check distribution fit
ggplot(result[censored == 0], aes(sample = value_imputed)) +
stat_qq() +
stat_qq_line() +
labs(title = "Q-Q Plot of Imputed Values",
subtitle = paste("Expected distribution:",
attr(result, "best_distribution"))) +
theme_minimal()You can specify which distributions to test and adjust validation thresholds:
The function returns useful model information as attributes:
# Extract model information
cat("Best distribution:", attr(result, "best_distribution"), "\n")
#> Best distribution: lognormal
cat("Model AIC:", round(attr(result, "aic"), 2), "\n")
#> Model AIC: 1220.47
cat("Parameter:", attr(result, "parameter"), "\n")
#> Parameter: Nitrate
cat("Unit:", attr(result, "unit"), "\n")
#> Unit: mg/l NO3
cat("Sample size:", attr(result, "sample_size"), "\n")
#> Sample size: 200
cat("Censoring percentage:", attr(result, "censored_pct"), "%\n")
#> Censoring percentage: 16.5 %
cat("Detection limits found:", paste(attr(result, "detection_limits"), collapse = ", "), "\n")
#> Detection limits found: 25, 15, 5, 8
cat("Maximum detection limit:", attr(result, "max_detection_limit"), "\n")
#> Maximum detection limit: 25
# Access the fitted model
model <- attr(result, "best_model")
summary(model)
#>
#> Call:
#> survival::survreg(formula = survival::Surv(x, cens, type = "left") ~
#> 1, dist = d)
#> Value Std. Error z p
#> (Intercept) 2.4672 0.0429 57.6 <2e-16
#> Log(scale) -0.5339 0.0556 -9.6 <2e-16
#>
#> Scale= 0.586
#>
#> Log Normal distribution
#> Loglik(model)= -608.2 Loglik(intercept only)= -608.2
#> Number of Newton-Raphson Iterations: 6
#> n= 200The package expects laboratory data with a specific structure:
For non-detect observations, the value in value_col is
treated as the detection limit for that specific analysis, allowing for
different detection limits across samples or analytical methods.
validate_imputation() to check resultsset.seed() for
reproducible results in reportsdata.table for efficient memory usageThe survlab package provides a robust solution for
imputing non-detect values in environmental laboratory data using
survival models. The automatic distribution selection and built-in
validation ensure reliable results for environmental monitoring,
contamination assessment, and regulatory compliance applications.