## install packages
<- c("spBayesSurv", "gamlss.cens", "sf", "stars")
pkg for(p in pkg) {
if(!(p %in% installed.packages())) install.packages(p)
}
## load required packages
library("gamlss2")
library("gamlss.cens")
library("sf")
library("stars")
## load the LeukSurv data
data("LeukSurv", package = "spBayesSurv")
Survival Models
In survival analysis, we aim to model the time until an event occurs, which is typically represented as a random variable \(T\). The survival function \(S(t)\) indicates the probability that the event has not occurred by time \(t\), mathematically defined as \[ S(t) = P(T > t) = 1 - F(t), \] where \(F(t)\) is the cumulative distribution function (CDF) of the random variable \(T\).
The hazard function \(\lambda(t)\), which describes the instantaneous risk of the event occurring at time \(t\), is defined as \[ \lambda(t) = \frac{d(t)}{S(t)}, \] where \(d(t)\) is the probability density function (PDF) of \(T\).
1 Survival models in the GAMLSS framework
Within the GAMLSS framework, we can specify a parametric form for the hazard function by selecting an appropriate distribution that fits the survival data, denoted as \[ d(t \mid \boldsymbol{\theta}), \] where \(d( \cdot )\) represents a parametric density suitable for modeling survival times and \(\boldsymbol{\theta} = (\theta_1, \ldots, \theta_K)^\top\) are the parameters that need to be estimated.
To incorporate covariates into this model, we can express the parameters as functions of explanatory variables \(\theta_k(\mathbf{x})\), \(k = 1, \ldots, K\). We utilize GAM-type predictors represented by \[ \eta_{k}(\mathbf{x}) = f_{1k}(\mathbf{x}) + \dots + f_{J_{k}k}(\mathbf{x}) \] which are linked to the parameters through suitable link functions \[ h_{k}(\theta_{k}(\mathbf{x})) = \eta_{k}(\mathbf{x}). \] The functions \(f_{jk}(\cdot)\) for \(j = 1, \ldots, J_{k}\) and \(k = 1, \ldots, K\), can be nonlinear smooth functions (typically estimated using regression splines), linear effects, or random effects, among others. This flexible modeling approach allows for a richer representation of the relationship between covariates and the distribution parameters compared to simple linear effects.
In GAMLSS, model estimation is generally performed using (penalized) maximum likelihood estimation (MLE). The likelihood function for a survival model with right-censored data for \(i = 1, \ldots, n\) observations can be expressed as \[ L(\boldsymbol{\beta}, \boldsymbol{\theta}) = \prod_{i=1}^{n} \left[ d(t_i \mid \boldsymbol{\theta}(\mathbf{x}_i)) \right]^{\delta_i} \left[ S(t_i \mid \boldsymbol{\theta}(\mathbf{x}_i)) \right]^{1 - \delta_i}, \] where \(\delta_i\) is the censoring indicator for the \(i\)-th observation. It takes the value of 1 if the event is observed (i.e., not censored) and 0 if it is censored. This likelihood function effectively accounts for both observed and censored data, facilitating the estimation of model parameters.
2 LeukSurv data example
In this example, we analyze the LeukSurv
data provided in the spBayesSurv
package. This dataset contains survival information on patients diagnosed with leukemia, where the main variables include time
(the survival time), cens
(the censoring indicator, with 1
indicating the event has occurred and 0
indicating right-censoring), age
, sex
, wbc
(white blood cell count), tpi
(tumor proliferative index), and spatial coordinates xcoord
and ycoord
representing the location of the patients. These covariates provide both demographic and clinical information that can be used to model the survival times and the associated risks.
The dataset is particularly useful for illustrating spatial survival modeling, as it allows us to incorporate spatial covariates (coordinates) into the survival model, capturing potential geographical variations in survival probabilities. We will first demonstrate how to estimate simple survival curves using gamlss2
, followed by a full spatial survival model where the geographical coordinates are explicitly modeled to investigate any spatial effects on survival.
We begin by loading the necessary packages and the LeukSurv
dataset. These packages provide functions for survival analysis, censored data handling, and spatial data manipulation, which are essential for the analysis that follows. Specifically, gamlss2
will be used for flexible survival model estimation, while gamlss.cens
manages the censored data structure. Additionally, sf
and stars
will facilitate spatial data processing and visualization.
3 Generating a censored family
To handle censored data in the gamlss2
framework, we first need to generate a censored family based on a parametric distribution. In this case, we use the Weibull distribution (denoted as WEI
in gamlss.dist
). This involves transforming the standard Weibull family into one that can handle censored observations.
We accomplish this by using the gen.cens()
function provided in the gamlss.cens
package, which adapts the distribution to accommodate censoring. Then, we create a censored family using the cens()
function.
gen.cens(WEI)
A censored family of distributions from WEI has been generated
and saved under the names:
dWEIrc pWEIrc qWEIrc WEIrc
The type of censoring is right
<- cens(WEI) fam
4 Estimating time-only models
To begin, we estimate a simple survival model where the only predictor is time. This is done both with gamlss2
for flexible distributional modeling, and with the more traditional survfit()
function from the survival
package for comparison.
In gamlss2
, the model is specified using the Surv(time, cens)
function, where time
is the survival time, and cens
is the censoring indicator. We fit the model assuming a Weibull distribution for survival times.
## using gamlss2
<- gamlss2(Surv(time, cens) ~ 1, family = fam, data = LeukSurv) b1
GAMLSS-RS iteration 1: Global Deviance = 12269.5961 eps = 0.009779
GAMLSS-RS iteration 2: Global Deviance = 12267.698 eps = 0.000154
GAMLSS-RS iteration 3: Global Deviance = 12267.6904 eps = 0.000000
## and now with survfit
<- survfit(Surv(time, cens) ~ 1, data = LeukSurv) s1
5 Predicting survival curves
Once the models are fitted, we can predict survival probabilities based on the estimated parameters. In gamlss2
, the predicted parameters from the model are used to compute survival probabilities. We then compare these with the survival probabilities obtained from the Kaplan-Meier estimate using survfit()
.
In the case of gamlss2
, we extract the predicted parameters and compute the survival probabilities with
## predict parameters from the gamlss2 model
<- predict(b1, newdata = LeukSurv)
par
## compute survival probabilities
<- 1 - family(b1)$p(Surv(LeukSurv$time, rep(1, nrow(LeukSurv))), par) p1
For the survfit()
model, we use the step function to extract the survival probabilities at each time point.
## step function for Kaplan-Meier survival estimates
<- stepfun(s1$time, c(1, s1$surv))
f <- f(LeukSurv$time) p2
6 Plotting the estimated survival curves
Finally, we plot the estimated survival curves from both models to visually compare the results. This will allow us to see how the flexible parametric model from gamlss2
compares with the non-parametric Kaplan-Meier estimate.
matplot(LeukSurv$time, cbind(p1, p2),
type = "l", lty = 1, lwd = 3, col = c(1, 4),
xlab = "Time", ylab = "Survival Prob(t > Time)",
main = "Estimated Survival Curves")
legend("topright", c("gamlss2", "survfit"),
lwd = 3, col = c(1, 4), bty = "n")
In this plot, we observe that the estimated survival curves are generally similar. However, the Kaplan-Meier estimate shows a slightly steeper decline in survival probabilities early on and flattens out more towards the end. This difference likely reflects the nature of the Kaplan-Meier estimator, which is non-parametric and purely empirical, capturing the observed data trends without any underlying assumptions about the distribution of survival times. In contrast, the parametric model from gamlss2
provides a more structured approach, allowing for smoother, distribution-driven survival estimates that incorporate the shape of the underlying Weibull distribution.
7 Spatial survival model
Next, we extend our analysis by estimating a full spatial survival model. This model incorporates both clinical covariates (such as age
, sex
, wbc
, and tpi
) and spatial information represented by the coordinates (xcoord
and ycoord
). The inclusion of spatial covariates allows us to assess potential geographic variation in survival probabilities, which could reveal spatial clustering or regional effects on survival outcomes.
The model is specified using the formula below, where the response is the censored survival time (Surv(time, cens)
), and smooth terms (s()
) are used to model nonlinear effects of continuous covariates like age
, wbc
, tpi
, and the spatial coordinates. The pipe symbol (|
) separates the formulas for the parameters of the distribution, allowing for flexible covariate effects on survival times.
## model formula
<- Surv(time, cens) ~ sex + s(age) + s(wbc) + s(tpi) + s(xcoord, ycoord) |
f + s(age) + s(wbc) + s(tpi) + s(xcoord, ycoord)
sex
## estimate the spatial survival model
<- gamlss2(f, family = fam, data = LeukSurv) b2
GAMLSS-RS iteration 1: Global Deviance = 11869.1992 eps = 0.042093
GAMLSS-RS iteration 2: Global Deviance = 11866.1148 eps = 0.000259
GAMLSS-RS iteration 3: Global Deviance = 11866.0959 eps = 0.000001
8 Visualizing effects & residuals
To explore the impact of the covariates and spatial coordinates, we can plot the estimated effects from the model. This visualization helps us understand how each covariate, including the spatial effect, influences survival outcomes.
par(mar = c(4, 4, 3, 1))
plot(b2)
To assess the model fit, we generate diagnostic plots of the quantile residuals. These plots allow us to evaluate the adequacy of the model and check for any potential issues.
plot(b2, which = "resid")
The plot suggests that the Weibull distribution is a reasonable choice for modeling the survival data. However, it is always advisable to explore alternative distributions to ensure the best possible model fit. In the GAMLSS framework, various other parametric distributions-such as the log-normal, gamma, or generalized gamma-can be easily tested to assess whether they provide a better fit to the data, especially if there are signs of deviations in the residuals or misspecification of the hazard function.
9 Visualize the spatial effect.
## read the map of new west england.
<- system.file("otherdata/nwengland.bnd", package = "spBayesSurv")
file <- readLines(file)
d
## transform the polygons to a list().
<- grep('\"', d, fixed = TRUE)
id <- list()
polys for(i in 1:length(id)) {
<- strsplit(d[id[i]], ",")[[1]][2]
j if(i < length(id))
<- d[(id[i] + 1):(id[i + 1] - 1)]
polys[[j]] else
<- d[(id[i] + 1):length(d)]
polys[[j]]
}
<- lapply(polys, function(x) {
polys <- tempfile()
tf writeLines(x, tf)
<- as.matrix(read.csv(tf, header = FALSE))
pol unlink(tf)
return(st_polygon(list(pol)))
})
## transform to sf object
<- st_sfc(polys)
polys_sfc <- st_sf(geometry = polys_sfc)
map $id <- names(polys)
map$district <- 1:nrow(map)
map
## plot the map
par(mar = rep(0, 4))
plot(st_geometry(map))
## sample coordinates for plotting
<- st_sample(map, size = 10000, type = "regular")
co
## create new data for prediction
<- as.data.frame(st_coordinates(co))
nd names(nd) <- c("xcoord", "ycoord")
$sex <- 1
nd$wbc <- mean(LeukSurv$wbc)
nd$tpi <- mean(LeukSurv$tpi)
nd$age <- mean(LeukSurv$age)
nd
## predict parameters
<- predict(b2, newdata = nd)
par
## compute survival probabilities
<- 1 - family(b2)$p(Surv(rep(365, nrow(nd)), rep(1, nrow(nd))), par)
p2
## plot on map
<- st_sf(geometry = co)
sp $survprob <- p2
sp
## reference system, only used for plotting
<- st_set_crs(sp, 4326)
sp <- st_set_crs(map, 4326)
map
## plot as raster map
<- st_as_stars(sp)
spr
par(mar = c(0, 0, 5, 0))
plot(spr, main = "Survival Probabilities Prob(t > 365)",
reset = FALSE, nbreaks = 100)
plot(st_geometry(map), add = TRUE)
box()