library("gamlss")
library("gamlss2")
library("gamlss.ggplots")
Supplement to Distribution Families in gamlss2
A comparison to families in gamlss
The distributions in the gamlss2
package differ from those in the original gamlss
family, as described by Rigby et al. (2019). The structure of the "gamlss.family"
object (see Stasinopoulos and Rigby 2007) differs from that of the "gamlss2.family"
. While both families contain the same core information about the distribution, it is represented in different formats. The key reason for this change is that the new "gamlss2.family"
structure is more flexible and can be more easily generalized to accommodate models with more than four parameters—something that was a limitation of the earlier GAMLSS implementation in the gamlss
package.
In this vignette, we aim to:
- Highlight the structure of a
"gamlss2.family"
object (Section 1); - Demonstrate how to fit a marginal distribution (Section 3) and a conditional distribution for the response \(y\) (Section 4);
- Fit a five-parameter distribution (Section 5).
First we import the packages we are going to need later.
1 The structure of "gamlss2.family"
object
A "gamlss2.family"
object is a list with the following components:
family
: The name of the family.names
: The names of the distribution parameters.links
: The link functions for the parameters.d
: The probability density function (PDF). If the log-likelihood function (loglik()
) is not provided, it will be evaluated usingd()
.scores
: A named list of the first derivatives of the log-likelihood with respect to the predictors. There is one list entry for each parameter. If these derivatives are not provided, they will be automatically approximated numerically.hess
: A named list of second derivatives with respect to the predictors. If these derivatives are missing or not supplied, they will be approximated numerically using information from thescores
component.p
: The cumulative distribution function (CDF).r
: The random number generation function.q
: The quantile function (inverse CDF).initialize
: A named list of functions used to set starting values for the distribution parameters.mean
: The mean function of the distribution (if it exists).variance
: The variance function of the distribution (if it exists).valid.response
: A function that checks the validity of the response range.
The first four components (family
, names
, links
, and d
) are mandatory for setting up a "gamlss2.family"
object. The remaining components are optional. However, for model diagnostics using (randomized) quantile residuals, it is recommended to also provide the p()
function. Additionally, to compute quantiles, the q()
function is necessary. In practice, it is advised to always include the q()
function in the family object.
2 The normal distribution as gamlss2
family
As an example, we demonstrate how to define the normal distribution as a "gamlss2.family"
object.
<- function(...) {
Normal <- list(
fam "family" = "Normal",
"names" = c("mu", "sigma"),
"links" = c("mu" = "identity", "sigma" = "log"),
## PDF and log-likelihood function
"d" = function(y, par, log = FALSE) {
dnorm(y, mean = par$mu, sd = par$sigma, log = log)
},"loglik" = function(y, par, ...) {
sum(dnorm(y, par$mu, par$sigma, log = TRUE))
},
## derivatives (first and second)
"scores" = list(
"mu" = function(y, par, ...) {
- par$mu) / (par$sigma^2)
(y
},"sigma" = function(y, par, ...) {
-1 + (y - par$mu)^2 / (par$sigma^2)
}
),
"hess" = list(
"mu" = function(y, par, ...) {
1 / (par$sigma^2)
},"sigma" = function(y, par, ...) {
rep(2, length(y))
},"mu.sigma" = function(y, par, ...) {
rep(0, length(y))
}
),
## CDF, quantile, and random functions
"p" = function(y, par, ...) {
pnorm(y, mean = par$mu, sd = par$sigma, ...)
},"r" = function(n, par) {
rnorm(n, mean = par$mu, sd = par$sigma)
},"q" = function(p, par) {
qnorm(p, mean = par$mu, sd = par$sigma)
},
## initialization functions
"initialize" = list(
"mu" = function(y, ...) {
+ mean(y)) / 2
(y
},"sigma" = function(y, ...) {
rep(sd(y), length(y))
}
),
## mean and variance functions
"mean" = function(par) {
$mu
par
},"variance" = function(par) {
$sigma^2
par
},
## valid response check
"valid.response" = function(x) {
if(is.factor(x) | is.character(x))
stop("The response should be numeric!")
return(TRUE)
}
)
class(fam) <- "gamlss2.family"
return(fam)
}
The family
and names
components define the name of the distribution and the names of the distribution’s parameters, respectively.
The links
component specifies the default link functions for the parameters. Link functions govern the range and behavior of the parameters. For example, using a log
link for the scale parameter \(\sigma\) ensures that it remains positive. Note that link functions alter the interpretation of the model. For instance, an identity
link implies an additive relationship on the parameter scale, such as \(b_1 x_1 + \dots + b_p x_p\), whereas a log
link is additive on the predictor scale but introduces a multiplicative structure on the original parameter scale.
The scores()
, hess()
, and loglik()
functions define the first and second derivatives and the log-likelihood of the distribution, respectively. These derivatives are crucial for model fitting. The initialize()
and valid.response()
functions also play important roles in model fitting: initialize()
specifies how the initial values for the parameters are determined, and valid.response()
defines the valid range for the response variable \(y\).
The d()
, p()
, q()
, and r()
functions define the probability density function (PDF), cumulative distribution function (CDF), quantile function (inverse CDF), and random number generation function for the distribution family.
The mean()
and variance()
functions define the mean and variance of the distribution (if they exist). Note that for some distributions—especially those with heavy tails—these moments may not exist.
Mikis commends
it needs to say here which of the options are compulsory and
which are not. By undestanding is that
family
, names
, links
, d
and p
are compulsory the rest not. Am I right?I am not clear when and why
rqres
is needed.3 Fitting a marginal
distribution
By marginal distribution, we refer to the situation where we are interested in fitting a theoretical distribution, \(f(y|\theta)\), to a single vector \(y\) without considering any explanatory variables. In contrast, we use the term conditional distribution to describe the scenario where the response variable \(y\) is modeled based on explanatory variables \(X\), such that the parameters \(\theta\) are a function of \(X\), i.e., \(f(y|\theta(X))\).
3.1 The function find_family()
In this section, we aim to find the “best” family for the variable \(y\) from a set of relevant available families:
<- find_family(rent$R, families = available_families("continuous")) p
The option "continuous"
limits the available distributions to those in the continuous category. The function find_family()
tries to fit all continuous distributions and stores the Generalized Akaike Information Criterion (GAIC) values in a vector. The values of this vector are sorted in decreasing order of GAIC.
print(p)
Note that similar results can be obtained in gamlss
using the fitDist()
function:
<- gamlss::fitDist(rent$R) a
Here, the results are sorted in increasing order of GAIC:
$fits a
3.2 The function fit_family()
The function fit_family()
is similar to the older gamlss
function fitDist()
. It fits a marginal distribution to the variable \(y\) and then generates a histogram of \(y\) alongside the fitted distribution.
fit_family(rent$R, family=BCTo)
Here is the output from the older fitDist()
function:
::histDist(rent$R, family=BCTo, nbins=30) gamlss
4 Fitting conditional distributions
4.1 Single conditional distributions
In this section, we use the Normal
family defined in Section 2 to fit a Normal distribution to a model specified by the following formula
## model formula
<- R ~ ti(Fl) + ti(A) + ti(Fl, A, bs = "ps") |
f ti(Fl) + ti(A) + ti(Fl, A, bs = "ps")
## estimate model
<- gamlss2(f, data = rent, family = Normal) b1
We can visualize the fitted smoothing terms using
plot(b1)
To inspect the residuals of the model, use
resid_plots(b1)
To improve the residual plots, we can fit a more sophisticated distribution with four parameters.
<- gamlss2(f, data = rent, family = BCTo)
b2 resid_plots(b2)
4.2 Fitting multiple conditional distributions
The function chooseDist()
from the gamlss
package can be used to compare multiple conditional distributions.
<- gamlss::chooseDist(m, type = "realplus") T1
This function generates a table of GAIC values, which can be printed as follows:
library(knitr)
|> kable(digits = c(2, 2, 3, 3 )) T1
Note that the function uses the same formula f
to fit all the distributional regression models. Also, note that f
does not include a model for the parameters \(\nu\) and \(\tau\); therefore, constants (intercepts) are fitted for these two parameters when the distributions have more than two parameters.
5 Fitting a 5-Parameter distribution
In this section, we demonstrate the advantage of using gamlss2
over gamlss
for fitting distribution families, specifically highlighting the ease with which a five-parameter distribution can be fitted. As an example, we use the Skew Generalized T (SGT) distribution.
The SGT distribution was introduced to R by the sgt
package (Davis 2015). Its key feature is that it includes a wide range of sub-distributions as special cases, making it highly versatile. The figure below, taken from the sgt
package vignette, illustrates the Skew Generalized T distribution and its various sub-distributions.
5.1 Importing the sgt
Package
First, load the sgt
package:
library("sgt")
5.2 Defining the SGT distribution as a "gamlss2.family"
We now define the Skew Generalized T distribution as a "gamlss2.family"
object using the d()
, p()
, and q()
functions from the sgt
package. The function below shares similarities with the Normal distribution created in Section 2, though it omits some elements, including:
score
hess
loglik()
mean()
variance()
Here is how the SGT()
function is defined:
## Skewed generalized T distribution
<- function(...) {
SGT stopifnot(requireNamespace("sgt"))
<- list(
fam "names" = c("mu", "sigma", "lambda", "p", "q"),
"links" = c("mu" = "identity", "sigma" = "log",
"lambda" = "rhogit", "p" = "log", "q" = "log"),
"d" = function(y, par, log = FALSE, ...) {
::dsgt(y, mu = par$mu, sigma = par$sigma, lambda = par$lambda,
sgtp = par$p, q = par$q, log = log)
},"p" = function(y, par, lower.tail = TRUE, log.p = FALSE) {
::psgt(y, mu = par$mu, sigma = par$sigma,
sgtlambda = par$lambda, p = par$p, q = par$q,
lower.tail = lower.tail, log.p = log.p)
},"q" = function(p, par, lower.tail = TRUE, log.p = FALSE) {
::qsgt(p, mu = par$mu, sigma = par$sigma,
sgtlambda = par$lambda, p = par$p, q = par$q,
lower.tail = lower.tail, log.p = log.p)
}
)
class(fam) <- "gamlss2.family"
return(fam)
}
data("abdom", package = "gamlss.data")
## specify the model formula
<- y ~ s(x) | s(x) | s(x) | s(x) | s(x)
f
## estimate SGT model
<- gamlss2(f, data = abdom, family = SGT,
b maxit = c(100, 100), eps = 1e-04)
## plot estimated effects
plot(b, which = "effects")