Supplement to Distribution Families in gamlss2

A comparison to families in gamlss

The GAMLSS working party

The distributions in the gamlss2 package differ from those in the original gamlss family, as described by Rigby et al. (2019). The structure of the "gamlss.family" object (see Stasinopoulos and Rigby 2007) differs from that of the "gamlss2.family". While both families contain the same core information about the distribution, it is represented in different formats. The key reason for this change is that the new "gamlss2.family" structure is more flexible and can be more easily generalized to accommodate models with more than four parameters—something that was a limitation of the earlier GAMLSS implementation in the gamlss package.

In this vignette, we aim to:

First we import the packages we are going to need later.

library("gamlss")
library("gamlss2")
library("gamlss.ggplots")

1 The structure of "gamlss2.family" object

A "gamlss2.family" object is a list with the following components:

  • family: The name of the family.

  • names: The names of the distribution parameters.

  • links: The link functions for the parameters.

  • d: The probability density function (PDF). If the log-likelihood function (loglik()) is not provided, it will be evaluated using d().

  • scores: A named list of the first derivatives of the log-likelihood with respect to the predictors. There is one list entry for each parameter. If these derivatives are not provided, they will be automatically approximated numerically.

  • hess: A named list of second derivatives with respect to the predictors. If these derivatives are missing or not supplied, they will be approximated numerically using information from the scores component.

  • p: The cumulative distribution function (CDF).

  • r: The random number generation function.

  • q: The quantile function (inverse CDF).

  • initialize: A named list of functions used to set starting values for the distribution parameters.

  • mean: The mean function of the distribution (if it exists).

  • variance: The variance function of the distribution (if it exists).

  • valid.response: A function that checks the validity of the response range.

The first four components (family, names, links, and d) are mandatory for setting up a "gamlss2.family" object. The remaining components are optional. However, for model diagnostics using (randomized) quantile residuals, it is recommended to also provide the p() function. Additionally, to compute quantiles, the q() function is necessary. In practice, it is advised to always include the q() function in the family object.

2 The normal distribution as gamlss2 family

As an example, we demonstrate how to define the normal distribution as a "gamlss2.family" object.

Normal <- function(...) {
  fam <- list(
    "family" = "Normal",
    "names" = c("mu", "sigma"),
    "links" = c("mu" = "identity", "sigma" = "log"),
    
    ## PDF and log-likelihood function
    "d" = function(y, par, log = FALSE) {
      dnorm(y, mean = par$mu, sd = par$sigma, log = log)
    },
    "loglik" = function(y, par, ...) {
      sum(dnorm(y, par$mu, par$sigma, log = TRUE))
    },
    
    ## derivatives (first and second)
    "scores" = list(
      "mu" = function(y, par, ...) {
        (y - par$mu) / (par$sigma^2)
      },
      "sigma" = function(y, par, ...) {
        -1 + (y - par$mu)^2 / (par$sigma^2)
      }
    ),
    
    "hess" = list(
      "mu" = function(y, par, ...) {
        1 / (par$sigma^2)
      },
      "sigma" = function(y, par, ...) {
        rep(2, length(y))
      },
      "mu.sigma" = function(y, par, ...) {
        rep(0, length(y))
      }
    ),
    
    ## CDF, quantile, and random functions
    "p" = function(y, par, ...) { 
      pnorm(y, mean = par$mu, sd = par$sigma, ...)
    },
    "r" = function(n, par) {
      rnorm(n, mean = par$mu, sd = par$sigma)
    },
    "q" = function(p, par) {
      qnorm(p, mean = par$mu, sd = par$sigma)
    },
    
    ## initialization functions
    "initialize" = list(
      "mu" = function(y, ...) {
        (y + mean(y)) / 2
      },
      "sigma" = function(y, ...) {
        rep(sd(y), length(y))
      }
    ),
    
    ## mean and variance functions
    "mean" = function(par) {
      par$mu
    },
    "variance" = function(par) {
      par$sigma^2
    },
    
    ## valid response check
    "valid.response" = function(x) {
      if(is.factor(x) | is.character(x))
        stop("The response should be numeric!")
      return(TRUE)
    }
  )
  
  class(fam) <- "gamlss2.family"

  return(fam)
}

The family and names components define the name of the distribution and the names of the distribution’s parameters, respectively.

The links component specifies the default link functions for the parameters. Link functions govern the range and behavior of the parameters. For example, using a log link for the scale parameter \(\sigma\) ensures that it remains positive. Note that link functions alter the interpretation of the model. For instance, an identity link implies an additive relationship on the parameter scale, such as \(b_1 x_1 + \dots + b_p x_p\), whereas a log link is additive on the predictor scale but introduces a multiplicative structure on the original parameter scale.

The scores(), hess(), and loglik() functions define the first and second derivatives and the log-likelihood of the distribution, respectively. These derivatives are crucial for model fitting. The initialize() and valid.response() functions also play important roles in model fitting: initialize() specifies how the initial values for the parameters are determined, and valid.response() defines the valid range for the response variable \(y\).

The d(), p(), q(), and r() functions define the probability density function (PDF), cumulative distribution function (CDF), quantile function (inverse CDF), and random number generation function for the distribution family.

The mean() and variance() functions define the mean and variance of the distribution (if they exist). Note that for some distributions—especially those with heavy tails—these moments may not exist.

 Mikis commends
it needs to say here which of the options are compulsory and
which are not. By undestanding is that
 family, names, links, d and p are compulsory the rest not. Am I right?
I am not clear when and why rqres is needed.

3 Fitting a marginal distribution

By marginal distribution, we refer to the situation where we are interested in fitting a theoretical distribution, \(f(y|\theta)\), to a single vector \(y\) without considering any explanatory variables. In contrast, we use the term conditional distribution to describe the scenario where the response variable \(y\) is modeled based on explanatory variables \(X\), such that the parameters \(\theta\) are a function of \(X\), i.e., \(f(y|\theta(X))\).

3.1 The function find_family()

In this section, we aim to find the “best” family for the variable \(y\) from a set of relevant available families:

p <- find_family(rent$R, families = available_families("continuous"))

The option "continuous" limits the available distributions to those in the continuous category. The function find_family() tries to fit all continuous distributions and stores the Generalized Akaike Information Criterion (GAIC) values in a vector. The values of this vector are sorted in decreasing order of GAIC.

print(p)

Note that similar results can be obtained in gamlss using the fitDist() function:

a <- gamlss::fitDist(rent$R)

Here, the results are sorted in increasing order of GAIC:

a$fits

3.2 The function fit_family()

The function fit_family() is similar to the older gamlss function fitDist(). It fits a marginal distribution to the variable \(y\) and then generates a histogram of \(y\) alongside the fitted distribution.

fit_family(rent$R, family=BCTo)

Here is the output from the older fitDist() function:

gamlss::histDist(rent$R, family=BCTo, nbins=30)

4 Fitting conditional distributions

4.1 Single conditional distributions

In this section, we use the Normal family defined in Section 2 to fit a Normal distribution to a model specified by the following formula

## model formula 
f <- R ~ ti(Fl) + ti(A) + ti(Fl, A, bs = "ps") | 
  ti(Fl) + ti(A) + ti(Fl, A, bs = "ps") 

## estimate model
b1 <- gamlss2(f, data = rent, family = Normal)

We can visualize the fitted smoothing terms using

plot(b1)

To inspect the residuals of the model, use

resid_plots(b1)

To improve the residual plots, we can fit a more sophisticated distribution with four parameters.

b2 <- gamlss2(f, data = rent, family = BCTo) 
resid_plots(b2)

4.2 Fitting multiple conditional distributions

The function chooseDist() from the gamlss package can be used to compare multiple conditional distributions.

T1 <- gamlss::chooseDist(m, type = "realplus")

This function generates a table of GAIC values, which can be printed as follows:

library(knitr)
T1  |> kable(digits = c(2, 2, 3, 3 ))

Note that the function uses the same formula f to fit all the distributional regression models. Also, note that f does not include a model for the parameters \(\nu\) and \(\tau\); therefore, constants (intercepts) are fitted for these two parameters when the distributions have more than two parameters.

5 Fitting a 5-Parameter distribution

In this section, we demonstrate the advantage of using gamlss2 over gamlss for fitting distribution families, specifically highlighting the ease with which a five-parameter distribution can be fitted. As an example, we use the Skew Generalized T (SGT) distribution.

The SGT distribution was introduced to R by the sgt package (Davis 2015). Its key feature is that it includes a wide range of sub-distributions as special cases, making it highly versatile. The figure below, taken from the sgt package vignette, illustrates the Skew Generalized T distribution and its various sub-distributions.

The Skew Generalized T distributions and its sub-distributions (Image from the vignette of the R package sgt)
Figure 1

5.1 Importing the sgt Package

First, load the sgt package:

library("sgt")

5.2 Defining the SGT distribution as a "gamlss2.family"

We now define the Skew Generalized T distribution as a "gamlss2.family" object using the d(), p(), and q() functions from the sgt package. The function below shares similarities with the Normal distribution created in Section 2, though it omits some elements, including:

  • score
  • hess
  • loglik()
  • mean()
  • variance()

Here is how the SGT() function is defined:

## Skewed generalized T distribution
SGT <- function(...) {
  stopifnot(requireNamespace("sgt"))

  fam <- list(
    "names" = c("mu", "sigma", "lambda", "p", "q"),
    "links" = c("mu" = "identity", "sigma" = "log",
       "lambda" = "rhogit", "p" = "log", "q" = "log"),
    "d" = function(y, par, log = FALSE, ...) {
      sgt::dsgt(y, mu = par$mu, sigma = par$sigma, lambda = par$lambda,
        p = par$p, q = par$q, log = log)
    },
    "p" = function(y, par, lower.tail = TRUE, log.p = FALSE) {
      sgt::psgt(y, mu = par$mu, sigma = par$sigma,
        lambda = par$lambda, p = par$p, q = par$q,
        lower.tail = lower.tail, log.p = log.p)
    },
    "q" = function(p, par, lower.tail = TRUE, log.p = FALSE) {
      sgt::qsgt(p, mu = par$mu, sigma = par$sigma,
        lambda = par$lambda, p = par$p, q = par$q,
        lower.tail = lower.tail, log.p = log.p)
    }
  )

  class(fam) <- "gamlss2.family"

  return(fam)
}
data("abdom", package = "gamlss.data")
## specify the model formula
f <- y ~ s(x) | s(x) | s(x) | s(x) | s(x)

## estimate SGT model
b <- gamlss2(f, data = abdom, family = SGT,
  maxit = c(100, 100), eps = 1e-04)

## plot estimated effects
plot(b, which = "effects")

References

Davis, Carter. 2015. Sgt: Skewed Generalized t Distribution Tree. https://CRAN.R-project.org/package=sgt.
Rigby, R. A., D. M. Stasinopoulos, G. Z. Heller, and F. De Bastiani. 2019. Distributions for Modeling Location, Scale, and Shape: Using GAMLSS in R. Boca Raton: Chapman & Hall/CRC. https://doi.org/10.1201/9780429298547.
Stasinopoulos, D. M., and R. A. Rigby. 2007. “Generalized Additive Models for Location Scale and Shape (GAMLSS) in R.” Journal of Statistical Software 23 (7): 1–46. https://doi.org/10.18637/jss.v023.i07.