First Steps

The package is designed to follow the workflow of well-established model fitting functions like lm() or glm(), i.e., the step of estimating full distributional regression models is actually not very difficult.

To illustrate the workflow using gamlss2, we analyze the HarzTraffic data, where we model the number of motorcycles (response bikes) at Sonnenberg in the Harz region of Germany. The data can be loaded with

data("HarzTraffic")
head(HarzTraffic)
        date yday bikes cars trucks others tempmin tempmax  temp humidity
1 2021-01-01    0     2 3135     25      1     0.2     2.6  1.32       85
2 2021-01-02    1     7 6593     32     10     1.6     4.0  2.63       72
3 2021-01-03    2     0 3367     30      2    -0.4     1.2  0.19       94
4 2021-01-04    3     0 2186     75      0    -0.6    -0.1 -0.37       97
5 2021-01-05    4     3 2071     68      0    -0.6     0.5 -0.21       98
6 2021-01-06    5     1 2022     97      0    -0.2     0.5  0.17       93
  tempdew cloudiness rain sunshine wind windmax
1   -0.93         99  1.2       50 1.38     3.7
2   -1.99        100  0.0       13 1.35     2.4
3   -0.65         98 13.5        0 1.74     2.8
4   -0.78         99  3.8        0 1.39     2.1
5   -0.46         98  5.3        0 1.42     2.1
6   -0.84         97  4.5        0 3.02     4.6

The data consists of seasonal time information (variable yday) along with a number of environmental variables (e.g. mean daily temperature). As a first model, we estimate a linear regression model with normal errors (which is the default)

b <- gamlss2(bikes ~ temp + rain + sunshine + wind, data = HarzTraffic)
GAMLSS-RS iteration  1: Global Deviance = 14325.7146 eps = 0.046095     
GAMLSS-RS iteration  2: Global Deviance = 14325.7146 eps = 0.000000     
summary(b)
Call:
gamlss2(formula = bikes ~ temp + rain + sunshine + wind, data = HarzTraffic)
---
Family: NO 
Link function: mu = identity, sigma = log
*--------
Parameter: mu 
---
Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -15.33601   19.67683  -0.779   0.4359    
temp         17.68501    1.01429  17.436  < 2e-16 ***
rain         -3.76399    1.74873  -2.152   0.0316 *  
sunshine      0.36452    0.02995  12.172  < 2e-16 ***
wind        -25.26905    4.38768  -5.759 1.11e-08 ***
*--------
Parameter: sigma 
---
Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  5.35765    0.02175   246.3   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
*--------
n = 1057 df =  6 res.df =  1051
Deviance = 14325.7146 Null Dev. Red. = 4.61%
AIC = 14337.7146 elapsed =  0.02sec

Note that the summary output is very similar to lm() and glm() with the main difference being that summary outputs are provided for all parameters of the distribution. In this case, the model is estimated using the NO family of the gamlss.dist package, a two-parameter distribution with parameters mu and sigma.

Residual Diagnostics

Since we estimated a simple linear model with Gaussian errors up to now, we are assuming that the distribution of the response variable, the number of motorcycles (bikes), follows a normal distribution with constant variance. However, this assumption may not always hold true, especially when the response variable is count data, which often exhibits overdispersion or non-constant variance.

To assess whether the normal distribution with constant variance is appropriate, we can start by examining diagnostic plots.

plot(b)

These plots help us visually inspect the residuals for any deviations from the assumptions of normality and constant variance.

References

Rigby, R. A., and D. M. Stasinopoulos. 2005. “Generalized Additive Models for Location, Scale and Shape.” Journal of the Royal Statistical Society Series C (Applied Statistics) 54 (3): 507–54. https://doi.org/10.1111/j.1467-9876.2005.00510.x.