Small guide to the step.Gam (2024)

If you, like me, are learning how to handle generalised additive models, it may have crossed your mind to perform some automated variable selection, doing things like varying the degrees of freedom of particular splines and such in your model.

I would caution you that unless have you have been forced to do stepwise variable selection or are very careful about interpreting it, it is generally considered a bad idea, and you may want to consider alternatives. Here is a good place to start your search if so.

We will be using the gam package, which is a good place to start learning GAMs. If you are looking to use GAMs in a more mature capacity in R, you should investigate the mgcv package instead, which is favored in the community.

Disclaimer: do not use the mgcv package and gam at the same time. Their namespaces are not compatible and overlap. This will lead to tears and horrible bugs.

The function to do this, of course, is called step.Gam. The documentation is only somewhat helpful — access it via help(step.Gam). It includes a rather minimalistic example to help you start off:

Gam.object <- gam(y~x+z, data=gam.data)
step.object <- step.Gam(Gam.object, scope=list("x"=~1+x+s(x,4)+s(x,6)+s(x,12),"z"=~1+z+s(z,4)))

Here, Gam.object is the starting model, which will be used as the starting point in exploring our variable space and building other models. AIC is used to pick better models internally and is reported in the function output at every step.

One of the most important and confusing parts is the scope=list component of the step.Gam function call — this list is crucial as it defines which variables to explore while building up the function. You need to provide a named formula, named exactly like the predictor variable you want to stepwise. Thus in this function call, we have a list component called "x" which corresponds to the formula ~1 + x + s(x, 4) + s(x, 6) + s(x, 12) and another component called "z" which corresponds to the formula ~1 + z + s(z, 4). These components correspond exactly to the predictors in the Gam.object formula, which means this scope will work with this starting formula. step.Gam will use these variables separated by plus signs as the building blocks of its models — so if you want to add extra variables for the stepwise selection to consider, simply add it after a plus sign — these can be, for example, natural splines, ns(x, knots=2).

Let’s do a slightly more advanced example to show you further how this works. The code can be found here. We are using a dataset called prostate which is a few variables and about a hundred observations, trying to find a good model to predict the Cscore response variable. We want to try fitting a smoothing spline model and select how many degrees of freedom we require for a good fit, judged by AIC. We will do backwards and forwards selection to illustrate — by default, step.Gam scans in both directions.

First, let us define the scope list:

scope_list = list(
"lcavol" = ~1 + lcavol + s(lcavol, df=2) + s(lcavol, df=3) + s(lcavol, df =4) + s(lcavol, df=5),
"lweight" = ~1 + lweight + s(lweight, df=2) + s(lweight, df=3) + s(lweight, df=4) + s(lweight, df=5),
"age" = ~1 + age + s(age, df=2) + s(age, df=3) + s(age, df=4) + s(age, df=5),
"lbph" = ~1 + lbph + s(lbph, df=2) + s(lbph, df=3) + s(lbph, df=4) + s(lbph, df=5),
"lcp" = ~1 + lcp + s(lcp, df=2) + s(lcp, df=3) + s(lcp, df=4) + s(lcp, df=5),
"lpsa" = ~1 + lpsa + s(lpsa, df=2) + s(lpsa, df=3) + s(lpsa, df=4) + s(lpsa, df=5)
)

One can see that we’re only considering splines up to five degrees of freedom.

First, let’s pick a very random model to demonstrate how the function goes both ways by default.

start_model_rand = gam(Cscore~lpsa+s(lcavol, df=3), data=prostate)

And run the selection by typing:

step.Gam(start_model_rand, scope_list)

Which should produce the following output:

Start: Cscore ~ lpsa + s(lcavol, df = 3); AIC= 969 
Step:1 Cscore ~ s(lcavol, df = 3) + s(lpsa, df = 2) ; AIC= 924
Step:2 Cscore ~ s(lcavol, df = 3) + s(lpsa, df = 3) ; AIC= 906
Step:3 Cscore ~ s(lcavol, df = 3) + lcp + s(lpsa, df = 3) ; AIC= 899
Step:4 Cscore ~ s(lcavol, df = 3) + s(lcp, df = 2) + s(lpsa, df = 3) ; AIC= 895
Step:5 Cscore ~ s(lcavol, df = 3) + s(lcp, df = 3) + s(lpsa, df = 3) ; AIC= 889
Step:6 Cscore ~ s(lcavol, df = 3) + s(lcp, df = 3) + s(lpsa, df = 4) ; AIC= 885
Step:7 Cscore ~ s(lcavol, df = 3) + age + s(lcp, df = 3) + s(lpsa, df = 4) ; AIC= 882
Step:8 Cscore ~ s(lcavol, df = 3) + age + s(lcp, df = 4) + s(lpsa, df = 4) ; AIC= 879
Step:9 Cscore ~ s(lcavol, df = 3) + age + s(lcp, df = 4) + s(lpsa, df = 5) ; AIC= 878
Step:10 Cscore ~ s(lcavol, df = 2) + age + s(lcp, df = 4) + s(lpsa, df = 5) ; AIC= 877
Step:11 Cscore ~ lcavol + age + s(lcp, df = 4) + s(lpsa, df = 5) ; AIC= 876
Call:
gam(formula = Cscore ~ lcavol + age + s(lcp, df = 4) + s(lpsa,
df = 5), data = prostate, trace = FALSE)
Degrees of Freedom: 96 total; 85 Residual
Residual Deviance: 36186

Thus it returns the “best” model, best meaning it has the lowest AIC in the space it has searched — in this case being Cscore ~ lcavol + age + s(lcp, df = 4) + s(lpsa, df = 5).

Now let’s attempt forward stepwise selection. For this, we can use a somewhat minimalistic starting model that includes each variable (lpsa + lcavol etc), using the dot formula operator to fill these in for us.

start_model_minimal = gam(Cscore~., data=prostate)
step.Gam(start_model_minimal, scope_list, direction = "forward")

Which produces:

Start: Cscore ~ .; AIC= 974 
Step:1 Cscore ~ svi + lcavol + lweight + age + lbph + lcp + s(lpsa, df = 2) ; AIC= 917
Step:2 Cscore ~ svi + lcavol + lweight + age + lbph + lcp + s(lpsa, df = 3) ; AIC= 898
Step:3 Cscore ~ svi + lcavol + lweight + age + lbph + s(lcp, df = 2) + s(lpsa, df = 3) ; AIC= 894
Step:4 Cscore ~ svi + lcavol + lweight + age + lbph + s(lcp, df = 3) + s(lpsa, df = 3) ; AIC= 888
Step:5 Cscore ~ svi + lcavol + lweight + age + lbph + s(lcp, df = 3) + s(lpsa, df = 4) ; AIC= 884
Step:6 Cscore ~ svi + lcavol + lweight + age + lbph + s(lcp, df = 4) + s(lpsa, df = 4) ; AIC= 882
Step:7 Cscore ~ svi + lcavol + lweight + age + lbph + s(lcp, df = 4) + s(lpsa, df = 5) ; AIC= 880
Call:
gam(formula = Cscore ~ svi + lcavol + lweight + age + lbph +
s(lcp, df = 4) + s(lpsa, df = 5), data = prostate, trace = FALSE)
Degrees of Freedom: 96 total; 82 Residual
Residual Deviance: 35649

Quite a different model this time — again a good reminder to be careful with interpreting the output of stepwise selection.

We can try backward selection too. In this case let’s start from a beefy case so the function will have something to cut off. We can also specify trace=2 to the function to see some more information about the selection process happening.

start_model = gam(Cscore~s(lcavol, df=5)+s(lweight, df=5)+s(age, df=5)+s(lbph, df=5)+s(lcp, df=5)+s(lpsa, df=5), data = prostate)step.Gam.mod(start_model, scope = scope_list, direction = "backward", trace=2)

Particularly, this verbose trace option will print every model the selection procedure has considered — very useful to see which places one can still explore manually. As the output is quite long, I provide only the last two steps here:

Step:18 Cscore ~ lcavol + age + s(lcp, df = 5) + s(lpsa, df = 5) ; AIC= 876 
Trial: Cscore ~ 1 + 1 + age + 1 + s(lcp, df = 5) + s(lpsa, df = 5) ; AIC= 880
Trial: Cscore ~ lcavol + 1 + 1 + 1 + s(lcp, df = 5) + s(lpsa, df = 5) ; AIC= 880
Trial: Cscore ~ lcavol + 1 + age + 1 + s(lcp, df = 4) + s(lpsa, df = 5) ; AIC= 876
Trial: Cscore ~ lcavol + 1 + age + 1 + s(lcp, df = 5) + s(lpsa, df = 4) ; AIC= 878
Step:19 Cscore ~ lcavol + age + s(lcp, df = 4) + s(lpsa, df = 5) ; AIC= 876
Trial: Cscore ~ 1 + 1 + age + 1 + s(lcp, df = 4) + s(lpsa, df = 5) ; AIC= 879
Trial: Cscore ~ lcavol + 1 + 1 + 1 + s(lcp, df = 4) + s(lpsa, df = 5) ; AIC= 880
Trial: Cscore ~ lcavol + 1 + age + 1 + s(lcp, df = 3) + s(lpsa, df = 5) ; AIC= 878
Trial: Cscore ~ lcavol + 1 + age + 1 + s(lcp, df = 4) + s(lpsa, df = 4) ; AIC= 877
Call:
gam(formula = Cscore ~ lcavol + age + s(lcp, df = 4) + s(lpsa,
df = 5), data = prostate, trace = FALSE)
Degrees of Freedom: 96 total; 85 Residual
Residual Deviance: 36186

I hope this step-by-step guide helped you understand how to work with this seemingly rarely used function. One important thing to note is that step.Gam is written entirely in R — which means don’ t be afraid to put breakpoints into it and use your debugger to inspect variables at any given point in the code to understand what it wants and expects from you.

Small guide to the step.Gam (2024)

FAQs

What is the difference between a GAM and a GLM? ›

Generalized Additive Models (GAMs) are an extension of Generalized Linear Models (GLMs) in such a way that predictor variables can be modeled non-parametrically in addition to linear and polynomial terms for other predictors.

What is the formula for GAM model? ›

A Generalised Additive Model (GAM) is an extension of the multiple linear model, which recall is y=β0+β1x1+β2x2+… +βpxp+ϵ.

What is the difference between GAM package and MGCV package? ›

The gam package was written by Trevor Hastie and is more or less frequentist. The mgcv package was written by Simon Wood, and, while it follows the same framework in many ways, it is much more general because it considers GAM to be any penalized GLM (and Bayesian).

What is the difference between GAM and gamm? ›

gamm is not as numerically stable as gam : an lme call will occasionally fail. See details section for suggestions, or try the 'gamm4' package. gamm is usually much slower than gam , and on some platforms you may need to increase the memory available to R in order to use it with large data sets (see memory. limit ).

When to use GAM model? ›

Unlike linear models, GAMs can capture non-linear patterns by combining multiple smooth functions of predictor variables. GAMs are particularly valuable when investigating intricate dependencies, making them a crucial tool for data analysis and predictive modeling.

Why use GLMM instead of GLM? ›

In statistics, a generalized linear mixed model (GLMM) is an extension to the generalized linear model (GLM) in which the linear predictor contains random effects in addition to the usual fixed effects. They also inherit from generalized linear models the idea of extending linear mixed models to non-normal data.

What are examples of GAM models? ›

Examples include the R packages mboost , which implements a boosting approach; gss , which provides the full spline smoothing methods; VGAM which provides vector GAMs; and gamlss , which provides Generalized additive model for location, scale and shape.

What is the GAM model in excel? ›

Generalized Additive Models or GAMs allow modeling an outcome according to nonlinear smoothing functions of predictors. Cubic splines are among the commonly used nonlinear functions in this context.

What are the assumptions of a GAM? ›

#5 – Assumptions

GLMs and GAMs make assumptions about the distribution of the response variable, independence of observations, and appropriate link function choice. However, GAMs additionally assume the smoothness of the functions used to model the predictor-response relationships.

What does gam check do? ›

This function plots 4 standard diagnostic plots, some smoothing parameter estimation convergence information and the results of tests which may indicate if the smoothing basis dimension for a term is too low. Usually the 4 plots are various residual plots.

What is the difference between GAM and GAN? ›

Gam is a bija mantra, or the seed mantra, for Ganesha or Ganapati. These words are used interchangeably to indicate the same god in Hinduism. Ganapataye is actually the alternate name for Ganesha, which is Ganapati. The ending 'aye' indicates that something is for or to Ganapati.

What is the difference between nlme package and lme4? ›

Differences between nlme and lme4

The most important differences are: lme4 uses modern, efficient linear algebra methods as implemented in the Eigen package, and uses reference classes to avoid undue copying of large objects; it is therefore likely to be faster and more memory-efficient than nlme.

What are the advantages of GAM models? ›

GAMs are also flexible, as they can be used for both regression and classification tasks. Additionally, GAMs can handle missing data and are robust to outliers. One of the key advantages of GAMs is their ability to model complex interactions between variables.

What are splines in GAM? ›

Splines are complex functions that allow us to model non-linear relationships for each feature. The sum of many splines forms a GAM. The result is a highly flexible model which still has some of the explainability of a linear regression. Let's understand how we can model non-linear features without GAMs.

What is the difference between GAM and gmm? ›

GAM is gram atomic mass and GMM is gram molecular mass. These are functionally same as the molar mass. GMM is the mass in gram of one mole of a molecular substance .

What is the difference between general and generalized linear model? ›

The main difference between the two approaches is that the general linear model strictly assumes that the residuals will follow a conditionally normal distribution, while the GLM loosens this assumption and allows for a variety of other distributions from the exponential family for the residuals.

When should you use a GLM? ›

Generalized linear models (GLMs) are a class of linear-based regression models developed to handle varying types of error distributions. These class of models are extremely useful for data types that may not conform to what is typically expected given Gaussian expectations or assumptions.

What is the difference between GAM and random forest? ›

Predominantly, the Generalised Additive Model is responsive to eurybiont (generalists) i.e. species with a broad tolerance range, whereas the Random Forest detects more subtle relationships between density and environmental variables, making it more suitable for stenobiont (specialists) species with a low tolerance ...

Top Articles
Latest Posts
Article information

Author: Geoffrey Lueilwitz

Last Updated:

Views: 5639

Rating: 5 / 5 (80 voted)

Reviews: 95% of readers found this page helpful

Author information

Name: Geoffrey Lueilwitz

Birthday: 1997-03-23

Address: 74183 Thomas Course, Port Micheal, OK 55446-1529

Phone: +13408645881558

Job: Global Representative

Hobby: Sailing, Vehicle restoration, Rowing, Ghost hunting, Scrapbooking, Rugby, Board sports

Introduction: My name is Geoffrey Lueilwitz, I am a zealous, encouraging, sparkling, enchanting, graceful, faithful, nice person who loves writing and wants to share my knowledge and understanding with you.