Reference levels matter

When doing a regression analysis with categorical variables, which level is used as the reference level can be important. This is underappreciated, since most non-major classes on regression (or more precisely, regression classes that don’t show you the underlying matrix algebra) don’t talk about it. Software mostly hides this as well unless users want to dive deep into the options. Failing to consider your choice of reference level and how that choice can effect your analysis can lead you to erroneous (or at least dubious) conclusions.

What are reference levels?

Before we get into why reference levels matter, let’s take a step back and talk about what reference levels are. In regression analyses involving categorical variables, these variables are represented numerically by unique combinations of 1’s and 0’s. For a two-level categorical variable, 1’s are arbitrarily assigned to represent one level of the variable and 0’s another in a single column:

## # A tibble: 10 x 2
##    `Categorical Variable` `Coding of Categorical Variable`
##    <chr>                                             <dbl>
##  1 A                                                     0
##  2 A                                                     0
##  3 B                                                     1
##  4 A                                                     0
##  5 A                                                     0
##  6 A                                                     0
##  7 A                                                     0
##  8 A                                                     0
##  9 A                                                     0
## 10 B                                                     1

Categorical variables with more than two levels are represented by adding an extra column for level beyond two. Thus, a categorical variable with \(p\) levels is represented by \(p-1\) columns. Each column will take the value \(1\) if the categorical variable is a specific number and \(0\) otherwise:

## # A tibble: 10 x 4
##    `Categorical Variable` `Variable is A` `Variable is B` `Variable is C`
##    <chr>                            <dbl>           <dbl>           <dbl>
##  1 A                                    0               0               0
##  2 B                                    1               0               0
##  3 B                                    1               0               0
##  4 C                                    0               1               0
##  5 C                                    0               1               0
##  6 C                                    0               1               0
##  7 D                                    0               0               1
##  8 D                                    0               0               1
##  9 D                                    0               0               1
## 10 D                                    0               0               1

In this case, “D” is the reference level, since if all of the columns take the value 0, we know the level of our categorical variable is “D”.

(The type of factor coding I described above is known as “dummy coding” or “effect coding”. (Or if you’re in Machine Learning, “one-hot encoding”, which is a very silly name.) You can read about alternative approaches to variable coding here. Depending on what you actually want to do, it can often be beneficial to encode your categorical factors in a different way.)

Reference Levels in Regression

Now suppose I want to fit a regression model using these factors. I don’t have to manually generate the columns of 1’s and 0’s; that happens automatically in software. For example, I can use lm() to fit a simple model in R:

fit <- lm(`Response Variable` ~ `Categorical Variable`, data = mydata)

When running this code, R will call model.matrix() to generate a matrix that looks a lot like what we saw above:

model.matrix(~`Categorical Variable`, data = mydata)
##    (Intercept) `Categorical Variable`B `Categorical Variable`C
## 1            1                       0                       0
## 2            1                       1                       0
## 3            1                       1                       0
## 4            1                       0                       1
## 5            1                       0                       1
## 6            1                       0                       1
## 7            1                       0                       0
## 8            1                       0                       0
## 9            1                       0                       0
## 10           1                       0                       0
##    `Categorical Variable`D
## 1                        0
## 2                        0
## 3                        0
## 4                        0
## 5                        0
## 6                        0
## 7                        1
## 8                        1
## 9                        1
## 10                       1
## attr(,"assign")
## [1] 0 1 1 1
## attr(,"contrasts")
## attr(,"contrasts")$`Categorical Variable`
## [1] "contr.treatment"

Note the presence of the Intercept column, which has a value of 1 for every row in our matrix.

These automatically-generated columns correspond to the parameters in the regression output:

summary(fit)
## 
## Call:
## lm(formula = `Response Variable` ~ `Categorical Variable`, data = dat2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.8735 -0.4155 -0.0004  0.2252  1.1856 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)
## (Intercept)             -0.38995    0.75362  -0.517    0.623
## `Categorical Variable`B  0.13970    0.92299   0.151    0.885
## `Categorical Variable`C  1.03921    0.87020   1.194    0.277
## `Categorical Variable`D  0.06177    0.84257   0.073    0.944
## 
## Residual standard error: 0.7536 on 6 degrees of freedom
## Multiple R-squared:  0.3659, Adjusted R-squared:  0.04883 
## F-statistic: 1.154 on 3 and 6 DF,  p-value: 0.4013

Interpreting this output is straight-forward now that we understand what those parameters are referring to: 'Categorical Variable'B is the difference between the average response for Level A and the average response for Level B. These parameters are differences.

WhyOne reason reference levels matter

One concern folks have when fitting models is selecting the best set of predictors. There are lots of ways to do this, but one of the simplest is to fit the full model, see which parameter estimates are significant, and then re-fit the model retaining on the statistically significant results. (Note that this is generally a bad way to do model selection, but I’m not focusing on that today. Here are some things to read about model selection for linear regression.) If you take this approach, your choice of reference level can actually effect the model you choose!

Suppose we want to fit a regression model to these data to estimate operating costs of aircraft based on their type and the rate at which they consume fuel:

## # A tibble: 20 x 3
##    `Aircraft Type` `Fuel Burn` `Operating Cost`
##    <chr>                 <dbl>            <dbl>
##  1 Fighter                  14             1194
##  2 Fighter                  42             1594
##  3 Fighter                  23             1189
##  4 Fighter                  23             1196
##  5 Fighter                  24             1093
##  6 Fighter                  44             1641
##  7 Helicopter               26              993
##  8 Bomber                   49             1515
##  9 Bomber                   28             1523
## 10 Fighter                  14             1050
## 11 Spy Plane                24             1447
## 12 Spy Plane                10             1155
## 13 Spy Plane                21             1275
## 14 Fighter                  44             1448
## 15 Helicopter               49             1397
## 16 Fighter                  47             1490
## 17 Bomber                   17             1155
## 18 Helicopter               29             1062
## 19 Bomber                   21             1097
## 20 Fighter                  30             1224

We fit a simple regression and check the results:

fit <- lm(`Operating Cost` ~ `Aircraft Type` + `Fuel Burn`, tb)

summary(fit)
## 
## Call:
## lm(formula = `Operating Cost` ~ `Aircraft Type` + `Fuel Burn`, 
##     data = tb)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -123.10  -63.42  -13.44   44.10  211.55 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                898.769     77.396  11.613 6.76e-09 ***
## `Aircraft Type`Fighter     -36.392     60.186  -0.605  0.55444    
## `Aircraft Type`Helicopter -259.036     78.489  -3.300  0.00486 ** 
## `Aircraft Type`Spy Plane   123.359     80.399   1.534  0.14577    
## `Fuel Burn`                 14.738      2.032   7.254 2.81e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 101.6 on 15 degrees of freedom
## Multiple R-squared:  0.7977, Adjusted R-squared:  0.7437 
## F-statistic: 14.78 on 4 and 15 DF,  p-value: 4.362e-05

The only dummy that’s significant is the Helicopter one, so we discard the other two. (Yay, model sparsity!)

tb_simplified <- tb %>% 
  mutate(`Aircraft Type` = ifelse(`Aircraft Type` == "Helicopter", "Helicopter", "Other"))

fit_simplified <- lm(`Operating Cost` ~ `Aircraft Type` + `Fuel Burn`, tb_simplified)
summary(fit_simplified)
## 
## Call:
## lm(formula = `Operating Cost` ~ `Aircraft Type` + `Fuel Burn`, 
##     data = tb_simplified)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -166.07  -73.39  -28.85   69.94  211.28 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           694.432     95.829   7.247 1.37e-06 ***
## `Aircraft Type`Other  248.786     70.407   3.534  0.00255 ** 
## `Fuel Burn`            13.161      2.067   6.367 7.02e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 110.2 on 17 degrees of freedom
## Multiple R-squared:  0.7299, Adjusted R-squared:  0.6981 
## F-statistic: 22.97 on 2 and 17 DF,  p-value: 1.473e-05

(One way to interpret this model is that there is no difference between Fighters, Bombers, and Spy Planes in terms of their operating costs. This is potentially an interesting result, but note that determining this wasn’t our goal. Instead, it just sorta happened accidentally. It’s also not necessarily correct, since we never actually compared Spy Planes with Fighters!)

But wait. We just let R arbitrarily choose a reference level. (In case you were wondering, unless the categorical variable is specifically coded as a factor, R assigns the first level alphabetically to be the reference level.) What if instead, we decided to use Helicopter as the reference level?

tb2 <- tb %>% 
  mutate(`Aircraft Type` = relevel(factor(`Aircraft Type`), ref = "Helicopter"))

fit2 <- lm(`Operating Cost` ~ `Aircraft Type` + `Fuel Burn`, tb2)

summary(fit2)
## 
## Call:
## lm(formula = `Operating Cost` ~ `Aircraft Type` + `Fuel Burn`, 
##     data = tb2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -123.10  -63.42  -13.44   44.10  211.55 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               639.733     91.642   6.981 4.42e-06 ***
## `Aircraft Type`Bomber     259.036     78.489   3.300 0.004856 ** 
## `Aircraft Type`Fighter    222.644     67.385   3.304 0.004819 ** 
## `Aircraft Type`Spy Plane  382.395     89.312   4.282 0.000656 ***
## `Fuel Burn`                14.738      2.032   7.254 2.81e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 101.6 on 15 degrees of freedom
## Multiple R-squared:  0.7977, Adjusted R-squared:  0.7437 
## F-statistic: 14.78 on 4 and 15 DF,  p-value: 4.362e-05

Now all of the parameters are significant! If we go this route, our final model will keep all of the groups separate. These models are not equivalent. They have different parameters, and they produce different predictions:

## # A tibble: 20 x 4
##    `Aircraft Type` `Reduced Type` `Fit 1 Predictions` `Fit 2 Predictions`
##    <fct>           <chr>                        <dbl>               <dbl>
##  1 Helicopter      Helicopter                   1037.               1023.
##  2 Helicopter      Helicopter                   1339.               1362.
##  3 Helicopter      Helicopter                   1076.               1067.
##  4 Bomber          Other                        1588.               1621.
##  5 Bomber          Other                        1312.               1311.
##  6 Bomber          Other                        1167.               1149.
##  7 Bomber          Other                        1220.               1208.
##  8 Fighter         Other                        1127.               1069.
##  9 Fighter         Other                        1496.               1481.
## 10 Fighter         Other                        1246.               1201.
## 11 Fighter         Other                        1246.               1201.
## 12 Fighter         Other                        1259.               1216.
## 13 Fighter         Other                        1522.               1511.
## 14 Fighter         Other                        1127.               1069.
## 15 Fighter         Other                        1522.               1511.
## 16 Fighter         Other                        1562.               1555.
## 17 Fighter         Other                        1338.               1305.
## 18 Spy Plane       Other                        1259.               1376.
## 19 Spy Plane       Other                        1075.               1170.
## 20 Spy Plane       Other                        1220.               1332.

But do reference levels really matter?

Sure, they seem to in the example above, but that’s because of our flawed model selection approach. A better method wouldn’t have had these issues and would return the same model regardless of our reference level.

The reason reference levels matter is because if you don’t think about them and don’t understand what’s going on under the hood, you may not realize when you’re taking a flawed model selection approach. Knowing what choices you software is making for you and how those choices show up in your output can help you recognize potential problems before they show up in your analysis.

Making good choices of reference levels can make your work more efficient. Suppose you actually wanted to compare Helicopter operating costs to those of different types of fixed-wing aircraft. If you used R’s default encoding, you’d have to figure out some weird contrast or something to make these comparisons or default to t-tests. But if you understand the importance of reference levels, you’ll be able to get these tests as a consequence of the regression you were already running.