Reference levels matter
When doing a regression analysis with categorical variables, which level is used as the reference level can be important. This is underappreciated, since most non-major classes on regression (or more precisely, regression classes that don’t show you the underlying matrix algebra) don’t talk about it. Software mostly hides this as well unless users want to dive deep into the options. Failing to consider your choice of reference level and how that choice can effect your analysis can lead you to erroneous (or at least dubious) conclusions.
What are reference levels?
Before we get into why reference levels matter, let’s take a step back and talk about what reference levels are. In regression analyses involving categorical variables, these variables are represented numerically by unique combinations of 1’s and 0’s. For a two-level categorical variable, 1’s are arbitrarily assigned to represent one level of the variable and 0’s another in a single column:
## # A tibble: 10 x 2
## `Categorical Variable` `Coding of Categorical Variable`
## <chr> <dbl>
## 1 A 0
## 2 A 0
## 3 B 1
## 4 A 0
## 5 A 0
## 6 A 0
## 7 A 0
## 8 A 0
## 9 A 0
## 10 B 1
Categorical variables with more than two levels are represented by adding an extra column for level beyond two. Thus, a categorical variable with \(p\) levels is represented by \(p-1\) columns. Each column will take the value \(1\) if the categorical variable is a specific number and \(0\) otherwise:
## # A tibble: 10 x 4
## `Categorical Variable` `Variable is A` `Variable is B` `Variable is C`
## <chr> <dbl> <dbl> <dbl>
## 1 A 0 0 0
## 2 B 1 0 0
## 3 B 1 0 0
## 4 C 0 1 0
## 5 C 0 1 0
## 6 C 0 1 0
## 7 D 0 0 1
## 8 D 0 0 1
## 9 D 0 0 1
## 10 D 0 0 1
In this case, “D” is the reference level, since if all of the columns take the value 0, we know the level of our categorical variable is “D”.
(The type of factor coding I described above is known as “dummy coding” or “effect coding”. (Or if you’re in Machine Learning, “one-hot encoding”, which is a very silly name.) You can read about alternative approaches to variable coding here. Depending on what you actually want to do, it can often be beneficial to encode your categorical factors in a different way.)
Reference Levels in Regression
Now suppose I want to fit a regression model using these factors. I don’t have to manually generate the columns of 1’s and 0’s; that happens automatically in software. For example, I can use lm()
to fit a simple model in R:
fit <- lm(`Response Variable` ~ `Categorical Variable`, data = mydata)
When running this code, R will call model.matrix()
to generate a matrix that looks a lot like what we saw above:
model.matrix(~`Categorical Variable`, data = mydata)
## (Intercept) `Categorical Variable`B `Categorical Variable`C
## 1 1 0 0
## 2 1 1 0
## 3 1 1 0
## 4 1 0 1
## 5 1 0 1
## 6 1 0 1
## 7 1 0 0
## 8 1 0 0
## 9 1 0 0
## 10 1 0 0
## `Categorical Variable`D
## 1 0
## 2 0
## 3 0
## 4 0
## 5 0
## 6 0
## 7 1
## 8 1
## 9 1
## 10 1
## attr(,"assign")
## [1] 0 1 1 1
## attr(,"contrasts")
## attr(,"contrasts")$`Categorical Variable`
## [1] "contr.treatment"
Note the presence of the Intercept column, which has a value of 1
for every row in our matrix.
These automatically-generated columns correspond to the parameters in the regression output:
summary(fit)
##
## Call:
## lm(formula = `Response Variable` ~ `Categorical Variable`, data = dat2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.8735 -0.4155 -0.0004 0.2252 1.1856
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.38995 0.75362 -0.517 0.623
## `Categorical Variable`B 0.13970 0.92299 0.151 0.885
## `Categorical Variable`C 1.03921 0.87020 1.194 0.277
## `Categorical Variable`D 0.06177 0.84257 0.073 0.944
##
## Residual standard error: 0.7536 on 6 degrees of freedom
## Multiple R-squared: 0.3659, Adjusted R-squared: 0.04883
## F-statistic: 1.154 on 3 and 6 DF, p-value: 0.4013
Interpreting this output is straight-forward now that we understand what those parameters are referring to: 'Categorical Variable'B
is the difference between the average response for Level A and the average response for Level B. These parameters are differences.
WhyOne reason reference levels matter
One concern folks have when fitting models is selecting the best set of predictors. There are lots of ways to do this, but one of the simplest is to fit the full model, see which parameter estimates are significant, and then re-fit the model retaining on the statistically significant results. (Note that this is generally a bad way to do model selection, but I’m not focusing on that today. Here are some things to read about model selection for linear regression.) If you take this approach, your choice of reference level can actually effect the model you choose!
Suppose we want to fit a regression model to these data to estimate operating costs of aircraft based on their type and the rate at which they consume fuel:
## # A tibble: 20 x 3
## `Aircraft Type` `Fuel Burn` `Operating Cost`
## <chr> <dbl> <dbl>
## 1 Fighter 14 1194
## 2 Fighter 42 1594
## 3 Fighter 23 1189
## 4 Fighter 23 1196
## 5 Fighter 24 1093
## 6 Fighter 44 1641
## 7 Helicopter 26 993
## 8 Bomber 49 1515
## 9 Bomber 28 1523
## 10 Fighter 14 1050
## 11 Spy Plane 24 1447
## 12 Spy Plane 10 1155
## 13 Spy Plane 21 1275
## 14 Fighter 44 1448
## 15 Helicopter 49 1397
## 16 Fighter 47 1490
## 17 Bomber 17 1155
## 18 Helicopter 29 1062
## 19 Bomber 21 1097
## 20 Fighter 30 1224
We fit a simple regression and check the results:
fit <- lm(`Operating Cost` ~ `Aircraft Type` + `Fuel Burn`, tb)
summary(fit)
##
## Call:
## lm(formula = `Operating Cost` ~ `Aircraft Type` + `Fuel Burn`,
## data = tb)
##
## Residuals:
## Min 1Q Median 3Q Max
## -123.10 -63.42 -13.44 44.10 211.55
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 898.769 77.396 11.613 6.76e-09 ***
## `Aircraft Type`Fighter -36.392 60.186 -0.605 0.55444
## `Aircraft Type`Helicopter -259.036 78.489 -3.300 0.00486 **
## `Aircraft Type`Spy Plane 123.359 80.399 1.534 0.14577
## `Fuel Burn` 14.738 2.032 7.254 2.81e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 101.6 on 15 degrees of freedom
## Multiple R-squared: 0.7977, Adjusted R-squared: 0.7437
## F-statistic: 14.78 on 4 and 15 DF, p-value: 4.362e-05
The only dummy that’s significant is the Helicopter one, so we discard the other two. (Yay, model sparsity!)
tb_simplified <- tb %>%
mutate(`Aircraft Type` = ifelse(`Aircraft Type` == "Helicopter", "Helicopter", "Other"))
fit_simplified <- lm(`Operating Cost` ~ `Aircraft Type` + `Fuel Burn`, tb_simplified)
summary(fit_simplified)
##
## Call:
## lm(formula = `Operating Cost` ~ `Aircraft Type` + `Fuel Burn`,
## data = tb_simplified)
##
## Residuals:
## Min 1Q Median 3Q Max
## -166.07 -73.39 -28.85 69.94 211.28
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 694.432 95.829 7.247 1.37e-06 ***
## `Aircraft Type`Other 248.786 70.407 3.534 0.00255 **
## `Fuel Burn` 13.161 2.067 6.367 7.02e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 110.2 on 17 degrees of freedom
## Multiple R-squared: 0.7299, Adjusted R-squared: 0.6981
## F-statistic: 22.97 on 2 and 17 DF, p-value: 1.473e-05
(One way to interpret this model is that there is no difference between Fighters, Bombers, and Spy Planes in terms of their operating costs. This is potentially an interesting result, but note that determining this wasn’t our goal. Instead, it just sorta happened accidentally. It’s also not necessarily correct, since we never actually compared Spy Planes with Fighters!)
But wait. We just let R arbitrarily choose a reference level. (In case you were wondering, unless the categorical variable is specifically coded as a factor, R assigns the first level alphabetically to be the reference level.) What if instead, we decided to use Helicopter
as the reference level?
tb2 <- tb %>%
mutate(`Aircraft Type` = relevel(factor(`Aircraft Type`), ref = "Helicopter"))
fit2 <- lm(`Operating Cost` ~ `Aircraft Type` + `Fuel Burn`, tb2)
summary(fit2)
##
## Call:
## lm(formula = `Operating Cost` ~ `Aircraft Type` + `Fuel Burn`,
## data = tb2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -123.10 -63.42 -13.44 44.10 211.55
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 639.733 91.642 6.981 4.42e-06 ***
## `Aircraft Type`Bomber 259.036 78.489 3.300 0.004856 **
## `Aircraft Type`Fighter 222.644 67.385 3.304 0.004819 **
## `Aircraft Type`Spy Plane 382.395 89.312 4.282 0.000656 ***
## `Fuel Burn` 14.738 2.032 7.254 2.81e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 101.6 on 15 degrees of freedom
## Multiple R-squared: 0.7977, Adjusted R-squared: 0.7437
## F-statistic: 14.78 on 4 and 15 DF, p-value: 4.362e-05
Now all of the parameters are significant! If we go this route, our final model will keep all of the groups separate. These models are not equivalent. They have different parameters, and they produce different predictions:
## # A tibble: 20 x 4
## `Aircraft Type` `Reduced Type` `Fit 1 Predictions` `Fit 2 Predictions`
## <fct> <chr> <dbl> <dbl>
## 1 Helicopter Helicopter 1037. 1023.
## 2 Helicopter Helicopter 1339. 1362.
## 3 Helicopter Helicopter 1076. 1067.
## 4 Bomber Other 1588. 1621.
## 5 Bomber Other 1312. 1311.
## 6 Bomber Other 1167. 1149.
## 7 Bomber Other 1220. 1208.
## 8 Fighter Other 1127. 1069.
## 9 Fighter Other 1496. 1481.
## 10 Fighter Other 1246. 1201.
## 11 Fighter Other 1246. 1201.
## 12 Fighter Other 1259. 1216.
## 13 Fighter Other 1522. 1511.
## 14 Fighter Other 1127. 1069.
## 15 Fighter Other 1522. 1511.
## 16 Fighter Other 1562. 1555.
## 17 Fighter Other 1338. 1305.
## 18 Spy Plane Other 1259. 1376.
## 19 Spy Plane Other 1075. 1170.
## 20 Spy Plane Other 1220. 1332.
But do reference levels really matter?
Sure, they seem to in the example above, but that’s because of our flawed model selection approach. A better method wouldn’t have had these issues and would return the same model regardless of our reference level.
The reason reference levels matter is because if you don’t think about them and don’t understand what’s going on under the hood, you may not realize when you’re taking a flawed model selection approach. Knowing what choices you software is making for you and how those choices show up in your output can help you recognize potential problems before they show up in your analysis.
Making good choices of reference levels can make your work more efficient. Suppose you actually wanted to compare Helicopter operating costs to those of different types of fixed-wing aircraft. If you used R’s default encoding, you’d have to figure out some weird contrast or something to make these comparisons or default to t-tests. But if you understand the importance of reference levels, you’ll be able to get these tests as a consequence of the regression you were already running.