Quick notes on binomial confidence intervals in R
I’m teaching a graduate-level intro stats course right now, and one thing that struck me as we move from calculating things “by hand” to doing things in R
is that there’s no real reason to emphasize the normal approximation binomail confidence interval once you’re using software. Or at least far less reason.
The normal approximation
This is the basic interval they’ve taught in introductory statistics courses since time immamorial. Or at least the past few decades, I’d have to know the history of Stats Ed to give the real timeframe. Anyway, this confidence interval uses the fact from the Central Limit Theorem, that, as \(n \rightarrow \infty\), the sampling distribution for \(\hat\pi = x/n\) closely resembles a Normal distribution.
Based on that, you get the equation:
\[\hat\pi \pm z_{\frac{\alpha}{2}} \sqrt{\frac{\hat\pi (1 - \hat\pi)}{n}}\]
Analog CI
We can build this CI in R pretty easily by inputting the values for the sample size, \(n\), and the number of “successes” or “1”s from our binary response variable. One example from class discusses a poll of 2500 people with 400 responding “Satisfactory”. For a 90% confidence interval, we have:
n <- 2500
x <- 400
pihat <- x/n
alpha <- 0.1 # 90% CI --> alpha = 1 - .9 = 0.1
lower_bound <- pihat + qnorm(alpha/2) * sqrt((pihat * (1 - pihat)/n))
upper_bound <- pihat + qnorm(1 - alpha/2) * sqrt((pihat * (1 - pihat)/n))
c(lower_bound, upper_bound)
## [1] 0.1479397 0.1720603
Easy mode
But it’s much easier to just use the binom
library, which contains the function binom.confint()
:
# install.packages("binom")
library(binom)
binom.confint(x = 400, n = 2500, conf.level = 0.9, method = "asymptotic")
## method x n mean lower upper
## 1 asymptotic 400 2500 0.16 0.1479397 0.1720603
Much easier! But now that we’re using binom.confint()
, we discover that we have to specify method = "asymptotic"
. But that implies that there are alternatives! And indeed, if we just remove that statement, we see that there are almost a DOZEN different methods that binom.confint()
will compute for you!
Other types of binomial confidence intervals
First off, most of these aren’t useful in most cases. They’re in there because (1) they’re not very hard to program, so the authors figured, “Why not?” and (2) in most cases, there is at least one circumstance where each one is the best option. (Or they’re included for historical reasons.)
Exact CIs, aka Clopper-Pearson
For one simple example, recall the assumption that we always have to make for our Normal approximation method: \(n * \hat\pi > 5\) and \(n * (1 - \hat\pi) > 5\). This is required when we use the Normal approximation. It means we can’t build CIs for small-ish samples. But other methods don’t have this problem!
method = "exact"
uses what’s called the Clopper-Pearson method, which uses the Binomial distribution to calculate an “exact” confidence interval rather than rely on an approximation.
While being “exact” sounds better than “approximate”, the truth of the matter is that the Clopper-Pearson interval is generally wider than it needs to be, meaning you get a less precise interval:
library(dplyr)
binom.confint(x = 400, n = 2500, conf.level = 0.9) %>%
mutate(`CI Width` = upper - lower) %>%
select(method, lower, upper, `CI Width`) %>%
arrange(`CI Width`)
## method lower upper CI Width
## 1 bayes 0.1480550 0.1721635 0.02410856
## 2 cloglog 0.1481500 0.1722628 0.02411279
## 3 profile 0.1481871 0.1723036 0.02411651
## 4 wilson 0.1483082 0.1724269 0.02411870
## 5 probit 0.1482369 0.1723573 0.02412042
## 6 asymptotic 0.1479397 0.1720603 0.02412053
## 7 logit 0.1483044 0.1724312 0.02412679
## 8 agresti-coull 0.1483026 0.1724325 0.02412988
## 9 lrt 0.1481877 0.1723265 0.02413880
## 10 exact 0.1480388 0.1725544 0.02451559
## 11 prop.test 0.1459601 0.1750977 0.02913765
Since we have a large sample, the differences aren’t very large, but there are times when you want every ounce of precision you can get!
Bayesian intervals
Bayesian statistics is a school of thought that says we should try to incorporate our prior knowledge about a problem when making a decision instead of letting the data stand on its own.I don’t want to get into why some folks prefer Bayesian intervals, but if you want to, just specify method = "bayes"
to get a Bayesian CI.
A good general-use CI
My go-to for a simple binomial confidence interval is the Agresti-Coull method, method = "agresti-coull"
. It’s one of the weirder ones (Seriously, go look at the equation for it!), but generally performs as well or better than the competition across most scenarios. It’s more precise than method = "exact"
, doesn’t fail in small samples like method = "asymptotic"
, and doesn’t rely on a Bayesian approach.