When doing a regression analysis with categorical variables, which level is used as the reference level can be important. This is underappreciated, since most non-major classes on regression (or more precisely, regression classes that don’t show you the underlying matrix algebra) don’t talk about it. Software mostly hides this as well unless users want to dive deep into the options. Failing to consider your choice of reference level and how that choice can effect your analysis can lead you to erroneous (or at least dubious) conclusions.

I’m teaching a graduate-level intro stats course right now, and one thing that struck me as we move from calculating things “by hand” to doing things in R is that there’s no real reason to emphasize the normal approximation binomail confidence interval once you’re using software. Or at least far less reason.
The normal approximation This is the basic interval they’ve taught in introductory statistics courses since time immamorial. Or at least the past few decades, I’d have to know the history of Stats Ed to give the real timeframe.

library(tidyverse) library(binom) Someone had a relatively straight-forward question: They had sets of binary outcomes for different response variables, and wanted to display them all in a simple way that highlighted both the probability of success and amount of data they had for each observation. There are more than a few ways to do it, and it can be hard to determine which is best without seeing them, so let’s look at a few examples and see which we like!

Both Economics and Statistics share a peculiar failure mode: Many critical results in both rely on “large sample”/“long run average” proofs.
The Central Limit Theorem is fundamental to much of classical statitics, including most (if not all) of the fundamental approaches that people are exposed to in their first few courses. The Efficient Market Hypothesis underpins much of the economic theory on which Western economies are based. Both are powerful tools for explaining common phenomena and often make complex problems simpler to understand and model.

So we’ll call that break a “summer hiatus”.
But now we’re back, and coming recently from the Joint Statistical Meetings (2019) in Denver, I’ve got Thoughts.
This year’s JSM was different for me, because I spent most of my time on recruitment, speaking with potential applicants during many of the sessions. As a result, I attended many fewer talks that I normally do. By happenstance, the topic of the p-value came up repeatedly in the talks I was able to attend.

As a newly-minted PhD Statistician, I was hired by a company that didn’t have a lot of native statistical expertise because they wanted to change that. As a result, I felt empowered to give lots of opinions on topics within my domain to anyone who happened to be in the room, including the head of the division. One of those opinions was that pie charts were the worst.
I viewed pie charts as the scarlet letter of bad analysis: Having one in your analysis should get you shamed and shunned.

This is an update to my Analysis Philosphy page, which is still working towards completion
Nonlinearity is a commonly-misunderstood problem when it comes to data analysis, mostly because our profession has once again managed to find a way to use a simple-sounding term in a way that’s counterintuitive to lay audiences. (See also Artificial Intelligence is Dumb.) When people think about nonlinear response variables, they think of functions that have non-linear relationships.

Previously on DIY Metircs… Last time in the DIY Metrics series, we had reached the point where we could extract a host of individual metrics from our data set using a function we’d named add_simple_stat_indicators:
add_simple_stat_indicators <- function(tb){ tb %>% mutate( gotblk = (description == "BLOCK"), gotstl = (description == "STEAL"), gotast = (description == "ASSIST"), gotreb = map_lgl(description, str_detect, "REBOUND"), tfoulu = map_lgl(description, str_detect, "T.FOUL"), tfoull = map_lgl(description, str_detect, "T.

With the recent success of the Rockets, people are trotting out that old saw about analytics nerds ruining sports. With the Houston Rockets specifically, the question is a combined referendum on the numbers-based approach of GM Daryl Morey and the foul-drawing proclivities of Houston’s two stars, James Harden and Chris Paul. Of course, the latter is linked with the former, since analytics shows us that drawing shooting fouls is extremely efficient offense.

Disclaimer: This post is at least tongue-half-way-in-cheek. I acutally like the article I’m lampooning.
A recent publication by academics and AI researchers titled “Data Sheets for Datasets” calls for the Machine Learning community to ensure that all of their datasets are accompanied by a “datasheet.” These datasheets would contain information the dataset’s “motivation, composition, collection process, recommended uses, and so on.” The authors, Gebru, et al., would you like to include more data about your dataset.