Using ordinal regression for signal detection theory

Apr 15, 2024 19 min read

My lab from graduate school, which once studied cooperation, has pivoted to focusing on misinformation and fake news. Although I’ve taken only a remote interest in this topic, there is a particular theoretical discussion in the field that recently caught my attention because it bears more generally on statistics and signal detection theory. Without going into detail, the question is how we should think about an intervention that purports to reduce the sharing of fake news. As a recent article from Brian Guay and colleagues argues, some interventions that reduce the sharing of fake news also reduce the sharing of true news to the same degree, and hence do not seem to actually increase “discernment” of news quality. Moreover, discernment can be conceptualized in both additive and multiplicative terms, so an intervention can increase one kind of discernment while decreasing the other kind.

Within this discussion, however, it seems to have been taken for granted that you can treat a Likert rating for measures like intention to share a news story or belief that the news story is true as a cardinal value. Effectively, such approaches assume that, e.g., a “2” on a five-point scale can be represented as 25% belief in truth or probability of sharing. But this assumption is notoriously problematic. Instead, cumulative ordinal models, such as the ordinal probit model, are recommended. However, ordinal models are more challenging to interpret. What do the scale points mean if they can’t be directly mapped to probabilities of sharing or truth? And how can we think of discernment? My goal here is to provide a brief explanation of how these models can be used for this purpose and show how their deep connections to signal detection theory provide a natural measure of discernment¹, known as d’.

Dividing up latent space

Suppose we asked people to rate the attractiveness of some faces on a five-point Likert scale ranging from “very unattractive” to “very attractive.” Although participants are given only five options, we might think that their true impression of a face’s attractiveness approximates something more continuous. Thus, when we model these ordinal judgments, we don’t want to take the values of 1, 2, 3, 4, or 5 as exact; instead, we want to think of them as pointers to intervals in some smooth latent attractiveness space. In other words, two faces that are given the same rating could still vary in their underlying attractiveness to the rater because the rater doesn’t have the ability to give more precise ratings.

Unfortunately, how people choose to divide up this latent space is not obvious. For example, some people might interpret the scale options as quintiles; i.e., they’d aim to place a roughly equal number of faces in each bin. Others might interpret the bins in terms of standard deviations, in which case the extreme bins (“1” or “5”) would be less common than the middle ones (“2” through “4”). In fact, the bins need not even be symmetric about the average: people could tend to think an ‘average’ face is a “2” or a “4” rather than a “3.”

Ordinal regression models assume that there is some continuous and unbounded latent space that is divided up into k ordered intervals, where k is the number of possible response options (five in our example). To do this, the model must identify k - 1 cutpoints, which separate each interval. Any latent value that falls below the first cutpoint will be classified as the lowest category (“very unattractive”); values below the second cutpoint and above the first will be classified as the second lowest category; and so on.

To return to our example, let’s assume — as is customary in many statistical models — that people’s true latent impressions of facial attractiveness follow a bell curve.² The mean and variance of this curve is often arbitrary, in which case it is mathematically convenient to center the distribution at zero and assume unit variance (i.e., use a “standard normal” distribution).³ The cutpoints could be distributed in a variety of ways, with three possible examples shown below.

cutpoints <- tibble(
  sds = (-2:2)[-3],
  sds_biased = sds - 0.5,
  qtile = qnorm(seq(.2, .8, .2)),
) |> 
  pivot_longer(
    everything(), 
    names_to = "type"
  )

ggplot() + 
  stat_function(
    fun = dnorm, xlim = c(-6, 6), 
    linewidth = 1, alpha = 0.7
  ) +
  geom_vline(
    aes(xintercept = value, color = type),
    data = cutpoints,
    linewidth = 1, linetype = "dashed", alpha = 0.7
  ) +
  labs(
    x = "Latent attractiveness",
    y = element_blank(), 
    color = "Thresholds:"
  ) + 
  coord_cartesian(xlim = c(-4, 4)) +
  scale_x_continuous(breaks = -5:5) +
  scale_y_continuous(breaks = NULL) +
  scale_color_manual(
    values = c("darkblue", "green4", "green2"),
    labels = c("Quantiles", "SDs", "SDs with Positive Bias")
  ) +
  guides(
    color = guide_legend(override.aes = list(linetype = "solid"))
  ) +
  theme(legend.position = "top")

Inferring the cutpoints

Based on on a participant’s Likert judgments, we’d first like to be able to infer the location of this participant’s latent cutpoints. In this case, we can fit an “intercept-only” ordinal regression model, where an “intercept” is just another word for “cutpoint.” Since there are four cutpoints in our example, our model will return four intercepts in increasing order.

To simulate this, we start by sampling 1000 latent facial attractiveness values from a standard normal distribution. Then, to convert these latent values to hypothetical Likert responses, we’ll bin them according to one of the schemes shown in the figure above and see whether the model can approximately recover the correct cutpoints. We’ll use the Bayesian regression package brms, which makes it easy to run ordinal models⁴, but this could be done in a frequentist way, as well.

Here are the first ten latent observations (z), along with the three types of Likert responses.

set.seed(883)
n <- 1000

df_faces <- tibble(
  z = rnorm(n),
  y_qtile = findInterval(z, qnorm(seq(.2, .8, .2))) + 1,
  y_sd = findInterval(z, (-2:2)[-3]) + 1,
  y_sd_biased = findInterval(z, (-2:2)[-3] - 0.5) + 1
) |> 
  mutate(across(starts_with("y_"), ~ as.ordered(.x)))

head(df_faces, 10)

## # A tibble: 10 × 4
##          z y_qtile y_sd  y_sd_biased
##      <dbl> <ord>   <ord> <ord>      
##  1  0.894  5       3     4          
##  2 -0.408  2       3     3          
##  3 -0.646  2       3     3          
##  4 -1.40   1       2     3          
##  5 -1.05   1       2     3          
##  6 -0.133  3       3     3          
##  7  0.0641 3       3     3          
##  8  0.488  4       3     3          
##  9 -1.95   1       2     2          
## 10 -0.585  2       3     3

Now let’s fit the intercept-only models. We will fit a cumulative ordinal probit model by specifying cumulative(link = "probit") as the family. Note that I choose to override brms’s default prior for the intercepts to favor equal bin sizes, but this is not essential or important to understand if you aren’t familiar with Bayesian methods. Here are the results of the quintile model.

m1_qtile <- brm(
  formula = y_qtile ~ 1, data = df_faces,
  family = cumulative(link = "probit"), 
  prior = prior(normal(0, 1), class = Intercept),
  file = "Models/m1_qtile"
)

fixef(m1_qtile)

##                Estimate  Est.Error       Q2.5      Q97.5
## Intercept[1] -0.8619719 0.04554015 -0.9508564 -0.7691847
## Intercept[2] -0.2339124 0.04029856 -0.3140439 -0.1555647
## Intercept[3]  0.2303742 0.04039558  0.1496655  0.3100270
## Intercept[4]  0.8576472 0.04544294  0.7702524  0.9464146

The cutpoints in our latent space closely correspond to the quintiles of the standard normal distribution (the 20th percentile, 40th percentile, and so on), which you can confirm for yourself by running qnorm(seq(.2, .8, by = .2)) in R.

How about the other two models? We define our “SD” model as setting cutpoints at +/- 2 and +/- 1 standard deviations. In our biased version, we subtract 0.5 from all the cutpoints to reflect a positive tendency to rate faces as attractive; i.e., the cutpoints range from -2.5 to 1.5. (Don’t get confused: lowering thresholds in latent space corresponds to higher values of our dependent variable!)

m1_sd <- brm(
  formula = y_sd ~ 1, data = df_faces,
  family = cumulative(link = "probit"), 
  prior = prior(normal(0, 1), class = Intercept),
  file = "Models/m1_sd"
)

fixef(m1_sd)

##                Estimate  Est.Error       Q2.5      Q97.5
## Intercept[1] -1.9876183 0.08638546 -2.1602289 -1.8224368
## Intercept[2] -1.0077497 0.04827131 -1.1026272 -0.9152075
## Intercept[3]  0.9980941 0.04872041  0.9007998  1.0928559
## Intercept[4]  1.9852740 0.08659853  1.8179879  2.1584526

m1_sd_biased <- brm(
  formula = y_sd_biased ~ 1, data = df_faces,
  family = cumulative(link = "probit"), 
  prior = prior(normal(0, 1), class = Intercept),
  file = "Models/m1_sd_biased"
)

fixef(m1_sd_biased)

##                Estimate  Est.Error       Q2.5      Q97.5
## Intercept[1] -2.6094299 0.15972071 -2.9390715 -2.3181616
## Intercept[2] -1.5601216 0.06238403 -1.6822920 -1.4401399
## Intercept[3]  0.4811322 0.04136991  0.4008233  0.5611668
## Intercept[4]  1.4753094 0.05885064  1.3628954  1.5926317

Again, the fits are remarkably close to the true thresholds.

Measuring discriminability

Hopefully by now you have an intuition for what these cumulative ordinal models are trying to do. The construct of interest is assumed to have a latent distribution (e.g., a standard normal for probit models), and the ordinal variable is assumed to be generated via thresholds in this latent space. Formally, if we denote \(\tau_1, \tau_2, \dots, \tau_{k-1}\) as the thresholds in latent space from smallest to largest, then our probit model assigns the following probabilities to each of the ordinal responses:

\[ \begin{align} & P(y=1) = & \Phi(\tau_1) \\ & P(y=2) = & \Phi(\tau_2)-\Phi(\tau_1) \\ & \dots \\ & P(y=k) = & 1-\Phi(\tau_{k-1}) \end{align} \]

where \(\Phi\) is the cumulative distribution function (CDF) of the standard normal distribution.⁵ In other words, because the CDF is the function that yields the probability of observing a value less than or equal to its input, to get the probability of observing the lowest bin, the CDF \(\Phi(\tau_1)\) tells us the probability of observing the latent variable below the first cutpoint. To get the probability of observing the second lowest bin, we need to get the probability of observing a latent value below the second threshold, i.e., \(\Phi(\tau_2)\), but not below the first threshold, i.e., \(\Phi(\tau_1)\), so we subtract the latter from the former. And so on, in recursive fashion, to get the remaining probabilities. Note that since all these probabilities must sum to one, we know that the probability of responding in the highest bin, \(k\), is just the probability of not having a latent value lower than or equal to the highest cutpoint, i.e., \(1-\Phi(\tau_{k-1})\). (Alternatively, imagine a yet higher cutpoint at a value that approaches infinity, whose CDF approaches 1 in the limit.)

In short, the cumulative ordinal model gives us a procedure for inferring a set of thresholds in a latent space of interest. But we usually turn to regression to help us understand how certain observed variables alter the distribution of responses of our dependent variable. For example, we might be interested in whether a positive mood induction increases attractiveness ratings. We can do this by letting the cutpoints shift left or right depending on whether the participant had received the induction (\(x\)). Remember that a positive effect on the dependent variable implies that the latent thresholds decrease, so we subtract from the intercepts to get an estimate, \(b\), of how much the mood induction increases perceptions of attractiveness:

\[ P(y=i) = \Phi(\tau_i-bx)-\Phi(\tau_{i-1}-bx). \]

And, as you might have guessed, we can extend this logic to include as many predictors variables as we want, substituting a whole equation \(b_1x_1+b_2x_2+ \dots+b_nx_n\) for \(bx\) above.

With this background out of the way, let’s narrow in on a specific application: measuring a person’s ability to discriminate two types of stimuli. This is the fundamental goal of signal detection theory. In classic SDT, participants make binary judgments (e.g., I heard/didn’t hear a noise), which are modeled under the assumption that there are two latent normal distributions — one for signals (e.g., a faint noise is played) and one for the lack of a signal (a noise is not played). When making this binary judgment, the participant sets a threshold in latent space that determines whether they’ll judge the stimulus as present (above threshold) or absent (below threshold).

The ordinal probit model we’ve been discussing can be thought of as a generalization of this framework. Instead of dividing latent space into two bins, however, we divide it into an arbitrary number of bins, with thresholds that need not be equally spaced apart.

Which faces make it on reality TV?

Let’s make this concrete by imagining an extension of our face example from above. Imagine that the faces people are evaluating come from a set of contestants who applied for a reality TV show that recruits people with the hottest faces it can find. In an experiment, half of the faces shown are from people who made the cut, and the other half from people who didn’t make the cut. The only evidence that participants can rely on to guess who made it on the show is the attractiveness of the faces, so the participant’s five-point judgment of attractiveness is a proxy for belief that the applicant made it on the show. This is an imperfect proxy because the participant may have different taste than the judges of the show, and these ratings can also be influenced by other random factors. To capture this noisy relationship between the rater’s perceptions of facial attractiveness and making it on the show, we model the faces of those rejected vs. accepted on the show as coming from two separate but overlapping distributions.⁶

ggplot() + 
  stat_function(
    fun = \(x) dnorm(x, -.5), xlim = c(-6, 6), 
    linewidth = 1, linetype = "dashed", alpha = 0.7
  ) +
  stat_function(
    fun = \(x) dnorm(x, 1), xlim = c(-6, 6), 
    linewidth = 1, linetype = "solid", alpha = 0.7
  ) +
  annotate(
    "text", size = 6,
    x = c(-2, 2.5), y = c(.35, .35), 
    label = c("Rejected", "Accepted"),
  ) +
  labs(
    x = "Latent attractiveness",
    y = element_blank(), 
    color = "Thresholds:"
  ) + 
  coord_cartesian(xlim = c(-4, 4)) +
  scale_x_continuous(breaks = -5:5) +
  scale_y_continuous(breaks = NULL) +
  theme(legend.position = "top")

If this were a standard SDT task, we would ask the participant to give a binary judgment about whether they think each face corresponds to an accepted show contestant. In our task, we can’t think of the five-point judgment in quite the same way. While a 1 (“very unattractive”) indicates a strong belief that the person pictured did not become a contestant, and a 5 (“very attractive”) indicates the opposite, it’s less clear how to interpret the middle ratings. If we were to convert the ordinal judgments into binary beliefs about who made it on the show, we could consider four possible dividing lines, corresponding to the four cutpoints. That is, the participant could judge everybody above a 1 as a contestant, everybody above a 2, everybody above a 3, or everybody above a 4. As the threshold is moved upwards, there are fewer false positives (i.e., judging that someone made it on the show when they didn’t), but also fewer true positives (i.e., judging that someone made it on the show when they did).

Critically, though, the weighting of false vs. true positives via the choice of thresholds is distinct from the participant’s ability to discriminate contestant from non-contestant faces. That is, two people could use different thresholds for their Likert responses (e.g., quintiles vs. standard deviations) while being equally good at telling the two categories of faces apart. Fortunately, our model can identify discrimination ability directly by estimating the latent distance between the two curves shown above. In SDT language, this distance is known as d-prime (d’).

At this point, you might be wondering how our model could estimate the distance between two distributions when our regression equation assumes there is only one distribution. While ordinal models assume a fixed distribution and a set of cutpoints that can shift left or right, SDT assumes two distributions and a set of fixed thresholds. Fortunately, these two frameworks are mathematically equivalent (assuming equal variance of the two distributions): shifting a distribution rightward is the same as shifting the thresholds leftward, and vice versa. Hence, the amount that the thresholds negatively shift when moving from rejected contestants to accepted contestants is equal to the distance between the two distributions (d’).

Once again, we can confirm this by simulating data from these two hypothetical latent distributions and fitting an ordinal model with an indicator variable for whether the face was from someone who was accepted as a contestant on the show. Let’s assume that our participant uses quintile thresholds, and for clarity, I will fix the distribution of rejected contestants to center on 0 and therefore let the distribution of accepted contestants be centered at 1.5, indicating a d’ of 1.5.

set.seed(10101)
n <- 1000

df_reality_show <- tibble(
  z = c(rnorm(n/2, mean = 0), rnorm(n/2, mean = 1.5)),
  accepted = rep(0:1, each = n/2),
  y_qtile = findInterval(z, qnorm(seq(.2, .8, .2))) + 1
)

m2 <- brm(
  formula = y_qtile ~ 1 + accepted, data = df_reality_show,
  family = cumulative(link = "probit"), 
  prior = prior(normal(0, 1), class = Intercept),
  file = "Models/m2"
)

fixef(m2)

##                Estimate  Est.Error       Q2.5      Q97.5
## Intercept[1] -0.8733243 0.06169553 -0.9945042 -0.7534213
## Intercept[2] -0.2551831 0.05412615 -0.3630002 -0.1498126
## Intercept[3]  0.2647077 0.05403831  0.1563562  0.3672487
## Intercept[4]  0.7724518 0.05728562  0.6601024  0.8848581
## accepted      1.3784746 0.07707211  1.2292166  1.5306994

Draw your attention to the coefficeint of the “accepted” dummy variable. This output is quite close to the true d’ value of 1.5.

Drawing the ROC curve

Another common SDT tool to represent the distance between two distributions is with a receiver operating characteristic (ROC) curve. This curve shows how the true positive rate (y-axis) and false positive rate (x-axis) trade off based on where a binary threshold is set in latent space. The better the participant is at discriminating the two distributions, the more that the true positive rate should exceed the false positive rate (except for extreme thresholds where everything is classified as a positive case or everything is classified as a negative case). The area under this ROC curve (AUC) is therefore another way to represent discriminability.

If we’re drawing the ROC curve based on the theoretical assumptions of our model (two normal distributions with equal variance separated by some distance), there is a one-to-one function mapping d’ to AUC⁷, so both d’ (which can take on any real value) and AUC (which is bounded between 0 and 1) yield the same information. In practice, though, it is also informative to draw an empirically derived ROC curve and see how this lines up with the theoretical one. Both the theoretical and empirical curve has advantages: while the theoretical one gives us more flexibility to adjust for covariates and smooth out noise, it also makes strong assumptions about the nature of the stimuli and cognition (e.g., normal distributions with equal variance) that may not match reality.

Let’s explore how we can draw these curves in our reality show example. Our model yielded cutpoints for the distribution of rejected contestants and an offset parameter that tells us how much to shift these cutpoints leftward for the distribution of accepted contestants. As explained earlier, we can equivalently treat this offset parameter as a positive shift in the latent distribution.

For a given binary threshold \(\tau\) and a value \(z\) in latent standard normal space, we want to calculate the true positive rate (TPR) and the false positive rate (FPR). In our theoretical model, the TPR (where a “positive” here is getting accepted for the reality show) is defined as

\[ P(z + d'>\tau) = P(z > \tau - d') = 1 - \Phi(\tau-d'), \]

and the FPR as

\[ P(z > \tau) = 1 - \Phi(\tau). \]

In other words, if we fix the mean of the distribution of rejected contestants at zero, the TPR is the probability that a \(d'\)-shifted distribution exceeds \(\tau\) — or equivalently, the probability that the standard normal exceeds a \(-d'\)-shifted threshold. The FPR is the probability that the distribution of rejected faces exceeds \(\tau\).

Because our theoretical model assumes a parametric form, we need not restrict our attention to the specific cutpoints that our model inferred for the ordinal variable. We can calculate these quantities for any threshold and plot the resulting curve, as shown below (plugging in our observed \(d'\) of 1.38). The specific observed thresholds for ordinal values of \(Y\) exceeding 1, 2, 3, and 4 are also labeled.

obs_d <- fixef(m2)[, "Estimate"]["accepted"]

df_roc_all <- tibble(
  tau = seq(-5, 5, by = .1),
  tpr = 1 - pnorm(tau - obs_d),
  fpr = 1 - pnorm(tau)
)

df_roc_cutpoints <- tibble(
  tau = fixef(m2)[, "Estimate"][1:4],
  label = str_c("Y>", 1:4),
  tpr = 1 - pnorm(tau - obs_d),
  fpr = 1 - pnorm(tau)
)

fig_roc <- ggplot(mapping = aes(fpr, tpr)) +
  geom_abline(linewidth = 0.1) +
  geom_line(data = df_roc_all) +
  coord_cartesian(xlim = c(0, 1), ylim = c(0, 1)) +
  labs(
    x = "False Positive Rate", 
    y = "True Positive Rate"
  ) +
  theme(aspect.ratio = 1)

# add in labels separately since I remove them in next plot
fig_roc +
  geom_label(aes(label = label), data = df_roc_cutpoints)

To draw the empirical ROC curve, we make no assumptions about the structure of latent space. Instead, we simply calculate the observed true positive and false positive rates at each of the four ordinal thresholds. Because there are only four possible points, I don’t draw an empirical curve, but just show the empirically derived points in red, with the theoretical ROC from above included. Let’s see how these points line up with the theoretical curve.

df_roc_empirical <- df_reality_show |> 
  count(accepted, y_qtile) |> 
  pivot_wider(
    names_from = "accepted", names_prefix = "accepted_",
    values_from = "n"
  ) |> 
  filter(y_qtile > 1) |> 
  arrange(desc(y_qtile)) |> 
  mutate(
    label = str_c("Y>", 4:1),
    fpr = cumsum(accepted_0) / 500,
    tpr = cumsum(accepted_1) / 500,
  )

fig_roc +
  geom_point(
    data = df_roc_empirical, 
    size = 2, color = "darkred"
  ) +
  geom_text(
    aes(label = label), data = df_roc_empirical,
    nudge_y = 0.04, nudge_x = -0.02, size = 4
  )

The empirically derived points land very close to the theoretical line. This is not so surprising, as our simulations were derived from this theoretical model. In a real data set, we could find larger deviations.

Conclusion

The cumulative ordinal probit model gives us a way to measure fake-news discernment — and discriminability of stimuli more generally — that doesn’t assume that an ordinal Likert scale is metric. Moreover, its connections with signal detection theory make it appealing for plotting ROC curves and doing decision theory. Of course, any model must make simplifying assumptions, and it’s always a good idea to compare model-derived predictions to nonparametric quantities like the empirical true and false positive rates. Nevertheless, I believe such models are much closer to the truth than models that assume that an ordinal variable is continuous.

This kind of discernment is probably best thought of as “additive”. It’s more complicated to think about how these models could yield “multiplicative” discernment, but this metric could be calculated from the model post hoc.↩︎
This is the assumption of probit models. If we used a logit model, we’d be assuming that facial attractiveness follows a logistic distribution. In practice, a properly scaled logistic distribution is very similar to a normal distribution, with slightly fatter tails, so the theoretical difference between logit and probit models is minimal. However, logit models are often preferred because the model parameters can be interpreted as multipliers on odds.↩︎
Although I’ll be fitting ordinal probit models that make this standard-normal assumption, it’s also possible to have your model learn a mean and/or variance. This can be useful, for example, if you think the variance of the latent distribution might differ between groups. Unfortunately, it is not possible to jointly identify all k - 1 cutpoints and a mean and variance in the same model, so only k - 3 cutpoints can be estimated if you want to also estimate mean and variance, as is done here.↩︎
Technically, we are exploring only one type of ordinal model called a “cumulative” ordinal model.↩︎
See Bürkner’s paper for a more detailed treatment.↩︎
We assume the variance of the two distributions is equal. SDT has tools to allow for unequal variances, but it’s not so easy to map this on to ordinal regression.↩︎
For two normal latent distributions with equal variance, AUC is \(\Phi(d'/\sqrt2)\).↩︎