Accuracy measures on conditional probabilities | Thoughts, Arguments and Rants

I just proved a result about probability aggregation that I found rather perplexing. The proof, and even the result, is a little too complicated to put in HTML, so here it is in PDF.

A Note on Probability Aggregation and Conditional Accuracy Measures

What started me thinking about this was Sarah Moss’s excellent paper Scoring Rules and Epistemic Compromise, which is about aggregating probability functions. Here’s, roughly, the kind of puzzle that Moss is interested in. We’ve got two probability functions Pr₁ and Pr₂, and we want to find some probability function Pr that ‘aggregates’ them. Perhaps that’s because Pr₁ is my credence function, Pr₂ is yours, and we need to find some basis for making choices about collective action. Perhaps it is because the only thing you know about a certain subject matter is that one expert’s credence function is Pr₁, another’s function is Pr₂, and each expert seems equally likely, and you want to somehow defer equally to the two of them. (Or, perhaps, it is because you want to apply the Equal Weight View of disagreement. But don’t do that; the Equal Weight View is false.)

It seems there is an easy solution to this. For any X, let Pr(X) = (Pr₁(X) + Pr₂(X))/2. But as Barry Loewer noted many years ago, this solution has some costs. Let’s say we care about two propositions, _p_ and _q_, and Boolean combinations of them. And say that _p_ and _q_ are probabilistically independent according to both Pr₁ and Pr₂. Then this linear mixture approach will not in general preserve independence. So there are some costs to it.

One of the thing Moss does is come up with an independent argument for using linear mixtures. Her argument turns on various accuracy measures, or what are sometimes called scoring rules, for probability functions. (Note that I’m leaving out a lot of the interesting stuff in Moss’s paper, which goes into a lot of detail about what happens when we get further away from the Brier scores that are the focus here. Anyone who is at all interested in these aggregation issues, which are pretty central to current epistemological debates, should read her paper.)

Thanks to Jim Joyce’s work there has been an upsurge in interest in philosophy in accuracy measures of probability functions. Here’s how the most commonly used scoring rule, called the Brier score, works. We start with a partition of possibility space, the partition that we’re currently interested in. In this case it would be {p ∧ q, p ∧ ¬q, ¬p ∧ q, ¬p ∧ ¬q}. For any proposition X, say V(X, w) is 1 if X is true, and 0 if X is false. Then we ‘score’ a function Pr in world w by summing (Pr(X) – V(X, w))², as X takes each value in the partition. This is a measure of how inaccurate Pr is in w, the higher this number is, the more inaccurate Pr is. Conversely, the lower it is, the more accurate it is. And accuracy is a good thing obviously, so this gives us a kind of goodness measure on probability functions.

Now in the aggregation problem we’re interested in here, we don’t know what world we’re in, so this isn’t directly relevant. But instead of looking at the actual inaccuracy measure of Pr, we can look at its expected inaccuracy measure. ‘Expected’ according to what, you might ask. Well, first we look at the expectation according to Pr₁, and then the expectation according to Pr₂, then we average them. That gives a fair way of scoring Pr according to each Pr₁ and Pr₂.

One of the things Moss shows is that this average of expected inaccuracy is minimised when Pr is the linear average of Pr₁ and Pr₂. And she offers good reasons to think this isn’t a quirk of the scoring rule we’re using. It doesn’t matter, that is, that we’re using squares of distance between Pr(X) and V(X); any ‘credence-eliciting’ scoring rule will plausibly have the same result.

But I was worried that this didn’t really address the Loewer concern directly. The point of that concern was that linear mixtures get the conditional probabilities wrong. So we might want instead to measure the accuracy of Pr’s *conditional* probability assignments. Here’s how I thought we’d go about that.

Consider the four values Pr(p | q), Pr(p | ¬q), Pr(q | p), Pr(q | ¬p). In any world _w_, two of the four ‘conditions’ in these conditional probabilities will be met. Let’s say they are p and ¬q. Then the *conditional* inaccuracy of Pr in that world will be (Pr(q | p) – V(q))² + (Pr(p | ¬q) – V(p))². In other words, we apply the same formula as for the Brier score, but we use conditional rather than unconditional probabilities, and we just look at the conditions that are satisfied.

From then on, I thought, we could use Moss’s technique. We’ll look for the value of Pr that minimises the expected conditional inaccuracy, and call that the compromise, or aggregated, function. I guessed that this would be the function we got by taking the linear mixtures of the original conditional probabilities. That is, we would want to have Pr(p | q) = (Pr₁(p | q) + Pr₂(p | q))/2. I thought that, at least roughly, the same reasoning that implied that linear mixtures of unconditional probabilities minimised the average expected unconditional inaccuracy would mean that linear mixtures of conditional probabilities minimised the average expected conditional inaccuracy.

I was wrong.

It turns out that, at least in the case where _p_ and _q_ are probabilistically independent according to both Pr₁ and Pr₂, the function that does best according to this new rule is the same linear mixture as does best under the measures Moss looks at. This was extremely surprising to me. We start with a whole bunch of conditional probabilities. We need to aggregate them into a joint conditional probability distribution that satisfies various nice constraints. Notably, these are all constraints on the resultant _conditional_ probabilities, and conditional probabilities are, at least relative to unconditional probabilities, fractions. Normally, one does not get nice results for ‘mixing’ fractions by simply averaging numerators and denominators. But that’s exactly what we do get here.

I don’t have a very good sense of _why_ this result holds. I sort of do understand why Moss’s results hold, I think, though perhaps not well enough to explain! But just why this result obtains is a bit of a mystery to me. But it seems to hold. And I think it’s one more reason to think that the obvious answer to our original question is the right one; if you want to aggregate two probability functions, just average them.