# Accuracy measures on conditional probabilities

I just proved a result about probability aggregation that I found rather perplexing. The proof, and even the result, is a little too complicated to put in HTML, so here it is in PDF.

What started me thinking about this was Sarah Moss’s excellent paper Scoring Rules and Epistemic Compromise, which is about aggregating probability functions. Here’s, roughly, the kind of puzzle that Moss is interested in. We’ve got two probability functions Pr1 and Pr2, and we want to find some probability function Pr that ‘aggregates’ them. Perhaps that’s because Pr1 is my credence function, Pr2 is yours, and we need to find some basis for making choices about collective action. Perhaps it is because the only thing you know about a certain subject matter is that one expert’s credence function is Pr1, another’s function is Pr2, and each expert seems equally likely, and you want to somehow defer equally to the two of them. (Or, perhaps, it is because you want to apply the Equal Weight View of disagreement. But don’t do that; the Equal Weight View is false.)

It seems there is an easy solution to this. For any X, let Pr(X) = (Pr1(X) + Pr2(X))/2. But as Barry Loewer noted many years ago, this solution has some costs. Let’s say we care about two propositions, p and q, and Boolean combinations of them. And say that p and q are probabilistically independent according to both Pr1 and Pr2. Then this linear mixture approach will not in general preserve independence. So there are some costs to it.

One of the thing Moss does is come up with an independent argument for using linear mixtures. Her argument turns on various accuracy measures, or what are sometimes called scoring rules, for probability functions. (Note that I’m leaving out a lot of the interesting stuff in Moss’s paper, which goes into a lot of detail about what happens when we get further away from the Brier scores that are the focus here. Anyone who is at all interested in these aggregation issues, which are pretty central to current epistemological debates, should read her paper.)

Thanks to Jim Joyce’s work there has been an upsurge in interest in philosophy in accuracy measures of probability functions. Here’s how the most commonly used scoring rule, called the Brier score, works. We start with a partition of possibility space, the partition that we’re currently interested in. In this case it would be {pq, p ∧ ¬q, ¬pq, ¬p ∧ ¬q}. For any proposition X, say V(X, w) is 1 if X is true, and 0 if X is false. Then we ‘score’ a function Pr in world w by summing (Pr(X) – V(X, w))2, as X takes each value in the partition. This is a measure of how inaccurate Pr is in w, the higher this number is, the more inaccurate Pr is. Conversely, the lower it is, the more accurate it is. And accuracy is a good thing obviously, so this gives us a kind of goodness measure on probability functions.

Now in the aggregation problem we’re interested in here, we don’t know what world we’re in, so this isn’t directly relevant. But instead of looking at the actual inaccuracy measure of Pr, we can look at its expected inaccuracy measure. ‘Expected’ according to what, you might ask. Well, first we look at the expectation according to Pr1, and then the expectation according to Pr2, then we average them. That gives a fair way of scoring Pr according to each Pr1 and Pr2.

One of the things Moss shows is that this average of expected inaccuracy is minimised when Pr is the linear average of Pr1 and Pr2. And she offers good reasons to think this isn’t a quirk of the scoring rule we’re using. It doesn’t matter, that is, that we’re using squares of distance between Pr(X) and V(X); any ‘credence-eliciting’ scoring rule will plausibly have the same result.

But I was worried that this didn’t really address the Loewer concern directly. The point of that concern was that linear mixtures get the conditional probabilities wrong. So we might want instead to measure the accuracy of Pr’s conditional probability assignments. Here’s how I thought we’d go about that.

Consider the four values Pr(p | q), Pr(p | ¬q), Pr(q | p), Pr(q | ¬p). In any world w, two of the four ‘conditions’ in these conditional probabilities will be met. Let’s say they are p and ¬q. Then the conditional inaccuracy of Pr in that world will be (Pr(q | p) – V(q))2 + (Pr(p | ¬q) – V(p))2. In other words, we apply the same formula as for the Brier score, but we use conditional rather than unconditional probabilities, and we just look at the conditions that are satisfied.

From then on, I thought, we could use Moss’s technique. We’ll look for the value of Pr that minimises the expected conditional inaccuracy, and call that the compromise, or aggregated, function. I guessed that this would be the function we got by taking the linear mixtures of the original conditional probabilities. That is, we would want to have Pr(p | q) = (Pr1(p | q) + Pr2(p | q))/2. I thought that, at least roughly, the same reasoning that implied that linear mixtures of unconditional probabilities minimised the average expected unconditional inaccuracy would mean that linear mixtures of conditional probabilities minimised the average expected conditional inaccuracy.

I was wrong.

It turns out that, at least in the case where p and q are probabilistically independent according to both Pr1 and Pr2, the function that does best according to this new rule is the same linear mixture as does best under the measures Moss looks at. This was extremely surprising to me. We start with a whole bunch of conditional probabilities. We need to aggregate them into a joint conditional probability distribution that satisfies various nice constraints. Notably, these are all constraints on the resultant conditional probabilities, and conditional probabilities are, at least relative to unconditional probabilities, fractions. Normally, one does not get nice results for ‘mixing’ fractions by simply averaging numerators and denominators. But that’s exactly what we do get here.

I don’t have a very good sense of why this result holds. I sort of do understand why Moss’s results hold, I think, though perhaps not well enough to explain! But just why this result obtains is a bit of a mystery to me. But it seems to hold. And I think it’s one more reason to think that the obvious answer to our original question is the right one; if you want to aggregate two probability functions, just average them.

## 14 Replies to “Accuracy measures on conditional probabilities”

1. Hey Brian. Something must be amiss here, because Dalkey (1972) showed that if both unconditional and conditional probabilities are aggregated using a linear average (in fact, using any common function of the individual conditional and unconditional probabilities), then the result must be a dictatorial aggregation (i.e., it just returns one of the two averaged functions back again). I’d have to look more closely to see what’s going wrong, but I recommend looking at that impossibility theorem.

2. I didn’t mean to aggregate both conditional and unconditional probabilities linearly. What I initially wanted to do was compare the function we get from unconditional linear averaging, and the function we get from conditional linear averaging, and find some measures by which the first does better, and some measures by which the second does better. But I was kinda shocked to find that even when I focussed on expected accuracy of the conditional probabilities themselves, the linear average of the unconditional probabilities still did better.

3. Sorry, Brian. Thanks, I think I see what you’re up to now. I’ll have to work this through and see what the more general cases look like.

4. BTW — Richard Pettigrew is here visiting (for the rest of the week). You should talk to him about this if you can — it’s really his cup of tea…

5. I’ve got a Mathematica notebook for the 3-proposition case. I think the result breaks down there. Let’s get together sometime and go over it, OK?

6. TomD says:

Hello Brian! Could say something to give me a feel for what the “conditional accuracy” score is supposed to measure?

Here’s how I think about (unconditional) accuracy. There’s this space of probability functions. One of them (the, erm, true one) assigns 1 to every truth and 0 to every falsehood. An inaccuracy score measures how far a probability function is from the true one.

So what about conditional accuracy? “Proximity to the truth” doesn’t seem to make much sense here, because there are no truths corresponding to conditional probabilities…

7. That’s really interesting Branden – it would be great to see the example.

Tom,

I was thinking that it measured something like dispositions to be accurate upon learning. Maybe this example will help.

According to Pr1, p and q are probabilistically independent, and each has probability 0.5.

According to Pr2, Pr(p | q) = Pr(q | p) = 0.9. But Pr(p & q) is merely 0.1.

Now consider a world where p and q are both true. Someone whose credences track Pr1 will have more accurate unconditional probabilities than a person whose credences track Pr2; they assign 0.25 probability to the truth, not merely 0.1. But a person whose credences track Pr2 will be well placed to learn about the world. If they get either piece of information p or q, they will update to having credence 0.9 in the other proposition. And that’s good, since that other proposition is also true.

I can think of some circumstances where we’re interested both in how accurate some credences are, but also how well they converge to the truth when other true information comes in. So I initially wanted to add conditional accuracy measures in to the unconditional accuracy measures. But for the purpose of proving stuff, it was useful to start with a measure that only looked at the conditional probabilities.

8. Correction: your result DOES hold for n=3 propositions! I’ve sent you the notebook establishing this via email (and I think I’m starting to see from that how the argument generalizes). Let’s talk about it…

9. smoss says:

I believe there’s a good explanation for your surprising result. A bit of Mathematica playing suggests that the conditional inaccuracy measure you introduced is a credence-eliciting scoring rule in the n = 2 case. And so the proof on page 9 of my paper shows why the mean of my friend’s expected conditional inaccuracy and my expected conditional inaccuracy is minimized by the mean of our credence functions. In a bit more detail for those who haven’t read the paper, but have read Brian’s note: let’s define Pr_3 as (Pr_1 + Pr_2)/2. Some simple algebra confirms that E_3(X) equals the expected value of X according to Pr_3. Since A is credence-eliciting, E_3(A(Pr)) is minimized when Pr = Pr_3. And that entails Brian’s result.

It’d be nice to have a proof or counterexample for the conjecture that if an accuracy measure R is credence-eliciting, then the conditional accuracy measure built from R in the way you suggest is also credence-eliciting. But I do have to grade a stack of final exams today.

10. Ah, I think I see what I was missing now.

In Sarah’s original presentation, the unconditional credences over a partition of possibility space play two roles. First, they are an input into the AEV function (what here becomes E_3). Second, they are an input into the scoring rule, since all the scoring rules discussed in her paper (like in most sensible discussions of this!) are defined over unconditional probabilities.

What I don’t think I’d really appreciated (though in retrospect it is kind of obvious) is that the proof on page 9 only uses the first of these properties. Even if the scoring rule uses conditional probabilities; heck, even if the scoring rule is massively holistic and gives bonus points for making independent propositions independent, that result will hold. And as a consequence any credence-eliciting scoring rule, even a massively holistic one, will say that we do best by taking linear averages.

However, this situation is somewhat unrealistic – it seems strange that both of us would be absolutely certain of the bias of the coins. And if we have any positive credence that the bias is something other than what we think it is, then the flips won’t be independent on either of our individual credence functions either – any heads result gives us some evidence (perhaps only a tiny amount) that the bias towards heads is slightly higher than we initially thought, and thus increases our credence that the next coin comes up heads.

The property our individual credence functions will have for the sequence of coin flips isn’t independence, but exchangeability. And exchangeability is preserved under linear combination of probability functions. (If each permutation of a sequence has the same probability for me, and they all have the same probability for you, then the average of our two probabilities will also be the same for each permutation.)

I suppose the more troubling case is the one where we think two topics really are doxastically independent, and not just causally independent (as is normally the case for exchangeable variables). Say that proposition A is about life on Mars, and proposition B is about whether the bank is open on Saturdays. It seems more plausible that the propositions should be independent even in the aggregated credence function then.

But I think that the lack of independence in the aggregated function is again not that bad. If I started out with credence 2/3 in each of them, and you started out with credence 1/3 in each of them, then on our aggregated function, either one will turn out to be some amount of evidence in favor of the other. But one way to interpret this result is just to say that in some sense, finding out that the first proposition was true is evidence that I’m a better guide to the truth in general, and finding out that it was false s evidence that you’re a better guide to the truth in general. Thus, it’s not totally surprising that our aggregated function becomes closer to mine when conditionalizing on something that I thought was more likely than you thought it was.

So perhaps the lack of independence in the aggregated function is not a problem at all – in cases of causal independence of chance processes, exchangeability is what matters, and that is preserved; in other cases, independence is not preserved, but that’s because one proposition is evidence that one of our agents was a “better” judge than the other, so maybe it shouldn’t be surprising that the group converges towards that person’s credences.

12. Hi Kenny,

I agree that these considerations suggest independence is too strong a requirement. But there’s still something very odd about straight averaging I think.

Let’s say we have a lot of evidence that X and Y are equally reliable across a range of subjects, and we learn their credences about life on Mars and bank hours. We then learn X is right about bank hours. Now we have a little more evidence that X is more reliable. But straight averaging means we should act as if X has become much more reliable. That seems odd, even if you think independence is too strong a constraint.

13. Brian – I agree. I definitely think there should be at least considerations of taking geometric means, or averaging odds ratios, or any number of other averaging procedures beyond straight arithmetic means of probabilities – not to mention any number of non-equal-weight views. But I just think that the objection based on independence isn’t as pressing as it seems at first (or at least, needs to be qualified in terms of “too much non-independence” rather than “non-independence”).