While Im doing all this probability blogging, I
should have something to say about this, but actually I dont. Well, not really.
Theres been an interesting exchange between John
Quiggin and Kevin
Drum about the value of data mining. Story to date: if we take a truckload
of data about shopping patterns, then it is very probable that we will turn up
some correlations that are, taken individually, antecedently very improbable. Theres
no paradox here. Deal five cards and it is very probable (in fact certain) that
youll get some combination of cards that is antecedently very improbable. The
problem is that usually the best way we tell that a correlation is signal not
noise is that it turns up in data in a way that is antecedently very
improbable. It seems by data mining we are guaranteed to get evidence that
some correlations exist, even if none at all do. Should we believe that such a
correlation is projectable, especially if we get data that would, had it turned
up in a survey designed to test whether that very correlation stably exists,
would have been taken as very good evidence for just that conclusion? Should,
in other words, the evidential value of evidence be dependent on the hypothesis
we were trying to test. Intuitively, yes. But (long but
worthwhile quote to follow):
But
there’s a
more fundamental question at hand, and it’s the point of this whole essay: is a
correlation deduced from a huge multivariate analysis really less reliable than
one deduced from a focused study? The argument seems to me to be this: if you
have a hypothesis and test it, and you find a correlation, that’s good. But if
you don’t have a hypothesis, and you find a correlation, then it’s probably
just by chance.
But it’s not. The
numbers don’t care whether you have a hypothesis or not, and in both cases
there’s a 5% chance that the correlation is due to chance. In both cases you
will have to reproduce the results independently if you want to increase your
certainty.
Is this a trivial
point? I don’t think so, because I think it points to a serious flaw in a lot
of statistical analyses: the feeling that if you test a specific hypothesis and
find a strong correlation, it’s probably real. Oh sure, you will make the usual
disclaimers about 95% confidence intervals, but the reality is that the results
get treated seriously.
I’m not sure they
should be. Or rather, I’m not sure they should be treated any differently than
the data mining techniques that produce masses of correlations. I suspect that
the disillusionment among economists (and others) with data mining is real, but
mostly because it punches you in the nose with the fact that correlations are
often just artifacts of chance. The same is true of focused studies, but
because these correlations back up a claim we wish to make, we mentally
discount the possibility of random error.
This is wrong.
Numbers are numbers, and no matter where they come from they should be treated
with the same respect or lack thereof. To suggest otherwise, I think, is
merely to admit that your conclusions are based not just on the numbers
themselves, but also on some previous belief a Bayesian argument that we will
leave for another day.
This seems wrong to me, but Im either too tired,
or too busy, or most likely too thick-headed, to see just why. So instead Ill
tell you a story. Theres a computer with a 52 key keyboard, one key for each
card in a deck of cards, and each key is labelled as such. Monkeys are trained
to press five keys in a row, any keys thats their choice, at regular time intervals.
Situation One. I deal myself
five cards face down, and a monkey, the only monkey in the room, presses the
five keys for those five cards.
Situation Two. I deal myself
five cards face down, and two monkeys, two of the three million monkeys in the
room, monkeys 1287548 and 2649702, press the keys for those five cards.
I think that in situation one I should suspect
theres a connection between the monkey and the cards. Probably not some spooky
connection, ESP or whatever, but that theres some causal connection or other
between the cards being dealt and the monkey pressing the keys. Maybe he had a
chance to see them, or maybe he ordered the deck before I dealt them (though I
seem to remember shuffling them) or something. Maybe not, but my prior
probability that theres some funny business going on between that monkey and
the deck of cards is now non-negligible. On the other hand, in situation two I
dont think theres any funny business going on between monkey 1287548 and the
deck, nor between monkey 2649702 and the deck.
Take home questions.
What are the analogies and disanalogies between situation
one/situation two and a planned experiment/data mining?
In situation one, what should be my prior
probability that if this monkey gets the
five cards right, theres funny business going on between this monkey and the
deck? In situation two, what should be my prior probability that if monkey 1287548
gets the five cards right, theres funny business going on between monkey 1287548
and the deck? Could these be the same? What could justify their being
different?
I have some thoughts about these questions, but
theyre not very well formed, not even for a blog. So being a good blogger Ill
just post a link instead, to Eric Funkhousers paper on coincidences.