## Weekend Reading: Cosma Shalizi (2009): On the Certainty of the Bayesian Fortune-Teller

**Cosmos Elysee** of the *Center for Evidence-Based Haruspicy* (2009): On the Certainty of the Bayesian Fortune-Teller: "**Three-Toed Sloth: Slow Takes from the Canopy (My Very Own Internet Tradition): June 16, 2009:**

Attention conservation notice: 2300 words of technical, yet pretentious and arrogant, dialogue on a point which came up in a manuscript-in-progress, as well as in my long-procrastinated review ofPlight of the Fortune Tellers.Why don't you read that book instead?Q: You really shouldn't write in library books, you know; and if you do, your marginalia should be more helpful, or less distracting, than just "wrong wrong wrong!"

A: No harm done; my pen and I are both transparent rhetorical devices. And besides, Rebonato is wrong in those passages.

Q: Really? Isn't his point that it's absurd to pretend you could actually estimate a something like a probability of an interest rate jump so precisely that there's a real difference between calling it 0.500 000 and calling it 0.499 967? Isn't it yet more absurd to think that you could get the 99.5 percent annual value-at-risk--the amount of money you'd expect to lose once in two thousand years--down to four significant figures, from any data set, let alone one that covers just five years and so omits:

not only the Black Death, the Thirty Years' War, the Barbarian invasions, and the fall of the Roman Empire, but even the economic recession of 1991--the only meaningful recession in the last twenty years

(as of 2006), to say nothing of the

famous corporate loan book crises of the Paleochristian era

(p. 218)?

A: Of course all that's absurd, and Rebonato is right to call people on it. By the time his book came out it was too late to do much good, but if people had paid attention to such warnings I dare say we wouldn't be quite so badly off now, and they had better listen in the future.

Q: So what's your problem? Oh, wait, let me guess: you're upset because Rebonato's a Bayesian, aren't you? Don't bother, I can tell that that's it. Look, we all know that you've got objections to that approach, but at this point I'm starting to think that maybe you have issues. Isn't this sort of reflexive hostility towards a whole methodology--something you must run into every day of work--awkward and uncomfortable? Embarrassing, even? Have you thought about seeking help?

A: Actually, I have a serious point to make here. What Rebonato wants is entirely right-headed, but it fits very badly with his Bayesianism, because Bayesian agents are never uncertain about probabilities; at least, not about the probability of any observable event.

Q: But isn't Bayesianism about representing uncertainty, and making decisions under uncertainty?

A: Yes, but Bayesian agents never have the kind of uncertainty that Rebonato (sensibly) thinks people in finance should have.

Q: Let me try to pin you down in black and white. [

Opens notebook]I have here on one side of the page our old friend, the well-known the probability space Omega F. Prob. Coming out of it, in the middle, is a sequence of random variables X1, X2, ... , Xn, ... , which have some joint distribution or other. (And nothing really depends on its being a sequence, I could use a random field on a network or whatever you like, add in covariates, etc.)

On the other side of the random variables, looking at them, I have a standard-issue Bayesian agent. The agent has a hypothesis space, each point m of which is a probability distribution for the random sequence. This hypothesis space is measurable, and the agent also has a probability measure, a.k.a. prior distribution, on this space. The agent uses Bayes's rule to update the distribution by conditioning, so it has a sequence of measures D0, D1, etc.

A: I think you are missing an "As you know, Bob", but yes, this is the set-up I have in mind.

Q: Now I pick my favorite observable event f, a set in the joint sigma-field of the Xi. For each hypothesis m, the probability m(f) is well-defined. The Bayesian thinks this is a random variable M(f), since it has a distribution D on the hypothesis space. How is that not being uncertain about the probability of f?

A: Well, in the first place--

Q: I am not interested in quibbles about D being a Dirac delta function.

A: Fine, assume that D doesn't put unit mass on any single hypothesis, and that it gives non-zero weight to hypotheses with different values of m(f). But remember how Bayesian updating works: The Bayesian, by definition, believes in a joint distribution of the random sequence X and of the hypothesis M. (Otherwise, Bayes's rule makes no sense.) This means that by integrating over M, we get an unconditional, marginal probability for f:

Pn(f) = EDn[M(f|X1=x1, X2=x2, ... , Xn=xn)]

Q: Wait, isn't that the denominator in Bayes's rule?

A: Not quite, that equation defines a measure--the predictive distribution--and the denominator in Bayes's rule is the density of that measure (with n=0) at the observed sequence.

Q: Oh, right, go on.

A: As an expectation value, Pn(f) is a completely precise number. The Bayesian has no uncertainty whatsoever in the probabilities it gives to anything observable.

Q: But won't those probabilities change over time, as it gets new data?

A: Yes, but this just means that the random variables aren't independent (under the Bayesian's distribution over observables). Integrating m with respect to the prior D0 gives us the infinite-dimensional distribution of a stochastic process, one which is not (in general) equal to any particular hypothesis, though of course it lies in their convex hull; the simple hypotheses are extremal points. If the individual hypothesis are (laws of) independent, identically-distributed random sequences, their mixture will be exchangeable. If the individual hypotheses are ergodic, their mixture will be asymptotically mean-stationary.

Q: Don't you mean "stationary" rather than "asymptotically mean-stationary"?

A: No; see chapter 25 here, or better yet that trifler's authority.

Q: You were saying.

A: Right. The Bayesian integrates out m and gets a stochastic process where the Xi are dependent. As far as anything observable goes, the Bayesian's predictions, and therefore its actions, are those of an agent which treats this stochastic process as certainly correct.

Q: What happens if the Bayesian agent uses some kind of hierarchical model, or the individual hypotheses are themselves exchangeable/stationary?

A: The only thing that would change, for these purposes, is the exact process the Bayesian is committed to. Convex mixtures of convex mixtures of points in C are convex mixtures of points in C.

Q: So to sum up, you're saying that the Bayesian agent is uncertain about the truth of the unobservable hypotheses (that's their posterior distribution), and uncertain about exactly which observable events will happen (that's their predictive distribution), but not uncertain about the probabilities of observables.

A: Right. (Some other time I'll explain how that helps make Bayesian models testable.) And--here's where we get back to Rebonato--all the things he is worried about, like values-at-risk and so forth, are probabilities of observable events. Put a Bayesian agent in the risk-modeling situation he talks about, and it won't just say that the 99.5% VaR is 109.7 million euros rather than 110 million, it will give you as many significant digits as you have time for.

Q: So let me read you something from p. 194--195:

Once frequentists accept (at a given statistical level of confidence) the point estimate of a quantity (say, a percentile), they tend to act as if the estimated number were the true value of the parameter. Remember that, for a frequentist, a coin cannot have a 40% chance of being biased. Either the coin is fair or it is biased. Either we are in a recession or we are not. We simply accept or reject these black-or-white statements at a certain confidence level... A Bayesian approach automatically tells us that a parameter (say, a percentile) has a whole distribution of possible values attached to it, and that extracting a single number out of this distribution (as I suggested above, the average, the median, the mode, or whatever) is a possibly sensible, but always arbitrary, procedure. No single number distilled from the posterior distribution is a primus inter pares: only the full posterior distribution enjoys this privileged status, and it is our choice what use to make of it.

This seems entirely reasonable; where do you think it goes wrong?

A: You mean, other than the fact that point estimates do not have "statistical levels of confidence", and that Rebonato has apparently forgotten about actual confidence intervals?

Q: Let's come back to that.

A: He is running together parameters of the unobserved hypotheses, and the properties of the predictive distribution on which the Bayesian acts. I can take any function I like of the hypothesis, g(m) say, and use it as a parameter of the distribution. If I have enough parameters gi and they're (algebraically) independent of each other, there's a 1-1 map between hypotheses and parameter vectors--parameter vectors are unique names for hypotheses. I could make parts of those names be readily-interpretable aspects of the hypothetical distributions, like various percentiles or biases. The distribution over hypotheses then gives me a distribution over percentiles conditional on the hypothesis M. But we don't know the true hypothesis, and on the next page Rebonato goes on to cast "ontological" doubt about whether it even exists. (How he can be uncertain about the state of something he thinks doesn't exist is a nice question.) We only have the earlier observations, so we need to integrate or marginalize out M, and this collapses the distribution of percentiles down to a single exact value for that percentile.

Q: Couldn't we avoid that integration somehow?

A: Integrating over the posterior distribution is the whole point of Bayesian decision theory.

Q: Let's go back to the VaR example. If you try estimating the size of once-in-two-thousand-year losses from five years of data, your posterior distribution has got to be pretty diffuse.

A: Actually, it can be arbitrarily concentrated by picking the right prior.

Q: Fine, for any reasonable prior it needs to be pretty diffuse. Shouldn't the Bayesian agent be able to use this information to avoid recklessness?

A: That depends on the loss function. If the loss involves which hypothesis happens to be true, sure, it'll make a difference. (That's how we get the classic proof that if the loss is the squared difference between the true parameter and the point estimate, the best decision is the posterior mean.) But if the loss function just involves what observable events actually take place, then no. Or, more exactly, it might make sense to show more caution if your posterior distribution is very diffuse, but that's not actually licensed by Bayesian decision theory; it is "irrational" and sets you up for a Dutch Book.

Q: Should I be worried about having a Dutch Book made against me?

A: I can't see why, but some people seem to find the prospect worrying.

Q: So what should people do?

A: I wish I had a good answer. Many of Rebonato's actual suggestions--things like looking at a range of scenarios, robust strategies, not treating VaR as the only thing you need, etc.--make a lot of sense. (When he is making these practical recommendations, he does not counsel people to engage in a careful quantitative elicitation of their subjective prior probabilities, and then calculate posterior distributions via Bayes's rule; I wonder why.) I would also add that there are such things as confidence intervals, which do let you make probabilistic guarantees about parameters.

Q: What on Earth do you mean by a "probabilistic guarantee"?

A: That either the right value of the parameter is in the confidence set, or you get very unlucky with the data (how unlucky depends on the confidence level), or the model is wrong. Unlike coherence, coverage connects you to reality. This is basically why Haavelmo told the econometricians, back in the day, that they needed confidence intervals, not point estimates.

Q: So how did the econometricians come to make fetishes of unbiased point-estimators and significance tests of equality constraints?

A: No doubt for the same reason they became convinced that linear and logistic regression was all you'd ever need to deal with any empirical data ever.

Q: Anyway, that "get the model right" part seems pretty tricky.

A: Everyone is going to have to deal with that. (You certainly still have to worry about mis-specification with Bayesian updating.) You can test your modeling assumptions, and you can weaken them so you are less susceptible to mis-specification.

Q: Don't you get weaker conclusions--in this case, bigger confidence intervals--from weaker modeling assumptions?

A: That's an unavoidable trade-off, and it's certainly not evaded by going Bayesian (as Rebonato knows full well). With very weak, and therefore very defensible, modeling assumptions, the confidence interval on, say, the 99.5% VaR may be so broad that you can't devise any sensible strategy which copes with that whole range of uncertainty, but that's the math's way of telling you that you don't have enough data, and enough understanding of the data, to talk about once-in-two-thousand-year events. I suppose that, if they have financial engineers in the stationary state, they might eventually be able to look back on enough sufficiently-converged data to do something at the 99% or even 99.5% level.

Q: Wait, doesn't that suggest that there is a much bigger problem with all of this? The economy is non-stationary, right?

A: Sure looks like it.

Q: So how can we use statistical models to forecast it?

A: If you want someone to solve the problem of induction, the philosophy department is down the stairs and to the left.