The Holy War – Bayesian vs. Frequentist

battle

When it comes to inference in computer science, there is a holy war between the good guys (Bayesian computer scientists) and the bad guys (the notorious frequentist gang). Bayesians try to keep track of the possibility of all different scenarios and how likely they are according to the observations and prior knowledge, and then check to see what the future is going to look like under each scenario (usually in a compact mathematical formulation but with concrete probabilistic interpretation). The frequentist gang, on the other hand, usually take the most likely scenario (or try to accept/reject different scenarios) and use heuristics all around the place, often times without concrete reasoning as to why these heuristics work.

Bayesian inference is, of course, computationally expensive. So the bad guys keep mocking the good guys that their methods are only theoritical and suitable for textbooks rather than real life. Now guess what… They are right! Exact Bayesian inference is not an option in most cases. This means that you need to do approximations. The funny thing is that these approximations usually lead to the exact same heuristics that are used by frequentists. But the good part of going this way is that you know what you are doing and why these methods work. As the great spritual leader Mahdi Milani Fard XIII once said:

Think Bayesian
Act frequentist
And live free

They should write that in gold and put it over the main enterance of all CS departments.

P.S.: This view is of course not exclusively mine. Many CS researchers like to think this way.

Share and Enjoy:
  • Digg
  • StumbleUpon
  • del.icio.us
  • Facebook
  • Google Bookmarks
  • email
<< Subway evacuation | Vancouver 2009 >>

9 Comments

PardisOctober 22nd, 2009 at 01:45
What does the “live free” part mean? :)

http://planning.cs.uiuc.edu/node471.html kind of goes into comparing the two approaches. But compares them from a different perspective. It actually ends with, “The frequentist approach attempts to be more conservative and rigorous, with the result being that weaker statements are made regarding decisions.” I guess their weaker statements are the ones that are based on the heuristics, like you said. Do you know a book that talks about this?

MahdiOctober 22nd, 2009 at 08:43
I don’t know of any book that might have a good discussion on this (that’s because I don’t read books in general :D ). You will need to study both sides and compare things yourself. If you want to study them from a philosophical point of view, you can check this:
http://arxiv.org/abs/0804.0486

But who am I kidding? In my world almost always confidence intervals and credible intervals match perfectly. At any rate, these days the Bayesian view is getting more attention from the CS ppl, mainly because it better matches probabilistic inference methods known to them. We like to keep track of probability of both the null hypothesis and the alternative one.

To see why the frequentist approach is more conservative, think about the case that we want to compare different hypotheses. In a binary classification task, for instance, one would need “decide” which class the datapoint is likely to belong to. The frequentist approach will try to reject/fail to reject the hypothesis that the datapoint belongs to, say, class 0. That is not a decision, as you could potentially fail to reject on both sides. A frequentist would tell you that a decision cannot be made. That’s the place the conservative nature of the frequentist kicks in. A Bayesian, on the other hand, can tell you which is more likely to be the correct class.

PardisOctober 23rd, 2009 at 01:19
Thanks for the overview and pointers!

AminNovember 3rd, 2009 at 22:36
Ha ha… nice but I donno how come I always try to say “HABA 8-&” specially after quotations part…

SoloGenNovember 30th, 2009 at 01:48
This view on frequentist interpretation of probability and a statement such as “The frequentist gang … use heuristics all around the place, often times without concrete reasoning as to why these heuristics work” is distorted and far from the truth.

One should consider two aspects of this debate: (1) philosophical and (2) practical.

On the philosophical side, there are indeed important differences between these two viewpoints. I am not going to talk about this issue.

On the practical side, when we are going to do statistical inference, there are big differences between these two approaches. What you said on Bayesian approach seems more or less correct. But contrary to what you said, the difference is not that frequentists are lousy and use heuristics more often, and Bayesians are theoretically sound. Actually the usual practice is that Bayesian approaches are without any sound theoretical guarantee (mainly because of the infamous prior selection dilemma – and I am not even mentioning all sorts of computational approximations that people use), as opposed to frequentist approaches that give guarantees such as consistency, convergence rate, confidence interrvals, etc.
This poor practice of Bayesian statistics, however, should not necessarily be like this. There are people (e.g. Jayanta K. Ghosh) who study Bayesian approaches and give rigorous frequentistic guarantees. Unfortunately, they are in minority.

Final words: Bayesian statistical inference is fine, but one should be careful when s/he is using it.

There are certain philosophical differences between these two views, especially between frequentists and subjective Bayesian,

MahdiNovember 30th, 2009 at 02:13
Thanks a lot for your comment SoloGen. You do have a point in there. However, there is a little bit of humor involved in some of the statements in this post (that’s what I usually say when someone proves me wrong :D ).

At any rate, you are certainly correct in the fact that frequentist view provides rigorous guarantees regardless of the actual distribution of the data (there is usually some mild conditions though). But those guarantees give rise to notoriously conservative bounds. Take PAC learning, for instance, in the context of learning algorithms. They are uniformly approximately correct across all sampling distribution, but seems to be useless in practice with terribly conservative sampling complexities. The Bayesian view, on the other hand, relies heavily on the assumption that the prior is correct (consistent with the distribution of the data). So if the assumptions are correct, then acting according to the posterior is “the best” you can do. If, on the other hand, the assumptions are violated, you might even fail to converge to the optimal solution in the limit of infinite sample sizes (this is shown in a couple of works both empirically and and analytically by counter examples). However, if the prior is correct, you are going to see much faster convergence (optimally fastest convergence in some cases) than those with PAC learning.

There is the (relatively) new PAC-Bayesian literature that try combine the good aspects of both Bayesian and PAC analysis. You can get guarantees regardless of the correctness of your prior, but still get the speedup boost of incorporating domain knowledge into your prior. This is what I’m studying at the moment as survey for my PhD comprehensive exam.

Final words: I’m not a statistician and my view of Bayesian vs Frequentist is a little bit distorted because of the tainted literature (heavy, and I mean heavy, use of heuristics for instance) in computer science. I’m certainly willing to read more into this to fix the misunderstandings.

SoloGenNovember 30th, 2009 at 03:18
I agree with what you said that if the prior is correct, we are happy.
But consider the thought experiment that in the extreme case, the prior can simply be a degenerate distribution concentrated on the “true” solution of the problem (in the frequentist sense) and in this case, we don’t even need any data in order to be correct. Where is the borderline of this extreme case and being reasonable?

Regarding PAC-like guarantees, I agree with you that they might be conservative in some sense. But that’s not without reason: if the class of problems is actually difficult, then the bound should show the difficulty, and ignoring the constants (which are not tight merely because of the convenience of the mathematicians who’ve derived them), those bounds actually reflect the difficulty of the problem. Why?! because in many cases upper and lower bounds match. For example, we cannot learn one time differentiable function faster than some specific rate unless we know that it is actually, say, twice differentiable!

Nevertheless, if the bound is conservative because of constants, it doesn’t mean that the algorithm is performing badly in practice. Especially if the algorithm is “adaptive”.
(Maybe the exception is the Structural Risk Minimization algorithms that simply use VC dimension to penalize the complexity of model, and I believe they might not be very good in practice. Of course, there are tighter notions of complexity that can help.)

And yes, PAC-Bayes idea is interesting.

I noticed that you are working with Joelle. What is your PhD topic?

Good luck to you for your exam.

MahdiNovember 30th, 2009 at 03:41
Good point. You seem to have a great grasp of the topic (certainly need to talk to you on this topic this NIPS). My objection, mainly, is that if you have prior domain knowledge, there should be a way of incorporating that into your learning algorithm. The problem might be very difficult in the general case when you ignore the domain knowledge (and thus consistent with the loose bounds), but might be way easier when you take the prior into account. The injection of this prior knowledge is done using heuristics (e.g. complexity penalties) with the frequentist view and prior distribution in the Baysian view. Notice that both of them boil down to forms of SRM (though you might not agree with this definition).

“Adaptive”, “SRM”, “penalizing complexity”, etc, etc, these are what I call heuristics. And most of them (and I emphasize the “most” part) are without theoretical guarantees. The good thing about PAC-Bayes seems to be the fact that it is a form of SRM with theoretical guarantees. It penalizes both based on the prior and likelihood.

I haven’t fixed a topic yet (still a first year, first term PhD). But I’m thinking about model-based Bayesian RL with kernels (which could extend to continuous domains with GPs). A Bayesian version of non-parametric model-based RL by peter stone et al. I’m not even sure about the literature on that for now. There is a long road ahead.

NateDecember 25th, 2009 at 18:04
While I realize the great potential of Bayesian statistics (I’ve been actively researching MCMC methods myself), I still like the mathematical tractability that frequentist methods bring to the table. Currently, I feel that there is not enough emphasis on model validation amongst Bayesians; a marvelously complex model does little good if it is mis-specified. Also, Bayes factors are useless unless informative PROPER priors are specified. Merely specifying a flat noninformative prior will lead to misleading results via Bayes factors. I think the best way to verify a Bayesian model is through rigorous posterior predictive checking with various test statistics of interest. Anyway, there is no doubt that Bayesian methods offer tremendous promise, but they should be utilized thoughtfully.

Leave a comment

Your comment