The Holy War – Bayesian vs. Frequentist
When it comes to inference in computer science, there is a holy war between the good guys (Bayesian computer scientists) and the bad guys (the notorious frequentist gang). Bayesians try to keep track of the possibility of all different scenarios and how likely they are according to the observations and prior knowledge, and then check to see what the future is going to look like under each scenario (usually in a compact mathematical formulation but with concrete probabilistic interpretation). The frequentist gang, on the other hand, usually take the most likely scenario (or try to accept/reject different scenarios) and use heuristics all around the place, often times without concrete reasoning as to why these heuristics work.
Bayesian inference is, of course, computationally expensive. So the bad guys keep mocking the good guys that their methods are only theoritical and suitable for textbooks rather than real life. Now guess what… They are right! Exact Bayesian inference is not an option in most cases. This means that you need to do approximations. The funny thing is that these approximations usually lead to the exact same heuristics that are used by frequentists. But the good part of going this way is that you know what you are doing and why these methods work. As the great spritual leader Mahdi Milani Fard XIII once said:
Think Bayesian
Act frequentist
And live free
They should write that in gold and put it over the main enterance of all CS departments.
P.S.: This view is of course not exclusively mine. Many CS researchers like to think this way.
<< Subway evacuation | Vancouver 2009 >>
http://planning.cs.uiuc.edu/node471.html kind of goes into comparing the two approaches. But compares them from a different perspective. It actually ends with, “The frequentist approach attempts to be more conservative and rigorous, with the result being that weaker statements are made regarding decisions.” I guess their weaker statements are the ones that are based on the heuristics, like you said. Do you know a book that talks about this?
http://arxiv.org/abs/0804.0486
But who am I kidding? In my world almost always confidence intervals and credible intervals match perfectly. At any rate, these days the Bayesian view is getting more attention from the CS ppl, mainly because it better matches probabilistic inference methods known to them. We like to keep track of probability of both the null hypothesis and the alternative one.
To see why the frequentist approach is more conservative, think about the case that we want to compare different hypotheses. In a binary classification task, for instance, one would need “decide” which class the datapoint is likely to belong to. The frequentist approach will try to reject/fail to reject the hypothesis that the datapoint belongs to, say, class 0. That is not a decision, as you could potentially fail to reject on both sides. A frequentist would tell you that a decision cannot be made. That’s the place the conservative nature of the frequentist kicks in. A Bayesian, on the other hand, can tell you which is more likely to be the correct class.
One should consider two aspects of this debate: (1) philosophical and (2) practical.
On the philosophical side, there are indeed important differences between these two viewpoints. I am not going to talk about this issue.
On the practical side, when we are going to do statistical inference, there are big differences between these two approaches. What you said on Bayesian approach seems more or less correct. But contrary to what you said, the difference is not that frequentists are lousy and use heuristics more often, and Bayesians are theoretically sound. Actually the usual practice is that Bayesian approaches are without any sound theoretical guarantee (mainly because of the infamous prior selection dilemma – and I am not even mentioning all sorts of computational approximations that people use), as opposed to frequentist approaches that give guarantees such as consistency, convergence rate, confidence interrvals, etc.
This poor practice of Bayesian statistics, however, should not necessarily be like this. There are people (e.g. Jayanta K. Ghosh) who study Bayesian approaches and give rigorous frequentistic guarantees. Unfortunately, they are in minority.
Final words: Bayesian statistical inference is fine, but one should be careful when s/he is using it.
There are certain philosophical differences between these two views, especially between frequentists and subjective Bayesian,
At any rate, you are certainly correct in the fact that frequentist view provides rigorous guarantees regardless of the actual distribution of the data (there is usually some mild conditions though). But those guarantees give rise to notoriously conservative bounds. Take PAC learning, for instance, in the context of learning algorithms. They are uniformly approximately correct across all sampling distribution, but seems to be useless in practice with terribly conservative sampling complexities. The Bayesian view, on the other hand, relies heavily on the assumption that the prior is correct (consistent with the distribution of the data). So if the assumptions are correct, then acting according to the posterior is “the best” you can do. If, on the other hand, the assumptions are violated, you might even fail to converge to the optimal solution in the limit of infinite sample sizes (this is shown in a couple of works both empirically and and analytically by counter examples). However, if the prior is correct, you are going to see much faster convergence (optimally fastest convergence in some cases) than those with PAC learning.
There is the (relatively) new PAC-Bayesian literature that try combine the good aspects of both Bayesian and PAC analysis. You can get guarantees regardless of the correctness of your prior, but still get the speedup boost of incorporating domain knowledge into your prior. This is what I’m studying at the moment as survey for my PhD comprehensive exam.
Final words: I’m not a statistician and my view of Bayesian vs Frequentist is a little bit distorted because of the tainted literature (heavy, and I mean heavy, use of heuristics for instance) in computer science. I’m certainly willing to read more into this to fix the misunderstandings.
But consider the thought experiment that in the extreme case, the prior can simply be a degenerate distribution concentrated on the “true” solution of the problem (in the frequentist sense) and in this case, we don’t even need any data in order to be correct. Where is the borderline of this extreme case and being reasonable?
Regarding PAC-like guarantees, I agree with you that they might be conservative in some sense. But that’s not without reason: if the class of problems is actually difficult, then the bound should show the difficulty, and ignoring the constants (which are not tight merely because of the convenience of the mathematicians who’ve derived them), those bounds actually reflect the difficulty of the problem. Why?! because in many cases upper and lower bounds match. For example, we cannot learn one time differentiable function faster than some specific rate unless we know that it is actually, say, twice differentiable!
Nevertheless, if the bound is conservative because of constants, it doesn’t mean that the algorithm is performing badly in practice. Especially if the algorithm is “adaptive”.
(Maybe the exception is the Structural Risk Minimization algorithms that simply use VC dimension to penalize the complexity of model, and I believe they might not be very good in practice. Of course, there are tighter notions of complexity that can help.)
And yes, PAC-Bayes idea is interesting.
I noticed that you are working with Joelle. What is your PhD topic?
Good luck to you for your exam.
“Adaptive”, “SRM”, “penalizing complexity”, etc, etc, these are what I call heuristics. And most of them (and I emphasize the “most” part) are without theoretical guarantees. The good thing about PAC-Bayes seems to be the fact that it is a form of SRM with theoretical guarantees. It penalizes both based on the prior and likelihood.
I haven’t fixed a topic yet (still a first year, first term PhD). But I’m thinking about model-based Bayesian RL with kernels (which could extend to continuous domains with GPs). A Bayesian version of non-parametric model-based RL by peter stone et al. I’m not even sure about the literature on that for now. There is a long road ahead.