We discuss an example of statistical efficiency vs. computational efficiency in the context of shrinkage priors in some high dimensional problems. Most of the material is from the author’s recent joint work (Bhattacharay et al., 2012). The author takes this opportunity to thank his collaborators Anirban Bhattacharya, Debdeep Pati and David Dunson for sharing their valuable insights. He also thanks Andrew Gelman, Eric Laber, Abel Rodriguez and James Scott for helpful discussions, but is solely responsibility for the contents of this blog.
Penalization methods in high dimensional problems, being the flavor of the last decade, have enjoyed much attention resulting in a rich
theoretical literature establishing their optimality properties. There
are also fast algorithms and compelling applied results underlining the
success story of these methods. The overwhelming emphasis has been on
rapidly producing a point estimator with good empirical and theoretical
properties. However, to the best of this author’s knowledge, a
satisfactory theory for uncertainty quantification for penalized
estimators is not yet in place.
Given that most shrinkage estimators correspond to the mode of a
Bayesian posterior, it is natural to ask whether we can use the whole
posterior distribution to provide a probabilistic measure of
uncertainty. The hope is to be able to choose a default shrinkage
prior that leads to similar optimality properties to those shown for
penalization and other approaches, but with the added advantage of the entire posterior distribution concentrating at the
optimal rate, instead of just focusing on the point estimate. Furthermore, taking a Bayesian perspective has distinct advantages in terms of tuning parameter choice, allowing key penalty parameters to be marginalized over the posterior distribution instead of relying on cross-validation. Of course, for the Bayesian methods to be useful for uncertainty quantification, there should exist fast Markov Chain Monte Carlo (MCMC) algorithms to explore these gigantic model spaces. As life would have it, there are theoretically efficient priors for the MCMC algorithm which are slow in high dimensions, and conversely, there are priors for which there exist fast computational algorithms but are not theoretically efficient. The
point is further detailed in a concrete example below.
Point mass priors vs. Shrinkage Priors
For a high-dimensional vector , a natural, time-tested way to incorporate sparsity in a Bayesian framework is to use point mass mixture priors, (1)
where , is the prior guess on model
size (sparsity level), and is an absolutely continuous
density on . It is common to place a beta prior on ,
leading to a beta-Bernoulli prior on the model size, which conveys an
automatic multiplicity adjustment (Scott and Berger, 2010). Let us also
take the model to be, the popular Normal means model (though the
discussion here holds in more generality):
For the model above, Castillo and van der Vaart (2012) established that the posterior corresponding to the prior (1) with an
appropriate beta prior on contracts at the frequentist minimax
rate. Thus at least theoretically, the prior given in
(1) is “hard to beat”. Unfortunately, to simulate
posterior draws corresponding to such priors is not an easy task as the MCMC algorithm has to explore a model space of dimension . This is not feasible even for moderate .
Shrinkage priors (there is an amazing variety of them), often constructed as continuous scale mixtures of Gaussians, are appealing alternatives as they can potentially lead to dramatically more efficient posterior computation. They also allow separate control of the level of sparsity and the size of the signal coefficients. Thus, one formulation of the million dollar question is: What family of shrinkage priors achieves the theoretical efficiency of the point mass priors?
In the recent preprint (Bhattacharya et al., 2012), we give a reasonably satisfactory answer for the above question. One of the main results is that a broad class of Gaussian scale mixture priors, including the
Bayesian Lasso and other commonly used choices such as ridge regression, are sub-optimal for the normal means model, i.e., they do not achieve the same theoretical efficiency as that of the point mass priors.
The key insight obtained in Bhattacharya et al. (2012) is that, the sub-optimal performance of shrinkage priors is due to the lack of borrowing information across coordinates apriori. Indeed, as Polson and Scott (2010) noted, essentially all such shrinkage priors can be represented as global-local (GL) mixtures of Gaussians,
where controls global shrinkage towards the origin while the local scales allow deviations in the degree of shrinkage. Thus the local scales are i.i.d from ; this lack of borrowing from other coordinates hurts the performance.
This is a well documented phenomenon in a related context. Indeed, the famous James-Stein estimator is an example of a superior estimator constructed by borrowing information across coordinates. Recall that this estimator can be considered as the posterior mean of a Bayesian model. Similarly, a high dimensional shrinkage prior which does not impose sufficient dependence across coordinates is not expected to perform well. There is ample evidence of this in Bhattacharya et al. (2012). Note that because of the global scale $\tau$, marginally the coordinate are not independent, but it turns out that the dependence induced by is often weak.
Inspired by the idea of inducing dependence, a new class of priors, called Dirichlet-Laplace (DL) priors, were proposed in Bhattacharya et al. (2012). These priors seem to achieve both optimal theoretical efficiency and efficient posterior simulation. Initial results for DL priors relative to a variety of competitors show promise, though much remains to be done. The DL prior can be written as
where is a suitable density (say the half-Cauchy; see Gelman (2006)). Clearly, the DL prior induces a much stronger dependence at the local scale via the Dirichlet distribution. There
is also an efficient Gibbs sampler which leads to fast posterior simulation. Let me also add that, among the existing shrinkage priors, the horseshoe prior Carvalho et al. (2010) stands out and is very similar in performance to our DL prior. But the horseshoe seems much harder to analyze theoretically.
Have we answered the $1,000,000 Question?
Not yet! What we have done amounts to constructing a shrinkage prior which seems to achieve the same theoretical efficiency as of the point-mass priors. However, the second piece of the puzzle-computational efficiency – is not rigorously studied in the Bayesian shrinkage prior circles. The key reason for this, in my opinion, is the lack of availability of easy to use criteria to measure computational efficiency. For instance, statistical efficiency can be measured in terms of minimax rate, posterior convergence rate, various loss functions etc. On the other hand, for measuring computational efficiency of MCMC algorithms, the only prevalent notion is that of effective sample size (ESS). For most MCMC algorithms, to get exact ESS theoretically is almost impossible for real problems; but even obtaining reasonable bounds is out of reach using state of the art methods.
My point here is that to rigorously answer the efficiency trade-off, some quantitative notion of computational efficiency should be part of the equation. For instance, with in a fixed computational budget, how do the effective sample sizes corresponding two different shrinkage priors with the same statistical efficiency compare? Fortunately, this can be done by simulation and is a low hanging fruit. Again, this question is not often seriously pursued because the notion of fixed computational budget is artificial for moderate sized problems. But for really gigantic problems, which seem to be the flavor of the next decade, I believe the notion of analyzing algorithms with a fixed computational budget and incorporating them into the balance sheet of efficiency for Bayesian procedures will be an important one.
 Anirban Bhattacharya, Debdeep Pati, Natesh S Pillai, and David B Dunson. Bayesian shrinkage. arXiv preprint arXiv:1212.6088, 2012.
 C.M. Carvalho, N.G. Polson, and J.G. Scott. The horseshoe estimator for sparse signals. Biometrika, 97(2):465–480, 2010. 4
 I. Castillo and A. van der Vaart. Needles and straws in a haystack: Posterior concentration for possibly sparse sequences. The Annals of Statistics,
 A. Gelman. Prior distributions for variance parameters in hierarchical models (comment on article by browne and draper). Bayesian analysis, 1(3):515–534, 2006.
 N.G. Polson and J.G. Scott. Shrink globally, act locally: Sparse Bayesian regularization and prediction. In Bayesian Statistics 9 (J.M. Bernardo, M.J.
Bayarri, J.O. Berger, A.P. Dawid, D. Heckerman, A.F.M. Smith and M. West, eds.), pages 501–538. Oxford University Press, New York, 2010.
 J.G. Scott and J.O. Berger. Bayes and empirical-bayes multiplicity adjustment in the variable-selection problem. The Annals of Statistics, 38(5):2587–2619,