Davenport-How Well Can We Estimate A Sparse Vector

On the Fundamental Limits of Adaptive Sensing
Ery Arias-Castro
, Emmanuel J. Candès
and Mark A. Davenport

Abstract
Suppose we can sequentially acquire arbitrary linear measurements of an n-dimensional vector x
resulting in the linear model y = Ax +z, where z represents measurement noise. If the signal is known
to be sparse, one would expect the following folk theorem to be true: choosing an adaptive strategy which
cleverly selects the next row of A based on what has been previously observed should do far better than
a nonadaptive strategy which sets the rows of A ahead of time, thus not trying to learn anything about
the signal in between observations. This paper shows that the folk theorem is false. We prove that
the advantages oered by clever adaptive strategies and sophisticated estimation proceduresno matter
how intractableover classical compressed acquisition/recovery schemes are, in general, minimal.
Keywords: sparse signal estimation, adaptive sensing, compressed sensing, support recovery, infor-
mation bounds, hypothesis tests.
1 Introduction
This paper is concerned with the fundamental question of how well one can estimate a sparse vector
from noisy linear measurements in the general situation where one has the exibility to design those
measurements at will (in the language of statistics, one would say that there is nearly complete freedom
in designing the experiment). This question is of importance in a variety of sparse signal estimation or
sparse regression scenarios, but perhaps arises most naturally in the context of compressive sensing (CS)
[3, 4, 9]. In a nutshell, CS asserts that it is possible to reliably acquire sparse signals from just a few linear
measurements selected a priori. More specically, suppose we wish to acquire a sparse signal x R
n
. A
possible CS acquisition protocol would proceed as follows. (i) Pick an mn random projection matrix A
(the rst m rows of a random unitary matrix) in advance, and collect data of the form
y = Ax +z, (1.1)
where z is a vector of errors modeling the fact that any real world measurement is subject to at least a
small amount of noise. (ii) Recover the signal by solving an
1
minimization problem such as the Dantzig
selector [5] or the LASSO [23]. As is now well known, theoretical results guarantee that such convex
programs yield accurate solutions. In particular, when z = 0, the recovery is exact, and the error degrades
gracefully as the noise level increases.
A remarkable feature of the CS acquisition protocol is that the sensing is completely nonadaptive; that
is to say, no eort whatsoever is made to understand the signal. One simply selects a collection a
i
of sensing vectors a priori (the rows of the matrix A), and measures correlations between the signal and
these vectors. One then uses numerical optimizatione.g., linear programming [5]to tease out the sparse
Department of Mathematics, University of California, San Diego {eariasca@ucsd.edu}
Departments of Mathematics and Statistics, Stanford University {candes@stanford.edu}
Department of Statistics, Stanford University {markad@stanford.edu}

1
signal x from the data vector y. While this may make sense when there is no noise, this protocol might
draw some severe skepticism in a noisy environment. To see why, note that in the scenario above, most of
the power is actually spent measuring the signal at locations where there is no information content, i.e.,
where the signal vanishes. Specically, let a be a row of the matrix A which, in the scheme discussed
above, has uniform distribution on the unit sphere. The dot product is
a, x =
j
a
j
x
j
,
and since most of the coordinates x
j
are zero, one might think that most of the power is wasted. Another
way to express all of this is that by design, the sensing vectors are approximately orthogonal to the signal,
yielding measurements with low signal power or a poor signal-to-noise ratio (SNR).
The idea behind adaptive sensing is that one should localize the sensing vectors around locations where
the signal is nonzero in order to increase the SNR, or equivalently, not waste sensing power. In other
words, one should try to learn as much as possible about the signal while acquiring it in order to design
more eective subsequent measurements. Roughly speaking, one would (i) detect those entries which are
nonzero or signicant, (ii) progressively localize the sensing vectors on those entries, and (iii) estimate the
signal from such localized linear functionals. This is akin to the game of 20 questions in which the search is
narrowed by formulating the next question in a way that depends upon the answers to the previous ones.
Note that in some applications, such as in the acquisition of wideband radio frequency signals, aggressive
adaptive sensing mechanisms may not be practical because they would require near instantaneous feedback.
However, there do exist applications where adaptive sensing is practical and where the potential benets
of adaptivity are too tantalizing to ignore.
The formidable possibilities oered by adaptive sensing give rise to the following natural folk theorem.
Folk Theorem. The estimation error one can get by using a clever adaptive sensing scheme
is far better than what is achievable by a nonadaptive scheme.
In other words, learning about the signal along the way and adapting the questions (the next sensing
vectors) to what has been learned to date is bound to help. In stark contrast, the main result of this paper
is this:
Surprise. The folk theorem is wrong in general. No matter how clever the adaptive sensing
mechanism, no matter how intractable the estimation procedure, in general it is not possible to
achieve a fundamentally better mean-squared error (MSE) of estimation than that oered by a
nave random projection followed by
1
minimization.
The rest of this article is mostly devoted to making this claim precise. In doing so, we shall also show
that adaptivity also does not help in obtaining a fundamentally better estimate of the signal support,
which is of independent interest.
1.1 Main result
To formalize matters, we assume that the error vector z in (1.1) has i.i.d. A(0,
2
) entries. Then if A is
a random projection with unit-norm rows as discussed above, [5] shows that the Dantzig selector estimate
x
DS
(obtained by solving a simple linear program) achieves an MSE obeying
1
n
E|x x
DS
|
2
2
C log(n)
k
2
m
, (1.2)
2
where C is some numerical constant. The bound holds universally over all k-sparse signals
1
provided that
the number of measurements m is suciently large (on the order of at least k log(n/k)). The fundamental
question is thus: how much lower can the mean-squared error be when (i) we are allowed to sense the signal
adaptively and (ii) we can use any estimation algorithm we like to recover x.
The distinction between adaptive and nonadaptive sensing can be expressed in the following manner.
Begin by rewriting the statistical model (1.1) as
y
i
= a
i
, x + z
i
, i = 1, . . . , m, (1.3)
in which a power constraint imposes that each a
i
is of norm at most 1, i.e., |a
i
|
2
1; then in a nonadaptive
sensing scheme the vectors a
1
, . . . , a
m
are chosen in advance and do not depend on x or z whereas in an
adaptive setting, the measurement vectors may be chosen depending on the history of the sensing process,
i.e., a
i
is a (possibly random) function of (a
1
, y
1
, . . . , a
i1
, y
i1
).
If we follow the principle that you cannot get something for nothing, one might argue that giving up
the freedom to adaptively select the sensing vectors would result in a far worse MSE. Our main contribution
is to show that this is not the case. We prove that the minimax rate of adaptive sensing strategies is within
a logarithimic factor of that of nonadaptive schemes.
Theorem 1. For k, n suciently large, with k < n/2, and any m, the following minimax lower bound
holds:
inf
x
sup
x: k-sparse
1
n
E| x x|
2
2

4
7
k
2
m
.
For all k, n, Section 2 develops a lower bound of the form c
0
k
2
m
. For instance, for any k, n such that
k n/4 with k 15, n 100, the lower bound holds if 4/7 is replaced by
1 2k/n
14
.
This bound can also be established for the full range of k and n at the cost of a somewhat worse
constant. See Theorem 4 for details. In short, Theorem 1 says that if one ignores a logarithmic factor, then
adaptive measurement schemes cannot (substantially) outperform nonadaptive strategies. While seemingly
counterintuitive, we nd that precisely the same sparse vectors which determine the minimax rate in the
nonadaptive setting are essentially so dicult to estimate that by the time we have identied the support,
we will have already exhausted our measurement budget.
Theorem 1 should not be interpreted as an exotic minimax result. For example, the second statement
is a direct consequence of a lower bound on the Bayes risk under the prior that selects k coordinates
uniformly at random and sets each of these coordinates to the same positive amplitude ; see Corollary 1
in Section 2.2. In fact, this paper develops other results of this type with other signal priors.
1.2 Connections with testing problems
The arguments we develop to reach our conclusions are quite intuitive, simple, and yet they seem dierent
from the classical Fano-type arguments for obtaining information-theoretic lower bounds (see Section 1.3
for a discussion of the latter methods). Our approach involves proving a lower bound for the Bayes risk
under a suitable prior. The arguments are simpler if we consider the prior that chooses the support by
selecting each coordinate with probability k/n. To obtain such a lower bound, we make a detour through
testingmultiple testing to be exact. Suppose we take m possibly adaptive measurements of the form
(1.3), then we show that estimating the support of the signal is hard, which in turn yields a lower bound
on the MSE.
1
A signal is said to be k-sparse if it has at most k nonzero components.
3
Support recovery in Hamming distance. We consider the multiple testing problem of deciding which
components of the signal are zero and which are not. We show that no matter which adaptive strategy
and tests are used, the Hamming distance between the estimated and true supports is large. Put
dierently, the multiple testing problem is shown to be dicult. In passing, this establishes that
adaptive schemes are not substantially better than nonadaptive schemes for support recovery.
Estimation with mean-squared loss. Any estimator with a low MSE can be converted into an eective
support estimator simply by selecting the largest coordinates or those above a certain threshold.
Hence, a lower bound on the Hamming distance immediately gives a lower bound on the MSE.
The crux of our argument is thus to show that it is not possible to choose sensing vectors adaptively in
such a way that the support of the signal may be estimated accurately. Although our method of proof is
nonstandard, it is still based on well-known tools from information and decision theory.
1.3 Dierential entropies and Fano-type arguments
Our approach is signicantly dierent from classical methods for getting lower bounds in decision and
information theory. Such methods rely one way or the other on Fanos inequality [8], and are all intimately
related to methods in statistical decision theory (see [24, 25]). Before continuing, we would like to point
out that Fano-type arguments have been used successfully to obtain (often sharp) lower bounds for some
adaptive methods. For example, the work [7] uses results from [24] to establish a bound on the minimax
rate for binary classication (see the references therein for additional literature on active learning). Other
examples include the recent paper [22], which derives lower bounds for bandit problems, and [20] which
develops an information theoretic approach suitable for stochastic optimization, a form of online learning,
and gives bounds about the convergence rate at which iterative convex optimization schemes approach a
solution.
Following the standard approaches in our setting leads to major obstacles that we would like to briey
describe. Our hope is that this will help the reader to better appreciate our easy itinerary. As usual,
we start by choosing a prior for x, which we take having zero mean. Coming from information theory,
one would want to bound the mutual information between x (what we want to learn about) and y (the
information we have), for any measurement scheme a
1
, . . . , a
m
. Assuming a deterministic measurement
scheme, by the chain rule, we have
I(x, y) = h(y) h(y [ x) =
m
i=1
h(y
i
[y
[i1]
) h(y
i
[y
[i1]
, x), (1.4)
where y
[i]
:= (y
1
, . . . , y
i
). Since the history up to time i 1 determines a
i
, the conditional distribution of y
i
given y
[i1]
and x is then normal with mean a
i
, x and variance
2
. Hence, h(y
i
[ y
[i1]
, x) =
1
2
log(2e
2
),
which is a constant since we assume
2
to be xed. This is the easy term to handle the challenging
term is h(y
i
[ y
[i1]
) and it is not clear how one should go about nding a good upper bound. To see this,
observe that
Var(y
i
[ y
[i1]
) = Var(a
i
, x [ y
[i1]
) +
2
.
A standard approach to bound h(y
i
[ y
[i1]
) is to write
h(y
i
[y
[i1]
)
1
2
Elog
_
2eVar(a
i
, x [ y
[i1]
) + 2e
2
_
,
using the fact that the Gaussian distribution maximizes the entropy among distributions with a given
variance. If we simplify the problem by applying Jensens inequality, we get the following bound on the
mutual information
I(x, y)
m
i=1
1
2
log
_
Ea
i
, x
2
/
2
+ 1
_
. (1.5)
4
The RHS needs to be bounded uniformly over all choices of measurement schemes, which is a daunting
task given that, for i, a
i
is a function of y
[i1]
.
Note that we have chosen to present the problem in this form because information theorists will nd
an analogy with the problem of understanding the role of feedback in a Gaussian channel [8]. Specically,
we can view the inner products a
i
, x as inputs to a Gaussian channel where we get to observe the output
of the channel via feedback. It is well-known that feedback does not substantially increase the capacity of
a Gaussian channel, so one might expect this argument to be relevant to our problem as well. Crucially,
however, in the case of a Gaussian channel the user has full control over the channel inputwhereas in
the absence of a priori knowledge of x, in our problem we are much more restricted in our control over the
channel input a
i
, x.
More amenable to computations is the approach described in [24, Th. 2.5], which is closely related to
the variant of Fanos inequality presented in [25]. This approach calls for bounding the average Kullback-
Leibler divergence with respect to a reference distribution, which in our case leads to bounding
m
i=1
Ea
i
, x
2
,
again, over all measurement schemes. We do not see any easy way to do this. We note however that the
latter approach is fruitful in the nonadaptive setting; see [2], and also [1, 21] for other asymptotic results
in this direction.
1.4 Can adaptivity sometimes help?
On the one hand, our main result states that one cannot universally improve on bounds achievable via
nonadaptive sensing strategies. Indeed, there are classes of k-sparse signals for which using an adaptive or
a nonadaptive strategy does not really matter. As mentioned earlier, these classes are constructed in such
a way that even after applying the most clever sensing scheme and the most subtle testing procedure, one
would still not be sure about where the signal lies. This remains true even after having used up the entirety
of our measurement budget. On the other hand, our result does not say that adaptive sensing never helps.
In fact, there are many instances in which it will. For example, when some or most of the nonzero entries
in x are suciently large, they may be detected suciently early so that one can ultimately get a far better
MSE than what would be obtained via a nonadaptive scheme, see Section 3 for simple experiments in this
direction and Section 4 for further discussion.
1.5 Connections with other works
A number of papers have studied the advantages (or sometimes the lack thereof) oered by adaptive
sensing in the setting where one has noiseless data, see for example [9, 13, 19] and references therein. Of
course, it is well known that one can uniquely determine a k-sparse vector from 2k linear nonadaptive
noise-free measurements and, therefore, there is not much to dwell on. The aforementioned works of course
do not study such a trivial problem. Rather, the point of view is that the signal is not exactly sparse,
only approximately sparse, and the question is thus whether one can get a lower approximation error by
employing an adaptive scheme. Whereas we study a statistical problem, this is a question in approximation
theory. Consequently, the techniques and results of this line of research have no bearing on our problem.
There is much research suggesting intelligent adaptive sensing strategies in the presence of noise and
we mention a few of these works. In a setting closely related to oursthat of detecting the locations of
the nonzeros of a sparse signal from possibly more measurements than the signal dimension n[12] shows
that by adaptively allocating sensing resources one can signicantly improve upon the best nonadaptive
5
schemes [10]. Closer to home, [11] considers a CS sampling algorithm (with m < n) which performs
sequential subset selection via the random projections typical of CS, but which focus in on promising areas
of the signal. When the signal is (i) very sparse (ii) has suciently large entries and (iii) has constant
dynamic range, the method is able to remove a logarithmic factor from the MSE achieved by the Dantzig
selector with (nonadaptive) i.i.d. Gaussian measurements. In a dierent direction, [6, 15] suggest Bayesian
approaches where the measurement vectors are sequentially chosen so as to maximize the conditional
dierential entropy of y
i
given y
[i1]
. Finally, another approach in [14] suggests a bisection method based
on repeated measurements for the detection of 1-sparse vectors, subsequently extended to k-sparse vectors
via hashing. None of these works, however, establish a lower bound on the MSE of the recovered signal.
1.6 Content
We prove all of our results in Section 2, trying to give as much insight as possible as to why adaptive methods
are not much more powerful than nonadaptive ones for detecting the support of a sparse signal. We will also
attempt to describe the regime in which adaptivity might be helpful via simple numerical simulations in
Section 3. These simulations show that adaptive algorithms are subject to a fundamental phase transition
phenomenon. Finally, we shall oer some comments on open problems and future directions of research in
Section 4.
2 Limits of Adaptive Sensing Strategies
This section establishes nonasymptotic lower bounds for the estimation of a sparse vector from adaptively
selected noisy linear measurements. To begin with, we remind ourselves that we collect possibly adaptive
measurements of the form (1.3) of an n-dimensional signal x; from now on, we assume for simplicity and
without loss of generality that = 1. Remember that y
[i]
= (y
1
, . . . , y
i
), which gathers all the information
available after taking i measurements, and let P
x
be the distribution of these measurements when the
target vector is x. Without loss of generality, we consider a deterministic measurement scheme, meaning
that a
1
is a deterministic vector and, for i 2, a
i
is a deterministic function of y
[i1]
. In this case, using
the fact that y
i
is conditionally independent of y
[i1]
given a
i
, the likelihood factorizes as
P
x
(y
[m]
) =
m
i=1
P
x
(y
i
[a
i
). (2.1)
We denote the total-variation metric between any two probability distributions P and Q by |PQ|
TV
,
and their KL divergence by K(P, Q) [18]. Our arguments will make use of Pinskers inequality, which
relates these two quantities via
|P Q|
TV

_
K(Q, P)/2. (2.2)
We shall also use the convexity of the KL divergence, which states that for
i
0 and
i

i
= 1, we have
K
_
i
P
i
,
i
Q
i
_
i
K(P
i
, Q
i
) (2.3)
in which P
i
and Q
i
are families of probability distributions.
2.1 The Bernoulli prior
For pedagogical reasons, it is best to study a simpler model in which our argument is most transparent.
The ideas to prove our main theorem, namely, Theorem 1 are essentially the same. The only real dierences
are secondary technicalities.
6
In this model, we suppose that x R
n
is sampled from a product prior : for each j 1, . . . , n,
x
j
=
_
0 w.p. 1 k/n
w.p. k/n
and the x
j
s are independent. In this model, x has on average k nonzero entries, all with known positive
amplitudes equal to . This model is easier to study than the related model in which one selects k
coordinates uniformly at random and sets those to . The reason is that in this Bernoulli model, the
independence between the coordinates of x brings welcomed simplications, as we shall see.
Our goal here is to prove a version of Theorem 1 when x is drawn from this prior. We do this in two
steps. First, we look at recovering the support of x, which is done via a reduction to multiple testing.
Second, we show that a lower bound on the error for support recovery implies a lower bound on the MSE,
leading to Theorem 3.
2.1.1 Support recovery in Hamming distance
We would like to understand how well we can estimate the support S = j : x
j
,= 0 of x from the data
(1.3), and shall measure performance by means of the expected Hamming distance. Here, the error of a
procedure

S for estimating the support S is dened as
E
x
([
SS[) =
n
j=1
P
x
(
S
j
,= S
j
).
Here, denotes the symmetric dierence, and for a subset S, S
j
= 1 if j S and equals zero otherwise.
Theorem 2. Set
0
= P(x
j
= 0) = 1 k/n. Then for k n/2,
E
([
SS[) k
_
1
_
2
m/(16
0
n)
_
. (2.4)
Hence, if the amplitude of the signal is below
_
n/m, we are essentially guessing at random. For
instance, if =
_
0
n/m, then E
x
[
SS[ 3k/4!
Proof. With
1
= 1
0
, we have
j
P(
S
j
,= S
j
) =
j
P(
S
j
= 1, S
j
= 0) +P(
S
j
= 0, S
j
= 1)
=
0
P(
S
j
= 1[S
j
= 0) +
1
P(
S
j
= 0[S
j
= 1), (2.5)
so everything reduces to testing n hypotheses H
0,j
: x
j
= 0. Fix j, and set P
0,j
= P([x
j
= 0) and
P
1,j
= P([x
j
,= 0). The test with minimum risk is of course the Bayes test rejecting H
0,j
if and only if
1
P
1,j
(y
[m]
)
0
P
0,j
(y
[m]
)
> 1;
that is, if the adjusted likelihood ratio exceeds one; see [17, Pbm. 3.10]. It is easy to see that the risk B
j
of this test obeys
B
j
min(
0
,
1
)
_
1
1
2
|P
1,j
P
0,j
|
TV
_
. (2.6)
7
We now use Pinskers inequality and obtain
B
j
min(
0
,
1
)
_
1
_
K(P
0,j
, P
1,j
)/8
_
. (2.7)
It remains to nd an upper bound on the KL divergence between P
0,j
and P
1,j
.
Set j = 1 for concreteness and write P
0
= P
0,1
for short and likewise for the other. Then
P
0
(y
[m]
) =
x
2
,...,xn
P(x
2
, . . . , x
n
) P(y
[m]
[x
1
= 0, x
2
, . . . , x
n
) :=
P(x
)P
0,x
,
where x
= (x
2
, . . . , x
n
) and similarly for P
1
(y
[m]
). The convexity of the KL divergence (2.3) gives
K(P
0
, P
1
)
P(x
)K(P
0,x
, P
1,x
). (2.8)
We now calculate this divergence. In order to do this, observe that we have y
i
= a
i
, x +z
i
= c
i
+z
i
under
P
0,x
while y
i
= a
i,1
+ c
i
+ z
i
under P
1,x
. This yields
K(P
0,x
, P
1,x
) = E
0,x
log
P
0,x
P
1,x
=
m
i=1
E
0,x
_
1
2
(y
i
a
i,1
c
i
)
2
1
2
(y
i
c
i
)
2
_
=
m
i=1
E
0,x
_
z
i
a
i,1
+ (a
i,1
)
2
/2
_
=

2
2
m
i=1
E
0,x
(a
2
i,1
).
The rst equality holds by denition, the second follows from (2.1), the third from y
i
= c
i
+z
i
under P
0,x
and the last holds since z

i
is independent of a
i,1
and has zero mean. Using (2.8), we obtain
K(P
0
, P
1
)

2
2
m
i=1
E[a
2
i,1
[x
1
= 0].
We shall upper bound the right-hand side via E[a
2
i,1
[x
1
= 0] = [P(x
1
= 0)]
1
E[a
2
i,1
1
x
1
=0
]
1
0
E[a
2
i,1
].
Hence, generalizing this to any j, we have shown that
K(P
0,j
, P
1,j
)

2
2
0
m
i=1
E(a
2
i,j
).
Returning to (2.5), via (2.7) and the bound above, we obtain
j
P(
S
j
,= S
j
) min(
0
,
1
)
j
_
1

4
1
0
Ea
2
i,j
_
.
Finally, Cauchy-Schwarz gives
i
Ea
2
i,j

i,j
Ea
2
i,j
=
nm.
When k n/2, min(
0
,
1
) =
1
= k/n, and we conclude that
j
P(
S
j
,= S
j
) k
_
1
_
2
m/16
0
n
_
,
which establishes the theorem.
8
2.1.2 Estimation in mean-squared error
It is now straightforward to obtain a lower bound on the MSE from Theorem 2.
Theorem 3. Consider the Bernoulli model with k n/2 and
2
=
64
0
9
n
m
. Then any estimate x obeys
E
E
x
1
n
| x x|
2
2

16
0
27
k
m
.
In other words, for k n, the MSE is at least about 0.59 k/m.
Proof. Let S be the support of x and set

S := j : [ x
j
[ /2. We have
| x x|
2
2
=
jS
( x
j
x
j
)
2
+
j / S
x
2
j

2
4
[S

S[ +

2
4
[
S S[ =

2
4
[
SS[
and, therefore,
E
E
x
| x x|
2
2

2
4
E
E
x
[
SS[ k

2
4
_
1
_
2
m/16
0
n
_
,
where the last inequality is from Theorem 2. We then plug in the value of
2
to conclude.
2.2 The uniform prior
We hope to have made clear how our argument above is considerably simpler than a Fano-type argument,
at least for the Bernoulli prior. We now wish to prove a similar result for exactly k-sparse signals in
order to establish Theorem 1. The natural prior in this context is the uniform prior over exactly k-sparse
vectors with nonzero entries all equal. The main obstacle in following the previous approach is the lack
of independence between the coordinates of the sampled signal x. We present an approach that bypasses
this diculty and directly relates the risk under the uniform prior to that under the Bernoulli prior.
The uniform prior selects a support S 1, . . . , n of size [S[ = k uniformly at random and sets all the
entries indexed by S to , and the others to zero; that is, x
j
= if j S, and x
j
= 0 otherwise. Let U(k, n)
denote the Hamming risk for support recovery under the uniform prior (we mean the risk achievable with
the best adpative sensing and testing strategies), and let B(k, n) be the corresponding quantity for the
Bernoulli prior with
1
= k/n, leaving and m implicit. Since the Bernoulli prior is a binomial mixture
of uniform priors, we have
B(k, n) = EU(L, n), L Bin(n, k/n). (2.9)
To take advantage of (2.9), we establish some order relationships between the U(k, n). We speculate
that U(k, n) is monotone increasing in k as long as k n/2, but do not prove this. The following is
sucient for our purpose.
Lemma 1. For all k and n, (i): U(k, n) U(k + 1, n + 1), and (ii): U(k, n) U(k, n + 1).
Proof. We prove (i) rst. Given a method consisting of a sampling scheme and a support estimator for the
(k + 1, n + 1) problem from the uniform prior, denoted ( a,

S), we obtain a method for the (k, n) problem
as follows. Presented with x R
n
from the latter, insert a coordinate equal to at random, yielding
x R
n+1
. Note that x is eectively drawn from the uniform prior for the (k +1, n +1) problem. We then
apply the method ( a,

S). It might seem that we need to know x to do this, but that is not the case. Indeed,
assuming we insert the coordinate in the jth position, we have a, x = a, x +a
j
, where a is a with the
jth coordinate removed; hence, this procedure may be carried out by taking inner products of x with a,
which we can do, and adding a
j
. The resulting method (a, S) for the (k, n) problem has Hamming risk
9
no greater than that of ( a,

S), itself bounded by U(k + 1, n + 1); at the same time, (a, S) has Hamming
risk at least as large as U(k, n), since |a|
2
| a|
2
1. This yields (i). The proof of (ii) follows the same
lines and is actually simpler.
We also need the following elementary result whose proof is omitted.
Lemma 2. For p (0, 1) and n N, let Y
n,p
denote a random variable with distribution Bin(n, p). Then
for any k,
E[Y
n,p
Y
n,p
> k] = np P(Y
n1,p
> k 1), E[Y
n,p
Y
n,p
< k] k P(Y
n1,p
< k 1),
and
P(Y
n,p
k) = (1 p) P(Y
n1,p
= k) +P(Y
n1,p
k 1).
Combining these two lemmas with (2.9) and the fact that U(, n) for all , we obtain
B(k, n) P(Y
n,k/n
k)U(k, n + k) +E[Y
n,k/n
Y
n,k/n
> k]
= P(Y
n,k/n
k)U(k, n + k) + k P(Y
n1,k/n
> k 1).
Since
P(Y
n1,k/n
> k 1) = 1 P(Y
n,k/n
k) + (1 k/n) P(Y
n1,k/n
= k),
we have the lower bound
U(k, n + k) k +
B(k, n) k k P(Y
n1,k/n
= k)
P(Y
n,k/n
k)
.
Theorem 2 states that B(k, n) k(1 ), where :=
_
2
m/(16(n k)). Since P(Y
n,k/n
k) 1/2 for
all k, n [16], we have:
Theorem 4. Set
n
=
_
2
m/(16(n k)). Then under the uniform prior over k-sparse signals in dimen-
sion n with k < n/2,
E
([
SS[) k (1
k,nk
2
nk
) ,
k,n
:=
P(Y
n1,k/n
= k)
P(Y
n,k/n
k)
. (2.10)
Setting
2
= 16(n 2k)(1
k,nk
)
2
/9m and following the same argument as in Theorem 3, gives the
bound
E
E
x
1
n
| x x|
2
2

4(1
k,nk
)
3
(1 2k/n)
27
k
m
.
We note that
k,nk
0 as k, n and that
k,nk
< 1/5, and thus (1
k,nk
)
3
> (4/5)
3
> 1/2, for
all k 15, n 100 provided that k n/4. This leads to a lower bound of the form in Theorem 1 with a
constant of (2/27)(1 2k/n) > (1 2k/n)/14.
When k, n are large, this approach leads to constant worse than that obtained in the Bernoulli case by
a factor of 8. For suciently large k, n, the term (1
k,nk
) 1, but there remains a factor of 4 which
arises because of the manner in which we have truncated the summation in our bound on B(k, n) above
(which results in a factor of 2 multiplied by
nk
in (2.4)). This can be eliminated by truncating on both
sides as follows:
B(k, n) E[Y
n,k/n
Y
n,k/n
/ [k t, k + t]] +P(k t Y
n,k/n
k + t)U(k + t, n + 2t)
and proceeding in a similar manner as before. Here, one uses
E[Y
n,k/n
Y
n,k/n
< k t] k P(Y
n1,k/n
< k t 1)
10
and
E[Y
n,k/n
Y
n,k/n
> k + t] k P(Y
n1,k/n
> k + t 1).
Choosing t = t
k
= 5
k log(k) and using well-known deviation bounds, we then obtain that

B(k, n) (1 + o(1))U(k + t
k
, n + 2t
k
) + o(1), as k, n .
Reversing this inequality, using Theorem 2 and following the proof of Theorem 3, we arrive at the following.
Corollary 1. Set
0
= 1 k/n. For the uniform prior and as k, n , we have
E
([
SS[) k(1
n,k
)
_
1
_
2
m/(16
0
n)
_
, E
E
x
1
n
| x x|
2
2

16
0
27
(1
n,k
)
k
m
.
for some sequence
n,k
0.
We see that this nishes the proof of Theorem 1 because the minimax risk over a class of alternatives
always upper-bounds an average risk over that same class.
3 Numerical Experiments
In order to briey illustrate the implications of the lower bounds in Section 2 and the potential limitations
and benets of adaptivity in general, we include a few simple numerical experiments. To simplify our
discussion, we limit ourselves to existing adaptive procedures that aim at consistent support recovery: the
adaptive procedure from [6] and the recursive bisection algorithm of [14].
We emphasize that in the case of a generic k-sparse signal, there are many possibilities for adaptively
estimating the support of the signal. For example, the approach in [11] iteratively rules out indices and
could, in principle, proceed until only k candidate indices remain. In contrast, the approaches in [6] and [14]
are built upon algorithms for estimating the support of 1-sparse signals. An algorithm for a 1-sparse signal
could then be run k times to estimate a k-sparse signal as in [6], or used in conjunction with a hashing
scheme as in [14]. Since our goal is not to provide a thorough evaluation of the merits of all the dierent
possibilities, but merely to illustrate the general limits of adaptivity, we simplify our discussion and focus
exclusively on the simple case of one-sparse signals, i.e., where k = 1.
Specically, in our experiments we will consider the uniform prior on the set of vectors with a single
nonzero entry equal to > 0 as in Section 2. Since we are focusing only on the case of k = 1, the algorithms
in [6] and [14] are extremely simple and are shown in Algorithm 1 and Algorithm 2 respectively. Note
that in Algorithm 1 the step of updating the posterior distribution p consists of an iterative update rule
given in [6] and does not require any a priori knowledge of the signal x or . In Algorithm 2, we simplify
the recursive bisection algorithm of [14] using the knowledge that > 0, which allows us to eliminate the
second stage of the algorithm aimed at detecting negative coecients. Note that this algorithm proceeds
through s
max
= log
2
n stages and we must allocate a certain number of measurements to each stage. In
our experiments we set m
s
= ,2
s
|, where is selected to ensure that
log
2
n
s=1
m
s
m.
3.1 Evolution of the posterior
We begin by showing the results of a simple simulation that illustrates the behavior of the posterior
distribution of x as a function of for both adaptive schemes. Specically, we assume that m is xed and
collect m measurements using each approach. Given the measurements y, we then compute the posterior
distribution p using the true prior used to generate the signal, which can be computed using the fact that
p
j
exp
_
1
2
2
|y Ae
j
|
2
2
_
, (3.1)
11
Algorithm 1 Adaptive algorithm from [6]
input: mn random matrix B with i.i.d. Rademacher (1 with equal probability) entries.
initialize: p =
1
n
(1, . . . , 1)
T
.
for i = 1 to i = m do
Compute a
i
= (b
i,1
p
1
, . . . , b
i,n
p
n
)
T
.
Observe y
i
= a
i
, x + z
i
.
Update posterior distribution p of x given (a
1
, y
1
), . . . , (a
i
, y
i
) using the rule in [6].
end for
output: Estimate for supp(x) is the index where p attains its maximum value.
Algorithm 2 Recursive bisection algorithm of [14]
input: m
1
, . . . , m
smax
.
initialize: J
(1)
1
= 1, . . . ,
n
2
, J
(1)
2
=
n
2
+ 1, . . . , n.
for s = 1 to s = s
max
do
Construct the m
s
n matrix A
(s)
with rows [J
(s)
1
[
1
2
1
J
(s)
1
[J
(s)
2
[
1
2
1
J
(s)
2
.
Observe y
(s)
= A
(s)
x +z
(s)
.
Compute w
(s)
=
ms
i=1
y
(s)
i
.
Subdivide: Update J
(s+1)
1
and J
(s+1)
2
by partitioning J
(s)
1
if w
(s)
0 or J
(s)
2
if w
(s)
< 0.
end for
output: Estimate for supp(x) is J
(smax)
1
if w
(smax)
0, J
(smax)
2
if w
(smax)
< 0.
where
2
is the noise variance and e
j
denotes the jth element of the standard basis. What we expect is that
once exceeds a certain threshold (which depends on m), the posterior will become highly concentrated
on the true support of x. To quantify this, we consider the case where j
denotes the true location of the

nonzero element of x and dene
=
p
j
max
j=j
p
j
.
Note that when 1, we cannot reliably detect the nonzero, but when 1 we can.
In Figure 1 we show the results for a few representative values of m (a) when using nonadaptive
measurements, i.e., a (normalized) i.i.d. Rademacher random matrix A, compared to the results of (b)
Algorithm 1, and (c) Algorithm 2. For each value of m and for each value of , we acquire m measurements
using each approach and compute the posterior p according to (3.1). We then compute the value of . We
repeat this for 10,000 iterations and plot the median value of for each value of for all three approaches.
In our experiments we set n = 512 and
2
= 1. We truncate the vertical axis at 10
4
to ensure that all
curves are comparable. We observe that in each case, once exceeds a certain threshold proportional to
_
n/m, the ratio of p
j
to the second largest posterior probability grows exponentially fast. As expected,
this occurs for both the nonadaptive and adaptive strategies, with no substantial dierence in terms of
how large must be before support recovery is assured (although Algorithm 2 seems to improve upon the
nonadaptive strategy by a small constant).
3.2 MSE performance
We have just observed that for a given number of measurements m, there is a critical value of below
which we cannot reliably detect the support. In this section we examine the impact of this phenomenon
on the resulting MSE of a two-stage procedure that rst uses m
d
= pm adaptive measurements to detect
the location of the nonzero with either Algorithm 1 or Algorithm 2 and then reserves m
e
= (1 p)m
measurements to directly estimate the value of the identied coecient. It is not hard to show that if we
12
10 20 30 40 50
10
0
10
2
10
4

m = 64
m = 32
m = 16
10 20 30 40 50
10
0
10
2
10
4

m = 64
m = 32
m = 16
2 4 6 8 10 12
10
0
10
2
10
4

m = 64
m = 32
m = 16
(a) (b) (c)
Figure 1: Behavior of the posterior distribution as a function of for several values of m. (a) shows the results for
nonadaptive measurements. (b) shows the results for Algorithm 1. (c) shows the results for Algorithm 2. We see
that Algorithm 2 is able to detect somewhat weaker signals than Algorithm 1. However, for both cases we observe
that once exceeds a certain threshold proportional to
_
n/m, the ratio of p
j
to the second largest posterior
probability grows exponentially fast, but that this does not dier substantially from the behavior observed in (a)
when using nonadaptive measurements.
correctly identify the location of the nonzero, then this will result in an MSE of (m
e
n)
1
= ((1 p)mn)
1
.
As a point of comparison, if an oracle provided us with the location of the nonzero a priori, we could
devote all m measurements to estimating its value, with the best possible MSE being
1
mn
. Thus, if we can
correctly detect the nonzero, this procedure will perform within a constant factor of the oracle.
We illustrate the performance of Algorithm 1 and Algorithm 2 in terms of the resulting MSE as a
function of the amplitude of the nonzero in Figure 2. In this experiment we set n = 512 and m = 128
with p =
1
2
so that m
d
= 64 and m
e
= 64. We then compute the average MSE over 100,000 iterations
for each value of and for both algorithms. We compare this to a nonadaptive procedure which uses a
(normalized) i.i.d. Rademacher matrix followed by orthogonal matching pursuit (OMP). Note that in the
worst case the MSE of the adaptive algorithms is comparable to the MSE obtained by the nonadaptive
algorithm and exceeds the lower bound in Theorem 1 by a only a small constant factor. However, when
begins to exceed a critical threshold, the MSE rapidly decays and approaches the optimal value of
1
men
.
Note that by reserving a larger fraction of measurements for estimation when is large we could actually
get arbitrarily close to
1
m
in the asymptotic regime.
4 Discussion
The contribution of this paper is to show that if one has the freedom to choose any adaptive sensing
strategy and any estimation procedure no matter how complicated or computationally intractable, we
would not be able to universally improve over a simple nonadaptive strategy that simply projects the
signal onto a lower dimensional space and perform recovery via
1
minimization. This negative result
should not conceal the fact that adaptivity may help tremendously if the SNR is suciently large, as
illustrated in Section 3. Hence, we regard the design and analysis of eective adaptive schemes as a subject
of important future research. At the methodological level, it seems important to develop adaptive strategies
and algorithms for support estimation that are as accurate and as robust as possible. Further, a transition
towards practical applications would need to involve engineering hardware that can eectively implement
this sort of feedback, an issue which poses all kinds of very concrete challenges. Finally, at the theoretical
level, it would be of interest to analyze the phase transition phenomenon we expect to occur in simple
Bayesian signal models. For instance, a central question would be how many measurements are required
to transition from a nearly at posterior to one mostly concentrated on the true support.
13
0 5 10 15 20
10
5
10
4
10
3
10
2
10
1
M
S
E

1
m
1
m
e
n
Nonadaptive
Algorithm 1
Algorithm 2
Figure 2: The performance of Algorithm 1 and Algorithm 2 in the context of a two-stage procedure that rst uses
m
d
=
m
2
adaptive measurements to detect the location of the nonzero and then uses m
e
=
m
2
measurements to
directly estimate the value of the identied coecient. We show the resulting MSE as a function of the amplitude
of the nonzero entry, and compare this to a nonadaptive procedure which uses a (normalized) i.i.d. Rademacher
matrix followed by OMP. In the worst case, the MSE of the adaptive algorithms is comparable to the MSE obtained
by the nonadaptive algorithm and exceeds the lower bound in Theorem 1 by only a small constant factor. When
begins to exceed this critical threshold, the MSE of the adaptive algorithms rapidly decays below that of the
nonadaptive algorithm and approaches
1
men
, which is the MSE one would obtain given m
e
measurements and a
priori knowledge of the support.
Acknowledgements
E. A-C. is partially by ONR grant N00014-09-1-0258. E. C. is partially supported by NSF via grant CCF-
0963835 and the 2006 Waterman Award, by AFOSR under grant FA9550-09-1-0643 and by ONR under
grant N00014-09-1-0258. M. D. is supported by NSF grant DMS-1004718.
References
[1] S. Aeron, V. Saligrama, and M. Zhao. Information theoretic bounds for compressed sensing. IEEE
Trans. Inform. Theory, 56(10):51115130, 2010.
[2] E. Candès and M. Davenport. How well can we estimate a sparse vector? Arxiv preprint
arXiv:1104.5246, 2011.
[3] E. Candès, J. Romberg, and T. Tao. Robust uncertainty principles: Exact signal reconstruction from
highly incomplete frequency information. IEEE Trans. Inform. Theory, 52(2):489509, 2006.
[4] E. Candès and T. Tao. Near-optimal signal recovery from random projections: Universal encoding
strategies? IEEE Trans. Inform. Theory, 52(12):54065425, 2006.
[5] E. Candès and T. Tao. The Dantzig Selector: Statistical estimation when p is much larger than n.
Ann. Stat., 35(6):23132351, 2007.
[6] R. Castro, J. Haupt, R. Nowak, and G. Raz. Finding needles in noisy haystacks. In Proc. IEEE Int.
Conf. Acoust., Speech, and Signal Processing (ICASSP), Las Vegas, NV, Apr. 2008.
14
[7] R. Castro and R. Nowak. Minimax bounds for active learning. IEEE Trans. Inform. Theory,
54(5):23392353, 2008.
[8] T. Cover and J. Thomas. Elements of information theory. Wiley-Interscience, Hoboken, NJ, 2006.
[9] D. Donoho. Compressed sensing. IEEE Trans. Inform. Theory, 52(4):12891306, 2006.
[10] D. Donoho and J. Jin. Higher criticism for detecting sparse heterogeneous mixtures. Ann. Stat.,
32(3):962994, 2004.
[11] J. Haupt, R. Baraniuk, R. Castro, and R. Nowak. Compressive distilled sensing: Sparse recovery using
adaptivity in compressive measurements. In Proc. Asilomar Conf. Signals, Systems, and Computers,
Pacic Grove, CA, Oct. 2009.
[12] J. Haupt, R. Castro, and R. Nowak. Distilled sensing: Selective sampling for sparse signal recovery.
In Proc. Int. Conf. Art. Intell. Stat. (AISTATS), Clearwater Beach, FL, Apr. 2009.
[13] P. Indyk, E. Price, and D. Woodru. On the power of adaptivity in sparse recovery. In Proc. IEEE
Symp. Found. Comp. Science (FOCS), Palm Springs, CA, Oct. 2011.
[14] M. Iwen and A. Tewk. Adaptive group testing strategies for target detection and localization in
noisy environments. IMA Preprint Series, 2010.
[15] S. Ji, Y. Xue, and L. Carin. Bayesian compressive sensing. IEEE Trans. Signal Processing, 56(6):2346
2356, 2008.
[16] R. Kaas and J. Buhrman. Mean, median and mode in binomial distributions. Statistica Neerlandica,
34(1):1318, 1980.
[17] E. Lehmann and J. Romano. Testing statistical hypotheses. Springer Texts in Statistics. Springer,
New York, 2005.
[18] P. Massart. Concentration inequalities and model selection, volume 1896 of Lecture Notes in Mathe-
matics. Springer, Berlin, 2007.
[19] E. Novak. On the power of adaptation. J. Complexity, 12(3):199237, 1996.
[20] M. Raginsky and A. Rakhlin. Information complexity of black-box convex optimization: A new look
via feedback information theory. In Proc. Allerton Conf. Communication, Control, and Computing,
Monticello, IL, Oct. 2009.
[21] G. Raskutti, M. Wainwright, and B. Yu. Minimax rates of estimation for high-dimensional linear
regression over
q
-balls. Arxiv preprint arXiv:0910.2042, 2009.
[22] P. Rigollet and A. Zeevi. Nonparametric bandits with covariates. Arxiv preprint arXiv:1003.1630,
2010.
[23] R. Tibshirani. Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc. Ser. B, 58(1):267
288, 1996.
[24] A. Tsybakov. Introduction to nonparametric estimation. Springer Series in Statistics. Springer, New
York, 2009.
[25] B. Yu. Assouad, Fano, and Le Cam. In Festschrift for Lucien Le Cam, pages 423435. Springer, New
York, 1997.
15

Davenport-How Well Can We Estimate A Sparse Vector

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Davenport-How Well Can We Estimate A Sparse Vector

Uploaded by

Copyright:

Available Formats

On the Fundamental Limits of Adaptive Sensing

and Mark A. Davenport

Department of Mathematics, University of California, San Diego {eariasca@ucsd.edu}

Departments of Mathematics and Statistics, Stanford University {candes@stanford.edu}

Department of Statistics, Stanford University {markad@stanford.edu}

and the last holds since z

k log(k) and using well-known deviation bounds, we then obtain that

denotes the true location of the

You might also like