
Submitted by
Assigned_Reviewer_5
Q1: Comments to author(s).
First provide a summary of the paper, and then address the following
criteria: Quality, clarity, originality and significance. (For detailed
reviewing guidelines, see
http://nips.cc/PaperInformation/ReviewerInstructions)
Review for # 271: “Least Informative Dimensions”
This paper addresses the problem of finding the low dimensional
stimulus subspaces (of a high dimensional stimulus space) that most
influence the spiking of a neuron or potentially other process. In
contrast to Maximally informative dimensions, which directly looks for
informative subspaces, this method tries to decompose the stimulus space
into two orthogonal subspaces by making one of them as uninformative about
the spiking as possible. The methodology is highly advanced, involving
replacing the relative entropy by an approximate “integral probability
metric” and representing this metric in terms of a kernel function. After
making some simplifying assumptions regarding the form of this kernel, the
authors use a gradient ascent. They give detailed formulas for the
required derivatives in the supplement. They then apply their method to
several simulated data sets and to a neuron recorded in the electric fish.
Using a shuffling procedure to calculate confidence bound upon the
integral probability metric they show that for their examples, this
procedure finds a subspace of the correct dimension which contains all the
relevant information in the stimulus.
This paper was somewhat
dense, but I enjoyed it once I figured out what was going on. The problem
of finding informative subspaces is an important one, and new methods are
welcome. The math is rigorous and the examples in the results section,
while somewhat simple, are appropriate for a NIPS paper given the space
constraints.
One thing that is missing is any mention of
computation times and a comparison with more standard methods such as
Sharpee’s maximally informative dimensions. Given that the authors are
motivating their work by the computational intensity of directly
estimating the mutual information, I think that they should compare the
two methods both in their computation time, and also their accuracy.
A second question I had regarded the firing rates used in their
simulated examples. They state on line 257 that “35% nonzero spike counts
were obtained” for the LNP neuron. This seems rather high, as in 350 Hz at
1ms resolution and 35 Hz at 10 ms resolution. Can the method detect the
relevant dimension (particularly in the complex cell example) at lower
resolution?
In summary I think this is a nice paper suitable for
NIPS if the above concerns are addressed.
Q2: Please
summarize your review in 12 sentences
A new method for finding stimulus subspaces which are
informative about neural spiking (or other random processes). Very
mathematically rigorous and quite interesting. However, no direct
comparison with other methods (such as maximally informative dimensions)
are provided and given the claim of efficiency (or at least the motivation
of efficiency) details such as computation times etc. should be
included. Submitted by
Assigned_Reviewer_6
Q1: Comments to author(s).
First provide a summary of the paper, and then address the following
criteria: Quality, clarity, originality and significance. (For detailed
reviewing guidelines, see
http://nips.cc/PaperInformation/ReviewerInstructions)
This paper is devoted to the problem of finding
relevant stimulus features from the neural responses. The main idea is to
minimize the information between a certain set of features and the neural
responses combined with the remaining stimulus features in order to
separate the relevant from irrelevant features. The method builds on the
prior work by Fukumizu and colleagues, but unfortunately makes an
additional restriction that greatly diminishes the potential use of this
method. Specifically, the irrelevant and relevant dimensions need to be
statistically independent from each other within the stimulus ensemble.
While the authors argue that this is not a crucial assumption “because
many distributions are Gaussian for which the dependencies between U
[informative features] and V[the uninformative features] can be removed by
prewhitening the stimulus data before training”, this assumption
essentially eliminates the novel contribution from this method. This is
because with correlated Gaussian stimuli one can use spiketriggered
covariance (STC) to find all features that are informative about the
neural response in one simple computational step that avoids the
nonconvexity and complexity of the proposed optimization. Furthermore,
the STC method has recent been shown to apply to elliptically symmetric
stimuli (Samengo & Gollisch J Computational Neuroscience 2013).
Finally, the proposed method has to iteratively increase the number of
relevant features, whereas this is not needed in the STC method.
In terms of practical implementations, the method was only
demonstrated in this paper on white noise signals. Even in this simple
case, the performance exhibited substantial biases. For example, in Figure
1, right panel, a substantial component along the irrelevant subspace
remained after optimization. Note that it is not correct to claim that
(lines 286888, and also lines 315316) that “LID correctly identifies the
subspace (blue dashed) in which the two true filters (solid black) reside
since projections of the filters on the subspace closely resemble the
original filters. “ The point of this method is to *eliminate* irrelevant
dimensions, and this was not achieved.
The superior convergence
properties of the proposed techniques relatively to spike triggered
average/covariance were not demonstrated. In fact, the method did not even
remove biases in the STA, which would be possible using the linear
approximation (Figure 3). Note that the Figure 3 description was
contradictory: stimuli were described as white noise, yet the footnote
stated that the data were not whitened.
The use of information
minimization for recovering multiple features and correlated stimuli was
demonstrated in Fitzgerald et al. PLoS Computational Biology 2011,
“Secondorder dimensionality reduction using minimum and maximum mutual
information models.”
Line 234235: it is not correct that MID
(also STC) “requires data that contains several spike responses to the
same stimulus” – one can optimize the relative entropy based on a sequence
of stimuli presented only once in either MID/STC methods.
Lines
236237: MID/STC can be used with spike patterns by defining events across
time bins or different neurons.
The abstract was written in a
somewhat misleading manner, because it is not the information that is
being minimized, but a related a quantity in combination with crucial
assumptions that are not spelled out in the abstract. For example, that
kernels need to factorize. The last statement in the abstract is also not
clear “… if we can make the uninformative features independent of the
rest.” Here, it is not clear that this implies that inputs are essentially
required to be uncorrelated if any of the input features can be
informative for one of the neurons in a dataset.
Q2: Please summarize your review in 12
sentences
This manuscript describes a method for finding
multiple features in neuroscience dimensions. The paper is written fairly
clearly. Unfortunately, this method implementation leaves much to be
desired compared to the existing methods (e.g. STC) and does not offer new
any capabilities because it requires uncorrelated inputs with which STC
can be used.
Submitted by
Assigned_Reviewer_7
Q1: Comments to author(s).
First provide a summary of the paper, and then address the following
criteria: Quality, clarity, originality and significance. (For detailed
reviewing guidelines, see
http://nips.cc/PaperInformation/ReviewerInstructions)
The gist of this paper is essentially a nonparametric
method that can be used to estimate Maximally Informative Dimensions
(MID; due to Sharpee et al). MID is one of the only spiketriggered
characterization techniques that is capable of extracting multiple
dimensions that are informative with respect to higherorder
statistics (unlike classic methods such as STA/STC which rely on
changes in mean and variance, respectively). Extending MID to multiple
dimensions is also difficult due to repeated estimation of relative
entropy which quickly becomes intractable due to the amount of data
required as dimensionality increases. Here, the authors use an
'indirect' estimation approach, whereby they can estimate maximally
informative features without actually having to estimate mutual
information.
This paper is largely excellent. The methodology
itself is similar in flavour to the work of Fukumizu et al, but the
similarities and conceptual differences are discussed in detail by the
authors. My main criticism is that the paper is very methods heavy,
with a slightly less focus on experiments. This really does seem like
a great estimation technique, and I do not feel that the experiments
sections does it justice. Ideally, I would like to have seen a
detailed comparison between MID and LID in real data (perhaps visual
recordings, similar to the data commonly presented in papers by the
Sharpee group). I appreciate that the authors did present some real
data application in the form of Punit recordings from weakly electric
fish, although one might argue that this data is non standard in the
sense that the vast majority of spiketriggered characterization
techniques are applied to either the visual or auditory domains. The
most disappointing aspects of the results however, is that the authors
do not seem to address the issue of taking MID to high dimensions 
which is one of the key areas in which current MID estimation
algorithms struggle. Some examples of this can be found in the recent
papers by Atencio, Sharpee, and Schreiner, where the MID approach is
applied to the auditory system. An approach like this could benefit
these kind of studies greatly, giving them the ability to investigate
higherdimensional feature spaces. A final comment is one of
computational efficiency  given that such an approach is likely to be
applied to vast amounts of neural data, it would be nice to get a feel
for how long the algorithm takes to run for data sets of different
sizes.
The paper itself is very well written and particularly
clear. It does not seem to suffer from any obvious grammatical or
mathematical errors (at least, to my eyes).
I very much
enjoyed this paper. It is a great example of what a good NIPS paper
should be  it uses modern machine learning methods to study problems
in neuroscience. Spiketriggered characterization methods have been
used for decades, and yet the seemingly simple problem of dealing with
highdimensional feature spaces has been plagued with difficulties.
This paper provides steps towards solving this problem, which will
have long lasting implications in both neuroscience and machine
learning communities. The only negative aspect of this paper, as I
mentioned earlier, is that the results section is somewhat lacking,
and does not fully reflect how good the estimation technique seems to
be.
Q2: Please summarize your review in 12
sentences
An excellent methodology paper, only slightly lacking
in an adequate display of results and relevant comparisons to related
methods. Submitted by
Assigned_Reviewer_8
Q1: Comments to author(s).
First provide a summary of the paper, and then address the following
criteria: Quality, clarity, originality and significance. (For detailed
reviewing guidelines, see
http://nips.cc/PaperInformation/ReviewerInstructions)
I think this paper has some significant strengths  I
certainly rank it higher than a 2. There are some nice ideas; it's a
wellwritten paper, and I enjoyed reading it. But there are some
significant weaknesses here that need to be addressed.
The basic
question that is not addressed sufficiently directly here: is the proposed
method actually better than MID or the method of Fukumizu et al, either in
terms of computational efficiency or statistical accuracy? (Presumably not
the latter, since MID is asymptotically efficient.) Direct comparisons
need to be provided. In addition, the numerical examples provided here are
rather small. How well do the methods scale? The authors really need
to do a better job making the point that the proposed methods offer
something that improves on the state of the art.
In addition, a
basic weakness of the proposed method should be addressed more clearly.
Specifically, on L77, "if Q can be chosen such that I [Y ;U : V ] = 0" 
as noted by reviewer 2, this typically won't happen in the case of
naturalistic stimuli X (unlike the case of gaussian X, where we can easily
prewhiten), where these informationtheoretic methods are of most
interest. So this makes the proposed methods seem significantly less
attractive. This is a key point here, and needs to be addressed more
effectively.
Minor comments 
L81  "another
dependency measure which ... shares its minimum with mutual information,
that is, it is zero if and only if the mutual information is zero." 
these two statements don't appear to be equivalent in general. if
there is no Q for which the corresponding MI is zero, then it's not clear
that these two dependency measures will in fact share the same (arg)
min. this point should be clarified here.
Maybe worth
writing out a brief derivation for eq (2), at least in the supplement.
It would be nice to have a clearer description of how the
incomplete cholesky decomposition helps  at least mention what the
computational complexity is. The kernel ICA paper by Bach and Jordan
does a good job of laying out these issues. Note that we still have to
recompute the incomplete cholesky for each new Q (if I understand
correctly).
The relative drawbacks of MID seem overstated.
While Sharpee's implementation of MID bins the projected data, there's no
conceptual need to do this  a kernel (unbinned) estimator for the
required densities could be used instead (and would probably make the
gradients easier to compute), or other estimators for the information
could be used that don't require density estimates at all.
Similarly, as one of the other reviewers notes, the MID method could
easily be applied to multispike patterns. I also agree with the
second reviewer that L235 is incorrect and needs revision.
Q2: Please summarize your review in 12
sentences
I think this paper has some significant strengths.
There are some nice ideas; it's a wellwritten paper, and I enjoyed
reading it. But there are some significant weaknesses here that need to be
addressed.
Q1:Author
rebuttal: Please respond to any concerns raised in the reviews. There are
no constraints on how you want to argue your case, except for the fact
that your text should be limited to a maximum of 6000 characters. Note
however that reviewers and area chairs are very busy and may not read long
vague rebuttals. It is in your own interest to be concise and to the
point.
We thank all reviewers for their constructive
feedback. We will make an effort to address all your minor comments in a
revised version of the paper. Here, we hope to clarify the major points.
First of all, we would like to stress that our primary motivation
was not to improve the runtime of MID but to offer a generalization that
extends to populations and spike patterns (we explain below why we think
this is difficult with MID). Given the page constraint, we felt that it is
more important to demonstrate our method on controlled examples of
increasing complexity than to compare it to other algorithms. Furthermore,
we are unsure how meaningful a runtime comparison would be, since LID
solves a more general problem than MID. We agree that a more extensive
experimental exploration is the next step after demonstrating the
effectiveness of the method on controlled examples. We will do that in the
near future. In the final version of the paper, we will adapt our
discussion of MID to avoid giving the impression our goal was to improve
the runtime.
Concerning the generality of our method, reviewer
#6(2) claimed that it would not yield any novel contribution beyond STC
for Gaussian stimuli since STC were sufficient to extract all informative
features for correlated Gaussian inputs already. This is not the case for
two reasons:
1. STC and STA characterize the spike triggered
ensemble which essentially captures the informative features for one
single bin. In contrast, our method captures informative stimulus features
about more complex responses, including patterns of spikes or population
responses. Therefore, our method solves a more general problem than STC
and STA independent of the stimulus distribution.
2. STC does not
necessarily identify all informative features, even for Gaussian stimuli.
This is because the spike triggered distribution does not need to be
Gaussian anymore as this simple example illustrates: Let x in R^2 and
~N(0,I), and let y=1 if a<=x_1<=b and 0 otherwise, where a and b
are chosen such that var(x_1)=1 (this can easily be done using the cdf and
ppf of the Gaussian). The spike triggered distribution p(xy=1) is still
white and has mean zero. Therefore, neither STA nor STC can detect the
dependencies between stimulus and response. Our algorithm, on the other
hand, detects the correct subspace [1,0]'. A similar example can be
constructed for correlated Gaussian stimuli. Therefore, our algorithm is
more general than STA and STC, even for Gaussian stimuli.
Reviewer
#6(2) also raised the concern that a substantial component of the
irrelevant dimensions remains after optimization (Fig 1 right panel).
Here, we would like to stress that LID identifies the informative
_subspace_. Thus, the resulting basis vectors can be flipped or linear
combinations of the original filters. If the original filters are not
orthogonal, the subspace basis will look different from the filters (since
Q is in SO(n)). If the subspace is determined correctly, however, it
should contain the original filters. This is indeed the case for Fig.
1/right because substantial parts of the projected filters would be
missing otherwise. We will make this point clearer in a final version.
Our reasoning why an extension of MID to patterns and populations
is not straightforward is the following: MID maximizes the information
between a _single_ spike and the projection of the stimuli onto a
particular subspace (which is not equal to I[Y:v'*X], see related work
section in the paper). Assuming that the p(v'xno spike) is very similar
to the prior distribution p(v'x), the information of a single spike is
approximately the information carried by a single bin. As the single bins
can be correlated, it may be desirable to extend MID to several bins. When
using spike patterns or population responses, however, there are several
pattern triggered ensembles p(v'xpattern 1), p(v'xpattern 2), ... which
are unequal to the prior distribution p(v'x) and, therefore, carry
information (see the equation in line 232 in the related work section). In
that case, I[Y:v'*X] has a term for each of those patterns and MID needs
to estimate the information for each of them in order to account for the
full stimulus response information. Depending on the number of patterns,
this can become substantially more involved.
Concerning the
independence between U and V, and the choice of Q such that I[Y,U:V]=0, we
would like to emphasize that we obtain several advantages by phrasing the
objective as I[Y,U:V]= I[Y:X] + I[U:V] − I[Y:U] while maintaining
applicability to most common stimulus distributions because I[U:V]=const
for most of them. This means that minimizing I[Y,U:V] becomes equivalent
to maximizing I[Y:U]. Clearly, for white and correlated Gaussians,
I[U:V]=0 after prewhitening. For elliptically symmetric distributions,
prewhitening yields a spherically symmetric distribution which means that
I[U:V]=const, since U and V correspond to coordinates in an orthogonal
basis. Since natural signals like images and sound are well described by
elliptically symmetric distributions (see work by Lyu/Simoncelli and
Hosseini/Sinz/Bethge) I[U:V]=const as well. This also means that even if Q
cannot be chosen such that I[Y,U:V]=0, the objective function is still
sensible. The only reason why we need I[Y,U:V]=0 is to tie the minimum of
the integral probability metric (IPM) to the minimum of the Shannon
information. However, even if that minimum cannot be attained, the IPM
objective will still yield reasonable features.
In return for the
additional restriction and the use of the IPM we get an algorithm that (i)
naturally extends to patterns and populations while maintaining
computational feasibility, (ii) can assess the Null distribution via
permutation analysis (as opposed to Fukumizu et al.), and (iii) does not
have to use factorizing joint kernels (as opposed to Fukumizi et al.; we
only chose factorizing kernels for convenience).
 