Paper ID: 1417
Title: Hierarchical Modular Optimization of Convolutional Networks Achieves Representations Similar to Macaque IT and Human Ventral Stream
Reviews

Submitted by Assigned_Reviewer_1

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
Paper 1417 "Hierarchical Modular Optimization of Convolutional Networks Achieves Representations Similar to Macaque IT and Human Ventral Stream"
This paper describes a novel procedure for finding a set of heterogeneous neural response filters in a hierarchical model of the visual ventral stream. The main motivation for this work is that previous models fail to achieve the specific categorical structures present in neural responses in the top end of the stream, IT.

Quality. The results are strong and convincing, and the model demonstrates both a high similarity to neural IT responses, and demonstrates outstanding performance in 8-way classification tasks.

Clarity. The paper is reasonably well written, though highly condensed. Some elaboration on the HMO procedure would be welcomed, as it required multiple readings to understand (and there is space left in the paper).

Significance. The paper represents a substantial and convincing step in computational models of deeper brain structures.

It would have been interesting to see how the results scale with features derived from big data, like deep learning approaches.

Q2: Please summarize your review in 1-2 sentences
The paper describes of novel and successful method for deriving the heterogeneity of feature detectors needed for building a deep model of the ventral stream, in particular for modeling object recognition area IT.

Submitted by Assigned_Reviewer_4

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
Summary:

This paper presents an algorithm for category-level object recognition that searches a large space of heterogeneous hierarchical neural network architectures using a boosting mechanism. The resulting network found via this algorithm is used to predict multiunit neural activity in monkey IT, as well as the similarity structure in monkey neural IT and human fMRI IT representations. This is the first reasonably successful prediction of neural activity in IT in response to complex naturalistic images of objects.

Pros:

Predicts neural activity in IT in response to complex naturalistic images of objects for the first time and with reasonable success. This is a significant achievement, and will be exciting to many at NIPS.

Introduces a new algorithm for category-level object recognition, which optimizes a very complex, heterogenous neural network architecture using a boosting methodology. This will be of interest also to non-neuroscientists interested in deep learning and object classification.

The comparisons to neural data are done to a high standard, with results replicated on three independently collected datasets (two monkey multiunit, one human fMRI).

The paper is clearly written in general.

Cons:

The Krizhevsky et al. object recognition network should be included in the evaluation, since this model was shown to be promising in closely related prior work (e.g., [2]). This is a very natural baseline model to compare to. The Krizhevsky network was shown to achieve classification performance superior to that of Monkey IT, on the exact Neural Representation Benchmark dataset used in this paper.

The paper describes running the proposed HMO algorithm to obtain a network N_{HMO} used in the comparisons to experimental data. It would be very interesting to learn more about the properties of this optimal network. What is its topology? Is the permitted heterogeneity in the space of networks used, or does this procedure end up using networks of some standard depth? Since intuition into what makes a good architecture is generally hard to come by, it would be useful to see this example. It would also be useful to describe other properties of the specific HMO implementation used in this paper, namely, how many networks were evaluated in the optimization of single stack networks, etc.

As it stands it is not possible to determine whether the proposed algorithm is necessary to model IT well, or whether other state-of-the-art object recognition methods would perform comparably. This has bearing on the view advanced at the end of the discussion that the paper's results argue for one relatively fixed, unmodifiable "core recognition" network which feeds into more plastic "downstream specialists" responsible for learning different tasks. The Krizhevsky network, by contrast, would be plastic throughout all levels, with no obvious "two-step" arrangement. If it performs comparably in predicting IT responses, it would weaken the paper's argument in support of their dual system view. On the other hand, if the Krizhevsky network underperformed the proposed method, this would be very informative. As it stands, arguing in support of the two-step arrangement seems premature given the other tenable but unexamined models in the literature.

Minor:

The description of the HMO algorithm is hard to follow. The score equation in line 143, Sum N(F(s)), makes sense to me as E(N(s)). Also it appears that F is never defined. The reweighted score function of line 156 again makes use of the undefined F, F_1, etc, which presumably again is N.

229: "Matricx" -> "Matrix"

Fig. 3A: include Y axis label
Q2: Please summarize your review in 1-2 sentences
Predicts neural activity in IT in response to complex naturalistic images of objects for the first time and with reasonable success. Does not include an important baseline model.

Submitted by Assigned_Reviewer_5

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
This is an interesting paper proposing a hierarchical modular network of the ventral visual pathway which is learnt by boosting. This paper is written very clearly and comparison between model behavior and experimental data is also well done. Although I am not a specialist in the field, I feel the contribution of this paper important.
Q2: Please summarize your review in 1-2 sentences
I would recommend the acceptance of the paper.

Submitted by Assigned_Reviewer_6

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
This paper presents a new computational model visual processing and shows that the the representation learned by this model more closely matches the response characteristics of IT cortex in the brain. The quality of the work is high, and I have no doubts about the validity of the results or the underlying methodology of experimentation.

The paper is original but not completely unprecedented. The HMO model itself is an interesting idea for constructing an ensemble model out many heterogeneous variants of a previously published visual model. The comparison of the hidden representation constructed by the model to real brain responses follows on some recent ideas in the literature, but the new model the authors suggest does seem to be an improvement compared to other published models evaluated on this metric.

The clarity of the paper is somewhat mixed. Some sections were exceptionally well written (e.g. the Introduction), while others were somewhat harder to follow than they might have been (e.g. High-Throughput Screening via Hierarchical Modular Optimization).

The work is significant principally because it is part of what seems like an important trend in literature: to develop quantitative models for visual computation and make meaningful comparisons between these models and the brain. In essence to treat computational models as well formed hypotheses for visual brain function (at least at the a course scale) and measure how well they account for real data.

Minor comments:
The section title "Intro" should be expanded to read "Introduction".
The word "nurture" is misspelled in the Discussion section.
Q2: Please summarize your review in 1-2 sentences
This paper presents a new computational model visual processing and shows that the the representation learned by this model more closely matches the response characteristics of IT cortex in the brain.
Author Feedback

Q1:Author rebuttal: Please respond to any concerns raised in the reviews. There are no constraints on how you want to argue your case, except for the fact that your text should be limited to a maximum of 6000 characters. Note however that reviewers and area chairs are very busy and may not read long vague rebuttals. It is in your own interest to be concise and to the point.
Thanks for the positive comments, those are always nice to get. We also basically agree with the substantive concerns raised. We don't view this as a rebuttal so much as a response lay out how we plan to address these concerns.

Here are our responses, to what we perceive as the main issues brought up in the reviews:

1) More sophisticated filterbank learning mechanisms: So far, we've chosen to use (uniform) random filters, and to control only the variance and means of these filters. From some parameter analysis that we've done, it seems that those gross filterbank statistics actually matter a lot, especially having multiple such values heterogenously composed in the modular components of the model. This suggests that the reviewer's suggestion could be quite important. Perhaps we could materially improve performance (and also neural fitting) by using more sophisticated mechanisms for choosing filterbank values, in additional to the architectural parameter optimization we have already worked on -- for example, the contrast filters described in the Ghebreab etal NIPS paper, or other approaches, like back-propagation and deep learning. We're addressing this now, for a future work, but probably won't have anything material to add in a revision. However, it is a good suggestion.

2) Comparison to some recent deep learning feature approaches, especially e.g. Krizhevsky et. al. This is a really important issue. One of the reviewers references our reference [2], a recent errfor lead by a collaborator in our group. There were some technical reasons we didn't include all the comparison feature sets compared to in reference [2]. First, of all we didn't actually have access to feature vectors from the algorithms evaluated in [2], for a significant subset of the images. That is, the "neural representation benchmark" (NRB) set that we used in this current work contains a significant number of images that were held out of the test in [2], including additional object categories. More important -- we didn't have features extracted for the algorithms in reference [2] for the Man Vs Monkey datasets (this especially goes to review #2's comments about the "two-step" model). As a result, it would have been hard to fit those algorithms in for a direct feature comparison, at least at the time of writing the paper. What we will try to do in preparing a final version (if accepted) -- is have those groups that participated in [2] extract features on the remaining images from the NRB dataset, as well as the Man Vs. Monkey datasets. We'll also want to do this with features from the algorithm described in recent work by Zeiler & Fergus.

What we have done already: we have the features from the algorithms compared in [2] for a subset of the NRB images. We have looked at the comparisons on that subset. The answer, that we can tell so far, is that the Krizhevsky features do pretty well at fitting the neural data -- somewhat less well overall than the HMO features we present here, but still, significantly better than the other control models. So we were happy about this, because it suggests further evidence for our main point, which is that by optimizing for performance on an object recognition task, one produces models that get better at predicting neural data. We suspect that once we get all the images [especially on the Man Vs. Monkey dataset] -- either as a revision for this paper, or perhaps (more likely) as a longer more detailed journal submission, we'll have a really strongly case for this, integrating these other new feature sets as data points.

Something interesting happens even with the data we do have now: in the course of the neurophysiology experiment, we measured data from neurons in two animals. The HMO features seem to do better at predicting one monkey, and the Krizhevsky features better at the other monkey. [It wasn't clear how to include this in in the existing paper, but maybe we should try to think of a way to do so in a revised version for final submission?] This made us wonder how well the neural features from one monkey predict the neurons from the other monkey -- and the answer is, basically no better than either the HMO or the Krizhevsky features. This has further begun to make us wonder, to what extent is the view of "IT" as a unified area in the brain really right? During the experiment, we placed arrays in slightly different parts of the two monkey's brains, both within anatomical IT, but still not exactly in the "same place" according to one or another coordinate systems (the utility of which are also suspect). This result lead us to wonder, is there substructure in IT that we can study and compare to models like HMO and SuperVision? This kind of result would be of real interest to neurophysiologists because it would suggest that we can make more detailed predictions about visual cortex structure by studying differences between models. In any case, this is a key next step that we are working on actively.

3) Reviewer # 2 is right to suggest that we did not record neurophysiology data for the purpose of this paper. We should make this more clear, and will do so. However, we want to be clear also that the neurophysiology data reported here is also NOT "taken" from our reference [2], as suggested. This neurophys data has actually not yet been reported anywhere in a journal paper! (Our lab submitted a paper last week with the main report on that data.) So both this current paper and [2] are derivative modeling efforts from core neurophysiology data developed and recorded by Ha Hong (one of the two first-authors of this current paper). We need to make that clear.

4) Typos in the description of the model will be fixed. Moreoever, and we'll try to make the text clearer; we've been working on text that is cleaner and simpler and we think will make the procedure more straightforward to understand.