Paper ID: 1298
Title: Saliency, Scale and Information: Towards a Unifying Theory
Current Reviews

Submitted by Assigned_Reviewer_1

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The paper proposes a simple, yet apparently powerful, saliency metric. For a given scale, it can be interpreted as the likelihood of the pixel intensity (or feature) under a distribution computed using kernel density estimation in a weighted neighborhood around the pixel. When using a Gaussian window for the neighborhood, the paper case the expression for the metric as being a scale-specific instantiation of a filtering operation. The final saliency of the pixel is then computed as the maxima of this per-scale metric across a range of scales. The paper provides an evaluation of this metric on a standard benchmark for predicting object saliency, both by itself as well as in combination with the MCG method, where it is found to perform competitively despite its simplicity. To showcase the possible application of the metric beyond images, a formulation for saliency on 3-D meshes is also introduced and discussed.

Overall, the paper is very well-written and does a good job of drawing parallels to a diverse set of methods in the literature that deal with saliency, segmentation, and the human visual system. The experiment results are compelling as well, and demonstrate that the metric has the potential to be practically useful. Overall, I am inclined to recommend acceptance.

I wonder if the authors could comment on possible extensions on more general formulations that expand the formulation beyond circularly symmetric windows at each scale. I wonder if there would be any benefit to generalizing the proposed metric to consider arbitrary 2-D Gaussian spatial windows, with the maxima now computed over general 2D covariance matrices instead of scalar scale values. Such a discussion might naturally fit into the section on viewpoint invariance ....
Q2: Please summarize your review in 1-2 sentences
The paper proposes a new saliency metric for images, based on the maximal uniqueness of pixel intensity (or any per-pixel feature vector) as measured in neighborhoods at different scales.

Submitted by Assigned_Reviewer_2

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
How the MISS value is computed algorithmically and a study on the running time comparison between different saliency computation is appreciated.
Q2: Please summarize your review in 1-2 sentences
The work is to propose a new saliency measure for images and others. While the central idea is not completely new, but is interesting as it unified several previous works on saliency computation. The work is acceptable for most conferences on image analysis or computer vision research, but could be in the borderline of NIPS publications.

Submitted by Assigned_Reviewer_3

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
This paper presents a new definition for visual saliency as maxima in information-scale space (MISS). Evaluation and comparisons are done on the tasks of fixation prediction and salient objection prediction. Robustness of MISS is characterized wrt

varying image conditions including: viewpoint, lighting and blur. MISS is also explained in terms of deep-learning-based saliency models.

Strengths: 1. Relates definition of saliency to scale-space information theory. 2. This new definition is naturally related to the problem of scale selection of local regions in both 2D images and meshes of 3D shapes. In computing saliency, scale can also be selected based on maximizing the information theoretic quantity of self-information.

Weaknesses:

1. Evaluation on the tasks of fixation prediction and salient object detection does not show overall better performance relative to the state of the art.

For example, see Table 1. MISS related approaches do not perform best on all datasets. In Table 2, only on the ft dataset, MISS performs the best. In Table 3, MISS performs the best only on the ft and PASCAL datasets.

2. Robustness to image transformations seem poor in Figure 3.

Q2: Please summarize your review in 1-2 sentences
The paper presents a new definition for visual saliency as maxima in information-scale space (MISS), and relates it to a variety of previous works on scale-space theory, interest point detection, bilateral filtering and existing visual saliency models. However, somewhat weak evaluation results compared to other existing approaches put this work below the bar.

Submitted by Assigned_Reviewer_4

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
Summary:

The paper proposes a definition of visual saliency which is based on

Maxima in Information Scale-Space (MISS). Connections to scale-space theory and information theory are explained, as well as to some other works. Several experiments are reported, including on fixation and salient objects in 2D, and 3D salient features of 3D meshes.

Quality: The paper is well written, and it seems technically correct.

Connections between information theory and saliency are interesting (and to some extent have been explored in the past by others, although in somewhat different formulations), and combining these with scale-space theory makes a lot of sense (and, again, to some extent, has also been explored before; e.g., see below).

One problem with the quality of the paper is that the technical/mathematical contribution is fairly small, and pretty much boils down to a definition of a saliency measure.

Clarity:

The paper is well written.

Originality:

It is great the authors point out the connections to the classic works on scale-space theory, and tie together scale-space theory and information theory in the context of visual saliency. There is a paper, however, by Toews and Wells ("A mutual-information scale-space for image feature detection and feature-based classification of volumetric brain images", CVPRW, 2010) which seems to diminish *some* of the novelty of the present work. May the authors please clarify the relation to (and differences from) from Toews and Wells?

Significance: This paper may have some positive impact on a subset of the NIPS community.

Remark: it is always a pleasure (for me) to see a deep net fails (Figure 5 -- where the results obtained by the proposed method are clearly better than those obtained by the deep-net approach). However, perhaps the authors should add at least one counterexample? Surely there was at least one image where the SALICON output was better than the proposed method's output...

Q2: Please summarize your review in 1-2 sentences
The paper proposes a new measure of visual saliency that is based on an interesting connection between the classical scale-space theory and information theory. Connections to few other works are nicely explained and the experimental section is sufficiently thorough. A main caveat about the paper is that the technical/mathematical contribution is rather small.

Submitted by Assigned_Reviewer_5

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
This paper proposes an information theoretic definition of visual saliency. The proposed approach is conveyed by building upon the scale space theory and nicely explained by drawing many connections to the existing literature.

An information theoretic view to the scale space studies already exist (for instance see the paper below), however applying it to the visual saliency problem is an interesting contribution.

Kristian Lindgren, "An Information-Theoretic Perspective on Coarse-Graining, Including the Transition from Micro to Macro", Entropy 2015

As a theoretical paper it is valuable. It gives a nice information theoretic insight for the saliency problem.

However in terms of the results it is below the state of the art. Nevertheless, the theoretical insights and the contribution deserves to be published.
Q2: Please summarize your review in 1-2 sentences
The paper has some valuable theoretical contributions which are nicely explained in the context of existing literature. The experimental validation is not too strong, however the theoretical contribution puts the paper slightly above the acceptance threshold.

Author Feedback
Author Feedback
Q1:Author rebuttal: Please respond to any concerns raised in the reviews. There are no constraints on how you want to argue your case, except for the fact that your text should be limited to a maximum of 5000 characters. Note however, that reviewers and area chairs are busy and may not read long vague rebuttals. It is in your own interest to be concise and to the point.
We wish to thank the reviewers for their valuable feedback, and enthusiasm for this paper. There appears to be a broad appreciation for the theoretical contributions of this work, with a few questions concerning relationship to related work and interpretation of quality of benchmark results. We have addressed these concerns directly in the individual responses that follow:

R1: It is true that one might consider arbitrary 2-D Gaussian spatial windows based on a covariance matrix, in the same vein as affine invariant interest point selection. Both the underlying feature space and shape of support are factors that might be exercised to quantify saliency in a manner that is most advantageous for a particular application. It is also the case that even for the circular support, directional information may be derived from considering the kernel weights corresponding to the adaptive portion of a local bilateral kernel (that corresponds to the pixel-wise contribution to the local likelihood estimate). This directionality might be useful for efficiently converging on the optimal covariance defined window, but also more generally provides adjunct information for scene understanding, border ownership etc. than a scalar quantity defining saliency. We appreciate this suggestion, and plan to include discussion of these possibilities in the revised draft.

R2: The definition presented in Toews and Wells is very similar to that employed by Kadir and Brady. Entropy across scale is considered as in Kadir and Brady, but a penalty is incurred according to localizability (framed as mutual information in subtracting entropy across adjacent scales). This is in in the same spirit as the approach of weighting the entropy according to the derivative of the entropy-scale curve to select salient regions by Kadir and Brady. While entropy and scale are involved, this formulation does not carry the same generality as the proposal presented in the paper (or relationships to other research discussed). Nevertheless, this is an important reference that will be discussed in revisions. It is also worth mentioning that there are a number of advantages to the MISS proposal including the quality of output as a measure of saliency, the distinct cross-scale information profiles for different categories of image pixel selections, the generalization to mesh (and other) data, and directionality discussed in the response to R1. We will also include some examples that show a more favorable side of SALICON as suggested.

R3: The Lindgren paper relates information to scale, but is grounded in thermodynamics, state configurations and chemical systems. Any overlap therefore is mostly at a very coarse grained (or even superficial) level of abstraction, and does not diminish the value of any of the observations presented in our paper. Some of the value of our submission lies in relating a variety of distinct and important fields of inquiry, with many of the results competitive with or better than the current state-of-the-art in considering benchmarks. These are characteristics that are exclusive to our submission.

R5: There is some concern expressed that the approach does not perform best on metrics for *every* dataset. It remains the case that it is the best for many datasets, and we don't feel that that the value of the contribution rests on being best on every benchmark. The theoretical underpinnings of this work are of critical importance, and there are many cases where the approach is competitive with, or better than state-of-the-art despite of its simplicity. Moreover, the generality of what is proposed presents considerable opportunity for exploring model variants (e.g. alternate features) that do offer advantages for specific tasks. Regarding figure 3, this analysis is atypical for work involving saliency. The intent is not to outperform a long history of work in keypoint matching, but to reveal a striking degree of stability for a saliency model when using the proposed definition.

R6: We appreciate the encouraging remarks. We agree that this work presents an important contribution that is complementary to the current heavy emphasis on deep learning, and that also carries value in its simplicity and perspective on a large body of classic research.

R7: We intend to share the implementation for these methods, which will help to communicate the mapping from the mathematics to algorithmic form. The running times in general are implementation dependent, however given that the proposed method is highly parallelizable, we have written a very efficient GPU implementation. A more detailed analysis of running time (and complexity) are ongoing and will also be shared on the webpage dedicated to this work.