|   | 
    
      
       Submitted by 
      Assigned_Reviewer_6
  Q1: Comments to author(s). 
      First provide a summary of the paper, and then address the following 
      criteria: Quality, clarity, originality and significance. (For detailed 
      reviewing guidelines, see 
      http://nips.cc/PaperInformation/ReviewerInstructions) 
      The authors claim to represent a simplified 
      perspective of EDML, that:  1. casts it as a general approach to 
      continuous optimization  2. makes immediate some results that were 
      previously obtained for EDML, but through more effort  3. Facilitates 
      the design of new EDML algorithms for new classes of models 
  First, 
      EDML in Bayesian networks is reviewed as a method to optimize the 
      likelihood of the observed data, where each update step for the next 
      parameters requires inference on the Bayesian network using the current 
      parameters. Then, the same analysis is applied to undirected models 
      (MRFs). As a last step, EDML is claimed to converge in one iteration for 
      MRFs under complete data. The last is experimented and compared to 
      conjugate gradient (CG) and L-BFGS on benchmark problems, including 
      handwritten digits modeled with 2D-grid-MRFs and 13 other networks that 
      could be infered with jointree. EDML outperforms both CG and L-BFGS in 
      running time till convergence to a certain accuracy, even if number of 
      iterations may be larger at times. 
  Comments:  - The title is 
      very similar to UAI10 and does not emphasize the main point claimed for 
      this particular work, which is "observing EDML as a general approach for 
      continuous optimization under complete data."  - Why would the overview 
      of EDML in the uai paper not fall under the definition of "continuous 
      optimization"? Not sure I see the difference in "perspective"  - To my 
      understanding, the focus in previous work was application to Bayesian 
      networks and comparison to EM, while here the focus is undirected MRFs and 
      comparison to CG and L-BFGS. The experiments indeed show the benefit in 
      using this method upon the other optimization 
      techniques. Q2: Please summarize your review in 1-2 
      sentences 
      The main contribution is providing a useful technique 
      for parameter training in MRFs with complete data, by "transferring" EDML 
      from Bayesian (directed) to undirected networks.  It wasn't clear to me 
      why this is claimed to be a different view of the 
      method.  Submitted by 
      Assigned_Reviewer_7
  Q1: Comments to author(s). 
      First provide a summary of the paper, and then address the following 
      criteria: Quality, clarity, originality and significance. (For detailed 
      reviewing guidelines, see 
      http://nips.cc/PaperInformation/ReviewerInstructions) 
      The paper provides a simplified perspective on EDML 
       and illustrated the benefits of such a simplified view.  First, 
      existing results about EDML are much easier to proce.  Second, and most 
      important, this new formulation provides  a systematic procedure for 
      deriving new instances of EDML  for other models. This is illustrated 
      for Markov networks.  The derived EDML is shown to be competitive to 
      existing  learning approaches. 
  Parameter estimation of 
      graphical models is important.  The paper, which is extremely well 
      written,  presents a very elegant view on this task.  I also like 
      the simplified view on EDML. It really helps  understanding it pros and 
      cons. It also shows that  it essentially implements a coordinate 
      descend for  parameter estimation. Indeed, such approach have been 
       proposed for graphical models --- for instance,  iterative 
      proportional fitting can be viewed as such a  method --- however I 
      expect it to be too slow  to scale well. Still, there are some recent 
      advances  for MaxEnt models 
  Fang-Lan Huang, Cho-Jui Hsieh, 
      Kai-Wei Chang, Chih-Jen Lin:  Iterative Scaling and Coordinate Descent 
      Methods for Maximum Entropy Models.  Journal of Machine Learning 
      Research 11: 815-848 (2010) 
  that is also nicely comparing 
      iterative scaling and coordinate  descend. Moreover, it shows that 
      there is not just one IPS  approach in general but several. It just 
      depends on the  auxiliar function you are using. This should really be 
      clarified. 
  In my opinion, the big pro of the present paper is that 
      it shows that  parameter estimation within graphical models requires to 
       iteratively solve rather simple convex optimization sub-problems. 
       In many cases, each sub-problem is of simple form, too, and 
       involved inferences made from the data at hand. I value this  clear 
      view on what parameter estimation is a lot. 
  However, there are 
      also other gradient optimization techniques  that avoid the line-search 
      and have an interesting connection  to EM such as 
  E. Bauer, D. 
      Koller, Y. Singer:  Update Rules for Parameter Estimation in Bayesian 
      Networks.  UAI 1997: 3-13 
  L.E. Ortiz, L. Pack Kaelbling: 
       Accelerating EM: An Empirical Study.  UAI 1999: 512-521 
  J. 
      Fischer, K. Kersting:  Scaled CGEM: A Fast Accelerated EM.  ECML 
      2003: 133-144 
  M. Jamshidian and R. I. Jennrich.  Accleration of 
      the EM Algorithm by using Quasi-Newton Methods.  Journal of the Royal 
      Stat. Society B, 59(3):569–587, 1997. 
  I would expect these 
      approaches to work almost as good but to run  considerable faster than 
      CG and LBFGS. 
  Still the transfer to the Markov network learning 
      setting is  interesting. 
 
  Q2: Please summarize 
      your review in 1-2 sentences 
      To summarize, 
  + Novel perspective on EDML that 
      provides additional insight into it.  + Derives an EDML approach for 
      parameter learning of Markov networks 
  - connection to CS and IPS 
      approaches should be improved and also  illustrated empirically  - 
      Many accelerated parameter estimation method are not discussed.  For 
      instance, there are accelerated EM approaches that also make  use of 
      advanced gradient methods such as scaled-conjugate gradient  avoiding 
      the line-search via scaling an (approx.) Hessian. 
  Although I still 
      think that the novel view is highly interesting, the discussion also 
       revealed that some issues (see the other reviews) should be addressed. 
      Still, since I think the simplified vies is interesting I lean towards 
      accept.   Submitted by 
      Assigned_Reviewer_8
  Q1: Comments to author(s). 
      First provide a summary of the paper, and then address the following 
      criteria: Quality, clarity, originality and significance. (For detailed 
      reviewing guidelines, see 
      http://nips.cc/PaperInformation/ReviewerInstructions) 
      Summary:  Overall, I think the paper is potentially 
      useful and a contribution to the community since it puts forth a simpler 
      framework for EDML (the previous works on EDML are quite dense). However, 
      I was disappointed that it presented it in a confusing manner and that it 
      didn't address a number of key issues such as convergence, relationship to 
      EM, etc. Further, the contribution to parameter estimation for undirected 
      models is quite incremental. 
  Contribution:  This paper portrays 
      EDML as fixed-point iteration where fixed points correspond to stationary 
      points of the log likelihood. Each iteration requires performing inference 
      and has a notion of soft evidence, like EM. Their application to 
      complete-data learning for Markov networks reduces to some Jacobi-style/ 
      coordinate descent-type method for the convex log likelihood and can be 
      used as a black-box optimizer for Markov networks. 
  Novelty: 
       The connection between EDML and continuous optimization is novel. 
      However, the literature on alternatives to EM is vast, and I'm not sure 
      how novel it is as a learning method under missing data. For complete-data 
      Markov networks, the method is similar to basic, known methods for convex 
      optimization. 
  Pros:  Attractive method for learning in the 
      presence of missing data that points to fertile ground for future work. 
      Unified treatment of directed and undirected graphical models is also 
      interesting and potentially impactful. Many practitioners would benefit 
      from reading the paper in order to think about alternative perspectives on 
      parameter learning in the presence of unobserved data. 
  Cons: 
       1) The paper casts EDML as an optimization technique, but lacks 
      necessary details for readers to understand this interpretation.  2) 
      Exposition, particularly notation, is sometimes confusing.  3) 
      insufficient conceptual or experimental comparison to EM.  4) 
      experiments only address an overly simple case of parameter learning 
      problems (complete data). 
  1) I found section 2 very confusing. It 
      first appears as though you're introducing some new family of optimization 
      algorithms, but then it becomes apparent that you're discussing the 
      general family of techniques described in the first paragraph of the 
      related work section. I would reshape this section to first state that 
      such a family exists and discuss their convergence properties. Then, I'd 
      state that for the particular class of problems you're dealing with, these 
      coordinate-wise minimizations are tractable. Claim1 is boring, and more 
      space should be made for previewing the sort of content discussed in the 
      first paragraph of the related work section. 
  The paper also lacks 
      a formal discussion of convergence behavior for the method. How do we know 
      that the update rule at the top of page 5 will converge? I agree that its 
      fixed points are fixed points of the marginal likelihood, but how do we 
      know it will ever converge to one of these? This issue would be remedied 
      if you were much more straightforward in section 2 about what you need for 
      such coordinate minimization techniques to converge. Fixed point iteration 
      is not guaranteed to converge, you need some sort of contraction. See 
      proofs, for example, about the convergence of loopy belief propagation. 
      
  Why does section 3.4 have the title 'Finding the Unique Optimum?" 
      The subfunctions are convex, so each of them has a unique minimizer. 
      However, the overall problem doesn't have a unique stationary point, and 
      the algorithm you present in this section is for finding such a point. I 
      understand that the updates at the top of page 5 are obtained by 
      analytically minimizing simultaneously in every coordinate (where you find 
      a unique minimum), but the title is very confusing. 
  2) The 
      notation in section 3.2 (soft evidence) needs to be defined much more 
      clearly. In equation (3) is there implicitly a sum over possible values 
      that X_i can take on? I don't understand how you can get a marginal 
      likelihood without summing over the possible values for each X_i. What's 
      the difference between P(\eta_i|x) and P(X_i|x)? The corresponding 
      appendix does not clear things up either. Furthermore, I didn't find 3.2 
      particularly illuminating about the method without some comparison to 
      'soft evidence' in EM. 
  Also, it's strange that you present an 
      algorithm at the top of page 5 based on an earlier work and leave it to 
      the reader to deduce that this can also be obtained by applying the 
      procedure from page 2 to the log likelihood. You should, at the very 
      least, say that the updates at the top of page 5 correspond to exact 
      simultaneous minimization of the subfunctions. You should also discuss 
      under what conditions such fixed point iteration will converge. 
  3) 
      In general, I'd like to see much more discussion about the relationship 
      between EDML and EM. EM has some desirable properties (monotonicity of the 
      updates) and some downsides (it's maximizing a concave lower bound on the 
      marginal likelihood rather than the actual likelihood). It would be good 
      to juxtapose these with the properties of EDML. Of course it would have 
      been nice to see experiments comparing the two as well. 
  4) The 
      experiments section only considers complete-data learning, in which the 
      log likelihood is convex. In this setting EDML is just another convex 
      optimization method, so a much more rigorous discussion of its convergence 
      behavior, memory requirements, per-iteration complexity, robustness to 
      errors in approximate inference, etc. is necessary. For future work, I 
      would make sure to do experiments with missing data and compare to EM. The 
      two methods are not guaranteed to converge to the same value of the 
      marginal likelihood, so comparing solution quality, and not just time, is 
      necessary. 
  Minor points  What exactly is the semantics of your 
      feasibility function g(y)? Are you always assuming it's a projection onto 
      the feasible set? Is the feasibility function you use in If you don't, g() 
      isn't guaranteed to give you what you want. First, you make it sound like 
      a projection function onto a feasible set, but then you use it as an 
      indicator function for the set?  You should define somewhere early on 
      what EDML stands for.  Is this technique stable if doing exact 
      inference in the bayes net is intractable? Discussion or experimental 
      evaluation in this setting would be useful.  The update equation at the 
      top of page 5 should have a number. 
  ----- Post Author Response 
      -----  In their response, the authors address many of the key omissions 
      in their paper, such as the lack of a proof of convergence, by referring 
      us to their other recent papers on EDML. This isn't quite sufficient to 
      address the confusion that will arise for future readers of the paper. 
      Some things, such as a comparison to EM can be addressed by providing a 
      pointer to the earlier papers. However, the discussion of EDML as an 
      optimization technique needs to be able to stand on its own. It is not 
      sufficient to present an optimization algorithm and not prove that it 
      converges or characterize what sorts of points it will converge to. 
      
  Overall, I am familiar with the modern literature on inference and 
      learning in graphical models, and this paper is not of sufficient quality 
      to be presented at NIPS. My concerns with this paper are mostly 
      presentation related. For subsequent versions of the paper, I hope the 
      authors will address the concerns that were raised. This is quite an 
      important contribution, and I hope authors will resubmit it again (if it 
      is rejected at NIPS). Q2: Please summarize your review in 
      1-2 sentences 
      I think the paper is potentially useful and a 
      contribution to the community since it puts forth a simpler framework for 
      EDML (the previous works on EDML are quite dense). However, I was 
      disappointed that it presented it in a confusing manner and that it didn't 
      address a number of key issues such as convergence, relationship to EM, 
      etc. Further, the contribution to parameter estimation for undirected 
      models is quite incremental. 
  
       Q1:Author 
      rebuttal: Please respond to any concerns raised in the reviews. There are 
      no constraints on how you want to argue your case, except for the fact 
      that your text should be limited to a maximum of 6000 characters. Note 
      however that reviewers and area chairs are very busy and may not read long 
      vague rebuttals. It is in your own interest to be concise and to the 
      point. 
      One goal of this work is to indeed provide a different 
      perspective on parameter estimation in probabilistic graphical models 
      (directed and undirected alike). Namely, we seek an understanding of 
      parameter estimation beyond what was provided by classical and 
      well-studied frameworks such as EM. Reviewer 7 clearly appreciates this 
      view and the insights it brings. Reviewer 8 recognizes the fertile ground 
      that this perspective may lay for future work. Reviewer 6 recognizes the 
      utility of this view in deriving new EDML algorithms for undirected 
      models. 
  Reviewer 8 asked for a comparison between EDML and EM. 
      Please note that previous work [5, 16] has already compared EDML to EM, 
      theoretically and empirically. Given the proposed pure optimization 
      perspective, we thought it was more appropriate to use the limited space 
      available to relate EDML to existing optimization methods (such as CG and 
      L-BFGS), under complete data. Our experiments also imply practical 
      benefits for the incomplete data case, when EDML is used in lieu of 
      gradient methods for the M-step of EM (comparisons with EM were also given 
      for directed models in [5, 16]). On the other hand, [5, 16] include 
      answers to several questions that Reviewer 8 raised, including discussions 
      about convergence. We will next clarify the main points raised by each 
      reviewer. 
  Reviewer 6  ********** 
  #1 (The UAI paper): 
      
  1a) EDML in the UAI paper is a "specific" optimization algorithm 
      for learning Bayesian networks (BNs) that invokes a lot of BN concepts. 
      
  1b) Viewing EDML in the simple form given in Section 2 did not 
      invoke any BN concepts, and therefore interpreting EDML as a "general" 
      optimization algorithm. 
  #2 (Other comments): 
  2a) For MRFs, 
      EDML can take more than one iteration to converge under complete data 
      (Lines 301 and 302). 
  2b) The title emphasizes that EDML can be now 
      used for both directed and undirected models. We will think about a more 
      informative title. 
  Reviewer 7  ********** 
  #1 (Parameter 
      estimation in graphical models): 
  1a) The reviewer's note about the 
      big pro of the paper is insightful: showing that parameter estimation in 
      graphical models requires iteratively solving rather simple convex 
      optimization sub-problems. 
  1b) We emphasize too that the creation 
      of all these sub-problems requires performing inference "only once", at 
      every EDML iteration, which we think is interesting. 
  #2 (EDML as a 
      coordinate descent (CD) method): 
  2a) EDML as a CD method makes two 
      choices:  1] Optimization in different directions is done in parallel, 
      i.e. the current parameters are used to generate all sub-problems.  2] 
      Every parameter set is considered a direction of optimization. 
  2b) 
      Huang et al 2010, proposed a unified framework. They discuss CD methods 
      that use sequential and parallel updates. EDML falls in the latter class. 
      We will clarify the connection in the paper. 
  #3 (Other comments): 
      
  3a) The reviewer points to interesting related work that we will 
      definitely consider. We tried simple gradient descent (with and "without" 
      line search), but did not find it competitive on our benchmarks. There are 
      still many variations that are worth trying, as suggested by the reviewer. 
      
  Reviewer 8  ********** 
  #1 (EDML vs EM): 
  1a) A 
      theoretical and empirical comparison between EDML and EM was already done 
      in previous work [5, 16], and showed advantages for EDML. The goal of this 
      paper is not to re-compare EDML with EM, but to simplify and generalize 
      EDML. 
  #2 (Sec 3.4): 
  2a) This section does not find a 
      stationary point for the overall learning problem, but only finds the 
      unique optimum of each convex sub-problem. This is only a single step of 
      the EDML algorithm. At the beginning of Section 3.4, we mention explicitly 
      that we are solving Equation 3 (the sub-problem). 
  #3 (Convergence 
      and the convex sub-problems): 
  3a) The convergence analysis of the 
      update equation solving the small convex sub-problems was already given in 
      [16]: the update equation at the top of Page 5 is a simpler form of 
      (Equation 4 in [16]), which was proved to converge by (Theorem 2 in [16]). 
      
  3b) While (Equation 4 in [16]) is guaranteed to converge, the EDML 
      iterative procedure is not guaranteed to converge in general cases. Please 
      refer to (Section 6 in [5]) for cases where EDML is guaranteed to 
      converge. 
  3c) Equation 2 is a simplified form of the convex 
      sub-problems gotten by previous work [16], and therefore it is possible to 
      use (Equation 4 in [16]) to optimize it, as proposed in Sec 3.4. 
      
  We will clarify these explicitly in the paper. 
  #4 (Markov 
      Networks and Experiments): 
  4a) Beyond allowing us to derive EDML 
      for Markov networks, our new formulation further lends itself to the 
      derivation of EDML for new families of graphical models. 
  4b) In 
      our experiments, the partition function is still non-trivial to compute. 
      Thus, even with complete data, minimizing the number of evaluations of the 
      partition function or its gradient is quite important. 
  #5 (Section 
      2 and the Related Work): 
  5a) Section 2 sheds light on a family of 
      algorithms, emphasizing that the current feasible parameters are used to 
      generate all the sub-problems. IPF does not fall directly in this family. 
      We will clarify such difference in Section 2. 
  #6 (Soft Evidence): 
      
  6a) There is a specific technical definition for the term soft 
      evidence we use (See Footnote 4). P(\eta_i|x) and P(X_i|x) are different; 
      details are given in [4,16] but we will clarify in this paper as well. 
      
  6b) sum_x is a summation over all the possible values of X. Note 
      that Equation 3 is for one sub-problem. Sub-problems become independent. 
      
  #7 (Other comments): 
  7a) In Section 2, the feasibility 
      function (g) takes an arbitrary point and returns a feasible point. 
      Theorem 5 in the appendix adds a constraint on g to get some favorable 
      behavior. We do not use g as an indicator function. 
  7b) We make no 
      claims yet about the algorithm's behavior with approximate 
inference. 
       |   |