NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:6489
Title:Pareto Multi-Task Learning

Reviewer 1


		
-This paper investigates a better strategy to produce a wide spread of trade-offs among multiple tasks in the multi-task learning setup (as shown in Figure 2) by introducing preference vectors. Based on multi-objective optimization with the vectors, the proposed method can achieve distributed Pareto solutions. It seems to be efficient compared to other existing MTL competitors. -However, I see the contributions of this paper is incremental and limited as the goal is not that significant and the approaches are mostly based on existing strategies such as [12, 24, 30]. In addition, the approach looks based on a single shared architecture assuming there are correlated tasks, but if tasks are less relevant, I wonder how to consider the tasks in generating Pareto solutions. -The paper is well-written and easy to follow, but there are a few unclear sentences that do not give clear answers. For example, in the lines of 87-88, why existing work cannot efficiently incorporate preference information and in the lines of 173-175 why the approach in [30] is inefficient and why the sequential gradient-based method can overcome the inefficiency? -In Algorithm 1, how to evenly distribute the vectors u_i’s? Is it achieved manually or rule-based things? -Even if the proposed approach pursues an approach for generating widely distributed Pareto solutions (rather than the accuracy itself), experimental results say that the method performs better than other competitors? Is there any analysis on this? -The first experiment is based on LeNet which is a quite old network so may not show the potential of the compared approaches for the problem. Reporting results using more advanced networks would be better. ---- The authors addressed most of my concerns even if they did not provide the actual results for my last concern. They should be given in the revised manuscript. From the other reviewers' comments, I would like to stick to my current rating.

Reviewer 2


		
Edit: I have read the author response and other reviews. My score remains the same. The paper proposes to frame multi-task learning as multi-objective optimization in the line of Sener and Koltun (NIPS 2018). Importantly, the proposed approach not only finds a single solution on the Pareto frontier, but a diverse set of different solutions that trade-off with different trade-offs. This is achieved by decomposing the MTL problem into K subproblems, each of which is defined by a unit preference vector and constrained to be in a subregion in the objective space. Overall, I found the approach well motivated and the paper well written. I am not aware of prior work that enables the user to trade-off the performance of different tasks for multi-task learning and believe that this may be of practical impact. It is also interesting to see that Pareto MTL is also able to find a good solution if only strong performance on one of the tasks is desired. I also appreciated the comparison to adaptive weight loss approaches, which should enable different perspectives on multi-objective approaches to MTL. I particularly enjoyed the extensive supplementary material, including the analysis of the importance of finding an initial solution, the analysis of the adaptive weight vectors, and an extension to many tasks. There are a few typos in the paper: "neural" -> "natural" (line 27); "depended" -> "dependant" (line 111); "significant" -> "significantly" (line 213); "MultiFashion-MINST" -> "MultiFashionMNIST" (line 245).

Reviewer 3


		
Originality: medium. This paper mainly combines another MOO algorithm and MOO-MTL, and improves the results from last year NIPS paper Multi-objective MTL. The technical contribution for MOO and MTL is limited since this paper just borrow the MOO optimization method directly from reference [24] and reference [29]. Nevertheless, I think this paper has a potential impact in MTL community since I do not find any previous paper achieves similar effects, which could guide people get different high-quality MTL results without random trails. Quality: below the average The quality of the paper is below the bar. Some important part is missing and lack of deep analysis. 1. The motivation is not clear. The author fails to explain why multi-objective MTL can only find concentrated solutions. The only explanation of this is in Fig 2, which is very empirical. For Fig.2, the author may consider explain it more in paper instead of in the supplementary materials (what is x-axis and y-axis). Also, for the linear scaling, why the solutions with different weight are also concentrated? The author only refers Boyd's optimization book. Maybe the author should consider adding the chapter and pages or give the explanation. 2. How to choose preference vector u? The author mentioned this a little in algorithm and supplementary materials but not enough. My concern is the sensitivity of the preference vector. What is the relationship between vector u and final distribution of the MTL solutions? Also, if we have many tasks instead of two or three, how to choose the vector u? 3. The experiment is performed on only two tasks and three tasks. What is the algorithm complexity respect to the number of the task? The evenly distributed vector u is large in many tasks scenario and there is no empirical and theoretical support that this method will work in many tasks case. Clarity: below the average The writing has a lot to improve. 1. For equation 3, the author only gives the reference, but this is a very important part of the paper and should be introduced with more details. the author should at least add one or two sentences to explain the intuition to make the logic flow more clear. 2. For equation 5-7, the author should make the optimized variable clear. For every objective equation, the author should add the subscript below the max and min notation. Also, eq 16 does not have the subscript for max. 3. Check the subscript more carefully. For example, In Eq 15, I guess the LHS should not have the subscript i? Similar situation is in Eq 18. Significance: above the average The results and the improvement over MOO-MTL are significant and may have an important impact especially in the MTL community. The main reason I reduce the significance to above the average is that the writing is not clear and the logic flow is not rigorous. The supplementary materials do not have theoretical support and the empirical support is not strong enough to convince me the method is robust and applicable in more complicated cases.