------------------------------------------------------ ------------------------------------------------------ ....THE REVIEWS.... ------------------------------------------------------ ------------------------------------------------------ Reviewer A: CLARITY: For the reasonably well-prepared reader, is it clear what was done and why? Is the paper well-written and well-structured?: 4. Understandable by most readers. INNOVATIVENESS: How original is the approach? Does this paper break new ground in topic, methodology, or content? How exciting and innovative is the research it describes? Note that a paper can score high for innovativeness even if its impact will be limited. : 3. Respectable: A nice research contribution that represents a notable extension of prior approaches or methodologies. SOUNDNESS/CORRECTNESS: First, is the technical approach sound and well-chosen? Second, can one trust the claims of the paper -- are they supported by proper experiments and are the results of the experiments correctly interpreted?: 4. Generally solid work, although there are some aspects of the approach or evaluation I am not sure about. RELATED WORK: Does the submission make clear where the presented system sits with respect to existing literature? Are the references adequate? Note that the existing literature includes preprints, but in the case of preprints: Authors should be informed of but not penalized for missing very recent and/or not widely known work. If a refereed version exists, authors should cite it in addition to or instead of the preprint. : 5. Precise and complete comparison with related work. Benefits and limitations are fully described and supported. SUBSTANCE: Does this paper have enough substance (in terms of the amount of work), or would it benefit from more ideas or analysis? Note that papers or preprints appearing less than three months before a paper is submitted to TACL are considered contemporaneous with the submission. This relieves authors from the obligation to make detailed comparisons that require additional experiments and/or in-depth analysis, although authors should still cite and discuss contemporaneous work to the degree feasible. : 4. Represents an appropriate amount of work for a publication in this journal. (most submissions) IMPACT OF IDEAS OR RESULTS: How significant is the work described? If the ideas are novel, will they also be useful or inspirational? If the results are sound, are they also important? Does the paper bring new insights into the nature of the problem?: 3. Interesting but not too influential. The work will be cited, but mainly for comparison or as a source of minor contributions. REPLICABILITY: Will members of the ACL community be able to reproduce or verify the results in this paper?: 4. They could mostly reproduce the results, but there may be some variation because of sample variance or minor variations in their interpretation of the protocol or method. IMPACT OF PROMISED SOFTWARE: If the authors state (in anonymous fashion) that their software will be available, what is the expected impact of the software package?: 1. No usable software released. IMPACT OF PROMISED DATASET(S): If the authors state (in anonymous fashion) that datasets will be released, how valuable will they be to others?: 3. Potentially useful: Someone might find the new datasets useful for their work. TACL-WORTHY AS IS? In answering, think over all your scores above. If a paper has some weaknesses, but you really got a lot out of it, feel free to recommend it. If a paper is solid but you could live without it, let us know that you're ambivalent. Reviewers: after you save this review form, you'll have to make a confidential recommendation to the editors via pull-down menu as to: what degree of revision would be needed to make the submission eventually TACL-worthy? : 4. Worthy: A good paper that is worthy of being published in TACL. Detailed Comments for the Authors Reviewers, please draft your comments on your own filesystem and then copy the results into the text-entry box. You will thus have a saved copy in case of system glitches. : The paper presents an unsupervised system for extractive opinion summarization, inspired by vector-quantized variational autoencoders (VQ-VAEs). The proposed system combines Transformer with the discretization bottleneck of VQ-VAE, and is trained via sentence reconstruction. Overall, the paper is very well presented. The descriptions are generally clear, well-illustrated and easy to follow. The experiments are well thought-out and results seem promising. Aside from the proposed method, the paper also promises to release a new dataset containing 1,050 human-written, generic and aspect summaries of 50 hotels for evaluation of opinion summarization systems. The proposed model showed promising results in several aspects (Table 4), but it remains unclear to me how the QT model inherently deals with redundancy. The sentence sampling process Eq.(8) seems to have a significant impact on the summarization performance (Table 2). However, the description is somewhat brief (Line 394-417). It would be great if the authors can elaborate on that, providing more intuition and justification on the robustness of the sampling process. Given the "clustering" nature of the proposed work, I would also encourage the authors to compare the proposed method with a clustering approach (or graphical models) for extractive summarization. REVIEWER CONFIDENCE: 4. Quite sure. I tried to check the important points carefully. It's unlikely, though conceivable, that I missed something that should affect my ratings. ------------------------------------------------------ ------------------------------------------------------ Reviewer B: CLARITY: For the reasonably well-prepared reader, is it clear what was done and why? Is the paper well-written and well-structured?: 4. Understandable by most readers. INNOVATIVENESS: How original is the approach? Does this paper break new ground in topic, methodology, or content? How exciting and innovative is the research it describes? Note that a paper can score high for innovativeness even if its impact will be limited. : 3. Respectable: A nice research contribution that represents a notable extension of prior approaches or methodologies. SOUNDNESS/CORRECTNESS: First, is the technical approach sound and well-chosen? Second, can one trust the claims of the paper -- are they supported by proper experiments and are the results of the experiments correctly interpreted?: 3. Fairly reasonable work. The approach is not bad, and at least the main claims are probably correct, but I am not entirely ready to accept them (based on the material in the paper). RELATED WORK: Does the submission make clear where the presented system sits with respect to existing literature? Are the references adequate? Note that the existing literature includes preprints, but in the case of preprints: Authors should be informed of but not penalized for missing very recent and/or not widely known work. If a refereed version exists, authors should cite it in addition to or instead of the preprint. : 4. Mostly solid bibliography and comparison, but there are a few additional references that should be included. Discussion of benefits and limitations is acceptable but not enlightening. SUBSTANCE: Does this paper have enough substance (in terms of the amount of work), or would it benefit from more ideas or analysis? Note that papers or preprints appearing less than three months before a paper is submitted to TACL are considered contemporaneous with the submission. This relieves authors from the obligation to make detailed comparisons that require additional experiments and/or in-depth analysis, although authors should still cite and discuss contemporaneous work to the degree feasible. : 3. Leaves open one or two natural questions that should have been pursued within the paper. IMPACT OF IDEAS OR RESULTS: How significant is the work described? If the ideas are novel, will they also be useful or inspirational? If the results are sound, are they also important? Does the paper bring new insights into the nature of the problem?: 3. Interesting but not too influential. The work will be cited, but mainly for comparison or as a source of minor contributions. REPLICABILITY: Will members of the ACL community be able to reproduce or verify the results in this paper?: 4. They could mostly reproduce the results, but there may be some variation because of sample variance or minor variations in their interpretation of the protocol or method. IMPACT OF PROMISED SOFTWARE: If the authors state (in anonymous fashion) that their software will be available, what is the expected impact of the software package?: 3. Potentially useful: Someone might find the new software useful for their work. IMPACT OF PROMISED DATASET(S): If the authors state (in anonymous fashion) that datasets will be released, how valuable will they be to others?: 3. Potentially useful: Someone might find the new datasets useful for their work. TACL-WORTHY AS IS? In answering, think over all your scores above. If a paper has some weaknesses, but you really got a lot out of it, feel free to recommend it. If a paper is solid but you could live without it, let us know that you're ambivalent. Reviewers: after you save this review form, you'll have to make a confidential recommendation to the editors via pull-down menu as to: what degree of revision would be needed to make the submission eventually TACL-worthy? : 2. Leaning against: I'd rather not see it appear in TACL. Detailed Comments for the Authors Reviewers, please draft your comments on your own filesystem and then copy the results into the text-entry box. You will thus have a saved copy in case of system glitches. : This paper produced a large opinion summarization dataset and also proposed a new method to generate an extractive opinion summary for each entity and for selected aspects based on a set of reviews. However, I have some major concerns about the paper. Regarding the manually created dataset for opinion summarization, I have the following questions: 1. In the first step of sentence selection, how did you ask the annotator to do the selection, focusing on a summary of the opinion of the entity or on the main aspects? Any annotation instructions or document? 2. The sentence selection must be highly subjective. What kind of agreement did you get? If only those sentences with 4 votes are selected, it is possible that no sentence or very few sentences are selected. Then the resulting sentences may not represent a good summary of the review. This is problematic. Am I missing something? 3. Due to this problem, how do you know that you are producing a good general summary of a set of reviews? 4. For Aspect Summaries, again due to the above problem, a lot of aspects related sentences may be lost and then it is not possible to produce a good aspect-specific summary. Therefore, it is unclear whether this manually created data for opinion summarization is of good quality, which affects the validity of the experiment results as well. If the sentences are not chosen properly, you may get wrong summaries. For example, for an aspect that most people are positive about, due to the selection/voting issue, it may end up that more people are negative about it. The aspect accuracy and positive and negative polarity proportions are important for opinion summarization. See Bing Liu’s 2012 book titled “sentiment analysis and opinion mining” about this issue. I also have some concerns about the evaluation measures used in the experiments. I believe that for opinion summary evaluation, it is not enough to simply use the measures for general text summarization. I understand that using ROUGE is a common practice for text summarization evaluation, but for opinion summarization, getting the correct polarities is of critical importance. Hence, in addition to the ROUGE measures, an accuracy evaluation about the opinion polarity comparison based on the same aspect should be provided by comparing those in the gold data and those in the extracted sentences. It does not matter how good the ROUGE scores are, if the sentiment polarities are incorrect, the results are not acceptable. The same goes with the user study. It is clearly possible to produce a set of accuracy results about the correctness of aspects and the correctness of their associated opinion polarities, e.g., the proportion of aspects match and the proportion of polarity match given the right aspect. The criteria used in the paper are too subjective, which should only be used when more objective measures are not feasible. But for opinions, aspects and polarities are easily verified manually. For aspect-specific summarization, the accuracy of the polarity should also be provided You stated “We present ROUGE-1/2/L F-scores on the general summarization portion of SPACE in Table 2.” I don’t see where the F-scores are in the table. About the implementation of the system, you did not mention any validation set. How did you select those parameters? You did mention the development set, but you did not mention the training set? REVIEWER CONFIDENCE: 4. Quite sure. I tried to check the important points carefully. It's unlikely, though conceivable, that I missed something that should affect my ratings. ------------------------------------------------------ ------------------------------------------------------ Reviewer C: CLARITY: For the reasonably well-prepared reader, is it clear what was done and why? Is the paper well-written and well-structured?: 4. Understandable by most readers. INNOVATIVENESS: How original is the approach? Does this paper break new ground in topic, methodology, or content? How exciting and innovative is the research it describes? Note that a paper can score high for innovativeness even if its impact will be limited. : 4. Creative: An intriguing problem, technique, or approach that is substantially different from previous research. SOUNDNESS/CORRECTNESS: First, is the technical approach sound and well-chosen? Second, can one trust the claims of the paper -- are they supported by proper experiments and are the results of the experiments correctly interpreted?: 4. Generally solid work, although there are some aspects of the approach or evaluation I am not sure about. RELATED WORK: Does the submission make clear where the presented system sits with respect to existing literature? Are the references adequate? Note that the existing literature includes preprints, but in the case of preprints: Authors should be informed of but not penalized for missing very recent and/or not widely known work. If a refereed version exists, authors should cite it in addition to or instead of the preprint. : 4. Mostly solid bibliography and comparison, but there are a few additional references that should be included. Discussion of benefits and limitations is acceptable but not enlightening. SUBSTANCE: Does this paper have enough substance (in terms of the amount of work), or would it benefit from more ideas or analysis? Note that papers or preprints appearing less than three months before a paper is submitted to TACL are considered contemporaneous with the submission. This relieves authors from the obligation to make detailed comparisons that require additional experiments and/or in-depth analysis, although authors should still cite and discuss contemporaneous work to the degree feasible. : 3. Leaves open one or two natural questions that should have been pursued within the paper. IMPACT OF IDEAS OR RESULTS: How significant is the work described? If the ideas are novel, will they also be useful or inspirational? If the results are sound, are they also important? Does the paper bring new insights into the nature of the problem?: 4. Some of the ideas or results will substantially help other people's ongoing research. REPLICABILITY: Will members of the ACL community be able to reproduce or verify the results in this paper?: 3. They could reproduce the results with some difficulty. The settings of parameters are underspecified or subjectively determined, and/or the training/evaluation data are not widely available. IMPACT OF PROMISED SOFTWARE: If the authors state (in anonymous fashion) that their software will be available, what is the expected impact of the software package?: 3. Potentially useful: Someone might find the new software useful for their work. IMPACT OF PROMISED DATASET(S): If the authors state (in anonymous fashion) that datasets will be released, how valuable will they be to others?: 3. Potentially useful: Someone might find the new datasets useful for their work. TACL-WORTHY AS IS? In answering, think over all your scores above. If a paper has some weaknesses, but you really got a lot out of it, feel free to recommend it. If a paper is solid but you could live without it, let us know that you're ambivalent. Reviewers: after you save this review form, you'll have to make a confidential recommendation to the editors via pull-down menu as to: what degree of revision would be needed to make the submission eventually TACL-worthy? : 4. Worthy: A good paper that is worthy of being published in TACL. Detailed Comments for the Authors Reviewers, please draft your comments on your own filesystem and then copy the results into the text-entry box. You will thus have a saved copy in case of system glitches. : This paper presented a novel opinion summarization using the Quantized Transformer for aspect-based summarization. In addition, the authors also provided a SPACE, a new corpus for their experimental results. I think the contribution is sufficient for TACL. In general, this is a very well-written paper with a clear and coherent presentation structure. Experimental results are solid, which shows a comparison with the strong baseline and related works. In my opinion, the method presented in the paper would work for aspect opinion extraction. The key difference in comparison with other work on aspect-summarization is the use of VQ-VAE method. I have a small question regarding the difference in the proposed method compared with the work described in [Roy and Graniger 2019]. The point is that the extractive aspect-opinion summarization can be considered as to how to select an important sentence, which can be considered a binary classification for selecting important sentences. Could you explain the main difference in your method when compared with the “paraphrase identification” presented in [Roy and Graniger 2019] Aurko Roy and David Grangier. 2019. Unsupervised paraphrasing without translation. ACL 2019 In the second point, you claimed that your proposed method could be easily accommodated with a large number of input reviews. Do you have any evidence to show that, or at least please give a discussion regarding this point? In the third point, it is good that the authors can share their code and data for verifying their work. Currently, I could not open the link indicated in the paper. Overall, this paper can be accepted in TACL if the authors can make clear my question regarding the novelty and some minor points regarding their claim. In addition, if the paper is accepted the authors should share their code and data. REVIEWER CONFIDENCE: 4. Quite sure. I tried to check the important points carefully. It's unlikely, though conceivable, that I missed something that should affect my ratings.