MetaReview Comments: The paper is clearly written and easy to follow. The overall method looks interesting. The experiments are extensive and sound. The reviewers consistently gave a high rating to the paper. Review #1 What is this paper about, what contributions does it make, and what are the main strengths and weaknesses? This paper investigates how and where the user and product knowledge should be incorporated into a neural model for improving the performance of sentiment classification. The authors argue and show that the widely accepted method of injecting such knowledge is not effective. Instead of incorporating these information as the bias term, they propose to incorporate them through the weight matrices, and investigate its usefulness under different injection locations. To reduce the parameter size, they introduce a chunk-wise importance matrix based representation which makes the training of model easier. Strengths: This paper is the first investigatory paper on how to better incorporate attribute information. The questions raised by authors are interesting and meaningful, and the investigations are comprehensive. This paper is well-written and does a good job in explaining and justifying the authors' motivations with different settings. The experimental results are generally convincing. Weakness: It would be more convincing to also show the results for bias-based attributes representation with other components. For example, if inject bias-based attributes into word embedding, the hidden representations generated by bi-LSTM are still able to capture attribute-biased sentiment of words according to their neighbours. It would be better to add more case studies for each type of injection. Reasons to accept Interesting question and comprehensive investigation. Clearly-written paper. Through evaluation and good results. Reasons to reject I think there is no reason to reject, other than lack of space for better papers. Overall Recommendation: 4 Questions for the Author(s) Did you try different injection methods on other tasks that also incorporate attribute information, and how were the results? Review #2 What is this paper about, what contributions does it make, and what are the main strengths and weaknesses? This paper investigates the way of incorporating attributive information (users and products) to main text documents to solve tasks such as sentiment classification. It is a good paper; well-written and addresses general issues. Strong points: extensive observation and comparison of the attribute injection methods: how to mix the attributive information with text representation and where to ingest the encoded vectors interesting suggestion that the attention is not the best place to ingest those attributive information despite the majority of existing methods did so relatively good results in the proposed chunk-wise matrix experiments in other tasks: category classification and headline generation Weak points: lack of qualitative discussion lack of examples in additional tasks: cannot see how meaningful the results are Reasons to accept This paper extensively observed the attribute ingestion methods, comparing three ways of encoding (bias-based, matrix-based, and CHIM-based which the authors propose), and four points to inject the representation: attention, word embedding, sentence encoder, and the final classifier. It is quite interesting to know that the attention is not the best place to ingest those attributive information, despite the majority of existing methods did so. The proposed CHIM-based method outperformed the ways in prior works, and the advantage is shown in other tasks (category classification and headline generation) as well. Reasons to reject The issues to be solved are clearly shown in Section 2, but qualitative discussion is lacking. Does the proposed method actually address the issues -- such as cake/sweet, tasty/delicious/ sweet cale/drink, user-based positive-negative bias, etc. The experiments of attribute transfer are interesting and the proposed method shows the good performance, but the only numbers are shown and we cannot see how those outputs are meaningful. It is better if we can see comparison with other methods with some output examples. Overall Recommendation: 4 Review #3 What is this paper about, what contributions does it make, and what are the main strengths and weaknesses? This paper presents a new method for incorporating attributes into modern neural architectures. It argues that bias-ing the attention weights is not a good idea and proposes "matrix" or "chunk-based" methods. It seems to be that attribute embeddings are projected to learn "gating" functions that are applied to other transformation functions in the network. Overall, I think the approach is quite interesting, albeit funky. It is quite unclear why the authors choose to apply gating on the weight matrices instead of the vector representations, considering that this involves a little strange reshaping to work. Nevertheless, empirical evidence is solid. In particular, showing that bias-ing the attention and some contradictory evidence of prior work may also be a valuable insight that this paper presents. Strengths This paper is easy to read and follow The overall method is interesting although weakly motivated Empirical evidence seems clear. A comprehensive evaluation of 'attribute injection" methods. Weaknesses The injection method feels a little strange, and not much explanation is given to "gating" the weight matrix versus the actual output representation. There is no evaluation against representation gating versus parameter gating. Reasons to accept I think this is an interesting contribution. It presents clear empirical results on different ways of injecting attributes and it should be useful to many practitioners and researchers. Reasons to reject The method is a little ad-hoc. There is really no good explanation or motivation to why chunking helps. Also, there is no empirical comparison against other possibly simpler variants of attribute injection. (Although the evaluation is already relatively comprehensive. ) Overall Recommendation: 4 Questions for the Author(s) It would be great if there is a better insight on why chunking helps. Are there hyperparameters for chunking? (chunk size?) what are the impacts on performance Why gate the parameters rather than the representation?