What the ACL-2014 review scores mean

I’ve had several people ask me what the numbers in ACL reviews mean — and I can’t find anywhere online where they’re described. (Can anyone point this out if it is somewhere?)

So here’s the review form, below. They all go from 1 to 5, with 5 the best. I think the review emails to authors only include a subset of the below — for example, “Overall Recommendation” is not included?

The CFP said that they have different types of review forms for different types of papers. I think this one is for a standard full paper. I guess what people really want to know is what scores tend to correspond to acceptances. I really have no idea and I get the impression this can change year to year. I have no involvement with the ACL conference besides being one of many, many reviewers.

APPROPRIATENESS (1-5)
Does the paper fit in ACL 2014? (Please answer this question in light of the desire to broaden the scope of the research areas represented at ACL.)

5: Certainly.
4: Probably.
3: Unsure.
2: Probably not.
1: Certainly not.

CLARITY (1-5)
For the reasonably well-prepared reader, is it clear what was done and why? Is the paper well-written and well-structured?

5 = Very clear.
4 = Understandable by most readers.
3 = Mostly understandable to me with some effort.
2 = Important questions were hard to resolve even with effort.
1 = Much of the paper is confusing.

ORIGINALITY (1-5)
Is there novelty in the developed application or tool? Does it address a new problem or one that has received little attention? Alternatively, does it present a system that has significant benefits over other systems, either in terms of its usability, coverage, or success?

5 = Surprising: Significant new problem, or a major advance over other applications or tools that attack this problem.
4 = Noteworthy: An interesting new problem, with clear benefits over other applications or tools that attack this problem.
3 = Respectable: A nice research contribution that represents a notable extension of prior approaches.
2 = Marginal: Minor improvements on existing applications or tools in this area.
1 = The system does not represent any advance in the area of natural language processing.

IMPLEMENTATION AND SOUNDNESS (1-5)
Has the application or tool been fully implemented or do certain parts of the system remain to be implemented? Does it achieve its claims? Is enough detail provided that one might be able to replicate the application or tool with some effort? Are working examples provided and do they adequately illustrate the claims made?

5 = The application or tool is fully implemented, and the claims are convincingly supported. Other researchers should be able to replicate the work.
4 = Generally solid work, although there are some aspects of the application or tool that still need work, and/or some claims that should be better illustrated and supported.
3 = Fairly reasonable work. The main claims are illustrated to some extent with examples, but I am not entirely ready to accept that the application or tool can do everything that it should (based on the material in the paper).
2 = Troublesome. There are some aspects that might be good, but the application or tool has several deficiencies and/or limitations that make it premature.
1 = Fatally flawed.

SUBSTANCE (1-5)
Does this paper have enough substance, or would it benefit from more ideas or results?
Note that this question mainly concerns the amount of work; its quality is evaluated in other categories.

5 = Contains more ideas or results than most publications in this conference; goes the extra mile.
4 = Represents an appropriate amount of work for a publication in this conference. (most submissions)
3 = Leaves open one or two natural questions that should have been pursued within the paper.
2 = Work in progress. There are enough good ideas, but perhaps not enough in terms of outcome.
1 = Seems thin. Not enough ideas here for a full-length paper.

EVALUATION (1-5)
To what extent has the application or tool been tested and evaluated? Have there been any user studies?

5 = The application or tool has been thoroughly tested. Rigorous evaluation on a large corpus or via formal user studies support the claims made for the system. Critical analysis of the results yields many insights into the limitations (if any).
4 = The application or tool has been tested and evaluated on a reasonable corpus or with a small set of users. The results support the claims made. Critical analysis of the results yields some insights into the limitations (if any).
3 = The application or tool has been tested and evaluated to a limited extent. The results have been critically analyzed to gain insight into the system's performance.
2 = A few test cases have been run on the application or tool but no significant evaluation or user study has been performed.
1 = The application or tool has not been tested or evaluated.

MEANINGFUL COMPARISON (1-5)
Do the authors make clear where the presented system sits with respect to existing literature? Are the references adequate? Are the benefits of the system/application well-supported and are the limitations identified?

5 = Precise and complete comparison with related work. Benefits and limitations are fully described and supported.
4 = Mostly solid bibliography and comparison, but there are a few additional references that should be included. Discussion of benefits and limitations is acceptable but not enlightening.
3 = Bibliography and comparison are somewhat helpful, but it could be hard for a reader to determine exactly how this work relates to previous work or what its benefits and limitations are.
2 = Only partial awareness and understanding of related work, or a flawed comparison or deficient comparison with other work.
1 = Little awareness of related work, or insufficient justification of benefits and discussion of limitations.

IMPACT OF IDEAS OR RESULTS (1-5)
How significant is the work described? Will novel aspects of the system result in other researchers adopting the approach in their own work? Does the system represent a significant and important advance in implemented and tested human language technology?

5 = A major advance in the state-of-the-art in human language technology that will have a major impact on the field.
4 = Some important advances over previous systems, and likely to impact development work of other research groups.
3 = Interesting but not too influential. The work will be cited, but mainly for comparison or as a source of minor contributions.
2 = Marginally interesting. May or may not be cited.
1 = Will have no impact on the field.

IMPACT OF ACCOMPANYING SOFTWARE (1-5)
If software was submitted or released along with the paper, what is the expected impact of the software package? Will this software be valuable to others? Does it fill an unmet need? Is it at least sufficient to replicate or better understand the research in the paper?

5 = Enabling: The newly released software should affect other people's choice of research or development projects to undertake.
4 = Useful: I would recommend the new software to other researchers or developers for their ongoing work.
3 = Potentially useful: Someone might find the new software useful for their work.
2 = Documentary: The new software useful to study or replicate the reported research, although for other purposes they may have limited interest or limited usability. (Still a positive rating)
1 = No usable software released.

IMPACT OF ACCOMPANYING DATASET (1-5)
If a dataset was submitted or released along with the paper, what is the expected impact of the dataset? Will this dataset be valuable to others in the form in which it is released? Does it fill an unmet need?

5 = Enabling: The newly released datasets should affect other people's choice of research or development projects to undertake.
4 = Useful: I would recommend the new datasets to other researchers or developers for their ongoing work.
3 = Potentially useful: Someone might find the new datasets useful for their work.
2 = Documentary: The new datasets are useful to study or replicate the reported research, although for other purposes they may have limited interest or limited usability. (Still a positive rating)
1 = No usable datasets submitted.

RECOMMENDATION (1-5)
There are many good submissions competing for slots at ACL 2014; how important is it to feature this one? Will people learn a lot by reading this paper or seeing it presented?

In deciding on your ultimate recommendation, please think over all your scores above. But remember that no paper is perfect, and remember that we want a conference full of interesting, diverse, and timely work. If a paper has some weaknesses, but you really got a lot out of it, feel free to fight for it. If a paper is solid but you could live without it, let us know that you're ambivalent. Remember also that the authors have a few weeks to address reviewer comments before the camera-ready deadline.

Should the paper be accepted or rejected?

5 = This paper changed my thinking on this topic and I'd fight to get it accepted;
4 = I learned a lot from this paper and would like to see it accepted.
3 = Borderline: I'm ambivalent about this one.
2 = Leaning against: I'd rather not see it in the conference.
1 = Poor: I'd fight to have it rejected.

REVIEWER CONFIDENCE (1-5)
5 = Positive that my evaluation is correct. I read the paper very carefully and am familiar with related work.
4 = Quite sure. I tried to check the important points carefully. It's unlikely, though conceivable, that I missed something that should affect my ratings.
3 = Pretty sure, but there's a chance I missed something. Although I have a good feel for this area in general, I did not carefully check the paper's details, e.g., the math, experimental design, or novelty.
2 = Willing to defend my evaluation, but it is fairly likely that I missed some details, didn't understand some central points, or can't be sure about the novelty of the work.
1 = Not my area, or paper is very hard to understand. My evaluation is just an educated guess.

PRESENTATION FORMAT
Papers at ACL 2014 can be presented either as poster or as oral presentations. If this paper were accepted, which form of presentation would you find more appropriate?
Note that the decisions as to which papers will be presented orally and which as poster presentations will be based on the nature rather than on the quality of the work. There will be no distinction in the proceedings between papers presented orally and those presented as poster presentations.

RECOMMENDATION FOR BEST LONG PAPER AWARD (1-3)
3 = Definitely.
2 = Maybe.
1 = Definitely not.

2 Responses to What the ACL-2014 review scores mean

Crosner says:

March 21, 2014 at 2:28 am

Thanks for your sharing.
This is quite useful to have a better understanding of ACL reviewer’s score.
an area chair says:

April 28, 2014 at 12:38 am

“I guess what people really want to know is what scores tend to correspond to acceptances.”

In deciding which papers to recommend for acceptance, I looked at the following things: recommendation score, confidence score, text of the review, text of the author response, discussion between reviewers, who the reviewer is and what I think they understood about the paper, and the paper itself — I read or at least skimmed the paper in all of the borderline cases.

Things I did not look at: any of the other numerical scores. But having reviewers to fill out these scores might still help the reviewing process, by forcing reviewers to think about the criteria that the program chairs feel are important.

I don’t know how the program chairs made the final decisions, but area chairs were given a lot of leeway in making recommendations. One piece of advice that we got, which I agree with, is to favor papers that elicited strong opinions (as long as at least one opinion was positive) over papers that all reviewers were tepid about. The logic is that if 20% of the ACL audience is excited about a paper and 40% hate it, that’s better than a paper that 100% of the audience is apathetic about.

2 Responses to What the ACL-2014 review scores mean

About

Blogroll

Blog Search

Archives