close
close

Association-anemone

Bite-sized brilliance in every update

ChatGPT is transforming peer review – how can we use it responsibly?
asane

ChatGPT is transforming peer review – how can we use it responsibly?

Since the artificial intelligence (AI) chatbot ChatGPT was launched in late 2022, computer scientists have noticed a worrying trend: chatbots are increasingly being used to review research papers that end up in the proceedings of major conferences.

There are several telltale signs. Reviews written by AI tools stand out because of their formal tone and verbosity—traits commonly associated with the writing style of large language models (LLM). For example, words like commendable and meticulous are now ten times more common in peer reviews than they were before 2022. AI-generated reviews also tend to be superficial and general, often not mentioning specific sections of the submitted work, and references are missing. .

That’s what my colleagues at Stanford University in California and I found when we examined about 50,000 peer reviews of computer science articles published in conference proceedings in 2023 and 2024. We estimate that 7-17% of the sentences in the reviews were written by LLM. based on the writing style and the frequency with which certain words appear (W. Liang et al. Proc. 41st Int. Prof. Mach. Learn. 23529575–29620; 2024).

Lack of time could be a reason to use LLMs to write peer reviews. We found that the rate of text generated by the LLM is higher in reviews that were submitted close to the deadline. This trend will intensify. Already, publishers are struggling to secure timely reviews and reviewers are overwhelmed with requests.

Fortunately, AI systems can help solve the problem they created. For this, the use of LLM should be limited to specific tasks – to correct language and grammar, to answer simple questions about the manuscript and to identify relevant information, for example. However, if used irresponsibly, LLMs risk undermining the integrity of the scientific process. Therefore, it is crucial and urgent that the scientific community establish norms on how to use these models responsibly in the academic evaluation process.

First, it is essential to recognize that the current generation of LLMs cannot replace expert human reviewers. Despite their capabilities, LLMs cannot demonstrate in-depth scientific reasoning. It also sometimes generates nonsensical responses known as hallucinations. A common complaint of researchers who were given LLM written reviews of their manuscripts was that the feedback lacked technical depth, particularly in terms of methodological criticism (W. Liang et al. NEJM AI 1AIoa2400196; 2024). LLMs can easily overlook mistakes in a research paper.

With these caveats in mind, careful design and guardrails are needed when implementing LLMs. For reviewers, an AI chatbot assistant could provide feedback on how to make vague suggestions more applicable to authors before submitting for peer review. It could also highlight sections of the paper, potentially missed by the reviewer, that already address the questions raised in the review.

To assist editors, LLMs can retrieve and summarize related work to help them contextualize the work and check compliance with submission checklists (for example, to ensure statistics are reported correctly). These are relatively low-risk LLM applications that could save reviewers and editors time if implemented well.

LLMs could, however, make mistakes even when performing low-risk information retrieval and summarization tasks. Therefore, LLM results should be seen as a starting point, not the final answer. Users should continue to check LLM’s activity.

Journals and conferences might be tempted to use artificial intelligence algorithms to detect LLM use in peer reviews and papers, but their effectiveness is limited. While such detectors can pick up obvious cases of AI-generated text, they are prone to producing false positives—for example, by flagging text written by scientists whose first language is not English as AI-generated. Users can also avoid detection by strategically requesting LLM. Detectors often struggle to distinguish reasonable uses of an LLM—to polish rough text, for example—from inappropriate ones, such as using a chatbot to write the entire report.

Ultimately, the best way to prevent AI from dominating peer review may be to promote more human interaction during the process. Platforms like OpenReview encourage reviewers and authors to have anonymized interactions, resolving questions through multiple rounds of discussion. OpenReview is now used by several major computer science conferences and journals.

The high tide of LLM use in academic writing and peer review cannot be stopped. To navigate this transformation, journals and conference venues should establish clear guidelines and implement systems to enforce them. At a minimum, journals should require reviewers to transparently disclose whether and how they use LLMs during the review process. We also need innovative, interactive, AI-age peer review platforms that can automatically constrain the use of LLMs to a limited set of tasks. In parallel, we need much more research on how AI can responsibly help with specific peer review tasks. Establishing community norms and resources will help ensure that LLMs benefit reviewers, editors, and authors without compromising the integrity of the scientific process.

Competing interests

The author declares no competing interests.