AI in Peer Review—What Works and What Doesn’t? An Interview with Dr Marie Soulière
Artificial intelligence (AI) tools like GPT3 have already revolutionized content creation, impacting how researchers create scientific content. Will AI transform peer review too? In the video below on the use of AI for manuscript evaluation, Shivanee Shah, Head of Content, Impact Science, chats with Dr Marie Soulière, Senior Publishing Manager, Frontiers. Dr Soulière leads strategic publishing projects in open-access publishing, with a specific focus on research integrity and quality peer review, balanced with operational efficiency and automation. She was heavily involved in developing Frontiers’ artificial intelligence review assistant (AIRA). Today, she and Shivanee talk about the successes, limitations, and ethical considerations surrounding the use of AI-based tools in peer review. Gain valuable insights into the practical applications, challenges, and future prospects of AI technology in shaping the peer review landscape.
Watch the video
Hi, I’m Shivanee Shah, head of Content at Impact Science.
With the increasing number of papers being published, also comes a challenge for publishers to be able to scale up their peer review processes while ensuring to maintain quality and integrity in publishing. One way to make this process a little bit more efficient would be to introduce AI, with AI at the centre of so many different processes and fields.
Today, as part of the Peer Review Week, we have Dr. Marie Souliere to talk about the latest in AI and peer review. So, Marie has been heavily involved in developing Frontiers’ Artificial Intelligence Review Assistant, AIRA, and we are excited to hear her views on the intersection of AI and peer review.
All right, so thank you so much for being with us today, Marie. We’re really happy to have you here. And the very first thing that maybe we should start with is to talk about your motivation to exploring the intersection of AI and peer review.
What challenges did you see to the traditional peer review process that led you to consider AI-based solution?
Thanks, Shivanee. It’s really a pleasure to be here. I think the first time we properly started exploring AI for peer review was back in 2017.
I remember I was sitting in a room with one of the Founders of Frontiers and the now Chief Technology Officer before another meeting, and we were talking about how we as publishers and our editors and reviewers, really needed help with peer review. Frontiers has always been really an innovation company. And at that time, we were thinking that with the exponential rise of research and research papers, as you mentioned, and reviewers were being more and more in demand, and some were performing really sparse reviews on papers, and we wanted to really make sure peer review would improve manuscripts to the best that they could be.
That’s always been one of the big aims of Frontiers with our unique collaborative peer review process. So, we needed to ensure peer review was strong and also that we provided reviewers in the community with support. We felt like we could help them by telling them maybe that the citations weren’t the best in the paper, that the images had issues, or that the statistical power wasn’t correct, for example.
And we were also getting feedback from editors that they really didn’t want to look at papers where they felt the language level wasn’t up to par. And, so, we wanted to help with all of these aspects.
We already knew from using the iThenticate software to look for language similarity for several years, that with scale comes a different level of challenges that can only, or mostly, be handled by workflows and algorithms. From there, me and my teams working on peer review and research integrity at the time, we created a pretty long wish list of quality and peer review aspects that we considered with the editors as well that needed to be supported by automation and artificial intelligence. We called it the Reviewer X. At the time, it was pretty funny.
We thought there would be a specific tab in the forum called Reviewer X instead of Reviewer 1, 2, or 3. And eventually, it became our Artificial Intelligence Review Assistant, AIRA. And I still have folders on my laptop that are called Reviewer X. I still cherish that name a little.
And, so initially, all of this really stemmed from the dual considerations of wanting to maintain high quality as this scale became more and more difficult for reviewers and ourselves to handle, with the same level of quality standards for publication. And the other main consideration was seeing that there were some particular challenges that could be solved only by AI.
So, we needed AI to really augment decision-making powers for us.
It’s so good to see AI helping with scaling up and making workflows more efficient. But AI is sometimes viewed with a lot of distrust, especially amongst the scientific community. How were you and your team able to get buy-in from internal stakeholders about using AI in peer review?
I think it’s healthy that this is the default setting. I don’t think we should blindly trust AI without knowing how it’s used, who created it, and what it is aimed at performing. So, I actually support this view and I think it’s up to the developers and the people providing these tools as a service to build the trust and the results for those who are using it.
As you mentioned in the introduction, I’m an elected council member for the Committee on Publication Ethics (COPE), where I’ve been somewhat of their AI expert for a few years. And I hosted a webinar specifically called ‘Trustworthy AI for the Future of Publishing’, a couple of years ago. In there, I was presenting the discussion document that we wrote with COPE that people can find online on their website. It’s on AI decision-making and publishing. And in there, we say that AI should not be making decisions on acceptance and rejection at this stage.
I also had in this webinar Nischay Shah, the CTO of Cactus Labs, who was presenting on leveraging AI to improve the quality of publications, and Ibo van de Poel, a Professor of Ethics and Technology at TU Delft, who talked about the realization of trust-ready AI and the need for fairness, accountability, and explainability.
And these points are basically what we did internally to get buy-in from our teams. We designed the tools with the IT teams in a way that was explainable, and we took time to explain really what the AI did and where it took data from to do that as well.
We then made sure that no decisions were made without a human in the loop. The AI would flag anomalies and a team member would have a look at it. It’s still the case. We also made sure the design of the system that could tell with the tool whether it made a mistake or not and in what way, so we could give the feedback directly to the tool.
And that’s an important part of building the trust with users when you can also provide the feedback to improve the algorithm. So, in a way, you see it as AI augments your human decisions, but human decisions augment the AI tool back.
And as I mentioned earlier, we had and still have a pretty long wish list of quality checks we wanted AI to help with. And currently, we have roughly 30 different checks with automation and 13 of those actually do use artificial intelligence technology, not just automation. And for each, when we would release them in the platform, we had a specific label that was literally called ‘untrusted checks’ that we could have a look to help train the AI, but we would not base a decision on these results yet, because they were deemed not trustworthy enough. And so, some of them remained untrusted for a year, some only a few weeks.
It always depended on feedback from the teams on how accurate and useful the tools turned out to be and then they would become ‘trusted checks’ with the specific label. And last year, we trusted enough several of these checks to release them to external editors and reviewers in our platform [who] are evaluating the manuscript. So, we show them the results from the AI and let them provide their own feedback as well.
So, with little also explanatory videos on how to use the AI and for each of the checks, details on what the AI detected. So that’s really the needed transparency that is part of developing trustworthy AI. So overall, it’s really a step-by-step process driven also by knowledge sharing and accuracy that can lead us to the trust.
So, you mentioned using AI for quality checks. Can you also use AI at other stages in a journal’s workflow? And again, how can a journal decide when to use AI and when not to in order to maximize efficiency and productivity?
I think that’s the beauty of artificial intelligence tools. They can really be used at any and every stage and we can tailor and design tools to assist in the workflows that we need.
Obviously, you need to find the tools that you need, or you can develop them yourself if you have the technology and the developers. So, different journals or publishers will need AI tools at different stages depending on their way of working. If you work for a journal that has 90% desk rejection rates for their efficiency, they should focus on initial manuscript quality tools to make an assessment and reject effectively early on.
For journals that have higher publication rates, that have more papers going to reviewers, or have open peer review, they might want to use more minimal validation tools to ensure a paper is the right structure and the right language. But then they might want to focus AI tools and support tools during peer review or after reviewer recommendations to do a final verification before publication, for example.
So, I would say a journal deciding where to use AI and what tools they should bring in should be based on where they see that they are having to do a large amount of manual work to detect issues or anomalies in a small amount of papers.
But that is something you really need to do because it’s a critical or major risk if you don’t catch it. I’ll give you an example. At Frontiers, we used to manually look at every single figure submitted with the manuscript.
Currently, we get about 20,000 figures per week, which would require so much time from the team members to individually look at. And the reasons for checking the images were not about the science, because we mostly entrust this to our editors and reviewers, but we were verifying for potential image manipulation, which is fraud, and a high risk if papers with issues are published. And another very high-risk situation we were looking for was to verify whether there were identifiable human images in the figures, for which we absolutely need a consent form from the person and usually that would be, you know, for patients and the paper would reveal medical information; so, it’s very important. These are problematic cases, and for the 2 to 3% of manuscripts that might have a gel image, a microscopy, a human image, we were manually looking at them all. So, what we looked into doing was to design our own AI tool with a machine learning segmentation model and point matching and clustering algorithms, the whole thing, to detect image integrity.
And we created a separate ML model to detect humans. And from there we trained both models, we tested them, we built the trust, and we were able to rely on this now for image integrity flag. So, it does have some false negatives with images flagged that don’t have issues. But we trained it to err on the side of caution to make sure we had the maximum recall.
We mean to make sure we don’t miss any true positives; we don’t miss any problematic cases. So, we now look at maybe 1000 images a week, so 5% of all of them, and we have the specific details and highlights on the figures from the tool, the AI tool, on what to look at. So overall, this resulted in a massive increase in efficiency, it’s a 100-fold decrease in the time spent on this particular task.
And this was a clear example for us of a choice we made to develop an AI tool for a particular task that we thought we were spending a lot of time on in our journal workflow and that required improved efficiency. I think as a plus, to conclude this little part, was also that we found out this AI tool sometimes does catch manipulations that we would not have caught ourselves.
So, that’s something else special about these tools. It’s this ability to again augment human decision-making in addition to improving efficiency.
So, this is really interesting, especially when we know peer review can be so complex and subjective and there’s so much more that AI can do. I’m sure there’s a long way that we can go with using and developing AI to make things even more easier and more effective.
How can AI address these challenges while maintaining the quality and integrity of the review process? Could you share any specific features or mechanisms of peer review tools that would ensure unbiased evaluation?
Well, I think first we have to get out of our heads the concept that AI is unbiased, because it has biases. They will be different ones, probably, than we humans have, maybe less cultural, less prejudiced. But there are biases in AI as well that are based on what data the AI is trained on and what the query or the lines of code selected to develop it were. For some aspects, yes. AI can be more blind, I would say, than we or editors are. As I say in my lab, less prejudiced with regards to author names, their country of origin, and it won’t have a personal relationship with the authors that would bias its assessment. So that’s a plus. And so, in theory, it has the potential to be less biased in evaluations. But the challenge we face is that AI is trained on existing data with decisions that were made by humans previously, and that does have some biases and they are really hard to abstract from the training models.
So, because historically more papers have been published from, for example, researchers in the UK, in Germany, in the US, the trained AI is likely to have some form of bias in favour or against papers with similarity to these, even if we don’t intend it as such. And sometimes we can develop biases we did not intend for as well, just due to the way it was trained.
I actually have a fun story I can share regarding this as well, about a bias I detected in one of our AI tools. It was some years ago. We developed a reviewer recommender that we still use today, but we’ve improved on it. It was done with machine learning technology and as part of our efforts to build the internal trust in the tool, I was cross-checking the results for accuracy and I started noticing a weird pattern, where reviewers, the top five recommendations there were often based in the same countries as the authors of the manuscript. So initially, I thought that must have been due to either big countries with a lot of researchers, or the opposite, but maybe they were very niche research fields.
For example, if you work on yellow fever or a specific type of fish in the northern sea, there are likely several experts in your specific country working on this as well. And I knew AI was also doing semantic matching between phrases in the manuscript abstract compared to abstracts of other papers. So, I thought this might be what was happening. But digging further, my conclusions didn’t hold as topics were not niche and I kept finding more and more cases with papers with Chinese authors and the top four or five recommended reviewers were from China. Then two researchers from Scotland recommended to edit a paper from a Scottish group, three reviewers from Norway with a paper with authors based there, and a lot of Italians as well, researchers with recommendations with Italian reviewers.
And I was very puzzled, and we studied this for a while, and in the end, we found out that it was based on specific terms or phrases used by researchers when writing in English. Italians writing research in English tend to phrase certain things differently than Americans, who do it differently than Scots or Chinese authors. And so, the semantic matching went beyond the matching of the content of the phrases, and it matched the language structure as well.
And this generated an unexpected bias that we had to address. And this was a very revealing example of the AI having learned something we didn’t expect and that acquired a bias we didn’t anticipate as well.
So, this story tells us that developers of AI tools really need to work with the researchers and publishers for the AI tools for publishing to see how best to address potential biases that are in databases and in the codes that we write to develop the AI or that might be generated by the AIs without the intention to do so.
And you can only find it out by testing. I think the biggest strength for peer review right now is the use of a combination of AI tools that might have some biases with human oversight and decision-making, where we will have other biases of our own, and the combination of both might even out to fairer and less subjective peer review overall.
Right, that’s a really interesting bias that you mentioned and it’s cool that you picked it out and there are these opportunities then to go back and fix and improve the tool experience for whoever’s using it as well.
And I’m sure there’s a lot more that AI can do. How do you envision the future of peer review with the continued integration of AI? What changes or improvements might we see in the coming years?
There is a lot going on with AI and in terms of publishing itself, we’re literally in a race between publishers and companies selling fake papers or manipulated research to be published. I think in the near future we’ll find out if they manage to completely outmanoeuvre us or if we manage to develop AI tools sophisticated enough to detect fraud at that level.
I think we’re doing good. We do have tools to detect manipulated images, weird citation patterns, some undisclosed conflicts of interest, some level of misconduct, statistical analysis issues, and some forms of peer review manipulation. But fully faked data and conclusions, there’s no tool that I know of right now that can detect that.
So, the overall issue is going to have to be tackled in another way by removing this need for fake papers to be generated in the first place. And this is a big task undertaken by a lot of organizations coming together, including COPE, STM, and others with institutions and governmental agencies around the world. And so, to answer your question, if those types of misconducts that have been really the focus of our AI developments and peer review for years now, to try to prevent fraud, if they finally start to diminish, we’ll have time and energy and efforts to focus on some of the promising improvements for peer review with a lot of standard quality checks that can be performed, I think on format style, language, citations and all the little flags for specific points that the reviewers can assess without having to look for all the details.
So, in parallel, I also foresee in the short term the use of generative AI as a very useful tool from the author’s side, with the ability to allow non-native English speakers or researchers with other language difficulties and disabilities to take advantage of writing tools. In the mid- to long-term, I personally believe that writing scholarly articles will be fully outsourced to generative AI, with researchers creating the studies, performing the research, yes, but the writing would be generated by AI and researchers would only validate it as accurate and take responsibility for content before it would be submitted to be validated by peers, and maybe Reviewer X, right.
So, I think there can be major advancements in how we disseminate research output and prepare it to be shared with the world, and doing the research will remain key and having it validated by peers as well, I think, as I don’t think we will trust AI to do either of these things for a long while. But the steps in between, of writing papers, of checking format, language citations, editing to fit a certain style, I think are all likely to be more and more just left to the new artificial intelligence solutions, so that researchers from everywhere can focus on what they do best, which is the scientific research itself for the benefit of mankind.
Yeah. Wow, this is really cool. As a researcher a long time ago, writing was one of my weak points at that point.
I can imagine AI taking over all of those parts and making everybody’s lives so much easier. Researchers can be at par irrespective of their ability to write and to edit. Whereas they do excellent research, it’s just a matter of communicating this out. So, that’s great.
I mean, we have some things to look forward to. Thank you so much, Marie, for this informative session. I mean, we all learned where AI is and where it’s going to be next.