News Release

New study reveals high rates of fabricated and inaccurate citations in LLM-generated mental health research

Peer-Reviewed Publication

JMIR Publications

New Study Reveals High Rates of Fabricated and Inaccurate Citations in LLM-Generated Mental Health Research

image: 

New Study Reveals High Rates of Fabricated and Inaccurate Citations in LLM-Generated Mental Health Research

view more 

Credit: JMIR Publications

(Toronto, November 17, 2025) A new study published in the peer-reviewed journal JMIR Mental Health by JMIR Publications highlights a critical risk in the growing use of Large Language Models (LLMs) like GPT-4o by researchers: the frequent fabrication and inaccuracy of bibliographic citations. The findings underscore an urgent need for rigorous human verification and institutional safeguards to protect research integrity, particularly in specialized and less publicly known fields within mental health.

Nearly 1 in 5 Citations Fabricated by GPT-4o in Literature Reviews

The article, titled "Influence of Topic Familiarity and Prompt Specificity on Citation Fabrication in Mental Health Research Using Large Language Models: Experimental Study," found that 19.9% of all citations generated by GPT-4o across six simulated literature reviews were entirely fabricated, meaning they could not be traced to any real publication. Furthermore, among the seemingly real citations, 45.4% contained bibliographic errors, most commonly incorrect or invalid Digital Object Identifiers (DOIs).

This timely research is highly relevant as academic journals have encountered instances of seemingly AI-hallucinated references in recent submissions. These bibliographic hallucinations and errors are not just formatting issues; they break the chain of verifiability, mislead readers, and fundamentally compromise the integrity and trustworthiness of scientific results and the cumulative knowledge base. This makes the need for careful scrutiny and verification paramount to safeguard academic rigor.

Reliability Varies by Topic Familiarity and Specificity

The research, conducted by a team including Jake Linardon, PhD, from Deakin University and his colleagues, systematically tested the reliability of GPT-4o's output across mental health topics with varying levels of public awareness and scientific maturity: major depressive disorder (high familiarity), binge eating disorder (moderate), and body dysmorphic disorder (low). They also tested general versus specialized review prompts (e.g., focusing on digital interventions).

  • Fabrication Risk is Highest for Less Familiar Topics: Fabrication rates were significantly higher for topics with lower public familiarity and research coverage, such as binge eating disorder (28%) and body dysmorphic disorder (29%), compared to major depressive disorder (6%).

  • Specialized Topics Pose a Higher Risk: While not universally true, stratified analysis showed that fabrication rates were significantly higher for specialized reviews (e.g., evidence for digital interventions) compared to general overviews for certain disorders, such as binge eating disorder.

  • Overall Inaccuracy is Pervasive: In total, nearly two-thirds of all citations generated by GPT-4o were either fabricated or contained errors, indicating a major reliability issue.

Urgent Call for Human Oversight and New Safeguards

The study’s conclusions issue a strong warning to the academic community: Citation fabrication and errors remain common in GPT-4o outputs. The authors stress that the reliability of LLM-generated citations is not fixed but is contingent on the topic and the way the prompt is designed.

Key Implications Highlighted in the Study:

  • Rigorous Verification is Mandatory: Researchers and students must subject all LLM-generated references to careful human verification to validate their accuracy and authenticity.

  • Journal and Institutional Role: Journal editors and publishers must implement stronger safeguards, potentially using detection software that flags citations that do not match existing sources, signaling a potential hallucination.

  • Policy and Training: Academic institutions must develop clear policies and training to equip users with the skills to critically assess LLM outputs and to design strategic prompts, especially when exploring less visible or highly specialized research topics.

 

Original article:

Linardon J, Jarman H, McClure Z, Anderson C, Liu C, Messer M. Influence of Topic Familiarity and Prompt Specificity on Citation Fabrication in Mental Health Research Using Large Language Models: Experimental Study. JMIR Ment Health 2025;12:e80371

URL: https://mental.jmir.org/2025/1/e80371

DOI: 10.2196/80371

 

About JMIR Publications

JMIR Publications is a leading open access publisher of digital health research and a champion of open science. With a focus on author advocacy and research amplification, JMIR Publications partners with researchers to advance their careers and maximize the impact of their work. As a technology organization with publishing at its core, we provide innovative tools and resources that go beyond traditional publishing, supporting researchers at every step of the dissemination process. Our portfolio features a range of peer-reviewed journals, including the renowned Journal of Medical Internet Research.

To learn more about JMIR Publications, please visit jmirpublications.com or connect with us via X, LinkedIn, YouTube, Facebook, and Instagram.

Head office: 130 Queens Quay East, Unit 1100, Toronto, ON, M5A 0P6 Canada

Media Contact:

Dennis O’Brien, Vice President, Communications & Partnerships

JMIR Publications

communications@jmir.org

The content of this communication is licensed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, published by JMIR Publications, is properly cited.


Disclaimer: AAAS and EurekAlert! are not responsible for the accuracy of news releases posted to EurekAlert! by contributing institutions or for the use of any information through the EurekAlert system.