Mental health researchers relying on AI tools like ChatGPT to expedite their work should be cautious, as a recent study from Australian researchers reveals a significant issue. The AI chatbot frequently provides incorrect or entirely fabricated citations, with errors occurring more than half the time.
Researchers at Deakin University tasked GPT-4o with generating six literature reviews on mental health topics. They found that nearly 20% (19.9%) of the 176 citations produced by the AI were completely fabricated. Among the 141 real citations, 45.4% contained errors such as incorrect publication dates, page numbers, or invalid digital object identifiers (DOIs).
Overall, only 77 out of 176 citations (43.8%) were both real and accurate. This means a troubling 56.2% of citations were either fabricated or contained errors. For researchers under pressure to publish and increasingly turning to AI tools for assistance, the study, published in JMIR Mental Health, highlights a concerning pattern regarding when and why these errors occur.
The Phantom Paper Problem: AI’s Fabricated Sources
The fabricated citations produced by ChatGPT weren’t obviously fake. When GPT-4o provided a DOI for a fabricated citation (33 of 35 fabricated sources included DOIs), 64% linked to actual published papers on completely unrelated topics. This deceptive practice makes it challenging to spot fabrications without thorough verification.
Another 36% of fake DOIs were completely invalid or nonfunctional. In both scenarios, the citations failed to support the claims made by the AI in its generated text.
Lead author Jake Linardon and his colleagues at Deakin University explored whether the AI’s performance varied based on the familiarity of the topic and the specificity of the request. They selected three psychiatric conditions for their experiment: major depressive disorder, binge eating disorder, and body dysmorphic disorder, each differing in public recognition and research volume.
Lesser-Known Topics Trigger More AI Hallucinations
GPT-4o’s citation accuracy varied significantly depending on the disorder it was tasked with writing about. For major depressive disorder, only 6% of citations were fabricated. However, for binge eating disorder and body dysmorphic disorder, fabrication rates rose to 28% and 29%, respectively.
Among real citations, major depressive disorder achieved 64% accuracy, binge eating disorder 60%, and body dysmorphic disorder only 29%. This pattern suggests that ChatGPT may perform better on well-established topics with abundant training data, although the study notes that this relationship wasn’t directly tested.
The study also examined whether the accuracy was affected by the type of review requested. When researchers asked for a broad summary of each disorder, including symptoms and treatments, fabrication rates differed from when they asked for highly specific reviews focused on digital interventions for each condition. For binge eating disorder specifically, specialized reviews saw fabrication rates jump to 46% compared to 17% for general overviews. However, this pattern didn’t hold consistently across all three disorders.
Rising AI Adoption in Research Raises the Stakes
These findings emerge as AI adoption accelerates in research settings. A recent survey revealed that nearly 70% of mental health scientists report using ChatGPT for research tasks, including writing, data analysis, and literature reviews. While most users say the tools improve efficiency, many express concerns about inaccuracies and misleading content.
Researchers face growing pressure to publish frequently while juggling teaching, supervision, and administrative duties. Tools that promise to streamline literature reviews and speed up writing offer appealing solutions to productivity demands. However, accepting AI output without verification poses serious risks.
Fabricated references mislead readers, distort scientific understanding, and erode the foundation of scholarly communication. Citations guide readers to source evidence and build cumulative knowledge. When those citations point nowhere or to the wrong papers, the entire system breaks down.
Fabricated citations with DOIs were particularly deceptive: 64% linked to real but unrelated papers. Among non-fabricated citations, DOI errors were the most common mistake at 36.2%.
Different types of errors affected various parts of citations. DOIs had the highest error rate at 36.2%, while author lists had the lowest at 14.9%. Publication years, journal names, volume numbers, and page ranges all showed error rates between these extremes.
What Researchers and Institutions Must Do Now
Linardon’s team emphasizes that all AI-generated content requires rigorous human verification. Every citation must be checked against original sources, and claims need to be validated. References must be confirmed to exist and actually support the statements attributed to them.
The authors also call for journals to implement stronger safeguards. One suggestion involves using plagiarism detection software in reverse. For example, citations that don’t trigger matches in existing databases may signal fabricated sources worth investigating more closely.
Academic institutions should develop clear policies around AI use in scholarly writing, including training on how to identify hallucinated citations and properly disclose when generative AI contributed to a manuscript.
The study found no clear evidence that newer AI versions have solved the hallucination problem, though direct comparisons with earlier models are limited by differences in how studies are designed. Despite expectations that GPT-4o would show improvements over earlier iterations, citation fabrication remained common across all test conditions.
Researchers can reduce risks by using AI preferentially for well-established subjects while implementing verification protocols for specialized areas where training data may be sparse. Topic characteristics matter: citation reliability isn’t random but depends on public familiarity, research maturity, and prompt specificity.
For now, ChatGPT’s citation accuracy works best as a starting point that demands extensive human oversight rather than a reliable shortcut researchers can fully trust. The tool can help generate initial drafts or organize ideas, but the verification burden remains squarely on human shoulders.
The findings also raise questions about how AI systems should be designed and marketed for academic use. If citation fabrication is predictable based on topic characteristics, developers might build in stronger warnings or verification prompts when users request information on specialized subjects.
Journals and funding bodies increasingly require authors to disclose AI use in research. This study provides evidence for why such transparency matters and why editorial review processes must adapt to catch AI-generated errors that traditional peer review might miss.
The scope of the problem extends beyond individual researchers. When fabricated citations enter the published literature, they can propagate through citation networks, mislead future researchers, and waste resources as scientists chase phantom sources or build on false premises. Institutional and systemic responses are needed, not just individual vigilance.