24 November, 2025
ai-citation-errors-over-half-of-chatgpt-references-are-inaccurate

Mental health researchers relying on AI tools like ChatGPT should be wary of a significant issue highlighted by Australian researchers. A recent study reveals that more than half of the citations generated by the AI chatbot are either fabricated or contain errors, posing a substantial risk to academic integrity.

Researchers at Deakin University conducted an experiment with GPT-4o, tasking it with writing six literature reviews on various mental health topics. The study, published in JMIR Mental Health, found that 19.9% of the 176 citations generated were entirely fabricated. Among the 141 real citations, 45.4% contained errors such as incorrect publication dates, page numbers, or invalid digital object identifiers (DOIs).

Overall, only 43.8% of the citations were both real and accurate, leaving 56.2% either fabricated or erroneous. This finding is particularly concerning for researchers under pressure to publish, who increasingly turn to AI for assistance in speeding up their work.

The Phantom Paper Problem: Fabricated Sources

The study uncovered that the fabricated citations were not easily identifiable as fake. Notably, 64% of fabricated DOIs linked to real but unrelated papers, making it difficult to spot the errors without careful verification. The remaining 36% of fake DOIs were completely invalid or nonfunctional, rendering them useless in supporting the AI’s claims.

Lead author Jake Linardon and his team tested the AI’s performance across three psychiatric conditions: major depressive disorder, binge eating disorder, and body dysmorphic disorder. These conditions vary in public recognition and research volume, providing a spectrum for analysis.

Lesser-Known Topics Trigger More AI Hallucinations

The accuracy of GPT-4o’s citations varied significantly based on the topic. For major depressive disorder, only 6% of citations were fabricated. In contrast, for binge eating disorder and body dysmorphic disorder, the fabrication rates soared to 28% and 29%, respectively. This suggests that ChatGPT may perform better on well-established topics with abundant training data, although this relationship was not directly tested in the study.

The study also explored whether the specificity of the request affected accuracy. For binge eating disorder, specialized reviews saw fabrication rates jump to 46% compared to 17% for general overviews. However, this pattern did not consistently apply across all disorders.

Rising AI Adoption in Research Raises the Stakes

As AI adoption accelerates in research settings, the implications of these findings are profound. A recent survey indicates that nearly 70% of mental health scientists use ChatGPT for tasks such as writing, data analysis, and literature reviews. While most users report improved efficiency, many express concerns about inaccuracies and misleading content.

The pressure on researchers to publish frequently, alongside teaching and administrative duties, makes AI tools appealing for streamlining literature reviews and writing. However, accepting AI output without verification poses serious risks. Fabricated references can mislead readers, distort scientific understanding, and undermine scholarly communication.

Fabricated citations with DOIs were particularly deceptive: 64% linked to real but unrelated papers. Among non-fabricated citations, DOI errors were the most common mistake at 36.2%.

Different types of errors affected various parts of citations, with DOIs having the highest error rate at 36.2%, while author lists had the lowest at 14.9%. Publication years, journal names, volume numbers, and page ranges showed error rates in between.

What Researchers and Institutions Must Do Now

Linardon’s team emphasizes the need for rigorous human verification of all AI-generated content. Every citation must be checked against original sources, and claims need validation. The authors also suggest that journals implement stronger safeguards, such as using plagiarism detection software in reverse to identify potentially fabricated sources.

Academic institutions should develop clear policies around AI use in scholarly writing, including training on identifying hallucinated citations and properly disclosing AI contributions to manuscripts. Despite expectations that newer AI versions like GPT-4o would improve over earlier iterations, citation fabrication remained common across all test conditions.

Researchers can mitigate risks by using AI preferentially for well-established subjects while implementing verification protocols for specialized areas where training data may be sparse. The reliability of citations is not random but depends on public familiarity, research maturity, and prompt specificity.

For now, ChatGPT’s citation accuracy should be viewed as a starting point that demands extensive human oversight rather than a reliable shortcut. While the tool can help generate initial drafts or organize ideas, the burden of verification remains firmly on human shoulders.

The findings also prompt questions about how AI systems should be designed and marketed for academic use. If citation fabrication is predictable based on topic characteristics, developers might incorporate stronger warnings or verification prompts when users request information on specialized subjects.

Journals and funding bodies increasingly require authors to disclose AI use in research. This study underscores the importance of such transparency and the need for editorial review processes to adapt to catch AI-generated errors that traditional peer review might overlook.

The issue extends beyond individual researchers. When fabricated citations enter the published literature, they can propagate through citation networks, mislead future researchers, and waste resources as scientists chase phantom sources or build on false premises. Institutional and systemic responses are needed, not just individual vigilance.