Performance of large language models ChatGPT and Gemini in child and adolescent psychiatry knowledge assessment

Objective This study evaluates the performance of four large language models—ChatGPT 4o, ChatGPT o1-mini, Gemini 2.0 Flash, and Gemini 1.5 Flash—in answering multiple-choice questions in child and adolescent psychiatry to assess their level of factual knowledge in the field. Methods A total of 150 s...

Full description

Saved in:
Bibliographic Details
Main Authors: Neubauer, Johanna Charlotte (Author) , Kaiser, Anna (Author) , Lettermann, Leon (Author) , Volkert, Tobias (Author) , Häge, Alexander (Author)
Format: Article (Journal)
Language:English
Published: September 19, 2025
In: PLOS ONE
Year: 2025, Volume: 20, Issue: 9, Pages: 1-9
ISSN:1932-6203
DOI:10.1371/journal.pone.0332917
Online Access:Verlag, kostenfrei, Volltext: https://doi.org/10.1371/journal.pone.0332917
Verlag, kostenfrei, Volltext: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0332917
Get full text
Author Notes:Johanna Charlotte Neubauer, Anna Kaiser, Leon Lettermann, Tobias Volkert, Alexander Häge
Description
Summary:Objective This study evaluates the performance of four large language models—ChatGPT 4o, ChatGPT o1-mini, Gemini 2.0 Flash, and Gemini 1.5 Flash—in answering multiple-choice questions in child and adolescent psychiatry to assess their level of factual knowledge in the field. Methods A total of 150 standardized multiple-choice questions from a specialty board review study guide were selected, ensuring a representative distribution across different topics. Each question had five possible answers, with only one correct option. To account for the stochastic nature of large language models, each question was asked 10 times with randomized answer orders to minimize known biases. Accuracy for each question was assessed as the percentage of correct answers across 10 requests. We calculated the mean accuracy for each model and performed statistical comparisons using paired t-tests to evaluate differences between Gemini 2.0 Flash and Gemini 1.5 Flash, as well as between Gemini 2.0 Flash and both ChatGPT 4o and ChatGPT o1-mini. As a post-hoc exploration, we identified questions with an accuracy below 10% across all models to highlight areas of particularly low performance. Results The accuracy of the tested models ranged from 68.3% to 78.9%. Both ChatGPT and Gemini demonstrated generally solid performance in the assessment of in child and adolescent psychiatry knowledge, with variations between models and topics. The superior performance of Gemini 2.0 Flash compared with its predecessor, Gemini 1.5 Flash, may reflect advancements in artificial intelligence capabilities. Certain topics, such as psychopharmacology, posed greater challenges compared to disorders with well-defined diagnostic criteria, such as schizophrenia or eating disorders. Conclusion While the results indicate that language models can support knowledge acquisition in child and adolescent psychiatry, limitations remain. Variability in accuracy across different topics, potential biases, and risks of misinterpretation must be carefully considered before implementing these models in clinical decision-making.
Item Description:Gesehen am 23.10.2025
Physical Description:Online Resource
ISSN:1932-6203
DOI:10.1371/journal.pone.0332917