Artificial intelligence-generated responses to frequently asked questions on coccydynia: Evaluating the accuracy and consistency of GPT-4o's performance
Aslinur Keles, Ozge Gulsum Illeez
, Berkay Erbagci
, Esra Giray
Department of Physical Medicine and Rehabilitation, Health Science University, Fatih Sultan Mehmet Training and Research Hospital, İstanbul, Türkiye
Keywords: Artificial intelligence, ChatGPT-4o, coccydynia, coccyx pain, large language models.
Abstract
Objectives: This study aimed to assess whether GPT-4o's responses to patient-centered frequently asked questions about coccydynia are accurate and consistent when asked at different times and from different accounts.
Materials and methods: Questions were collected from medical websites, forums, and patient support groups and posed to GPT-4o. The responses were evaluated by two physiatrists for accuracy and consistency. Responses were categorized: (i) correct and comprehensive, (ii) correct but not inadequate, (iii) partially correct and partially incorrect, and (iv) completely incorrect. Inconsistencies in scoring were resolved by an additional reviewer as needed. Statistical analysis, including Cohen's kappa for interreviewer reliability, was performed.
Results: Of the 81 responses, 45.7% were rated as correct and comprehensive, while 49.4% were correct but incomplete. Only 4.9% of the responses contained partially incorrect information, and no responses were completely incorrect. The interreviewer agreement was substantial (kappa=0.67), but 75% of the responses differed between the two rounds. Notably, 34.9% of initially incomplete answers improved in the second round.
Conclusion: GPT-4o shows promise in providing accurate and generally reliable information about coccydynia. However, the variability observed in response consistency across repeated queries suggests that while the model is useful for patient education and general inquiries, it may not be suitable for providing specialized clinical knowledge without human oversight.
Citation: Keles A, Illeez OG, Erbagci B, Giray E. Artificial intelligence-generated responses to frequently asked questions on coccydynia: Evaluating the accuracy and consistency of GPT-4o's performance. Arch Rheumatol 2025;40(1):63-71. doi: 10.46497/ ArchRheumatol.2025.10966.
All authors contributed equally to this article.
The authors declared no conflicts of interest with respect to the authorship and/or publication of this article.
The authors received no financial support for the research and/or authorship of this article.
The data that support the findings of this study are available from the corresponding author upon reasonable request.