Evaluation of the Diagnostic Performance of ChatGPT in Radiographic Staging of Sacroiliitis According to the Modified New York Criteria

Uğur Güngör Demir; Ali Nail Demir; Alper Uysal

doi:10.5152/ArchRheumatol.2026.25254

Original Article

Vol. 41 No. 1 (2026): Vol. 41 No. 1 (2026): Archives of Rheumatology

DOI: https://doi.org/10.5152/ArchRheumatol.2026.25254

Full Text PDF

Published: Jan 16, 2026

Keywords:

Ankylosing spondylitis, artificial intelligence, ChatGPT, diagnostic accuracy, modified New York criteria, sacroiliitis

How to Cite

Güngör Demir, U., Nail Demir, A., & Uysal, A. (2026). Evaluation of the Diagnostic Performance of ChatGPT in Radiographic Staging of Sacroiliitis According to the Modified New York Criteria. Archives of Rheumatology, 41(1), 57–63. https://doi.org/10.5152/ArchRheumatol.2026.25254

Uğur Güngör Demir

Department of Physical Medicine and Rehabilitation, Mersin City Training and Research Hospital, Mersin, Türkiye

https://orcid.org/0000-0002-5084-7280

Ali Nail Demir

Department of Rheumatology, Mersin City Training and Research Hospital, Mersin, Türkiye

https://orcid.org/0000-0001-5713-4120

Alper Uysal

Department of Physical Medicine and Rehabilitation, Mersin City Training and Research Hospital, Mersin, Türkiye

https://orcid.org/0000-0002-4114-1649

Uğur Güngör Demir

Department of Physical Medicine and Rehabilitation, Mersin City Training and Research Hospital, Mersin, Türkiye

https://orcid.org/0000-0002-5084-7280

Ali Nail Demir

Department of Rheumatology, Mersin City Training and Research Hospital, Mersin, Türkiye

https://orcid.org/0000-0001-5713-4120

Alper Uysal

Department of Physical Medicine and Rehabilitation, Mersin City Training and Research Hospital, Mersin, Türkiye

https://orcid.org/0000-0002-4114-1649

Abstract

Background/Aims: This study aimed to evaluate the diagnostic performance of ChatGPT in grading sacroiliitis on pelvic radiographs according to the modified New York criteria.

Materials and Methods: This retrospective study included 266 individuals with or without radiographic sacroiliac joint involvement according to the modified New York criteria (231 with ankylosing spondylitis and 35 without radiographic evidence of sacroiliitis). Two experts independently graded all radiographs based on the modified New York criteria, with disagreements resolved by a third reviewer. ChatGPT-5o (OpenAI, 2025) was prompted to classify each radiograph using a standardized English-language instruction. ChatGPT’s grading outputs were compared with expert consensus.

Results: A statistically significant association was found between ChatGPT and expert gradings, but agreement remained slight (κ = 0.136). Multi-class performance was limited (overall accuracy = 30%), while binary analysis showed higher apparent accuracy (78%) due to a strong positive bias. Sensitivity was 0.796, specificity was 0.696, positive predictive value was 0.946, and negative predictive value was 0.338. Per-grade area under curve values ranged from 0.52 to 0.75, with the highest for Grade 0.

Conclusion: ChatGPT demonstrated only limited agreement with expert assessments and showed poor ability to distinguish between sacroiliitis stages, performing adequately only for normal joints. These findings suggest that large language models like ChatGPT are unsuitable for direct radiographic interpretation without integration into specialized, vision-based diagnostic frameworks.

Cite this article as: Güngör Demir U, Demir AN, Uysal A. Evaluation of the diagnostic performance of ChatGPT in radiographic staging of sacroiliitis according to the modified New York criteria. ArchRheumatol. 2026;41(1):57-63.

This work is licensed under a Creative Commons Attribution 4.0 International License.

Original Article

Vol. 41 No. 1 (2026): Vol. 41 No. 1 (2026): Archives of Rheumatology

DOI: https://doi.org/10.5152/ArchRheumatol.2026.25254

Article Sidebar

Main Article Content

Evaluation of the Diagnostic Performance of ChatGPT in Radiographic Staging of Sacroiliitis According to the Modified New York Criteria

Main Article Content

Abstract

Article Details

Similar Articles