student.uva.nl
What is your study programme?
What is your study programme?
Colloquium credits

Presentation Master's thesis - Jazmine Pajar - Psychological Methods

Colloquium credits

Presentation Master's thesis - Jazmine Pajar - Psychological Methods

Last modified on 23-06-2026 11:43
Large Language Models vs. Human Experts as Authors and Judges of Educational Questions: A Judge Response Theory and Explanatory Item Response Modelling Approach
Show information for your study programme
What is your study programme?
or
Start date
29-06-2026 13:00
End date
29-06-2026 14:00
Location

Large Language Models (LLMs) are increasingly used to generate educational questions and evaluate their quality. While this may reduce the time required for question creation and quality control, it also raises concerns about whether LLMs can produce high-quality questions and judge them against human expert standards.

To investigate these issues, this study examines LLMs as both authors and judges of educational multiple-choice questions, in collaboration with Futurewhiz, a company with digital learning platforms for primary and secondary education. A balanced dataset was created, comprising human-expert-authored questions and those generated by the latest models from OpenAI, Anthropic, Google Gemini, Meta Llama, and Mistral AI available at the time of data collection. All questions were rated by educational experts and LLM judges using a standardised four-dimensional quality rubric, and objective quality metrics were calculated for each question.

A methodological limitation in many studies comparing educational question quality is that the observed ratings may reflect both question quality and judge-specific differences, such as leniency, thereby obscuring accurate estimates of quality. Similarly, comparative studies of LLM and human judges often rely purely on rating agreement; while this shows whether both judge types assign similar scores, it does not reveal which features of question quality drive those ratings. To address these limitations, this study uses Judge Response Theory (JRT) to compare question quality while accounting for judge-specific differences, and Explanatory Item Response Modelling (EIRM) to examine which features drive human and LLM quality judgements.