Validating LLM-as-a-Judge Systems under Rating Indeterminacy

we developed a framework for judge-system meta-evaluation under rating indeterminacy (Figure 1). Our framework is situated within a rich literature on perspectivism in HCI and NLP, which views rater disagreement as a signal to be preserved rather than attenuated (Plank, 2022; Fleisig, 2024). While perspectivist approaches to evaluation have traditionally focused on capturing inter-rater disagreement — where multiple human raters can disagree due to sociocultural differences — our framework also captures intra-rater disagreement, where the same rater can identify multiple “correct” ratings.