Exploring Rating Quality in the Context of High-Stakes Rater-Mediated Educational Assessments

Thumbnail Image
Journal Title
Journal ISSN
Volume Title
University of Alabama Libraries

Constructed response (CR) items are widely used in large-scale testing programs, including the National Assessment of Educational Progress (NAEP) and many district and state-level assessments in the United States. One unique feature of CR items is that they depend on human raters to assess the quality of examinees’ work. The judgment of human raters is a relatively subjective process because it is based on raters’ own understanding of assessment context, interpretations of rubrics, expectations of performance and professional experiences. As a result, the process of human rating may bring some random errors or bias, which may unfairly affect the assignment of ratings. The main purpose of this dissertation is to provide insight into methodological issues that arise due to the role of rater judgments performance assessments. This dissertation includes three independent but related studies. The first study systematically explores the impacts of ignoring rater effects when they are present on estimates of student ability. Results suggest that in simulation conditions that reflect many large-scale mixed-format assessments, directly modeling rater effects yields more accurate student achievement estimates than estimation procedures that do not incorporate raters. The second study proposes an iterative parametric bootstrap procedure to help researchers and practitioners more accurately evaluate rater fit. The results indicate that the proposed iterative procedure performs best because it has well-controlled false positive rates, high true positive rates, and overall accuracy rates compared to using traditional parametric bootstrap procedure and rule-of-thumb critical values. The third study examines the quality of ratings in the Georgia Middle Grades Writing Assessment using both the Partial Credit model formulation of Many Facets Rasch model (PC-MFR) and a Hierarchical Rater Model based on a signal detection model (HRM-SDT). Major findings suggests that rating quality varies across four writing domains, that rating quality varies across each category with each domain, that raters use the rating scale category in a psychometrically sound way, and that there is some correspondence between rating quality indices based on PC-MFR model and HRM-SDT.

Electronic Thesis or Dissertation