About OpenKnowledge@NAU | For NAU Authors

Native and non-native raters of L2 speaking performance: accent familiarity and cognitive processes

Bogorevich, Valeriia (2018) Native and non-native raters of L2 speaking performance: accent familiarity and cognitive processes. Doctoral thesis, Northern Arizona University.

[img] Text
Bogorevich_V_2018_Native_and_non-native_raters_of_L2_speaking_performance.pdf - Published Version

Download (3MB)


Rater variation in performance assessment can impact test-takers’ scores and compromise assessments’ fairness and validity (Crooks, Kane, & Cohen, 1996). Rater variation can also undermine a test’s validity and fairness; therefore, it is important to investigate raters’ scoring patterns in order to inform rater training. Substantial work has been done analyzing rater cognition in writing assessment (e.g., Cumming, 1990; Eckes, 2008); however, few studies have tried to classify factors that could contribute to rater variation in speaking assessment (e.g., May, 2006). The present study used a mixed methods approach (Tashakkori & Teddlie, 1998; Greene, Carcelli, & Graham, 1989) to investigate the potential differences between native English-speaking and non-native English-speaking raters in how they assess L2 students’ speaking performance. Kane’s (2006) argument-based approach to validity was used as the theoretical framework. The study challenged the plausibility of the assumptions for the evaluation inference, which links the observed performance and the observed score and depends on the assumption that the raters apply the scoring rubric accurately and consistently. The study analyzed raters’ scoring patterns when using a TOEFL iBT speaking rubric analytically. The raters provided scores for each rubric criterion (i.e., Overall, Delivery, Language Use, and Topic Development). Each rater received individual training, practice, and calibration experience. All the raters filled out a background questionnaire asking about their teaching experiences, language learning history, the background of students in their classrooms, and their exposure to and familiarity with the non-native accents used in the study. For the quantitative analysis, the two groups of raters 23 native (North American) and 23 non-native (Russian) raters graded and left comments for speech samples from Arabic (n = 25), Chinese (n = 25), and Russian (n = 25) L1 background. Students’ samples were in response to two independent speaking tasks; the students’ responses varied from low to high proficiency levels. For the qualitative part, 16 raters (7 native and 9 non-native) shared their scoring behavior through think-aloud protocols and interviews. The speech samples graded during the think-aloud included Arabic (n = 4), Chinese (n = 4), and Russian (n = 4) speakers. Raters’ scores were examined using the Multi-Faceted Rasch Measurement using FACETS (Linacre, 2014) software to test group differences between native and non-native raters as well as raters who are familiar and unfamiliar with the accents of students in the study. In addition, raters’ comments were coded and also used to explore rater group differences. The qualitative analyses involved thematical coding of transcribed think-aloud sessions and interview sessions using content analysis (Strauss & Corbin, 1998) to investigate the cognitive processes of raters and their perceptions of their rating processes. The coding included such themes as decision-making and re-listening patterns, perceived severity, criteria importance, and non-rubric criteria (e.g., accent familiarity, L1 match). Afterward, the quantitative and qualitative results were analyzed together to describe the potential sources of rater variability. This analysis was done employing side-by-side comparison of qualitative and quantitative data (Onwuegbuzie & Teddlie, 2003). The results revealed that there were no radical differences between native and non-native raters; however, some different patterns were observed. Non-native raters also showed more lenient grading patterns towards the students with whom their L1 matched. In addition, all raters, regardless of the group, demonstrated several patterns of rating depending on their focus while listening to examinees’ performance and interpretations of the rating criteria during the decision-making process. The findings can motivate professionals who oversee and train raters at testing companies and intensive English programs to study their raters’ scoring behaviors to individualize training to help make exam ratings fair and raters interchangeable.

Item Type: Thesis (Doctoral)
Publisher’s Statement: © Copyright is held by the author. Digital access to this material is made possible by the Cline Library, Northern Arizona University. Further transmission, reproduction or presentation of protected items is prohibited except with permission of the author.
Keywords: Accent familiarity; L2 speaking performance assessment; MFRM; Mixed methods; Native and Non-native raters; Rater cognition; Language, literature and linguistics
Subjects: P Language and Literature > P Philology. Linguistics
P Language and Literature > PE English
NAU Depositing Author Academic Status: Student
Department/Unit: Graduate College > Theses and Dissertations
College of Arts and Letters > English
Date Deposited: 31 Oct 2018 01:01
URI: http://openknowledge.nau.edu/id/eprint/5406

Actions (login required)

IR Staff Record View IR Staff Record View


Downloads per month over past year