Utilizing large language models for EFL essay grading: an examination of reliability and validity in rubric-based assessments

dc.authorid0000-0003-2645-2710
dc.contributor.authorYavuz, Fatih
dc.contributor.authorÇelik, Özgür
dc.contributor.authorYavaş Çelik, Gamze
dc.contributor.authorid131069
dc.date.accessioned2024-06-27T08:37:02Z
dc.date.available2024-06-27T08:37:02Z
dc.date.issued2024-05
dc.departmentRektörlüğe Bağlı Birimler, Ortak Dersler Koordinatörlüğü
dc.description.abstractThis study investigates the validity and reliability of generative large language models (LLMs), specifically ChatGPT and Google's Bard, in grading student essays in higher education based on an analytical grading rubric. A total of 15 experienced English as a foreign language (EFL) instructors and two LLMs were asked to evaluate three student essays of varying quality. The grading scale comprised five domains: grammar, content, organization, style & expression and mechanics. The results revealed that fine-tuned ChatGPT model demonstrated a very high level of reliability with an intraclass correlation (ICC) score of 0.972, Default ChatGPT model exhibited an ICC score of 0.947 and Bard showed a substantial level of reliability with an ICC score of 0.919. Additionally, a significant overlap was observed in certain domains when comparing the grades assigned by LLMs and human raters. In conclusion, the findings suggest that while LLMs demonstrated a notable consistency and potential for grading competency, further fine-tuning and adjustment are needed for a more nuanced understanding of non-objective essay criteria. The study not only offers insights into the potential use of LLMs in grading student essays but also highlights the need for continued development and research.
dc.description.sponsorshipTUBİTAK
dc.identifier.citationYavuz, F., Çelik, Ö., & Yavaş Çelik, G. (2024). Utilizing large language models for EFL essay grading: An examination of reliability and validity in rubric-based assessments. British Journal of Educational Technology, 00, 1–17. https://doi.org/10.1111/bjet.13494
dc.identifier.doi10.1111/bjet.13494
dc.identifier.endpage17
dc.identifier.issn0007-1013
dc.identifier.startpage1
dc.identifier.urihttps://dspace.mudanya.edu.tr/handle/20.500.14362/202
dc.identifier.wosWOS:001237843800001
dc.identifier.wosqualityQ1
dc.institutionauthorYavuz, Fatih
dc.language.isoen
dc.publisherWiley
dc.relation.journalBritish Journal of Educational Technology
dc.relation.publicationcategoryMakale- Uluslararası- Hakemli Dergi- Kurum Öğretim Elemanı
dc.rightsinfo:eu-repo/semantics/openAccess
dc.subjectAI-based grading
dc.subjectautomated essay scoring
dc.subjectgenerative AI
dc.subjectlarge language models
dc.subjectreliability
dc.subjectrubric-based grading
dc.subjectvalidity
dc.titleUtilizing large language models for EFL essay grading: an examination of reliability and validity in rubric-based assessments
dc.typeMakale
Files
Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Fatih Yavuz - Utilizing large language models for EFL essay grading An examination of.pdf
Size:
1.76 MB
Format:
Adobe Portable Document Format
Description:
License bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
1.8 KB
Format:
Item-specific license agreed to upon submission
Description: