Deep Learning in Automated Essay Scoring for Islamic Education: A Systematic Review

Authors

  • Rokhmatul Khoiro Amin Putri Universitas Islam Negeri (UIN) Sunan Ampel Surabaya
  • Kusaeri Kusaeri Universitas Islam Negeri (UIN) Sunan Ampel Surabaya
  • Suparto Suparto Universitas Islam Negeri (UIN) Sunan Ampel Surabaya

DOI:

https://doi.org/10.58524/oler.v5i2.753

Keywords:

Assessment, Automated Essay Scoring, Deep Learning, Islamic Education, Systematic Review

Abstract

Automated Essay Scoring (AES) is a computer-based scoring system that uses appropriate features to automatically assess or give feedback to students, by combining the power of Artificial Intelligence and natural language processing (NLP) to provide convenience and benefits for evaluators. This study aims to analyze the most effective algorithmic models in evaluating the accuracy and reliability of the Automated Essay Scoring (AES) system, especially in the context of Islamic religious education assessment, as well as examine its advantages and disadvantages in supporting objective and efficient learning evaluation. This study uses the Systematic Literature Review (SLR) approach by following the PRISMA protocol. A total of 31 relevant articles published in the period 2020 to 2025 from the Scopus and Springer databases were analyzed to evaluate the use and effectiveness of algorithms in the development of AES systems. The results show that transformer-based models, specifically BERT, are the most effective algorithms in current AES implementations. BERT excels because of its ability to understand bidirectional context and semantic depth in text. These models generate accurate scores and can provide automated feedback that is close to the quality of human judgment. However, the use of BERT requires large training data and high computing resources. While BERT demands substantial data and computing power, its application in Islamic education highlights the potential of AES to support more objective, consistent, and scalable assessment of students’ essays

References

Abosalem, Y. (2015). Assessment techniques and students’ higher-order thinking skills. ICSIT 2018 - 9th International Conference on Society and Information Technologies, Proceedings, 4(1), 61–66. https://doi.org/10.11648/j.ijsedu.20160401.11

Al Awaida, S. A., Al-Shargabi, B., & Al-Rousan, T. (2019). Automated Arabic essay grading system based on F-score and Arabic WordNet. Jordanian Journal of Computers and Information Technology, 5(3), 170–180. https://doi.org/10.5455/jjcit.71-1559909066

Alqahtani, A., & Alsaif, A. (2019). Automatic evaluation for Arabic essays: A rule-based system. IEEE International Symposium on Signal Processing and Information Technology (ISSPIT), 1–7. https://doi.org/10.1109/ISSPIT47144.2019.9001802

Amalia, A., Lydia, M. S., Kadir, R. A., Tanjung, F. A. U., Ginting, D. S. B., & Gunawan, D. (2024). Automated Indonesian essay scoring and holistic feedback using bidirectional encoder representations for transformers. 8th International Conference on Electrical, Telecommunication and Computer Engineering (ELTICOM), 96–101. https://doi.org/10.1109/ELTICOM64085.2024.10864959

Amorim, E., Cançado, M., & Veloso, A. (2018). Automated essay scoring in the presence of biased ratings. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), 1, 229–237. https://doi.org/10.18653/v1/n18-1021

Azahar, M., & Ghauth, K. (2022). A hybrid automated essay scoring using NLP and random forest regression (pp. 448–457). Atlantis Press. https://doi.org/10.2991/978-94-6463-094-7_35

Bahroun, Z., Anane, C., Ahmed, V., & Zacca, A. (2023). Transforming education: A comprehensive review of generative artificial intelligence in educational settings through bibliometric and content analysis. Sustainability, 15(17), 12983. https://doi.org/10.3390/su151712983

Bansal, B., Gupta, J., Singh, M., Rani, R., Jaiswal, G., & Sharma, A. (2025). Automated essay scoring: A comparative study of machine learning and deep learning approaches. 5th International Conference on Advances in Electrical, Computing, Communication and Sustainable Technologies (ICAECT), 1–7. https://doi.org/10.1109/ICAECT63952.2025.10958994

Bernardin, H. J., Thomason, S., Buckley, M. R., & Kane, J. S. (2016). Rater rating-level bias and accuracy in performance appraisals: The impact of rater personality, performance management competence, and rater accountability. Human Resource Management, 55(2), 321–340. https://doi.org/10.1002/hrm.21676

Beseiso, M., & Alzahrani, S. (2020). An empirical analysis of BERT embedding for automated essay scoring. International Journal of Advanced Computer Science and Applications, 11(10), 204–210. https://doi.org/10.14569/IJACSA.2020.0111027

Beseiso, M., Alzubi, O. A., & Rashaideh, H. (2021). A novel automated essay scoring approach for reliable higher educational assessments. Journal of Computing in Higher Education, 33(3), 727–746. https://doi.org/10.1007/s12528-021-09283-1

Cao, Y., Jin, H., Wan, X., & Yu, Z. (2020). Domain-Adaptive Neural Automated Essay Scoring. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’20), pages 1011-1020. Association for Computing Machinery.

Catulay, J. J. J. E., Magsael, M. E., Ancheta, D. O., & Costales, J. A. (2021). Neural-network architecture approach: An automated essay scoring using Bayesian linear ridge regression algorithm. 8th International Conference on Soft Computing & Machine Intelligence (ISCMI), 196–200. https://doi.org/10.1109/ISCMI53840.2021.9654801

Chassab, R. H., Zakaria, L. Q., & Tiun, S. (2021). Automatic essay scoring: A review on the feature analysis techniques. International Journal of Advanced Computer Science and Applications, 12(10), 252–264. https://doi.org/10.14569/IJACSA.2021.0121028

Chavva, R. K. R., Muthyam, S. R., Seelam, M. S., & Nalliboina, N. (2024). A transformer-based approach for enhancing automated essay scoring. 1st International Conference on Advanced Computing and Emerging Technologies (ACET), 1–6. https://doi.org/10.1109/ACET61898.2024.10730000

Das, L. B., Raghu, C. V., Jagadanand, G., George, R. A. R., Yashasawi, P., Kumaran, N. A. A., & Patnaik, V. K. (2022). FACTOGRADE: Automated essay scoring system. IEEE International Conference on Industry 4.0, Artificial Intelligence, and Communications Technology (IAICT), 42–48. https://doi.org/10.1109/IAICT55358.2022.9887447

Dascalu, M., Westera, W., Ruseti, S., Trausan-Matu, S., & Kurvers, H. (2017). ReaderBench learns Dutch: Building a comprehensive automated essay scoring system for Dutch language. Artificial Intelligence in Education: 18th International Conference, AIED 2017 (Vol. 10331, pp. 52–63). Springer. https://doi.org/10.1007/978-3-319-61425-0_5

Eang, C., & Lee, S. (2024). Improving the accuracy and effectiveness of text classification based on the integration of the BERT model and a recurrent neural network (RNN_BERT_Based). Applied Sciences, 14(18), 8388. https://doi.org/10.3390/app14188388

Faseeh, M., Jaleel, A., Iqbal, N., Ghani, A., Abdusalomov, A., Mehmood, A., & Cho, Y. I. (2024). Hybrid approach to automated essay scoring: Integrating deep learning embeddings with handcrafted linguistic features for improved accuracy. Mathematics, 12(21), 3416. https://doi.org/10.3390/math12213416

Fiacco, J., Adamson, D., & Rose, C. (2023). Towards extracting and understanding the implicit rubrics of transformer-based automatic essay scoring models. In E. Kochmar et al. (Eds.), Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023) (pp. 232–241). Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.bea-1.20

Fiacco, J., Jiang, S., Adamson, D., & Rosé, C. (2022). Toward automatic discourse parsing of student writing motivated by neural interpretation. In E. Kochmar et al. (Eds.), Proceedings of the 17th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2022) (pp. 204–215). Association for Computational Linguistics. https://doi.org/10.18653/v1/2022.bea-1.25

Gaheen, M. M., ElEraky, R. M., & Ewees, A. A. (2021). Automated students’ Arabic essay scoring using trained neural network by e-jaya optimization to support personalized instruction. Education and Information Technologies, 26(1), 1165–1181. https://doi.org/10.1007/s10639-020-10300-6

Geetha, M. P., & Renuka, D. K. (2021). Improving the performance of aspect-based sentiment analysis using fine-tuned BERT base uncased model. International Journal of Intelligent Networks, 2, 64–69. https://doi.org/10.1016/j.ijin.2021.06.005

Gillath, O., & Karantzas, G. (2019). Attachment security priming: A systematic review. Current Opinion in Psychology, 25, 86–95. https://doi.org/10.1016/j.copsyc.2018.03.001

Han, C. (2019). Introduction to many-facet Rasch measurement: Analyzing and evaluating rater-mediated assessments. Measurement: Interdisciplinary Research and Perspectives, 17(2), 113–116. https://doi.org/10.1080/15366367.2018.1516094

Hua, C., & Wind, S. A. (2019). Exploring the psychometric properties of the mind-map scoring rubric. Behaviormetrika, 46(1), 73–99. https://doi.org/10.1007/s41237-018-0062-z

Hussein, M. A., Hassan, H. A., & Nassef, M. (2020a). A trait-based deep learning automated essay scoring system with adaptive feedback. International Journal of Advanced Computer Science and Applications, 11(5), 287–293. https://doi.org/10.14569/IJACSA.2020.0110538

Hussein, M. A., Hassan, H., & Nassef, M. (2019a). Automated language essay scoring systems: A literature review. PeerJ Computer Science, 5, e208. https://doi.org/10.7717/peerj-cs.208

John Bernardin, H., Thomason, S., Buckley, M. R., & Kane, J. S. (2016). Rater rating-level bias and accuracy in performance appraisals: The impact of rater personality, performance management competence, and rater accountability. Human Resource Management, 55(2), 321–340. https://doi.org/10.1002/hrm.21678

Ke, Z., & Ng, V. (2019). Automated essay scoring: A survey of the state of the art. Proceedings of the 28th International Joint Conference on Artificial Intelligence (IJCAI 2019), 6300–6308. https://doi.org/10.24963/ijcai.2019/879

Kruse, O., Rapp, C., Anson, C. M., Benetos, K., Cotos, E., Devitt, A., & Shibani, A. (2023). Digital writing technologies in higher education: Theory, research, and practice. Springer. https://doi.org/10.1007/978-3-031-36033-6

Kusumaningrum, R., Kadarisman, K., Endah, S. N., Sasongko, P. S., Khadijah, K., Sutikno, S., Rismiyati, R., & Afriani, A. (2024). Automated essay scoring using convolutional neural network long short-term memory with mean of question-answer encoding. ICIC Express Letters, 18(8), 785–792. https://doi.org/10.24507/icicel.18.08.785

Lagakis, P., & Demetriadis, S. (2021). Automated essay scoring: A review of the field. International Conference on Computer, Information and Telecommunication Systems (CITS), 1–6. https://doi.org/10.1109/CITS52676.2021.9618476

Lim, C. T., Bong, C. H., Wong, W. S., & Lee, N. K. (2021). A comprehensive review of automated essay scoring (AES) research and development. Pertanika Journal of Science and Technology, 29(3), 1875–1899. https://doi.org/10.47836/pjst.29.3.27

Liu, O. L., Frankel, L., & Roohr, K. C. (2014). Assessing critical thinking in higher education: Current state and directions for next-generation assessment. ETS Research Report Series, 2014(1), 1–23. https://doi.org/10.1002/ets2.12009

Lonetti, F., Bertolino, A., & Di Giandomenico, F. (2023). Model-based security testing in IoT systems: A rapid review. Information and Software Technology, 164, 107326. https://doi.org/10.1016/j.infsof.2023.107326

Lu, C., & Cutumisu, M. (2021). Integrating deep learning into an automated feedback generation system for automated essay scoring. In Proceedings of the 14th International Conference on Educational Data Mining (EDM 2021) (pp. 573–579). International Educational Data Mining Society.

Ludwig, S., Mayer, C., Hansen, C., Eilers, K., & Brandt, S. (2021). Automated essay scoring using transformer models. Psych, 3(4), 897–915. https://doi.org/10.3390/psych3040056

Machhout, R. A., & Zribi, C. B. O. (2024). Enhanced BERT approach to score Arabic essay’s relevance to the prompt. IBIMA Business Review, 2024. https://doi.org/10.5171/2024.176992

Mahmoud, S., Nabil, E., & Torki, M. (2024). Automatic scoring of Arabic essays: A parameter-efficient approach for grammatical assessment. IEEE Access, 12, 142555–142568. https://doi.org/10.1109/ACCESS.2024.3470728

Mayfield, E., & Black, A. W. (2020). Should you fine-tune BERT for automated essay scoring? In Proceedings of the Fifteenth Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2020) (pp. 151–162). Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.bea-1.15

Misgna, H., On, B. W., Lee, I., & Choi, G. S. (2025). A survey on deep learning-based automated essay scoring and feedback generation. Artificial Intelligence Review, 58(2), 11017. https://doi.org/10.1007/s10462-024-11017-5

Mizumoto, A., & Eguchi, M. (2023). Exploring the potential of using an AI language model for automated essay scoring. Research Methods in Applied Linguistics, 2(2), 100050. https://doi.org/10.1016/j.rmal.2023.100050

Naqvi, B., Perova, K., Farooq, A., Makhdoom, I., Oyedeji, S., & Porras, J. (2023). Mitigation strategies against phishing attacks: A systematic literature review. Computers and Security, 132, 103387. https://doi.org/10.1016/j.cose.2023.103387

Nguyen, H. V., & Litman, D. J. (2018). Argument mining for improving the automated scoring of persuasive essays. 32nd AAAI Conference on Artificial Intelligence (AAAI 2018), 5892–5899. https://doi.org/10.1609/aaai.v32i1.12046

Nie, Y. (2025). Automated essay scoring with SBERT embeddings and LSTM-attention networks. PeerJ Computer Science, 11, e2634. https://doi.org/10.7717/peerj-cs.2634

Ouyang, F., Wu, M., Zhang, L., Xu, W., Zheng, L., & Cukurova, M. (2023). Making strides towards AI-supported regulation of learning in collaborative knowledge construction. Computers in Human Behavior, 142, 107650. https://doi.org/10.1016/j.chb.2023.107650

Page, M. J., McKenzie, J. E., Bossuyt, P. M., Boutron, I., Hoffmann, T. C., Mulrow, C. D., Shamseer, L., Tetzlaff, J. M., Akl, E. A., Brennan, S. E., Chou, R., Glanville, J., Grimshaw, J. M., Hróbjartsson, A., Lalu, M. M., Li, T., Loder, E. W., Mayo-Wilson, E., McDonald, S., … Moher, D. (2021). The PRISMA 2020 statement: An updated guideline for reporting systematic reviews. International Journal of Surgery, 88, 105906. https://doi.org/10.1016/j.ijsu.2021.105906

Rahman, A. A., Ahmad, J., Yasin, R. M., & Hanafi, N. M. (2017). Investigating central tendency in competency assessment of design electronic circuit: Analysis using many facet Rasch measurement (MFRM). International Journal of Information and Education Technology, 7(7), 525–528. https://doi.org/10.18178/ijiet.2017.7.7.923

Ramesh, D., & Sanampudi, S. K. (2022). Automated essay scoring systems: A systematic literature review. Artificial Intelligence Review, 55(3), 2495–2527. https://doi.org/10.1007/s10462-021-10068-2

Ridley, R., He, L., Dai, X., Huang, S., & Chen, J. (2020). Prompt-agnostic essay scorer: A domain generalization approach to cross-prompt automated essay scoring. arXiv preprint. http://arxiv.org/abs/2008.01441

Ridley, R., He, L., Dai, X. Y., Huang, S., & Chen, J. (2021). Automated cross-prompt scoring of essay traits. Proceedings of the 35th AAAI Conference on Artificial Intelligence (AAAI 2021), 15, 13745–13753. https://doi.org/10.1609/aaai.v35i15.17620

Rosen, Y., & Tager, M. (2014). Making student thinking visible through a concept map in computer-based assessment of critical thinking. Journal of Educational Computing Research, 50(2), 249–270. https://doi.org/10.2190/EC.50.2.f

Sevcikova, B. L. (2018). Human versus Automated Essay Scoring: A Critical Review. Arab World English Journal, 9(2), 157-174. https://doi.org/10.24093/awej/vol9no2.11

Shin, J., & Gierl, M. J. (2020). More efficient processes for creating automated essay scoring frameworks: A demonstration of two algorithms. Language Testing, 38(2), 247–272. https://doi.org/10.1177/0265532220937830

Song, W., Song, Z., Liu, L., & Fu, R. (2020). Hierarchical multi-task learning for organization evaluation of argumentative student essays. Proceedings of the 29th International Joint Conference on Artificial Intelligence (IJCAI 2020), 3875–3881. https://doi.org/10.24963/ijcai.2020/536

Tashu, T. M., Maurya, C. K., & Horvath, T. (2022). Deep learning architecture for automatic essay scoring. arXiv preprint. http://arxiv.org/abs/2206.08232

Uto, M., & Ueno, M. (2018). Empirical comparison of item response theory models with rater’s parameters. Heliyon, 4(5), e00622. https://doi.org/10.1016/j.heliyon.2018.e00622

Uto, M., Xie, Y., & Ueno, M. (2020). Neural automated essay scoring incorporating handcrafted features. In D. Scott, N. Bel, & C. Zong (Eds.), Proceedings of the 28th International Conference on Computational Linguistics (pp. 6077–6088). Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.coling-main.535

Wang, Y., Wang, C., Li, R., & Lin, H. (2022). On the use of BERT for automated essay scoring: Joint learning of multi-scale essay representation. In M. Carpuat, M.-C. de Marneffe, & I. V. Meza Ruiz (Eds.), Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 3416–3425). Association for Computational Linguistics. https://doi.org/10.18653/v1/2022.naacl-main.249

Xie, J., Cai, K., Kong, L., Zhou, J., & Qu, W. (2022). Automated essay scoring via pairwise contrastive regression. In Proceedings of the 29th International Conference on Computational Linguistics (COLING 2022) (pp. 2724–2733). Association for Computational Linguistics. https://aclanthology.org/2022.coling-1.240/

Yang, R., Cao, J., Wen, Z., Wu, Y., & He, X. (2020). Enhancing automated essay scoring performance via fine-tuning pre-trained language models with combination of regression and ranking. In Findings of the Association for Computational Linguistics: EMNLP 2020 (pp. 1560–1569). Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.findings-emnlp.141

Zawacki-Richter, O., & Jung, I. (2023). Handbook of open, distance and digital education. Springer. https://doi.org/10.1007/978-981-19-2080-6

Zupanc, K., & Bosnić, Z. (2018). Increasing accuracy of automated essay grading by grouping similar graders. ACM International Conference Proceeding Series. https://doi.org/10.1145/3227609.3227645

Downloads

Published

2025-12-15

How to Cite

Deep Learning in Automated Essay Scoring for Islamic Education: A Systematic Review. (2025). Online Learning In Educational Research (OLER), 5(2), 319-338. https://doi.org/10.58524/oler.v5i2.753