Comparison of the Psychometric Properties of Essay and Multiple-Choice Questions in Math and Science for Sixth-Grade Students Based on Classical Test Theory and Item Response Theory

Afzali, Afshin; Yaghoobi, Abolghasem; Pilehvarpour, Mohamad Aref; Azizi, Kazhal; Qanei, Darya

doi:10.22034/trj.2025.142154.2067

Authors

Department of Psychology, Faculty of Economic and Social Sciences, Bu-Ali Sina University

Document Type : Research Paper

10.22034/trj.2025.142154.2067

Abstract

Introduction:This study analyzes the psychometric properties of essay and multiple-choice questions in math and science for sixth-grade students using Classical Test Theory (CTT) and Item Response Theory (IRT). There are two prominent theories for analyzing test questions: Classical Test Theory (CTT) and Item Response Theory (IRT). In CTT, the unit of analysis is the entire test, while in IRT, the unit of analysis is each individual item. CTT has been a foundational theory in measurement for several decades, defined as a simple linear model stating that the observed score on a test is the sum of the true score and measurement error. This model consists of three components: the observed score, the true score, and the error score. The central idea regarding the relationship between the true score, observed score, and measurement error provides CTT with the ability to explain factors affecting test scores. CTT is based on three assumptions: first, the correlation between error scores and true scores is zero; second, errors have a mean of zero; and third, measurements of parallel tests are uncorrelated. CTT has been used for decades as a model for assessing the reliability and validity of measurement tools. According to the literature, CTT involves three main concepts: (a) the test score, also known as the observed score, (b) the true score, and (c) error scores. CTT focuses on two main aspects: item difficulty and item discrimination. Item difficulty refers to the proportion of individuals who can correctly answer the question. Generally, the more difficult the question, the lower the percentage of individuals answering correctly. The primary index for measuring item difficulty is the difficulty index. On the other hand, item discrimination refers to the ability of an item to differentiate between "high-performing" and "low-performing" individuals. IRT is based on the assumption that the abilities of one or more participants, denoted by θ (theta), are predictable. In Item Response Theory (IRT), important parameters for each item are defined: the discrimination parameter (a), the difficulty parameter (b), and the guessing parameter (c).
Method: The sample consisted of 388 sixth-grade students from Hamadan, selected using cluster sampling. Given the study's objective to evaluate the performance of multiple-choice and essay questions in science and math, a survey approach was utilized with methods based on CTT and IRT. The study population included all sixth-grade students in Hamadan during the 2023–2024 academic year, with a sample size of 388 determined using the Morgan table. This sample was selected randomly from six schools (three girls' schools and three boys' schools). To collect data, two teacher-made tests for science and math, containing both multiple-choice and essay questions, were used. To assess validity, each test was reviewed by four teachers (with at least six years of teaching experience) and then piloted. After incorporating the experts' feedback, the tests were finalized and used for data collection. A grading (partial credit) method was used for scoring the essay questions (Saif, 2016). To analyze the results, the e-IRT software was utilized. Parameters for multiple-choice questions were estimated based on the three-parameter model, while essay questions were analyzed using the Graded Response Model.
Results: Results indicated that essay questions performed better than multiple-choice questions in both science and math. Specifically, for essay questions, the average discrimination index in science was 0.208 and in math was 0.55, while the average difficulty index in science was 2.591 and in math was 2.342, reflecting better discrimination and difficulty for essay questions. Additionally, analysis of essay questions using IRT showed that all four questions in math had a discrimination parameter above 1.35. In science, the discrimination values for the questions were 0.087, 1.090, 0.844, 1.419, and 0.533, respectively. Furthermore, the threshold parameter showed positive changes at each step in both science and math, indicating better discrimination and threshold functioning of essay questions.
The better performance of descriptive questions compared to multiple-choice questions can be attributed to several factors. Descriptive questions allow students to explore topics in depth and demonstrate critical and analytical thinking abilities. They are particularly suitable for assessing higher-order skills such as analysis, evaluation, and synthesis of information (Anderson, 2001). Unlike multiple-choice questions that restrict students to selecting one option, descriptive questions enable them to express their ideas in detail and creatively (Biggs, 2011). They also allow assessment of complex or multifaceted topics by enabling students to tailor responses based on prior knowledge and experience (Moon, 2006).Despite these advantages, descriptive questions are used less frequently for several reasons. Scoring them is time-consuming and may involve human error, leading to scorer variability (Brown, 2013). They may also demonstrate lower reliability because responses can be influenced by writing ability, fatigue, or time constraints (Gipps, 1994). Additionally, descriptive questions usually cover fewer content areas and may not provide a comprehensive assessment of the entire curriculum (Race, 2014).This study has limitations that should be considered when interpreting the findings. The types of questions used may not have fully reflected all aspects of students' abilities. Although CTT and IRT provided valuable psychometric information, they may not have captured all complex aspects of item characteristics. The study focused only on math and science; therefore, generalization to other subjects should be made cautiously. Moreover, factors such as testing conditions and student stress were not fully controlled and may have influenced the results.

Keywords

Main Subjects

Education and teaching

References

Adegoke, B.A. (2013). Comparison of Item Statistics of Physics Achievement Test using Classical Test and Item Response Theory Frameworks. Journal of Education and Practice, 4, 87-96.

   Ado Abdu Bichi , Rohaya Talib .2018. Item Response Theory: An Introduction to Latent Trait Models to Test and Item Development. International Journal of Evaluation and Research in Education (IJERE)،Vol.7, No.2, pp. 142~151

   Adu-Mensah, J., & Adom, D. (2020). Test, measurement, and evaluation: Understanding and use of the concepts in education. International Journal of Evaluation and Research in Education.

   Ahmadi, H., Shirbagi, N., & Shirbagi, S. (2023). Teachers’ Understanding and Use Of Authentic Assessment In the Teaching-Learning Process. Journal of Research in Teaching, 11(4), 170-197. [In Persian]

   Alimirzaei.M, Moghadamzadeh. A , Minaei. A, Eizanlou. B , & Salehi. K. (2019). Sources of the Differential Item Functioning and its Application in Education. Journal of Research in Teaching, 7(1), 133-153. [In Persian]

   Anderson, L. W., & Krathwohl, D. R. (2001). A taxonomy for learning, teaching, and assessing: A revision of Bloom's taxonomy of educational objectives.

Ayanwale, Musa Adekunle, Adeleke, J. O., & Mamadelo, T. I. (2019). Invariance Person Estimate of Basic Education Certificate Examination: Classical Test Theory and Item Response Theory Scoring Perspective. Journal of the International Society for Teacher Education, 23(1), 18–26.

   Ayenew Takele Alemu , Hiwot Tesfa, Addisu Mulugeta,Enyew Tale Fenta, Mahider Awoke Belay.(2024). Quality of multiple choice question items: item analysis, International Journal of Scientific Reports 10(6):195-199

   Babatunde K Oladele, Benson A. Adegoke2020   Using Test Theories Models to Assess Senior Secondary Students Ability in Constructed-Response Mathematics Tests .Journal of Education and Practice . Vol.11, No.7,

   Baker, F. (2001), The Basics of Item Response Theory, ERIC: Clearinghouse on Assessment and Evaluation, University of Maryland, College Park, MD.

   Bichi, A. A. (2016). Classical test theory: An introduction to linear modelling approach to test and item analysis. International Journal for Social Studies, 2(9), 27-33

Biggs, J., & Tang, C. (2011). Teaching for quality learning at university.

Brown, G., Bull, J., & Pendlebury, M. (2013). Assessing student learning in higher education. Routledge.

   Butakor P.K. (2022). Using Classical Test and Item Response Theories to Evaluate Psychometric Quality of Teacher-Made Test in Ghana. European Scientific Journal, ESJ, 18, (1), 139
1. Alonso-Fernandez, I. Martinez-Ortiz, R. Caballero, M. Freire, and B. Fernandez-Manjon,. 2020 .Predicting students’ knowledge after playing a serious game based on learning analytics data: A case study, Journal of Computer Assisted Learning, vol. 36, no. 3, pp. 350–358
   Cai, L., Choi, K., Hansen, M., & Harrell, L. (2016). Item Response Theory. In Annual Review of Statistics and Its Application (Vol. 3, pp. 297–321). Annual Reviews Inc.

   Chukwu Ohiri .S .(2023) . Psychometric Analysis at Item Level of the Waec May/June Mathematics Multiple Choice Questions Using the Classical Test Theory . International Journal of Research Publication and Reviews , 4(11) , 132-138

   Clarke, M. (2011). Framework for building an effective student assessment system: READ/SABER Working Paper. World Bank.

   Cobbinah, A. & Ntumi, S. (2022). Difficulty, discrimination and pseudo-guessing indices of the West African Examinations Council core mathematics multiple choice items: Practical implications of using item response theory. Journal Research in Education Sciences, 13(5), 51-60

   Crocker, L., & Algina, J. (2006). Introduction to Classical and Modern Test Theory. Cengage Learning.

   Ebrahimi Manesh, M. R., Daneshpoor, A., Hasani Panah, T., & Haji Ramazani, E. (2024). Examination of student evaluation methods throughout the academic year. Journal of Psychology and Educational Sciences, 5 (53), 627-636. [In Persian]

   Embretson, S. E., & Reise, S. P. (2013). Item Response Theory for Psychologists. Lawrence Erlbaum Associates, Inc., Mahwah. 1–371

Embretson, S. E., & Reise, S. P. (2013). Item Response Theory for Psychologists. Lawrence Erlbaum Associates, Inc., Mahwah. 1–371.

Embretson, S. E., & Reise, S. P. (2013). Item response theory. Psychology Press.

    Ganglmair, A., & Lawson, R. (2010). Advantages of Rasch modelling for the development of a scale to measure affective response to consumption. In EEuropean Advances in Consumer Research, 6, 162–167.

   Gipps, C. V. (1994). Beyond testing: Towards a theory of educational assessment. Brown, G., Bull, J., & Pendlebury, M. (2013). Assessing student learning in higher education.

   Habibi, M., Khodaei, E., & Ezzanlou, B. (2012). *Old and new measurement theories in behavioral and medical sciences: A review of methodology, advantages, and challenges*. Quarterly Journal of Behavioral Sciences Research, 10(4), 302-315. [In Persian]

   Hambleton, R.K. and Swaminathan, H. (1985). Item response theory: principles and applications. p.332.

   Hambleton, R.K. and Swaminathan, H. (1985). Item response theory: principles and applications. p.332.

   Hambleton, R.K., Swaminathan, H. and Rogers, J.H. (1991), Fundamentals of Item Response Theory, Sage Publications, Newbury Park, CA

   Jose Manuel Azevedo , Ema P. Oliveira, Patrícia Damas Beites ,(2019), Using Learning Analytics to evaluate the quality of multiple-choice questions. A perspective with Classical Test Theory and Item Response Theory, The International Journal of Information and Learning Technology,36(4) , 322-341

   Kusumawati, M., & Hadi, S. (2018). An analysis of multiple choice questions (MCQs): Item and test statistics from mathematics assessments in senior high school. Research and Evaluation in Education.

   Lang, J. W. B., & Tay, L. (2021). The Science and Practice of Item Response Theory in Organizations. In Annual Review of Organizational Psychology and Organizational Behavior, 8, 311–338.

   Maba, W., Perdata, I. B. K., & Astawa, I. N. (2017). Constructing assessment instrument models for teacher’s performance, welfare and education quality. International Journal of Social Sciences and Humanities, 1(3), 88–96.

   McAlpine, M. (2002), “Design requirements of a databank”, The CAA Centre TLTP Project, Leicestershire

   Mehta G, Mokhasi V. 2014 .Item analysis of multiple choice questions-an assessment of the assessment tool. Int J Health Sci Res,;4(7):197-202.

   Mehtap Erguven , Two approaches to psychometric process: Classical test theory and item response theory , (2013), Journal of Education , Vol. 2 No. 2

Moon, J. (2007). Critical Thinking: An Exploration of Theory and Practice (1st ed.). Routledge.

   Mullis, I. V. S., Martin, M. O., Foy, P., & Hooper, M. (2020). TIMSS 2019 International Results in Mathematics and Science. TIMSS & PIRLS International Study Center, Boston College.

   Musa Adekunle Ayanwale , Julia Chere-Masopha and Malebohang C. Morena , (2022), The Classical Test or Item Response Measurement Theory: The Status of the Framework at the Examination Council of Lesotho , International Journal of Learning, Teaching and Educational Research , Vol. 21, No. 8, pp. 384-406

    Quansah, F., Amoako, I., & Ankomah, F., 2019.“Teachers’ test construction skills in Senior High Schools in Ghana: Document Analysis,” International Journal of Assessment Tools in Education, vol. 6, no. 1, pp. 1-8

Race, P. (2014). The lecturer's toolkit: A practical guide to assessment, learning, and teaching

    Ravela, P., Arregui, P., Valverde, G., Wolfe, R., Ferrer, G., Rizo, F. M., Aylwin, M., & Wolff, L. (2009). The Educational Assessments that Latin America Needs. Washington, DC: PREAL.

   Rioborue Alexander Oghenerume , & Friday Egberha,2024, Comparative Analysis of Item Statistics of WASSCE and NECO SSCE 2023 Data Processing Multiple Choice Tests Using Item Response Theory, International Journal of Educational Researchers, 15(1): 58-67

   Samadieh, H., Tanhayi Roshwanlou, F., Saeedi Rezvani, T., & Talebzadeh Shushtari, L. (2019). *Psychometric properties of the Unidimensional Relationship Closeness Scale based on classical test theory and item response theory*. Educational Measurement Quarterly. [In Persian]

    Santoso, A., Pardede, T., Djidu, H., Apino, E., Rafi, I., Rosyada, M. N., & Abd Hamid, H. S. (2022). The effect of scoring correction and model fit on the estimation of ability parameter and person fit on polytomous item response theory. Research and Evaluation in Education, 8(2), 140–151.

   Seif, A. A. (2016). *Educational Measurement, Assessment, and Evaluation* (7th ed., 8th print). Doran Publishing.. [In Persian]

Sharifi, H. P., & Sharifi, N. (2013). *Principles of psychometrics and psychological testing*. Tehran: Roshd Publications. [In Persian]

Article View: 261
PDF Download: 194

Volume 13, Issue 4 - Serial Number 43
December 2025
Pages 68-93

Files

History

How to cite

Statistics

Article View: 261
PDF Download: 194

Comparison of the Psychometric Properties of Essay and Multiple-Choice Questions in Math and Science for Sixth-Grade Students Based on Classical Test Theory and Item Response Theory

References

References

Volume 13, Issue 4 - Serial Number 43December 2025Pages 68-93

Volume 13, Issue 4 - Serial Number 43
December 2025
Pages 68-93