Research: Evaluating the Quality of Higher Education Instructor-Constructed Multiple-Choice Tests

Title: Evaluating the Quality of Higher Education Instructor-Constructed Multiple-Choice Tests: Impact on Student Grades
Authors: Gavin T. L. Brown, Hasan H. A. Abdulnabi
Access the original paper here
Listen to a deep-dive podcast:

Paper summary

This 2017 research article examines the quality of instructor-created multiple-choice questions (MCQs) in higher education. The authors assessed 100 MCQs from a midterm and final exam using Classical Test Theory (CTT) and Item Response Theory (IRT) models, finding significant issues with item quality. The study’s analysis revealed that using IRT, specifically the two-parameter logistic model, provided a more accurate assessment of student ability and impacted grading decisions, particularly pass/fail outcomes. The researchers recommend using statistical item analysis to improve MCQ quality and enhance the validity of assessment scores. Finally, the authors propose the development of automated systems to assist instructors with this process.

What are the key implications for teachers in the classroom?

The sources suggest several key implications for teachers in the classroom regarding the use of multiple-choice questions (MCQs) in assessments:

MCQ quality matters: The quality of instructor-written MCQs can be problematic and can lead to misleading conclusions about student achievement. Poorly constructed items, rather than poor teaching or learning, can result in inappropriate decisions about student ability and give inappropriate feedback to both students and instructors.
Statistical analysis is crucial: Statistical item analysis should be used to evaluate the characteristics of MCQ items. This involves determining item difficulty, discrimination, and the effectiveness of distractors, using methods from classical test theory (CTT) and item response theory (IRT).
Item analysis helps identify and remove flawed items: Items that are too easy or too difficult, easy to guess, or that do not positively discriminate between high and low performing students should be removed. This is especially important because the statistical analysis can identify problems that may not be apparent to the teacher.
IRT models can provide more accurate scoring: IRT models, especially the two-parameter logistic (2PL) model, can adjust scores based on the relative difficulty of items, potentially giving a more accurate measure of student ability than raw scores. The 2PL model was found to fit the data better than the one-parameter logistic (1PL) or Rasch model and the three-parameter logistic (3PL) model in this study.
The Rasch model may not be appropriate for MCQs: The strict assumptions of the Rasch model may not be realistic for MCQs because it requires all items to have equal discrimination and zero guessing.
Test reliability and validity should be considered: Teachers should be aware of the reliability of their tests, and the standard error of measurement, which indicates the range of scores that a student might get by chance, should be considered when evaluating scores. The validity of the inferences made from the test results must also be considered.
Automated analysis systems can be useful: Given that most higher education teachers lack training in psychometrics, automated item analysis systems would be beneficial, and such systems should allow the teacher to make decisions about which items to remove. Such a system could also help establish grade boundaries.
Item quality affects student grades: The inclusion of poor quality items can have an impact on students’ overall course grades and on pass/fail decisions. Using statistical models to remove bad items can result in a different grade distribution, potentially benefitting some students and altering the number of students who pass or fail.
Item writing guidelines should be followed: Teachers should follow established best-practice guidelines for writing MCQs. These include guidelines for content, style, format, writing the stem, and writing options. Training in item writing can improve the quality of MCQs.
MCQs may not assess higher order thinking: There is a concern that MCQs focus too much on recall of knowledge and not higher order thinking. Poor item writing can be a factor in this.
Feedback to students and instructors can be improved: Giving grades derived from item difficulty has the potential to give more informative feedback to both students and instructors. Also, item analysis can help instructors improve their teaching, by identifying areas where students are struggling.
Transparency in test creation is important: Although recycling items is not always possible, due to transparency practices, this also limits the opportunities to collect data from the same items.
Small sample sizes are a challenge: It can be difficult to implement IRT models in classes where there are fewer than 500 students, which is common in higher education. This needs to be taken into account when interpreting results.
A balance between content coverage and item quality is necessary: While using fewer items might lead to more credible scores, it may sacrifice content coverage. On the other hand, retaining poor quality items can lead to misleading information.

In short, teachers should recognize that simply writing MCQs is not enough; they must engage in a process of quality assurance to ensure that the assessments they use are valid, reliable, and fair to students.

Quote

The overwhelming conclusion is that item statistical analysis is a necessary adjunct to judgment-based evaluation of item quality in MCQ testing in higher education