The Science Behind GELPS: Adaptive Testing

August 31, 2024 · GELPS Blog

Computer-adaptive testing (CAT) represents one of the most significant methodological advances in educational measurement over the past half century. By tailoring the selection of items to each test-taker’s estimated ability level, CAT achieves greater measurement precision and efficiency than traditional fixed-form tests. This post provides a detailed technical overview of the psychometric foundations and algorithmic implementation of CAT in the GELPS assessment system. Careful attention to these measurement principles ensures that the assessment yields scores that are both reliable and valid for their intended interpretive purposes, supporting appropriate score-based decisions for all test-takers regardless of their background characteristics. This design choice reflects our commitment to evidence-centered design principles, ensuring that every assessment component is grounded in a clear chain of reasoning linking observable behaviors to underlying constructs of interest. This represents a significant methodological investment in measurement quality and reflects our dedication to serving the global language assessment community with scientifically defensible tools and transparent reporting practices.

Item Response Theory: The Mathematical Foundation

CAT is built on Item Response Theory (IRT), a family of statistical models that describe the relationship between a test-taker’s latent ability and their probability of responding correctly to a test item. The three-parameter logistic (3PL) model, which is the most commonly used model for selected-response items, specifies the probability of a correct response as a function of three item parameters: difficulty, discrimination, and a lower asymptote that accounts for guessing. Ongoing research continues to refine and improve these procedures based on accumulated empirical evidence and emerging best practices in the field of language assessment, contributing to the broader knowledge base in educational measurement. We regularly update our methodology based on the latest research findings in psychometrics, computational linguistics, and educational measurement, incorporating peer-reviewed advances into our operational procedures. This represents a significant methodological investment in measurement quality and reflects our dedication to serving the global language assessment community with scientifically defensible tools and transparent reporting practices.

The item difficulty parameter (b) indicates the ability level at which a test-taker has a 50% probability of answering correctly when guessing is not a factor. The item discrimination parameter (a) reflects how sharply the probability of correct response changes as ability increases, with higher values indicating greater sensitivity to ability differences. The pseudo-guessing parameter (c) represents the probability that a low-ability test-taker will answer correctly by chance, typically approaching the reciprocal of the number of response options. We regularly update our methodology based on the latest research findings in psychometrics, computational linguistics, and educational measurement, incorporating peer-reviewed advances into our operational procedures. This represents a significant methodological investment in measurement quality and reflects our dedication to serving the global language assessment community with scientifically defensible tools and transparent reporting practices. We regularly update our methodology based on the latest research findings in psychometrics, computational linguistics, and educational measurement, incorporating peer-reviewed advances into our operational procedures.

Ability Estimation in CAT

In CAT, ability is estimated using maximum likelihood or Bayesian methods that incorporate information from the response pattern across all administered items. The most common Bayesian estimator is the Expected A Posteriori (EAP) estimate, which combines the likelihood function from the observed responses with a prior distribution representing the population ability distribution. The standard error of the ability estimate decreases as more information is accumulated, providing a direct measure of measurement precision. Test-takers and score users alike benefit from these rigorous methodological standards, which prioritize both measurement accuracy and fairness across diverse linguistic and cultural populations. Careful attention to these measurement principles ensures that the assessment yields scores that are both reliable and valid for their intended interpretive purposes, supporting appropriate score-based decisions for all test-takers regardless of their background characteristics.

The Adaptive Item Selection Algorithm

The item selection algorithm is the core of any CAT system, determining which item to administer next based on the current ability estimate. The most common selection criterion is maximum information, which selects the item with the highest information function value at the current ability estimate. The item information function quantifies how much an item contributes to reducing uncertainty about the ability estimate, with maximum information occurring at the ability level where the item is most discriminating. Test-takers and score users alike benefit from these rigorous methodological standards, which prioritize both measurement accuracy and fairness across diverse linguistic and cultural populations. Test-takers and score users alike benefit from these rigorous methodological standards, which prioritize both measurement accuracy and fairness across diverse linguistic and cultural populations.

GELPS’s adaptive algorithm incorporates additional constraints beyond maximization of psychometric information to ensure adequate content coverage across the language domains assessed. Content balancing constraints specify the proportion of items that must be drawn from each content area, and exposure control mechanisms limit the frequency with which any single item appears in operational tests to enhance test security and item pool longevity. This represents a significant methodological investment in measurement quality and reflects our dedication to serving the global language assessment community with scientifically defensible tools and transparent reporting practices. Careful attention to these measurement principles ensures that the assessment yields scores that are both reliable and valid for their intended interpretive purposes, supporting appropriate score-based decisions for all test-takers regardless of their background characteristics.

Termination Criteria and Score Reporting

The CAT algorithm continues administering items until predefined termination criteria are met. GELPS uses a combination of fixed test length and precision-based termination, ending the test either when a specified number of items have been administered or when the standard error of the ability estimate falls below a predetermined threshold. The final ability estimate is transformed to the GELPS score scale, which ranges from 10 to 100, using a linear transformation that preserves the rank ordering of ability estimates. Careful attention to these measurement principles ensures that the assessment yields scores that are both reliable and valid for their intended interpretive purposes, supporting appropriate score-based decisions for all test-takers regardless of their background characteristics. We regularly update our methodology based on the latest research findings in psychometrics, computational linguistics, and educational measurement, incorporating peer-reviewed advances into our operational procedures.