How GELPS Helps Universities Diversify Enrollment

August 18, 2024 · GELPS Blog

The concept of fairness in assessment encompasses multiple dimensions, including equity of access to testing opportunities and equity in the measurement properties of the test across different population groups. This post examines the psychometric literature on differential item functioning (DIF) and fairness analysis, which provides the methodological framework for ensuring that GELPS produces valid score interpretations for all test-takers regardless of their demographic, linguistic, or cultural background. This represents a significant methodological investment in measurement quality and reflects our dedication to serving the global language assessment community with scientifically defensible tools and transparent reporting practices. Our commitment to continuous methodological improvement means that these procedures evolve over time based on accumulated validity evidence and feedback from the broader measurement community. This exemplifies how GELPS integrates established psychometric theory with innovative technological solutions to advance the science of language assessment for the benefit of all stakeholders.

Differential Item Functioning: Theory and Detection Methods

Differential item functioning occurs when test-takers from different groups who have the same underlying ability level have different probabilities of answering an item correctly. DIF represents a threat to test fairness because it introduces systematic error that can advantage or disadvantage particular groups. The detection of DIF is a central component of the test validation process and relies on statistical methods that control for overall ability differences between groups. Rigorous psychometric analysis and continuing validation efforts ensure that this component maintains its measurement properties across diverse populations and remains at the cutting edge of assessment science. Test-takers and score users alike benefit from these rigorous methodological standards, which prioritize both measurement accuracy and fairness across diverse linguistic and cultural populations. Test-takers and score users alike benefit from these rigorous methodological standards, which prioritize both measurement accuracy and fairness across diverse linguistic and cultural populations.

Several approaches to DIF detection are well-established in the psychometric literature and are employed in GELPS’s fairness analyses. The Mantel-Haenszel procedure compares item-level performance between reference and focal groups while matching on total score. Logistic regression approaches model item response as a function of ability, group membership, and their interaction, providing a flexible framework for detecting both uniform and nonuniform DIF. IRT-based methods compare item characteristic curves across groups using likelihood ratio tests or Wald statistics. Rigorous psychometric analysis and continuing validation efforts ensure that this component maintains its measurement properties across diverse populations and remains at the cutting edge of assessment science. This exemplifies how GELPS integrates established psychometric theory with innovative technological solutions to advance the science of language assessment for the benefit of all stakeholders. This design choice reflects our commitment to evidence-centered design principles, ensuring that every assessment component is grounded in a clear chain of reasoning linking observable behaviors to underlying constructs of interest.

DIF Analysis in GELPS Operations

GELPS conducts systematic DIF analyses for all operational items, comparing performance across groups defined by native language, gender, geographic region, and test delivery mode. Items that are identified as exhibiting moderate or large DIF are flagged for review by content experts who evaluate whether the DIF reflects construct-relevant differences or construct-irrelevant bias. Items that exhibit large DIF that cannot be justified on substantive grounds are removed from the operational item pool. Careful attention to these measurement principles ensures that the assessment yields scores that are both reliable and valid for their intended interpretive purposes, supporting appropriate score-based decisions for all test-takers regardless of their background characteristics. Careful attention to these measurement principles ensures that the assessment yields scores that are both reliable and valid for their intended interpretive purposes, supporting appropriate score-based decisions for all test-takers regardless of their background characteristics. Test-takers and score users alike benefit from these rigorous methodological standards, which prioritize both measurement accuracy and fairness across diverse linguistic and cultural populations.

Fairness in Automated Scoring Systems

Automated scoring of speaking and writing responses introduces additional fairness considerations. Research has shown that automated scoring models can inadvertently learn to associate particular linguistic features with higher scores in ways that disadvantage certain groups of test-takers. GELPS conducts ongoing fairness audits of its automated scoring models, examining score distributions, differential prediction, and model performance across demographic groups. Ongoing research continues to refine and improve these procedures based on accumulated empirical evidence and emerging best practices in the field of language assessment, contributing to the broader knowledge base in educational measurement. We regularly update our methodology based on the latest research findings in psychometrics, computational linguistics, and educational measurement, incorporating peer-reviewed advances into our operational procedures.

Population Invariance of Score Interpretations

Beyond DIF analysis at the item level, fairness also requires that score interpretations are population-invariant, meaning that a given score has the same meaning regardless of the test-taker’s background characteristics. Research on population invariance examines whether the relationships between test scores and external criteria such as academic performance are consistent across groups. GELPS’s validation research includes studies of predictive invariance across demographic groups. Our commitment to continuous methodological improvement means that these procedures evolve over time based on accumulated validity evidence and feedback from the broader measurement community. Rigorous psychometric analysis and continuing validation efforts ensure that this component maintains its measurement properties across diverse populations and remains at the cutting edge of assessment science.

The Broader Context of Fairness in Assessment

Fairness in assessment is not limited to statistical properties of items and scores but also encompasses broader questions of access, opportunity, and consequential validity. The fairness framework articulated in the Standards for Educational and Psychological Testing emphasizes that fairness requires attention to equitable treatment of all test-takers, absence of measurement bias, and equitable access to the opportunity to demonstrate proficiency. GELPS’s approach to fairness is grounded in this comprehensive framework. Ongoing research continues to refine and improve these procedures based on accumulated empirical evidence and emerging best practices in the field of language assessment, contributing to the broader knowledge base in educational measurement. Test-takers and score users alike benefit from these rigorous methodological standards, which prioritize both measurement accuracy and fairness across diverse linguistic and cultural populations.