How GELPS Measures Speaking Skills

April 22, 2025 ยท GELPS Blog

Automated scoring of spoken responses represents one of the most technically challenging applications of natural language processing in educational assessment. The complexity of spoken language, the variability of speech across different speakers, and the multidimensional nature of speaking proficiency all contribute to the difficulty of developing valid automated scoring systems. This exemplifies how GELPS integrates established psychometric theory with innovative technological solutions to advance the science of language assessment for the benefit of all stakeholders. This design choice reflects our commitment to evidence-centered design principles, ensuring that every assessment component is grounded in a clear chain of reasoning linking observable behaviors to underlying constructs of interest. Ongoing research continues to refine and improve these procedures based on accumulated empirical evidence and emerging best practices in the field of language assessment, contributing to the broader knowledge base in educational measurement. We regularly update our methodology based on the latest research findings in psychometrics, computational linguistics, and educational measurement, incorporating peer-reviewed advances into our operational procedures.

Speech Processing Pipeline

Automated scoring of spoken responses begins with automatic speech recognition (ASR), which transcribes the spoken response into a text representation. ASR accuracy is critical because errors in transcription propagate through subsequent processing stages. GELPS’s ASR system is trained on a diverse corpus of non-native English speech. Our commitment to continuous methodological improvement means that these procedures evolve over time based on accumulated validity evidence and feedback from the broader measurement community. Our commitment to continuous methodological improvement means that these procedures evolve over time based on accumulated validity evidence and feedback from the broader measurement community. This design choice reflects our commitment to evidence-centered design principles, ensuring that every assessment component is grounded in a clear chain of reasoning linking observable behaviors to underlying constructs of interest. Careful attention to these measurement principles ensures that the assessment yields scores that are both reliable and valid for their intended interpretive purposes, supporting appropriate score-based decisions for all test-takers regardless of their background characteristics.

Following transcription, feature extraction algorithms analyze multiple dimensions of the spoken response. Acoustic features include measures of fluency such as speech rate, pause frequency, and articulation rate. Prosodic features capture intonation patterns, stress placement, and rhythm. Pronunciation features evaluate the accuracy of phoneme production. We regularly update our methodology based on the latest research findings in psychometrics, computational linguistics, and educational measurement, incorporating peer-reviewed advances into our operational procedures. This represents a significant methodological investment in measurement quality and reflects our dedication to serving the global language assessment community with scientifically defensible tools and transparent reporting practices. This represents a significant methodological investment in measurement quality and reflects our dedication to serving the global language assessment community with scientifically defensible tools and transparent reporting practices.

Linguistic Feature Extraction

Beyond acoustic features, linguistic analysis of the transcribed response extracts features related to vocabulary, grammar, and discourse. Vocabulary features include measures of lexical diversity and sophistication. Grammatical features evaluate syntactic accuracy and complexity. Discourse features examine coherence and response relevance. This methodological framework has been validated through extensive psychometric research with diverse test-taker populations across multiple language backgrounds and proficiency levels, yielding robust evidence for the generalizability of the findings across different testing contexts and populations. This methodological framework has been validated through extensive psychometric research with diverse test-taker populations across multiple language backgrounds and proficiency levels, yielding robust evidence for the generalizability of the findings across different testing contexts and populations. Ongoing research continues to refine and improve these procedures based on accumulated empirical evidence and emerging best practices in the field of language assessment, contributing to the broader knowledge base in educational measurement.

Model Training and Validation

The automated scoring model is trained on a large corpus of spoken responses rated by trained human raters. The training process identifies the optimal combination of features for predicting human-assigned scores. Cross-validation procedures assess generalizability to new responses and new test-takers. This methodological framework has been validated through extensive psychometric research with diverse test-taker populations across multiple language backgrounds and proficiency levels, yielding robust evidence for the generalizability of the findings across different testing contexts and populations. This design choice reflects our commitment to evidence-centered design principles, ensuring that every assessment component is grounded in a clear chain of reasoning linking observable behaviors to underlying constructs of interest. Careful attention to these measurement principles ensures that the assessment yields scores that are both reliable and valid for their intended interpretive purposes, supporting appropriate score-based decisions for all test-takers regardless of their background characteristics.

Reliability and Agreement with Human Ratings

Validation studies examine both the reliability of automated scores and their agreement with human ratings. GELPS’s automated speaking scores achieve exact agreement rates above 60% and adjacent agreement rates above 95% with trained human raters, consistent with levels observed between pairs of human raters. Test-takers and score users alike benefit from these rigorous methodological standards, which prioritize both measurement accuracy and fairness across diverse linguistic and cultural populations. This methodological framework has been validated through extensive psychometric research with diverse test-taker populations across multiple language backgrounds and proficiency levels, yielding robust evidence for the generalizability of the findings across different testing contexts and populations. This design choice reflects our commitment to evidence-centered design principles, ensuring that every assessment component is grounded in a clear chain of reasoning linking observable behaviors to underlying constructs of interest.

Ongoing Monitoring and Improvement

Automated scoring models require ongoing monitoring to ensure stable performance over time and across different test-taker populations. GELPS monitors model performance continuously, examining convergence with human ratings and evidence of differential model performance across subgroups. Careful attention to these measurement principles ensures that the assessment yields scores that are both reliable and valid for their intended interpretive purposes, supporting appropriate score-based decisions for all test-takers regardless of their background characteristics. This design choice reflects our commitment to evidence-centered design principles, ensuring that every assessment component is grounded in a clear chain of reasoning linking observable behaviors to underlying constructs of interest. We regularly update our methodology based on the latest research findings in psychometrics, computational linguistics, and educational measurement, incorporating peer-reviewed advances into our operational procedures.