Deep Dives

High-level, nuanced investigations analyzing complex, confusing, and sometimes controversial issues in literacy. Each column delves into the theoretical and research bases of the issue, and identifies where and why gaps in the research exist.

Understanding Psychometric Properties

19 min

Unlock Unlimited Access

You’ve reached your monthly viewing limit.

Join our community for just $5.95/month to enjoy unlimited access to premium content now.

Already a member? Sign In

The Issue

Recently, some educators have begun reconsidering the use of a popular universal screener due to newly surfaced questions of its rigor- despite the fact that this information has always been available for educators and administrators to access. This reaction reflects a broader challenge in education: distinguishing between tools that are widely used and those that are well-supported by evidence. The history of reading instruction has shown that popularity and effectiveness do not always align. The same is true for assessment tools. Unfortunately, peeling back the curtain of technical manuals will reveal that the aforementioned screener suddenly losing its credibility is not the only popular measure with less than ideal evidence of actually identifying students at risk for reading difficulties.

While many educators are accustomed to reviewing test administration procedures, fewer have had training in how to interpret a test’s technical manual. This article is intended to support educators and school leaders in evaluating whether a test is appropriate for its intended purposes, as well as appropriate for their students. While the properties included below are relevant to a wide range of tests, the explanations and examples will center those intended to evaluate students’ literacy and/or language.


Test Purposes

Because psychometric properties must be interpreted relative to a test’s intended purpose, it is important to first distinguish between screeners, diagnostics, and progress monitoring measures. There should always be a clearly intended purpose for administering any test. Educators may use a criterion-referenced test to determine how well students have acquired the skills and knowledge from a period of instruction or a curricular unit. High schoolers often take standardized, norm-referenced tests to compare their performance on various components to that of their peers across the nation so that colleges and universities have an equivalent data point of student aptitude. Although these are not all-inclusive, three such purposes referred to in this article include:

Screener

A brief assessment used to identify students who may be at risk for difficulties and need further evaluation and/or support. The screener may broadly aim to identify a risk of general academic difficulties, or it may identify a more specific risk, such as reading or language difficulties, or something narrower still, such as dyslexia. It is important to note that constructs (skills or domains included in a test, such as phonemic awareness, sound-letter correspondences, and word reading) on a screening measure may not be knowledge or skills that are taught in a classroom. For example, rapid automatized naming (RAN) is a powerful predictor of later reading abilities for young students (Araújo et al., 2014), but is resistant to direct instruction (Wolfsperger & Mayer, 2024). Thus, while this construct is important to include in literacy screening measures, it would not usually be directly taught through instruction or intervention.

Progress Monitoring

A series of tests used repeatedly over time to measure student growth and evaluate whether instruction/intervention is effective. Constructs on these tests primarily closely resemble instructional targets in order to be sensitive to small gains demonstrated by students.

Diagnostic

A detailed test used to identify a student’s specific strengths, weaknesses, and underlying skill deficits related to reading or language. These may be used as part of a broader assessment to determine student goals, educational diagnoses, and/or eligibility for additional services.

For purposes of this article, we will hereby use the word “tests” to collectively refer to screeners, progress monitoring tools, and/or diagnostics unless further specified.

Psychometric Properties

Psychometric properties are the characteristics used to evaluate the quality and usefulness of a test. These values, which are typically reported in the test’s technical manual, help determine how accurately and consistently a test performs its intended purpose. Not every test will include every key psychometric property in its manual. Sometimes this is because the particular test may preclude that property's relevance. Meaning, for example, if a test does not have multiple forms, there would be no way to measure the reliability of the test’s alternate forms- they do not exist. Other times, psychometric properties are not reported on, even when they are vital to determining the rigor and quality of the assessment (Spaulding et al., 2006). For example, a test marketed as a universal screener that doesn’t report on how well it actually identifies at-risk students should be a deal-breaker.

In the sections that follow, we will examine key psychometric properties and how educators can use this information to evaluate whether a test is appropriate for its intended purpose. Importantly, all of these psychometric values depend heavily on the characteristics of the sample population used to generate them (discussed in further detail below) and, like all statistical estimates, contain some degree of uncertainty and measurement error. The following is not an exhaustive list, but rather a selection of the psychometric properties most relevant to evaluating the rigor of reading assessments:

Validity

Screening measures and diagnostic tests have the specific purposes of predicting and identifying areas of impairment, such as reading or language difficulties. Validity evidence helps to explain that the test actually measures what it purports to measure and supports interpretations of test scores for particular uses. In other words, while it might be clear at a glance that a diagnostic test is focused on reading, it is not clear at a glance that it is good at identifying whether or not students have reading disorders, so its validity needs to be analyzed prior to selecting and using the test for identification, eligibility, and/or diagnostic purposes.

Content Validity

Purpose: Content validity can essentially be considered a well-founded expert opinion. It outlines the research, theoretical framework(s), and/or test development process used to determine why the constructs and format of the test are considered appropriate for the targeted students and intended purpose. For example, a screening measure might describe that it includes several segmenting and blending tasks because they are considered strong predictors of later reading ability for the ages targeted. A full explanation of a test’s content validity would more systematically explain all of the constructs included in the test and how it fully covers the domain it aims to assess. A test’s content validity may also explain its format as well. For example, it may provide a rationale for why the test is teacher-administered or computer-based, why students are expected to produce a response as opposed to only selecting answer choices, etc.

How it might be reported: Unlike the other psychometric properties described below, content validity is not a statistical calculation derived from testing performed with sample populations. If provided, content validity is reported in more of a narrative format.

Important Considerations: Even the strongest content validity is not proof that the test is well constructed, and not a sufficient rationale on its own. Consider, for instance, the same example of segmenting and blending. While the importance of these phonemic awareness tasks may be established via the test’s content validity, that does not ensure that the difficulty of these items in the screener was neither too easy nor too difficult to properly differentiate between students who are struggling with this skill and those who are not.

Construct Validity

Purpose: This property evaluates whether the test captures each intended construct, rather than unrelated abilities that may confound student performance. Students’ attention abilities, test-taking skills, language, and numerous other factors all may influence their performance on reading tests in ways that are not related to the constructs being measured. Consider the potential ramifications of a vocabulary test that provides the targeted words only in print form. While this might be an expected test format for high schoolers, using this format for elementary school students would likely reflect the students’ word-reading ability as much or more than their vocabulary skills, as intended.

How it might be reported: Construct validity is not established through a single study or statistic, but rather through a body of evidence, often including descriptions of statistical analyses and patterns in how the test relates to other measures. These relationships are reported using correlation coefficients (r), with a value of 1.0 representing a perfect, positive relationship between two measures. Lower (r) values indicate weaker relationships. For example, a phonemic awareness measure should correlate strongly with other measures of phonological processing and early reading skills (convergent validity), while showing weaker relationships with less directly related constructs such as vocabulary (divergent validity).

Important Considerations: In general, educators should expect stronger correlations between a test and measures of the same or closely related constructs, and weaker correlations with unrelated constructs. It is important to note, however, that acceptable magnitudes for correlation values vary by construct, age, purpose, and measurement method.

As with any validity evidence that relies on comparisons to other tests, the strength of the correlation is only meaningful if the comparison measure is itself psychometrically strong. 
Criterion Validity

Criterion validity measures how well a test, or specific components within a test, align with important external measures or outcomes, such as another test that is considered a “gold standard” in the same domain. There are two types of criterion validity:

  • Concurrent Validity

    • Purpose: Concurrent validity evaluates whether a test produces results similar to those of an already established measure or evaluation when both assessments are administered within the same timeframe. For example, do the results of a dyslexia screener align with the outcomes of a full dyslexia evaluation if performed in close succession?
    • How it might be reported: This evidence is typically reported as a correlation coefficient (r). As a general guideline, correlations around .30–.50 are considered moderate, while .50 and above are considered strong.
    • Important Considerations: Strong concurrent validity is most likely when the comparison measure evaluates the same construct in a similar manner.
  • Predictive Validity

    • Purpose: As the name implies, this measure indicates how well a test predicts future student performance. For example, do the results of a dyslexia screener align with the outcomes of a full dyslexia evaluation if performed two years later?
    • How it might be reported: Like concurrent validity, predictive validity is typically reported as a correlation coefficient (r). While general conventions for interpreting correlation strength apply, predictive validity coefficients are typically lower than concurrent validity coefficients due to the added variability of predicting future outcomes.
    • Important Considerations: Predictive validity may not be reported if the test or screener is not intended to predict future outcomes.
Classification Accuracy (sometimes known as discriminant accuracy)

Classification accuracy determines how well a test correctly places students into specific groups, such as identifying whether a student has a vocabulary impairment or not. While this may sometimes be reported as an overall accuracy percentage, one value alone does not provide a complete picture of this validity property. For example, if a screener targeted a disorder that affects 1% of the population, and always identified test-takers as unaffected, its classification accuracy could still be 99%. For this reason, classification accuracy is often broken down into additional metrics, particularly sensitivity and specificity.

  • Sensitivity

    • Purpose: Sensitivity refers to how accurately a test identifies students who actually have the targeted difficulty or disorder. In other words, a highly sensitive test minimizes “missed” students. Students who need support are identified as such.
    • How it might be reported: This is reported as a percentage, with 100% meaning all affected students are identified by the test. Conventions for judging the acceptability of sensitivity rates vary, with some researchers adopting those suggested by Plante and Vance (1994): 90%–100% is good, 80%–89% is fair, and under 80% is considered unacceptable, although this is not a universally accepted standard (Greenslade et al., 2009).
    • Important Considerations: While this can be a concern for assessments in any format, it may be especially important to consider with computer adaptive tests (CATs), which sometimes fail to identify subtle or inconsistent difficulties (Shen et al., 2025).
  • Specificity

    • Purpose: Specificity refers to how accurately a test identifies students who do not have the targeted difficulty or disorder as typical. A highly specific test minimizes false positives. Typical students are correctly identified as not needing support. Strong specificity is particularly important in screening and diagnostic contexts to avoid overidentifying students for additional evaluation or intervention.
    • How it might be reported: This is reported as a percentage, with 100% meaning that no students are misidentified as having the condition when they are actually not affected. Like with sensitivity, conventions for judging the acceptability of specificity rates vary, with some researchers adopting those suggested by Plante and Vance (1994): 90%–100% is good, 80%–89% is fair, and under 80% is considered unacceptable, although this is not a universally accepted standard (Greenslade et al., 2009).
    • Important Considerations: Specificity is especially important when considering culturally and linguistically diverse students, as tests with poor specificity may incorrectly identify differences in language, dialect, background knowledge, or educational experience as evidence of a disability.

These are vital measurements of validity that should be reported on for screeners and diagnostic assessments. Ideally, these measurements are provided not only for the entire sample used to assess these psychometric properties, but also broken down by population subgroups (e.g., by race/ethnicity or other relevant subgroups) as well, to help determine whether the test functions equitably across diverse groups and to identify potential sources of bias.

It is important to remember that these values are also impacted by where the test’s cutoff or benchmark scores are set. Lowering a cutoff score may increase sensitivity by identifying more students who are at risk, which may in turn reduce specificity by incorrectly identifying more typical students as being at risk. Raising the cutoff score may have the opposite effect. Because of this tradeoff, benchmark decisions should be interpreted carefully in relation to the intended purpose of the test.


Reliability

While validity refers to whether a test measures what it claims to measure, reliability refers to how consistently the test measures it across different situations and conditions. If a student receives substantially different scores across repeated administrations of the same test despite no meaningful change in ability, confidence in the usefulness and interpretability of the test becomes limited.

Measurements of reliability help evaluate how stable test performance is across factors such as time, testing items, raters, and administration conditions.

Interrater Reliability

Purpose: This property evaluates whether scores remain consistent when different educators administer or score the test. If a test relies heavily on interpretation (e.g., “Yes, I think that answer was good enough”), teacher knowledge, or does not have clear scoring or administration protocols (e.g., how to handle pronunciation differences), it may not have good interrater reliability, because one teacher may score a student differently than another teacher would. Sometimes tests can be impacted by this not only because of ambiguity or subjectivity, but due to examiner error. For example, even relatively structured measures such as oral reading fluency (ORF) assessments can demonstrate scoring inconsistencies due to teacher errors (Cummings et al., 2014).

How it might be reported: Interrater reliability is typically reported as a measure of agreement between scorers, expressed as a value between 0 and 1 (or 0%–100%), where 1.0 indicates perfect agreement. In general, values above .80 indicate that different raters are scoring student responses consistently.

Important Considerations: Tests that do not rely on a person to administer and/or score (such as those performed entirely on a computer) would not typically report this value. An exception might be a test that uses artificial intelligence (AI) and/or machine learning (ML) to judge student responses, particularly oral responses. In this case, interrater reliability might be reported to compare the computer model's scoring to that of human examiners, even if humans would not be administering or scoring students in real-world applications of the test.

Internal Consistency (sometimes known as inter-item correlation)

Purpose: This measure evaluates how consistent individual test items (a specific task or question on a test) are within the overall test (or section of the test, if applicable). It answers the question of ‘Do students who do well on this question tend to do well overall on the construct being tested?’

How it might be reported: Internal consistency is most commonly reported using a coefficient such as Cronbach’s alpha (α), which ranges from 0 to 1. Values of .70–.80 are typically considered acceptable, while .80 and above are considered strong. Some tests may also report other related statistics (e.g., split-half reliability), but Cronbach’s alpha is the most common.

Important Considerations: This measure tends to increase as the number of test items increases. For tests like CBMs that are time-based, there isn’t a consistent number of items administered, so calculating and reporting this measure may be less appropriate and/or interpretable. While this measure may be useful as a check of item quality, high internal consistency alone does not establish validity (McCrae et al., 2011).

Alternate Form Reliability

Purpose: Some tests publish multiple versions of the measure (such as Form A and Form B), which are sometimes known as parallel or alternate forms. While there are several reasons to use parallel forms (such as to prevent students from cheating off a neighbor’s answers), they are primarily designed for the included test purposes to mitigate the practice effect of taking the same test repeatedly. Meaning, if a student were to take Form A of a test that had an oral reading fluency (ORF) component, the next time they take the test, they could use Form B, with a new, unfamiliar passage to read. This would help ensure any improvement in the student’s ORF rate is due to improvement in their reading fluency, not familiarity with the text.

Alternate form reliability measures how similarly a student could be expected to perform if they were to take all of the available forms of a given test. For example, pretend that there was a test of reading comprehension that had two alternate forms. Each form provides one text for the student to read, followed by comprehension questions. The passage in Form A is about baseball, and the passage in Form B is about cellular respiration. It is not hard to imagine that students may be unlikely to perform equally well on both forms. This test, therefore, may not have good alternate form reliability. Like any of these psychometric properties, however, this would be based on statistical analyses conducted with the testing sample, not based on subjective thinking.

How it might be reported: Alternate form reliability is typically reported as a correlation coefficient between scores on the different forms of the same test. In general, higher correlations indicate that the alternate forms produce more consistent results and can be used interchangeably with greater confidence.

Important Considerations: This value would not be reported for tests that do not provide alternate forms.

Test-Retest Reliability

Purpose: This property measures how stable the test is if a student were to take the same measure again on another day, without any meaningful change in the student’s skill or ability. Consider, for instance, a student taking a 4-question, multiple-choice exam about a topic they had no knowledge of. It is possible the student could have made lucky guesses and scored perfectly. If the same student were to retake that test weeks later, however, making random guesses again might now earn a score of 25% or even 0%. Having such a small number of multiple-choice questions would likely result in poor test-retest reliability.

How it might be reported: Test-retest reliability is typically reported as a correlation coefficient between scores from repeated administrations of the same test. In general, higher correlations indicate that the test produces more stable and consistent results over time when no meaningful change in the student’s skills or abilities is expected.

Important Considerations: The length of time between administrations of the test when calculating this value might vary. The greater the length of time between the tests, the more likely it is that students may have learned and/or matured between attempts, meaning that an identical performance would not necessarily be expected. On the other hand, retaking a test on consecutive days may artificially inflate student scores due to practice effects. These factors related to growth expectations and the time between administrations should be taken into account when interpreting this value.


Sample Population and Bias

Most of the above psychometric properties rely on statistical calculations based on the performance of a group of students (of adequate size) who took the test prior to its publication. This group of students is known as the sample population, or simply the sample. It is important that administrators and educators understand that the sample from which the validity and reliability data are collected should accurately represent the students whom they intend to administer the test to. While test materials usually explicitly label the grades and/or ages that the test is intended for, other characteristics of the sample should be factored in as well. These are important considerations for testing bias, because factors such as students’ culture, types of community, location, and students' socioeconomic status (SES) can impact their familiarity with concepts and vocabulary (Peña, 2007). For instance, students living in rural communities might be better equipped to understand a reading passage about farming than one about public transportation. And while unfamiliarity with the word cardinal for a child in Georgia or St. Louis might be surprising due to the prominence of the bird and sports mascot in those regions, this unfamiliarity would be relatively common for children in California, where the bird is not native to.

Sample populations should resemble the demographic characteristics of the intended test-takers in terms of language backgrounds, dialects spoken, and SES. However, this expectation of inclusion might not hold true across all possible student characteristics. Because many tests are designed to measure the presence (or risk) of literacy and/or language difficulties rather than their severity, inclusion of students with these disorders in the normative sample may artificially lower normative expectations. Therefore, some researchers argue that students with the disorder being assessed should be omitted from normative samples (Peña et al., 2006).


In Sum

When considering the psychometric properties of a test, it’s vital that the big picture be considered, rather than focusing on a few statistical tests in isolation. Validity, classification accuracy, reliability, and sample population characteristics must all be considered together in relation to the test’s intended purpose and the students it was designed to assess.

Educators must consider the broader picture: the intended purpose of the assessment, the population it was designed for, and whether there is psychometric evidence available to support the decisions being made from the test results. A measure that is excellent for universal screening may be inappropriate for progress monitoring, and a test with strong reliability statistics may still fail to accurately identify students with reading difficulties. Understanding these concepts helps educators move beyond marketing claims and popularity and toward selecting assessment tools that are both scientifically sound and instructionally meaningful.

References

Araújo, S., Reis, A., Petersson, K. M., & Faísca, L. (2015). Rapid automatized naming and reading performance: A meta-analysis. Journal of Educational Psychology, 107(3), 868–883. https://doi.org/10.1037/edu0000006

Cummings, K. D., Biancarosa, G., Schaper, A., & Reed, D. K. (2014). Examiner error in curriculum-based measurement of oral reading. Journal of School Psychology, 52(4), 361–375. https://doi.org/10.1016/j.jsp.2014.05.007

Greenslade, K. J., Plante, E., & Vance, R. (2009). The diagnostic accuracy and construct validity of the structured photographic expressive language test--preschool: second edition. Language, Speech, and Hearing Services in Schools, 40(2), 150–160. https://doi.org/10.1044/0161-1461(2008/07-0049)

Indu, P. V., Vidhukumar, K., Chacko, D., Menon, V., Grover, S., & Gupta, S. (2025). Criterion validity, construct validity, and factor analysis: An introductory overview. Indian Journal of Psychiatry, 67(9), 916–921. https://doi.org/10.4103/indianjpsychiatry_911_25

McCrae, R. R., Kurtz, J. E., Yamagata, S., & Terracciano, A. (2011). Internal consistency, retest reliability, and their implications for personality scale validity. Personality and Social Psychology Review: An Official Journal of the Society for Personality and Social Psychology, Inc, 15(1), 28–50. https://doi.org/10.1177/1088868310366253

Peña, E. D., Spaulding, T. J., & Plante, E. (2006). The composition of normative groups and diagnostic decision making: shooting ourselves in the foot. American Journal of Speech-Language Pathology, 15(3), 247–254. https://doi.org/10.1044/1058-0360(2006/023)

Peña E. D. (2007). Lost in translation: methodological considerations in cross-cultural research. Child Development, 78(4), 1255–1264. https://doi.org/10.1111/j.1467-8624.2007.01064.x

Shen, L., Huang, Y., Ma, Z., Chen, M., Ma, C., & Clemens, N. H. (2026). How accurate are computer-adaptive reading screeners? A systematic review and meta-analysis of validity and classification accuracy. School Psychology Review, 1–21. https://doi.org/10.1080/2372966X.2026.2633597

Spaulding, T. J., Plante, E., & Farinella, K. A. (2006). Eligibility criteria for language impairment: Is the low end of normal always appropriate? Language, Speech, and Hearing Services in Schools, 37(1), 61–72. https://doi.org/10.1044/0161-1461(2006/007)

Wolfsperger, J., & Mayer, A. (2024). A brief research report on the efficacy of a RAN training in elementary school age children. Frontiers in Education, 9. https://doi.org/10.3389/feduc.2024.1376434

0 of 5 free articles this month

Become a premium member to enjoy unlimited access and support our community