ATS American Thoracic Society
Recent Abstracts


The "validity" of a measurement instrument does not refer to the instrument itself but to whether particular interpretations of its scores are well justified. It is inappropriate to speak of a measurement instrument as inherently valid or invalid. It is only meaningful to consider the validity of a specified purpose or interpretation of the resulting scores. Since multiple types of inferences may be entertained for scores from a given instrument, depending upon the situation in which it is to be used, the validity of each inference must be established. This distinction is a familiar one in pulmonary medicine. The FEV1, for example, is not considered to be valid in and of itself but in terms of specific purposes for which it might be used. To the extent that the user of a measurement instrument proposes to use it for purposes different from those intended by the developer and supported by existing evidence, the user becomes responsible for producing evidence regarding the validity of his/her proposed interpretations.

An overall assessment of a measurement instrument's utility and limitations, then, is not established once and for all but is gradually built up from cumulative evidence regarding the interrelationships between the content of the instrument and definitions of the construct to be measured, interrelationships between scores on the instrument and the results of other relevant measurements, and differences between scores on the instrument for different groups of individuals or for the same individuals at different points in time or under different conditions. Typically the latter involves measurements made before and after some intervention (e.g., a particular medical therapy or an educational/behavioral intervention), or before and after two different interventions that are being compared under controlled conditions. It is possible to characterize different types of inferences or uses (e.g., assessment of change within an individual over time, assessment of group differences), and for each of these there are conventional experimental designs and statistics by which the legitimacy of these inferences can be tested.

The following brief summaries describe several sources of evidence commonly gathered to investigate the validity of inferences based on an instrument or assessment tool. Historically, a terminology has been used in which different sources of evidence are treated as though they established different "types" of validity (e.g., "content validity"). Some potential developers and users of HRQL instruments may be accustomed to this terminology. However, this terminology has led to considerable confusion and misunderstanding and is not used here.

Evidence based on instrument content

The extent to which the content of the instrument—its questions or items—are linked by a plausible (and preferably explicit) rationale to some particular conception of the construct being measured (e.g., quality of life) is one source of evidence supporting the validity of interpreting scores as measures of the intended construct. The developer of the instrument is in the best position to provide such a rationale. However, judgments about the plausibility of the rationale can and should be made by (other) experts, measurement specialists, and potential users (researchers or clinicians).

The likelihood that there will be a clear and compelling relationship between instrument content and the construct to be assessed is greatest when the construct (e.g., quality of life) is first defined as clearly as possible in written form, and this definition used as the foundation for developing or selecting the items that make up the instrument. When evaluating the available evidence supporting the validity of some inference based on scores from a HRQOL measurement instrument, it is useful to consider such factors as:

1) the process by which the instrument was developed,
2) whether there exists a clear definition of what is being measured,
3) how that definition was derived,
4) whether it omits important elements or includes irrelevant ones,
5) how well-accepted it is, and
6) the plausibility of the explicit or implicit rationale linking the content of the instrument to the definition of what is to be measured.

Evidence based on internal structure

Evidence based on the internal statistical structure of an instrument is another source of information relevant to interpretation of its scores. Such evidence typically is obtained by techniques such as factor analysis, cluster analysis, or other methods of determining whether the items are measuring a single, unidimensional construct or a multidimensional one, and whether their statistical structure is consistent with the definition of the construct to be measured and the intended logical structure of the instrument.

Evidence based on relationships between scores and other variables

Evidence of the validity of particular interpretations of scores on an instrument can also be obtained by considering how scores on the instrument are related to other variables. Supportive evidence for particular interpretations may come both from findings of similar results (convergence) or dissimilar results (divergence), depending upon the particular variables involved.

The process of gathering evidence of relationships that support or refute particular interpretations of scores on an HRQL instrument is an ongoing one, since no single relationship is normally sufficient to conclusively establish the validity of a given interpretation. Relevant intercorrelations and associations should be provided by the developer and with each reported use of an instrument, so that succeeding users can judge the validity of the instrument in relationship to the purpose for which they may wish to use it.

For HRQL instruments, the assumption is frequently made that more severe disease will, on average, be associated with diminished quality of life and that treatment already known to reduce the severity of a disease or its symptoms will be associated with improved quality of life in a given population. Hence, one might examine correlations between scores on a HRQL instrument and variables representing other clinical indicators of disease severity, such as supplementary oxygen requirement, FEV1, or exercise tolerance. Or one might examine differences in HRQL scores between treated and untreated subjects. The expectation would be that, on average, scores on the HRQL instrument should correlate with or show mean differences or changes in the expected direction in concert with clinical markers of disease severity and with other health status measures. However, some caution is in order. Health related quality of life, as commonly viewed, is not a function of disease state or physiologic status alone—persons of the same physiologic status or impairment may have radically different functional status and HRQL according to their own perceptions. Where other characteristics and circumstances affect HRQL, the failure to find a close correlation between physiological and HRQL measures does not necessarily undercut the validity of either the measure of HRQL or of the physiologic parameter.

Evidence based on relationships between scores and a criterion

Among the variables whose relationship to instrument scores is of particular interest are those that might be considered criterion variables. Selection of a single criterion variable as a "gold standard" is difficult, if not impossible, in the case of quality of life instruments because no accepted "gold standard" exists. It is difficult even to imagine what would constitute such a gold standard, since the concept of HRQL is typically considered to be, at least in part, subjective, reflecting the priorities and needs of the individual whose quality of life is being considered and how well those needs are being met.

Relationships between scores and external criteria can be divided into two types: concurrent and predictive. Concurrent relationships are examined by correlating scores on a measurement instrument with other variables obtained concurrently on the same individuals or by examining the differences in scores, obtained concurrently, for populations known to be similar or different in some relevant respect. Predictive relationships are examined by administering an instrument to a sample of individuals and waiting for some subsequent "outcome" to occur (e.g., disease remission, death, improvement in some physiologic parameter). The ability of the scores on the instrument to predict the outcome argues for its utility for that purpose. By itself, such a predictive relationship is not evidence that the instrument is measuring the construct of interest. However, failure to find an expected predictive relationship may suggest either that the intended construct is being poorly measured or that the hypothesis concerning the relationship of the construct to the outcome is questionable. It is the entire web of types of evidence discussed above that sheds light on such issues. Supposed evidence from casual inspection of the instrument—the issue of "face validity"

It is at times suggested that the validity of an interpretation or use of a measurement instrument can be judged by those who might use it or those on whom it might be used, on the basis of whether the content of the instrument appears to be relevant to the construct to be measured. However, such "face validation" may be quite unreliable and misleading. While it is understandable that those examining an instrument would form an impression of whether the items are appropriate or not to what they perceive to be the purpose of the measurement to be, it is impossible to draw legitimate conclusions regarding an instrument's reliability or validity until other information has been gathered and considered. An instrument that appears to casual observation to have good "face validity" could, in fact, be relatively useless for many purposes. An instrument that omits content some individuals consider important may be quite appropriate for its intended uses and quite relevant and appropriate to the developer's definition of the construct to be measured. In any event, such impressions are not likely to involve an adequate analysis of the rationale behind, properties of, and available evidence regarding the intended uses of the instrument. Casual judgements do not provide sound evidence regarding validity. However, it is important to recognize that, if persons to whom an instrument is administered perceive it to be inappropriate or ambiguous, they may resist the measurement and may even actively or passively refuse to continue to participate in the study. The net effect of such resistance and refusals ranges from poor quality data on such persons to increased attrition rates. Such perceptions thereby effectively undermine the validity of the instrument in that setting.

Copyright © 2007 American Thoracic Society · Web Site Requirements
Questions or comments? Contact Us.