|
Validity
The "validity" of a measurement instrument does not
refer to the instrument itself but to whether particular
interpretations of its scores are well justified. It is
inappropriate to speak of a measurement instrument as inherently
valid or invalid. It is only meaningful to consider the validity of
a specified purpose or interpretation of the resulting scores.
Since multiple types of inferences may be entertained for scores
from a given instrument, depending upon the situation in which it
is to be used, the validity of each inference must be established.
This distinction is a familiar one in pulmonary medicine. The FEV1,
for example, is not considered to be valid in and of itself but in
terms of specific purposes for which it might be used. To the
extent that the user of a measurement instrument proposes to use it
for purposes different from those intended by the developer and
supported by existing evidence, the user becomes responsible for
producing evidence regarding the validity of his/her proposed
interpretations.
An overall assessment of a measurement instrument's utility and
limitations, then, is not established once and for all but is
gradually built up from cumulative evidence regarding the
interrelationships between the content of the instrument and
definitions of the construct to be measured, interrelationships
between scores on the instrument and the results of other relevant
measurements, and differences between scores on the instrument for
different groups of individuals or for the same individuals at
different points in time or under different conditions. Typically
the latter involves measurements made before and after some
intervention (e.g., a particular medical therapy or an
educational/behavioral intervention), or before and after two
different interventions that are being compared under controlled
conditions. It is possible to characterize different types of
inferences or uses (e.g., assessment of change within an individual
over time, assessment of group differences), and for each of these
there are conventional experimental designs and statistics by which
the legitimacy of these inferences can be tested.
The following brief summaries describe several sources of evidence
commonly gathered to investigate the validity of inferences based
on an instrument or assessment tool. Historically, a terminology
has been used in which different sources of evidence are treated as
though they established different "types" of validity (e.g.,
"content validity"). Some potential developers and users of HRQL
instruments may be accustomed to this terminology. However, this
terminology has led to considerable confusion and misunderstanding
and is not used here.
Evidence based on instrument content
The extent to which the content of the instrument—its questions
or items—are linked by a plausible (and preferably explicit)
rationale to some particular conception of the construct being
measured (e.g., quality of life) is one source of evidence
supporting the validity of interpreting scores as measures of the
intended construct. The developer of the instrument is in the best
position to provide such a rationale. However, judgments about the
plausibility of the rationale can and should be made by (other)
experts, measurement specialists, and potential users (researchers
or clinicians).
The likelihood that there will be a clear and compelling
relationship between instrument content and the construct to be
assessed is greatest when the construct (e.g., quality of life) is
first defined as clearly as possible in written form, and this
definition used as the foundation for developing or selecting the
items that make up the instrument. When evaluating the available
evidence supporting the validity of some inference based on scores
from a HRQOL measurement instrument, it is useful to consider such
factors as:
1) the process by which the instrument was developed,
2) whether there exists a clear definition of what is being
measured,
3) how that definition was derived,
4) whether it omits important elements or includes irrelevant
ones,
5) how well-accepted it is, and
6) the plausibility of the explicit or implicit rationale linking
the content of the instrument to the definition of what is to be
measured.
Evidence based on internal structure
Evidence based on the internal statistical structure of an
instrument is another source of information relevant to
interpretation of its scores. Such evidence typically is obtained
by techniques such as factor analysis, cluster analysis, or other
methods of determining whether the items are measuring a single,
unidimensional construct or a multidimensional one, and whether
their statistical structure is consistent with the definition of
the construct to be measured and the intended logical structure of
the instrument.
Evidence based on relationships between scores and other
variables
Evidence of the validity of particular interpretations of scores
on an instrument can also be obtained by considering how scores on
the instrument are related to other variables. Supportive evidence
for particular interpretations may come both from findings of
similar results (convergence) or dissimilar results (divergence),
depending upon the particular variables involved.
The process of gathering evidence of relationships that support or
refute particular interpretations of scores on an HRQL instrument
is an ongoing one, since no single relationship is normally
sufficient to conclusively establish the validity of a given
interpretation. Relevant intercorrelations and associations should
be provided by the developer and with each reported use of an
instrument, so that succeeding users can judge the validity of the
instrument in relationship to the purpose for which they may wish
to use it.
For HRQL instruments, the assumption is frequently made that more
severe disease will, on average, be associated with diminished
quality of life and that treatment already known to reduce the
severity of a disease or its symptoms will be associated with
improved quality of life in a given population. Hence, one might
examine correlations between scores on a HRQL instrument and
variables representing other clinical indicators of disease
severity, such as supplementary oxygen requirement, FEV1, or
exercise tolerance. Or one might examine differences in HRQL scores
between treated and untreated subjects. The expectation would be
that, on average, scores on the HRQL instrument should correlate
with or show mean differences or changes in the expected direction
in concert with clinical markers of disease severity and with other
health status measures. However, some caution is in order. Health
related quality of life, as commonly viewed, is not a function of
disease state or physiologic status alone—persons of the same
physiologic status or impairment may have radically different
functional status and HRQL according to their own perceptions.
Where other characteristics and circumstances affect HRQL, the
failure to find a close correlation between physiological and HRQL
measures does not necessarily undercut the validity of either the
measure of HRQL or of the physiologic parameter.
Evidence based on relationships between scores and a
criterion
Among the variables whose relationship to instrument scores is
of particular interest are those that might be considered criterion
variables. Selection of a single criterion variable as a "gold
standard" is difficult, if not impossible, in the case of quality
of life instruments because no accepted "gold standard" exists. It
is difficult even to imagine what would constitute such a gold
standard, since the concept of HRQL is typically considered to be,
at least in part, subjective, reflecting the priorities and needs
of the individual whose quality of life is being considered and how
well those needs are being met.
Relationships between scores and external criteria can be divided
into two types: concurrent and
predictive. Concurrent relationships are examined
by correlating scores on a measurement instrument with other
variables obtained concurrently on the same individuals or by
examining the differences in scores, obtained concurrently, for
populations known to be similar or different in some relevant
respect. Predictive relationships are examined by administering an
instrument to a sample of individuals and waiting for some
subsequent "outcome" to occur (e.g., disease remission, death,
improvement in some physiologic parameter). The ability of the
scores on the instrument to predict the outcome argues for its
utility for that purpose. By itself, such a predictive relationship
is not evidence that the instrument is measuring the construct of
interest. However, failure to find an expected predictive
relationship may suggest either that the intended construct is
being poorly measured or that the hypothesis concerning the
relationship of the construct to the outcome is questionable. It is
the entire web of types of evidence discussed above that sheds
light on such issues. Supposed evidence from casual inspection of
the instrument—the issue of "face validity"
It is at times suggested that the validity of an interpretation or
use of a measurement instrument can be judged by those who might
use it or those on whom it might be used, on the basis of whether
the content of the instrument appears to be relevant to the
construct to be measured. However, such "face validation" may be
quite unreliable and misleading. While it is understandable that
those examining an instrument would form an impression of whether
the items are appropriate or not to what they perceive to be the
purpose of the measurement to be, it is impossible to draw
legitimate conclusions regarding an instrument's reliability or
validity until other information has been gathered and considered.
An instrument that appears to casual observation to have good "face
validity" could, in fact, be relatively useless for many purposes.
An instrument that omits content some individuals consider
important may be quite appropriate for its intended uses and quite
relevant and appropriate to the developer's definition of the
construct to be measured. In any event, such impressions are not
likely to involve an adequate analysis of the rationale behind,
properties of, and available evidence regarding the intended uses
of the instrument. Casual judgements do not provide sound evidence
regarding validity. However, it is important to recognize that, if
persons to whom an instrument is administered perceive it to be
inappropriate or ambiguous, they may resist the measurement and may
even actively or passively refuse to continue to participate in the
study. The net effect of such resistance and refusals ranges from
poor quality data on such persons to increased attrition rates.
Such perceptions thereby effectively undermine the validity of the
instrument in that setting.
|