|
Reliability
Measurement instruments are subject to measurement error, which
is generally viewed as random and thereby distinguished from
systematic errors of measurement. Scores on a measurement
instrument are considered to provide estimates of an underlying
value, traditionally termed the individual's "true score" (other
terms have also been used). While measurement error typically is
ignored for all practical purposes in making physical measurements
such as height and weight, it cannot be ignored in psychological
and behavioral measurement.
The reliability of a measurement instrument is the extent
to which it yields consistent, reproducible estimates of what is
assumed to be an underlying true score. A current approach
to estimating reliability, referred to as "generalizability
theory," allows for estimating the various sources of variation in
scores, including variation due to errors of measurement.
Reliability can be assessed using several experimental designs.
Each of these designs is suited to estimating the magnitude of
particular sources of measurement error. Three of the commonly used
designs are particularly relevant to determining the reliability of
HRQOL instruments.
Test-retest reliability refers to the extent to
which an instrument, such as an HRQOL measurement tool,
consistently yields the same scores on two successive occasions if
the quality of life of the individual to whom it is applied has not
changed in the intervening period, or, by inference, the extent to
which the measurement tool would give the same results when
administered to two different individuals if their HRQOL were the
same. A failure to give the same results on different occasions,
when that would be expected, suggests that some source of error
variance is affecting scores.
The test-retest design yields relevant information on instrument
reliability when there is reasonable assurance that the person
being assessed has not changed. It is not an appropriate
experimental design by which to evaluate instrument reliability
when the possibility exists that the process of making the first
measurement would have affected the second measurement
significantly (e.g., a recall effect) or when the characteristic
being measured may have been subjected to some intervening
influence or temporal fluctuation, or when there are subjective
elements in the scoring process (e.g., observer ratings). In these
cases, consideration of the intercorrelation of the items making up
the instrument is more appropriate.
Internal consistency reliability refers to the
degree of homogeneity of items in an instrument or scale—the extent
to which responses to the various components of the instrument
(i.e., its individual items or its subsections) correlate with one
another or with a score on the instrument as a whole (including or
excluding the item(s) in question). Substantial intercorrelation of
these elements is interpreted to mean the items or components are
measuring the same or closely related constructs in a reliable
manner. Low levels of correlation among items suggests that the
construct is not being measured reliably—that there are sources of
unexplained error in the measurement.
A high level of internal consistency is anticipated when all the
items making up an instrument are intended to measure a single,
unidimensional construct—when they can be viewed as a sample of
items drawn from the domain of all possible items assessing that
construct. If, however, an instrument has been designed to tap
multiple dimensions of a more complex construct, subsets of items
related to one dimension may be expected to correlate more highly
with each other than with items related to other dimensions. This
pattern of intercorrelations (often termed the "internal
statistical structure" of the instrument) is frequently of interest
in its own right as an indication that the instrument is measuring
the intended elements and as a means of better understanding their
interrelationships. If only a single score is calculated from an
instrument, the average intercorrelation of all the items needs to
be at an acceptable level (the definition of acceptable varies with
the intended use of the scores). If, however, scores on one or more
subscales are to be used, it is the average intercorrelation of the
items making up each subscale that must reach an acceptable
level.
Split-halves reliability refers to another
experimental design commonly used to evaluate the reliability of a
measurement instrument. the strategy in this case is to assign the
items randomly to two "split halves" and calculate the
intercorrelation of scores derived from each half. A high level of
correlation is taken as evidence that the items are consistently
measuring the same underlying construct.
Some measurement instruments require the person making the
measurement to render a judgement, such as concerning an
individual's functional ability in some area. When any aspect of
the measurement process or scoring of an instrument involves human
judgement and, hence a degree of subjectivity, a third source of
measurement error must be considered, inter-rater or inter-observer
variability. This error source is not relevant for standardized
measurement instruments, which not only are administered under
precisely controlled circumstances but are scored objectively.
Unreliability of the measurements due to variation between
observers or raters is evaluated by considering the average
association, across subjects, between scores obtained from
different persons rating the same subject. If this is not done
(i.e., if reliability is determined using a design in which each
subject is assessed only by one observer or rater), an overly
optimistic estimate of measurement reliability will be obtained,
and, in most instances, the degree of overestimation is
substantial.
Comparison of average scores of groups also is subject to another
source of random erro--namely, error associated with the sampling
of persons who are tested. When a HRQOL instrument is used to make
comparisons between groups, estimates of reliability and
determination of the statistical significance of group differences
must take this additional source of error into account.
The various sources of measurement error discussed above are
traditionally evaluated in terms of one or more type of reliability
coefficient. These coefficients are computed using data gathered by
means of various experimental designs and are referred to as, for
example, internal consistency, test-retest, and split-half
reliability coefficients. Such classical coefficients, or
appropriate coefficients derived from generalizability statistics,
should be provided by instrument developers. Certain minimal levels
of reliability are typically considered requisite to specific types
of uses—the more critical the decision, the higher the requirement
for reliability and precision of the measurement. The validity of a
particular interpretation of scores on a measurement instrument is
undermined by reliability that is inadequate for that
interpretation. On the other hand, an extremely high level of
internal consistency reliability may compromise the usefulness of a
behavioral or psychological measurement instrument for predictive
and other purposes, since it suggests that the instrument measures
only a single, very narrow construct and may fail to assess all of
the relevant, predictive dimensions.
The reliability of an instrument is not necessarily re-examined in
each study using the instrument, but instrument users have a
responsibility for reviewing the available information to determine
whether adequate reliability has already been established for the
instrument in relation to their intended purpose. If it has not,
they bear the responsibility of obtaining and reporting relevant
information supporting their interpretation of scores in the new
situation.
|