Reliability

Measurement instruments are subject to measurement error, which is generally viewed as random and thereby distinguished from systematic errors of measurement. Scores on a measurement instrument are considered to provide estimates of an underlying value, traditionally termed the individual's "true score" (other terms have also been used). While measurement error typically is ignored for all practical purposes in making physical measurements such as height and weight, it cannot be ignored in psychological and behavioral measurement.

The reliability of a measurement instrument is the extent to which it yields consistent, reproducible estimates of what is assumed to be an underlying true score. A current approach to estimating reliability, referred to as "generalizability theory," allows for estimating the various sources of variation in scores, including variation due to errors of measurement. Reliability can be assessed using several experimental designs. Each of these designs is suited to estimating the magnitude of particular sources of measurement error. Three of the commonly used designs are particularly relevant to determining the reliability of HRQOL instruments.

Test-retest reliability refers to the extent to which an instrument, such as an HRQOL measurement tool, consistently yields the same scores on two successive occasions if the quality of life of the individual to whom it is applied has not changed in the intervening period, or, by inference, the extent to which the measurement tool would give the same results when administered to two different individuals if their HRQOL were the same. A failure to give the same results on different occasions, when that would be expected, suggests that some source of error variance is affecting scores.

The test-retest design yields relevant information on instrument reliability when there is reasonable assurance that the person being assessed has not changed. It is not an appropriate experimental design by which to evaluate instrument reliability when the possibility exists that the process of making the first measurement would have affected the second measurement significantly (e.g., a recall effect) or when the characteristic being measured may have been subjected to some intervening influence or temporal fluctuation, or when there are subjective elements in the scoring process (e.g., observer ratings). In these cases, consideration of the intercorrelation of the items making up the instrument is more appropriate.

Internal consistency reliability refers to the degree of homogeneity of items in an instrument or scale—the extent to which responses to the various components of the instrument (i.e., its individual items or its subsections) correlate with one another or with a score on the instrument as a whole (including or excluding the item(s) in question). Substantial intercorrelation of these elements is interpreted to mean the items or components are measuring the same or closely related constructs in a reliable manner. Low levels of correlation among items suggests that the construct is not being measured reliably—that there are sources of unexplained error in the measurement.

A high level of internal consistency is anticipated when all the items making up an instrument are intended to measure a single, unidimensional construct—when they can be viewed as a sample of items drawn from the domain of all possible items assessing that construct. If, however, an instrument has been designed to tap multiple dimensions of a more complex construct, subsets of items related to one dimension may be expected to correlate more highly with each other than with items related to other dimensions. This pattern of intercorrelations (often termed the "internal statistical structure" of the instrument) is frequently of interest in its own right as an indication that the instrument is measuring the intended elements and as a means of better understanding their interrelationships. If only a single score is calculated from an instrument, the average intercorrelation of all the items needs to be at an acceptable level (the definition of acceptable varies with the intended use of the scores). If, however, scores on one or more subscales are to be used, it is the average intercorrelation of the items making up each subscale that must reach an acceptable level.

Split-halves reliability refers to another experimental design commonly used to evaluate the reliability of a measurement instrument. the strategy in this case is to assign the items randomly to two "split halves" and calculate the intercorrelation of scores derived from each half. A high level of correlation is taken as evidence that the items are consistently measuring the same underlying construct.

Some measurement instruments require the person making the measurement to render a judgement, such as concerning an individual's functional ability in some area. When any aspect of the measurement process or scoring of an instrument involves human judgement and, hence a degree of subjectivity, a third source of measurement error must be considered, inter-rater or inter-observer variability. This error source is not relevant for standardized measurement instruments, which not only are administered under precisely controlled circumstances but are scored objectively. Unreliability of the measurements due to variation between observers or raters is evaluated by considering the average association, across subjects, between scores obtained from different persons rating the same subject. If this is not done (i.e., if reliability is determined using a design in which each subject is assessed only by one observer or rater), an overly optimistic estimate of measurement reliability will be obtained, and, in most instances, the degree of overestimation is substantial.

Comparison of average scores of groups also is subject to another source of random erro--namely, error associated with the sampling of persons who are tested. When a HRQOL instrument is used to make comparisons between groups, estimates of reliability and determination of the statistical significance of group differences must take this additional source of error into account.

The various sources of measurement error discussed above are traditionally evaluated in terms of one or more type of reliability coefficient. These coefficients are computed using data gathered by means of various experimental designs and are referred to as, for example, internal consistency, test-retest, and split-half reliability coefficients. Such classical coefficients, or appropriate coefficients derived from generalizability statistics, should be provided by instrument developers. Certain minimal levels of reliability are typically considered requisite to specific types of uses—the more critical the decision, the higher the requirement for reliability and precision of the measurement. The validity of a particular interpretation of scores on a measurement instrument is undermined by reliability that is inadequate for that interpretation. On the other hand, an extremely high level of internal consistency reliability may compromise the usefulness of a behavioral or psychological measurement instrument for predictive and other purposes, since it suggests that the instrument measures only a single, very narrow construct and may fail to assess all of the relevant, predictive dimensions.

The reliability of an instrument is not necessarily re-examined in each study using the instrument, but instrument users have a responsibility for reviewing the available information to determine whether adequate reliability has already been established for the instrument in relation to their intended purpose. If it has not, they bear the responsibility of obtaining and reporting relevant information supporting their interpretation of scores in the new situation.