|
|
||||||||
Overview |
1 Screening and Test Evaluation Program, School of Public Health, University of Sydney, Sydney, New South Wales, Australia.
2 Department of Clinical Epidemiology and Biostatistics, Academic Medical Center, University of Amsterdam, Amsterdam, The Netherlands.
aAddress correspondence to this author at: Screening and Test Evaluation Program, School of Public Health, Building A27, University of Sydney, Sydney, New South Wales 2006, Australia. Fax 61-2-93515049; e-mail lesi{at}health.usyd.edu.au.
Abstract
Before a new test is introduced in clinical practice, its accuracy should be assessed. In the past decade, researchers have put an increased emphasis on exploring differences in test sensitivity and specificity between patient subgroups. If the reference standard is imperfect and the prevalence of the target condition differs among subgroups, apparent differences in test sensitivity and specificity between subgroups may be caused by reference standard misclassification. We provide guidance on how to determine whether observed differences may be explained by reference standard misclassification. Such misclassification may be ascertained by examining how the apparent sensitivity and specificity change with the prevalence of the target condition in the subgroups.
In a diagnostic accuracy study, the test under evaluation (index test) is applied to a series of individuals suspected to have the disease of interest (target condition). Test accuracy is assessed by comparing the results of the index test with the results of the reference standard in the same individuals. The reference standard is considered the best available method of assessing the target condition. Among the measures of diagnostic test accuracy are sensitivity, the probability of a positive test result in the presence of the target condition, and specificity, the probability of a negative test result in the absence of the target condition.
Estimates of test accuracy may differ between patient subgroups (1)(2)(3)(4). For example, the carcinoembryonic antigen test has a higher sensitivity (83%) for the diagnosis of colorectal cancer in patients with advanced metastatic disease (Dukes D) than in patients with localized disease (36% for Dukes A and B) (5). In this example, the differences are likely to be real, arising from the differing spectrum of Duke classification. However, in other instances in which the reference standard is prone to more misclassification, there may be uncertainty about whether the observed differences are real. For example, in a study of the D-dimer test to detect deep vein thrombosis (DVT), differences in accuracy were observed between cancer patients (sensitivity, 98%; specificity, 48%) and patients without cancer (sensitivity, 93%; specificity, 64%) (6). Because of the lower specificity of the D-dimer test in cancer patients, additional testing may be required to determine the presence or absence of DVT in this subgroup. So, observed differences in test accuracy, if accepted as real, may have implications for diagnostic management of the different subgroups.
Therefore, before accepting that test accuracy actually differs between subgroups, researchers should rule out the possibility that the observed differences are artefactual. In this report, we demonstrate how reference standard misclassification may cause apparent differences in accuracy between subgroups and provide guidance on how to determine whether observed differences might be explained by reference standard misclassification.
reference standard misclassification
For diagnostic research purposes, disease diagnosis in patients suspected of a certain target condition must be performed according to a carefully chosen reference standard. Reference standards that perfectly differentiate between patients with and without the target condition are rare, however, and thus misclassification of some patients is often inevitable (7)(8). When the accuracy of an index test is determined by comparison with an imperfect reference standard, some target condition misclassification will be introduced (9)(10)(11)(12).
Misclassification of the target condition by the reference standard will tend to result in underestimation of the accuracy of the index test. Underestimation of the sensitivity of the index test is most likely when the prevalence of the target condition is low, and the estimated sensitivity will be closer to the true sensitivity with increasing prevalence. Underestimation of the specificity will occur most when the prevalence of the target condition is high, and the estimated specificity will be closer to the true specificity when the prevalence of the target condition is low (Fig. 1
).
|
The pattern of sensitivity and specificity changing with prevalence of the target condition can be illustrated with theoretical data on the accuracy of an index test among patient subgroups with low, middle, and high prevalence of a target condition (Table 1
). In this hypothetical example, the sensitivity and specificity of the index test are 80% and 70%, as assessed by comparison with a perfect reference standard. The sensitivity and specificity of the imperfect reference standard are both 90%. When the prevalence of the target condition is 10%, the observed sensitivity and specificity are 55% and 69%. When the prevalence of the target condition is 50%, the sensitivity increases to 75% and the specificity decreases to 65%. When the prevalence of the target condition is 80%, the sensitivity increases further to 79% and the specificity decreases to 55%. The example in Table 1
also shows that reference standard imperfections attenuate the real difference in prevalence between subgroups, making it more difficult to detect statistically significant differences between subgroups or settings. However, we are still able to identify the subgroup with the higher or lower prevalence of the target condition.
|
Therefore, before accepting differences in sensitivity and specificity between subgroups as real, one should explore whether these differences are compatible with the pattern of sensitivity and specificity changing with the prevalence of the target condition in the presence of reference standard misclassification. We will elaborate on 2 situations that may be encountered, one in which observed differences in test accuracy between subgroups can be explained by reference standard misclassification, and another in which observed differences in test accuracy between subgroups cannot be explained by reference standard misclassification.
clinical examples of subgroup variability
To illustrate a situation in which observed differences in test accuracy between subgroups are attributable to reference standard misclassification, we use the previously mentioned example in which differences in test accuracy between subgroups were observed in the evaluation of the D-dimer test to diagnose DVT (6). The D-dimer test in patients with cancer had a sensitivity and specificity of 98% and 48%, whereas in patients without cancer the sensitivity and specificity were 93% and 64%. The prevalence of DVT, as determined with repeated compression ultrasound and follow-up, was higher in cancer patients (38%) than in noncancer patients (21%) (Table 2
). The sensitivity was higher and specificity lower in the group with the higher prevalence (cancer patients). This difference is in the direction expected from the effect of reference standard misclassification, as illustrated in Fig. 1
. Hence, the observed differences in test accuracy can, at least in part, be explained by reference standard misclassification. Another example is the use of a dipstick test to diagnose urinary tract infection (UTI) (13). On the basis of clinical signs and symptoms, patients suspected of UTI were divided in subgroups with high and low pretest probability of UTI. The dipstick test in patients with a high pretest probability had a sensitivity of 92% and a specificity of 42%, whereas in the low pretest probability group the sensitivity and specificity were 56% and 78%. The prevalence of UTI, as determined with urine culture, was 50% in the high pretest probability group and 7% in the low pretest probability group (Table 2
).
|
The alternative situation, in which observed differences in test accuracy between subgroups are not compatible with reference standard misclassification, can be illustrated with data derived from another study to assess the diagnostic accuracy of the D-dimer test to diagnose DVT (14). The sensitivity and specificity of the D-dimer test were 96% and 45% for patients who had a previous thromboembolism and 99% and 30% for patients who had not. The prevalence of DVT was 42% for patients who had a previous thromboembolism and 35% for the reference group. In this case, the sensitivity is higher and specificity is lower in the group with the lower prevalence (no previous DVT). Therefore, the observed differences in test accuracy between both subgroups cannot be explained by reference standard misclassification only.
Apparent similarities in sensitivity and specificity between subgroups may also occur when the prevalence of the target condition differs between these subgroups. In this case, reference standard misclassification may be hiding real differences in sensitivity and specificity between subgroups.
The values of the sensitivity and specificity of the index test compared to a perfect reference standard can also be estimated. To do so, additional information is needed about the accuracy of the applied reference standard (see Table 1 in the Data Supplement that accompanies the online version of this review at www.clinchem.org/content/vol53/issue10) (15). For a valid calculation, the index test and reference standard are assumed to be uncorrelated, i.e., they do not tend to err on the same patients. If the index test and reference standard tend to err on the same patients the correction formula does not apply, but the previously described approach to judge whether observed differences can be explained by reference standard misclassification still applies. The observed test accuracy will be underestimated to a lesser extent because the curves of sensitivity and specificity plotted against prevalence of the target condition (Fig. 1
) may move beyond their true values, but potential differences in test accuracy between subgroups can still be observed.
recommendation
Evaluation of the accuracy of a diagnostic test across patient subgroups should include exploration of whether the sensitivity and specificity change with prevalence of the target condition in the direction that is compatible with the typical pattern of these measures in the presence of reference standard misclassification. Addressing the following issues will be helpful to assess whether observed differences in test accuracy between subgroups are the result of reference standard misclassification (Fig. 2
):
|
Acknowledgments
Grant/funding support: This work was funded under a program grant (number 402764) from the National Health and Medical Research Council of Australia.
Financial disclosures: None declared.
Acknowledgments: We thank Drs. Petra Macaskill and Mike Jones for their comments on drafts of this manuscript.
Footnotes
1 Corné Biesheuvel is currently employed at the Childrens Hospital at Westmead. ![]()
References
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |