Clinical Chemistry AACC Online Job Center
HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS
 QUICK SEARCH:   [advanced]


     


Clinical Chemistry 53: 1725-1729, 2007; 10.1373/clinchem.2007.087403
This Article
Right arrow Abstract Freely available
Right arrow Full Text (PDF)
Right arrow 087403.Supplemental Data
Right arrow Submit an electronic Letter to
the Editor about this paper
Right arrow Alert me when this article is cited
Right arrow Alert me when eLetters are posted
Right arrow Alert me if a correction is posted
Right arrow Citation Map
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via ISI Web of Science (1)
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Biesheuvel, C.
Right arrow Articles by Bossuyt, P.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Biesheuvel, C.
Right arrow Articles by Bossuyt, P.
Related Collections
Right arrow Laboratory Management
Right arrow General Clinical Chemistry
Right arrow Evidence Based Laboratory Medicine and Test Utilization
(Clinical Chemistry. 2007;53:1725-1729.)
© 2007 American Association for Clinical Chemistry, Inc.


Overview

Observed Differences in Diagnostic Test Accuracy between Patient Subgroups: Is It Real or Due to Reference Standard Misclassification?

Corné Biesheuvel1,1, Les Irwig1,a and Patrick Bossuyt2

1 Screening and Test Evaluation Program, School of Public Health, University of Sydney, Sydney, New South Wales, Australia.
2 Department of Clinical Epidemiology and Biostatistics, Academic Medical Center, University of Amsterdam, Amsterdam, The Netherlands.

aAddress correspondence to this author at: Screening and Test Evaluation Program, School of Public Health, Building A27, University of Sydney, Sydney, New South Wales 2006, Australia. Fax 61-2-93515049; e-mail lesi{at}health.usyd.edu.au.


Abstract

Before a new test is introduced in clinical practice, its accuracy should be assessed. In the past decade, researchers have put an increased emphasis on exploring differences in test sensitivity and specificity between patient subgroups. If the reference standard is imperfect and the prevalence of the target condition differs among subgroups, apparent differences in test sensitivity and specificity between subgroups may be caused by reference standard misclassification. We provide guidance on how to determine whether observed differences may be explained by reference standard misclassification. Such misclassification may be ascertained by examining how the apparent sensitivity and specificity change with the prevalence of the target condition in the subgroups.

In a diagnostic accuracy study, the test under evaluation (index test) is applied to a series of individuals suspected to have the disease of interest (target condition). Test accuracy is assessed by comparing the results of the index test with the results of the reference standard in the same individuals. The reference standard is considered the best available method of assessing the target condition. Among the measures of diagnostic test accuracy are sensitivity, the probability of a positive test result in the presence of the target condition, and specificity, the probability of a negative test result in the absence of the target condition.

Estimates of test accuracy may differ between patient subgroups (1)(2)(3)(4). For example, the carcinoembryonic antigen test has a higher sensitivity (83%) for the diagnosis of colorectal cancer in patients with advanced metastatic disease (Dukes D) than in patients with localized disease (36% for Dukes A and B) (5). In this example, the differences are likely to be real, arising from the differing spectrum of Duke classification. However, in other instances in which the reference standard is prone to more misclassification, there may be uncertainty about whether the observed differences are real. For example, in a study of the D-dimer test to detect deep vein thrombosis (DVT), differences in accuracy were observed between cancer patients (sensitivity, 98%; specificity, 48%) and patients without cancer (sensitivity, 93%; specificity, 64%) (6). Because of the lower specificity of the D-dimer test in cancer patients, additional testing may be required to determine the presence or absence of DVT in this subgroup. So, observed differences in test accuracy, if accepted as real, may have implications for diagnostic management of the different subgroups.

Therefore, before accepting that test accuracy actually differs between subgroups, researchers should rule out the possibility that the observed differences are artefactual. In this report, we demonstrate how reference standard misclassification may cause apparent differences in accuracy between subgroups and provide guidance on how to determine whether observed differences might be explained by reference standard misclassification.


reference standard misclassification

For diagnostic research purposes, disease diagnosis in patients suspected of a certain target condition must be performed according to a carefully chosen reference standard. Reference standards that perfectly differentiate between patients with and without the target condition are rare, however, and thus misclassification of some patients is often inevitable (7)(8). When the accuracy of an index test is determined by comparison with an imperfect reference standard, some target condition misclassification will be introduced (9)(10)(11)(12).

Misclassification of the target condition by the reference standard will tend to result in underestimation of the accuracy of the index test. Underestimation of the sensitivity of the index test is most likely when the prevalence of the target condition is low, and the estimated sensitivity will be closer to the true sensitivity with increasing prevalence. Underestimation of the specificity will occur most when the prevalence of the target condition is high, and the estimated specificity will be closer to the true specificity when the prevalence of the target condition is low (Fig. 1 ).


Figure 1
View larger version (19K):
[in this window]
[in a new window]

 
Figure 1. Sensitivity will be underestimated most when the prevalence of the target condition is low, and specificity will be underestimated most when the prevalence of the target condition is high.

The pattern of sensitivity and specificity changing with prevalence of the target condition can be illustrated with theoretical data on the accuracy of an index test among patient subgroups with low, middle, and high prevalence of a target condition (Table 1 ). In this hypothetical example, the sensitivity and specificity of the index test are 80% and 70%, as assessed by comparison with a perfect reference standard. The sensitivity and specificity of the imperfect reference standard are both 90%. When the prevalence of the target condition is 10%, the observed sensitivity and specificity are 55% and 69%. When the prevalence of the target condition is 50%, the sensitivity increases to 75% and the specificity decreases to 65%. When the prevalence of the target condition is 80%, the sensitivity increases further to 79% and the specificity decreases to 55%. The example in Table 1 also shows that reference standard imperfections attenuate the real difference in prevalence between subgroups, making it more difficult to detect statistically significant differences between subgroups or settings. However, we are still able to identify the subgroup with the higher or lower prevalence of the target condition.


View this table:
[in this window]
[in a new window]

 
Table 1. Theoretical example of differences in observed sensitivity and specificity of the index test between patient subgroups with different prevalence of the target condition (10%, 50%, and 80%). The sensitivity and specificity of the imperfect reference standard are both 90%.

Therefore, before accepting differences in sensitivity and specificity between subgroups as real, one should explore whether these differences are compatible with the pattern of sensitivity and specificity changing with the prevalence of the target condition in the presence of reference standard misclassification. We will elaborate on 2 situations that may be encountered, one in which observed differences in test accuracy between subgroups can be explained by reference standard misclassification, and another in which observed differences in test accuracy between subgroups cannot be explained by reference standard misclassification.


clinical examples of subgroup variability

To illustrate a situation in which observed differences in test accuracy between subgroups are attributable to reference standard misclassification, we use the previously mentioned example in which differences in test accuracy between subgroups were observed in the evaluation of the D-dimer test to diagnose DVT (6). The D-dimer test in patients with cancer had a sensitivity and specificity of 98% and 48%, whereas in patients without cancer the sensitivity and specificity were 93% and 64%. The prevalence of DVT, as determined with repeated compression ultrasound and follow-up, was higher in cancer patients (38%) than in noncancer patients (21%) (Table 2 ). The sensitivity was higher and specificity lower in the group with the higher prevalence (cancer patients). This difference is in the direction expected from the effect of reference standard misclassification, as illustrated in Fig. 1Up . Hence, the observed differences in test accuracy can, at least in part, be explained by reference standard misclassification. Another example is the use of a dipstick test to diagnose urinary tract infection (UTI) (13). On the basis of clinical signs and symptoms, patients suspected of UTI were divided in subgroups with high and low pretest probability of UTI. The dipstick test in patients with a high pretest probability had a sensitivity of 92% and a specificity of 42%, whereas in the low pretest probability group the sensitivity and specificity were 56% and 78%. The prevalence of UTI, as determined with urine culture, was 50% in the high pretest probability group and 7% in the low pretest probability group (Table 2 ).


View this table:
[in this window]
[in a new window]

 
Table 2. Examples of studies with observed differences in test accuracy in subgroups that may be explained by reference standard misclassification.

The alternative situation, in which observed differences in test accuracy between subgroups are not compatible with reference standard misclassification, can be illustrated with data derived from another study to assess the diagnostic accuracy of the D-dimer test to diagnose DVT (14). The sensitivity and specificity of the D-dimer test were 96% and 45% for patients who had a previous thromboembolism and 99% and 30% for patients who had not. The prevalence of DVT was 42% for patients who had a previous thromboembolism and 35% for the reference group. In this case, the sensitivity is higher and specificity is lower in the group with the lower prevalence (no previous DVT). Therefore, the observed differences in test accuracy between both subgroups cannot be explained by reference standard misclassification only.

Apparent similarities in sensitivity and specificity between subgroups may also occur when the prevalence of the target condition differs between these subgroups. In this case, reference standard misclassification may be hiding real differences in sensitivity and specificity between subgroups.

The values of the sensitivity and specificity of the index test compared to a perfect reference standard can also be estimated. To do so, additional information is needed about the accuracy of the applied reference standard (see Table 1 in the Data Supplement that accompanies the online version of this review at www.clinchem.org/content/vol53/issue10) (15). For a valid calculation, the index test and reference standard are assumed to be uncorrelated, i.e., they do not tend to err on the same patients. If the index test and reference standard tend to err on the same patients the correction formula does not apply, but the previously described approach to judge whether observed differences can be explained by reference standard misclassification still applies. The observed test accuracy will be underestimated to a lesser extent because the curves of sensitivity and specificity plotted against prevalence of the target condition (Fig. 1Up ) may move beyond their true values, but potential differences in test accuracy between subgroups can still be observed.


recommendation

Evaluation of the accuracy of a diagnostic test across patient subgroups should include exploration of whether the sensitivity and specificity change with prevalence of the target condition in the direction that is compatible with the typical pattern of these measures in the presence of reference standard misclassification. Addressing the following issues will be helpful to assess whether observed differences in test accuracy between subgroups are the result of reference standard misclassification (Fig. 2 ):


Figure 2
View larger version (22K):
[in this window]
[in a new window]

 
Figure 2. Issues one should address to determine whether observed differences in test accuracy between subgroups may be explained by reference standard misclassification (RMS).


Acknowledgments

Grant/funding support: This work was funded under a program grant (number 402764) from the National Health and Medical Research Council of Australia.

Financial disclosures: None declared.

Acknowledgments: We thank Drs. Petra Macaskill and Mike Jones for their comments on drafts of this manuscript.


Footnotes

1 Corné Biesheuvel is currently employed at the Children’s Hospital at Westmead.


References

  1. Bossuyt PM, Reitsma JB, Bruns DE, Gatsonis CA, Glasziou PP, Irwig LM, et al. Towards complete and accurate reporting of studies of diagnostic accuracy: the STARD initiative. BMJ 2003;326:41-44.[Abstract/Free Full Text]
  2. Bossuyt PM, Reitsma JB, Bruns DE, Gatsonis CA, Glasziou PP, Irwig LM, et al. The STARD statement for reporting studies of diagnostic accuracy: explanation and elaboration. BMJ 2003;49:7-18.[Medline] [Order article via Infotrieve]
  3. Irwig LM, Bossuyt PM, Glasziou PP, Gatsonis CA, Lijmer J. Designing studies to ensure that estimates of test accuracy are transferable. BMJ 2002;324:669-671.[Free Full Text]
  4. Mower WR. Evaluating bias and variability in diagnostic test reports. Ann Emerg Med 1999;33:85-91.[CrossRef][ISI][Medline] [Order article via Infotrieve]
  5. Fletcher RH. Carcinoembryonic antigen. Ann Intern Med 1986;104:66-73.[CrossRef][ISI][Medline] [Order article via Infotrieve]
  6. ten Wolde M, Kraaijenhagen RA, Prins MH, Buller HR. The clinical usefulness of D-dimer testing in cancer patients with suspected deep venous thrombosis. Arch Intern Med 2002;162:1880-1884.[Abstract/Free Full Text]
  7. Begg CB. Biases in the assessment of diagnostic tests. Stat Med 1987;6:411-423.[ISI][Medline] [Order article via Infotrieve]
  8. Knottnerus JA, van Weel C, Muris JW. Evaluation of diagnostic procedures. BMJ 2002;324:477-480.[Free Full Text]
  9. Buck AA, Gart JJ. Comparison of a screening test and a reference test in epidemiologic studies. I. Indices of agreement and their relation to prevalence. Am J Epidemiol 1966;83:586-592.[Free Full Text]
  10. Gart JJ, Buck AA. Comparison of a screening test and a reference test in epidemiologic studies. II. A probabilistic model for the comparison of diagnostic tests. Am J Epidemiol 1966;83:593-602.[Free Full Text]
  11. Boyko EJ, Alderman BW, Baron AE. Reference test errors bias the evaluation of diagnostic tests for ischemic heart disease. J Gen Intern Med 1988;3:476-481.[ISI][Medline] [Order article via Infotrieve]
  12. Valenstein PN. Evaluating diagnostic tests with imperfect standards. Am J Clin Pathol 1990;93:252-258.[ISI][Medline] [Order article via Infotrieve]
  13. Lachs MS, Nachamkin I, Edelstein PH, Goldman J, Feinstein AR, Schwartz JS. Spectrum bias in the evaluation of diagnostic tests: lessons from the rapid dipstick test for urinary tract infection. Ann Intern Med 1992;117:135-140.[ISI][Medline] [Order article via Infotrieve]
  14. Schutgens RE, Ackermark P, Haas FJ, Nieuwenhuis HK, Peltenburg HG, Pijlman AH, et al. Combination of a normal D-dimer concentration and a non-high pretest clinical probability score is a safe strategy to exclude deep venous thrombosis. Circulation 2003;107:593-597.[Abstract/Free Full Text]
  15. Kelsey JL, Whittemore AS, Evans AS, Thompson WD. Methods in Observational Epidemiology 1986 Oxford University Press New York. .




This Article
Right arrow Abstract Freely available
Right arrow Full Text (PDF)
Right arrow 087403.Supplemental Data
Right arrow Submit an electronic Letter to
the Editor about this paper
Right arrow Alert me when this article is cited
Right arrow Alert me when eLetters are posted
Right arrow Alert me if a correction is posted
Right arrow Citation Map
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via ISI Web of Science (1)
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Biesheuvel, C.
Right arrow Articles by Bossuyt, P.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Biesheuvel, C.
Right arrow Articles by Bossuyt, P.
Related Collections
Right arrow Laboratory Management
Right arrow General Clinical Chemistry
Right arrow Evidence Based Laboratory Medicine and Test Utilization


HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS