|
|
||||||||
Articles |
a Author for correspondence. Fax 518-473-2900; e-mail jenny{at}wadsworth.org
| Abstract |
|---|
|
|
|---|
Methods: We used participant data from the New York State Department of Health PT program to characterize the quality of testing in the toxicology specialty. Outcomes from laboratory investigations into causes of UNSAT and information on quality control practices collected from all program participants were used to identify the root causes of error.
Results: Two classes of error were encountered: spurious test results caused by lapses in standard operating procedures and instrument malfunctions (300 per million assays) and common-cause analytic error (7000 per million assays or 0.7% rate of UNSAT). Causes of spurious results included inaccurate mathematical correction for specimen dilution, misinterpretation of instrument codes, and instrument sampling errors. Calibration drift was most frequently cited as the common-cause analytic error. Approximately one-half of the laboratories used an allowable error for the quality control of analytical systems that exceeded the threshold error specified by manufacturers for stable instrument performance.
Conclusions: The causes of spurious results suggest the need for ongoing competency testing of analysts where analyst intervention is required in an otherwise automated process, and for continued diligence in mistake-proofing instrument design. The intrinsic quality of laboratory testing is unlikely to improve until the allowable error in quality control is consistent with manufacturer specifications for stable system performance.
| Introduction |
|---|
|
|
|---|
The interlaboratory perspective of the PT provider affords opportunities to identify root causes of error that may be systemic among laboratories that use similar analytical systems or processes. Outcomes of investigations into reasons for PT failures can be used by the laboratory, by device manufacturers, and by the PT program itself in the continuous improvement of their respective products (2)(3)(4).
We used participant data from the New York State Department of Health (NYSDOH) PT program to characterize the quality of testing in the toxicology specialty. A frequent observation in PT is spurious results, not unlike those observed by Witte et al. (5) and Plebani and Carraro (6) in their reviews of clinical data, that are suggestive of laboratory mistakes rather than the product of common-cause analytic variation. Another observation is the constant rate of unsatisfactory performance across test events, which is suggestive of intrinsic analytic errors beyond those allowed by program performance specifications. We describe the root causes of unsatisfactory performance in the toxicology PT program.
| Materials and Methods |
|---|
|
|
|---|
Analyte target values are established from either the weighed-in amount of drug or the robust estimate of the mean of participant data. Data from methods that are judged to require peer evaluation because of specimen matrix effects are removed before the determination of the participant mean. The participant mean is used as a target value if the mean differs by >3% from the gravimetrically assigned value.
Laboratory performance is judged by both CLIA 88 (8) and NYSDOH evaluation criteria. We have proposed that the CLIA 88 25% allowable error for most analytes in the toxicology specialty is inconsistent with the capabilities of current analytical systems and with the clinical requirements for optimal patient care (7). NYSDOH criteria (15% allowable error) are used to judge whether the laboratory needs to evaluate analytical performance for possible sources of error.
data review and analysis
We apply two levels of review to proficiency test reports. The
first review occurs as reports are received to detect results that are
so discrepant from target values, e.g., a toxic drug concentration
reported as a subtherapeutic concentration, that the laboratory must
immediately investigate the error and the possibility that similar
errors occurred in testing patient specimens. We classify these
aberrant results as spurious values, and the PT program coordinators
and the laboratory typically complete the investigation within 24
h of notification. Findings from the investigation of spurious values
were collected over the 10 test events conducted from January 1996 to
January 1999 (Table 1
).
|
The second review is an evaluation of performance against NYSDOH and
CLIA 88 criteria. Unsatisfactory performance is defined as two or
more results, among the five challenges for an analyte, that exceed
allowable error limits (analyte score
60%). Our investigation into
unsatisfactory performance is initiated by the mailing of an inquiry
report to the laboratory. The investigation is limited to those cases
in which performance is atypical of peer laboratories using the same
analytical system, thereby substantiating the idea that the testing
error(s) are laboratory based and not an artifact of the PT challenge
(specimen matrix effects). The inquiry report restates laboratory
performance for the analyte, quantifies the magnitude of the error, and
provides PT program assessment of error as either
systematic or random or a result of nonlinearity near the limits of the
assays purported reportable range. The inquiry report is used to
capture information on the design of the laboratorys internal quality
control (QC) program (source of QC materials, allowable imprecision,
and the rules used for interpretation of QC data), the analytical CV at
each level of QC, and the mean number of patient specimens
analyzed each month. The laboratory is instructed to return
the inquiry report with the internal documents that were generated in
the process of its investigation into unsatisfactory performance. PT
program staff use the documents to categorize the source of testing
errors and to maintain a database of internal QC practices and assay
performance characteristics.
categorization of test errors
Sources of test errors are categorized as follows: (a)
calibration driftperformance in PT suggests significant systematic
error and recalibration of the analytical system resolves the error;
(b) method biasperformance in PT suggests significant
systematic error, and we conclude that the inherent method bias
contributed to the laboratorys unsatisfactory performance;
(c) reportable range errorsperformance in PT suggests
significant analytical bias near the limits of the reportable range for
the method; (d) instabilityperformance in PT suggests
random error, and the laboratory concludes that a component of the
analytical system (e.g., sample probes, reaction cells, reagents) is
not performing optimally; and (e) random eventthe errors
can not be replicated, and the investigation does not identify possible
sources of error.
evaluation of internal qc practices as possible root cause of
unsatisfactory performance
To evaluate whether internal QC practices are predictive of
unsatisfactory performance in PT, we documented, from a survey of all
program participants, the limits of acceptable results and the
analytical CV for each QC material used by the laboratory, the source
of QC materials, the rules used to interpret QC data, and the mean
number of patient specimens analyzed each month. The allowable error
used for the QC of an assay was determined from the ratio of the
difference between QC limits and the QC range midpoint concentration
(target) to the midpoint concentration, expressed as a percentage. The
allowable errors used in the internal QC programs were compared to
allowable error in PT and to manufacturer performance claims for the
analytical system. Within intervals of allowable error, we also
determined the incidence of unsatisfactory performance in PT
attributable to analytic systematic error.
We characterized unsatisfactory analytical performance as systematic or random error through interpretation of two statistics, x-bar and range (1)(9), that are determined from the normalized bias of the five results that are reported for each analyte. The bias of the test result from the peer method mean is normalized to the PT program allowable error. For example, a laboratory reports a serum theophylline concentration of 13 mg/L, and the peer method mean is 10 mg/L. The allowable error around the target concentration is 15%. The normalized bias is determined as (13 mg/L - 10 mg/L)/(0.15 x 10 mg/L), or 2.0, that is, the reported result was two times the allowable error ascribed by the PT program. The x-bar statistic is determined as the mean of the normalized biases across the five challenge specimens for the analyte. The range statistic is an index of random error and is determined as the difference between the largest and smallest of the series of normalized results for an analyte, divided by 2. For example, the range statistic for the series of normalized results, 0.2, 0.5, -0.8, 0.6, and 0.1, is determined as 0.6 - (-0.8) = 1.4 ÷ 2 = 0.7. A range statistic equal to 0.7 indicates that the normalized results were distributed over a range equivalent to 70% of the full range allowed by the PT performance specification of ± 15%. A range statistic >0.7 suggests either significant random error or significant systematic error near the limit(s) of the purported reportable range.
The x-bar and range statistics were tabulated for each analyte for each of five test events conducted from June 1997 to September 1998. The largest of the series of five values of the respective statistics was likewise tabulated for each analyte. Cases of unsatisfactory performance in PT attributable to systematic error were identified as an analyte x-bar (maximum) >1.0 and range (maximum) <0.7.
| Results |
|---|
|
|
|---|
We list in Table 1
the laboratory mistakes and analytical system
malfunctions that produced the spurious values and the frequency of
occurrence in each test event. Laboratory mistakes occurred when valid
analytical results were mishandled. The four assigned causes of
mistakes were found to recur among laboratories and are identified as
misinterpretation of instrument codes, inaccurate factoring for
specimen dilution, mishandling of data provided on instrument
printouts, and misidentification or misplacement of specimens within
batch sequences.
Two of the recurring laboratory mistakes, misinterpretation of
instrument codes and mishandling of data provided on instrument
printouts, were instrument specific. The error rate among laboratories
using the Beckman Synchron (Beckman Coulter Instruments) in the
analysis of PT specimens for gentamicin and tobramycin at
concentrations exceeding 12 mg/L (the method reportable range
limit) was 33 333 per million assays (Table 1
). The analyzer correctly
identified specimens with analyte concentrations outside its reportable
range. However, the out of instrument range (OIR) annotation used to
flag the specimen notifies the analyst that additional testing is
required to determine whether the analyte concentration is low or high.
In 9 of the 270 cases where the OIR code was presumably generated, the
analysts interpreted the code to mean that the gentamicin and
tobramycin concentrations were less than the lower limit of the
reportable range (Table 2
). The error rate across test events was proportional to
the number of specimens that challenged the upper limit of the Synchron
reportable range. Analysts in three laboratories misread the Abbott
X-systems printout of assay data and reported net polarization units
for drug concentration (Table 2
). The columnar design of the instrument
printout is used to list the net polarization, blank intensity, and
analyte concentration for each specimen assayed. Analysts stated that
the positioning of columnar data is not consistent between TDx, FLx,
and AxSYM reports, which contributed to the misreading of instrument
reports.
|
Mishandling of specimen dilutions was the most common laboratory
mistake and was responsible for 20% of the spurious values reported to
the PT program (error rate was 510 per million assays; Table 1
). In
each case, the analytical result obtained for the diluted specimen was
accurate, but the reported results were grossly inaccurate (Table 2
).
Laboratories committing this error have described four scenarios: the
analyst diluted and assayed the specimen but failed to correct the
result for dilution; the analyst diluted and assayed the specimen and
corrected the result for dilution, but did not communicate that
factoring for dilution was performed, and data entry staff repeated the
correction for dilution before releasing the test result;
the analyst was unaware that the instrument report listed the result
that had been corrected for dilution; or an incorrect dilution factor
was used in the determination of the specimen analyte concentration.
Instrument malfunction was cited in 15% of the investigations of
spurious values (Table 1
), and most laboratories suspected the
specimen-sampling module as the probable source of error. Typically,
four of the five test results for an analyte challenge were well within
PT program ranges of acceptable results, but the recovery of analyte
from the fifth challenge specimen was markedly low. In most instances,
transposition of specimens within the sequence batch was ruled out
because the batch contained only the PT specimens. Repeat analysis of
the test specimen upon request from the PT program invariably produced
acceptable recovery of analyte. The error in sampling was most
frequently detected by the PT program among users of the Abbott AxSYM
(Table 2
). Laboratories suspected air bubbles in the sample, a hole in
the reaction vessel, or a failure to aspirate the specimen. We estimate
that the AxSYM sampling error rate is 0.016% (10 incidents per 60 575
assays, or 165 per million assays).
investigation of unsatisfactory assay performance
The outcome of investigations into unsatisfactory analytical
performance is summarized in Table 3
. A total 20 830 analyte challenges (5 specimens per
analyte challenge) were evaluated from the five test events conducted
from May 1997 to October 1998. The rate of unsatisfactory performance
among all analyte challenges was 0.7%. Seventy-five percent of the
investigations concluded that the unsatisfactory performance was
attributable to systematic error (calibration drift or bias near
purported limits of reportable range) or random error (instability of
analytical system). Conclusions from 25% of the investigations were
indeterminate. However, program staff judged that in many cases in
which an indeterminate conclusion was reached by the laboratory, an
inherent method bias, not calibration drift, was a major component of
the total error in test results. We categorized the sources of error
from indeterminate laboratory conclusions as method bias (14%) or as a
random, indeterminate event (11%).
|
evaluation of internal qc practices
Calibration drift (error) was most frequently mentioned as the
cause of unsatisfactory performance in PT (Table 3
). Because we expect
that laboratories release results only from analytical runs that they
judge to be in-control, we investigated the allowable errors used by
laboratories in their internal QC programs and the analytical CVs. A
scatter plot of allowable error vs method imprecision for theophylline,
carbamazepine, and phenobarbital is shown in Fig. 1
A (data are for the control material with the drug concentration
within its therapeutic range). The PT program performance specification
for these analytes is ± 15% around the assigned target value;
however, 35% of the laboratories reported that the allowable error
used in their QC programs exceeded 15%. Only 10% of the laboratories
reported that the 95% confidence interval of method imprecision (2 CV)
exceeded 15%. These findings suggest that in many laboratories, the
allowable error used to monitor analytical stability is decoupled from
the imprecision performance characteristic of the analytical method.
This inference is supported by Fig. 1A
, where it appears that many
laboratories use fixed criteria of 10%, 15%, and 20% for QC limits,
where 20% is most prominent. The use of method standard deviation by
many laboratories to set QC limits is also evident by the
line-of-identity in Fig. 1A
. We noted that ~4% of laboratories
reported a method CV <2%, and 3% of the laboratories reported an
allowable error of <4%. This observation raises questions concerning
the statistical validity of imprecision estimates by this subset of
laboratories. The low volume of patient testing may contribute to the
questionable validity of imprecision estimates because 16% of the
laboratories perform <10 analyses on patient specimens per month for
theophylline, carbamazepine, and phenobarbital.
|
We evaluated the correlation of unsatisfactory performance in PT
attributable to systematic error to the allowable error used in QC
programs by plotting the laboratory x-bar (maximum), a
measure of laboratory systematic error determined by the PT program,
against the laboratory allowable error (Fig. 1B
). The performance on 14
analyte challenges was evaluated as unsatisfactory [x-bar
(maximum) >1.0; range <0.7] with analytical bias exceeding 15%. The
internal QC program allowable error exceeded 15% for 11 of the 14
cases of unsatisfactory performance. We further quantified the
correlation by determining the rate of unsatisfactory performance
within the intervals of allowable error (Table 4
) and found that the rate of unsatisfactory performance exceeded
12% when the allowable error exceeded 22.5%.
|
We noted a fourfold range in allowable error used by laboratories in
the QC of phenobarbital, carbamazepine, and theophylline assays (5th
and 95th percentiles of allowable error are 5% and 20%,
respectively). The range in allowable errors suggests a lack of
consensus on QC requirements and a lack of guidance from manufacturers
on QC program parameters that are consistent with stable analyzer
performance. We investigated QC practices and PT performance among
laboratories that use an analytical system with QC program guidance
that is provided by the manufacturer. Abbott Diagnostics (Abbott
Laboratories) makes available QC materials and rules for interpretation
of QC data to users of its X-systems. The QC limits for the
phenobarbital, theophylline, and carbamazepine therapeutic control
material are ± 10% for the TDx analyzer, and ± 12%,
±10%, and ± 15%, respectively, for the AxSYM analyzer around
the assigned target value. Laboratories are instructed to initiate
investigation of analytical performance when control assay values
exceed the limits. We found that the performance of <6% of the Abbott
analyzers is monitored by use of Abbott QC materials. Among those using
the Abbott control materials, 70% of the assays are monitored
with ± 10% or lower QC limits, and 97% of the QC limits are
15% or lower (Fig. 2
B). None of the performances on the PT analyte challenges among
this group was judged unsatisfactory. When laboratories opted not to
use Abbott QC materials, we noted that 49% of these laboratories used
allowable errors in QC that are larger than the Abbott recommended
fixed criterion to detect possible unstable system performance (Fig. 2A
). The incidence of unsatisfactory performance in PT increased with
increases in QC program allowable error (Fig. 2C
and Table 4
),
supporting probability models developed by Ehrmeyer et al.
(10) that predict the effects of method CV on PT outcomes.
|
| Discussion |
|---|
|
|
|---|
|
Episodic process failures are caused by lapses in laboratory standard operating procedures (mistakes), and by performance anomalies in an otherwise stable analytical system (instrument malfunction). Clearly, diligence in the training and competency testing of staff to minimize mistakes is a vital and evolving activity within the laboratory. Laboratory efforts to reduce mistakes have been greatly assisted by advances in technology and design of analytical systems. The 1996 Clinical Chemistry Forum (11) focused on clinical laboratory mistakes and on concepts to design quality into analytical systems to reduce or eliminate mistakes. Manufacturers must anticipate the mistakes that are most likely to occur, and design mistake-proofing systems (12). Continual improvement can then be accomplished by complaint tracking after product launch (13). Our findings suggest opportunities for continual improvement in system design.
Each of the laboratory mistakes listed in Table 1
has recurred across
laboratories from the time the study was initiated. The most prevalent
mistake occurs during the processing of specimens with analyte
concentrations that exceed the upper limit of the analytical systems
reportable range. We design PT specimens to encompass the clinically
relevant range of analyte concentrations. The analyte concentration in
one of the five test specimens in each test event may exceed the upper
limit of the reportable range of some methods. When we estimated the
error rate for failure to perform mathematical calculation of results
for specimen dilution, we assumed that all laboratories needed to
dilute one specimen in each test event for each analyte. The estimate
of 510 results not corrected for dilution per million dilutions
performed may grossly underestimate the error rate, perhaps by as much
as 10-fold if only 10% of the challenges required specimen dilution.
Specimen dilution typically requires special handling and reporting,
and typically, a breakdown in communication is the reason for the
testing error. Some instruments allow the entry of a dilution factor,
which is used by the instrument to calculate the reportable result
obtained for a specimen that had been diluted off-line. The
inconsistent use of this feature among analysts within a laboratory has
produced confusion as to when the result requires manual calculation
for dilution. The confusion is compounded when protocols for handling
specimen dilutions vary among several different analytical systems
within the laboratory. Automated specimen dilution and (or) instrument
reports that describe the dilution protocol and the results would
likely reduce this type of error considerably.
Rates of nonconformity are highly correlated with process complexity
(12). Simplification of analytical devices for use in any
setting has merit in the reduction of mistakes, and testing at the
point of care has benefited greatly from manufacturer efforts to
mistake-proof analytical devices. Although the error rate in
transcription of results from Abbott X-systems reports was low in PT
(Table 1
), the mistakes suggest an opportunity to improve the test
process through simplification of instrument reports. The X-system
reports contain an array of analytical data (fluorescence polarization
units and blank intensity) that are associated with test specimen
results, and the arrangement of the data varies among the different
X-system configurations. Complex instrument reports only confound an
inherently error-prone process of results transcription. Likewise,
errors in the interpretation of instrument codes, as occurred among
laboratories using the Beckman Synchron, could be eliminated with the
consistent use of descriptive specimen flags on instrument reports.
The reason for spurious test results is difficult to identify when mistakes are eliminated as the cause. We were unable to assign a cause to 10% of the spurious proficiency test results, although the investigations were performed proximate to the episode. However, as data were collected over 3 years, a pattern emerged among laboratories using the Abbott AxSYM, where the recovery of analyte from one of the PT specimens was markedly low, whereas recovery from the remaining four PT specimens in the batch was well within the acceptable range of concentration. Laboratories suggested that the only plausible explanations were failure of an instrument to acquire the full volume of sample for analysis, air bubbles introduced into the specimen by mixing, or a hole in the reaction vessel used for the specimen. Clearly, given the unpredictable and low rate of occurrence (165 errors per million assays), it is difficult to isolate and identify definitively the cause of the spurious results. However, the plausibility of the explanations offered by laboratories suggests instrument features designed to prevent short sampling of specimens and to monitor the integrity of reaction vessels warrant consideration. Results from the Abbott AxSYM accounted for 29% of the entire PT database, which increased the likelihood of detecting such sampling errors for that analytical system.
One of us (R. Jenny) has conceptualized the use of an analytical system by many laboratories as a distributed production process (1). The process is sampled periodically by challenges with PT materials, and using principles of statistical process control, we can judge whether a laboratory is performing within specification. The NYSDOH performance specification for most toxicology analytes is the recovery of analyte from test specimens to within 15% of the target value. The specification is based on capabilities of modern analytical systems, and is consistent with standards of laboratory practice guidelines (14)(15) and with proposed analytic goals (16)(17).
The mean rate of nonconformance (unsatisfactory analyte performance)
was 0.7% over five test events. Because laboratories release test
results from analytical runs that are judged in-control, we
investigated internal QC schemes used to monitor assay performance. We
found wide disparity in the allowable errors used by laboratories in
the QC of their analytical methods. We list in Table 6
the manufacturer estimates for imprecision for analytical
systems most commonly used among participants in our PT program, the
threshold imprecision that the manufacturer uses as a guideline to
judge the performance of the analyzer, and the distribution of
allowable errors among users of those systems. The manufacturer
estimates for imprecision are those published in the "Performance
Characteristics" section of the product assay sheets. The threshold
imprecision used by a manufacturers technical services
representatives to judge analyzer performance was obtained either from
the product assay sheets or by consultation with the manufacturer. In
many instances, the allowable errors are not consistent with
manufacturer guidelines for stable performance of the analytical
systems used. The allowable error used by ~50% of the laboratories
using the Abbott X-systems exceeds the allowable error that is
recommended by Abbott for the analysis of theophylline and
phenobarbital.
|
The scatter plot of allowable error against analytical imprecision
(Fig. 2A
) reveals patterns that suggest the origin of error limits used
for internal QC. The line-of-identity indicates a statistical
derivation, the 20% fixed-limit is coincident with "expected"
ranges provided by suppliers of assayed control materials (Bio-Rad and
Dade are the major vendors of QC materials among laboratories we
studied; Dade Liquid Immunoassay Controls and Bio-Rad Liquichek
Immunoassay Plus Control materials are most frequently used.), and the
10% fixed-limit is recommended by Abbott to laboratories that use
Abbott QC material. Laboratories that opt to use commercially assayed
control ranges as QC limits frequently designate the limits of the
supplied control ranges as 2 SD limits, and use "Westgard rules"
(which presuppose valid estimates of method performance
characteristics) to monitor method performance.
Clearly, when analytical systems are deployed for use, the quality of conformance to system design specifications varies considerably among laboratories that use those instruments, as expressed by the allowable errors they use to judge system stability. Because manufacturers continually refine technology to provide analytical performance that is consistent with clinical needs for optimal patient care, the rewards to patient care may be diminished by QC practices that are desensitized to unstable system performance. Steindel and Tetrault (18) and Howanitz et al. (19) conducted a Q-Probes study of QC practices in hospital laboratories and concluded that laboratorians have difficulty in following QC rules because they are complex and tedious to follow, and that QC practices should be simplified. We believe simplification should encompass the standardization of allowable errors among laboratories using an analytical system. The quality of patient testing is dictated by the analytical system design specifications: laboratories should not expect better performance than what is claimed but should not accept less. We suggest that the objective of the laboratorys QC program should be to maintain system performance within verified manufacturer performance claims and that the manufacturers precision claims may be viable allowable error limits in QC programs. Until QC practices are made consistent and relevant to system design specifications, the intrinsic quality of laboratory services is unlikely to improve.
In conclusion, PT providers can substantially augment the utility
of their services through the characterization of participant
performance and active participation in investigations of causes of
unsatisfactory performance. Guidelines have been developed for
laboratories to review their PT results and to identify the source(s)
of test error(s) (20)(21). The PT provider
has the capability to assimilate causes of errors and to identify
common causes among participants. The sharing of this information with
laboratories and manufacturers should contribute to the continuous
improvement of instrument design and laboratory services. Internal
quality specifications, expressed as allowable error in QC, are an
important link to providing reliable laboratory services that meet
needs for good patient care. We observed an increased rate of
unsatisfactory performance in PT as the allowable error in QC increased
(Fig. 2
). Our finding that many laboratories use allowable errors for
internal QC that greatly exceed (broad or narrow ranges) those
recommended by manufacturers for monitoring system stability supports
the contention that QC programs must be retooled. Laboratories make
purchasing decisions based on system capabilities and should strive to
maintain those capabilities (manufacturer claims) in routine use. To
strive, through rigorous QC, for greater accuracy and precision than
the analytical system was designed to provide is imprudent. To allow
analytical systems to perform at a level that may be characterized by
the manufacturer as unstable performance is simply unacceptable. QC
limits based on these extremes produce high false-positive and
false-negative rejection of analytical runs. The common ground is to
base QC evaluation criteria on the expected performance of the
analytical system. The manufacturer is in the best position to provide
guidance for selection of those evaluation criteria.
| Footnotes |
|---|
1 Nonstandard abbreviations: PT, proficiency testing; NYSDOH, New York State Department of Health; QC, quality control; and OIR, out of instrument range. ![]()
| References |
|---|
|
|
|---|
The following articles in journals at HighWire Press have cited this article:
![]() |
P. Bonini, M. Plebani, F. Ceriotti, and F. Rubboli Errors in Laboratory Medicine Clin. Chem., May 1, 2002; 48(5): 691 - 698. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |