Demystifying EQA statistics and reports

Reports act as an important feedback tool in External Quality Assessment (EQA). Their main role is to score laboratories for their performance in an EQA round. The most common scores that apply to quantitative data are Q- and Z-scores. To calculate these scores, EQA providers need to have an assigned value and standard deviation for the sample. Both assigned values and standard deviations can be derived chemically or statistically. When derived statistically, different anomalies against the normal distribution of the data have to be handled. Various procedures for evaluating laboratories are able to handle these anomalies. Formal tests and graphical representation techniques are discussed and suggestions are given to help choosing between the different evaluations techniques. In order to obtain reliable estimates for calculating performance scores, a satisfactory number of data is needed. There is no general agreement about the minimal number that is needed. A solution for very small numbers is proposed by changing the limits of evaluation. Apart from analyte- and sample-specific laboratory evaluation, supplementary information can be obtained by combining results for different analytes and samples. Various techniques are overviewed. It is shown that combining results leads to supplementary information, not only for quantitative, but also for qualitative and semi-quantitative analytes.


Introduction
Reports created by External Quality Assessment (EQA) providers serve as a major feedback tool towards the participating laboratories. They support the pedagogic role of EQA and are often used by auditors to follow up laboratory quality, certainly in the light of eventual accreditation (1)(2)(3)(4). Di erent EQA providers summarize the statistical evaluation and their ndings in various types of reports.
In a rst instance, participating laboratories should receive, as soon as possible after an EQA round closing, a con dential individual report detailing their own performances. The report should be as clear and comprehensive as possible and contain the assigned values for each of the parameters that were included, limits of acceptability and evaluation for each of the laboratory's result. Ide-ally, it would contain additional information to support evaluation, like the number of laboratories involved in the evaluation and details about the distribution of data reported by all the participants. As such, the report allows the participating laboratory to compare its results for each analyte with those of other participants (1,(5)(6)(7)(8)(9). In addition to individual reports for each participant, summary reports containing general and anonymized information on method performance, variability and bias for di erent analytes could be included at the end of each round. Periodic reports can be made as well to highlight the most striking evidence that is found for di erent EQA rounds together (7). This manuscript focuses on the feedback reports of individual laboratories and gives an overview of var-Coucke W, Soumali MR.
Background for EQA reports ious relevant statistical evaluation techniques of reported data, without aiming at describing the entire range of performance assessment systems.
Because of large di erences in EQA scheme design, evaluation procedures vary widely and depend on, among others, choices made for determining the assigned value, commutability of control samples or the way in which laboratories report their results in routine. Commonly, EQA in the clinical eld asks laboratories to analyse the samples as if they were routine samples and hence, produce mostly one value for a certain analyte without reporting measurement uncertainty (10). For many analytes determined in the clinical laboratory, reference method-based assigned value setting is not possible. Due to a complex matrix like whole blood or serum, which is pooled for large-scale distribution and subject to procedures to enhance sample stability, samples are altered. Consequently, samples are often not commutable, i.e. the di erences between methods that they demonstrate do not re ect the di erences that are observed for routine samples (10). Commutable samples enable EQA providers to derive more information from an EQA round than non-commutable samples, like harmonization between methods (4,11). If commutability cannot be assessed, the only way to evaluate laboratories is with respect to their own peer groups. Peer groups consist of laboratories whose measurement procedures are equal or so similar that they are expected to have the same result and matrix-related bias compared to other methods. Peer group evaluation provides valuable information to assess quality, verifying that a laboratory is using a measurement procedure in accordance to the manufacturer's speci cations and to other laboratories using the same technology, but cannot assess laboratory or method accuracy (4,11). Commutable samples on the other hand, give insights into the bias and accuracy that re ect analytical performance for routine samples.
In order to help interpreting an EQA result that is out of consensus, EQA providers are encouraged to write advice for poor performers in the report (8). Laboratories should always follow up any unacceptable EQA result by a root cause analysis and document corrective actions (12). In addition, when interpreting EQA results, laboratories should not forget that results within the acceptance range may still be linked to a problem in the laboratory, for example when they are close to the acceptance limits or when successive Z-or Q-scores are all positive or negative (11).

Building performance statistics
Laboratories are marked for an out of consensus result if they report a value that is too far from the assigned value and hence prior to any interpretation, the EQA provider must determine the assigned value and a range of acceptable values around it (1,8,11,13). Criteria for de ning the ranges for acceptability are extremely important. Ranges that are too wide will not allow detecting laboratories with poor performance, while a satisfactory performance will be wrongly agged if the ranges are too strict (7). It is also very important that acceptability criteria are reliable, or laboratories may lose con dence in the scheme.
The comparison with acceptability ranges is often condensed in two di erent scores: Z-scores and Qscores.
A simple evaluation technique consists of calculating Q-scores. They consist of the relative di erence between the value reported by the laboratory and the assigned value: Q-score = reported value -assigned value assigned value The Q-score is often presented as a percentage and compared with a maximal allowable deviation (6,8,13,14). The limit of acceptability is often considered as the ' tness for purpose', meaning that a result within the limits of acceptability is ' t for purpose', or better: ' t for intended use'. It is important to specify such purpose, which should be derived from external requirements (5,15). External quality assessment providers for clinical laboratories usually adopt the approach of analytical performance speci cations (16). The approach includes requirements derived from speci c studies or general studies like biological variability, and in a second instance, state of the art performance criteria as well. Another type of score is the Z-score. It is the di erence between the value reported by the laboratory and the assigned value, corrected for the variability: Z-score = reported value -assigned value standard deviation If the distribution of the data reported by well performing laboratories approaches a normal distribution, Z-scores follow a standard normal distribution and the percentage of Z-scores that are beyond extreme values can be calculated exactly: 4.6% and 0.27% of the Z-scores will have an absolute value greater than 2 and 3, respectively. Hence, a very small minority of well performing laboratories have Z-scores larger than 2 and even fewer have Z-scores greater than 3. That is why often a Z-score with absolute value lower than 2 is considered as acceptable, between 2 and 3 as questionable and unsatisfactory when it is larger than 3 (3). Because Z-scores are standardized scores, they can be compared between all analytes (8).
As can be seen from the formulas to calculate Qand Z-scores, they both include an estimate of the assigned value and Z-scores also need an estimate of the variability of the data, expressed as a standard deviation.

Calculating performance scores for quantitative tests: one sample, one parameter
The evaluation of a laboratory in an EQA round is basically an assessment of how well an analyte has been measured in a certain sample. Before calculating any score, EQA providers should examine the reported data and screen them for anomalies that jeopardize a correct evaluation. Ideally, the reported data would be normally distributed. In practice however, EQA providers cannot ensure this assumption and have to check the data for anomalies, of which di erent types may occur. The most common are bimodality, skewness and outliers.
Bimodality occurs when the data consists of a collection of small groups with di erent central values. Skewness occurs when the data are not cen-trally located around their mean, i.e. there is an increased proportion of extremely large or small data. Outliers are probably the most common anomaly. Mostly, outliers are data that are far from the bulk of the data, i.e. the process that produced them is not like the process that produced other data. The process may be out of range, like, for example, showing a systematic deviation or an increased variability, or the outlier could be caused by an extra-analytical mistake, like a clerical error or sample identi cation mistake. Skewness can be detected by means of graphical exploration of the data and data transformation; like a log-or square root transformation. In most cases, it helps to make the data more symmetrical. In case of bimodality, several statistical tools are available to detect the di erent subgroup. They rely on kernel density estimation, which is a nonparametric technique to estimate the probability density function from the data and serves excellently for identifying modes. Some use solely kernel density estimation for identifying modes, others extend this technique by a method called bootstrapping (17). It is a method that is based on resampling and aims at estimating the behaviour of the distribution's parameters in order to nd the largest mode (18)(19)(20). The statistical procedures for handling bimodality and skewness should be applied by the EQA organizer between the deadline for reporting results and the creation of feedback reports. Once the EQA provider has validated these procedures, they remain preferably unchanged over time.
In the following sections, it is assumed that bimodality and skewness have been dealt with either by using homogeneous, unimodal data, or by transformation and that the statistical techniques only have to deal with outliers.

Outlier removal
Unfortunately, the rule that identi es outliers with 100% certainty does not exist. Even more, the detection of outliers has various aws, like masking and swamping. Masking means that an outlier is not detected by the presence of another outlier, swamping means that a non-outlying observation is falsely indicated as an outlier (3,21,22). Three tests are commonly used for outlier detection of EQA data: the Hampel outlier test, Grubbs test and Dixon test. The Hampel and Grubbs tests compare the di erence between an extreme value and the centre of the data with the variability of the data and identify the extreme value as an outlier if the ratio is too large. The Dixon test looks at the di erence between the two most extreme values and an estimator of scale to identify outliers. The three tests can work with a speci ed alpha, i.e. the probability that value is wrongly marked as outliers, which should be kept as low as possible, like 0.05. For relative small data series (N < 15), a higher value of alpha could be adopted. Recently, the Hampel and Grubbs tests have been proposed as preferable in comparison to the Dixon test (23)(24)(25) with the Grubbs test able to handle also small data series, from six data points on (25).
It should be noted that indicating outliers and marking them as "out of consensus results" does not go as far as calculating performance scores, like Z-or Q-scores. Q-and Z-scores can be calculated by identi cation and removal of outliers prior to calculation of assigned/target value and descriptive statistics, followed by calculation of individual Q and Z scores for all participants, whether outliers or not. Outlier participants should still receive scores even though their results are excluded from calculation of the target value.

Determining the assigned value
Several ways exist to set or determine the assigned value. A rst group of assigned value setting possibilities are rather chemical: adding amounts of pure analyte to a sample matrix containing none, certi ed reference materials with assigned values determined by formulation or analysis with de nitive methods or reference values determined by analysis that are traceable to reference standards.
In this case, commutability should be assured as well (2,6,8,11,13,14). Other methods rely on statistics: consensus values from reference laboratories that use the best available methods, or from participants (6,8,13,14). It has been reported that over 90% of the programmes rely on consensus values (2). There are numerous methods to assess the as-signed value based on reported results and all of them attempt to accommodate for the most common anomaly that may endanger a correct estimation of the assigned value: outliers.
The in uence of outliers on the estimation of the central value may be signi cant even when groups are unimodal and symmetrical. When the classical average is used, outlier detection tests, as described in the previous section, should be applied to identify and exclude outliers before the average. Another possibility is to use techniques that attempt to nd a correct estimate of the assigned value in presence of outliers. Estimators obtained by these techniques are called robust estimators, since they are not, or almost not, in uenced by outliers. Two criteria play a role in the evaluation of these robust estimators: breakdown point and efciency. The breakdown point can be seen as the proportion of the data that could be in nite without in uencing the estimate to be in nite. Hence, the higher the breakdown point, the more outliers may be present in the data before a clear e ect on the estimated assigned value is visible. E ciency re ects the uncertainty of the estimator: high-ecient estimators are very certain. In general, high breakdown point and high e ciency are antagonistic criteria, i.e. high breakdown point is associated with low e ciency. For example, the classical average has a high e ciency, but a very low breakdown point. The kernel density-based estimation of the mode on the other hand, has a very high breakdown point, but low e ciency.
One of the most widely used estimators of the assigned value is the median (7). It is simply the middle value when the reported values are sorted from smallest to largest. Medians have a very high breakdown point, but exhibit a low e ciency. Other estimators exist that have an acceptable breakdown point and have a better e ciency than the median, like the estimator from Algorithm A from the ISO 13528 (13). Originally described by Huber as the H1.5 algorithm (26), this algorithm starts with an estimation of the central location, and subsequently reduces the in uence of outlying results by winsorization, i.e. changing values outside an interval by the outer values of the interval (27).
Background for EQA reports In addition to the well-established estimators, some less known estimators merit mentioning as well. In fact, there is a family of central location estimators that o er solutions for the following algorithm: The parameter θ is the estimator of location for which is minimal, where by x i are the n data points and p is a prede ned value (28). For a certain value of p, there is only one value of θ that minimizes this sum for a given data series. This value is called the least power (Lp) estimate. It is interesting to know that the classical average is obtained by setting p to 2, and the median is obtained by setting p to 1. Because classical average is strongly biased towards outliers but has a very high e ciency, while the median has a low eciency, it may be interesting to think of an intermediate estimator. This estimator is found by setting p to 1.5, and is called the L1.5-estimator. It is more e cient than the median and is less in uenced by outliers than the average.
Another estimator is the MM-estimator, which should have a very low bias towards outliers and is more e cient than the other estimators that are presented here (29,30). Its calculation is relatively complicated though.

Determining the standard deviation
Similar to the case of the assigned value, di erent ways exist to determine the standard deviation and the EQA provider adopts its own procedure for its determination (6). They belong to two distinct classes. The rst class contains the parameters that are xed beforehand. They may be a value derived from a perception of how laboratories should perform, legislative documents, a smallscale trial from a model of precision, like the Horwitz curve (1,7,8,13,31). The latter however is rarely applied in EQA schemes for clinical laboratories. If historic data are available, the standard deviation could be derived from the assigned value, for example by means of the characteristic function (32,33), which is a mathematical relation to estimate the standard deviation based on the assigned value: SD = α 2 + β 2 × (assigned value) 2 where α and β are to be estimated from the historical data by means of non-linear regression. The coe cients α and β have a di erent meaning in explaining the standard deviation. The parameter α principally explains the standard deviation at low concentrations, while the parameter β a ects the standard deviation at higher concentrations and approaches the coe cient of variation (CV) when β is low or the concentration is high.
The second class contains the estimates of standard deviation that are based on the reported results.
Since reported EQA data may have outliers, the classical estimate of standard deviation should only be used after elimination of outliers, as identied by the Dixon or preferentially the Huber or Grubb test, since the presence of only a few outliers in ate it and make it unreliable.
EQA providers could also rely on robust estimators for the standard deviation. The ISO 13528 standard proposes Huber's M-estimator H1.5 (called algorithm A), also for the estimate of variability (13). Other methods propose the robust Qn estimator, which is expected to be more e cient, but loses reliability in case the same value occurs more than once in the data set (34,35).
Another estimator that is easy to calculate is based on the interquartile range (IQR), in which the standard deviation is estimated by dividing the IQR by 1.349 (7,36,37).

Qualitative and semi-quantitative data
Many clinical EQA schemes also evaluate the results of analytes that are not reported on a continuous scale. These may include, for example, the absence or presence of a particular pathogen species or (drug) substance and only two answers are possible: pathogen/substance present or absent. An answer that can only have two values is called dichotomous, or binary. The results of other parameters may be expressed by semi-quantitative measure, such as integer values on which arithmetic operations should be handled with caution. Traditional measures of laboratory performance, Background for EQA reports like Z-or Q-scores cannot be applied here and laboratory performance for one parameter, one sample are often limited to reporting whether the laboratory has given the consensus or expected answer or not. Although it is, for the patient's safety, extremely important to follow up individual answers for qualitative parameters that are out of consensus, like for example blood groups, combining results and counting the frequency of correct and false results for multiple samples and/or laboratories may yield additional information to evaluate analytical methods or laboratories.
For evaluating positive samples, sensitivity and positive predictive value can be used. Sensitivity is the probability of nding a positive answer for a positive sample; positive predictive value is the probability that a sample is positive when the answer is positive. Speci city is the probability of nding a negative answer for a negative sample; negative predictive value is the probability that a sample is negative if the answer is negative. Specicity and sensitivity are usually used to describe method performance, while positive and negative predictive values are more important from a clinical point of view. A combined score is the reliability, which re ects the percentage of correct results, taking into account a set of positive and negative samples. Standard errors and con dence intervals for these parameters can be calculated using standard formulas that are derived from the binomial distribution (38)(39)(40).
Similar to the usual measures of repeatability and reproducibility, new measures have been introduced (38): accordance for within laboratory agreement and concordance for between laboratory agreements. As the equivalent of repeatability, accordance re ects the probability that two identical test materials assessed by the same laboratory under standard repeatability conditions give the same result. As the equivalent of reproducibility, concordance re ects the probability that two identical test materials analysed under di erent conditions will give the same result. Accordance and concordance can be compared with each other to estimate the proportion of betweenlaboratory variation: if the concordance is smaller than the accordance, between-laboratory varia-tion is important. Because the magnitude of concordance and accordance depends on the sensitivity, the concordance odds ratio has been introduced: COR = accordance (100 -concordance) concordance (100 -accordance) where accordance and concordance are expressed as percentages (38).
Where dichotomous answers are given for a parameter that has an underlying continuous character, for example simple tests that re ect whether a substance is below or above a certain threshold, like human chorionic gonadotropin (hCG) in urine, speci c EQAs can be set up with sample concentrations around the decision limit. Models have been developed to obtain estimators of central location and variability to evaluate di erent measurement methods (41)(42)(43). When titers are involved, the result may be dichotomized, for example by evaluating whether the reported titer would or would not lead to an incorrect conclusion (9).
Other systems to deal with qualitative tests are credit-scoring systems. Depending on the answers and their clinical impact, credit points are given or subtracted in order to obtain a nal mark for the laboratory (9).

Graphical presentation for one parameter, one sample
The evaluation of laboratories and methods is greatly supported by a graphical representation of the data and is also required by international standards (8,13). To give an informative and concise summary, graphical representations should be informative with as few lines, shapes or colours as possible. Speci cally for EQA, it is important to note that the graphs should not be in uenced by a small fraction of heavily deviating results. There are two di erent types of graphs that enable laboratories to evaluate themselves with respect to their peer group or to all the participants: box plots and histograms.
Box plots are based on three di erent percentiles: the 25 th (P25), the 50 th (which is equivalent to the Coucke W, Soumali MR. Background for EQA reports median) and the 75 th (P75). A rectangle is drawn from P25 to the P75 percentile and lines extend the rectangle as far as values are not outliers. The outlier exclusion rule is simple and it states that all values lower than P25 -1.5 (P75 -P25) and higher than P75 + 1.5 (P75 -P25) are considered as outliers ( Figure 1). Eventually, outliers can be added as separate dots on the graph. Box plots inform about the location, scale and symmetry of the di erent groups, and for each group individually, show the presence -or absence -of outliers (44). Box plots adapted for EQA could be created by showing a box plot of all the data next to a box plot of the method group, with an indication of the individual laboratory result. Coloured or shaded rectangles can be used to indicate the area of acceptance according to di erent scoring systems. Box plots have the advantage of keeping their visual power even when they are reduced to small size and hence, they are ideal candidates for putting in reports containing results for multiple parameters.
A histogram is a classical nonparametric estimator of the distribution of the data and is today still an important statistical tool for displaying and summarizing data. Its creation is straightforward: (a) divide the interval of the data in subintervals of equal width; (b) count the number of data in each subinterval; (c) display the counts in a bar graph of which the bar heights for each subinterval corre-spond to the number of data in the corresponding subinterval. Histograms inform about the centre of the distribution, the possible existence of modes and the symmetry of the distribution.
The width, and consequently, number of intervals is however arbitrary. Many small subintervals lead to an irregular shaped histogram, while large and few subintervals lead to a very rough estimation of the data. Algorithms that calculate optimal subinterval widths should be applied (45).
A histogram can be easily adopted to show important information related to EQA, as illustrated in Figure 2. In case of peer group evaluation, two histograms could be superposed: the histogram of all the data, and a histogram of the peer group of the laboratory.
Evaluation intervals can be drawn by means of rectangles that are put on the background of the histogram. In this way, it is easy to estimate the fraction of data that are outside of the limits, how the own method performs with respect to the whole group and importantly, how the individual laboratory result is situated with respect to the own method group, to all the data and to the decision limits.  Coucke W, Soumali MR. Background for EQA reports

Graphical presentation for one parameter, multiple samples
Combining information of multiple samples can be easily done by means of a scatter plot in which the results of the laboratory are plotted against the assigned values. A robust linear regression line drawn through the points on the scatter plot not only gives a visual appraisal of the laboratory's bias but may also help the interpretation of the analytical variability or even help identifying gross outliers of which the cause may be outside the analytical phase (46).
Combining the results of two samples in a scatter plot, in which the reported results from one sample of all the laboratories are plotted against those from another, similar sample is called a Youden plot (Figure 3). Youden plots can be made of the original values or rescaled values, such as Z-scores (13,47). Some important recent developments are the addition of a robust con dence ellipse for each method (48,49). The position of the robust con -dence ellipses with respect to each other reveals inter-method biases of which the interpretation is relevant for commutable samples. The position of points re ecting the values reported by individual laboratories inform about laboratory-speci c bias or variability.

Combining information from di erent parameters and/or samples
Several authors advised that reports could go beyond the evaluation of a certain parameter for a given sample. Combining information of multiple parameters, or multiple samples, informs about a global quality level of the laboratory and, in case samples were analysed at di erent time points, informs about the evolution of the quality level of the laboratory.
Results can be combined in di erent ways. In the rst instance, laboratories might be asked to analyse the sample multiple times, in order to assess the repeatability (11). It should be noted however that two observations lead to a very uncertain measure of repeatability, and moreover, multiple analyses should always be handled with caution except when the laboratories analysed vials that have the same content but di erent labels (6).
In the second instance, some parameters should be considered together because the result of one parameter depends on the result of another parameter -in statistical terms: the parameters are dependent on each other. Examples are pro le data, like serum electrophoresis pro le or leukocyte di erential count. The sum of di erent parameters within these pro les is a xed value, for example, 100% in the case that the parameters represent fractions of di erent types that are expressed as a percentage. In this case, fractions have to be viewed as a whole. In such cases, a multivariate statistical approach is more appropriate to analyse and interpret these data. Individual laboratory evaluation is based on the multivariate distance of the laboratory results for several parameters from the centre that is made up by the assigned values of each of the parameters. This distance, the so-called Mahalanobis distance, is ob- tained by robust estimates of multivariate centre and variability (50). Performance characterisation of analytical methods for pro le data is also possible by means of a multivariate CV, which encompasses the variability estimates of the di erent parameters that the pro le is made of (51).
In the third instance, Z-scores can be combined in various ways. Because of their standardization with respect to the standard deviation, Z-scores are a more ideal candidate to be combined for different parameters than original reported values or Q-scores (6). A simple way to combine Z-scores is to sum them over di erent analytes determined for the same sample (6). Sums can be taken of (i) the Z-scores themselves (SZ); (ii) rescaling of the summed Z-scores by dividing SZ by the square root of the number of data involved (RSZ); (iii) their absolute value (SAZ) or (iv) their squared value (SSZ). Although the sum of the absolute value and the squared value leads to similar conclusions, the sum of the squared values is preferred because it has better statistical properties. It should be noted that, for a judicious interpretation of these sums, heavily deviating Z-scores often nd their cause outside of the analytical process and, for this reason, they should be identi ed by means of an outlier test and be omitted from the calculation of the sums. If outliers are omitted, an extreme RSZ value is an indicator of bias and an extreme SAZ value is an indicator of high imprecision. Extreme values can be identi ed by comparing RSZ values with the standard normal distribution and SSZ values with a chi-square distribution.
Z-scores for di erent samples analysed over a certain period can be combined as well, some authors speak in this case of running scores (8). It is noteworthy stating that a problem from a speci c round may have a 'memory' e ect for future running scores. In this case, running scores can be smoothed by taking weighted sums of Z-scores, in a way that the in uence of Z-scores on the running statistic is bigger for recent than for older Z-scores (6).
Whenever the normal distribution of the data around the assigned value cannot be assured, even not after a transformation or omitting outli-ers, combining Z-scores becomes cumbersome and a nonparametric approach can help evaluating laboratories by involving the reported value for multiple samples. When the di erence between an individual value and the assigned value of a certain parameter for a certain sample is considered, laboratories can be ranked according to absolute value of this di erence. Each reported value is allocated its own percentile value, i.e. the percentage of laboratories performing equal or worse. Subsequently, median percentile values obtained for a certain laboratory for di erent samples are taken and a score on a scale from 0 to 100 is obtained. Lower values indicate good performance, higher values point to weak performance (52).
Finally, results obtained for the same laboratory and parameter for samples with di erent assigned values can be combined by means of a linear regression model in which the independent variable is the assigned value and the dependent variable is the value found by the laboratory. Several statistics can be derived from this approach, such as the long-term coe cient of variation (LCVa) (53). It is equivalent to the variability of the points around the regression line divided by the assigned value or the long-term bias. Another statistic is the longterm bias (LTB), which is determined by the di erence between the regression line and the 45-degree line re ecting equality between the assigned value and reported values. Combination of both long-term coe cients of variation and bias leads to an estimate of the uncertainty of measurement (MU) (54). It should be noted that these parameters depend largely on the assumptions of the regression model and can only be interpreted in absence of outliers and a strict linear relationship between the assigned value and reported values. In addition, the MU assumes that bias and variability are independent (54).
Another approach to the linear regression problem is rst to exclude outliers from the regression model, then consider the variability of the regression model as a measure for long-term analytical variability and subsequently the bias of the regression line, after omitting regression lines with high variability (46).

Discussion
Evaluation methods applied for data gathered in EQA rounds vary widely, not only for continuous data, but also for semi-quantitative and qualitative data. For the qualitative and semi-quantitative data, it is of larger interest to combine results of di erent samples or surveys to estimate laboratory or method performance.
For quantitative parameters, several methods are proposed to nd a consensus value or to estimate the variability. Unfortunately, there is no best method to nd an assigned value or standard deviation that works well in all conditions. Although several authors attempted to compare di erent methods, the set of methods that were compared or the data on which they were compared varied too much to draw unique conclusions. Di erent methods to be used can be compared by each EQA provider using retrospective analysis on its own dataset and by means of statistical techniques that are able to estimate the uncertainty of statistical parameters with unknown distribution, like nonparametric bootstrapping (55). An alternative method is Monte Carlo simulation, a name given to any approach that uses generation of random numbers in order to nd answers to speci c questions. It is based on the principle that any process could be split in a series of simpler events, each presented by a probability distribution (2). The method has been applied in various studies for evaluating techniques for determining the assigned value (2,25,56) or scoring laboratories (25,57). Irrespective of the performance of each statistical method, it should not be forgotten that EQA providers have to be able to explain their statistical methods to non-statisticians in the participating laboratories. For this reason, EQA providers may prefer to use a less performing, but easy to explain statistical technique that is still able to handle outlying values.
Although combining results for di erent analytes or samples may reveal novel information from the reported results, it should be noted that non-experts might misinterpret scores of summed Z-values. Their general use should be handled with caution (6,8).
An important question that has not been assessed that often is the minimum number of data needed for obtaining reliable statistics. It has been mentioned that a minimum number of 20 values is necessary to have reliable robust estimates (31), although some estimators still estimate Z-scores correctly even for groups as small as 6 (25). Other authors suggest modifying the limits for evaluation of Z-scores dependent on the peer group size (50).
In conclusion, there should be no doubt that feedback reports from EQA providers to participating laboratories serve as a major tool to support their pedagogic role. Although there are mistakes that can only been detected by EQA, it should be realised however that EQA is only one aspect of the entire quality management system in laboratories. Every action undertaken based on EQA reports may be too late already. Results that were subject to the same mistake as the faulty EQA result may have been produced and reported before it could be detected by means of the EQA report. For this reason, laboratories need to reassure and implement all possible quality standards in the total testing process, since EQA reports can only serve as a follow-up of such performance (3).