Resampling methods in Microsoft Excel® for estimating reference intervals

Computer- intensive resampling/bootstrap methods are feasible when calculating reference intervals from non-Gaussian or small reference samples. Microsoft Excel® in version 2010 or later includes natural functions, which lend themselves well to this purpose including recommended interpolation procedures for estimating 2.5 and 97.5 percentiles. The purpose of this paper is to introduce the reader to resampling estimation techniques in general and in using Microsoft Excel® 2010 for the purpose of estimating reference intervals in particular. Parametric methods are preferable to resampling methods when the distributions of observations in the reference samples is Gaussian or can transformed to that distribution even when the number of reference samples is less than 120. Resampling methods are appropriate when the distribution of data from the reference samples is non-Gaussian and in case the number of reference individuals and corresponding samples are in the order of 40. At least 500-1000 random samples with replacement should be taken from the results of measurement of the reference samples.

Reference intervals (1,2) are amongst the essential tools for interpreting laboratory results. Excellent theoretical and practical studies have been published since the 1960ies for establishing reference intervals as a concept in contrast to "normal intervals" (3), establishing principles for selecting reference persons (4) and performing the appropriate statistical data analysis (5)(6)(7)(8). Reference intervals continue to be an active area of research (2,9) and at the core of practical work in clinical laboratories (6,10).
The reference interval includes 95% the results/ values measured in a representative sample of reference subjects and is bounded by lower and upper reference limits marking the 2.5 and 97.5 percentiles respectively (Figure 1). The reference interval including the upper and lower reference limits are naturally determined in a ranked set of data values if the number of observations is large, e.g. > 1000. When the number of values is small, e.g. in the order of 40-120 and the data are non-Gaussian, resampling methods are useful to estimate the reference interval. The reference population is commonly a population of apparently healthy individuals, but may also consist of any other well-defined population of interest for diagnostic/comparative purposes.

Different methods for calculating reference intervals
Estimating reference intervals means dealing with uncertainties and probabilities. All probabilistic methods are based on assumptions about a theo- Theodorsson E. Microsoft Excel® for estimating reference intervals retical distribution which fundamentally determines the conclusion that can be drawn from the data. The most commonly used probabilistic methods are the parametric methods that assume that observations in the population are distributed according to the Gaussian/Normal distribution. These are the methods of choice if the data are Gaussian or can be transformed to that distribution since the data themselves with the added knowledge of the distribution of the data in the sample and population enables the user of the data to draw firmer conclusions than if only the data are known.
When the data are non-Gaussian and cannot be transformed to the Gaussian distribution the analyst is left with non-parametric or resampling methods for determining reference intervals. Data from 120 reference intervals are needed for reliably determining non-parametric reference intervals (7,11). Resampling/bootstrap methods (12) take numerous repeated sub-samples with replacement of the available data in order to estimate the distribution of the data in the population, including the reference intervals.
Resampling methods with replacement are commonly called bootstrap methods by many believed to refer to the fairytale of Baron Munchausen pulling himself and his horse out of a swamp by his hair. Bootstraps are in fact the tab, loop or handle at the top of boots allowing the use of fingers pull the boots on. The concept of bootstrap is therefore a metaphor for clever, self-induced salvaging efforts. There is a wealth of current intro-ductory literature on resampling methods in general (13) and resampling methods for Microsoft Excel ® in particular (14).
Resampling with replacement means that a large number of samples with replacement are then taken from the original sample and the statistic of interest is calculated from this pseudo-population as estimate of the corresponding parameter of the population ( Figure 2).     (5)(6)(7)15) and Clinical and Laboratory Standards Institute (CLSI) (16) recommend the following formulas: lower limit has the rank number 0.025 x (n+1) and the upper limit the rank number 0.975 x (n+1). This method is practicable mainly when the number of reference samples is 120 or more.
While there is agreement on how to calculate the median, agreement on how to calculate quantiles including percentiles is lacking due to the fact that interpolation is frequently needed, especially when the number of observations is small. Unfortunately there is no general agreement on the best way of performing interpolations. In 1996 Hyndman and Fan published an influential paper on calculating quantiles (17). It evaluated the methods used by popular statistics packages with the intention to find a consensus and basis for standardization. Of the nine formulas commonly used, four satisfied five of the six properties desirable for a sample quantile. Hyndman and Fan felt that the "Linear interpolation of the approximate medians for order statistics" (used by the Excel =PERCENTILE.EXC function) was best due to the approximately median-unbiased estimates of the quantiles, regardless of the distribution (17).
Unfortunately only a limited movement towards standardization has taken place since 1996 (18). Most statistical programs stick to their traditions when calculating quantiles probably to maintain backwards computability with older versions.
CLSI has published an authoritative and practical standard "Defining, Establishing, and Verifying Reference Intervals in the Clinical Laboratory: Approved Guideline EP28-A3c" (11) and the whole subject is authoritatively summarized by Harris and Boyd (7).
Several excellent software packages are available for the statistical analysis of reference intervals (5,19-23) but they are not as commonly used as they should e.g. due to cost constraints. Recently Geffré et al. published "Reference Value Advisor", a set of comprehensive freeware macros for Microsoft Excel ® to calculate reference intervals (24). Ittogether with the original paper -can be downloaded at http://www.biostat.envt.fr/spip/spip. php?article63 which implements the procedures recommended e.g. by EP28-A3c. Increased availability of low-cost computing machinery and resampling methods has made distribution independent computer intensive methods accessible in routine work (25).
The fundamental purpose of statistics in general and statistics in reference interval estimation is to extrapolate from a limited number of observations/data to the whole population of observations. All statistical methods are based on assumptions which fundamentally decide the conclusion that can be made from the data. Parametric statistical methods which are the most widely used, assume that the population values are distributed according to a mathematically well-defined distribution -the Gaussian/Normal distribution. Parametric methods were initially described in the first half of the twentieth century when methods for performing extensive calculations were primarily manual and semi-automated at best. Using distribution functions to describe natural phenomena including research data, substantially simplified the calculations required for statistical analysis. Without these known statistical distribution functions it would have been impossible to use advanced statistical methods in practice.
The wide availability of low-cost computing power in the 1980ies substantially changed the options for working with the data as they turn out irrespective of theoretical distributions. A seminal work on resampling methods by Jones in 1956 (26) was followed by the very influential works by Efron et al. (12,25) which brought these methods in the mainstream of data analysis.
The purposes of the present paper are to provide detailed instructions on the use of distribution-independent computer intensive resampling/bootstrap methods using standard functions of Microsoft Excel ® 2010 or later for estimating reference intervals and to put resampling methods into a theoretical and practical perspective. A simple method based on ranked results was initially used (27,28). The results were sorted using the sorting function in Microsoft Excel ® and the rank numbers of the reference interval computed using the following formulas: lower limit has the rank number 0.025 x (n+1) and the upper limit the rank number 0.975 x (n+1) where n is the total number of reference samples.
The INDEX function in Microsoft Excel ® is used to locate a single data value amongst data in a table by specifying the row and column numbers corresponding to the data point to be retrieved. The format is = INDEX(range_to_search_in, row, column).
The $ sign before the row and column numbers is used to mark the fixed data column and row to be used ( Figure 2).
The COUNT function in Microsoft Excel ® totals the number of cells in a selected range of cells that contain a specific type of data, e.g. cells containing data in a range of cells that can contain empty cells. In this case it counts the number of observations entered between row 1 and 10 000 -COUNT($A$1:$A$10000). It is used here in order to eliminate the need for the user to keep track of the number of reference values in the reference sample.  DBETWEEN(1,83)). The increased speed is caused by the fact that Excel then does not need to repeatedly determine the number of reference values you have entered.

Subsequent randomly drawn reference values
Drag the small quadrant in the lower right corner of cell C1 to column AG83 thereby creating 30 samples, each of 83 random observations from the original observations. Extending the columns to the right you can increase the number of samples e.g. to 1000 random samples.   Step Instruction Input the data from the reference persons In column A. As many data values as needed up to 10 000 data points, each in its own cell Selecting the random samples from the data from the reference persons • In column C, row 1 enter =INDEX($A$1:$A$10000;RANDBETWEEN(1;COUNT($A$1:$A$10000))) • Select the solid black quadratic box in the lower right corner of column c, row 1 and pull it downwards the number of rows corresponding to the number of observations you want in each sample from the reference data This will copy the cells downwards, as many as you wish, making e.g. random picks of 40 from the samples from the reference persons. It is crucial for this application that you make at least 39 random picks. Step Instruction

Calculate the 2.5 percentile
Calculate the 97.5 percentile Create several random samples Copy the column C you have created to the right to make as many random samples from the reference sample as you wish: If you wish to create 500 random samples, draw/copy 500 columns to the right and if you wish to create 1000 random samples, 1000 columns to the right

Theodorsson E. Microsoft Excel® for estimating reference intervals
The computer time needed for re-calculating 1000 samples of 83 observations each from a sample of 83 observations was less than 1 second.

Discussion
The resampling model built by standard functions of Microsoft Excel ® 2013 and illustrated here resulted in the practically identical reference interval calculated by Microsoft Visual Basic for Applications and described by Geffré et al. (24).
Resampling methods are free from the assumption that the observations are distributed according to a certain theoretical distribution, but importantly assume that the underlying population distribution is practically the same as that in a particular sample. This means that a sufficient number of observations is needed in the sample to make sure that it represents the population. Tu and Zhang have found that applying resampling methods to a random sample of 50 from true Gaussian distribution only resulted in coverage of the true confidence interval estimate in 88% of the cases, even though the improved bias-corrected-and-accelerated bootstrap (29) is used. Good (13) recommends in the order of 100 observations in the sample which is close to the 120 individuals/observations recommended by IFCC (8,30)  Bjerner et al. (32) have shown that parametric methods may be preferable to resampling methods even when the number of reference samples is less than 120 when the distributions of observations in the reference samples is Gaussian or can transformed to that distribution.

Microsoft Excel ®
Microsoft Excel ® is currently the most widely used and powerful spreadsheet program. Its widespread availability and versatility constitutes its major advantages enabling the users to perform powerful descriptive, inferential graphical and statistical analysis using even methods not yet programmed in statistical packages. The program has, especially in its earlier versions been criticized for its mathematical (33), statistical (34) and graphical properties. Several of the limitations have been overcome in the most recent versions of Excel (35), since 2010.

Is Microsoft Excel ® appropriate for resampling estimations?
Amongst the advantages of using Microsoft Excel ® for calculations in general and for resampling in particular is that the software is widely available and used for administrative purposes in the laboratory and facilitates easy data handling and visualization. Using the natural functions of Microsoft Excel ® 2010 or later the resampling procedures are made highly visual, providing the user with hands on visual contact with all the random samples taken and the calculations made. This is, however, also the disadvantage of the use of the natural functions gallery of Microsoft Excel ® for this purpose since there is a risk of errors when performing calculation on large tables of data -e.g. the 1000 columns needed for the present purpose.

Conclusion
Resampling methods are recommended by IFCC and are widely available in software for handling reference values (5,(19)(20)(21)(22)(23)(24) but not as yet as yet a part of the CLSI guideline (16). Resampling methods are appropriate when the distribution of data from the reference samples is non-Gaussian and in case the number of reference samples are in order of at least 40. At least 500-1000 random samples with replacement should be taken from the reference samples. Resampling methods are conveniently implemented in versions of Microsoft Excel® 2010 or later as shown here.