## Introduction

Serologic tests are commonly used in seroepidemiologic and prevalence studies (*1*). The design is typically conducted to understand the current situation of a condition of interest, say a disease. For example, over the past two years, soon after the announcement of the coronavirus disease pandemic, many serologic tests have been developed for diagnosis of the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2); numerous seroepidemiologic studies have been conducted to determine the prevalence of the disease in various parts of the world. For example, a population-based seroprevalence study revealed a SARS-CoV-2 seroprevalence of 9.7% in the Principality of Andorra (*2*). The results obtained from seroepidemiologic studies are generally used by health care researchers to understand where we do stand by estimating the health burden and the economic impact of a disease, and policy-makers to better identify the priorities and planning (*3*). But, are the values obtained from these studies valid?

At the heart of the design is the method by means of which we identify the condition of interest. We usually use a diagnostic test to detect the condition (*e.g.*, a disease). However, a diagnostic test is usually not perfect; it may give false-positive and false-negative results; not all people with positive tests are diseased, and not all with negative tests are disease-free (*4*). This is why the prevalence derived from these studies, the so-called “apparent prevalence” (*pr*), is not necessarily an unbiased estimation of the true prevalence (*π*), the true proportion of diseased people in the population or the study sample. Herein, we are going to discuss how we can derive an unbiased estimation of *π* from the obtained *pr* and the test sensitivity (*Se*) and specificity (*Sp*). We also used a computer simulation program to better investigate the situation.

## Prevalence

The *pr* (the apparent prevalence) is defined as the portion of tested people with a positive test (*T* ^{+}) (*5*). Therefore:

where *TPR* and *FPR* are true-positive and false-positive rates, respectively. Substituting the *TPR* and *FPR*, we have (*4*):

Solving the above equation for *π* (the true prevalence), yields:

This shows that the true prevalence (*π*) and the apparent prevalence (*pr*) are linearly related (Figure 1). If we take into account the uncertainty existing in the measured estimates of *pr*, *Se*, and *Sp*, Eq. 3 becomes:

where *x* (any variable with a hat, *e.g.*, or *pr*) represents an estimation for *x* (*e.g.*, *π* or *pr*). Assuming that *pr* and the test *Se* and *Sp* are independent, employing basic calculus and using a first-order Taylor series expansion, we have (*6*, *7*):

where *σ*^{2}_{x} represents the variance of *x*. Based on the results, we can calculate the 95% confidence interval (CI) of the true prevalence (*8*-*10*). To portray the effect of variations in estimates of *pr*, *Se*, and *Sp* on the *π*, we conducted a Monte-Carlo simulation program.

## Computer simulation

We assumed that the *Se* and *Sp* of a diagnostic test were measured in a hypothetical validity study on 225 individuals: 75 in whom the disease was confirmed and 150 without the disease (Table 1). The results gave a *Se* of 93% (95% Cl 88% to 99%,
*σ*^{2}_{Se} = 8.3 x10^{-4}) and a *Sp* of 90% (85% to 95%, *σ*^{2}_{Se} = 6.0 x10^{-4}).

##### Table 1

To further investigate the situation, we used a Monte-Carlo simulation (Table 2, Supplementary material). We assumed an arbitrarily chosen population size of 1,000,000 people and assumed that 200,000 of whom had a disease - *i.e.*, a population true prevalence of 0.20. We randomly selected a sample of 300 individuals from the population. Each person in the study sample was then tested with a diagnostic test with *Se* and *Sp* values randomly selected from the above-mentioned distributions (supposed to be Gaussian with a mean of 93% and variance of 8.3 x10^{-4} for the *Se*, and a mean of 90% and variance of 6.0 x10^{-4} for the *Sp*) (Table 2). The *π _{s}*, the proportion of individuals in the sample with the disease (true prevalence of the disease in the sample); the

*pr*, the proportion of people in the sample with a positive test (apparent prevalence of the disease in the sample); and the calculated true prevalence (in the sample),

_{s}*π*, derived from Eq. 4 and 5, were then estimated for each sample. The above steps were repeated for an arbitrarily chosen 200,000 samples. The frequency distributions of

_{c}*π*,

_{s}*pr*, and

_{s}*π*were then plotted and compared. Linear regression analysis (no intercept model) was used to determine the relationship between the

_{c}*π*and

_{s}*π*.

_{c}##### Table 2

## Simulation results and discussion

The mean true prevalence (*π _{s}*) was 0.20 (95% CI 0.16 to 0.25) – as expected, equal to the population true prevalence (

*π*) of 0.20. The mean apparent prevalence (

*pr*) was 0.27 (0.20 to 0.33), a biased estimate of the true prevalence (

_{s}*π*) (Figure 1). The mean calculated true prevalence (

_{s}*π*), 0.20 (0.14 to 0.26), however, was an unbiased estimation for the true prevalence (

_{c}*π*) (Figure 2). The slope of the regression line was almost 1; the model could explain almost all of the variance observed in the

_{s}*π*(Figure 3). The observed variance of the

_{c}*π*distribution was less than that of the

_{s}*pr*(Figure 2). The former was attributed to the sampling variation; the second, to the sampling variation and variability in the test

_{s}*Se*and

*Sp*distribution used for each sample. The variance of the

*π*distribution (similar to that of the

_{c}*pr*) was also due to the variations in estimating the

_{s}*pr*and the test

_{s}*Se*and

*Sp*(Eq. 4). It is important to note that the term “test” in this context should be construed in a general way as any means for classifying individuals, either a laboratory test for checking a biomarker, an imaging procedure examination, or a physical examination to check presence or absence of a sign (

*11*,

*12*). To elaborate on the topic presented, let us examine the following example.

### Example

In the first round of a population-based seroprevalence study on SARS-CoV-2 serological screening, conducted in the Principality of Andorra, the researchers found that 6816 of 70,389 tested people were seropositive, translating into a seroprevalence, *pr _{s}*, of 9.7% (95% CI 9.5% to 9.9%) (

*2*). The

*Se*and

*Sp*of the diagnostic test they used (Livzon rapid test, Zhuhai Livzon Diagnostics Inc, Guangdong, China) were 92% (84% to 96%) and 100% (95% to 100%). The values were derived from a validation study conducted on 48 diseased and 48 disease-free individuals (

*2*). Here, the

*pr*, of 9.7% does not reflect the correct portion of the population with previous exposure to SARS-CoV-2; there might be several people with false-positive test results due to cross-reacting antibodies, technical issues,

_{s}*etc.*, some people might have false-negative tests, on the other hand (

*13*). The seroprevalence (the apparent prevalence,

*pr*) was an unbiased estimation of the true prevalence, only if the

_{s}*Se*and

*Sp*of the test used would have been equal to 100%, the gold standard test.

Based on the provided data, it is possible to calculate the variances of the seroprevalence, and the test *Se* and *Sp*, which are 1.2 x10^{-6}, 8.9 x10^{-4}, and 1.5 x10^{-4}, respectively. Substituting the values in Eq. 4 and 5, the estimated true prevalence (*π _{s}*) is 10.5% (95% CI 8.2% to 12.9%), the correct proportion of the population with previous exposure to SARS-CoV-2. Had merely binomial distribution been used for the calculation of the 95% confidence interval (ignoring the uncertainty in the estimated

*Se*and

*Sp*) instead of Eq. 5, we would have come to a 95% confidence interval of 10.3% to 10.8%, a much narrower interval.

## Conclusion

Depending on the *Se* and *Sp* of the diagnostic test used in a given prevalence study, the results obtained are generally biased estimates of the true prevalence of the condition of interest (*e.g.*, a disease). The derived apparent prevalence values should therefore be corrected. Based on the variances of the seroprevalence, and the test *Se* and *Sp*, it is possible to calculate an unbiased estimation of the true prevalence.