**Introduction**

*P*value, and journal editors today demand that researchers quote actual

*P*values, and let readers make their own interpretation (for example, see Instructions to Authors on

*Biochemia Medica*web site).

*P*value is less then 0.05 or any other arbitrary cut-off value. So it is equally important for both, researchers and readers of scientific or expert journals, to understand statistical hypothesis testing procedures and how to use them when presenting or evaluating research results in the published article. There is one somewhat disregarded issue concerning statistical tests. Many published papers today quote rather a large number of

*P*values which may be difficult to interpret (1). The purpose of this paper is to give a brief overview of basic steps in the general procedure for a statistical hypothesis testing, and to point out some common pitfalls and misconceptions.

**Statistical hypothesis testing**

*P*value.

**Type I error**occurs when we “see” the effect when actually there is none. The probability of making a Type I error is usually called alpha (α), and that value is determined in advance for any hypothesis test. Alpha is what we call “significance level” and its value is most commonly set at 0.05 or 0.01. When the

*P*value, obtained in the third step of the general hypothesis test procedure is below the value of α, the result is called “statistically significant at the α level”.

**Type II error**occurs when we fail to see the difference when it is actually present. The probability of making the Type II error is called beta (β) and its value depends greatly upon the size of the effect we are interested in, sample size and the chosen significance level. Beta is associated with the power of the test to detect an effect of a specified size. More about power analysis in research can be found in one of the previous articles in

*Lessons in Biostatistics*series (3).

**What is the ***P* value?

*P*value?

*P*value is often misinterpreted as probability that the null hypothesis is true. The null hypothesis is not random and has no probability. It is either true or not. The actual meaning of the

*P*value is the probability of having observed our data (or more extreme data) when the null hypothesis

**is**true. For example, when we observe the difference in means of serum cholesterol levels measured in two samples, we want to know how likely it is to get such or more extreme difference when there is no actual difference between underlying populations. This is what

*P*value tells us, and if we find that the

*P*value is low, say 0.003, we consider the observed difference quite unlikely under the terms of the null hypothesis.

**Which significance level should we choose?**

*P*value, we need some guidance about reaching a decision from the observed

*P*value.

**Wrong conclusion #1**: Treatment B is better, when actually it is the same as treatment A.

**Consequence #1**: We adopt the new treatment exposing patients to the adverse effects of Treatment B.

**Wrong conclusion #2**: Both treatments are the same, when actually treatment B is better than the treatment A.

**Consequence #2**: We do not adopt the new treatment in practice, but continue to search for a better solution.

**Can we “prove” the null hypothesis?**

**No**.

*British Medical Journal*: “Absence of evidence is not evidence of absence” (4). In terms of the null hypothesis we should say that “we have not rejected” or “have failed to reject” the null hypothesis (5). Statistical hypothesis test does not “prove” anything.

**Multiple hypothesis testing**

**not**rejecting the null hypothesis when the null hypothesis is actually true.

α

_{2}= 1 - [(1-α) x (1-α)] = 1-(1-α)

^{2}.

_{2}= 1 - 0.95

^{2 }= 1 - 0.90 = 0.10.

_{3}= 1 - 0.95

^{3 }= 1 - 0.86 = 0.14.

**all**null hypotheses are actually true is:

_{k}= 1-(1-α)

^{ k}.

**all**null hypotheses are actually true equals 0.64. From Figure 1, we can see that it takes about 60 tests to reach the probability of 0.95 to get a significant result about some effect purely by chance, when no effect actually exists.

**all**null hypotheses are actually true is simply calculated as:

*Figure 1. Probability of making at least one Type I error as a function of the number of independent hypothesis tests performed on the same dataset when all null hypotheses are true and significance le*

*vel α is set to 0.05.**P*values were presented for each of the six subgroups plus two more for all causes in both age groups. Only one of those eight reported

*P*values was less than 0.05 (0.02) while other ranged from 0.10 to 0.99. It was also pointed out that no association between back pain and any vascular disease was found in women, which leads to the notion that the author performed the same number of tests in the women subgroup. That would make the total of at least 16 tests among which only one was found to be “significant”, just about as many as we would expect to occur purely by chance.

*P*values obtained from the series of independent tests in order to preserve the overall significance level. If we adjust the minimum accepted significance level, we compare the “original”

*P*values with the adjusted significance level. If we adjust

*P*values, then we compare adjusted

*P*values with the originally stated significance level.

*P*values (sometimes called “nominal” P-values) for multiple testing is to use the Bonferroni method (1). By this method the adjustment is made by multiplying the nominal

*P*values with the number of tests performed. So, if we made three independent tests which resulted in

*P*values of 0.020, 0.030 and 0.040, the Bonferroni-adjusted

*P*values would be 0.060, 0.090 and 0.120, respectively. While “original” results for all three tests would be considered significant at 0.05 level, after adjustment none of them remained significant.

_{k}= 1 – (1 – p)

^{k}.