Survival analysis based on the method by Kaplan and Meier is a cornerstone of phase III oncology clinical trials (1). The log-rank test is a statistical test of choice to compare survival/time to event of interest between two or more groups of patients. However, one name (the log-rank test) can be used for three related but different mathematical procedures. Two of them are widely employed inside different commercial statistical programs. “Behind the scenes“ mathematics are not the same and thus different results can be obtained. In the case of a borderline statistical significance, this can mean the difference between the evidence (significant P value) and merely an observation. In other words, two persons analysing the same data set with two different statistical programs can “unknowingly” reach a different conclusion. Since all three methods can be reported under the same name, space for possible data manipulation occurs.
The log-rank test variants
Mathematical overview of all three methods is shown in Table 1 and the supplementary file. The first method that was proposed by Mantel in 1966 represents an extension of the Mantel-Haenszel procedure for comparing 2 x 2 tables (2). Most commercial statistical programs provide it under the name of the log-rank test (e.g. STATA, StataCorp v. 14 and MedCalc, MedCalc Software v 16). The second method that was developed by Peto and Peto in 1972 uses alternative computational approach to produce the same test statistic but different variance (3). It is computationally simpler and therefore easier to calculate by hand/table calculator. Although developed later, it was originally named the log-rank test by the authors and the name was thereafter generalized for both procedures. This method is provided by e.g. Statistica StatSoft v. 13 under the name of the log-rank test. It should be noted that this program also provides the first method proposed by Mantel, but under a different name (the Cox-Mantel test). The third method is based on the simple χ2 (chi squared) principle of analysing observed and expected number of events. This method is rarely used by commercial statistical programs but deserves to be mentioned because it is widely accepted as an explanation to the logic behind the test (4). We refer to particular methods throughout our manuscript and supplementary file as the Cox-Mantel test, the Peto log-rank test and the simple χ2 log-rank test, respectively. All three methods produce a one degree of freedom χ2 statistic that is used to obtain the corresponding P value. These tests should not be confused with weighted two-sample tests for survival data (Gehan generalization of the Wilcoxon test, Peto and Peto generalization of the Wilcoxon test, the Tarone-Ware test, the Fleming-Harrington test, etc.) (5).
It is hard to recommend which method should be favoured over the other. Variance of the Peto log-rank test is calculated under the assumption of equal censoring and other log-rank tests might perform better if censoring does not occur at random with respect to group membership (e.g. if withdrawals due to side-effects occur mainly in one treatment group) (3). However, it is unclear how important in practice unequal censoring is. On the other hand, it was suggested that the Cox-Mantel test tends to underestimate true variance (and therefore produce unrealistically lower P value in comparison to the Peto log-rank test) when the test statistic is large in absolute value (6). As we have observed in multiple data sets, these two tests exchange in providing more significant P value in different clinical situations. It should be noted that if the assumption of proportional hazards is violated (e.g. survival curves cross) neither of the log-rank test methods should be used. Alternative statistical methods were developed for such situations (7).
All three log-rank test variants are considered to be the log-rank test and are named as such on different occasions. Actually, medical researchers are mostly unaware of the method used and currently, there is no discrimination between the log-rank test variants in most of published medical literature. Some statistical programs do not clearly report their method of choice either, and sometimes it is almost impossible to know how the P value was obtained unless data are recalculated in a known manner. Therefore, an interactive MS Excel spreadsheet that uses all three methods is prepared as a supplementary file accompanying this article. Users are encouraged to experiment with the provided data set or test their own, and become more acquainted with the problem. Spreadsheet can analyse up to 200 entries that can be copy-pasted inside corresponding columns and can serve as a standalone statistical program. It should be noted that it is unethical to “fish” for significant P value and to report only one most significant result. Such “P value hacking” is strongly discouraged by the authors.
Application of three methods to example data set
Primary myelofibrosis (PMF) is a Philadelphia chromosome negative chronic myeloproliferative neoplasm (Ph- MPN) originating from transformed hematopoietic stem cell (8). Secondary myelofibrosis (SMF) can develop from PMF biologically related Ph- MPNs and it clinically resembles PMF. Typical feature of these diseases is scarring of the bone marrow (i.e. myelofibrosis) that can be graded according to the current European consensus (9).
In our example, we have evaluated impact of highly advanced (grade 3) bone marrow fibrosis present at the time of diagnosis on overall survival in a cohort of 67 patients with PMF and SMF. Data were acquired in a retrospective manner and represent single centre experience. One might have the feeling that there is a real effect in place by observing the Kaplan-Meier curves (Figure 1). But is there statistical evidence to support it? Stated in other words, can inferences about population be made from these results (based on a sample)? We performed necessary calculations for all three log-rank test variants. Calculations for first ten observations are shown in Table 2. Step by step procedure for each approach is shown in the supplementary file. After obtaining corresponding P values, we encounter a controversial situation. When P values are reported to three decimal places, the Cox-Mantel test suggests that the result is significant (P = 0.047), the Peto log-rank test suggest that the result is insignificant (P = 0.052), and the simple χ2 test suggests that the result is of borderline statistical significance (P = 0.050). None of the methods used is currently considered the gold standard, and all three P values can be reported as results of the log-rank test. According to our interpretation, our result seems to be truly of borderline statistical significance as suggested by inhomogeneity of obtained P values. Significant association of higher grade of bone marrow fibrosis with inferior overall survival was previously reported in multiple cohorts of myelofibrosis patients (10-12), although this finding was not universal (13). Therefore, we conclude that our data are in line with most of previously published results and are in support of adverse prognostic significance of highly advanced bone marrow fibrosis in these patients. However, we cannot consider our result alone to represent high level of evidence due to borderline statistical significance and retrospective study design.
|The first ten observations are shown. For specific calculations please see Table 1 and the supplementary file. N – number of observation. Time - duration of follow-up, recorded in months in our example. Status - censoring variable, 1 for death and 0 for alive or lost to follow-up. Group - 1 for grade < 3 myelofibrosis and 2 for grade 3 myelofibrosis patients. I – number of interval. O – observed number of deaths per interval. R – overall number at risk per interval. RGr1 and RGr2 – overall number at risk in a specific group per interval. EGr1 and EGr2 – expected number of deaths in a specific group per interval. * Variance of the log-rank test calculated by Mantel method. †Λ – the Nelson-Aalen estimator. ‡W – score for the Peto log-rank test.|
Are P values that important?
A recent statement by the American Statistical Association discussed that no single index should substitute for scientific reasoning, and proper inference requires full reporting and transparency; e.g. patient selection, contextual factors, number of hypotheses explored, measures of effect size, etc. (14). Although there are many valid arguments against a blind use of specific threshold P values to determine statistical significance (and we agree they should not be used in that way), P values remain an important landmark in scientific decision-making. Medical literature is overladen with borderline significant results regarding survival benefit of a new drug or a new procedure. Our example adds a new dimension to an issue of their appropriate interpretation. The statement “the log-rank test was used” is not unequivocal as it seems at first and this should be of a particular concern in a drug regulatory context. Things get especially suspicious if different statistical programs are used for survival analyses and data analysis in general.
As we previously stated, it would be unethical to “fish for significant P value” and to report only one most significant result. We would like to point out that this danger will exist in borderline significant situations until scientific and professional authorities establish the consensus about the log-rank test method of choice. Until then, there is probably no need to insist on the least significant result in analysis of retrospective data sets and researchers should be adhering to their standard practice/statistical program of choice. This is because retrospective studies are biased by numerous factors. Their results do not provide high strength of evidence and usually do not have direct effects on clinical practice. However, in a drug regulatory context, one must insist on the clear evidence of improved survival because randomized clinical trials are taken as a very high level of evidence that bears clinical-practice-related and financial implications. This would perhaps be the situation in which all three log-rank test variants should be tested, in order to properly evaluate drug efficacy. In our opinion, firm result should be significant irrespective of the method used. If the result of a randomized clinical trial is of borderline statistical significance (not consistent among three variants of the log-rank test) then it should not be taken as the clear evidence of a drug/procedure benefit. Regulatory and clinical reasoning should be based on the least significant result as the relevant one. Definite conclusion would require replicating results in new independent samples.