Survival analysis, more than meets the eye

The log-rank test is a cornerstone of phase III oncology clinical trials. However, there are at least three different mathematical procedures that can be named the log-rank test and two of them are widely used by commercial statistical programs. Consequently, different P values can be obtained. In the case of a borderline statistical significance, this can mean the difference between the evidence (significant P value) and merely an observation. Since all three methods can be reported under the same name, space for possible data manipulation occurs. This should be of a particular concern in a drug regulatory context. Randomized clinical trials with borderline significant results should perhaps be required to report P values calculated by all three methods, in order to properly evaluate drug efficacy. An interactive MS Excel spreadsheet that uses all three logrank test variants is prepared as a supplementary file accompanying this article. Association of high grade of bone marrow fibrosis with poor outcome in patients with myelofibrosis is used as an example.


Introduction
Survival analysis based on the method by Kaplan and Meier is a cornerstone of phase III oncology clinical trials (1). The log-rank test is a statistical test of choice to compare survival/time to event of interest between two or more groups of patients. However, one name (the log-rank test) can be used for three related but different mathematical procedures. Two of them are widely employed inside different commercial statistical programs. "Behind the scenes" mathematics are not the same and thus different results can be obtained. In the case of a borderline statistical significance, this can mean the difference between the evidence (significant P value) and merely an observation. In other words, two persons analysing the same data set with two different statistical programs can "unknowingly" reach a different conclusion. Since all three methods can be reported under the same name, space for possible data manipulation occurs.

The log-rank test variants
Mathematical overview of all three methods is shown in Table 1 and the supplementary file. The first method that was proposed by Mantel in 1966 represents an extension of the Mantel-Haenszel procedure for comparing 2 x 2 tables (2). Most commercial statistical programs provide it under the name of the log-rank test (e.g. STATA, Stata-Corp v. 14 and MedCalc, MedCalc Software v 16). The second method that was developed by Peto and Peto in 1972 uses alternative computational approach to produce the same test statistic but different variance (3). It is computationally simpler and therefore easier to calculate by hand/table calculator. Although developed later, it was originally named the log-rank test by the authors and the name was thereafter generalized for both procedures. This method is provided by e.g. Statistica StatSoft v. 13 under the name of the log-rank test. It should be noted that this program also provides the first method proposed by Mantel

Key steps in data analysis
The Cox-Mantel test  The simple χ 2 log-rank test  The Peto log-rank test Step 1 (Data sorting) * Data are sorted in a time ascending order. At the time of each death, a new interval is created.
Step 2 (Preliminary calculations) † Step 3 (Calculation of χ 2 value) Step 4 (P value) P values that correspond to calculated χ 2 values are found using one degree of freedom χ 2 distribution table. * Step 1 (Sorting data) is same for all three methods. Intervals are necessary if we want to obtain correct calculations when more than one death occurs at the same time (tied observations). All concurrent deaths are considered to happen in the same interval. Central calculations for all three methods are interval specific (i.e. occur at death times). † Step 2 (Preliminary calculations) is necessary for later calculations of χ 2 value. The Cox-Mantel test and the simple χ 2 test share calculation of observed and expected number of deaths. ‡ O, E, R and I represent observed number of deaths, expected number of deaths, number of patients at risk and number of intervals, respectively. O j , E j and R j (j=1,..,I) represent aforementioned parameters at the time of the j-ordered interval. § T and V represent test statistic and variance for particular test variant, respectively. The Cox-Mantel test and the Peto log-rank test produce the same test statistic. Calculations are done in one of the groups only. Variance for both methods is calculated on a whole data-set. N, Λ I and W represent number of observations, the Nelson-Aalen estimator for a particular interval and a specific Peto log-rank test score, respectively. N i , Λ Ii and W i (i=1,..,N) represent aforementioned parameters at the time of the i-ordered observation.
The third method is based on the simple χ 2 (chi squared) principle of analysing observed and expected number of events. This method is rarely used by commercial statistical programs but deserves to be mentioned because it is widely accepted as an explanation to the logic behind the test (4). We refer to particular methods throughout our manuscript and supplementary file as the Cox-Mantel test, the Peto log-rank test and the simple χ 2 logrank test, respectively. All three methods produce a one degree of freedom χ 2 statistic that is used to obtain the corresponding P value.  (7).
All three log-rank test variants are considered to be the log-rank test and are named as such on different occasions. Actually, medical researchers are mostly unaware of the method used and currently, there is no discrimination between the log-rank test variants in most of published medical literature. Some statistical programs do not clearly report their method of choice either, and sometimes it is almost impossible to know how the P value was obtained unless data are recalculated in a known manner. Therefore, an interactive MS Excel spreadsheet that uses all three methods is prepared as a supplementary file accompanying this article. Users are encouraged to experiment with the provided data set or test their own, and become more acquainted with the problem. Spreadsheet can analyse up to 200 entries that can be copy-pasted inside corresponding columns and can serve as a standalone statistical program. It should be noted that it is unethical to "fish" for significant P value and to report only one most significant result. Such "P value hacking" is strongly discouraged by the authors.

Application of three methods to example data set
Primary myelofibrosis (PMF) is a Philadelphia chromosome negative chronic myeloproliferative neo-plasm (Ph-MPN) originating from transformed hematopoietic stem cell (8). Secondary myelofibrosis (SMF) can develop from PMF biologically related Ph-MPNs and it clinically resembles PMF. Typical feature of these diseases is scarring of the bone marrow (i.e. myelofibrosis) that can be graded according to the current European consensus (9).
In our example, we have evaluated impact of highly advanced (grade 3) bone marrow fibrosis present at the time of diagnosis on overall survival in a cohort of 67 patients with PMF and SMF. Data were acquired in a retrospective manner and represent single centre experience. One might have the feeling that there is a real effect in place by observing the Kaplan-Meier curves ( Figure 1). But is there statistical evidence to support it? Stated in other words, can inferences about population be made from these results (based on a sample)? We performed necessary calculations for all three logrank test variants. Calculations for first ten observations are shown in Table 2.
Step by step procedure for each approach is shown in the supplementary file. After obtaining corresponding P values, we encounter a controversial situation. When None of the methods used is currently considered the gold standard, and all three P values can be reported as results of the log-rank test. According to our interpretation, our result seems to be truly of borderline statistical significance as suggested by inhomogeneity of obtained P values. Significant association of higher grade of bone marrow fibrosis with inferior overall survival was previously reported in multiple cohorts of myelofibrosis patients (10)(11)(12), although this finding was not universal (13). Therefore, we conclude that our data are in line with most of previously published results and are in support of adverse prognostic significance of highly advanced bone marrow fibrosis in these patients. However, we cannot consider our result alone to represent high level of evidence due to borderline statistical significance and retrospective study design.

Are P values that important?
A recent statement by the American Statistical Association discussed that no single index should substitute for scientific reasoning, and proper inference requires full reporting and transparency; e.g. patient selection, contextual factors, number of hypotheses explored, measures of effect size, etc. (14). Although there are many valid arguments against a blind use of specific threshold P values to determine statistical significance (and we agree they should not be used in that way), P values remain an important landmark in scientific decisionmaking. Medical literature is overladen with borderline significant results regarding survival benefit of a new drug or a new procedure. Our example adds a new dimension to an issue of their appropriate interpretation. The statement "the log-rank test was used" is not unequivocal as it seems at first and this should be of a particular concern in a drug regulatory context. Things get especially suspicious if different statistical programs are used for survival analyses and data analysis in general.

Potential conflict of interest
None declared.
As we previously stated, it would be unethical to "fish for significant P value" and to report only one most significant result. We would like to point out that this danger will exist in borderline significant situations until scientific and professional authorities establish the consensus about the log-rank test method of choice. Until then, there is probably no need to insist on the least significant result in analysis of retrospective data sets and researchers should be adhering to their standard practice/ statistical program of choice. This is because retrospective studies are biased by numerous factors. Their results do not provide high strength of evidence and usually do not have direct effects on clinical practice. However, in a drug regulatory context, one must insist on the clear evidence of improved survival because randomized clinical trials are taken as a very high level of evidence that bears clinical-practice-related and financial implications. This would perhaps be the situation in