**Introduction**

*Lessons in biostatistics*presented the calculation, usage and interpretation of odds ratio statistic and greatly demonstrated the simplicity of odds ratio in clinical practice (1). The example used then was from a fictional study where the effects of two drug treatments to

*Staphylococcus Aureus*(SA) endocarditis were compared. Original data are reproduced on Table 1.

*Table 1. Results from fictional endocarditis treatment study by McHugh (1).*

*Table 2*

*. Results from fictional endocarditis treatment study by McHugh looking at age (1).*

*Table 3. Effect of treatment on endocarditis stratified by age.*_{i}is the sample size of age class

*I*, and

*a*,

*b*,

*c*and

*d*are the table cells, as presented by McHugh (1).

## Definition

*π*indicates the probability of an event (e.g., death in the previous example), and

*β*

*are the regression coefficients associated with the reference group and the*

_{i}*x*explanatory variables. At this point, an important concept must to be highlighted. The reference group, represented by

_{i}*β*

*, is constituted by those individuals presenting the reference level of each and every variable*

_{0}*x*. To illustrate, considering our previous example, these are the individuals older aged that received standard treatment. Later, we will discuss how to set the reference level.

_{1...m}## Logistic regression step-by-step

*Table 4. Results from multivariate logistic regression model containing all explanatory variables (full model).*_{0}. Taking the exponential of β

_{0}we have the mean odds to death of individuals in the reference category. So, exp(β

_{0}) = exp(-2.121) = 0.12 is the chance of death among those individuals that are older and received new treatment. A small difference in the interpretation of coefficients appears when we go to the next coefficients. Individuals that also received new treatment but are younger have a mean chance of death exp(β

_{1}) = exp(0.454) = 1.58 times the chance of reference individuals. Similarly, older individuals that received standard treatment have a mean chance exp(β

_{2}) = exp(1.333) = 3.79 times the chance of reference individuals to die. But what if individuals are younger and received standard treatment? Then we have to calculate exp(β

_{1}+β

_{2}) = exp(1.787) = 5.97 times the mean chance of reference individuals.

## Logistic regression pitfalls

### Odds and probabilities

### Continuous explanatory variables or variables with more than two levels

_{1}and x

_{2}) in the model. The individuals at reference level, let’s say “Low”, will present zeros in both dummy variables (Equation 4a), while individuals with “Medium” satisfaction will have a one in x

_{1}and a zero in x

_{2 }(Equation 4b). The opposite will occur with individuals with “High” satisfaction (Equation 4c). Usually, statistical software does it automatically and the reader does not have to worry about it.

*Table 5. Results from multivariate logistic regression model containing all explanatory variables (full model), using AGE as a continuous variable.*### Variables inclusion and selection

*per*variable, we can try to include all your explanatory variables in the full model. However, if we have a limited sample size in relation to the number of candidate variables, a pre-selection should be performed instead. One way to do that is to test all variables previously, using models with just one explanatory variable at a time (univariate models) and afterwards include in the multivariate model all variables that have shown a relaxed P-value (for instance, P ≤ 0.25). There is no reason to worry about a rigorous p-value criterion at this stage, because this is just a pre-selection strategy and no inference will derive from this step. This relaxed P-value criterion will allow reducing the initial number of variables in the model reducing the risk of missing important variables (4,5).

### Reference group setup

*Table 6. Relationship between geographic region and ketoacidosis prevalence in Brazil (data from (7)). North/Notheast region used as reference level.*

*Table 7. Relationship between geographic region and ketoacidosis prevalence in Brazil (data from (7)). Middle-West region used as reference level.*

*Table 8. Relationship between geographic region and ketoacidosis prevalence in Brazil (data from (7)). Southeast region used as reference level.*