1 Introduction

A statistical method to model:
- outcome: time to event (e.g. death, recurrence etc).
- predictors/independent variables: numerical, categorical variables.
In contrast to KM approach, it is concerned with hazard of event.
Basically, the (interval) hazard at time \(t\) is as follows, \[Hazard,\ h_t = {Deaths,\ e_t \over Survivors,\ n_t \times Interval,\ u_t}\] where interval \(u_t\) is the time interval from present time \(t\) until the next event time \(t+1\).

Nelson-Aalen’s cumulative hazard estimate is given sum of \(e_i/n_1\) until time \(t\), \[Nelson-Aalen\ cumulative\ hazard,\ NA = \sum_{i\leq t} {e_i \over n_i}\]

In R, cumulative hazard function, \(H(t)\) is calculated from the estimated cumulative survival function, \(S(t)\) as follows, \[H(t)=-log_eS(t)\]

The formula for Cox proportional hazards (PH) model, \[\begin{aligned} log_e\bigg({hazard\ at\ time,\ t \over {baseline\ hazard\ at\ time,\ t}}\bigg) = \\ log_e(hazard\ ratio,\ HR) = &\ coefficients \times numerical\ predictors \\ & + coefficients \times categorical\ predictors \end{aligned}\] or in notational form, \[\begin{aligned} log_e\bigg({h(t) \over h_0(t)}\bigg) = \\ log_e{HR} = &\ \beta_1x_1 + \beta_2x_2 + ... + \beta_kx_k \end{aligned}\] where we have k predictors. Notice there is something missing in the equations above, which is the intercept (\(\beta_0\)). It is because the intercept = 0 at time = 0, i.e. nobody experiences the event at the start of the followup period, everyone is still alive!

Whenever the predictor is a categorical variable with more than two levels, remember to consider dummy (binary) variable(s).

Hazard ratio (HR) is the ratio of hazards of two levels. HR for a predictor is easily calculated from a Cox PH model,

\[HR_i = e^{\beta_i}\]

2 Analysis

Load the required packages. Make sure you have all these in your computer.

# library
library(survival)
library(survminer)
library(epiDisplay)
library(coin)
library(TH.data)
library(car)

2.1 A detour: obtaining the cumulative survival and hazard function

Before we start with Cox PH model, we want to obtain the cumulative survival and hazard until time \(t\) using coxph() function on one group. Following our previous one-group survival analysis sur_aml on aml dataset,

acute = aml
cox_aml = coxph(Surv(time, status) ~ 1, data = acute)
sur_cox_aml = survfit(cox_aml)
summary(sur_cox_aml)

#> Call: survfit(formula = cox_aml)
#> 
#>  time n.risk n.event survival std.err lower 95% CI upper 95% CI
#>     5     23       2    0.915  0.0575       0.8088        1.000
#>     8     21       2    0.830  0.0775       0.6910        0.997
#>     9     19       1    0.787  0.0844       0.6381        0.971
#>    12     18       1    0.745  0.0899       0.5878        0.944
#>    13     17       1    0.702  0.0943       0.5397        0.914
#>    18     14       1    0.654  0.0995       0.4852        0.881
#>    23     13       2    0.557  0.1057       0.3840        0.808
#>    27     11       1    0.509  0.1070       0.3367        0.768
#>    30      9       1    0.455  0.1083       0.2855        0.725
#>    31      8       1    0.402  0.1079       0.2372        0.680
#>    33      7       1    0.348  0.1060       0.1917        0.632
#>    34      6       1    0.295  0.1023       0.1493        0.582
#>    43      5       1    0.241  0.0966       0.1101        0.529
#>    45      4       1    0.188  0.0887       0.0745        0.474
#>    48      2       1    0.114  0.0784       0.0296        0.439

basehaz(cox_aml)  # also try `-log(sur_cox_aml$surv)` to obtain H_t

#>        hazard time
#> 1  0.08893281    5
#> 2  0.18655185    8
#> 3  0.23918343    9
#> 4  0.29473899   12
#> 5  0.35356252   13
#> 6  0.35356252   16
#> 7  0.42499109   18
#> 8  0.58524750   23
#> 9  0.67615659   27
#> 10 0.67615659   28
#> 11 0.78726770   30
#> 12 0.91226770   31
#> 13 1.05512484   33
#> 14 1.22179151   34
#> 15 1.42179151   43
#> 16 1.67179151   45
#> 17 2.17179151   48
#> 18 2.17179151  161

2.2 Data

Now, back to our main business, we are going to use a built-in dataset in survival package, namely lung. We assign it to lca data object.

# data
?lung  # about the dataset
lca = na.omit(lung)  # omit subjects with missing data
str(lca)

#> 'data.frame':    167 obs. of  10 variables:
#>  $ inst     : num  3 5 12 7 11 1 7 6 12 22 ...
#>  $ time     : num  455 210 1022 310 361 ...
#>  $ status   : num  2 2 1 2 2 2 2 2 2 2 ...
#>  $ age      : num  68 57 74 68 71 53 61 57 57 70 ...
#>  $ sex      : num  1 1 1 2 2 1 1 1 1 1 ...
#>  $ ph.ecog  : num  0 1 1 2 2 1 2 1 1 1 ...
#>  $ ph.karno : num  90 90 50 70 60 70 70 80 80 90 ...
#>  $ pat.karno: num  90 60 80 60 80 80 70 80 70 100 ...
#>  $ meal.cal : num  1225 1150 513 384 538 ...
#>  $ wt.loss  : num  15 11 0 10 1 16 34 27 60 -5 ...
#>  - attr(*, "na.action")= 'omit' Named int [1:61] 1 3 5 12 13 14 16 20 23 25 ...
#>   ..- attr(*, "names")= chr [1:61] "1" "3" "5" "12" ...

The dataset needs some modifications and preparations for the purpose of our analysis, specifically status, sex and ph.ecog. Again, use ?lung to look at the description of the dataset.

Give proper coding for status variable as 0/1 for censored/dead

table(lca$status)  # status: 1=censored, 2=dead

#> 
#>   1   2 
#>  47 120

lca$status = lca$status - 1  # turn to our usual 0=censored, 1=dead

Factor sex variable properly as male/female,

lca$sex = factor(lca$sex, levels = 1:2, labels = c("male", "female"))
str(lca$sex)

#>  Factor w/ 2 levels "male","female": 1 1 1 2 2 1 1 1 1 1 ...

Although ph.ecog variable is supposed to be a numerical variable, it has narrow scale (only 0 to 3 in our data). Thus we turn it into a categorical variable,

table(lca$ph.ecog)  # only one obs. = 3, set to 2

#> 
#>  0  1  2  3 
#> 47 81 38  1

lca[lca$ph.ecog == 3, ]$ph.ecog = 2
lca$ph.ecog = factor(lca$ph.ecog)
str(lca$ph.ecog)

#>  Factor w/ 3 levels "0","1","2": 1 2 2 3 3 2 3 2 2 2 ...

2.3 Data exploration

# explore
codebook(lca)

#> 
#>  
#>  
#> inst      :    
#>  obs. mean   median  s.d.   min.   max.  
#>  167  10.707 11      8.168  1      32    
#> 
#>  ================== 
#> time      :    
#>  obs. mean    median  s.d.    min.   max.  
#>  167  309.934 268     209.436 5      1022  
#> 
#>  ================== 
#> status    :    
#>  obs. mean   median  s.d.   min.   max.  
#>  167  0.719  1       0.451  0      1     
#> 
#>  ================== 
#> age   :    
#>  obs. mean   median  s.d.   min.   max.  
#>  167  62.569 64      9.211  39     82    
#> 
#>  ================== 
#> sex   :    
#>        Frequency Percent
#> male         103    61.7
#> female        64    38.3
#> 
#>  ================== 
#> ph.ecog   :    
#>   Frequency Percent
#> 0        47    28.1
#> 1        81    48.5
#> 2        39    23.4
#> 
#>  ================== 
#> ph.karno      :    
#>  obs. mean   median  s.d.   min.   max.  
#>  167  82.036 80      12.779 50     100   
#> 
#>  ================== 
#> pat.karno     :    
#>  obs. mean   median  s.d.   min.   max.  
#>  167  79.581 80      15.104 30     100   
#> 
#>  ================== 
#> meal.cal      :    
#>  obs. mean    median  s.d.   min.   max.  
#>  167  929.126 975     413.49 96     2600  
#> 
#>  ================== 
#> wt.loss   :    
#>  obs. mean   median  s.d.   min.   max.  
#>  167  9.719  7       13.379 -24    68    
#> 
#>  ==================

table(lca$status)  # number of events

#> 
#>   0   1 
#>  47 120

2.4 Univariable

Out task now is to decide on the predictors to include in the multivariable model. As usual, to use add1() function, we start with an empty model,

cox_lca0 = coxph(Surv(time, status) ~ 1, data = lca)  # empty model
summary(cox_lca0)  # remember, no intercept in Cox PH, so there's nothing here

#> Call:  coxph(formula = Surv(time, status) ~ 1, data = lca)
#> 
#> Null model
#>   log likelihood= -508.1168 
#>   n= 167

We list the variable names in the dataset, and apply some R trick to include ” + ” in between the variable names,

names(lca)

#>  [1] "inst"      "time"      "status"    "age"       "sex"       "ph.ecog"  
#>  [7] "ph.karno"  "pat.karno" "meal.cal"  "wt.loss"

cat(names(lca), sep = " + ")

#> inst + time + status + age + sex + ph.ecog + ph.karno + pat.karno + meal.cal + wt.loss

This makes our life easier to copy all the variable names for regression analysis.

For the current analysis, we skip ph.karno and pat.carno, we only include age + sex + ph.ecog + meal.cal + wt.loss in the predictor list,

add1(cox_lca0, scope = ~ age + sex + ph.ecog + meal.cal + wt.loss, 
     test = "Chisq")  # LR test, note the argument's value is different from glm

#> Single term additions
#> 
#> Model:
#> Surv(time, status) ~ 1
#>          Df    AIC     LRT Pr(>Chi)   
#> <none>      1016.2                    
#> age       1 1014.7  3.5236 0.060503 . 
#> sex       1 1012.0  6.2468 0.012442 * 
#> ph.ecog   2 1007.6 12.5861 0.001849 **
#> meal.cal  1 1018.0  0.2486 0.618081   
#> wt.loss   1 1018.2  0.0005 0.981746   
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

                      # which uses test = "LRT"

Since meal.cal and wt.loss’s P-values < 0.25, we exclude these variables from multivariable model. We proceed with three variables, age, sex and ph.ecog.

2.5 Multivariable model

We include the selected variables into our multivariable model

cox_lca = coxph(Surv(time, status) ~ age + sex + ph.ecog, data = lca)
cox_lca  # basic results

#> Call:
#> coxph(formula = Surv(time, status) ~ age + sex + ph.ecog, data = lca)
#> 
#>                coef exp(coef)  se(coef)      z       p
#> age        0.007223  1.007249  0.011229  0.643 0.52006
#> sexfemale -0.501674  0.605516  0.197374 -2.542 0.01103
#> ph.ecog1   0.313994  1.368882  0.233325  1.346 0.17839
#> ph.ecog2   0.889753  2.434527  0.269210  3.305 0.00095
#> 
#> Likelihood ratio test=20  on 4 df, p=0.0005001
#> n= 167, number of events= 120

Focus on:

Coefficients, \(\beta\)s.
HRs, exp(coef), which are given here.
P-values.

It also gives the likelihood ratio test, i.e. LR test of the present model vs empty model.

Using summary() gives more details,

summary(cox_lca)

#> Call:
#> coxph(formula = Surv(time, status) ~ age + sex + ph.ecog, data = lca)
#> 
#>   n= 167, number of events= 120 
#> 
#>                coef exp(coef)  se(coef)      z Pr(>|z|)    
#> age        0.007223  1.007249  0.011229  0.643  0.52006    
#> sexfemale -0.501674  0.605516  0.197374 -2.542  0.01103 *  
#> ph.ecog1   0.313994  1.368882  0.233325  1.346  0.17839    
#> ph.ecog2   0.889753  2.434527  0.269210  3.305  0.00095 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#>           exp(coef) exp(-coef) lower .95 upper .95
#> age          1.0072     0.9928    0.9853    1.0297
#> sexfemale    0.6055     1.6515    0.4113    0.8915
#> ph.ecog1     1.3689     0.7305    0.8665    2.1626
#> ph.ecog2     2.4345     0.4108    1.4364    4.1264
#> 
#> Concordance= 0.642  (se = 0.031 )
#> Likelihood ratio test= 20  on 4 df,   p=5e-04
#> Wald test            = 20.32  on 4 df,   p=4e-04
#> Score (logrank) test = 21.16  on 4 df,   p=3e-04

which now includes 95% CI of the HRs.

2.6 Stepwise

# both
cox_lca_stepboth = step(cox_lca, direction = "both")

#> Start:  AIC=1004.24
#> Surv(time, status) ~ age + sex + ph.ecog
#> 
#>           Df    AIC
#> - age      1 1002.6
#> <none>       1004.2
#> - sex      1 1009.0
#> - ph.ecog  2 1011.4
#> 
#> Step:  AIC=1002.65
#> Surv(time, status) ~ sex + ph.ecog
#> 
#>           Df    AIC
#> <none>       1002.6
#> + age      1 1004.2
#> - sex      1 1007.6
#> - ph.ecog  2 1012.0

cox_lca_stepboth

#> Call:
#> coxph(formula = Surv(time, status) ~ sex + ph.ecog, data = lca)
#> 
#>              coef exp(coef) se(coef)      z        p
#> sexfemale -0.5076    0.6019   0.1970 -2.576 0.009983
#> ph.ecog1   0.3204    1.3776   0.2331  1.374 0.169289
#> ph.ecog2   0.9368    2.5518   0.2592  3.614 0.000301
#> 
#> Likelihood ratio test=19.58  on 3 df, p=0.0002074
#> n= 167, number of events= 120

# forward
cox_lca_stepforward = step(cox_lca0, scope = ~ age + sex + ph.ecog, direction = "forward")

#> Start:  AIC=1016.23
#> Surv(time, status) ~ 1
#> 
#>           Df    AIC
#> + ph.ecog  2 1007.6
#> + sex      1 1012.0
#> + age      1 1014.7
#> <none>       1016.2
#> 
#> Step:  AIC=1007.65
#> Surv(time, status) ~ ph.ecog
#> 
#>        Df    AIC
#> + sex   1 1002.6
#> <none>    1007.6
#> + age   1 1009.0
#> 
#> Step:  AIC=1002.65
#> Surv(time, status) ~ ph.ecog + sex
#> 
#>        Df    AIC
#> <none>    1002.6
#> + age   1 1004.2

cox_lca_stepforward

#> Call:
#> coxph(formula = Surv(time, status) ~ ph.ecog + sex, data = lca)
#> 
#>              coef exp(coef) se(coef)      z        p
#> ph.ecog1   0.3204    1.3776   0.2331  1.374 0.169289
#> ph.ecog2   0.9368    2.5518   0.2592  3.614 0.000301
#> sexfemale -0.5076    0.6019   0.1970 -2.576 0.009983
#> 
#> Likelihood ratio test=19.58  on 3 df, p=0.0002074
#> n= 167, number of events= 120

# backward
cox_lca_stepback = step(cox_lca, direction = "backward")

#> Start:  AIC=1004.24
#> Surv(time, status) ~ age + sex + ph.ecog
#> 
#>           Df    AIC
#> - age      1 1002.6
#> <none>       1004.2
#> - sex      1 1009.0
#> - ph.ecog  2 1011.4
#> 
#> Step:  AIC=1002.65
#> Surv(time, status) ~ sex + ph.ecog
#> 
#>           Df    AIC
#> <none>       1002.6
#> - sex      1 1007.6
#> - ph.ecog  2 1012.0

cox_lca_stepback

#> Call:
#> coxph(formula = Surv(time, status) ~ sex + ph.ecog, data = lca)
#> 
#>              coef exp(coef) se(coef)      z        p
#> sexfemale -0.5076    0.6019   0.1970 -2.576 0.009983
#> ph.ecog1   0.3204    1.3776   0.2331  1.374 0.169289
#> ph.ecog2   0.9368    2.5518   0.2592  3.614 0.000301
#> 
#> Likelihood ratio test=19.58  on 3 df, p=0.0002074
#> n= 167, number of events= 120

All stepwise methods give the same set of variables, which is sex + ph.ecog. We name it as cox_lca1,

cox_lca1 = cox_lca_stepboth
summary(cox_lca1)

#> Call:
#> coxph(formula = Surv(time, status) ~ sex + ph.ecog, data = lca)
#> 
#>   n= 167, number of events= 120 
#> 
#>              coef exp(coef) se(coef)      z Pr(>|z|)    
#> sexfemale -0.5076    0.6019   0.1970 -2.576 0.009983 ** 
#> ph.ecog1   0.3204    1.3776   0.2331  1.374 0.169289    
#> ph.ecog2   0.9368    2.5518   0.2592  3.614 0.000301 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#>           exp(coef) exp(-coef) lower .95 upper .95
#> sexfemale    0.6019     1.6614    0.4091    0.8856
#> ph.ecog1     1.3776     0.7259    0.8724    2.1753
#> ph.ecog2     2.5518     0.3919    1.5354    4.2410
#> 
#> Concordance= 0.646  (se = 0.03 )
#> Likelihood ratio test= 19.58  on 3 df,   p=2e-04
#> Wald test            = 20.21  on 3 df,   p=2e-04
#> Score (logrank) test = 20.97  on 3 df,   p=1e-04

2.7 Confounder

We skip confounder checking this time, you may do this step as an exercise.

2.8 Model comparison

Compare cox_lca1 model with the no-variable model and all-variable model by LR test and AIC comparison.

# LR test
anova(cox_lca0, cox_lca1, cox_lca, test = "Chisq")

#> Analysis of Deviance Table
#>  Cox model: response is  Surv(time, status)
#>  Model 1: ~ 1
#>  Model 2: ~ sex + ph.ecog
#>  Model 3: ~ age + sex + ph.ecog
#>    loglik   Chisq Df Pr(>|Chi|)    
#> 1 -508.12                          
#> 2 -498.33 19.5796  3  0.0002074 ***
#> 3 -498.12  0.4173  1  0.5183043    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

There is no difference of 2-variable model (cox_lca1) to 3-variable full model.

# AIC
AIC(cox_lca0, cox_lca1, cox_lca)

#>          df      AIC
#> cox_lca0  0 1016.234
#> cox_lca1  3 1002.654
#> cox_lca   4 1004.237

cox_lca1 has lower AIC than 3-variable full model. Note that there is no AIC for empty model in Cox PH because there’s no intercept.

2.9 Multicollinearity, MC

We check for the variables are redundant (multicollinear) by looking at the magnitude of SE to its coefficient (same approach to that of logistic regression),

cox_lca1  # small SEs < coefficients

#> Call:
#> coxph(formula = Surv(time, status) ~ sex + ph.ecog, data = lca)
#> 
#>              coef exp(coef) se(coef)      z        p
#> sexfemale -0.5076    0.6019   0.1970 -2.576 0.009983
#> ph.ecog1   0.3204    1.3776   0.2331  1.374 0.169289
#> ph.ecog2   0.9368    2.5518   0.2592  3.614 0.000301
#> 
#> Likelihood ratio test=19.58  on 3 df, p=0.0002074
#> n= 167, number of events= 120

There are no large SEs for these two variables.

2.10 Interaction, *

Now we add \(sex \times ph.ecog\) interaction term to the model,

add1(cox_lca1, scope = ~ . + sex * ph.ecog, test = "Chisq")  # insig. *

#> Single term additions
#> 
#> Model:
#> Surv(time, status) ~ sex + ph.ecog
#>             Df    AIC    LRT Pr(>Chi)
#> <none>         1002.6                
#> sex:ph.ecog  2 1005.2 1.4089   0.4944

There was no significant interaction to be included in out model.

2.11 Proportional hazards assumption

This is a very important assumption in Cox PH model. As the name of the model itself suggests, proportional hazards asssumption is central to the model.

2.11.1 Statistical test

We start with a formal statistical test of the PH assumption by predictors and overall (global test) using cox.zph() function. It tests for constant coefficients over the time.

cox.zph(cox_lca1)

#>         chisq df     p
#> sex      1.54  1 0.214
#> ph.ecog  5.49  2 0.064
#> GLOBAL   6.52  3 0.089

We notice that the P-value < 0.05 for pg.ecog2 hazards i.e. for pg.ecog = 2 vs pg.ecog = 0 levels, which indicates the the hazards are not proportionate. But as we turn to the global test, the P-value is > 0.05. We may conclude it is proportionate if we consider the model as a whole.

2.11.2 Plots

2.11.2.1 Plots of the scaled Schoenfeld residuals over time

ggcoxzph(cox.zph(cox_lca1))

The points scattered fairly equally above and below the estimated coefficient lines over time. The points jumping above and below maybe because of categorical predictors, may look better if we have numerical predictors.

2.11.2.2 Cumulative hazard plots

Cumulative hazard plot for male and female,

sur_lca_sex = survfit(Surv(time, status) ~ sex, data = lca)
sur_lca_sex

#> Call: survfit(formula = Surv(time, status) ~ sex, data = lca)
#> 
#>              n events median 0.95LCL 0.95UCL
#> sex=male   103     82    284     223     353
#> sex=female  64     38    426     345     641

ggsurvplot(sur_lca_sex, fun = "cumhaz")

The lines are clearly parallel to each other, indicating the hazards are proportionate over time.

Cumulative hazard plot for ph.ecog of 0, 1 and 2,

sur_lca_ph.ecog = survfit(Surv(time, status) ~ ph.ecog, data = lca)
sur_lca_ph.ecog

#> Call: survfit(formula = Surv(time, status) ~ ph.ecog, data = lca)
#> 
#>            n events median 0.95LCL 0.95UCL
#> ph.ecog=0 47     27    371     337     643
#> ph.ecog=1 81     59    345     269     460
#> ph.ecog=2 39     34    199     156     291

ggsurvplot(sur_lca_ph.ecog, fun = "cumhaz")

ph.ecog = 2 looks less parallel to the other two curves. It could be a violation of the PH assumption.

2.11.2.3 Log minus log (LML) survival function plot

LML is also called log cumulative hazard plot. Remember that we can obtain cumulative hazard function \(H(t)\) from cumulative survival function \(S(t)\), because \(H_t = -log_eS(t)\). LML is just \(log_e(H_t)\).

LML plot for male and female,

ggsurvplot(sur_lca_sex, fun = "cloglog")

The lines are parallel, especially prominent for time period of > 50 days.

LML plot for ph.ecog of 0, 1 and 2,

ggsurvplot(sur_lca_ph.ecog, fun = "cloglog")

At longer time period (> 100 days), the lines look more parallel.

Judging from cumulative hazard and LML plots, there is probably a violation of PH assumption. But we should keep in mind that cumulative hazard and LML plots show us the hazards by group from basic survival data (univariable), thus just serve as rough PH assumption check.

2.11.3 Interpretation

At this point we have decided on the final model,

cox_lca_final = cox_lca1
summary(cox_lca_final)

#> Call:
#> coxph(formula = Surv(time, status) ~ sex + ph.ecog, data = lca)
#> 
#>   n= 167, number of events= 120 
#> 
#>              coef exp(coef) se(coef)      z Pr(>|z|)    
#> sexfemale -0.5076    0.6019   0.1970 -2.576 0.009983 ** 
#> ph.ecog1   0.3204    1.3776   0.2331  1.374 0.169289    
#> ph.ecog2   0.9368    2.5518   0.2592  3.614 0.000301 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#>           exp(coef) exp(-coef) lower .95 upper .95
#> sexfemale    0.6019     1.6614    0.4091    0.8856
#> ph.ecog1     1.3776     0.7259    0.8724    2.1753
#> ph.ecog2     2.5518     0.3919    1.5354    4.2410
#> 
#> Concordance= 0.646  (se = 0.03 )
#> Likelihood ratio test= 19.58  on 3 df,   p=2e-04
#> Wald test            = 20.21  on 3 df,   p=2e-04
#> Score (logrank) test = 20.97  on 3 df,   p=1e-04

Female has lower hazard with HR = 0.60 (40% lower) than male, controlling for other predictors.
ECOG score 1 has higher hazard with HR = 1.38 (38% higher) than ECOG score 0, controlling for other predictors.
ECOG score 2 has higher hazard with HR = 2.55 (155% higher) than ECOG score 0, controlling for other predictors.

2.12 Model equations

Log\(_e\) hazard ratio is given by, \[log_e\bigg({h(t) \over h_0(t)}\bigg) = log_eHR = -0.51 * sex\ (female) + 0.32 \times ph.ecog\ (1) + 0.94 \times ph.ecog\ (2)\]

Hazard ratio is easily obtained by, \[{h(t) \over h_0(t)} = HR = e^{-0.51 * sex\ (female) + 0.32 \times ph.ecog\ (1) + 0.94 \times ph.ecog\ (2)}\]

Lastly, hazard (or risk in R) can be obtained from this equation, \[h(t) = h_0(t) \times e^{-0.51 * sex\ (female) + 0.32 \times ph.ecog\ (1) + 0.94 \times ph.ecog\ (2)}\] Hazard/risk is given by predict(..., type = "risk") in R. \(h_0(t)\) is found by setting all values of the predictors to baseline values. In our case, by setting sex = 0 (“male”) and ph.ecog = 0 (i.e. dummy variables ph.ecog1 = 0, ph.ecog2 = 0).

2.13 Prediction

2.13.1 HR and hazard

We start by adding predicted hazard to our sample,

lca$hazard = predict(cox_lca_final, type = "risk")

To add HR, we need to find \(h_0(t)\), the baseline hazard by predicting for, sex = 0 (“male”) and ph.ecog = 0 (i.e. dummy variables ph.ecog1 = 0, ph.ecog2 = 0),

h0_t = predict(cox_lca_final, list(sex = "male", ph.ecog = "0"), type = "risk")
h0_t  # h0(t)

#> 1 
#> 1

followed by \(HR = h(t)/h_0(t)\),

# HR = h(t)/h0(t)
lca$hr = lca$hazard/h0_t

To see whether we did it right, view the first 20 observations in the sample,

head(lca[c("sex", "ph.ecog", "time", "status", "hazard", "hr")], 20)

#>       sex ph.ecog time status    hazard        hr
#> 2    male       0  455      1 1.0000000 1.0000000
#> 4    male       1  210      1 1.3776214 1.3776214
#> 6    male       1 1022      0 1.3776214 1.3776214
#> 7  female       2  310      1 1.5359575 1.5359575
#> 8  female       2  361      1 1.5359575 1.5359575
#> 9    male       1  218      1 1.3776214 1.3776214
#> 10   male       2  166      1 2.5517798 2.5517798
#> 11   male       1  170      1 1.3776214 1.3776214
#> 15   male       1  567      1 1.3776214 1.3776214
#> 17   male       1  613      1 1.3776214 1.3776214
#> 18   male       2  707      1 2.5517798 2.5517798
#> 19 female       2   61      1 1.5359575 1.5359575
#> 21   male       1  301      1 1.3776214 1.3776214
#> 22 female       0   81      1 0.6019161 0.6019161
#> 24   male       0  371      1 1.0000000 1.0000000
#> 26 female       1  520      1 0.8292126 0.8292126
#> 27   male       0  574      1 1.0000000 1.0000000
#> 28   male       2  118      1 2.5517798 2.5517798
#> 29   male       1  390      1 1.3776214 1.3776214
#> 30   male       2   12      1 2.5517798 2.5517798

Now, we proceed to obtain the hazard and HR for a subject (sex = "female", ph.ecog = "2"). Hazard,

predict(cox_lca_final, list(sex = "female", ph.ecog = "2"), type = "risk")  # hazard

#>        1 
#> 1.535957

There are two ways to obtain HR. First, our equation for HR above,

exp(coef(cox_lca_final)[1]*1 + coef(cox_lca_final)[2]*0 + coef(cox_lca_final)[3]*1)  # HR

#> sexfemale 
#>  1.535957

We can also obtain HR by dividing hazard by the baseline hazard h0_t that we had before,

predict(cox_lca_final, list(sex = "female", ph.ecog = "2"), type = "risk")/h0_t  # HR

#>        1 
#> 1.535957

# we utilize the baseline hazard h0(t)

Then we obtain the hazard and HR for a data frame,

new_data = data.frame(sex = c("male", "male", "male", "female", "female", "female"),
                      ph.ecog = c("0", "1", "2", "0", "1", "2"))
new_data

#>      sex ph.ecog
#> 1   male       0
#> 2   male       1
#> 3   male       2
#> 4 female       0
#> 5 female       1
#> 6 female       2

new_hazard = hazard = predict(cox_lca_final, new_data, type = "risk")
new_hr = new_hazard/h0_t
data.frame(new_data, hazard = round(new_hazard, 3), hr = round(new_hr, 3))

#>      sex ph.ecog hazard    hr
#> 1   male       0  1.000 1.000
#> 2   male       1  1.378 1.378
#> 3   male       2  2.552 2.552
#> 4 female       0  0.602 0.602
#> 5 female       1  0.829 0.829
#> 6 female       2  1.536 1.536

2.13.2 Median survival times and survival probabilities

To obtain survival probabilities, Rizopoulos (2017) gives a good guide here:

http://www.drizopoulos.com/courses/emc/ep03_%20survival%20analysis%20in%20r%20companion

Here we obtain the median survival time and probabilities for a subject,

# simple, sex = "female", ph.ecog = "2"
new_cox1 = survfit(cox_lca_final, newdata = list(sex = "female", ph.ecog = "2"))
new_cox1  # median survival time

#> Call: survfit(formula = cox_lca_final, newdata = list(sex = "female", 
#>     ph.ecog = "2"))
#> 
#>        n events median 0.95LCL 0.95UCL
#> [1,] 167    120    285     218     429

summary(new_cox1, times = 100)  # survival at 100 days

#> Call: survfit(formula = cox_lca_final, newdata = list(sex = "female", 
#>     ph.ecog = "2"))
#> 
#>  time n.risk n.event survival std.err lower 95% CI upper 95% CI
#>   100    143      24     0.83  0.0434        0.749        0.919

summary(new_cox1, times = c(100, 200, 300))  # survival at 100, 200 and 300 days

#> Call: survfit(formula = cox_lca_final, newdata = list(sex = "female", 
#>     ph.ecog = "2"))
#> 
#>  time n.risk n.event survival std.err lower 95% CI upper 95% CI
#>   100    143      24    0.830  0.0434        0.749        0.919
#>   200    111      24    0.656  0.0672        0.536        0.802
#>   300     67      25    0.449  0.0827        0.313        0.644

Now for a data frame,

new_data = data.frame(sex = c("male", "male", "male", "female", "female", "female"),
                      ph.ecog = c("0", "1", "2", "0", "1", "2"))
new_data

#>      sex ph.ecog
#> 1   male       0
#> 2   male       1
#> 3   male       2
#> 4 female       0
#> 5 female       1
#> 6 female       2

new_cox2 = survfit(cox_lca_final, newdata = new_data)
new_cox2  # median survival times

#> Call: survfit(formula = cox_lca_final, newdata = new_data)
#> 
#>     n events median 0.95LCL 0.95UCL
#> 1 167    120    361     288     558
#> 2 167    120    291     245     371
#> 3 167    120    199     163     284
#> 4 167    120    550     426      NA
#> 5 167    120    429     337     641
#> 6 167    120    285     218     429

summary(new_cox2, times = 100)  # survival at 100 days

#> Call: survfit(formula = cox_lca_final, newdata = new_data)
#> 
#>  time n.risk n.event survival1 survival2 survival3 survival4 survival5
#>   100    143      24     0.886     0.846     0.733     0.929     0.904
#>  survival6
#>       0.83

summary(new_cox2, times = c(100, 200, 300))  # survival at 100, 200 and 300 days

#> Call: survfit(formula = cox_lca_final, newdata = new_data)
#> 
#>  time n.risk n.event survival1 survival2 survival3 survival4 survival5
#>   100    143      24     0.886     0.846     0.733     0.929     0.904
#>   200    111      24     0.760     0.685     0.496     0.848     0.796
#>   300     67      25     0.594     0.488     0.264     0.731     0.649
#>  survival6
#>      0.830
#>      0.656
#>      0.449

3 Exercises

Present the results in a table (follow Arifin et al. (2016)). You may follow the way of presentation for logistic regression.
Now, include ph.kano and pat.karno in the multivariable analysis. What do you get? *Hint: Interaction?
Perform Cox PH on builtin GBSG2 dataset (in PH.data package).
Use dataset from “Practice Datasets” folder.

References

Arifin, W. N., Sarimah, A., Norsa’adah, B., Majdi, Y. N., Siti-Azrin, A. H., Imran, M. K., … Naing, L. (2016). Reporting statistical results in medical journals. The Malaysian Journal of Medical Sciences: MJMS, 23(5), 1.

Chongsuvivatwong, V. (2022). epiDisplay: Epidemiological data display package. Retrieved from https://CRAN.R-project.org/package=epiDisplay

Fox, J., & Weisberg, S. (2019). An R companion to applied regression (Third). Thousand Oaks CA: Sage. Retrieved from https://www.john-fox.ca/Companion/

Fox, J., Weisberg, S., & Price, B. (2024). Car: Companion to applied regression. Retrieved from https://r-forge.r-project.org/projects/car/

Grana, C., Chinol, M., Robertson, C., Mazzetta, C., Bartolomei, M., De Cicco, C., … Paganelli, G. (2002). Pretargeted adjuvant radioimmunotherapy with yttrium-90-biotin in malignant glioma patients: A pilot study. British Journal of Cancer, 86(2), 207.

Hothorn, T. (2025). TH.data: TH’s data archive. Retrieved from https://CRAN.R-project.org/package=TH.data

Hothorn, T., Hornik, K., van de Wiel, M. A., & Zeileis, A. (2006). A Lego system for conditional inference. The American Statistician, 60(3), 257–263. https://doi.org/10.1198/000313006X118430

Hothorn, T., Hornik, K., van de Wiel, M. A., & Zeileis, A. (2008). Implementing a class of permutation tests: The coin package. Journal of Statistical Software, 28(8), 1–23. https://doi.org/10.18637/jss.v028.i08

Hothorn, T., Winell, H., Hornik, K., van de Wiel, M. A., & Zeileis, A. (2023). Coin: Conditional inference procedures in a permutation test framework. Retrieved from http://coin.r-forge.r-project.org

Kassambara, A., Kosinski, M., & Biecek, P. (2025). Survminer: Drawing survival curves using ggplot2. Retrieved from https://rpkgs.datanovia.com/survminer/index.html

Miller, R. G. (1997). Survival analysis. John Wiley & Sons.

Musa, K. I., Mansor, W. N. A. W., & Hanis, T. M. (2023). Data analysis in medicine and health using r. New York: Chapman; Hall/CRC. Retrieved from https://www.taylorfrancis.com/books/mono/10.1201/9781003296775/data-analysis-medicine-health-using-kamarul-imran-musa-wan-arifin-wan-mansor-tengku-muhammad-hanis

Rizopoulos, D. (2017). EP03: Survival analysis in R companion. Retrieved September 12, 2018, from http://www.drizopoulos.com/courses/emc/ep03_%20survival%20analysis%20in%20r%20companion#survival-probabilities-from-cox-models

Terry M. Therneau, & Patricia M. Grambsch. (2000). Modeling survival data: Extending the Cox model. New York: Springer.

Therneau, T. M. (2024). Survival: Survival analysis. Retrieved from https://github.com/therneau/survival

Woodward, M. (2013). Epidemiology: Study design and data analysis. Boca Raton, FL, USA: CRC Press.

Survival analysis using R: Cox Proportional Hazards Model

Dr. Wan Nor Arifin

Updated: 15 December 2025