A statistical method to model:
In contrast to KM approach, it is concerned with hazard of event.
Basically, the (interval) hazard at time \(t\) is as follows, \[Hazard,\ h_t = {Deaths,\ e_t \over Survivors,\ n_t \times Interval,\ u_t}\] where interval \(u_t\) is the time interval from present time \(t\) until the next event time \(t+1\).
Nelson-Aalen’s cumulative hazard estimate is given sum of \(e_i/n_1\) until time \(t\), \[Nelson-Aalen\ cumulative\ hazard,\ NA = \sum_{i\leq t} {e_i \over n_i}\]
In R, cumulative hazard function, \(H(t)\) is calculated from the estimated cumulative survival function, \(S(t)\) as follows, \[H(t)=-log_eS(t)\]
Whenever the predictor is a categorical variable with more than two levels, remember to consider dummy (binary) variable(s).
\[HR_i = e^{\beta_i}\]
Load the required packages. Make sure you have all these in your computer.
# library
library(survival)
library(survminer)
library(epiDisplay)
library(coin)
library(TH.data)
library(car)
Before we start with Cox PH model, we want to obtain the cumulative
survival and hazard until time \(t\)
using coxph() function on one group. Following our previous
one-group survival analysis sur_aml on aml
dataset,
acute = aml
cox_aml = coxph(Surv(time, status) ~ 1, data = acute)
sur_cox_aml = survfit(cox_aml)
summary(sur_cox_aml)
#> Call: survfit(formula = cox_aml)
#>
#> time n.risk n.event survival std.err lower 95% CI upper 95% CI
#> 5 23 2 0.915 0.0575 0.8088 1.000
#> 8 21 2 0.830 0.0775 0.6910 0.997
#> 9 19 1 0.787 0.0844 0.6381 0.971
#> 12 18 1 0.745 0.0899 0.5878 0.944
#> 13 17 1 0.702 0.0943 0.5397 0.914
#> 18 14 1 0.654 0.0995 0.4852 0.881
#> 23 13 2 0.557 0.1057 0.3840 0.808
#> 27 11 1 0.509 0.1070 0.3367 0.768
#> 30 9 1 0.455 0.1083 0.2855 0.725
#> 31 8 1 0.402 0.1079 0.2372 0.680
#> 33 7 1 0.348 0.1060 0.1917 0.632
#> 34 6 1 0.295 0.1023 0.1493 0.582
#> 43 5 1 0.241 0.0966 0.1101 0.529
#> 45 4 1 0.188 0.0887 0.0745 0.474
#> 48 2 1 0.114 0.0784 0.0296 0.439
basehaz(cox_aml) # also try `-log(sur_cox_aml$surv)` to obtain H_t
#> hazard time
#> 1 0.08893281 5
#> 2 0.18655185 8
#> 3 0.23918343 9
#> 4 0.29473899 12
#> 5 0.35356252 13
#> 6 0.35356252 16
#> 7 0.42499109 18
#> 8 0.58524750 23
#> 9 0.67615659 27
#> 10 0.67615659 28
#> 11 0.78726770 30
#> 12 0.91226770 31
#> 13 1.05512484 33
#> 14 1.22179151 34
#> 15 1.42179151 43
#> 16 1.67179151 45
#> 17 2.17179151 48
#> 18 2.17179151 161
Now, back to our main business, we are going to use a built-in
dataset in survival package, namely lung. We
assign it to lca data object.
# data
?lung # about the dataset
lca = na.omit(lung) # omit subjects with missing data
str(lca)
#> 'data.frame': 167 obs. of 10 variables:
#> $ inst : num 3 5 12 7 11 1 7 6 12 22 ...
#> $ time : num 455 210 1022 310 361 ...
#> $ status : num 2 2 1 2 2 2 2 2 2 2 ...
#> $ age : num 68 57 74 68 71 53 61 57 57 70 ...
#> $ sex : num 1 1 1 2 2 1 1 1 1 1 ...
#> $ ph.ecog : num 0 1 1 2 2 1 2 1 1 1 ...
#> $ ph.karno : num 90 90 50 70 60 70 70 80 80 90 ...
#> $ pat.karno: num 90 60 80 60 80 80 70 80 70 100 ...
#> $ meal.cal : num 1225 1150 513 384 538 ...
#> $ wt.loss : num 15 11 0 10 1 16 34 27 60 -5 ...
#> - attr(*, "na.action")= 'omit' Named int [1:61] 1 3 5 12 13 14 16 20 23 25 ...
#> ..- attr(*, "names")= chr [1:61] "1" "3" "5" "12" ...
The dataset needs some modifications and preparations for the purpose
of our analysis, specifically status, sex and
ph.ecog. Again, use ?lung to look at the
description of the dataset.
Give proper coding for status variable as
0/1 for censored/dead
table(lca$status) # status: 1=censored, 2=dead
#>
#> 1 2
#> 47 120
lca$status = lca$status - 1 # turn to our usual 0=censored, 1=dead
Factor sex variable properly as male/female,
lca$sex = factor(lca$sex, levels = 1:2, labels = c("male", "female"))
str(lca$sex)
#> Factor w/ 2 levels "male","female": 1 1 1 2 2 1 1 1 1 1 ...
Although ph.ecog variable is supposed to be a numerical
variable, it has narrow scale (only 0 to 3 in our data). Thus we turn it
into a categorical variable,
table(lca$ph.ecog) # only one obs. = 3, set to 2
#>
#> 0 1 2 3
#> 47 81 38 1
lca[lca$ph.ecog == 3, ]$ph.ecog = 2
lca$ph.ecog = factor(lca$ph.ecog)
str(lca$ph.ecog)
#> Factor w/ 3 levels "0","1","2": 1 2 2 3 3 2 3 2 2 2 ...
# explore
codebook(lca)
#>
#>
#>
#> inst :
#> obs. mean median s.d. min. max.
#> 167 10.707 11 8.168 1 32
#>
#> ==================
#> time :
#> obs. mean median s.d. min. max.
#> 167 309.934 268 209.436 5 1022
#>
#> ==================
#> status :
#> obs. mean median s.d. min. max.
#> 167 0.719 1 0.451 0 1
#>
#> ==================
#> age :
#> obs. mean median s.d. min. max.
#> 167 62.569 64 9.211 39 82
#>
#> ==================
#> sex :
#> Frequency Percent
#> male 103 61.7
#> female 64 38.3
#>
#> ==================
#> ph.ecog :
#> Frequency Percent
#> 0 47 28.1
#> 1 81 48.5
#> 2 39 23.4
#>
#> ==================
#> ph.karno :
#> obs. mean median s.d. min. max.
#> 167 82.036 80 12.779 50 100
#>
#> ==================
#> pat.karno :
#> obs. mean median s.d. min. max.
#> 167 79.581 80 15.104 30 100
#>
#> ==================
#> meal.cal :
#> obs. mean median s.d. min. max.
#> 167 929.126 975 413.49 96 2600
#>
#> ==================
#> wt.loss :
#> obs. mean median s.d. min. max.
#> 167 9.719 7 13.379 -24 68
#>
#> ==================
table(lca$status) # number of events
#>
#> 0 1
#> 47 120
Out task now is to decide on the predictors to include in the
multivariable model. As usual, to use add1() function, we
start with an empty model,
cox_lca0 = coxph(Surv(time, status) ~ 1, data = lca) # empty model
summary(cox_lca0) # remember, no intercept in Cox PH, so there's nothing here
#> Call: coxph(formula = Surv(time, status) ~ 1, data = lca)
#>
#> Null model
#> log likelihood= -508.1168
#> n= 167
We list the variable names in the dataset, and apply some R trick to include ” + ” in between the variable names,
names(lca)
#> [1] "inst" "time" "status" "age" "sex" "ph.ecog"
#> [7] "ph.karno" "pat.karno" "meal.cal" "wt.loss"
cat(names(lca), sep = " + ")
#> inst + time + status + age + sex + ph.ecog + ph.karno + pat.karno + meal.cal + wt.loss
This makes our life easier to copy all the variable names for regression analysis.
For the current analysis, we skip ph.karno and
pat.carno, we only include
age + sex + ph.ecog + meal.cal + wt.loss in the predictor
list,
add1(cox_lca0, scope = ~ age + sex + ph.ecog + meal.cal + wt.loss,
test = "Chisq") # LR test, note the argument's value is different from glm
#> Single term additions
#>
#> Model:
#> Surv(time, status) ~ 1
#> Df AIC LRT Pr(>Chi)
#> <none> 1016.2
#> age 1 1014.7 3.5236 0.060503 .
#> sex 1 1012.0 6.2468 0.012442 *
#> ph.ecog 2 1007.6 12.5861 0.001849 **
#> meal.cal 1 1018.0 0.2486 0.618081
#> wt.loss 1 1018.2 0.0005 0.981746
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# which uses test = "LRT"
Since meal.cal and wt.loss’s P-values <
0.25, we exclude these variables from multivariable model. We proceed
with three variables, age, sex and
ph.ecog.
We include the selected variables into our multivariable model
cox_lca = coxph(Surv(time, status) ~ age + sex + ph.ecog, data = lca)
cox_lca # basic results
#> Call:
#> coxph(formula = Surv(time, status) ~ age + sex + ph.ecog, data = lca)
#>
#> coef exp(coef) se(coef) z p
#> age 0.007223 1.007249 0.011229 0.643 0.52006
#> sexfemale -0.501674 0.605516 0.197374 -2.542 0.01103
#> ph.ecog1 0.313994 1.368882 0.233325 1.346 0.17839
#> ph.ecog2 0.889753 2.434527 0.269210 3.305 0.00095
#>
#> Likelihood ratio test=20 on 4 df, p=0.0005001
#> n= 167, number of events= 120
Focus on:
exp(coef), which are given here.It also gives the likelihood ratio test, i.e. LR test of
the present model vs empty model.
Using summary() gives more details,
summary(cox_lca)
#> Call:
#> coxph(formula = Surv(time, status) ~ age + sex + ph.ecog, data = lca)
#>
#> n= 167, number of events= 120
#>
#> coef exp(coef) se(coef) z Pr(>|z|)
#> age 0.007223 1.007249 0.011229 0.643 0.52006
#> sexfemale -0.501674 0.605516 0.197374 -2.542 0.01103 *
#> ph.ecog1 0.313994 1.368882 0.233325 1.346 0.17839
#> ph.ecog2 0.889753 2.434527 0.269210 3.305 0.00095 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> exp(coef) exp(-coef) lower .95 upper .95
#> age 1.0072 0.9928 0.9853 1.0297
#> sexfemale 0.6055 1.6515 0.4113 0.8915
#> ph.ecog1 1.3689 0.7305 0.8665 2.1626
#> ph.ecog2 2.4345 0.4108 1.4364 4.1264
#>
#> Concordance= 0.642 (se = 0.031 )
#> Likelihood ratio test= 20 on 4 df, p=5e-04
#> Wald test = 20.32 on 4 df, p=4e-04
#> Score (logrank) test = 21.16 on 4 df, p=3e-04
which now includes 95% CI of the HRs.
# both
cox_lca_stepboth = step(cox_lca, direction = "both")
#> Start: AIC=1004.24
#> Surv(time, status) ~ age + sex + ph.ecog
#>
#> Df AIC
#> - age 1 1002.6
#> <none> 1004.2
#> - sex 1 1009.0
#> - ph.ecog 2 1011.4
#>
#> Step: AIC=1002.65
#> Surv(time, status) ~ sex + ph.ecog
#>
#> Df AIC
#> <none> 1002.6
#> + age 1 1004.2
#> - sex 1 1007.6
#> - ph.ecog 2 1012.0
cox_lca_stepboth
#> Call:
#> coxph(formula = Surv(time, status) ~ sex + ph.ecog, data = lca)
#>
#> coef exp(coef) se(coef) z p
#> sexfemale -0.5076 0.6019 0.1970 -2.576 0.009983
#> ph.ecog1 0.3204 1.3776 0.2331 1.374 0.169289
#> ph.ecog2 0.9368 2.5518 0.2592 3.614 0.000301
#>
#> Likelihood ratio test=19.58 on 3 df, p=0.0002074
#> n= 167, number of events= 120
# forward
cox_lca_stepforward = step(cox_lca0, scope = ~ age + sex + ph.ecog, direction = "forward")
#> Start: AIC=1016.23
#> Surv(time, status) ~ 1
#>
#> Df AIC
#> + ph.ecog 2 1007.6
#> + sex 1 1012.0
#> + age 1 1014.7
#> <none> 1016.2
#>
#> Step: AIC=1007.65
#> Surv(time, status) ~ ph.ecog
#>
#> Df AIC
#> + sex 1 1002.6
#> <none> 1007.6
#> + age 1 1009.0
#>
#> Step: AIC=1002.65
#> Surv(time, status) ~ ph.ecog + sex
#>
#> Df AIC
#> <none> 1002.6
#> + age 1 1004.2
cox_lca_stepforward
#> Call:
#> coxph(formula = Surv(time, status) ~ ph.ecog + sex, data = lca)
#>
#> coef exp(coef) se(coef) z p
#> ph.ecog1 0.3204 1.3776 0.2331 1.374 0.169289
#> ph.ecog2 0.9368 2.5518 0.2592 3.614 0.000301
#> sexfemale -0.5076 0.6019 0.1970 -2.576 0.009983
#>
#> Likelihood ratio test=19.58 on 3 df, p=0.0002074
#> n= 167, number of events= 120
# backward
cox_lca_stepback = step(cox_lca, direction = "backward")
#> Start: AIC=1004.24
#> Surv(time, status) ~ age + sex + ph.ecog
#>
#> Df AIC
#> - age 1 1002.6
#> <none> 1004.2
#> - sex 1 1009.0
#> - ph.ecog 2 1011.4
#>
#> Step: AIC=1002.65
#> Surv(time, status) ~ sex + ph.ecog
#>
#> Df AIC
#> <none> 1002.6
#> - sex 1 1007.6
#> - ph.ecog 2 1012.0
cox_lca_stepback
#> Call:
#> coxph(formula = Surv(time, status) ~ sex + ph.ecog, data = lca)
#>
#> coef exp(coef) se(coef) z p
#> sexfemale -0.5076 0.6019 0.1970 -2.576 0.009983
#> ph.ecog1 0.3204 1.3776 0.2331 1.374 0.169289
#> ph.ecog2 0.9368 2.5518 0.2592 3.614 0.000301
#>
#> Likelihood ratio test=19.58 on 3 df, p=0.0002074
#> n= 167, number of events= 120
All stepwise methods give the same set of variables, which is
sex + ph.ecog. We name it as cox_lca1,
cox_lca1 = cox_lca_stepboth
summary(cox_lca1)
#> Call:
#> coxph(formula = Surv(time, status) ~ sex + ph.ecog, data = lca)
#>
#> n= 167, number of events= 120
#>
#> coef exp(coef) se(coef) z Pr(>|z|)
#> sexfemale -0.5076 0.6019 0.1970 -2.576 0.009983 **
#> ph.ecog1 0.3204 1.3776 0.2331 1.374 0.169289
#> ph.ecog2 0.9368 2.5518 0.2592 3.614 0.000301 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> exp(coef) exp(-coef) lower .95 upper .95
#> sexfemale 0.6019 1.6614 0.4091 0.8856
#> ph.ecog1 1.3776 0.7259 0.8724 2.1753
#> ph.ecog2 2.5518 0.3919 1.5354 4.2410
#>
#> Concordance= 0.646 (se = 0.03 )
#> Likelihood ratio test= 19.58 on 3 df, p=2e-04
#> Wald test = 20.21 on 3 df, p=2e-04
#> Score (logrank) test = 20.97 on 3 df, p=1e-04
We skip confounder checking this time, you may do this step as an exercise.
Compare cox_lca1 model with the no-variable model and
all-variable model by LR test and AIC comparison.
# LR test
anova(cox_lca0, cox_lca1, cox_lca, test = "Chisq")
#> Analysis of Deviance Table
#> Cox model: response is Surv(time, status)
#> Model 1: ~ 1
#> Model 2: ~ sex + ph.ecog
#> Model 3: ~ age + sex + ph.ecog
#> loglik Chisq Df Pr(>|Chi|)
#> 1 -508.12
#> 2 -498.33 19.5796 3 0.0002074 ***
#> 3 -498.12 0.4173 1 0.5183043
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
There is no difference of 2-variable model (cox_lca1) to
3-variable full model.
# AIC
AIC(cox_lca0, cox_lca1, cox_lca)
#> df AIC
#> cox_lca0 0 1016.234
#> cox_lca1 3 1002.654
#> cox_lca 4 1004.237
cox_lca1 has lower AIC than 3-variable full model. Note
that there is no AIC for empty model in Cox PH because there’s no
intercept.
We check for the variables are redundant (multicollinear) by looking at the magnitude of SE to its coefficient (same approach to that of logistic regression),
cox_lca1 # small SEs < coefficients
#> Call:
#> coxph(formula = Surv(time, status) ~ sex + ph.ecog, data = lca)
#>
#> coef exp(coef) se(coef) z p
#> sexfemale -0.5076 0.6019 0.1970 -2.576 0.009983
#> ph.ecog1 0.3204 1.3776 0.2331 1.374 0.169289
#> ph.ecog2 0.9368 2.5518 0.2592 3.614 0.000301
#>
#> Likelihood ratio test=19.58 on 3 df, p=0.0002074
#> n= 167, number of events= 120
There are no large SEs for these two variables.
Now we add \(sex \times ph.ecog\) interaction term to the model,
add1(cox_lca1, scope = ~ . + sex * ph.ecog, test = "Chisq") # insig. *
#> Single term additions
#>
#> Model:
#> Surv(time, status) ~ sex + ph.ecog
#> Df AIC LRT Pr(>Chi)
#> <none> 1002.6
#> sex:ph.ecog 2 1005.2 1.4089 0.4944
There was no significant interaction to be included in out model.
This is a very important assumption in Cox PH model. As the name of the model itself suggests, proportional hazards asssumption is central to the model.
We start with a formal statistical test of the PH assumption by
predictors and overall (global test) using cox.zph()
function. It tests for constant coefficients over the time.
cox.zph(cox_lca1)
#> chisq df p
#> sex 1.54 1 0.214
#> ph.ecog 5.49 2 0.064
#> GLOBAL 6.52 3 0.089
We notice that the P-value < 0.05 for
pg.ecog2 hazards i.e. for pg.ecog = 2 vs pg.ecog = 0
levels, which indicates the the hazards are not proportionate. But as we
turn to the global test, the P-value is > 0.05. We may
conclude it is proportionate if we consider the model as a whole.
ggcoxzph(cox.zph(cox_lca1))
The points scattered fairly equally above and below the estimated
coefficient lines over time. The points jumping above and below maybe
because of categorical predictors, may look better if we have numerical
predictors.
Cumulative hazard plot for male and female,
sur_lca_sex = survfit(Surv(time, status) ~ sex, data = lca)
sur_lca_sex
#> Call: survfit(formula = Surv(time, status) ~ sex, data = lca)
#>
#> n events median 0.95LCL 0.95UCL
#> sex=male 103 82 284 223 353
#> sex=female 64 38 426 345 641
ggsurvplot(sur_lca_sex, fun = "cumhaz")
The lines are clearly parallel to each other, indicating the hazards are
proportionate over time.
Cumulative hazard plot for ph.ecog of 0, 1 and 2,
sur_lca_ph.ecog = survfit(Surv(time, status) ~ ph.ecog, data = lca)
sur_lca_ph.ecog
#> Call: survfit(formula = Surv(time, status) ~ ph.ecog, data = lca)
#>
#> n events median 0.95LCL 0.95UCL
#> ph.ecog=0 47 27 371 337 643
#> ph.ecog=1 81 59 345 269 460
#> ph.ecog=2 39 34 199 156 291
ggsurvplot(sur_lca_ph.ecog, fun = "cumhaz")
ph.ecog = 2 looks less parallel to the other two curves. It
could be a violation of the PH assumption.
LML is also called log cumulative hazard plot. Remember that we can obtain cumulative hazard function \(H(t)\) from cumulative survival function \(S(t)\), because \(H_t = -log_eS(t)\). LML is just \(log_e(H_t)\).
LML plot for male and female,
ggsurvplot(sur_lca_sex, fun = "cloglog")
The lines are parallel, especially prominent for time period of > 50
days.
LML plot for ph.ecog of 0, 1 and 2,
ggsurvplot(sur_lca_ph.ecog, fun = "cloglog")
At longer time period (> 100 days), the lines look more parallel.
Judging from cumulative hazard and LML plots, there is probably a violation of PH assumption. But we should keep in mind that cumulative hazard and LML plots show us the hazards by group from basic survival data (univariable), thus just serve as rough PH assumption check.
At this point we have decided on the final model,
cox_lca_final = cox_lca1
summary(cox_lca_final)
#> Call:
#> coxph(formula = Surv(time, status) ~ sex + ph.ecog, data = lca)
#>
#> n= 167, number of events= 120
#>
#> coef exp(coef) se(coef) z Pr(>|z|)
#> sexfemale -0.5076 0.6019 0.1970 -2.576 0.009983 **
#> ph.ecog1 0.3204 1.3776 0.2331 1.374 0.169289
#> ph.ecog2 0.9368 2.5518 0.2592 3.614 0.000301 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> exp(coef) exp(-coef) lower .95 upper .95
#> sexfemale 0.6019 1.6614 0.4091 0.8856
#> ph.ecog1 1.3776 0.7259 0.8724 2.1753
#> ph.ecog2 2.5518 0.3919 1.5354 4.2410
#>
#> Concordance= 0.646 (se = 0.03 )
#> Likelihood ratio test= 19.58 on 3 df, p=2e-04
#> Wald test = 20.21 on 3 df, p=2e-04
#> Score (logrank) test = 20.97 on 3 df, p=1e-04
Log\(_e\) hazard ratio is given by, \[log_e\bigg({h(t) \over h_0(t)}\bigg) = log_eHR = -0.51 * sex\ (female) + 0.32 \times ph.ecog\ (1) + 0.94 \times ph.ecog\ (2)\]
Hazard ratio is easily obtained by, \[{h(t) \over h_0(t)} = HR = e^{-0.51 * sex\ (female) + 0.32 \times ph.ecog\ (1) + 0.94 \times ph.ecog\ (2)}\]
Lastly, hazard (or risk in R) can be obtained from this equation,
\[h(t) = h_0(t) \times e^{-0.51 * sex\
(female) + 0.32 \times ph.ecog\ (1) + 0.94 \times ph.ecog\ (2)}\]
Hazard/risk is given by predict(..., type = "risk") in R.
\(h_0(t)\) is found by setting all
values of the predictors to baseline values. In our case, by setting
sex = 0 (“male”) and ph.ecog = 0 (i.e. dummy
variables ph.ecog1 = 0, ph.ecog2 = 0).
We start by adding predicted hazard to our sample,
lca$hazard = predict(cox_lca_final, type = "risk")
To add HR, we need to find \(h_0(t)\), the baseline hazard by predicting
for, sex = 0 (“male”) and ph.ecog = 0
(i.e. dummy variables ph.ecog1 = 0, ph.ecog2 = 0),
h0_t = predict(cox_lca_final, list(sex = "male", ph.ecog = "0"), type = "risk")
h0_t # h0(t)
#> 1
#> 1
followed by \(HR = h(t)/h_0(t)\),
# HR = h(t)/h0(t)
lca$hr = lca$hazard/h0_t
To see whether we did it right, view the first 20 observations in the sample,
head(lca[c("sex", "ph.ecog", "time", "status", "hazard", "hr")], 20)
#> sex ph.ecog time status hazard hr
#> 2 male 0 455 1 1.0000000 1.0000000
#> 4 male 1 210 1 1.3776214 1.3776214
#> 6 male 1 1022 0 1.3776214 1.3776214
#> 7 female 2 310 1 1.5359575 1.5359575
#> 8 female 2 361 1 1.5359575 1.5359575
#> 9 male 1 218 1 1.3776214 1.3776214
#> 10 male 2 166 1 2.5517798 2.5517798
#> 11 male 1 170 1 1.3776214 1.3776214
#> 15 male 1 567 1 1.3776214 1.3776214
#> 17 male 1 613 1 1.3776214 1.3776214
#> 18 male 2 707 1 2.5517798 2.5517798
#> 19 female 2 61 1 1.5359575 1.5359575
#> 21 male 1 301 1 1.3776214 1.3776214
#> 22 female 0 81 1 0.6019161 0.6019161
#> 24 male 0 371 1 1.0000000 1.0000000
#> 26 female 1 520 1 0.8292126 0.8292126
#> 27 male 0 574 1 1.0000000 1.0000000
#> 28 male 2 118 1 2.5517798 2.5517798
#> 29 male 1 390 1 1.3776214 1.3776214
#> 30 male 2 12 1 2.5517798 2.5517798
Now, we proceed to obtain the hazard and HR for a subject
(sex = "female", ph.ecog = "2"). Hazard,
predict(cox_lca_final, list(sex = "female", ph.ecog = "2"), type = "risk") # hazard
#> 1
#> 1.535957
There are two ways to obtain HR. First, our equation for HR above,
exp(coef(cox_lca_final)[1]*1 + coef(cox_lca_final)[2]*0 + coef(cox_lca_final)[3]*1) # HR
#> sexfemale
#> 1.535957
We can also obtain HR by dividing hazard by the baseline hazard
h0_t that we had before,
predict(cox_lca_final, list(sex = "female", ph.ecog = "2"), type = "risk")/h0_t # HR
#> 1
#> 1.535957
# we utilize the baseline hazard h0(t)
Then we obtain the hazard and HR for a data frame,
new_data = data.frame(sex = c("male", "male", "male", "female", "female", "female"),
ph.ecog = c("0", "1", "2", "0", "1", "2"))
new_data
#> sex ph.ecog
#> 1 male 0
#> 2 male 1
#> 3 male 2
#> 4 female 0
#> 5 female 1
#> 6 female 2
new_hazard = hazard = predict(cox_lca_final, new_data, type = "risk")
new_hr = new_hazard/h0_t
data.frame(new_data, hazard = round(new_hazard, 3), hr = round(new_hr, 3))
#> sex ph.ecog hazard hr
#> 1 male 0 1.000 1.000
#> 2 male 1 1.378 1.378
#> 3 male 2 2.552 2.552
#> 4 female 0 0.602 0.602
#> 5 female 1 0.829 0.829
#> 6 female 2 1.536 1.536
To obtain survival probabilities, Rizopoulos (2017) gives a good guide here:
http://www.drizopoulos.com/courses/emc/ep03_%20survival%20analysis%20in%20r%20companion
Here we obtain the median survival time and probabilities for a subject,
# simple, sex = "female", ph.ecog = "2"
new_cox1 = survfit(cox_lca_final, newdata = list(sex = "female", ph.ecog = "2"))
new_cox1 # median survival time
#> Call: survfit(formula = cox_lca_final, newdata = list(sex = "female",
#> ph.ecog = "2"))
#>
#> n events median 0.95LCL 0.95UCL
#> [1,] 167 120 285 218 429
summary(new_cox1, times = 100) # survival at 100 days
#> Call: survfit(formula = cox_lca_final, newdata = list(sex = "female",
#> ph.ecog = "2"))
#>
#> time n.risk n.event survival std.err lower 95% CI upper 95% CI
#> 100 143 24 0.83 0.0434 0.749 0.919
summary(new_cox1, times = c(100, 200, 300)) # survival at 100, 200 and 300 days
#> Call: survfit(formula = cox_lca_final, newdata = list(sex = "female",
#> ph.ecog = "2"))
#>
#> time n.risk n.event survival std.err lower 95% CI upper 95% CI
#> 100 143 24 0.830 0.0434 0.749 0.919
#> 200 111 24 0.656 0.0672 0.536 0.802
#> 300 67 25 0.449 0.0827 0.313 0.644
Now for a data frame,
new_data = data.frame(sex = c("male", "male", "male", "female", "female", "female"),
ph.ecog = c("0", "1", "2", "0", "1", "2"))
new_data
#> sex ph.ecog
#> 1 male 0
#> 2 male 1
#> 3 male 2
#> 4 female 0
#> 5 female 1
#> 6 female 2
new_cox2 = survfit(cox_lca_final, newdata = new_data)
new_cox2 # median survival times
#> Call: survfit(formula = cox_lca_final, newdata = new_data)
#>
#> n events median 0.95LCL 0.95UCL
#> 1 167 120 361 288 558
#> 2 167 120 291 245 371
#> 3 167 120 199 163 284
#> 4 167 120 550 426 NA
#> 5 167 120 429 337 641
#> 6 167 120 285 218 429
summary(new_cox2, times = 100) # survival at 100 days
#> Call: survfit(formula = cox_lca_final, newdata = new_data)
#>
#> time n.risk n.event survival1 survival2 survival3 survival4 survival5
#> 100 143 24 0.886 0.846 0.733 0.929 0.904
#> survival6
#> 0.83
summary(new_cox2, times = c(100, 200, 300)) # survival at 100, 200 and 300 days
#> Call: survfit(formula = cox_lca_final, newdata = new_data)
#>
#> time n.risk n.event survival1 survival2 survival3 survival4 survival5
#> 100 143 24 0.886 0.846 0.733 0.929 0.904
#> 200 111 24 0.760 0.685 0.496 0.848 0.796
#> 300 67 25 0.594 0.488 0.264 0.731 0.649
#> survival6
#> 0.830
#> 0.656
#> 0.449
ph.kano and pat.karno in the
multivariable analysis. What do you get? *Hint: Interaction?GBSG2 dataset (in
PH.data package).