A statistical method to analyze:
It is concerned with survival probability at specific time points over a time interval (follow-up period), e.g. five year survival etc.
Basically, the interval survival at time \(t\) is as follows, \[Interval\ survival,p_t = {Survivors,\ n_t - Deaths,\ e_t \over Survivor,\ n_t}\] The survival is usually represented by cumulative survival until time \(t\), \[Cumulative\ survival, s_t = p_0p_1p_2...p_{t-1}\]
In follow-up study over a period of time, not all subjects will experience event (e.g. not everyone die in 5 years). The subjects are called censored observations. In survival analysis, this censored observations are taken into account.
| time | status | x |
|---|---|---|
| 9 | 1 | Maintained |
| 13 | 1 | Maintained |
| 13 | 0 | Maintained |
| 18 | 1 | Maintained |
| 23 | 1 | Maintained |
| 28 | 0 | Maintained |
| 31 | 1 | Maintained |
| 34 | 1 | Maintained |
| 45 | 0 | Maintained |
| 48 | 1 | Maintained |
| 161 | 0 | Maintained |
| 5 | 1 | Nonmaintained |
| 5 | 1 | Nonmaintained |
| 8 | 1 | Nonmaintained |
| 8 | 1 | Nonmaintained |
| 12 | 1 | Nonmaintained |
| 16 | 0 | Nonmaintained |
| 23 | 1 | Nonmaintained |
| 27 | 1 | Nonmaintained |
| 30 | 1 | Nonmaintained |
| 33 | 1 | Nonmaintained |
| 43 | 1 | Nonmaintained |
| 45 | 1 | Nonmaintained |
Load the required packages. Make sure you have all these in your computer.
# library
library(survival)
library(survminer)
library(epiDisplay)
library(coin)
We are going to use a built-in dataset in survival
package, namely aml. We assign it to acute
data object to avoid confusion with the built-in dataset name.
# data
?aml # about the dataset
acute = aml
acute
#> time status x
#> 1 9 1 Maintained
#> 2 13 1 Maintained
#> 3 13 0 Maintained
#> 4 18 1 Maintained
#> 5 23 1 Maintained
#> 6 28 0 Maintained
#> 7 31 1 Maintained
#> 8 34 1 Maintained
#> 9 45 0 Maintained
#> 10 48 1 Maintained
#> 11 161 0 Maintained
#> 12 5 1 Nonmaintained
#> 13 5 1 Nonmaintained
#> 14 8 1 Nonmaintained
#> 15 8 1 Nonmaintained
#> 16 12 1 Nonmaintained
#> 17 16 0 Nonmaintained
#> 18 23 1 Nonmaintained
#> 19 27 1 Nonmaintained
#> 20 30 1 Nonmaintained
#> 21 33 1 Nonmaintained
#> 22 43 1 Nonmaintained
#> 23 45 1 Nonmaintained
Explore the data,
codebook(acute)
#>
#>
#>
#> time :
#> obs. mean median s.d. min. max.
#> 23 29.478 23 31.72 5 161
#>
#> ==================
#> status :
#> obs. mean median s.d. min. max.
#> 23 0.783 1 0.422 0 1
#>
#> ==================
#> x :
#> Frequency Percent
#> Maintained 11 47.8
#> Nonmaintained 12 52.2
#>
#> ==================
table(acute$status) # number of events
#>
#> 0 1
#> 5 18
Generate survival curve data for plotting by
survfit().
sur_aml = survfit(Surv(time, status) ~ 1, data = acute)
View median survival time and its 95% confidence interval from
sur_aml,
sur_aml # median survival time = 27 (95%CI: 18, 45)
#> Call: survfit(formula = Surv(time, status) ~ 1, data = acute)
#>
#> n events median 0.95LCL 0.95UCL
#> [1,] 23 18 27 18 45
and the details of the survival curve data,
summary(sur_aml)
#> Call: survfit(formula = Surv(time, status) ~ 1, data = acute)
#>
#> time n.risk n.event survival std.err lower 95% CI upper 95% CI
#> 5 23 2 0.9130 0.0588 0.8049 1.000
#> 8 21 2 0.8261 0.0790 0.6848 0.996
#> 9 19 1 0.7826 0.0860 0.6310 0.971
#> 12 18 1 0.7391 0.0916 0.5798 0.942
#> 13 17 1 0.6957 0.0959 0.5309 0.912
#> 18 14 1 0.6460 0.1011 0.4753 0.878
#> 23 13 2 0.5466 0.1073 0.3721 0.803
#> 27 11 1 0.4969 0.1084 0.3240 0.762
#> 30 9 1 0.4417 0.1095 0.2717 0.718
#> 31 8 1 0.3865 0.1089 0.2225 0.671
#> 33 7 1 0.3313 0.1064 0.1765 0.622
#> 34 6 1 0.2761 0.1020 0.1338 0.569
#> 43 5 1 0.2208 0.0954 0.0947 0.515
#> 45 4 1 0.1656 0.0860 0.0598 0.458
#> 48 2 1 0.0828 0.0727 0.0148 0.462
Note the survival column for \(s_t\).
Now plot Kaplan-Meier survival curve,
ggsurvplot(sur_aml, surv.median.line = "hv")
| time | event | group | |
|---|---|---|---|
| 12 | 43 | FALSE | RIT |
| 13 | 20 | TRUE | RIT |
| 14 | 14 | TRUE | RIT |
| 15 | 36 | FALSE | RIT |
| 16 | 59 | FALSE | RIT |
| 17 | 31 | TRUE | RIT |
| 18 | 14 | TRUE | RIT |
| 19 | 36 | TRUE | RIT |
| 26 | 8 | TRUE | Control |
| 27 | 8 | TRUE | Control |
| 28 | 11 | TRUE | Control |
| 29 | 12 | TRUE | Control |
| 30 | 15 | TRUE | Control |
| 31 | 5 | TRUE | Control |
| 32 | 8 | TRUE | Control |
| 33 | 8 | TRUE | Control |
| 34 | 6 | TRUE | Control |
| 35 | 14 | TRUE | Control |
| 36 | 13 | TRUE | Control |
| 37 | 25 | TRUE | Control |
Now we use the built-in data in coin package, namely
glioma. We assign it to gli data object to
avoid confusion with the built-in dataset name.
# data
?glioma # about the dataset
gli = glioma
str(gli)
#> 'data.frame': 37 obs. of 7 variables:
#> $ no. : int 1 2 3 4 5 6 7 8 9 10 ...
#> $ age : int 41 45 48 54 40 31 53 49 36 52 ...
#> $ sex : Factor w/ 2 levels "Female","Male": 1 1 2 2 1 2 2 2 2 2 ...
#> $ histology: Factor w/ 2 levels "GBM","Grade3": 2 2 2 2 2 2 2 2 2 2 ...
#> $ group : Factor w/ 2 levels "Control","RIT": 2 2 2 2 2 2 2 2 2 2 ...
#> $ event : logi TRUE FALSE FALSE FALSE FALSE TRUE ...
#> $ time : int 53 28 69 58 54 25 51 61 57 57 ...
For this analysis, we only need the data for GBM
subgroup under histology,
gli4 = subset(gli, histology == "GBM") # grade 4 glioma
str(gli4)
#> 'data.frame': 20 obs. of 7 variables:
#> $ no. : int 12 13 14 15 16 17 18 19 7 8 ...
#> $ age : int 55 70 39 40 47 58 40 36 32 70 ...
#> $ sex : Factor w/ 2 levels "Female","Male": 1 2 1 1 1 2 1 2 1 2 ...
#> $ histology: Factor w/ 2 levels "GBM","Grade3": 1 1 1 1 1 1 1 1 1 1 ...
#> $ group : Factor w/ 2 levels "Control","RIT": 2 2 2 2 2 2 2 2 1 1 ...
#> $ event : logi FALSE TRUE TRUE FALSE FALSE TRUE ...
#> $ time : int 43 20 14 36 59 31 14 36 8 8 ...
Expore the data,
codebook(gli4)
#>
#>
#>
#> no. :
#> obs. mean median s.d. min. max.
#> 20 13.7 14 3.466 7 19
#>
#> ==================
#> age :
#> obs. mean median s.d. min. max.
#> 20 53.9 52.5 14.436 32 83
#>
#> ==================
#> sex :
#> Frequency Percent
#> Female 10 50
#> Male 10 50
#>
#> ==================
#> histology :
#> Frequency Percent
#> GBM 20 100
#> Grade3 0 0
#>
#> ==================
#> group :
#> Frequency Percent
#> Control 12 60
#> RIT 8 40
#>
#> ==================
#> event :
#> Frequency Percent
#> FALSE 3 15
#> TRUE 17 85
#>
#> ==================
#> time :
#> obs. mean median s.d. min. max.
#> 20 19.3 14 14.55 5 59
#>
#> ==================
Now, we generate survival curve data,
sur_gli4 = survfit(Surv(time, event) ~ group, data = gli4)
then view the median (95% CI),
sur_gli4 # median survival times/group, 0.95UCL cannot be estimated
#> Call: survfit(formula = Surv(time, event) ~ group, data = gli4)
#>
#> n events median 0.95LCL 0.95UCL
#> group=Control 12 12 9.5 8 NA
#> group=RIT 8 5 33.5 20 NA
and the details of the survival curve data,
summary(sur_gli4)
#> Call: survfit(formula = Surv(time, event) ~ group, data = gli4)
#>
#> group=Control
#> time n.risk n.event survival std.err lower 95% CI upper 95% CI
#> 5 12 1 0.9167 0.0798 0.7729 1.000
#> 6 11 1 0.8333 0.1076 0.6470 1.000
#> 8 10 4 0.5000 0.1443 0.2840 0.880
#> 11 6 1 0.4167 0.1423 0.2133 0.814
#> 12 5 1 0.3333 0.1361 0.1498 0.742
#> 13 4 1 0.2500 0.1250 0.0938 0.666
#> 14 3 1 0.1667 0.1076 0.0470 0.591
#> 15 2 1 0.0833 0.0798 0.0128 0.544
#> 25 1 1 0.0000 NaN NA NA
#>
#> group=RIT
#> time n.risk n.event survival std.err lower 95% CI upper 95% CI
#> 14 8 2 0.750 0.153 0.503 1.000
#> 20 6 1 0.625 0.171 0.365 1.000
#> 31 5 1 0.500 0.177 0.250 1.000
#> 36 4 1 0.375 0.171 0.153 0.917
Now we plot the sur_gli4 data,
ggsurvplot(sur_gli4, conf.int = T, surv.median.line = "hv")
Now, we compare statistically the survival curves of the groups by log-rank test,
survdiff(Surv(time, event) ~ group, data = gli4)
#> Call:
#> survdiff(formula = Surv(time, event) ~ group, data = gli4)
#>
#> N Observed Expected (O-E)^2/E (O-E)^2/V
#> group=Control 12 12 5.93 6.23 12.6
#> group=RIT 8 5 11.07 3.33 12.6
#>
#> Chisq= 12.6 on 1 degrees of freedom, p= 4e-04
gli3 = subset(gli, histology == "Grade3")
aml data from
survival package. This time compare the groups
(x variable in the dataset).