1 Introduction

  1. A statistical method to analyze:

    • outcome: time to event (e.g. death, recurrence etc).
    • (comparison) predictors/independent variables: categorical variables.
  2. It is concerned with survival probability at specific time points over a time interval (follow-up period), e.g. five year survival etc.

  3. Basically, the interval survival at time \(t\) is as follows, \[Interval\ survival,p_t = {Survivors,\ n_t - Deaths,\ e_t \over Survivor,\ n_t}\] The survival is usually represented by cumulative survival until time \(t\), \[Cumulative\ survival, s_t = p_0p_1p_2...p_{t-1}\]

  4. In follow-up study over a period of time, not all subjects will experience event (e.g. not everyone die in 5 years). The subjects are called censored observations. In survival analysis, this censored observations are taken into account.

2 Kaplan-Meier survival curve in a group

Acute myeloid leukemia data (Miller, 1997).
time status x
9 1 Maintained
13 1 Maintained
13 0 Maintained
18 1 Maintained
23 1 Maintained
28 0 Maintained
31 1 Maintained
34 1 Maintained
45 0 Maintained
48 1 Maintained
161 0 Maintained
5 1 Nonmaintained
5 1 Nonmaintained
8 1 Nonmaintained
8 1 Nonmaintained
12 1 Nonmaintained
16 0 Nonmaintained
23 1 Nonmaintained
27 1 Nonmaintained
30 1 Nonmaintained
33 1 Nonmaintained
43 1 Nonmaintained
45 1 Nonmaintained

Load the required packages. Make sure you have all these in your computer.

# library
library(survival)
library(survminer)
library(epiDisplay)
library(coin)

We are going to use a built-in dataset in survival package, namely aml. We assign it to acute data object to avoid confusion with the built-in dataset name.

# data
?aml  # about the dataset
acute = aml
acute
#>    time status             x
#> 1     9      1    Maintained
#> 2    13      1    Maintained
#> 3    13      0    Maintained
#> 4    18      1    Maintained
#> 5    23      1    Maintained
#> 6    28      0    Maintained
#> 7    31      1    Maintained
#> 8    34      1    Maintained
#> 9    45      0    Maintained
#> 10   48      1    Maintained
#> 11  161      0    Maintained
#> 12    5      1 Nonmaintained
#> 13    5      1 Nonmaintained
#> 14    8      1 Nonmaintained
#> 15    8      1 Nonmaintained
#> 16   12      1 Nonmaintained
#> 17   16      0 Nonmaintained
#> 18   23      1 Nonmaintained
#> 19   27      1 Nonmaintained
#> 20   30      1 Nonmaintained
#> 21   33      1 Nonmaintained
#> 22   43      1 Nonmaintained
#> 23   45      1 Nonmaintained

Explore the data,

codebook(acute)
#> 
#>  
#>  
#> time      :    
#>  obs. mean   median  s.d.   min.   max.  
#>  23   29.478 23      31.72  5      161   
#> 
#>  ================== 
#> status    :    
#>  obs. mean   median  s.d.   min.   max.  
#>  23   0.783  1       0.422  0      1     
#> 
#>  ================== 
#> x     :    
#>               Frequency Percent
#> Maintained           11    47.8
#> Nonmaintained        12    52.2
#> 
#>  ==================
table(acute$status)  # number of events
#> 
#>  0  1 
#>  5 18

Generate survival curve data for plotting by survfit().

sur_aml = survfit(Surv(time, status) ~ 1, data = acute)

View median survival time and its 95% confidence interval from sur_aml,

sur_aml  # median survival time = 27 (95%CI: 18, 45)
#> Call: survfit(formula = Surv(time, status) ~ 1, data = acute)
#> 
#>       n events median 0.95LCL 0.95UCL
#> [1,] 23     18     27      18      45

and the details of the survival curve data,

summary(sur_aml)
#> Call: survfit(formula = Surv(time, status) ~ 1, data = acute)
#> 
#>  time n.risk n.event survival std.err lower 95% CI upper 95% CI
#>     5     23       2   0.9130  0.0588       0.8049        1.000
#>     8     21       2   0.8261  0.0790       0.6848        0.996
#>     9     19       1   0.7826  0.0860       0.6310        0.971
#>    12     18       1   0.7391  0.0916       0.5798        0.942
#>    13     17       1   0.6957  0.0959       0.5309        0.912
#>    18     14       1   0.6460  0.1011       0.4753        0.878
#>    23     13       2   0.5466  0.1073       0.3721        0.803
#>    27     11       1   0.4969  0.1084       0.3240        0.762
#>    30      9       1   0.4417  0.1095       0.2717        0.718
#>    31      8       1   0.3865  0.1089       0.2225        0.671
#>    33      7       1   0.3313  0.1064       0.1765        0.622
#>    34      6       1   0.2761  0.1020       0.1338        0.569
#>    43      5       1   0.2208  0.0954       0.0947        0.515
#>    45      4       1   0.1656  0.0860       0.0598        0.458
#>    48      2       1   0.0828  0.0727       0.0148        0.462

Note the survival column for \(s_t\).

Now plot Kaplan-Meier survival curve,

ggsurvplot(sur_aml, surv.median.line = "hv")

3 Comparing Kaplan-Meier curves between groups

Glioma data for histology grade 4 patients (Grana et al., 2002).
time event group
12 43 FALSE RIT
13 20 TRUE RIT
14 14 TRUE RIT
15 36 FALSE RIT
16 59 FALSE RIT
17 31 TRUE RIT
18 14 TRUE RIT
19 36 TRUE RIT
26 8 TRUE Control
27 8 TRUE Control
28 11 TRUE Control
29 12 TRUE Control
30 15 TRUE Control
31 5 TRUE Control
32 8 TRUE Control
33 8 TRUE Control
34 6 TRUE Control
35 14 TRUE Control
36 13 TRUE Control
37 25 TRUE Control

3.1 KM for two groups

Now we use the built-in data in coin package, namely glioma. We assign it to gli data object to avoid confusion with the built-in dataset name.

# data
?glioma  # about the dataset
gli = glioma
str(gli)
#> 'data.frame':    37 obs. of  7 variables:
#>  $ no.      : int  1 2 3 4 5 6 7 8 9 10 ...
#>  $ age      : int  41 45 48 54 40 31 53 49 36 52 ...
#>  $ sex      : Factor w/ 2 levels "Female","Male": 1 1 2 2 1 2 2 2 2 2 ...
#>  $ histology: Factor w/ 2 levels "GBM","Grade3": 2 2 2 2 2 2 2 2 2 2 ...
#>  $ group    : Factor w/ 2 levels "Control","RIT": 2 2 2 2 2 2 2 2 2 2 ...
#>  $ event    : logi  TRUE FALSE FALSE FALSE FALSE TRUE ...
#>  $ time     : int  53 28 69 58 54 25 51 61 57 57 ...

For this analysis, we only need the data for GBM subgroup under histology,

gli4 = subset(gli, histology == "GBM")  # grade 4 glioma
str(gli4)
#> 'data.frame':    20 obs. of  7 variables:
#>  $ no.      : int  12 13 14 15 16 17 18 19 7 8 ...
#>  $ age      : int  55 70 39 40 47 58 40 36 32 70 ...
#>  $ sex      : Factor w/ 2 levels "Female","Male": 1 2 1 1 1 2 1 2 1 2 ...
#>  $ histology: Factor w/ 2 levels "GBM","Grade3": 1 1 1 1 1 1 1 1 1 1 ...
#>  $ group    : Factor w/ 2 levels "Control","RIT": 2 2 2 2 2 2 2 2 1 1 ...
#>  $ event    : logi  FALSE TRUE TRUE FALSE FALSE TRUE ...
#>  $ time     : int  43 20 14 36 59 31 14 36 8 8 ...

Expore the data,

codebook(gli4)
#> 
#>  
#>  
#> no.   :    
#>  obs. mean   median  s.d.   min.   max.  
#>  20   13.7   14      3.466  7      19    
#> 
#>  ================== 
#> age   :    
#>  obs. mean   median  s.d.   min.   max.  
#>  20   53.9   52.5    14.436 32     83    
#> 
#>  ================== 
#> sex   :    
#>        Frequency Percent
#> Female        10      50
#> Male          10      50
#> 
#>  ================== 
#> histology     :    
#>        Frequency Percent
#> GBM           20     100
#> Grade3         0       0
#> 
#>  ================== 
#> group     :    
#>         Frequency Percent
#> Control        12      60
#> RIT             8      40
#> 
#>  ================== 
#> event     :    
#>       Frequency Percent
#> FALSE         3      15
#> TRUE         17      85
#> 
#>  ================== 
#> time      :    
#>  obs. mean   median  s.d.   min.   max.  
#>  20   19.3   14      14.55  5      59    
#> 
#>  ==================

Now, we generate survival curve data,

sur_gli4 = survfit(Surv(time, event) ~ group, data = gli4)

then view the median (95% CI),

sur_gli4  # median survival times/group, 0.95UCL cannot be estimated
#> Call: survfit(formula = Surv(time, event) ~ group, data = gli4)
#> 
#>                n events median 0.95LCL 0.95UCL
#> group=Control 12     12    9.5       8      NA
#> group=RIT      8      5   33.5      20      NA

and the details of the survival curve data,

summary(sur_gli4)
#> Call: survfit(formula = Surv(time, event) ~ group, data = gli4)
#> 
#>                 group=Control 
#>  time n.risk n.event survival std.err lower 95% CI upper 95% CI
#>     5     12       1   0.9167  0.0798       0.7729        1.000
#>     6     11       1   0.8333  0.1076       0.6470        1.000
#>     8     10       4   0.5000  0.1443       0.2840        0.880
#>    11      6       1   0.4167  0.1423       0.2133        0.814
#>    12      5       1   0.3333  0.1361       0.1498        0.742
#>    13      4       1   0.2500  0.1250       0.0938        0.666
#>    14      3       1   0.1667  0.1076       0.0470        0.591
#>    15      2       1   0.0833  0.0798       0.0128        0.544
#>    25      1       1   0.0000     NaN           NA           NA
#> 
#>                 group=RIT 
#>  time n.risk n.event survival std.err lower 95% CI upper 95% CI
#>    14      8       2    0.750   0.153        0.503        1.000
#>    20      6       1    0.625   0.171        0.365        1.000
#>    31      5       1    0.500   0.177        0.250        1.000
#>    36      4       1    0.375   0.171        0.153        0.917

Now we plot the sur_gli4 data,

ggsurvplot(sur_gli4, conf.int = T, surv.median.line = "hv")

3.2 Log-rank test comparing two groups

Now, we compare statistically the survival curves of the groups by log-rank test,

survdiff(Surv(time, event) ~ group, data = gli4)
#> Call:
#> survdiff(formula = Surv(time, event) ~ group, data = gli4)
#> 
#>                N Observed Expected (O-E)^2/E (O-E)^2/V
#> group=Control 12       12     5.93      6.23      12.6
#> group=RIT      8        5    11.07      3.33      12.6
#> 
#>  Chisq= 12.6  on 1 degrees of freedom, p= 4e-04

4 Exercises

  1. Analyze gli data for histology == “Grade3”.
gli3 = subset(gli, histology == "Grade3")
  1. Again, analyze using aml data from survival package. This time compare the groups (x variable in the dataset).
  2. Use dataset from “Practice Datasets” folder.

References

Arifin, W. N., Sarimah, A., Norsa’adah, B., Majdi, Y. N., Siti-Azrin, A. H., Imran, M. K., … Naing, L. (2016). Reporting statistical results in medical journals. The Malaysian Journal of Medical Sciences: MJMS, 23(5), 1.
Chongsuvivatwong, V. (2022). epiDisplay: Epidemiological data display package. Retrieved from https://CRAN.R-project.org/package=epiDisplay
Fox, J., & Weisberg, S. (2019). An R companion to applied regression (Third). Thousand Oaks CA: Sage. Retrieved from https://www.john-fox.ca/Companion/
Fox, J., Weisberg, S., & Price, B. (2024). Car: Companion to applied regression. Retrieved from https://r-forge.r-project.org/projects/car/
Grana, C., Chinol, M., Robertson, C., Mazzetta, C., Bartolomei, M., De Cicco, C., … Paganelli, G. (2002). Pretargeted adjuvant radioimmunotherapy with yttrium-90-biotin in malignant glioma patients: A pilot study. British Journal of Cancer, 86(2), 207.
Hothorn, T. (2025). TH.data: TH’s data archive. Retrieved from https://CRAN.R-project.org/package=TH.data
Hothorn, T., Hornik, K., van de Wiel, M. A., & Zeileis, A. (2006). A Lego system for conditional inference. The American Statistician, 60(3), 257–263. https://doi.org/10.1198/000313006X118430
Hothorn, T., Hornik, K., van de Wiel, M. A., & Zeileis, A. (2008). Implementing a class of permutation tests: The coin package. Journal of Statistical Software, 28(8), 1–23. https://doi.org/10.18637/jss.v028.i08
Hothorn, T., Winell, H., Hornik, K., van de Wiel, M. A., & Zeileis, A. (2023). Coin: Conditional inference procedures in a permutation test framework. Retrieved from http://coin.r-forge.r-project.org
Kassambara, A., Kosinski, M., & Biecek, P. (2025). Survminer: Drawing survival curves using ggplot2. Retrieved from https://rpkgs.datanovia.com/survminer/index.html
Miller, R. G. (1997). Survival analysis. John Wiley & Sons.
Musa, K. I., Mansor, W. N. A. W., & Hanis, T. M. (2023). Data analysis in medicine and health using r. New York: Chapman; Hall/CRC. Retrieved from https://www.taylorfrancis.com/books/mono/10.1201/9781003296775/data-analysis-medicine-health-using-kamarul-imran-musa-wan-arifin-wan-mansor-tengku-muhammad-hanis
Rizopoulos, D. (2017). EP03: Survival analysis in R companion. Retrieved September 12, 2018, from http://www.drizopoulos.com/courses/emc/ep03_%20survival%20analysis%20in%20r%20companion#survival-probabilities-from-cox-models
Terry M. Therneau, & Patricia M. Grambsch. (2000). Modeling survival data: Extending the Cox model. New York: Springer.
Therneau, T. M. (2024). Survival: Survival analysis. Retrieved from https://github.com/therneau/survival
Woodward, M. (2013). Epidemiology: Study design and data analysis. Boca Raton, FL, USA: CRC Press.