1 Introduction

A statistical method to analyze:
- outcome: time to event (e.g. death, recurrence etc).
- (comparison) predictors/independent variables: categorical variables.
It is concerned with survival probability at specific time points over a time interval (follow-up period), e.g. five year survival etc.
Basically, the interval survival at time \(t\) is as follows, \[Interval\ survival,p_t = {Survivors,\ n_t - Deaths,\ e_t \over Survivor,\ n_t}\] The survival is usually represented by cumulative survival until time \(t\), \[Cumulative\ survival, s_t = p_0p_1p_2...p_{t-1}\]
In follow-up study over a period of time, not all subjects will experience event (e.g. not everyone die in 5 years). The subjects are called censored observations. In survival analysis, this censored observations are taken into account.

2 Kaplan-Meier survival curve in a group

Acute myeloid leukemia data (Miller, 1997).
time	status	x
9	1	Maintained
13	1	Maintained
13	0	Maintained
18	1	Maintained
23	1	Maintained
28	0	Maintained
31	1	Maintained
34	1	Maintained
45	0	Maintained
48	1	Maintained
161	0	Maintained
5	1	Nonmaintained
5	1	Nonmaintained
8	1	Nonmaintained
8	1	Nonmaintained
12	1	Nonmaintained
16	0	Nonmaintained
23	1	Nonmaintained
27	1	Nonmaintained
30	1	Nonmaintained
33	1	Nonmaintained
43	1	Nonmaintained
45	1	Nonmaintained

Load the required packages. Make sure you have all these in your computer.

# library
library(survival)
library(survminer)
library(epiDisplay)
library(coin)

We are going to use a built-in dataset in survival package, namely aml. We assign it to acute data object to avoid confusion with the built-in dataset name.

# data
?aml  # about the dataset
acute = aml
acute

#>    time status             x
#> 1     9      1    Maintained
#> 2    13      1    Maintained
#> 3    13      0    Maintained
#> 4    18      1    Maintained
#> 5    23      1    Maintained
#> 6    28      0    Maintained
#> 7    31      1    Maintained
#> 8    34      1    Maintained
#> 9    45      0    Maintained
#> 10   48      1    Maintained
#> 11  161      0    Maintained
#> 12    5      1 Nonmaintained
#> 13    5      1 Nonmaintained
#> 14    8      1 Nonmaintained
#> 15    8      1 Nonmaintained
#> 16   12      1 Nonmaintained
#> 17   16      0 Nonmaintained
#> 18   23      1 Nonmaintained
#> 19   27      1 Nonmaintained
#> 20   30      1 Nonmaintained
#> 21   33      1 Nonmaintained
#> 22   43      1 Nonmaintained
#> 23   45      1 Nonmaintained

Explore the data,

codebook(acute)

#> 
#>  
#>  
#> time      :    
#>  obs. mean   median  s.d.   min.   max.  
#>  23   29.478 23      31.72  5      161   
#> 
#>  ================== 
#> status    :    
#>  obs. mean   median  s.d.   min.   max.  
#>  23   0.783  1       0.422  0      1     
#> 
#>  ================== 
#> x     :    
#>               Frequency Percent
#> Maintained           11    47.8
#> Nonmaintained        12    52.2
#> 
#>  ==================

table(acute$status)  # number of events

#> 
#>  0  1 
#>  5 18

Generate survival curve data for plotting by survfit().

sur_aml = survfit(Surv(time, status) ~ 1, data = acute)

View median survival time and its 95% confidence interval from sur_aml,

sur_aml  # median survival time = 27 (95%CI: 18, 45)

#> Call: survfit(formula = Surv(time, status) ~ 1, data = acute)
#> 
#>       n events median 0.95LCL 0.95UCL
#> [1,] 23     18     27      18      45

and the details of the survival curve data,

summary(sur_aml)

#> Call: survfit(formula = Surv(time, status) ~ 1, data = acute)
#> 
#>  time n.risk n.event survival std.err lower 95% CI upper 95% CI
#>     5     23       2   0.9130  0.0588       0.8049        1.000
#>     8     21       2   0.8261  0.0790       0.6848        0.996
#>     9     19       1   0.7826  0.0860       0.6310        0.971
#>    12     18       1   0.7391  0.0916       0.5798        0.942
#>    13     17       1   0.6957  0.0959       0.5309        0.912
#>    18     14       1   0.6460  0.1011       0.4753        0.878
#>    23     13       2   0.5466  0.1073       0.3721        0.803
#>    27     11       1   0.4969  0.1084       0.3240        0.762
#>    30      9       1   0.4417  0.1095       0.2717        0.718
#>    31      8       1   0.3865  0.1089       0.2225        0.671
#>    33      7       1   0.3313  0.1064       0.1765        0.622
#>    34      6       1   0.2761  0.1020       0.1338        0.569
#>    43      5       1   0.2208  0.0954       0.0947        0.515
#>    45      4       1   0.1656  0.0860       0.0598        0.458
#>    48      2       1   0.0828  0.0727       0.0148        0.462

Note the survival column for \(s_t\).

Now plot Kaplan-Meier survival curve,

ggsurvplot(sur_aml, surv.median.line = "hv")

3 Comparing Kaplan-Meier curves between groups

Glioma data for histology grade 4 patients (Grana et al., 2002).
	time	event	group
12	43	FALSE	RIT
13	20	TRUE	RIT
14	14	TRUE	RIT
15	36	FALSE	RIT
16	59	FALSE	RIT
17	31	TRUE	RIT
18	14	TRUE	RIT
19	36	TRUE	RIT
26	8	TRUE	Control
27	8	TRUE	Control
28	11	TRUE	Control
29	12	TRUE	Control
30	15	TRUE	Control
31	5	TRUE	Control
32	8	TRUE	Control
33	8	TRUE	Control
34	6	TRUE	Control
35	14	TRUE	Control
36	13	TRUE	Control
37	25	TRUE	Control

3.1 KM for two groups

Now we use the built-in data in coin package, namely glioma. We assign it to gli data object to avoid confusion with the built-in dataset name.

# data
?glioma  # about the dataset
gli = glioma
str(gli)

#> 'data.frame':    37 obs. of  7 variables:
#>  $ no.      : int  1 2 3 4 5 6 7 8 9 10 ...
#>  $ age      : int  41 45 48 54 40 31 53 49 36 52 ...
#>  $ sex      : Factor w/ 2 levels "Female","Male": 1 1 2 2 1 2 2 2 2 2 ...
#>  $ histology: Factor w/ 2 levels "GBM","Grade3": 2 2 2 2 2 2 2 2 2 2 ...
#>  $ group    : Factor w/ 2 levels "Control","RIT": 2 2 2 2 2 2 2 2 2 2 ...
#>  $ event    : logi  TRUE FALSE FALSE FALSE FALSE TRUE ...
#>  $ time     : int  53 28 69 58 54 25 51 61 57 57 ...

For this analysis, we only need the data for GBM subgroup under histology,

gli4 = subset(gli, histology == "GBM")  # grade 4 glioma
str(gli4)

#> 'data.frame':    20 obs. of  7 variables:
#>  $ no.      : int  12 13 14 15 16 17 18 19 7 8 ...
#>  $ age      : int  55 70 39 40 47 58 40 36 32 70 ...
#>  $ sex      : Factor w/ 2 levels "Female","Male": 1 2 1 1 1 2 1 2 1 2 ...
#>  $ histology: Factor w/ 2 levels "GBM","Grade3": 1 1 1 1 1 1 1 1 1 1 ...
#>  $ group    : Factor w/ 2 levels "Control","RIT": 2 2 2 2 2 2 2 2 1 1 ...
#>  $ event    : logi  FALSE TRUE TRUE FALSE FALSE TRUE ...
#>  $ time     : int  43 20 14 36 59 31 14 36 8 8 ...

Expore the data,

codebook(gli4)

#> 
#>  
#>  
#> no.   :    
#>  obs. mean   median  s.d.   min.   max.  
#>  20   13.7   14      3.466  7      19    
#> 
#>  ================== 
#> age   :    
#>  obs. mean   median  s.d.   min.   max.  
#>  20   53.9   52.5    14.436 32     83    
#> 
#>  ================== 
#> sex   :    
#>        Frequency Percent
#> Female        10      50
#> Male          10      50
#> 
#>  ================== 
#> histology     :    
#>        Frequency Percent
#> GBM           20     100
#> Grade3         0       0
#> 
#>  ================== 
#> group     :    
#>         Frequency Percent
#> Control        12      60
#> RIT             8      40
#> 
#>  ================== 
#> event     :    
#>       Frequency Percent
#> FALSE         3      15
#> TRUE         17      85
#> 
#>  ================== 
#> time      :    
#>  obs. mean   median  s.d.   min.   max.  
#>  20   19.3   14      14.55  5      59    
#> 
#>  ==================

Now, we generate survival curve data,

sur_gli4 = survfit(Surv(time, event) ~ group, data = gli4)

then view the median (95% CI),

sur_gli4  # median survival times/group, 0.95UCL cannot be estimated

#> Call: survfit(formula = Surv(time, event) ~ group, data = gli4)
#> 
#>                n events median 0.95LCL 0.95UCL
#> group=Control 12     12    9.5       8      NA
#> group=RIT      8      5   33.5      20      NA

and the details of the survival curve data,

summary(sur_gli4)

#> Call: survfit(formula = Surv(time, event) ~ group, data = gli4)
#> 
#>                 group=Control 
#>  time n.risk n.event survival std.err lower 95% CI upper 95% CI
#>     5     12       1   0.9167  0.0798       0.7729        1.000
#>     6     11       1   0.8333  0.1076       0.6470        1.000
#>     8     10       4   0.5000  0.1443       0.2840        0.880
#>    11      6       1   0.4167  0.1423       0.2133        0.814
#>    12      5       1   0.3333  0.1361       0.1498        0.742
#>    13      4       1   0.2500  0.1250       0.0938        0.666
#>    14      3       1   0.1667  0.1076       0.0470        0.591
#>    15      2       1   0.0833  0.0798       0.0128        0.544
#>    25      1       1   0.0000     NaN           NA           NA
#> 
#>                 group=RIT 
#>  time n.risk n.event survival std.err lower 95% CI upper 95% CI
#>    14      8       2    0.750   0.153        0.503        1.000
#>    20      6       1    0.625   0.171        0.365        1.000
#>    31      5       1    0.500   0.177        0.250        1.000
#>    36      4       1    0.375   0.171        0.153        0.917

Now we plot the sur_gli4 data,

ggsurvplot(sur_gli4, conf.int = T, surv.median.line = "hv")

3.2 Log-rank test comparing two groups

Now, we compare statistically the survival curves of the groups by log-rank test,

survdiff(Surv(time, event) ~ group, data = gli4)

#> Call:
#> survdiff(formula = Surv(time, event) ~ group, data = gli4)
#> 
#>                N Observed Expected (O-E)^2/E (O-E)^2/V
#> group=Control 12       12     5.93      6.23      12.6
#> group=RIT      8        5    11.07      3.33      12.6
#> 
#>  Chisq= 12.6  on 1 degrees of freedom, p= 4e-04

4 Exercises

Analyze gli data for histology == “Grade3”.

gli3 = subset(gli, histology == "Grade3")

Again, analyze using aml data from survival package. This time compare the groups (x variable in the dataset).
Use dataset from “Practice Datasets” folder.

References

Arifin, W. N., Sarimah, A., Norsa’adah, B., Majdi, Y. N., Siti-Azrin, A. H., Imran, M. K., … Naing, L. (2016). Reporting statistical results in medical journals. The Malaysian Journal of Medical Sciences: MJMS, 23(5), 1.

Chongsuvivatwong, V. (2022). epiDisplay: Epidemiological data display package. Retrieved from https://CRAN.R-project.org/package=epiDisplay

Fox, J., & Weisberg, S. (2019). An R companion to applied regression (Third). Thousand Oaks CA: Sage. Retrieved from https://www.john-fox.ca/Companion/

Fox, J., Weisberg, S., & Price, B. (2024). Car: Companion to applied regression. Retrieved from https://r-forge.r-project.org/projects/car/

Grana, C., Chinol, M., Robertson, C., Mazzetta, C., Bartolomei, M., De Cicco, C., … Paganelli, G. (2002). Pretargeted adjuvant radioimmunotherapy with yttrium-90-biotin in malignant glioma patients: A pilot study. British Journal of Cancer, 86(2), 207.

Hothorn, T. (2025). TH.data: TH’s data archive. Retrieved from https://CRAN.R-project.org/package=TH.data

Hothorn, T., Hornik, K., van de Wiel, M. A., & Zeileis, A. (2006). A Lego system for conditional inference. The American Statistician, 60(3), 257–263. https://doi.org/10.1198/000313006X118430

Hothorn, T., Hornik, K., van de Wiel, M. A., & Zeileis, A. (2008). Implementing a class of permutation tests: The coin package. Journal of Statistical Software, 28(8), 1–23. https://doi.org/10.18637/jss.v028.i08

Hothorn, T., Winell, H., Hornik, K., van de Wiel, M. A., & Zeileis, A. (2023). Coin: Conditional inference procedures in a permutation test framework. Retrieved from http://coin.r-forge.r-project.org

Kassambara, A., Kosinski, M., & Biecek, P. (2025). Survminer: Drawing survival curves using ggplot2. Retrieved from https://rpkgs.datanovia.com/survminer/index.html

Miller, R. G. (1997). Survival analysis. John Wiley & Sons.

Musa, K. I., Mansor, W. N. A. W., & Hanis, T. M. (2023). Data analysis in medicine and health using r. New York: Chapman; Hall/CRC. Retrieved from https://www.taylorfrancis.com/books/mono/10.1201/9781003296775/data-analysis-medicine-health-using-kamarul-imran-musa-wan-arifin-wan-mansor-tengku-muhammad-hanis

Rizopoulos, D. (2017). EP03: Survival analysis in R companion. Retrieved September 12, 2018, from http://www.drizopoulos.com/courses/emc/ep03_%20survival%20analysis%20in%20r%20companion#survival-probabilities-from-cox-models

Terry M. Therneau, & Patricia M. Grambsch. (2000). Modeling survival data: Extending the Cox model. New York: Springer.

Therneau, T. M. (2024). Survival: Survival analysis. Retrieved from https://github.com/therneau/survival

Woodward, M. (2013). Epidemiology: Study design and data analysis. Boca Raton, FL, USA: CRC Press.

Survival analysis using R: Kaplan-Meier Survival Curve

Dr. Wan Nor Arifin

Updated: 15 December 2025