1 Two independent samples - Chi-squared test for association

1.1 About the test:

Non-parametric test.
Association between TWO categorical variables.
Cross-tabulation between the variables, usually 2 x 2, but can be any levels.
The association between the variables are made by comparing the observed cell counts with the expected cell counts if the variables are not associated to each other.
Requirement – < 25% expected cell counts < 5.
\(\chi^2\) statistics.

1.2 Analysis:

The data.

             Cancer
Smoking      lung cancer no lung cancer
  smoking             20             12
  no smoking          55            113

Now, load lung.csv,

lung = read.csv("lung.csv")
str(lung)

#> 'data.frame':    200 obs. of  2 variables:
#>  $ Smoking: chr  "smoking" "smoking" "smoking" "smoking" ...
#>  $ Cancer : chr  "cancer" "cancer" "cancer" "cancer" ...

head(lung)

#>   Smoking Cancer
#> 1 smoking cancer
#> 2 smoking cancer
#> 3 smoking cancer
#> 4 smoking cancer
#> 5 smoking cancer
#> 6 smoking cancer

Now, we create cross-tabulation of the categorical variables,

tab_lung = table(Smoking = lung$Smoking, Cancer = lung$Cancer)
str(tab_lung)

#>  'table' int [1:2, 1:2] 55 20 113 12
#>  - attr(*, "dimnames")=List of 2
#>   ..$ Smoking: chr [1:2] "no smoking" "smoking"
#>   ..$ Cancer : chr [1:2] "cancer" "no cancer"

and view the table,

tab_lung

#>             Cancer
#> Smoking      cancer no cancer
#>   no smoking     55       113
#>   smoking        20        12

addmargins(tab_lung)

#>             Cancer
#> Smoking      cancer no cancer Sum
#>   no smoking     55       113 168
#>   smoking        20        12  32
#>   Sum            75       125 200

Perform chi-squared test for association. Two ways to do, by using the table,

chisq.test(tab_lung)

#> 
#>  Pearson's Chi-squared test with Yates' continuity correction
#> 
#> data:  tab_lung
#> X-squared = 8.9286, df = 1, p-value = 0.002807

or by using the variables directly,

chisq.test(lung$Smoking, lung$Cancer)

#> 
#>  Pearson's Chi-squared test with Yates' continuity correction
#> 
#> data:  lung$Smoking and lung$Cancer
#> X-squared = 8.9286, df = 1, p-value = 0.002807

But remember, for chi-squared test, you must review the table to get an idea about the association.

Check assumption – < 25% expected cell counts < 5.

The expected cell counts,

chisq.test(tab_lung)$expected

#>             Cancer
#> Smoking      cancer no cancer
#>   no smoking     63       105
#>   smoking        12        20

No count < 5, thus we can rely on chi-squared test.

1.3 Presentation

Guide: Reporting Statistical Results in Medical Journals

2 Two independent samples - Fisher’s exact test

2.1 About the test:

Alternative of chi-squared test.
Usually small cell counts, i.e. chi-squared test requirement is not fulfilled.
Gives exact P-value, no statistical distribution involved.

2.2 Analysis:

Perform Fisher’s exact test,

fisher.test(tab_lung)

#> 
#>  Fisher's Exact Test for Count Data
#> 
#> data:  tab_lung
#> p-value = 0.002414
#> alternative hypothesis: true odds ratio is not equal to 1
#> 95 percent confidence interval:
#>  0.1215695 0.6836086
#> sample estimates:
#> odds ratio 
#>  0.2940024

2.3 Presentation

Guide: Reporting Statistical Results in Medical Journals

3 Two dependent samples - McNemar’s test

3.1 About the test:

Non-parametric test.
Association between TWO repeated categorical outcomes.
Cross-tabulation is limited to 2 x 2 only.
The concern is whether the subjects still have the same outcomes (concordant) or different outcomes (discordant) upon repetition (pre-post).
The association is determined by looking at the discordant cells.
\(\chi^2\) statistics.

3.2 Analysis:

The data.

             Second
First        approve disapprove
  approve        794        150
  disapprove      86        570

*Data from @agresti2003, Table 10.1 Rating of Performance of Prime Minister Now, we are going to enter the data in form of counts directly. This is done as follows,

tab_pm = read.table(header = FALSE, text = "
794 150
86  570
")
tab_pm

#>    V1  V2
#> 1 794 150
#> 2  86 570

str(tab_pm)

#> 'data.frame':    2 obs. of  2 variables:
#>  $ V1: int  794 86
#>  $ V2: int  150 570

which is a data frame. To properly format the data into a table, do as follows in two steps,

tab_pm = as.matrix(tab_pm)  # first convert to a matrix
tab_pm = as.table(tab_pm)  # then convert to a table
str(tab_pm)

#>  'table' int [1:2, 1:2] 794 86 150 570
#>  - attr(*, "dimnames")=List of 2
#>   ..$ : chr [1:2] "A" "B"
#>   ..$ : chr [1:2] "V1" "V2"

Now it is a proper table from str(). The table needs proper headers. Now we give them proper names,

dimnames(tab_pm) = list(First = c("approve", "disapprove"), Second = c("approve", "disapprove"))
str(tab_pm)

#>  'table' int [1:2, 1:2] 794 86 150 570
#>  - attr(*, "dimnames")=List of 2
#>   ..$ First : chr [1:2] "approve" "disapprove"
#>   ..$ Second: chr [1:2] "approve" "disapprove"

Now we view the table,

tab_pm

#>             Second
#> First        approve disapprove
#>   approve        794        150
#>   disapprove      86        570

addmargins(tab_pm)

#>             Second
#> First        approve disapprove  Sum
#>   approve        794        150  944
#>   disapprove      86        570  656
#>   Sum            880        720 1600

Perform McNemar’s test,

mcnemar.test(tab_pm)

#> 
#>  McNemar's Chi-squared test with continuity correction
#> 
#> data:  tab_pm
#> McNemar's chi-squared = 16.818, df = 1, p-value = 4.115e-05

3.3 Presentation

Guide: Reporting Statistical Results in Medical Journals

4 Exercise

use dataset from “Practice Datasets” folder

Categorical data analysis using R

Dr. Wan Nor Arifin

Updated: 15 December 2025

1 Two independent samples - Chi-squared test for association

1.1 About the test:

1.2 Analysis:

1.3 Presentation

2 Two independent samples - Fisher’s exact test

2.1 About the test:

2.2 Analysis:

2.3 Presentation

3 Two dependent samples - McNemar’s test

3.1 About the test:

3.2 Analysis:

3.3 Presentation

4 Exercise