1 Two independent samples - Chi-squared test for association

1.1 About the test:

  • Non-parametric test.
  • Association between TWO categorical variables.
  • Cross-tabulation between the variables, usually 2 x 2, but can be any levels.
  • The association between the variables are made by comparing the observed cell counts with the expected cell counts if the variables are not associated to each other.
  • Requirement – < 25% expected cell counts < 5.
  • \(\chi^2\) statistics.

1.2 Analysis:

  1. The data.
             Cancer
Smoking      lung cancer no lung cancer
  smoking             20             12
  no smoking          55            113

Now, load lung.csv,

lung = read.csv("lung.csv")
str(lung)
#> 'data.frame':    200 obs. of  2 variables:
#>  $ Smoking: chr  "smoking" "smoking" "smoking" "smoking" ...
#>  $ Cancer : chr  "cancer" "cancer" "cancer" "cancer" ...
head(lung)
#>   Smoking Cancer
#> 1 smoking cancer
#> 2 smoking cancer
#> 3 smoking cancer
#> 4 smoking cancer
#> 5 smoking cancer
#> 6 smoking cancer

Now, we create cross-tabulation of the categorical variables,

tab_lung = table(Smoking = lung$Smoking, Cancer = lung$Cancer)
str(tab_lung)
#>  'table' int [1:2, 1:2] 55 20 113 12
#>  - attr(*, "dimnames")=List of 2
#>   ..$ Smoking: chr [1:2] "no smoking" "smoking"
#>   ..$ Cancer : chr [1:2] "cancer" "no cancer"

and view the table,

tab_lung
#>             Cancer
#> Smoking      cancer no cancer
#>   no smoking     55       113
#>   smoking        20        12
addmargins(tab_lung)
#>             Cancer
#> Smoking      cancer no cancer Sum
#>   no smoking     55       113 168
#>   smoking        20        12  32
#>   Sum            75       125 200
  1. Perform chi-squared test for association. Two ways to do, by using the table,
chisq.test(tab_lung)
#> 
#>  Pearson's Chi-squared test with Yates' continuity correction
#> 
#> data:  tab_lung
#> X-squared = 8.9286, df = 1, p-value = 0.002807

or by using the variables directly,

chisq.test(lung$Smoking, lung$Cancer)
#> 
#>  Pearson's Chi-squared test with Yates' continuity correction
#> 
#> data:  lung$Smoking and lung$Cancer
#> X-squared = 8.9286, df = 1, p-value = 0.002807

But remember, for chi-squared test, you must review the table to get an idea about the association.

  1. Check assumption – < 25% expected cell counts < 5.

The expected cell counts,

chisq.test(tab_lung)$expected
#>             Cancer
#> Smoking      cancer no cancer
#>   no smoking     63       105
#>   smoking        12        20

No count < 5, thus we can rely on chi-squared test.

2 Two independent samples - Fisher’s exact test

2.1 About the test:

  • Alternative of chi-squared test.
  • Usually small cell counts, i.e. chi-squared test requirement is not fulfilled.
  • Gives exact P-value, no statistical distribution involved.

2.2 Analysis:

  1. Perform Fisher’s exact test,
fisher.test(tab_lung)
#> 
#>  Fisher's Exact Test for Count Data
#> 
#> data:  tab_lung
#> p-value = 0.002414
#> alternative hypothesis: true odds ratio is not equal to 1
#> 95 percent confidence interval:
#>  0.1215695 0.6836086
#> sample estimates:
#> odds ratio 
#>  0.2940024

3 Two dependent samples - McNemar’s test

3.1 About the test:

  • Non-parametric test.
  • Association between TWO repeated categorical outcomes.
  • Cross-tabulation is limited to 2 x 2 only.
  • The concern is whether the subjects still have the same outcomes (concordant) or different outcomes (discordant) upon repetition (pre-post).
  • The association is determined by looking at the discordant cells.
  • \(\chi^2\) statistics.

3.2 Analysis:

  1. The data.
             Second
First        approve disapprove
  approve        794        150
  disapprove      86        570

*Data from @agresti2003, Table 10.1 Rating of Performance of Prime Minister Now, we are going to enter the data in form of counts directly. This is done as follows,

tab_pm = read.table(header = FALSE, text = "
794 150
86  570
")
tab_pm
#>    V1  V2
#> 1 794 150
#> 2  86 570
str(tab_pm)
#> 'data.frame':    2 obs. of  2 variables:
#>  $ V1: int  794 86
#>  $ V2: int  150 570

which is a data frame. To properly format the data into a table, do as follows in two steps,

tab_pm = as.matrix(tab_pm)  # first convert to a matrix
tab_pm = as.table(tab_pm)  # then convert to a table
str(tab_pm)
#>  'table' int [1:2, 1:2] 794 86 150 570
#>  - attr(*, "dimnames")=List of 2
#>   ..$ : chr [1:2] "A" "B"
#>   ..$ : chr [1:2] "V1" "V2"

Now it is a proper table from str(). The table needs proper headers. Now we give them proper names,

dimnames(tab_pm) = list(First = c("approve", "disapprove"), Second = c("approve", "disapprove"))
str(tab_pm)
#>  'table' int [1:2, 1:2] 794 86 150 570
#>  - attr(*, "dimnames")=List of 2
#>   ..$ First : chr [1:2] "approve" "disapprove"
#>   ..$ Second: chr [1:2] "approve" "disapprove"

Now we view the table,

tab_pm
#>             Second
#> First        approve disapprove
#>   approve        794        150
#>   disapprove      86        570
addmargins(tab_pm)
#>             Second
#> First        approve disapprove  Sum
#>   approve        794        150  944
#>   disapprove      86        570  656
#>   Sum            880        720 1600
  1. Perform McNemar’s test,
mcnemar.test(tab_pm)
#> 
#>  McNemar's Chi-squared test with continuity correction
#> 
#> data:  tab_pm
#> McNemar's chi-squared = 16.818, df = 1, p-value = 4.115e-05

4 Exercise

  • use dataset from “Practice Datasets” folder