Cancer
Smoking lung cancer no lung cancer
smoking 20 12
no smoking 55 113
Now, load lung.csv,
lung = read.csv("lung.csv")
str(lung)
#> 'data.frame': 200 obs. of 2 variables:
#> $ Smoking: chr "smoking" "smoking" "smoking" "smoking" ...
#> $ Cancer : chr "cancer" "cancer" "cancer" "cancer" ...
head(lung)
#> Smoking Cancer
#> 1 smoking cancer
#> 2 smoking cancer
#> 3 smoking cancer
#> 4 smoking cancer
#> 5 smoking cancer
#> 6 smoking cancer
Now, we create cross-tabulation of the categorical variables,
tab_lung = table(Smoking = lung$Smoking, Cancer = lung$Cancer)
str(tab_lung)
#> 'table' int [1:2, 1:2] 55 20 113 12
#> - attr(*, "dimnames")=List of 2
#> ..$ Smoking: chr [1:2] "no smoking" "smoking"
#> ..$ Cancer : chr [1:2] "cancer" "no cancer"
and view the table,
tab_lung
#> Cancer
#> Smoking cancer no cancer
#> no smoking 55 113
#> smoking 20 12
addmargins(tab_lung)
#> Cancer
#> Smoking cancer no cancer Sum
#> no smoking 55 113 168
#> smoking 20 12 32
#> Sum 75 125 200
chisq.test(tab_lung)
#>
#> Pearson's Chi-squared test with Yates' continuity correction
#>
#> data: tab_lung
#> X-squared = 8.9286, df = 1, p-value = 0.002807
or by using the variables directly,
chisq.test(lung$Smoking, lung$Cancer)
#>
#> Pearson's Chi-squared test with Yates' continuity correction
#>
#> data: lung$Smoking and lung$Cancer
#> X-squared = 8.9286, df = 1, p-value = 0.002807
But remember, for chi-squared test, you must review the table to get an idea about the association.
The expected cell counts,
chisq.test(tab_lung)$expected
#> Cancer
#> Smoking cancer no cancer
#> no smoking 63 105
#> smoking 12 20
No count < 5, thus we can rely on chi-squared test.
fisher.test(tab_lung)
#>
#> Fisher's Exact Test for Count Data
#>
#> data: tab_lung
#> p-value = 0.002414
#> alternative hypothesis: true odds ratio is not equal to 1
#> 95 percent confidence interval:
#> 0.1215695 0.6836086
#> sample estimates:
#> odds ratio
#> 0.2940024
Second
First approve disapprove
approve 794 150
disapprove 86 570
*Data from @agresti2003, Table 10.1 Rating of Performance of Prime Minister Now, we are going to enter the data in form of counts directly. This is done as follows,
tab_pm = read.table(header = FALSE, text = "
794 150
86 570
")
tab_pm
#> V1 V2
#> 1 794 150
#> 2 86 570
str(tab_pm)
#> 'data.frame': 2 obs. of 2 variables:
#> $ V1: int 794 86
#> $ V2: int 150 570
which is a data frame. To properly format the data into a table, do as follows in two steps,
tab_pm = as.matrix(tab_pm) # first convert to a matrix
tab_pm = as.table(tab_pm) # then convert to a table
str(tab_pm)
#> 'table' int [1:2, 1:2] 794 86 150 570
#> - attr(*, "dimnames")=List of 2
#> ..$ : chr [1:2] "A" "B"
#> ..$ : chr [1:2] "V1" "V2"
Now it is a proper table from str(). The table needs
proper headers. Now we give them proper names,
dimnames(tab_pm) = list(First = c("approve", "disapprove"), Second = c("approve", "disapprove"))
str(tab_pm)
#> 'table' int [1:2, 1:2] 794 86 150 570
#> - attr(*, "dimnames")=List of 2
#> ..$ First : chr [1:2] "approve" "disapprove"
#> ..$ Second: chr [1:2] "approve" "disapprove"
Now we view the table,
tab_pm
#> Second
#> First approve disapprove
#> approve 794 150
#> disapprove 86 570
addmargins(tab_pm)
#> Second
#> First approve disapprove Sum
#> approve 794 150 944
#> disapprove 86 570 656
#> Sum 880 720 1600
mcnemar.test(tab_pm)
#>
#> McNemar's Chi-squared test with continuity correction
#>
#> data: tab_pm
#> McNemar's chi-squared = 16.818, df = 1, p-value = 4.115e-05