How do you determine if there is a significant relationship between two categorical variables in SAS?

Two Categorical Variables

Checking if two categorical variables are independent can be done with Chi-Squared test of independence.

This is a typical Chi-Square test: if we assume that two variables are independent, then the values of the contingency table for these variables should be distributed uniformly. And then we check how far away from uniform the actual values are.

There also exists a Crammer's V that is a measure of correlation that follows from this test

Example

Suppose we have two variables

  • gender: male and female
  • city: Blois and Tours

We observed the following data:

Are gender and city independent? Let's perform a Chi-Squred test. Null hypothesis: they are independent, Alternative hypothesis is that they are correlated in some way.

Under the Null hypothesis, we assume uniform distribution. So our expected values are the following

So we run the chi-squared test and the resulting p-value here can be seen as a measure of correlation between these two variables.

To compute Crammer's V we first find the normalizing factor chi-squared-max which is typically the size of the sample, divide the chi-square by it and take a square root

R

tbl = matrix[data=c[55, 45, 20, 30], nrow=2, ncol=2, byrow=T]
dimnames[tbl] = list[City=c['B', 'T'], Gender=c['M', 'F']]

chi2 = chisq.test[tbl, correct=F]
c[chi2$statistic, chi2$p.value]

Here the p value is 0.08 - quite small, but still not enough to reject the hypothesis of independence. So we can say that the "correlation" here is 0.08

We also compute V:

sqrt[chi2$statistic / sum[tbl]]

And get 0.14 [the smaller v, the lower the correlation]

Consider another dataset

    Gender
City  M  F
   B 51 49
   T 24 26

For this, it would give the following

tbl = matrix[data=c[51, 49, 24, 26], nrow=2, ncol=2, byrow=T]
dimnames[tbl] = list[City=c['B', 'T'], Gender=c['M', 'F']]

chi2 = chisq.test[tbl, correct=F]
c[chi2$statistic, chi2$p.value]

sqrt[chi2$statistic / sum[tbl]]

The p-value is 0.72 which is far closer to 1, and v is 0.03 - very close to 0

Categorical vs Numerical Variables

For this type we typically perform One-way ANOVA test: we calculate in-group variance and intra-group variance and then compare them.

Example

We want to study the relationship between absorbed fat from donuts vs the type of fat used to produce donuts [example is taken from here]

Is there any dependence between the variables? For that we conduct ANOVA test and see that the p-value is just 0.007 - there's no correlation between these variables.

R

t1 = c[164, 172, 168, 177, 156, 195]
t2 = c[178, 191, 197, 182, 185, 177]
t3 = c[175, 193, 178, 171, 163, 176]
t4 = c[155, 166, 149, 164, 170, 168]

val = c[t1, t2, t3, t4]
fac = gl[n=4, k=6, labels=c['type1', 'type2', 'type3', 'type4']]

aov1 = aov[val ~ fac]
summary[aov1]

Output is

            Df Sum Sq Mean Sq F value  Pr[>F]   
fac          3   1636   545.5   5.406 0.00688 **
Residuals   20   2018   100.9                   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

So we can take the p-value as the measure of correlation here as well.

References

  • //en.wikipedia.org/wiki/Chi-square_test
  • //mlwiki.org/index.php/Chi-square_Test_of_Independence
  • //courses.statistics.com/software/R/R1way.htm
  • //mlwiki.org/index.php/One-Way_ANOVA_F-Test
  • //mlwiki.org/index.php/Cramer%27s_Coefficient

How do you determine if there is a significant relationship between two categorical variables?

This test is used to determine if two categorical variables are independent or if they are in fact related to one another. If two categorical variables are independent, then the value of one variable does not change the probability distribution of the other.

Which statistical test is used to identify whether there is a relationship between two categorical variables 40?

The chi-square test is an overall test for detecting relationships between two categorical variables. If the test is significant, it is important to look at the data to learn the nature of the relationship.

How do you compare the differences between two categorical variables?

The Pearson's χ2 test is the most commonly used test for assessing difference in distribution of a categorical variable between two or more independent groups. If the groups are ordered in some manner, the χ2 test for trend should be used.

Can you do a correlation test with categorical data?

For a dichotomous categorical variable and a continuous variable you can calculate a Pearson correlation if the categorical variable has a 0/1-coding for the categories. This correlation is then also known as a point-biserial correlation coefficient.

Bài Viết Liên Quan

Chủ Đề