How do you find the correlation between independent and dependent variables in Python?

Question

Let's say we have 10 independent variable x1,x2,x3,...xn which all are categorical with same levels 0,1,2 (eg., 0 = no color , 1 = Red, 2 = Green) and you have two dependent(response) variables (eg., y1 = pant length in m and y2 = waist size in m). How do we determine which independent variables (x1,x2,x3,...xn) drives the dependent variables (y1 and y2)?

Nội dung chính Show

Introduction
Relationship between Numerical Variables
Correlation Matrix
Correlation Plot
Correlation Test
Relationship Between Categorical Variables
Frequency Table
Chi-square Test of Independence
How do you find the correlation between variables in Python?
Can dependent and independent variables be correlated?
How do you find the correlation between two variables in Pandas?

Example of the data is as follows:

| x1 | x2 | x3 | x4 | x5 | x6 | x7  | x8 | x9 | x10 | size(y1) | length(y2) |

|----|----|----|----|----|----|-----|----|----|-----|----------|------------|

|  0 |  1 |  2 |  1 |  0 |  0 |   2 |  1 |  0 |   2 |     0.36 |       0.84 |
|  0 |  1 |  1 |  0 |  2 |  1 |   0 |  2 |  0 |   1 |     0.84 |       1.23 |
|  1 |  2 |  0 |  1 |  0 |  1 |   0 |  1 |  0 |   2 |     1.92 |       3.86 |

I tried PLS regression in python and here is my code

import pandas as pd
import numpy as np
df = pd.read_csv('data.csv', header = 0)

X =  pd.DataFrame.as_matrix(df[[x for x in df.columns if x not in ['waist_size', 'pant_length']]])
Y =  pd.DataFrame.as_matrix(df[[''waist_size', 'pant_length'']])

from sklearn.cross_decomposition import PLSRegression
pls = PLSRegression(n_components = 8)
pls.fit(X,Y)
coef = pls.coef_
sorted_index = np.argsort(np.abs(pls.coef_))

Actual result from this approach is as follows: I am getting a numpy array for all the rows in the dataset and is as follows

[1, 0],
[1, 0],
[1, 0],
[1, 0],
[1, 0],
[0, 1],
[1, 0]
.....

How to interpret this?

And, is there is a way to calculate direct correlations and feature importance in this kind of problems?

Introduction

Building high-performing machine learning algorithms depends on identifying relationships between variables. This helps in feature engineering as well as deciding on the machine learning algorithm. In this guide, you will learn techniques for finding relationships in data with Python.

Data

In this guide, we will use a fictitious dataset of loan applicants containing 200 observations and ten variables, as described below:

Marital_status: Whether the applicant is married ("Yes") or not ("No")

`Is_graduate’: Whether the applicant is a graduate ("Yes") or not ("No")

Income: Annual income of the applicant (in USD)

Loan_amount: Loan amount (in USD) for which the application was submitted

Credit_score: Whether the applicant's credit score was good ("Good") or not ("Bad").

Approval_status: Whether the loan application was approved ("Yes") or not ("No").

Investment: Investments in stocks and mutual funds (in USD), as declared by the applicant

Gender: Whether the applicant is "Female" or "Male"

Age: The applicant’s age in years

Work_exp: The applicant's work experience in years

Let’s start by loading the required libraries and the data.

1import pandas as pd
2import numpy as np
3import matplotlib.pyplot as plt
4import seaborn as sns
5%matplotlib inline
6import warnings
7warnings.filterwarnings('ignore')
8import statsmodels.api as sm
9
10# Load data
11dat = pd.read_csv("data_test.csv")
12print(dat.shape) 
13dat.head(5)

python

1  (200, 10)
2
3
4|   	| Marital_status 	| Is_graduate 	| Income 	| Loan_amount 	| Credit_score 	| approval_status 	| Investment 	| gender 	| age 	| work_exp 	|
5|---	|----------------	|-------------	|--------	|-------------	|--------------	|-----------------	|------------	|--------	|-----	|----------	|
6| 0 	| Yes            	| No          	| 72000  	| 70500       	| Bad          	| Yes             	| 117340     	| Female 	| 34  	| 8.10     	|
7| 1 	| Yes            	| No          	| 64000  	| 70000       	| Bad          	| Yes             	| 85340      	| Female 	| 34  	| 7.20     	|
8| 2 	| Yes            	| No          	| 80000  	| 275000      	| Bad          	| Yes             	| 147100     	| Female 	| 33  	| 9.00     	|
9| 3 	| Yes            	| No          	| 76000  	| 100500      	| Bad          	| Yes             	| 65440      	| Female 	| 34  	| 8.55     	|
10| 4 	| Yes            	| No          	| 72000  	| 51500       	| Bad          	| Yes             	| 48000      	| Female 	| 33  	| 8.10     	|

Relationship between Numerical Variables

Many machine learning algorithms require that the continuous variables not be correlated with each other, a phenomenon called multicollinearity. Establishing relationships between the numerical variables is a common step to detect and treat multicollinearity.

Correlation Matrix

Creating a correlation matrix is a technique to identify multicollinearity among numerical variables. In Python, this can be created using the corr() function, as in the line of code below.

1|             	| Income    	| Loan_amount 	| Investment 	| age       	| work_exp  	|
2|-------------	|-----------	|-------------	|------------	|-----------	|-----------	|
3| Income      	| 1.000000  	| 0.020236    	| 0.061687   	| -0.200591 	| 0.878455  	|
4| Loan_amount 	| 0.020236  	| 1.000000    	| 0.780407   	| -0.033409 	| 0.031837  	|
5| Investment  	| 0.061687  	| 0.780407    	| 1.000000   	| -0.022761 	| 0.076532  	|
6| age         	| -0.200591 	| -0.033409   	| -0.022761  	| 1.000000  	| -0.133685 	|
7| work_exp    	| 0.878455  	| 0.031837    	| 0.076532   	| -0.133685 	| 1.000000  	|

The output above shows presence of strong linear correlation between the variables Income and Work_exp and between Investment and Loan_amount.

Correlation Plot

The correlation can also be visualized using a correlation plot, which is implemented using the pairplot function in the seaborn package.

The first line of code below creates a new dataset, df, that contains only the numeric variables. The second line creates the plot, where the argument kind="scatter" creates the plot without the regression line. The third line plots the chart.

1df = dat[['Income','Loan_amount','Investment','age','work_exp']]
2
3sns.pairplot(df, kind="scatter")
4plt.show()

python

Correlation Test

A correlation test is another method to determine the presence and extent of a linear relationship between two quantitative variables. In our case, we would like to statistically test whether there is a correlation between the applicant’s investment and their work experience. The first step is to visualize the relationship with a scatter plot, which is done using the line of code below.

1plt.scatter(dat['work_exp'], dat['Investment'])
2plt.show()

python

The above plot suggests the absence of a linear relationship between the two variables. We can quantify this inference by calculating the correlation coefficient using the line of code below.

1np.corrcoef(dat['work_exp'], dat['Investment'])

python

1    array([[1.        , 0.07653245],
2           [0.07653245, 1.        ]])

The value of 0.07 shows a positive but weak linear relationship between the two variables. Let’s confirm this with the linear regression correlation test, which is done in Python with the linregress() function in the scipy.stats module.

1from scipy.stats import linregress
2linregress(dat['work_exp'], dat['Investment'])

python

1 LinregressResult(slope=15309.333089382928, intercept=57191.00212603336, rvalue=0.0765324479448039, pvalue=0.28142275240186065,    stderr=14174.32722882554)

Since the p-value of 0.2814 is greater than 0.05, we fail to reject the null hypothesis that the relationship between the applicant’s investment and their work experience is not significant.

Let us consider another example of correlation between Income and Work_exp using the line of code below.

1linregress(dat['work_exp'], dat['Income'])

python

1  LinregressResult(slope=6998.2868438531395, intercept=11322.214342089712, rvalue=0.8784545623577412, pvalue=2.0141691110555243e-65, stderr=270.52631667365495)

In this case, the p-value is smaller than 0.05, so we reject the null hypothesis that the relationship between the applicants’ income and their work experience is not significant.

Relationship Between Categorical Variables

In the previous sections, we covered techniques of finding relationships between numerical variables. It is equally important to understand and estimate the relationship between categorical variables.

Frequency Table

A frequency table is a simple but effective way of finding distribution between two categorical variables. The crosstab() function can be used to create the two-way table between two variables. In the line of code below, we create a two-way table between the variables marital_status and loan_approval.

1pd.crosstab(dat.Marital_status, dat.approval_status)

python

1| approval_status 	| No 	| Yes 	|
2|-----------------	|----	|-----	|
3| Marital_status  	|    	|     	|
4| Divorced        	| 31 	| 29  	|
5| No              	| 66 	| 10  	|
6| Yes             	| 52 	| 12  	|

The output above shows that divorced applicants have a higher probability of getting loan approvals (at 56.8 percent) compared to married applicants (at 19.6 percent). To test whether this insight is statistically significant or not, we use the chi-square test of independence.

Chi-square Test of Independence

The chi-square test of independence is used to determine whether there is an association between two or more categorical variables. In our case, we would like to test whether the marital status of the applicants has any association with their approval status.

This can be easily done in Python using the chi2_contingency() function from the scipy.stats module. The lines of code below perform the test.

1from scipy.stats import chi2_contingency
2chi2_contingency(pd.crosstab(dat.Marital_status, dat.approval_status))

python

1    (24.09504482353403, 5.859053936061414e-06, 2, array([[44.7 , 15.3 ],
2            [56.62, 19.38],
3            [47.68, 16.32]]))

The second value of the above output — 5.859053936061414e-06 -— represents the p-value of the test. As evident, the p-value is less than 0.05, hence we reject the null hypothesis that the marital status of the applicants is not associated with the approval status. In order to understand the parameters and the result of the chi2_contingency function, you can use the help(chi2_contingency) command, which will give brief documentation of this function.

Conclusion

In this guide, you have learned techniques of finding relationships in data for both numerical and categorical variables. You also learned about how to interpret the results of the tests by statistically validating the relationship between the variables.

To learn more about data science using Python, please refer to the following guides:

How do you find the correlation between variables in Python?

To calculate the correlation between two variables in Python, we can use the Numpy corrcoef() function.

Can dependent and independent variables be correlated?

Correlation can be used to quantify the linear dependency of two variables. It cannot capture non-linear relationship between variables. Independent variables has NIL correlation, r=0. If r=0, indicates NIL correlation but not a non dependency (Independency), they can be dependent.

How do you find the correlation between two variables in Pandas?

By using corr() function we can get the correlation between two columns in the dataframe.

Correlation coefficient pandas Strong positive correlation Numpy correlation Stats pearsonr Coefficient of correlation Correlation matrix Python

How do you find the correlation between independent and dependent variables in Python?

Introduction

Data

Relationship between Numerical Variables

Correlation Matrix

Correlation Plot

Correlation Test

Relationship Between Categorical Variables

Frequency Table

Chi-square Test of Independence

Conclusion

How do you find the correlation between variables in Python?

Can dependent and independent variables be correlated?

How do you find the correlation between two variables in Pandas?

Bài Viết Liên Quan

Quảng Cáo

Có thể bạn quan tâm

Toplist được quan tâm

Quảng cáo

Xem Nhiều

Quảng cáo

Chúng tôi

Điều khoản

Trợ giúp

Mạng xã hội