realcode4you
- Sep 13, 2021
- 2 min read

What is Pearson's-r Correlation coefficient In Machine Learning ? - Data Dependency

Data Dependency

Pearson's-r Correlation coefficient

Import Libraries

import numpy as np
import csv
import matplotlib.pyplot as plt
import scipy.stats
import pandas as pd

%matplotlib inline

Install wget to download dataset from github

!pip install wget

import wget

link_to_data = 'https://github.com/tuliplab/mds/raw/master/Jupyter/data/Auto.csv'
DataSet = wget.download(link_to_data)

Read Dataset

data = pd.read_csv('Auto.csv')

Selecting Record

data.head()

Describe Dataset

data.describe()

Selecting Two features

miles = data['miles']
weights = data['Weight']

print miles[:10]
print weights[:10]

pearson_r = np.cov(miles, weights)[0, 1] / (miles.std() * weights.std())
print pearson_r

Finding correlation Coefficient of each features

np.corrcoef(miles,weights)

horse = data['Horse power']

np.corrcoef(weights,horse)

Plot

# plotting
fig, ax = plt.subplots(figsize=(7, 5), dpi=300)
ax.scatter(weights,miles, alpha=0.6, edgecolor='none', s=100)
ax.set_xlabel('Car Weight (tons)')
ax.set_ylabel('Miles Per Gallon')

line_coef = np.polyfit(weights, miles, 1)
xx = np.arange(1, 5, 0.1)
yy = line_coef[0]*xx + line_coef[1]

ax.plot(xx, yy, 'r', lw=2)

Output:

Practice Exercise

Find the Pearson's-r coefficient for two linearly dependent variables. Add some noise and see the effect of varying the noise.
Simulate and visualize some data with positive linear correlation
Simulate and visualize some data with negative linear correlation.

xx = np.arange(-5, 5, 0.1)
pp = 1.5  # level of noise
yy = xx + np.random.normal(0, pp, size=len(xx))

# visualize the data
fig, ax = plt.subplots()
ax.scatter(xx, yy, c='r', edgecolor='none')
ax.set_xlabel('X data')
ax.set_ylabel('Y data')

line_coef = np.polyfit(xx, yy, 1)
line_xx = np.arange(-5, 5, 0.1)
line_yy = line_coef[0]*line_xx + line_coef[1]

ax.plot(line_xx, line_yy, 'b', lw=2)

print scipy.stats.pearsonr(xx, yy)

Output

Pearson's r coefficient is limited to analyze the linear correlation between two variables. It is not capable to show the non-linear dependency. Investigate the Pearson's r coefficient between two variables that are correlated non-linearly.

# generate some data, first for X
xx = np.arange(-5, 5, 0.1)

# assume Y = 2Y + some perturbation
pp = 1.1  # level of noise
yy = xx**2 + np.random.normal(0, pp, size=len(xx))

# visualize the data
fig, ax = plt.subplots()
ax.scatter(xx, yy, c='r', edgecolor='b')
ax.set_xlabel('X data')
ax.set_ylabel('Y data')
ax.set_title('$Y = X^2+\epsilon$', size=16)

Output:

# generate some data, first for X
xx = np.arange(-5, 5, 0.1)

# assume Y = 2Y + some perturbation
pp = 1.1  # level of noise
yy = xx**2 + np.random.normal(0, pp, size=len(xx))

# visualize the data
fig, ax = plt.subplots()
ax.scatter(xx, yy, c='r', edgecolor='b')
ax.set_xlabel('X data')
ax.set_ylabel('Y data')
ax.set_title('$Y = X^2+\epsilon$', size=16)

Output:

The Pearson's-r correlation is near zero which means there is no linear correlation. But how about non-linear correlation? Isn't y=x2?

np.corrcoef(xx,yy)

Output:

array([[ 1.        , -0.04489687],
       [-0.04489687,  1.        ]])

RealCode4You

What is Pearson's-r Correlation coefficient In Machine Learning ? - Data Dependency

Data Dependency

Pearson's-r Correlation coefficient

Recent Posts