Correlation

Correlation coefficient

Correlation is

" the degree of association between two variables"
" the measure of the strength of association among the different variables."

Correlation does NOT

" imply causation"

To emphasize that a correlation between two variables does not imply that one causes the other. For example; Sales of personal computers and athletic shoes have both risen strongly in the last several years and there is a high correlation between them, but you cannot assume that buying computers causes people to buy athletic shoes (or vice versa).

A correlation coefficient shows the degree of linear dependence of x and y. In other words, the coefficient shows how close two variables lie along a line.

If the coefficient is equal to 1 or -1, all the points lie along a line. If the correlation coefficient is equal to zero, there is no linear relation between x and y. however, this does not necessarily mean that there is no relation at all between the two variables. There could e.g. be a non-linear relation.

A positive relationship means that the two variables move into the same direction. A higher value of x corresponds to higher values of y, and vice versa.

A negative relationship means that the two variables move into the opposite directions. A lower value of x corresponds to higher values of y, and vice versa.

The most commonly used, and the one that is Pearson's correlation coefficient. Following video explains the Pearson's correlation coefficient concept

Importance of correlation coefficient of variables.

Correlation is an important tool for feature engineering in building machine learning models. Model training on a set of features with no or very little correlation, will lead to lesser performance models. It is more important when dealing with high dimensionality datasets to filter non non-correlated features.

Another example would be, if two predictors are strongly correlated to each other, then we only need to use one of them (in predicting salary, there is no need to use both age in years, and age in months). Resulting model will be simpler, faster due to less features and easier to interpret.