Monthly Archives: November 2013

Measures of Correlation, information content, and the 21st Century Correlation Coefficient.

A quick review of various correlation coefficients is surveyed here before we come to the concepts and references regarding what is touted as 21st century correlation coefficient.

The popularly known measure of association or relationship between two continuous variables is what is called Pearson correlation.

See:     https://statistics.laerd.com/statistical-guides/pearson-correlation-coefficient-statistical-guide.php for a nice introduction of “r”, the Pearson correlation coefficient.

The “r” is affected by outliers.

So we use Spearman rank correlation which uses the same formula but uses it with ranking of the raw pairs of continuous data elements.

You can actually change one of the values (the 10th width value from 500 to 10000, for example) to make it an outlier and you will see it will not affect the Spearman correlation value but it will affect seriously the Pearson correlation depending on how bad the outlier is.

See:   https://drive.google.com/file/d/0ByamQuDO9y9ST0dPYlQ3Tno2R3c/edit?usp=sharing

However, if you want to find correlations when you have ordinal data, we need to use concepts of concordance and discordance as a way of understanding how the co-relation (joint relationship of association) can be defined.

For a nice introduction on concordance and discordance see: http://stats.stackexchange.com/questions/51604/ordinal-trends-and-finding-concordant-discordant-pairs

Now using concordance and discordance ideas one can define the correlation (association) using Kendall’s tau.

Here is an example:

http://www.statsdirect.com/help/default.htm#nonparametric_methods/kendall_correlation.htm 

You can also use the following sas codes to calculate various measures of correlation as explained here, where the above three measures of correlation along with Hoeffding’s D are output in SAS.

http://support.sas.com/documentation/cdl/en/procstat/63104/HTML/default/viewer.htm#procstat_corr_sect028.htm

In the above document you see all the four main measures of correlation.  The last one being Hoeffding.  Hoeffding measures general concept of independence.

The proc logistic provides some important ways to compare the goodness of fit using various measures of association in a modeling perspective.

http://www.medicine.mcgill.ca/epidemiology/joseph/courses/epib-621/logfit.pdf

For a recent document that compares and summarizes these four measures see

http://www3.nd.edu/~mclark19/learn/CorrelationComparison.pdf

This document takes the direction of information content between two variables as a measure of association, using the concepts of mutual information and maximal information coefficient.

Interestingly “maximal information coefficient” – MIC – is also touted as the correlation coefficient of 21st century, http://www.slideshare.net/daniel_bilar/speed-2011-mic-a-correlation-forthe21stcentury

This is a nice simple collection of R-codes in its expository mode.

Now why would you agree or disagree with the above claim that this is the 21st century correlation coefficient?  Like to hear your points of views.

From Data Monster & Insight Monster

Map Analytics – Some Great References

The following is a ggplot2 implementation of powerful mapping process in R. 
 
http://www.journal.r-project.org/archive/2013-1/kahle-wickham.pdf

The purpose of this as stated by the article is “This article details some new methods for the visualization of spatial data in R using the layered grammar of graphics implementation of ggplot2 in conjunction with the contextual information of static maps from Google Maps, OpenStreetMap, Stamen Maps or CloudMade MapsThe result is an easy to use R package named.  ggmap.”