Let us simplify and simplify … until it is simple. Do we really need a pseudo-R-square for discrimination problems?
The following example picture shows, the prediction of “income” based on predictor “age”.
If this is a binary target problem of high income vs. low income prediction, one can model the target with age using logistic regression and go with probability of true positives for a specific tolerable levels of “false positives” using ROC curve.
Some times, people use R-square as a way to determine the goodness of fitting, though in reality there is no useful information using R-square goodness of fitting measure (variance explained by the model vs. total variance) for classification analysis, as much as the ROC curve, in decision rules. It is incomprehensible because people are so stuck with the R-square concepts, they want something similar in classification also.
If we can make justified understanding of whether there is any relationship between linear regression and an implied classification analysis if both are consistent in a sense of correlation, meaning they logically mean the same thing, it is an useful exercise.
So in the example, I have 12 observations for each one of that will have (age, income), both are continuous variables. So let us also introduce recodes for these variables as age_high, age_low, and income_high, income_low. So there are four quadrants.
Let us assume that the linear regression is the forward raising line that has also intercept.
Now you can see that the predictive value of each of the variable is on the line with yellow dot. It implies that as age increases income increases, in this example. This is not a realistic one, but it is good enough to bring out some points. In human resource analysis, income is a non-linear function that typically increases, plateaus, and then decreases.
Note that (1), (2), (3) references below refers to a specific observations contribution to those terms.
So, if I use this structure, the reds are not predicted correctly; either high incomes are predicted low and low incomes are predicted higher. There are 7 such observations. So, true positive is 43%. The false positives is 50%. (one of the point closely escaped!) It is funny! This does not happen almost always, and remember that false positives% is not 100-true positives %.
The picture also brings out what is meant by total sum of square, sum of squares explained, and unexplained sum of squares.
The key point is, as observations come closer to the line, the R-square increases and the prediction using ROC type measure also increases, a very important consistency understanding.
Again, going back to the start of the discussion, this may give confidence to people to use R-square, instead of ROC curve. But not a good idea, because we do not do any variance or deviance calculations in making our decisions from the probabilities to the binary classes.
Econometricians especially are fascinated to use some type of model explained type concept and introduce all kinds of pseudo-R-square concepts, implicating the % explained type information using loglikelihood values of model vs. no-model. For a whole collection of references on various pseudo-R-square along with McFadden’s pseudo-Rsquare, see in the nice logical article, http://www.statisticalhorizons.com/r2logistic. by Paul Allison, the legendary logistic regression seminar leader.
So much so that a software is considered incomplete if they do not have this concept of pseudo R-square for classification analysis. And, then it becomes a competition of who has all the pseudo-R-square for classification. Graduate students take pride in quoting maximum number of pseuo-R-squares, and so on…
Interestingly, now the contender is Tjur’s “coefficient of discrimination”, an attempt to improve pseudo-R-square simply states
“The definition is very simple. For each of the two categories of the dependent variable, calculate the mean of the predicted probabilities of an event. Then, take the difference between those two means. That’s it!“, to quote from the above recommended article.
In the end we are really interested in finding out, based on a modeling procedure, what percentage is predicted accurately vs. falsely identified in the application of any type of decision rule. There is one exception to this. If we want to use the predicted probabilities for subsequent usage in, say expected economic value of a segment which consists of individuals, then it is fine. But if we are making decisions on each and every record, then we need ROC curve. Otherwise, we can stop with some goodness of fit type measure and postpone the decision making with some subsequent measure, for example expected probabilities.
The coefficient of discrimination in any form that extends pseudo-R-square to be a function of how much you are willing to tolerate the false positive for your decision making for each and every record based on probabilities, then it could be used for classification algorithms.
I will not use a pseudo-R-Square as goodness of fit for decision rules, until further convincing understanding. But it begs the question, for the wider community of Statisticians and Econometricians.
Is any pseudo-R-Square, that purports to come to aid to understand the predictive power in a classification problem incomplete or unwanted invention?
Alternatively, how can pseudo-R-square take care of balancing act of true positives vs. false positives?