To get access to my world class team who can address your opportunity areas and network with such luminaries,
– Join my group (use the ‘Join this site’ on the right)
– Leave comments, and questions that might help every one else.
So we saw the ‘Fundamental Theorem of Predictive Modeling‘. It should give you the answer as to what is the evil twin of predictive modeling.
People build great models; they mention how nicely they got the R-square or great KS and show all paraphernalia of the model to convince the managers or clients and close the model with a scoring function; in the end mostly supported by the explained part or the high KS and a nicely decreasing trend of the KS chart.
What is not discussed in the delivered results of a topic is
– the part called “not explained by the model”. The first part of the right hand side of the variance decomposition formula. This needs more explaining, before that take a look at the picture that artfully shows the explained and unexplained part.
I am bringing this one from Dr. Chuanhua Yu (as noted below at the bottom of the picture) which I liked for its focus and artistry, where you can see all the little details for normal regression. See how the unexplained deviation is calculated, both algebraically as well as its representation in the graph.
The unexplained part is where the expert’s commitment comes in explaining why the predictive model is likely to fail, how often, and how devastatingly it could turn out possibly in the field, when we apply it.
I know it is not easy to say one’s own work does not measure up its celebrity status it will potentially receive based on the naive application of goodness of fit of the model. But if it is not done, when the results come back from the campaign conducted in the field, it will indeed show its ugly side, the evil-twin side of the predictive modeling.
I can create a beautiful regression model with R-square 0.99 and yet its applicability could be in a range of the predictive region of x values which are useless for some specific application of the model. The difficulty happens either because we show only the sunny side and not the dark side or people who read and have self-interest pick up only sunny side of the story and communicate it down the stream.
On the technical side, this happens because the structure of the model might change completely different for any number of practical reasons, outside of the range where the model was developed and validated.
Some times, the out of time validation helps and yet that does not capture the full story.
That brings us to the second part of the explanation ‘Bias’ in the predictive modeling. This is the most sinuous and hidden part in most of the modeling that are not discussed explicitly.
The bias is not represented in the above graph, nor usually discussed in text books.
I evaluate 100s of models in a year generally in low seasons and as adviser to various organizations, I see many many vendors’ work.
As a general practice, I never see analysts – even mentioning with few bullet points – as to what are the hidden assumptions in the unexplained part nor sign off saying that certain assumptions are not applicable and the results are safe-guardeded, and hence we are safe in applying the predictive model.
Though we say this as unexplained part or unexplained variation, for functional purposes, this is everything that is the ugliest part of the model building, that include the following
– specification error
– correlated errors
– selection bias
– correlated predictors
– simultaneity, lagged, and networked variables
– bias in the original sample survey
– the consumer and market dynamics (this I have addressed under why predictive models fail?).
There are more sub-topics here – the whole science of modeling, and I am just trying to give some shades of the evil-twin.
Some of the above topics are simpler and not so ugly and some could be misleading you into confidently spending resources and time.
Few words of caution on sample bias:
The bias in the sample happens more often, especially if you are building models from surveys or you have access to your consumer data only partially and that is the one I will address today.
One way to address the sampling bias is weight it by its population representativeness which survey researchers provide you. Sometimes, they may not provide you that for any number of reasons including calling it as proprietary material. There are ways to circumvent that challenge.
Sampling weights are useful only if there are reasonable representative samples in the survey or consumer transaction data in all the consumer segments you are interested in. Some times there may not be any representative samples or even if it is there, it may be so spotty that its weighted calculation will introduce too much variance or too little variance artificially giving you false confidence, with out knowing it.
So when you receive a predictive model, the following questions can help you protect when you apply the model. A great R-square or KS or any other variation of goodness of fit is only half the story.
“How do we know the prediction will work in the field? what built in fail-safe mechanisms are addressed in the modeling?”
These are not easy questions but helps you to be careful of the evil-twin.
Have fun and have a wonderful weekend.
Next week, I will bring out this amazing social media analysis tool which is absolutely free – how do we get it? Well Uncle Sam in all his benevolence funds these companies like he did for internet and one of the promise of such funding is that it is available for public.
See also: Bias in Sample Surveys, Comments
From Data Monster & Insight Monster