Monthly Archives: August 2010

Increasing the performance of a predictive model – A Check list for an Executive – Meta Data Snooping or Meta Predictive Intelligence

There are many tricks in predictive modeling that are available that might show high predictive performance at the time of development.

This is mean<b>Cheating</b> runs rampantt to be a fun reading for marketing managers so that they know what questions to ask.

Some of the tricks are out right so simple and yet the day to day pressures of life may lead to overlook these simple check lists; the result is that the model will not perform well in the application.

The title says it all. Unless the modeler is very careful not to hype the model, these ideas are actually good to explain, support, and defend the right way of doing predictive models.  And, there is a lot of pressure to show performance from vendor/modeler point of view.

There are some innocent ones, some smart ones that can actually be great ones to go after for implementation, and some tricky ones which will unnecessarily hype the model; finding out the right one is the fun part of job.

I started with 10 tricks, but seem to go beyond and keep improving the list. There are some fake ones and there are some creative ones; have fun finding them.

– Do lots of models and select the highest top-deciles indexing model, given a specific sample – decile bias

– Do ensemble collection of models and just report the highest indexing model – model bias

– Do re-sampling models and select the highest indexing model – sample bias

– Create intelligent constraints on defining target variables so that the
target incidence becomes smaller increasing indices of top deciles in target model interpretation – illusion of low incidence rate models

– Provide lift chart without discussing any measurements that explains the “True positives” and “false positives” in the lift.

– Do decision trees where each cell is only 3 observations and accumulate all the cells where the lift in each cell is more than 50%, because only such cells are supposed to be contributing to the top deciles

– Do high degree polynomial fitting without verifying stability of the high degree polynomial

– Do not discuss Type I error and Type II error in the name of testing the prediction with pure set aside samples, without specifying how the set aside sample was drawn or without spelling out the distributional equivalence of the set aside samples and the training/testing samples – sampling bias and alternative bias

– Do k models and test with real data on one of the models and establish the predictive superiority of all the models – model selection bias

– Do not check whether the model reasonably indexing in lift is really a target prediction model or reference prediction model – The lift curve is not “Baluga” but an “ant eater”.

– Introduce the full weights to the reference group which will reduce the incidence rate increasing the indices of the top deciles

– Modeling sample is full sample and testing sample is subset of the full sample with no appropriate weights

However, there is no substitute or power like getting it tested with actual real world samples; all the above whining about the analyst’s hard work does not matter if the model performs well in the real world samples and if it does consistently with different real world samples, you will be called a magic modeler.

Alternatively, the question from the executive could be, ok smarty, tell me the remedy too. That is for another week.