The Importance of Knowing The Best Modeling Method – The Competitive Methods Used are Logistic Regression, RandomForest, Gradient Boosting Method (gbm package as understood in R)

We are all going to have some bias because in the end there are factors other than pure accuracy(bias) and precision(variance) in selecting a method, what a technical geek would argue about.

In the next 15 years, the estimate of pure analytics is going to be to the tune of $250B.  Impossible to sit and do logistic regressions and multinomial classifications and other sophisticated regression methods which statisticians, econometricians, and other scientific disciplines have been bringing out to apply in each and every one of the opportunity that is going to come up in the next 15 years. Thus, there is an urgent need to create methods broadly under the umbrella of machine learning applications that is getting the attention of analytics managers, computer scientists, statisticians, mathematicians, econometricians, behavioral scientists, …

I encourage you adding GBM (gradient boosting methods, a general method which can also be used in RF and there are many boosting methods) to the mix of your inquiry. We can leverage what others have already written painstakingly. Also, people build models of models and the models become non-implementable if accuracy is the only criteria. The Netflix example was an eye opener in that direction.

https://www.nescent.org/wg/cart/images/9/91/Chapter6_March20.pdf on tree based methods, where boosted random forest is included and a comparison to gbm is also made. That is in no way meant as final. Here is another one from famous book Elements of Statistical Learning.  http://www.stanford.edu/~hastie/ElemStatLearnII/figures15.pdf

Authors keep academic honesty in publishing these papers.

It looks to me that this being a statistical conclusion, we are likely to make some level of Type I and Type II errors, no pun intended.  Accordingly, I tend to interpret, “not all models are wrong”, “some are marginally better than few others at some times”.  As simple as the fact that a group is great in R is good enough reason to be more frequently using RF and GBM methods.

While this discussion has been very gracious.  However, I do not want to discount 100s of papers published on random forest and GBM. One common point one may hear is that these are not interpretable because we will not know which variable matters and how much it matters. I disagree and though it is some what round about way, the interpretation is still possible.  I follow this for gbm method I use.

I have built 100s of models with logistic and gbm comparisions in the last few years because of automation established. GBM was as good as logistic and for a good reason machine learning experts are going to love GBM/Random forests and continue to use them. Quite likely there are proprietary methods that can be good to compete or possibly beating these methods as well.

But I am also skeptical about proprietary methods.

I remember when I was working in a Blue Chip company, a vendor came and presented significantly higher predictive accuracy to a given situation.  What the vendor was doing is hilarious.  He makes averages of binned ranges of predictors, predicting the averages, deep in the outputs of the algorithms.   People do all kinds of things and they can always succeed to some level but if some one claims prediction beyond the peer reviewed method, the call is always, “show me the results” from the field, consistently.

Leave a Reply

Your email address will not be published. Required fields are marked *