My thoughts on
“DaveG brought out some important considerations that are pretty unique from statistical perspective.
There is nothing wrong to borrow from any science to apply conceptual and experiential thinking that consistently explains a phenomenon. Statistical/Mathematical/Computational/Econometric sciences all have a role in the business of prediction and the resulting application.
That is the reason why Predictive Analytics is a science on its own.
All statistical modeling are not regression models, like all predictive analytics are not statistical. The moment one understands the concept of random error even if they are not appreciative of it, they are in the statistical sciences.
Statistical or not we need proof for anything to believe that the process will always work consistently. Working with predictive models, I see how the pure computational scientists will have difficulty establishing the superiority of one algorithm over another only because they do not utilize the type I error and type II error in errors when the data has random component to it. We end up ranking one over the other even though the differences between one and the other could be due to error in the random component; in the end not really able to say what matters and come empty handed; talk about uncertainty. Statistics is a science of making certainty statements in the world of uncertain phenomenon (C.R.Rao – Congressional Medal of Science Laureate), albeit their need to have additional language in stating the uncertain statements in a tricky certainty looking statement. It is eye opening to see how people end up arguing about random phenomena because their type I and type II errors are so wide in the tail end of the decision tree and/or different people have different bounds for them node to node, with out knowing the differences. In fact in many applications, the eventual implications of these Type I and Types II errors and their location will come out only later in the game because the application of the model will show the real results of their inapplicability for any number of practical considerations. This scientific culture lead to famous decisions in the first million dollar data mining competition. Netflix decided to identify first and second winner based on a business rule as to who has submitted it first, and the difference between the first and second was at the fifth decision and eventually none of the algorithms were used in practice again for business reasons of the cost of implementing the solution. For a full account of the whole event, on how the Netflix challenge was won, see http://www.research.att.com/articles/featured_stories/2010_01/2010_02_netflix_article.html?fbid=gy-J6K6DxJh.
The winning team photo: http://bits.blogs.nytimes.com/2009/09/21/netflix-awards-1-million-prize-and-starts-a-new-contest/?ref=technology&_r=0
While in graduate schools these cases will be discussed and interpreted, the minute vein that caused the anheurism is perhaps hard to find or in a practical world that does not matter. The best we can do is be maniacally focused based on the philosophy of consumer centered vision where the key factors, customer satisfaction, price, quality of product, and availability of service post sales – a different discussion.
The advances in data mining and computing made what was statistically not possible before, through the concept of data mining. If you innocently use million observations with out properly sampling to build a decision tree, almost all the variable you use in your predictive equation will become significant – there are lot more false positive explanation over the years in explaining marketing models only because we do not know where to draw the line regarding significance/importance of a variable in our predictive models blindly using 100 variable to 200 variable models. The difficult part is we won’t even know that we are biasing the results with lot of random errors. Even in a machine learning methods it is a good idea to create a pre-processing where you reduce the variables in a statistical way and then feed them into the machine learning algorithms. In a way an analyst needs an important area that is not fully utilized and it is what is called “insight models”, as against the predictive models.
Over the years I come to appreciate the importance of balancing these two types of models for any given predictive situation, especially in a world where proliferation of data length(millions) and number of data elements (thousands) – rows vs. columns – have become common and either there is too much correlation among a lot of variables or too little structure in many of the elements.
Now here is the kicker for the kindling of thoughts: For every predictive model there is an insight model which performs as good as the predictive model. After all, I can not easily get out of the mode of statistical thinking. And perhaps in a practical world, especially in application areas like Netflix – aka big data – it suggests what is coming in the future.