Monthly Archives: July 2011

Insight Model vs. Predictive Model

The moment we say predictive model, all analysts will immediately think of regressions, chaid, random forest, and GBMs. The methods are mentioned here in increasing order of number of variables with increasing difficulty to provide insights but increasing power of prediction, generally.

The major challenge with these models is that even for some interpretable data variables (as against latent variables) not all analysts will agree perfectly with the same interpretation why the sign or the magnitude is right for the predictor variables in these models.

While hard core predictive analysts will pass that worry easily because prediction is the most important thing in their operations, the marketing, psychologists, sociologists, educationists, investment and financial services people could be very easily made to think hard on these models, and possibly reject the models.

One way to over come these challenges is to balance these with two types of models to express the phenomena.

The insight models are those types of models that group variables together and explicitly drive analysts to dig for latent variables that explains the predicand.

The major point here is that typically these latent models are developed upto few dimensions so that we kind of understand and explain the top latent variales that are not more than 5 or 10 (even 10 is an anathema) though it may explain only less then half of the variation in the data. It is also generally acceptable that if we can explain the insights with top three variables that supports consistent credible interpretation of business operations, the executives are very happy and very ecstatic – they could explain the challenges of operations to others in turn.

If done right, the analysts, strategists, and executives are all could be partying together!

Welcome to the new world of Insight Models – These are fascinating and provide a direct implication to create tons of insights for companies.

One of the results of insight models I found is that there is an insight model for every predictive model where the predictive power of the insight model is as good as the best selected predictive model.  It may be difficult to come up with more and more latent dimensions but I will say after top 5 latent dimensions you can assign the latent dimension interpretations to the rest of the variables as the variable that dominates.

Do you know why I say that every predictive model has equivalent insight model that is as predictive or the traditionally known predictive model? Moreover, what method would you follow and why it is a new brave world in the light of big data? Do you really need to balance insight model with predictive model or a clever modification and the communication will sell this concept?

There are lots of other insightful questions that follow this.

Note that this is not anything to do with strategic approach to model building. This is an essential part of modeling so that we develop and utilize the best model in the most practical way.

From Data Monster & Insight Monster

Why Sampling Matters? – Is it important to work with samples any more, now that we have computers all around, …?

I bring together various scenarios why sampling matters; one can add more from various fields; however, I think direct marketing clients who use generally large population acquisition base or have acquired a large customer base, are trying to get comfort level working with samples, because there is no standard error calculations and associated probability distributions.

Sampling is useful to gain insight and prediction and extrapolate it to the whole population understanding that there is sampling error in the estimates of the parameters and equations. In the end, when we apply the sample results we are indeed applying to the population in study.


It is too expensive in resources and time to learn about the whole population insight, using whole population data when in fact within a reasonable tolerance we can get the same insight and estimates of the population equations and parameters using a fraction of the population.
It is more important to track the dynamics of consumer intelligence than getting perfect consumer intelligence at a particular point in time.
Also, just collecting the right data that are applicable to a business problem is a huge process and we are just constrained to concentrate on small (sample) volume high quality data (than other alternatives), leveraging the tremendous amount of work done by econometricians and statisticians for those efficiently collected sample data.

The population dynamics (in consumer marketing studies, it is the consumer dynamics) change so fast and often times so big that by the time we get actionable consumer intelligence it is too late to apply. Timeliness is more important as things always change.

We test few samples systematically to understand whether the quality is in control or not in production line. We cannot afford to test each one and release it at the same time to the consumer, especially if it is a deeper level of test.

In genetic testing where we have close to 30,000 genes in a human being, with so many sub-populations in the world, how to determine the genetic map for a disease. To test, with the current technology and information structures of genes, we can only test clusters of signals not specific genes for most of the diseases and even known genes contributing as a single signal provider is only valid for fraction of the population – usually smaller than 5%, if we are lucky;

In clinical trials, it is so expensive and indeed it is unethical to start Phase I or Phase II when in fact we do not know some level of the efficacy of the medication. We need to work with samples and that too with appropriate consent protocols of the sampled patients, and such studies should be designed to lead to summary of unbiased estimates of efficacy and side effects for specific sub-populations.

To understand fish sub-populations what we can do with sampling that we cannot do with full population study – think why.

Internet provides quite a few reading for the key phrase “why sampling matters”. Read the simple one I pulled out . This leads to many important statistical concepts.

Common Question:
Why do I need samples when we have so much computing power and so much data automatically available through systems with which consumers interact.

My recent reading about targeting in Facebook using marketing intelligence of Facebook (which has 900MM plus users) is the following. The consumer intelligence using Facebook full data has so much dynamics and so broad (despite its advanced targeting methods) that some industry leaders are arguing that it is not worth (though Facebook provides segment wise advertising) it unless it is accompanied by a well understood segmentation of one’s franchise, the consumer dynamics trends and patterns that explains the segments’ behavior for the next best campaign. It is better than a perfect population wide intelligence using all the population data where the standard error is zero, but will end up with too many segments.

So concentrate on the business problem, gain insights using better segmentation methods, predictive models and analysis that helps you understand the bias issues and the MSE (mean square error) and see how these segments are moving dynamically based on the market trends, brand trends, product trends, communication and shopping channels trends, so that they can be targeted with the understanding of where one’s consumers are going to be at the time of next best campaign. (I know it is a lot… but this is the way to solve consumer marketing problem)

I can see how one can achieve much in a short time using panels of sizes 100,000 compared to the population based studies. Do you?

From Data Monster & Insight Monster

The Architectural Components of Consumer Analytics

– Data (Consumer Geo-demographic and Market Research Panels, and Transactional)

– Integrated Insights

– Market Trends vs. Consumer Trends Vs. Customer Trends

– Competitive Intelligence vs. Product Intelligence

– Addressability, Scalability of Insights for Marketing 1-1

– BIAs Reduction in Insights

– Balancing the sample (bias is completely eleminated if samples can be balanced)

– possibility sampling or unavoidable sampling; apply applicable_algorithm approach

– Bias introduced by Missing values

– Insights vs. Prediction

– Customer life time value calculations

– In market timing models

– Affinity and Cross-affinities (Brand)

– Propensities and Cross propensities (Product)

– communication channel propensity models

– Shopper channel prefernce models

From Data Monster & Insight Monster

How Clinical Trials Have Been Manipulated To Show Better Discrimination for Efficacy

The untold indirect insights of this NY times article – A business man’s perspective and pros and cons of the article how bias in sample selection for targets vs. reference can be manipulated for better predictive discrimination, is presented here.

Here is one survey which summarizes the status of how the totality of clinical trials strategies in the end pan out to a collection of anti-depressant medications fizzle out in the meta analysis.

– Scientists will add and subtract trial participants specifically to bring out statistically significant differences between control and test; nothing wrong with that; they will be honest about it in the submission in specifying the sub-populations in the trial; of course they will try to see whether the medicate is applicable to the whole population
– If it does not work with whole population, it will work with most severely affected sub-population compared to that of the whole population. Or some subset of the population. with stroke, or with diabetes, or with hypertension, or with …. However, that is also not a problem. ROI measures may not be great but it leaves a leg forward in the continued research for improvements and to continue to be the or a therapeutic category leader, or at least get a stock market lift up so that you can cash in for company bursar selling company stocks.
– Interestingly if you compare most severely affected population with overall referenes also, you are more likely to be getting the statistical differences
– Best

2011 June – IBM GLOBAL CIO survey key points

2011 June – IBM GLOBAL CIO survey key points
– The number one priority of CEO/CIO dominantly is to get help with the overall company strategies using analytics as the most important area of program development and management. So this is a great conversation starter with clients confirming whether this is indeed true in their organization also. If indeed it is true, what are they doing, what data they are using and how successful are they, and spice it up with some of your data, market insights, and some innovative solutions you can complement their work. As a side conversation, Verify whether this is the same understanding with CMOs also.

– The second point that struck me is that the top two tools does not include social media and it includes BI/analytics (I guess this is expected thinking of the first one) and the mobile solutions.

– The third point that is worth noting is that the transformation mandate is coming in areas which are more volatile either because of the disruptive changes in technology or volatility of economy.

They are happening so fast and so much this is not surprising. The selling point I think is that this is the reason why they need more stronger program in analytics or insight support; they need more of that to justify their decisions in investment and direction.

– The fourth point I see is that the downside of financial services has lot more pressure than the technology disruptiveness. Interestingly I also feel that this is felt lot more possibly because this is a global survey not just a US survey, unless financial services includes home/auto ownership related big financial decisions. This survey seems to be weighed in by more CIOs contributing from outside of US.

I think on one’s side (at defining some priorities and development areas) one will benefit thinking on the following: what kind of problems one needs to be actively thinking as thought leadership work that would attract the CEOs/CMOs. It requires some brainstorming with a marketing agency/analytics groups.

One caveat one needs to accommodate is that the CIOs will likely change their opinion one year from now based on pressures that develop in the next 12 months; so if one does anything it has be sooner and any solution we keep bringing up with the clients will have to be more immediate so that we are surfing while the waves are cresting.

Some times one does not have to create any thing new; but tune the solutions one already may have into making sure that it fits these understandings.

The other finer point is that we need to be in a position to predict what will be coming next year. This also points out to “what are the couple of scenarios in the next 3 years and 5 years”. It is upto us to make use of it.

From Data Monster & Insight Monster

A million dollar misinterpretation? Netflix Prize Winning Solution

I really wish well for the winning team that got the million dollar prize in the Netflix data mining challenge.

Just take a look at the ranking of the various teams frozen at the time of winning announcement.

Is it really true that the first team “Bellcor’s Pragmatic Chaos’s solution is the best?

Statistically speaking it is just the chance that Bellcor’s Pragmatic Chao’s accuracy of prediction of few (both MSE is same at the second digit level) extra records put the team as winning team and from practical application point of view, in fact it may not be the best, compared to the second.

Oftentimes, I see pure computational data miner’s struggle through interpretation because the concept of white noise is not part of the conceptual process.

That is part of the reason why the pure computational data miner never talks about two types of error. For the practitioner there is only one type of error that needs to be controlled for better prediction.

Any undergrduate trained statistician will say that they do not even have to do any tests for the above conclusions.

On top of it these models are to interpret and hard to implement.

I am sure this will bring some fire from different corners. One expected line is the following. Statisticians can better incorporate heuristics to improve their science!

Let us face it; a spade is a spade. Hopefully seeing these points of views will keep rekindling the need for better balancing of methods.

Fair Balance Statement: I am not connected in any way to the second team to defend the second team, if it seems like i am defending the second ranked team; this is what we understand from statistical perspectives.

From Data Monster & Insight Monster