Monthly Archives: January 2015

The business side of inner workings of an engaging influential dashboard design

What determines analytics priorities?

What kind of measurement each one is supported with?

How to have engaging conversations with right owners of each one of the dimensions here?

Shall we say it is time for dashboard?  Converting all the above questions into one of measurement, graphical representation, and engaging communication with the senior management, is the tricky part of an engaging and influential part of designing a dashboard.

Market Basket Analysis – A Recomender System with Many Business Possibilities

One of the ways recommendation analysis is done is using “market basket analysis”.

It is a non-parametric method and it is similar to cluster analysis, clusters behaviorally induced.

Market basket analysis is good for many analytical decision making.

– shelf arrangement

– companion discount

– lead product identification for additional sales

– better pricing

– Developing sales brochures/advertisements

– creating package offers

– inventory management

Here is a simple presentation on the topic

For a simple application example in R using apriori algorithm, as a tutoring notes, you may see

Here is the link for full document of data-miners presentation which has many interesting data mining applications explained including “market basket analysis – in section 10.5, while discussing Association rules and their applications).   Here is link for their book. .  A older version is available as a pdf download here at

Here is a look on how to implement these ideas in hadoop set up for real time application.



The Problem of Left Out Variables and Importance of Case – Control Studies – Why every day lingua is not good enough to explain deep scientific endeavors

! Today’s greatest news is about “luck” being the independent variable in a regression model !

There are many reasons why this kind of language happens.  The different language constructs and terminologies among scientists and media people, the need for media to reach out to all the mass in a simplified way and communicate the conclusions, scientists not giving importance to the fact that “unknown is too much” and there is no proper caveats mentioned for the “unknown”.

As the number of stem cell divisions in a tissue rises, so does the chance of cancer striking that site.

This one graph explains the main conclusion.  The more the cell divisions, you are more likely to get cancer but then division happens to all people equally as human beings, in general.  So who gets cancer, and who does not is just “luck”.

In the picture above, the authors conclude that it is observed in observational studies that “small intestine” shows lower likelihood of cancer compared to “colorectal”.

So combining all the information above and some additional computations on environmental and genetics, the ability to explain the occurrence of cancer by cell division is lot more than environmental and genetics, and even after controlling for environmental and genetics, the correlation is still significant.

For more details are in  Please see the original article for a complete coverage.

From teaching point of view, this is a great example for a discussion on “left out variables”.  What are the left out variable?    Why left out variable is always a mystery and many of the problems perhaps can be traced to left out variable?  How do diagnostic plots look like with and with out left out variable? and so on.

Left out variables are well studied in econometrics and here is a chapter 6 of the book,, beautifully explained with many examples and many types of specification errors.

The important conclusion of left out variables are, “If a variable that is correlated with one of the variables in the model and it is left out, then we will get a biased relationship”, and that is my hypothesis for the above conclusion, all else being acceptable.

Two explorations:

Why I am committing to comment about this successful scientists’ publication is the following.  The conclusions from this article is circulating around the world and it is an important conclusion.  So it is an important point for scientists to understand and explain.

What if there is another thing going on in the cell division and cells’  absorption of extraneous material – organic and non-organic – in a processed form where transfer of carcinogenic items/processes are picked up.  Then the more the cell division happens, the more likely one will get cancer.  Is this eliminated in the study? Well that is not the purpose of the study because, this is a macro level analysis of aggregate data.

That brings out the next point, which is a common problem in usage of macro level aggregate data analysis.

People use aggregate analysis and interpret it as if they are case-control methods; not intended but at least that is how people are likely to take it.

For some historical cases of caveats of lab/data analysis, published in Economist is here,

Recalling or undoing published papers in prestigious journals do happen. Here is a quote from Economist.

Similar problems undid a 2010 study published in Science, a prestigious American journal (and reported in this newspaper). The paper seemed to uncover genetic variants strongly associated with longevity. Other geneticists immediately noticed that the samples taken from centenarians on which the results rested had been treated in different ways from those from a younger control group. The paper was retracted a year later, after its authors admitted to “technical errors” and “an inadequate quality-control protocol”.

The number of retractions has grown tenfold over the past decade. But they still make up no more than 0.2% of the 1.4m papers published annually in scholarly journals. Papers with fundamental flaws often live on. Some may develop a bad reputation among those in the know, who will warn colleagues. But to outsiders they will appear part of the scientific canon

Note that the authors are not saying that environment and genetics do not play a role.  The chance is overwhelming.

This is not uncommon.  We know only around 5% of variations in predictability is due to genetics in breast cancer.  My doctor says, in some therapeutic classes, the efficacy of medications is hardly 5% to 10%.  We use such medications because it still explains curing such percentage of incidents.  The fact that 90% is not explainable by genetics or environment is not interpreted as “luck” as the reason.

Some where, the media want to say it in simpler term so that common people may understand the interpretation, but loose the scientific lingua, which is precise.  It is not being geeky to be scientific!  It can be life and death for all of us.

Here is the biggest question.  Will big data change the above conclusion.  My favored hypothesis is “YES”.  Only, right data and right analysis will give us the assertive answer either way.