Monthly Archives: October 2013

The Importance of Knowing The Best Modeling Method – The Competitive Methods Used are Logistic Regression, RandomForest, Gradient Boosting Method (gbm package as understood in R)

We are all going to have some bias because in the end there are factors other than pure accuracy(bias) and precision(variance) in selecting a method, what a technical geek would argue about.

In the next 15 years, the estimate of pure analytics is going to be to the tune of $250B.  Impossible to sit and do logistic regressions and multinomial classifications and other sophisticated regression methods which statisticians, econometricians, and other scientific disciplines have been bringing out to apply in each and every one of the opportunity that is going to come up in the next 15 years. Thus, there is an urgent need to create methods broadly under the umbrella of machine learning applications that is getting the attention of analytics managers, computer scientists, statisticians, mathematicians, econometricians, behavioral scientists, …

I encourage you adding GBM (gradient boosting methods, a general method which can also be used in RF and there are many boosting methods) to the mix of your inquiry. We can leverage what others have already written painstakingly. Also, people build models of models and the models become non-implementable if accuracy is the only criteria. The Netflix example was an eye opener in that direction. on tree based methods, where boosted random forest is included and a comparison to gbm is also made. That is in no way meant as final. Here is another one from famous book Elements of Statistical Learning.

Authors keep academic honesty in publishing these papers.

It looks to me that this being a statistical conclusion, we are likely to make some level of Type I and Type II errors, no pun intended.  Accordingly, I tend to interpret, “not all models are wrong”, “some are marginally better than few others at some times”.  As simple as the fact that a group is great in R is good enough reason to be more frequently using RF and GBM methods.

While this discussion has been very gracious.  However, I do not want to discount 100s of papers published on random forest and GBM. One common point one may hear is that these are not interpretable because we will not know which variable matters and how much it matters. I disagree and though it is some what round about way, the interpretation is still possible.  I follow this for gbm method I use.

I have built 100s of models with logistic and gbm comparisions in the last few years because of automation established. GBM was as good as logistic and for a good reason machine learning experts are going to love GBM/Random forests and continue to use them. Quite likely there are proprietary methods that can be good to compete or possibly beating these methods as well.

But I am also skeptical about proprietary methods.

I remember when I was working in a Blue Chip company, a vendor came and presented significantly higher predictive accuracy to a given situation.  What the vendor was doing is hilarious.  He makes averages of binned ranges of predictors, predicting the averages, deep in the outputs of the algorithms.   People do all kinds of things and they can always succeed to some level but if some one claims prediction beyond the peer reviewed method, the call is always, “show me the results” from the field, consistently.

Strategic Analytics; Why KPIs are not enough and KLIs Matter? Case of a Customer Service Performance Management

Strategy, Strategy, Strategy…In today’s news, a marketing manager talks about social strategy, meaning how he is going to go about using social media.

In another news, a campaign manager talks about campaign strategy that would increase the ROI of his efforts.

Analyze the situations where both are fully aware of the corporate strategy vs. not fully aware of corporate strategy.  The second situation is just a way of saying that they came out of the town hall meeting with the sr. management office just giving out parrot talk about KPIs, on and on and on…
A strategy is a plan and analytics helps you figure out the right metric for that, while tactics is a serious of one or more actions that achieves sub-parts of the strategy leading to the fulfillment of strategy. 

So how to come out with a strategic metric?  Often, strategy is a convincing plan that is in the hearts of the generals (General Managers or CEOs) in the battle, we do not measure and follow that on a daily basis, and are measured by the outputs (KPIs) of the strategy.  KLIs (Key Leading Indicators) are the ones that helps you predict the performance of you strategic metric and hence predict performance of KPIs. 

Here I want to follow something that is apparent, tractable, and can be communicated convincingly using a dashboard regarding your strategic metric, KLIs and KPIs.

RFM can be a marketing strategic measure if overall marketing plan’s strategy is to reward and increase sales among people who are recent frequent buyers and spending more more.

That can not be organization’s strategic metric.

Organizational strategic metric is one that distinguishes your vision in the market place that is specific to your unique product and/or services, and the value segments. Budget is the other side of the coin of your strategy.

Also, organizational strategic metric is not fixed, but it is can not be changing often.  Since vision is more fundamental, the strategic metric can change with the flow of the river of the market dynamics; still vision drives everything. 

Here is an example of a customer service case study, using strategic metric, KLIs and KPIs. Often I find departments saying, to be best in class in service.

          To be “Best in class” is an aspiration and in itself is not a strategy.  It requires a strategy to achieve.  Also, required is some understanding and definition of base line measure of “Best in class” to compare with.  That is your strategic metric.  Even if we do not know base line measure, we can still define the strategic metric. 

For example, consider a direct selling division of a huge retailer.  For customer service in this case, you may find that as long as product quality is at certain level, and 90% of the Level I calls resolved in the first call within 10 minutes, and 90% the remaining calls are resolved at Level II, and 1% of the times product returns are satisfactorily accepted, then you may envision that as customer service success story.  The way you achieve this is using the strategy, say, selling the product only if the customer has certain credit level score with a friend of certain type in Facebook friends list.
       KLIs are metrics that help you reduce the differences between Best in Class goals metric, expected to achieve vs. your organization’s strategic metric or alternatively KLIs are metrics that optimizes your organization’s strategic metric.  Some KLIs are, (1) Percent transactions of certain types that puts the credit score close to cut off score on a quarterly basis, (2) Quarterly money spent on jewelry products of certain type, (3) Holding certain job types, (4) Traveling certain mileage.

           As you can see all these requires certain maturity level in analytics as well as data collection and data usage. Also, the strategic metric is well defined concept, and KLIs are predictive of the success of your strategy, and hence you are able to keep the KPIs under control, which are amount of sales, customer satisfaction, and …
          Every organization will have its own collection of Strategic Metric and KLIs: Note that all the above discussion changes depending on your unique product and/or services, your value segments, your competitor’s level of service, and your budget.

From Data Monster & Insight Monster

HADOOP Cluster System For Less Than $999 – A 8 GFLOP Computer?


All fun…

Compare this with NVIDIA –

May be the best thing is go for the well structured and beautifully engineered 1 TFLOP machines using the NVIDIA GPUs.  The power consumption is whopping 1,200+v, in that case. For cubieboard, the power consumption is less than 100 volts… for this small configuration, a simple linear calculation points out. It can be flexibly be built especially for companies that are starting to do, using open source systems.  This is a minimalist configuration. I see future possibilities. At least this can be a great test system.

More hands on details available at:

More on and its assembly:  

For GFLOPS see:
Accordingly, it looks like the 8GFLOPS is a serious underestimate.

One important point is that you can build a hadoop cluster but to get value out of your big data you need to hire a 100K analyst, at the least!

Anybody interested in working with me to build and support HADOOP Clusters for our clients?

From Data Monster & Insight Monster

Advanced Analytics for Categorical Data – Pre-Test Question Paper

Available only for members on Request.  Every one will be seeing one or two questions every now and then.

This collection of around 25 questions were developed for a course in Health Services Research.  However, these questions are generic and are basically a test of some kind of statistics prelim exam questions for a quantitative Ph.D program.

1   Q1. When conducting multiple linear regression models, the effect of multiplying a continuous independent variable by a constant:
a.      will not affect the estimated coefficient of the independent variable
b.      will not affect the estimate of the SSR (Sum of Squares due to Regression)
c.       Both
d.      Neither
e.       only a
From Data Monster & Insight Monster

Some Useful Notes on Clustering With R

The following gives some quick ideas and codes on k-means clustering and hierarchical clustering methods using R.

University of California, Berkeley notes on clustering and R, especially on Hierarchical and agglomerative methods.

There is also a chapter on clustering methods on this data mining and case studies book.  The chapter on clustering includes k-means, k-mediods, hierarchical, and density-based clustering methods.

Start here and we will take it from here with your intput.

From Data Monster & Insight Monster

Changing Face of CRM and A World of New Opportunities Developing in The Last 21 Years.

Exactly 21 years back, Peppers and Rogers invented the term 1to1® in relation to marketing and was popularized in CRM contexts by Peppers and Rogers (1993).

As late as 2007, the CRM paradigm meant to influence consumers treating differentially based on the latent value they have for the organization bringing together the tools, methods, and technology for reaching out the ‘right customer, with a right offer having a right price, at the right time’ and people started adding with ‘right channel’ when multitude of channels started popping up.

Factors That Are Changing The Face of CRM:

Proliferation of Channels:
In a survey that BIGdata Research releases monthly, there are more than 20 channels of reaching out retail customers, all the way transforming from a telephone/internet/direct mail channel as the challenge to integrate from 1990s.

Mobile is Expanding:
“… over one billion of the worlds 4+ billion mobiles phones are now smartphones, and 3 billion are SMS enabled (weirdly, 950 million mobile phones still don’t have SMS capabilities). In 2014, mobile internet usage will overtake desktop internet usage and already in 2011, more than 50% of all “local” searches are done from a mobile device…”

Read more:
Location Marketing Goes Hand in Hand with Real Time Marketing:
For real time marketing to be of value, the marketer has to know the current location of the consumer.  For an interesting collection of infographics type information sharing see

Realtime marketing:
The value of personal time has risen many folds in this new world precisely because of the mobile technology and location identification technology; we always feel like being monitored and followed and be responsive to requests.  The up-side of this is that the marketers who are leveraging real time marketing offers can become disruptive organizations.  In the 90s we were in a hesitant to say ‘real time’ and slowly started referring ‘near real time’ and now organizations are bold in using the phrase ‘real time offers’.

All the above are putting extraordinary pressures in collecting and using huge amounts of data that needed to be organized, used intelligently in a way that is mindful of consumers’ privacy, and yet accordingly be the first to engage consumers in a meaningful and relevant way that they will not only be part of your acquisition but also become more loyal.

So the changing face of CRM is all about for marketers to respond to the loud signals from consumers:
– Right consumer
– Right location
– Right offer
– Right Time
– Right Channel (in retail, there are 7 channels of delivery)
– Right communication media

and BIG data is the key to this paradigm in terms of technology and tools.

So which sector is catching up faster?

Here it is from McKinsey analysis reports, though some what old, the emergence of these priorities almost holding good even today.

From Data Monster & Insight Monster

Weigh In Transaction Data With Consumer Survey Panels

Weigh In Transaction Data With Consumer Survey Panels.  Sometimes, the value-weight of consumer survey panels are more than the value of transaction data.

Let us quickly see some powerful popular panels.

NIH creates many health related panels.  Perhaps, it has lot of increased potential with the new ACA (by now every one should knot that it is the OBAMACARE)  I will bring out those in a separate article.

As a sample of discussion, here is a Gfk MRI’s survey of the American Consumer; what a treasure load of data?

Take a look at the survey questionnaire that is provided here.

Do you want to know more about the mechanics of the survey.  Here it is.

Often market researchers as well as highly tactical one to one direct marketers do not do enough justice in using this data.  Typically most of them just stop after getting strategic insights from panel surveys.

The unknown secret is the power of using these panels for one to one targeting in direct marketing online or off-line.  Here is the way to extend the addressability of intelligence from any small scientific survey to the national 240 MM adults.

From Data Monster & Insight Monster