Monthly Archives: June 2010

Unusual Multivariate Correlation Matrix in a Class of Interest Variables

It is unusual to get a correlation matrix where all the elements are negative. Here is a actual example, with numbers changed with out loosing the flavor of the example. In fact this is a 16 x 16 matrix and all the inter-correlations are negative.

What does this mean? what would be its impact on the factor analysis, a key application of correlation matrices.

I un-highlighted in this small collection where the numbers are outside of -0.03 and +0.03 (considering them to be high). Since the observations are close to 100,000 even 0.03 correlations matter.

So V3 is negatively correlated with each of the other variables, that is more of the V3, definitely less of V1, V2, and V4. So, if this is happening in a collection of categories of interest, this means that those who have interests in V3 category are less interested in the rest of the categories (in this small 4 class example). How to interpret this? So perhaps there is only one dimension that is V3, at an appropriate level of correlation.

Basics of HADOOP, MapReduce Solution Architecture for Mega Databases

See the basics with these videos:

Top companies using Hadoop for deep analysis of their complex, unstructured, humongus databases are:

Facebook, Yahoo, Google, Twitter, LinkedIn, …

“…using hadoop, they are able to do deep analysis of data as they capture to digest it and summarize it and load it into older generation ‘analytics structures’ for visualization…’and other sophisticated analysis’ …”

with all kinds of human data (personal/systems data) for analysis such as

audience understanding, behavioral data understanding and targeting, forecasting and optimization, fraud detection and abuse detection, population surviellance, social network maps and social graph understanding of individuals, click stream analytics of terabytes, petabytes data, fleets of trucks and their streaming data, …

The 5 part lectures presented by google engineers are:

-still distributed computing is critical and yet how do we make them massively parallel in a coordinated way

From Data Monster & Insight Monster

What is, What will be, What if, What alternatives(dynamics), What Action… – Different Phases of Solving Problems

What is happening, What will be happening as it is, What alternatives are available and how to pick the right paths, What if analysis of different possibilities for analysis of dynamics, What could be done to move to a better equilibrium…

Descriptive – Passive “What is happening” (D)
Diagnostic – Probing “What is happening and why” (Di)
Predictive – What will be happening as it is (P)
Simulation – What if analysis of different possibilities for analysis of dynamics – How different states of P changes when we bring in outside the structure thinking – this is very strategic as well as heavy computational – P(s1,s2,…,sk); different combination of s1,s2,…,sk provides different predictive solution paths
Testing – What alternatives are available and how to pick the right paths; which combination of s1,s2,…,sk to pick for implementation
Prescriptive– Action – combine different s1,s2,…,sk levers along with different exogeneous factors to come up with the prescriptive plan

Using Prediction Market for Product Innovation

Prediction Markets are currently popular in managing coporate level parameters and processes. Some examples are the following, though there are some exceptions.

Read the Wikipedia introduction on “Prediction Markets”

– HP using it to predict the RAM prices using internal employee based “prediction market”.
– …

However, we at “consumer centric” camp can also use these markets for “product Innovation”.

We will use a filtering mechanism to self filter the huge consumer base we may tap into (email list) to sign in for a product innovation market. Bring it down to a sampling scheme (may be stratified by demographics or other expected success parameters) with a sample of size say few thousands(for all practical considerations, few hundred should be good enough, typical of conjoint analysis.

Create a mechanistic game to optimize the information content of the product innovation engagement game.

CPG Analytics Opportunities – Layers of relationship with a consumer – A Funnel Segment Methodology

In this article, we define the market opportunities, trends, data collection methods, and approaches to developing and profiting from consumer insights(analytics or intelligence).

CPG, a two trillion dollar market in USA is a facinating, fast changing industry with in USA and non-standardized industry outside USA, especially in Asia, where half of the humans live. Almost 80% of the product innovations fail.

The top CPG companies are

Proctor and Gamble $82 B Cincinatti, OH
UNILEVER N V N Y [UN] $52.4 B Rotterdam, AL
PEPSICO INC [PEP] $35.1 B Purchase, NY
SARA LEE CP [SLE] $16.3 B Chicago, IL
GEN MILLS INC [GIS] $12.0 B Minneapolis, MN
KELLOGG CO [K] $10.9 B Battle Creek, MI
DEL MONTE FOODS CO [DLM] $3.2 B San Francisco, CA
MCCORMICK & CO [MKC] $2.7 B Sparks, MD
CORN PRODUCTS INTL [CPO] $2.6 B Westchester, IL

How are they going to address the raising new trends in CPG, what are the new trends? – Here is a facinating site (with well collected set of advertisement pictures) where they talk about 8 consumer trends published in 2008 January.

The PDF version for download of this very well written trends is here.

This site is impressive.

Looking at this site, it is clear web is the medium of communication for info sharing to begin consumer relationship/incrementally tease the relationship/start selling/creating loyalty/expanding consumer base/innovating. These are easy to expand, less expensive, and reach global consumers in no time.

These things point out seven layers of consumer relationship and accompanying data gathering and data intelligence; the consumers are actually segmented as a funnel relationship; it is all about knowing your consumer and creating and thriving with their relationship. It is all the up to creating a cult of your consumers. Apple is a great example. There are many variations of this. You do not have to go to the cult level. The basic spirit of the funnel methodology is as follows.

– know who is out there and who enters and who does not enter your click vs. brick stores – coming into your radar; just a cookie level relationshiop. Know (using simple visitor stats and understanding what attracts persistent visitorship) and attract first layer consumers and testers; the rest of the following involves mechanistic designs of consumer rewards and interactions.

– know who could be converted to providing email address for more involved engagement (secont tier relationship)

– know who would like to commit to the brand (RSS/BLOG/two line questions) – third layer of relationship

– know who is willing to buy (fourth layer of relationship)

– who is loyal (coming back second time – fifth layer)

– who is willing to recommend (sixth layer)

– who is willing to influence product innovation (seventh layer relationship)

We will see what are some basic data collection and analysis and how to utilize the powerful HH level data from companies that provide such data assets, which is a well grown industry in USA. This may not be possible or not so pervasively available in other countries.

From Data Monster & Insight Monster

Be a Leader and Have a Richer Life

This is all about brining desirable change to one’s life in all the four part of one’s life; Work, Home, Community, and Self.First of all, this is not considered as to be balancing, as much as integrating these four parts of one’s life.

The way to achieve is

(1) Being real – understanding what matters most and be able to leave a legacy for life,

(2) Being whole – understanding the performance indicators that matters most to people who matter in one’s life and the four parts that defines those performance indicators, and

(3) Being Innovative/Creative – making the change that needs to happen – experimenting and making those changes that addresses (1) and(2) as part of daily life. See the audio interview with the author, Stu Friedman in

Every Thing About Predictive Modeling Lift Curve

This week I want to bring the following useful questions which were actual question I passed through in my analytics works in the last 15 years. Some of the questions are difficult. So if you want to talk to me send me a note.

· Though the KS is 40.4, the response model is not good, the analyst claims (possible/not possible)
· The top index is 220 and nicely decreasing and KS=15. The model is not good, the analyst claims (agree or disagree)
· Cumulative “good” vs. cumulative “bad”, which is captured in KS provides a better measure of predictive performance of a model (agree/disagree)
· Traditionally marketing analytics measure of goodness of model is True positive (True Accuracy) captured in the deciles. The total inaccuracy percentage is 100-(True Accuracy). Agree/Disagree
· In all marketing problems the misclassification cost of false positive is significantly more important or higher than the cost of false negatives (Agree or Disagree)
· In order to achieve better lift indices in top deciles, you can increase the weights of target vs. reference and still get the same best model and same ranking of the consumer but will increase the lift index in top deciles – Agree/Disagree
· In every one of the marketing problem, if you have high incidence rates then you can redefine the target variable creatively so that it becomes low incidence and hence the lift index will be more palatable to the client (Agree/Disagree)
· Area under the receiver operation curve(AROC), which is commonly used in other sciences is a nice measure that can be used in marketing applications (Agree/Disagree)
· The penetration indices (decile wise % accurate) is a better way of establishing the goodness or optimality of the predictive power of a model
· If your set aside validation lift index is higher than the training/testing index, the model must be very good – agree/disagree
· In variable transformations for continuous explanatory variables, there is an higher order polynomial which will always fit the data in the most desirable way – agree/disagree
· In the end what you do does not matter [it can even be a black box], as long as the set aside sample is behaving in the same way as the training and testing – as a modeling philosophy. Agree/Disagree

From Data Monster & Insight Monster