Monthly Archives: December 2012

Visualization Has Handicaps – Why Algebra and Equations Are Easier In a Sense-Case of Seven Dimensional Segmentation Scheme

Are we missing an important lesson about visualization?  Visualization is almost becoming a topic like a porn,( television shows, articles, photographs, etc., thought to create or satisfy an excessive desire for something, especially something luxurious) for a good reason especially recently, enabled by the need to summarize the big data where data dross can be a problem.

The following discussion is useful as a generic topic to justify students to learn algebra and equations than just relying on geometrical explanations, though I have to tell you that I loved geometry as a kid and visualization definitely has a role in simplifying complex communication in simple easy to understand conclusions.

So the topic is not about whether visualization has benefits or not.  It is about using it in a meaningfully useful ways.

I explain this using the following 7 dimensional visual representation.

A segmentation scheme typically involves many variables to condense the whole population into say any where from 4 to 10 homogeneous groups, and even 10 is a lot if we have to create distinctive marketing programs.  For example, I know of segmentation schemes that are any where from 40 to 80 groups because you can explain more variation; however, the direct applicability of those large segmentation schemes require further grouping based on some additional data.

So in the end it will result anywhere from 4 to 10 segments for developing and applying marketing plans.

So let us say we have a seven class segmentation scheme, based on probably 15 generic (foundational) variables that are commonly available.

or much simpler 4 dimension Venn diagram.

Phew! – I went around so many times in the picture, trying to figure out whether it has all the intersecting sets or not (in case I want to use those pictures for explaining some of the ideas).


The Second picture shows clearly how the intersecting sets are represented and it is simpler to work with compared to the seven dimension venn diagram.  For example set A interaction with set D is the bottom most set.

The purpose of a Venn diagram is to show geometrically the number of units that belongs to intersection of sets or non-intersecting sets (which is a union of pure intersecting sets). Note that the full set which results in these subset representations and their intersections is called the ‘Omega’ set. In the second example it is set 16+all the numbers in the mutually exclusive sets created by the various intersecting sets out of four primary sets. The primary sets are A, B, C, D in the above 4 dimension (second picture) Venn diagram.  Intersecting region means the elements in that region are members of the intersecting sets.

Can you see how quickly the conversational meaning of interpreting the intersecting or non-intersecting factors become complicated as the number of dimensions increase as you go from 4 dimensions to seven dimensions.  However, when you draw the picture, it is not easy to see the various intersecting regions. The 4 dimensions provides a possibility of 2^4 possible distinct sets, which is 16 distinct sets.  For 5 dimensions it is 32, 6 dimensions it is 64 and for 7 dimensions it is 128 distinct intersecting sets.  These 128 distinct sets are supposed to be captured by the top picture.  The bottom picture shows 15 bounded regions and the one outside of the union of all those bounded sets.

Imagine the 128 sets representation of 7 dimensional representation.

The only accomplishment of the above tabulation is to show you how to achieve the systematic symbolism of getting all possible intersections and what sets contribute to what representative intersecting sets on the left.   A better representation could be actual counts in those places where there are ones. The important point is that in this representation, you won’t go round and round to figure out which representative intersecting cell has how many counts.  Note that in practical life, the conversational sets (a combination of these intersecting sets which could be rewritten as and/or/not/nor conditions) could be much higher than this.

Where am I going with this?
Now think about how to work with say 7 dimensional marketing segmentation (non-overlapping sets), which is characterized by say 15 generic variables, such as age, income, home ownership, residential blocks, … The total number of disjoint sets are 2^15-1, which means 32,767 disjoint sets that contribute to these 7 non-overlapping marketing segments.
So if you want to express these seven segments with these foundational 32,767 characterizing sets, you can imagine we will go crazy trying to visualize, but it is an easier task to list them down algebraically and combine those sets to explain in conversational English.
This also has implications in design of experiments where we want to explain how to identify equivalent folds of various fractional factorial designs.
From Data Monster & Insight Monster

Segmenting the population, an interesting way

Why:   Creative Thinker, does not mean just visuals, but a philosopher too (philosophy: science of logic); if you know the science of it and can explain with all the tools of science, you become a scientist and if you know the inner connections of various components of it and want to express it in ways that is newer and unknown heretofore, breaking out of the limitations of communications, you become an artist.

When:  A magician and circus master with a  project management band (magic stone, wand or whip depending on the situation and the animal on the spot), who with time as the touch stone enjoys showcasing creating the fire in various ways with the touchstone to keep us all warm and lively; will bring together resources and budget in more than creative ways one can possibly imagine or argue.

What: Curious Mind, still a child and soaking in all the wonders around and trying to make connections, wobble, willing to make that mistake but always growing in capacity and joy of creating a new world

How: Engineering Mind, – Hard core, hands on analyst; world will steal from you for the professional interpretations, wood works you make, and the joys you can create in the child, politicians, business people, and your own reflections in others.  You are the creator and the creations with “Why” can become world famous in natural world but had to work with “When” if you won the rewards to be in dollars and cents.

Where:  Strategist , often confused and having difficult to distinguish between doer and thought leader, a very thin veil separating.

Whom:  The parent , who needs to control the politics of communications and inter connections of the human soul, who sometimes forgets oneself to count in or the photographer who can not be part of the group photo; it is all what you got – the times and what you got on your hand.  It is apt to be CEO and truly the selfless CEO who brings out these goodies in others.

A New Definition of CRM System

CRM had come through few generations of development, starting with Peppers and Rogers, who started saying reach and influence customers with the right message, at the right time, with the right offer for a right price; the paradigm was called customer relationship management.

It helped us evolve from sales centric to product centric to business centric with tinges of customer centric.  Is it fully consumer centric?  I will definitely say NO. 

The four strategic platforms that comes together from the point of view of strategic management for managing customers are

– contact system – CRM contact management
– content system – CRM content management
– Intelligence system – CRM analytics
– business system – CRM business management

However, let us turn this into an integrated system of levers and gears from the point of view of consumers (current customers – customer base – and yet to become customers – prospect base)

From the consumer point of view, it is about delight me with the best offer that will make me take away the pain of decision absolutely positively, taking less time to search, decide, and buy the one that I will not regret to have spent my money.  If that is indeed the issue, we are partially masked by brand consideration and brand value, which is a complex topic but we will come there, once we clear up all the remaining part.

So no wonder people painfully and yet practically stopped with the wonderful practical definition of Peppers and Rogers.

What is it that stops us making the next move on the CRM revolution.

A bridge that will integrate the brands and consumer demand world.

Modeling Customer Behavior Path

Finally, marketing industry is catching up with the right important problem, namely seeing and understanding that single event customer behavior is a one node in a collection of network of behaviors consumers go through in their decision making.  In other words it is a Bayesian Network problem.  The network is strongly stable, usually, as the capture of the network happens more and more maturity of the

Heterogeneity adjustment for Prediction

Consider a situation where there are 12 groups of observations for various ages groups (less than 24, 26-35,36-55, above 55), but resulting in different probabilities of becoming an entrepreneur or odds of becoming an entrepreneur.

Thus, the probabilities are P1,P2, P3, P4, P5,P6,P7,P8,P9,P10,P11, and P12, where P1, P4, P7, and P10 are at the mean of the age classes.

However, based on the estimated predictive equation which is shown as red curve, the estimated probabilities of these 12 observations will be P1(same as P2, and P3), P4 (same as P6, and P5), P7 (same as P8, and P9), and P10 (same as P11, and P12).

So if we rank them using predicted equation, the ranking of the probabilities will be  P4, P7, P1, P10.

Graphically, it represents the following 12 identified observations on the two dimensional plane.  There is no extra information other extra information to improve prediction.  Now consider the situation that the conditional distribution of probabilities given age classes are distributed with varying variances, not because of the mean probabilities but inherently with different variances because of other reasons including the not observable elements, or errors in observations, or specification error.

However, if I could make an adjustment bringing in their heterogeneous distribution variances for each of the classes, I am more likely to get the original ranking which is

P6 > P9>P4>P7>P3>P1>P8>12>P2>P5>P10>P11

The heterogeneity unadjusted predictive equation becomes

Bringing in the heterogeneity in to the equation:

We have left hand side+variance(p|Age Class)

Since the variance of conditional distributions are determined by the mean probability, more specifically p(1-p), the one sigma distance will be maximum when p is closer to 0.5.  This will improve prediction for certain percentages of the observations, with such a percentage being less the higher the number of records per age class.

The problem with this ranking is that all the observations given a class will have only three different prediction.

Now let us look at the following adjustment to the predictive equation.

Use the ranking index differential at the individual level for the conditioning variable.   In the full implementation you will be incorporating the ranking differential from each of the independent variable to not only increase predictive accuracy but also decrease predictive error (false positive).  This consideration to use the ranking index differentials from all the independent variables as additional information to stabilize the results and decrease the predictive error is important as in practice there are more than one independent variable in the model.  Also, we can locate the optimal number of ranking differential to use to increase the predictive accuracy.

The important point is that we can improve predictive accuracy (true positive) and decrease predictive error (false positive) by additionally incorporating the heterogeneous distribution of probabilities defined by the conditioning variables.

Minimum Best Practices Model Performance Reports – Part 1 – Logistic Regression

For the simple classification problem, the following steps and model performance reports should be covered as minimum best practices. 

  • The variable importance analysis – One can use WOE (weight of evidence), KS measure, Information gain, uni-variate area under the curve of false positive and true positive trade off curve, correlation, and the density (volume) of number of targets to evaluate 1000s of variables and rank them in terms of uni-variate importance.  Most of the times they will be ranking the variables almost the same. We create multiple metrics to suit different analytics cultures.
  • Weights adjusted for the prediction – If weights which adjusts for the representativeness of the data are not used and not used properly then such a process can can actually bias the results
  • Co-linearity analysis – Co-linearity of independent variables not only affects the signs of the variables making them difficult for interpretation but also the stability of variables in the model
  • Variable stability analysis in training- While pushing the limits of number of variables to maximize predictive accuracy, it is possible some of the variables at the tail end of the list of importance are likely to move in and out of the retained list of variables in the model just because of sampling variation in the training data. This gets acute if co-linearity is also extensive
  • Sampling errors in sample generation among (Train/Test/Validation) results.  Different variation of the problem mentioned in (4) is the one which affects the variation in the performance of train/test/validation metrics.  So we look for the stability of the KS measure among Train/Test/Valid works.  The sample that contributes most variation among the Train/Test/Valid results is given least importance.   The sample that contributes to least importance are checked for repeated stability and chosen as the best results
  • Validation metrics such as AUROC, ROC graph, concordance/discordance, KS, Lift chart are provided.  Ultimately what matters is that the performance metrics are performing and they do not differ from the training/testing results significantly
  • Partial decile reports for each of the independent variable.  This gives stability, smoothness, and variable importance in one shot as a partial ROC representation of results
  • Frequency tabulation of each of the variable surviving in the final model with the independent variable.   This gives the volume (density) of targets spread between the presence and absence of the slice of the independent variable
  • Partial variable importance metrics for variables that survived in the model.
  •  The above minimal model performance reports are defined for the binary classification (binomial) models.

    When we do models for continuous dependent variables, then how do we build and interpret the above model performance reports.  For example, GLM is a collection of models where the dependent variable is an exponential class random variable.  GLM contains logistic regression as one special case.  However, it is not straight forward how to have similar model performance reports.

    We will continue this exploration in Minimum Best Practices Model Performance Reports – Part 2 – GLM Models

    From Data Monster & Insight Monster

    Purchase Path Models

    It is common understanding that in retail or CPG or any industry for that matter know that consumers go through ‘search’, ‘compare’, ‘be prepared – if it comes to get loan

    Funnel Model

     for example’, and ‘purchase’ – a process consumers go through as a purchase path decision moves.  The simplified version of this is what is called purchase funnels.  One version of such purchase channel for digital marketing for example is the following.
    However, the recent article by  John Coleman, in talks about ‘The Purchase Path‘ as four stages of consumer relationship management, while advising ‘You need to engage with prospects and customers every step of the way, and do it in a real-time, highly customized manner, if you hope to make it to the end of a much narrower and more cluttered path to purchase‘.

    In a variation of funnel type CRM purchase behavior model, Coleman suggests
    – wonder zone – facing a confused state and time dealing with multitude of channels of information to resolve their choices of channel preferences and organizing their thoughts, preferences, search, and information, as one swims through the preference likelihoods given the market conditions
    – evaluation zone – becoming more specific with needs/priorities/actions past the bewilderment; more specific retail/online/brick store decisions are being made.
    – select zone – making best choice of brand/product/channel and price
    – happy zone – post purchase experience zone
    In the following I am presenting purchase path as a Network Model, as the consumer travel between search and transact and marketers painfully move from total ignorance zone (for new consumers) to cultivating consumers in to loyal consumers who become brand advocates.

    A sample of actual focus group intelligence explained with the above model using archetypical consumer types.  Note that the typical funnel model happens with (2), (3), (4), (5), (6), (7) in a linear fashion.

    – Ex1: The consumer is heavily reliant on word of mouth recommendation from highly trusted friends and only will buy brand products.  However, the person has the capacity to wait for right market time. In a sense, the consumer wants to plan for the whole year and wait for the right sale.  Usually, he is also a brand ambassador.  The consumer is highly price sensitive and brand sensitive.  He is a self employed individual with two children and works in technology and marketing.
               For more expensive and luxury products, the consumer will also use web to search for product information on options, and product recommendation, and price variations.  The individual does not use the stage numbered as 4.  The individual mentioned this as archetypical behavior path.

    – Ex2: The consumer was home maker with children talking about pharmaceutical life style products.  She usually gets information from the doctor and health channel information sources, and health care experts.  She also actively collects information from SN sites. She does not use other web/telephone.  She does not look or wait for price optimization.  She is acts as active brand ambassador for products across many verticals.  The individual is a very brand conscious person.  The consumer manages

    – Ex3: The consumer is a very busy business man; does not act as brand ambassador except for casual recommendation of products, and a highly brand sensitive person. The consumer is not price sensitive and does not wait for special sales and offers for any products in general.   But very selective in product.  He goes by word of mouth recommendations and does web search, does not use SN or telephone to talk to the product marketing telephone support

    – Ex4: The consumer is a winter bird and market interactions are family activities; he was more interested in food, travel, entertainment, and health.  Heavily relies on SN sites and word of mouth.  Not brand sensitive but very price sensitive.  Works often as brand/product advocate to the population above age 60.

    -Ex5: The consumer is a young and ambitious salaried person and married with plans on getting children and make a home in the suburb.  The individual is not brand sensitive but price sensitive and still lives with basic needs and does not care about the luxury products, though he has a big dream of making a big American house. He uses web heavily but not SN sites, besides word of mouth recommendations. He does not act as ambassador for any products except casual recommendations at office

    -Ex6: The consumer is young, single and very brand and price conscious; spends a lot of money on sports and adventures; buys expensive brand watches, cars, and travels a lot besides taking care of his business.  Very health conscious and loves fish and beef.

    From Data Monster & Insight Monster

    The Missing Value Phenomenon – Imputation is More Powerful – But Do Not Do It Naively

    The missing value problem is occurs more often, over-simplified, and not knowing that one may be making the effect of missing worse than before. 

    It is all a question of what matters to you.  Bias due to not using complete set of observations or bias due to using some method and not knowing its implications.

    By just imputing average or median you can, in all likelihood, be making it worse.  There is a reason for missing and it is not because it is the average.  However, the reason for missing is exhibited by the way the missing happens at various slices of multidimensional observation space is the best way to impute missing value; the higher the number of dimensions that  are used in slicing the data in multidimensional observational space, the more robust the representation of capturing the missing values.

    Here is one study to help in that direction, especially in the direction using a method and seeing how it affects the standard error.

    Maximum Likelihood Parameter Estimation with Incomplete Data

    Besides authors macros, two key procedures in SAS that help you do that are

    PROC MI  and

    See the following video for a quick understanding.

    The key point I want to bring to you is that a method with out imputation, not a naive one, is more powerful in detecting signals and prediction than the one with out missing value imputation.