Monthly Archives: March 2013

All things Merging – PROC SQL, The Boss

Some not so commonly known gems:  I will add few more in the next revision; I am reviewing this for today.

1.  Over time, there were many references on PROC SQL and I think the following one is a great introduction to PROC SQL, especially to people who have some exposure to Data Step Merge Statement.

An interesting segment that is discussed in this document is the importance of cartesian product merge SQL.  It connects the logical branching of all the merge processes as a consequence of cartesian merge.  This also discusses the importance of watching out for the merge outputs when there are one to many or many to many merges, a key reason there could be misunderstanding the equivalence of different ways of merging. (Data step vs. SQL)

2., brings out interesting ways to merge and create weighted score for the matching variables.

This is a very powerful and simplified tool if you want to select observations for matched sample study.

3., is useful to see which file contributes to with what information.  The indicators are similar to (in=ind) option in merge statement with in a data step.

From Data Monster & Insight Monster

Segmentation Methods Best Practices – Part 2

Here are some of the guide posts that can be useful in developing and justifying segmentation solution, even if one do not want to invest time in the mathematics of it.

I came to understand these only after lots of sweat and tears; this is where, I felt people were uncomfortable in Statistics.  Wish our graduate programs teach these in a focused way and all the points and questions mentioned below carefully, in an insightful way. For various practical challenges, these pointers/questions get fast forwarded.  Since, marketing uses segmentation so commonly, a marketing analytics course should include the list of 10 below as part of it, I feel; or if there are two marketing analytics courses, this one should be one of them and the other one is fully loaded with predictive analytics methods.

The words ‘Segmentation’, ‘Clustering’ are used interchangeably in the following list.

1. How to figure out which PROC to use? FASTCLUS, ACECLUS, CLUSTER?  Does it matter to discuss the equivalents in SPSS and R?  Yes. However, that will be a separate topic.
2. What are the dynamics of the following concepts?  missing values, winsorizing, non-parametric mehods in clustering?  Every clustering project should navigate through these with out getting entangled in the methodology!
3. How does imputation affect clustering – oh’ this one lands you in unexpected places! 
4. What is the best way to winsorize? should you do it?  What can potentially avoid this adjustment?
5. Is there something called differential weights for the cluster factors.  How do I estimate them?  I never see people discussing them.  What is the importance of this concept and how do I use it?
6.  How to develop a segmentation with a sample and score the whole population?  This is a million dollar opportunity!
7. There is a natural tendency to reduce the number of clusters but more clusters provide more insights.  This is nothing to do with CCC or anything else.  It is purely segmentation intelligence.  This is where the science appreciates the art!  How to play this to your advantage?
8. How are the following connected?  Clustering of observations and clustering of variables?  
9. The most obscure but an important understanding you need – Only for those who join this site…
10. How to converse with non-technical people about segmentation with out using any kind of nerdy phrases. Believe me, this is a trap.  You should take this challenge and yet you should not over use this power.

Imagine the sweat and tears I mentioned above.  They were real – Trying to understand the above beast and control it to play your favouriate music.

It is really very interesting how these algorithms play out bringing the various types of clustering structures.

– Disjoint groups
– Hierarchically disjoint groups – so many different structures are possible here  (two big clusters and each splitting varying number of smaller clusters); two big clusters and one of them splitting into multiple smaller clusters),  …

Invariably, hierarchical clusters exists

When you study the interrelationships of the clustering variables and which variable over-represents where, and in what volume, as well as the inter-correlations of dominant cluster variables in each of the clusters is a great story to share and discuss.  This is where you get to discuss …mmm,…ahah… pointers.

Alternatively, this write up can be titled as ‘Ten Things You Should Ask Your Consultant About Your Segmentation Project’, when you get the segmentation report. Note that there is no perfect, uniformly best segmentation among the whole class of segmentation solutions for a given data set.  One can tolerate some level of Type I and Type II errors and hence there is a huge collection of solutions/answers for a given data set which are all statistically acceptable.

I started this piece as follows.

From Data Monster & Insight Monster

Leveraging Meta Parameters To Create Impressive Visualization

The following is a very informative notes from Tableau, referred in 

I want to take a different angle on this, which what I call are the meta parameters that can help you think on how to create next great visualization in your job, and also, if you can go to the next step, you can be the next Hans Rosling or Charles Minard, ….

Immediately below, I state the meta parameters that are prominently used.  The ordering of the visualization is by the author.  My interest is to understand latent factors of these famous graphics.

#5 – John Snow’s Cholera Map

Geographical location axes, neighborhood correlation, causative factors

#4 – Hans Rosling and Gapminder

Time axis, KPIs, trend movements over time, similar to (3) next, but an effective use of technology to show how the map changes as time moves.  Great visualization that captures the imagination of

#3 – Charles Minard’s chart of the Russian Campaign

Time axis,  KPIs, trend movements over time and a great story bringing out important correlated/causative factors

#2 – Florence Nightingale’s Area Charts

 KPIs, two level causative factors; time-trends in a circular axis!, attention to the relative importance of strategic measure  – It is like bending multiple contingency tables (2 x 2 tables) of diseases that can be cured vs. mortal over time in one graph – what an ingenuity?  

#1 – Joseph Priestley’s Chart of Biography

 I call this as grid chart, horizontal axis representing time period, and vertical axis representing nominal cells.  I am not aware of the commonality of the specific line on the vertical axis among the men of learning stated there in the graphics example.  (There must be some commonality!).

Time slots vs. nominal groupings, latent variables, trends.

So in summarizing further, there are basically 4 parameters you have to push into your graphics and in terms of visualization story telling, I put these forward as the important factors

– moving time trends (bubbles and width to represent volume changes)
– correlation and causative factors highlighted on the way
– layers of explanation (multiple contingency tables compressed into layers – does not have to circular axis) even with latent variables.


From Data Monster & Insight Monster