Monthly Archives: April 2013

Latent Analysis – Successful Social Science Methods Not So Popular Yet in BI or Predictive Methods

Regression methods, logistic or normal distribution based methods are meat and potato of analytical methods.

Every one is taught and software are awash with procedures and some times even extending them to some additional functionality.  In the aisles of SAS and SPSS developers and users heard the story of how the whole company was built based on these methods in the early stages of development of the company.

However, statisticians, economists, social and behavioral scientists, and recently computer scientists have introduced a host of other methods, which are not pervasive yet among analysts, though significant amount of solution methods are available.

One such methods is Latent Variable Models.

There is one abstraction of successful analytics practices that points to latent variable models as the ultimate creator of knowledge nuggets in social sciences and I would add to that that it is true in business too, especially if you are looking for an advanced level of CRM where the intelligence in consumer inputs can be coming in so many signal variations that the intelligence can be systematically extracted using latent variable analysis.  This is a common opportunity, in general, in sample surveys for attitudinal and behavioral aspects of consumers.

A simple factor analysis is the beginning of this area.


From Data Monster & Insight Monster


Students get this free.

Join my site for actual data to do the following, and this is available for $99.  If you come through twitter follow, you will get it for $149, and if you come through any of my linkedin site, it is $179. For non-members it is $199.

For buyers, these data are available as yearly updates as ACS makes its yearly updates for three consecutive years.

It takes a long time to figure these out; getting census data at zip code level and you will be getting more than 125 census variables at zip code level and the look up table the provides various links can be used to reconfigure the data at a required level.

– identifying MSA code given a zip code
– identifying MSA name given a zip code
– identifying RURAL/URBAN/SUBURBAN of a Zipcode
– identifying county given a zip-code

Some additional relationships
– identifying STATE given a zip code
– identifying State_FIPS code given a zip-code

These relationships open up powerful analysis at MSA level, ZIP code level, and county level.

Have fun.

From Data Monster & Insight Monster

Free ZIP Code Level Census Data – Also, Reasons Why Your ZIP code Level Census Data Are Incorrectly Sold, Interpreted or Used

Join my site using the top right button “JOIN MY SITE” to receive login and password to download the ZIP level census data freely.

In the recent publication of 5 year American Community Survey data, US government published data for nearly 125 variables, with four types of element for each of the variable, for each of the zip code.

There are 32,989 zip codes.  This is the number of zipcodes for which site, publishes data.  As to different reasons why different number of zip codes will be stated are given in

Join my site using the top right button “JOIN MY SITE” to receive login and password to download the ZIP level census data freely.

The meta data and actual data are available.

If you believe the margin of errors do not matter, you can use the specific column of data you are interested in; otherwise, how do you adjust for the sampling variation in the data.

As you can see, the zip data has four elements for each of the variable.

– Estimate (because it is based on a sample survey, no doubt the largest well designed survey done ever, possibly except India and China)

– Margin of error in the estimate (tells you how much uncertainty is built in)

– Percent of the estimate

– Percent margin of error

Here is the notes from census bureau verbatim.

“Although the American Community Survey (ACS) produces population, demographic and housing unit estimates, it is the Census Bureau’s Population Estimates Program that produces and disseminates the official estimates of the population for the nation, states, counties, cities and towns and estimates of housing units for states and counties.

Supporting documentation on code lists, subject definitions, data accuracy, and statistical testing can be found on the American Community Survey website in the Data and Documentation section.

Sample size and data quality measures (including coverage rates, allocation rates, and response rates) can be found on the American Community Survey website in the Methodology section.

Source:  U.S. Census Bureau, 2007-2011 American Community Survey

Explanation of Symbols:An ‘**’ entry in the margin of error column indicates that either no sample observations or too few sample observations were available to compute a standard error and thus the margin of error. A statistical test is not appropriate.
A ‘-‘ entry in the estimate column indicates that either no sample observations or too few sample observations were available to compute an estimate, or a ratio of medians cannot be calculated because one or both of the median estimates falls in the lowest interval or upper interval of an open-ended distribution.
A ‘-‘ following a median estimate means the median falls in the lowest interval of an open-ended distribution.
A ‘+’ following a median estimate means the median falls in the upper interval of an open-ended distribution.
A ‘***’ entry in the margin of error column indicates that the median falls in the lowest interval or upper interval of an open-ended distribution. A statistical test is not appropriate.
A ‘*****’ entry in the margin of error column indicates that the estimate is controlled. A statistical test for sampling variability is not appropriate.
A ‘N’ entry in the estimate and margin of error columns indicates that data for this geographic area cannot be displayed because the number of sample cases is too small.
A ‘(X)’ means that the estimate is not applicable or not available.

Data are based on a sample and are subject to sampling variability. The degree of uncertainty for an estimate arising from sampling variability is represented through the use of a margin of error. The value shown here is the 90 percent margin of error. The margin of error can be interpreted roughly as providing a 90 percent probability that the interval defined by the estimate minus the margin of error and the estimate plus the margin of error (the lower and upper confidence bounds) contains the true value. In addition to sampling variability, the ACS estimates are subject to nonsampling error (for a discussion of nonsampling variability, see Accuracy of the Data).  The effect of nonsampling error is not represented in these tables.

While the 2007-2011 American Community Survey (ACS) data generally reflect the December 2009 Office of Management and Budget (OMB) definitions of metropolitan and micropolitan statistical areas; in certain instances the names, codes, and boundaries of the principal cities shown in ACS tables may differ from the OMB definitions due to differences in the effective dates of the geographic entities.

Estimates of urban and rural population, housing units, and characteristics reflect boundaries of urban areas defined based on Census 2000 data. Boundaries for urban areas have not been updated since Census 2000. As a result, data for urban and rural areas from the ACS do not necessarily reflect the results of ongoing urbanization.”

To address the sampling variability, Census bureau in a serious and sincere way provides these margin of errors and it is fascinating the level of details they provide to make sure people interpret these numbers.

However, neither the vendors provide the data, nor help interpret the data correctly.

What is better? give you data which is not correct but you can do a quick and dirty analysis (very important to keep up with time) or give you right data and take time to do the right job?

From Data Monster & Insight Monster

From Big Data To Small Data In Real Time – What does this 1 per million part score card mean to you?

There are many takeaways. 

Seven powerful metrics that bring down a specific type of big data opportunity (security) into small data.

A 1 per million part identification system bringing together different data types in real time, a big data opportunity.

Behavioral interpretation and social analytics are still key to make sense of data, to quickly bring big data into interpretable and usable small data

I am working on a weighting system that is what makes this a very effective, better than six sigma identification system – this is my intellectual property

The following is a difficult thing for me to discuss as this indicates an unsettling prospect of

  • how resourceful organizations are going to be watching you and me going forward with big data, 
  • how privacy is going to be a challenge to maintain, 
  • how we are going to loose our moral superiority.  In difficult times like this, after Boston Marathon, it is important we have a tool like this, though. 

This only makes why discussions about Type I and Type II errors are becoming more and more important and, added to that, how the idea of bias (pre-conceived notions) are going to undermine real intelligence.

The following, based on my initial estimates, is a 1 per million part identification system, better than six sigma, using big data and a weighting system that would help identify that 1 per million part measure of a dangerous person who is lost in our day to day hurry-burry life of innocence, dream, and celebration of love and accomplishment.


– Sudden changes in behavior or performance or relationships

Their usual metric of performance will be lost.  For a student, he/she will get outlying set of grades from his normal performance or performance evaluation or angry exchanges

Loose commonly known best friends or other gender relationships

– Sudden changes in the watchful eyes of organization or people; will travel to not so common places and will acquire new relationships who are in turn in the watchful eyes of security organizations.  At least the chatter inside the security organization has just got elevated or elevated chatter comes and goes, but not able to stick to a well defined resolution of not dropping the ball.  Resolving clearly does not mean putting people in jail, but have an officer report on the latest activities, log in the details so that it flows through the right people for right action, making sure the watchful eyes not sleeping. 

– Unprecedented access changes happening around one’s neighborhood, social relationships, and one’s lifestyle interests, that would provide opportunities for dark side to show up its head

– Firearms, crude bomb, illegal activities blip on the intelligence radar or the correlated words of “bomb” or “firearms” or “mass danger” materials are popping up in the radar

– physical (becoming a post teenager – things get hardened around this time) or emotional changes at home (death or separation) or with closely related people (friends lost)

– New buying/shopping activities of apparently looking unrelated items; this will be a second or third blip, almost always, if they are innocent looking items but used as an aid to complete the intended action

– sleep patterns, telephone call patterns, internet information access patterns, even visiting one’s own home are changing unreliably, but with consistency one’s change started

I say, these are seven metrics of highly dangerous people who need help badly.  What caught my attention is that our security agencies were so close to the …. and yet the tragedy happened.

Of course, it takes a lot of fine intelligence to be careful about Type I and Type II errors that is also respectful of privacy of citizens.

My prayers are with the innocent victims of all ages, an innocent child who cared for kindness, a couple starting their dream life, and an accomplished elderly who did not give up running a marathon at age 78.

From Data Monster & Insight Monster

SAS, Hadoop, and The New World – An Opportunity Especially For Statisticians To Take Front Seat

A distributed high end big data analytics opportunities using SAS environment.

A billion records in 32 secs to run a logistic regression using 48 blades of distributed cluster.

SAS at O’Reilly Conference:

How it all started in SAS:

SAS approach to Hadoop infra:

The demo here:

SAS Data Integration using DI Studio:  Hadoop Integration.

SAS’s  answer to Hadoop:

From Data Monster & Insight Monster