Monthly Archives: December 2013

Ten Reasons Why Models May Fail… Previously published in CRMportals.com

TEN REASONS WHY MODELS MAY FAIL

by

Kent Leahy and Nethra Sambamoorthi, Ph.D.

American Express

DM NEWS – 1997

Addendum (by Nethra Sambamoorthi – 2003):

“Statistical scoring models are used to rank consumers/customers for marketing purposes.  They are used very commonly in the b2b analysis, b2c analysis, CRM analysis, enterprise analysis [to rank initiatives, partners (affiliates), and consumers], and customer interaction analysis.  In broad terms analysis powers Customer Knowledge Management, Customer Interaction Management, B2B marketing (B2B Analytics), and B2C marketing (B2C Analytics).  With out CRM analytics, we will never be able to optimize our resources, effort, and time.  Further more, analytics is the key component of “Marketing Analytics”, “Marketing Technology Automation”, and “Enterprise Analytics”.  Fast forward to better CRM processes, increased CRM productivity, and higher ROI using CRM analytics. Some studies have indicated analytics contributes as much as 50% of ROI by improving your business processes, keeping managers accountable and responsive, and by improving productivity.

While there may be more than ten reasons why models may fail, we have pointed out the most important ten reasons why models hinder or even create loss instead of bring forth its full benefits in direct marketing, database marketing, or CRM marketing.  For reasons that is easy to understand some companies will fail more often for a particular reason than the other.”


There are many reasons why direct response predictive segmentation models may do less-well than expected, or perhaps even fail, or fail miserably. Some of these reasons are listed here, but by no means does the following exhaust the list. However, they are felt to be among the most common the authors have confronted in the industry, so hopefully, by bringing them to the readers’ attention, many of these errors may be avoided in the future. Please note that the list is not ordered either by “severity” or “frequency of occurrence”.


(1) The person who will actually be building the model is not included in the initial discussions or design of the model.
This problem is one of the most regularly occurring in the industry. Quite often the modelling methodology is done independent of the statisticians input which can be disastrous at the backend. Well-trained statisticians are indispensable in spotting potential model trouble-spots before they become actualized. Research methodology and design operations are inherently statistical issues for which statisticians have been trained. By limiting his/her input at the earliest stage, one is merely asking for trouble.


(2) The model has been “overfit” to the sample at hand ,and, consequently, does not generalize well to the actual mailing population, or is otherwise unreliable
Typically the mailing results are quite disappointing when this happens. Remember, the mark of an effective modeller is one who knows when to “stop”, and who is not obsessed with obtaining impressive “pre-rollout” gains tables at the expense of real back-end results, or who refuses to engage in adventurous “data-mining” or otherwise ” torturing the data until it confesses”.


(3) The circumstances surrounding the actual mailing change or the mailing environment turns out to be substantially different from the one on which the model was built.
Economic changes, seasonal variations not captured in the model , and lessened demand for the product are just a few of the many “extra-sampling” reasons why a model may fail, and fail badly. It is not so much the changes per se, that can be problematic, although they can obviously negatively impact on a mailing, but the fact that the effect of such changes may not be constant across differing levels of the model predictors. Under such circumstances, the model could end up selecting people considerably less likely to respond than originally anticipated based upon the model. This is the reason why models require periodic updating


(4) The model is used as though it were ‘generic’ or ‘universally applicable’
For example, the model might be developed using a particular product mix and then used in a promotion where a different mix is used. Examples of other changes foisted on the model which may have a negative impact on the mailing include using a different “creative”, a different “package”, or even using the model on a different population One particular (actual) example comes to mind which occurred awhile ago when the results of a mailing for a term life insurance policy with face value choices of either $10,000, $15,000, or $25,000 was used to develop a model which was later used on a mailing which had policy offerings that were considerably more attractive to a more “upscale” audience (i.e., $15,000, $35,000, or $75,000). As might have been expected,the model selected out the opposite prospects, instead of the best ! The lesson to take away is that a model is not impervious to changes in the conditions under which it was built.


(5) Changes in the mailing environment in conjunction with the use of an ‘overfitted’ model.
Some of the reasons a model may fail are only (or primarily) problematic when they occur in the presence of another (or other) problems. For example, changes in the mailing environment of the roll-out could very well be innocuous were it not for the use of an “over-fitted” model, which does not allow for even minor deviations from the model-building environment. Thus the deleterious effect of the simultaneous occurrence of the two conditions is greater than the sum of their individual effects. The “overfitting” problem, for example, invariably exacerbates many other problems that might be present, in addition to being a prime reason itself why a mailing might fail.


(6) The model contains “post-event” variable(s), or those that occurred after the event you are trying to predict
For example, suppose you are trying to predict who is likely to purchase a new car next year based upon this years behavior. You build a model and find out that “age of car” is an excellent predictor, in fact it appears to be too good a predictor. .Unfortunately, unbeknown to you, auto records on your database are updated every six months, and you find out that individuals who bought cars this year have a car age value representing the car they just recently bought. Obviously such data is going to be a excellent predictor of this years auto purchases, but it obviously won’t do very well next year, except for those who buy a car every year.


(7) Not ‘test-scoring’ the model, or making an error when implementing the model
Nothing is more disastrous nor more easy to do then make an error when scoring the prospective mailing file with a scoring algorithm. One can build the best and most reliable model in the world and it can still self-destruct if it is not implemented properly. From making an error in re-coding a variable to inserting the wrong model weights, implementing the model can be a veritable mine-field. One way to greatly reduce the possibility of making such errors is to have the algorithm tested on the file that was used to build the model. If both the model and the algorithm produce the same ‘gains table’ counts, then one can be reasonably assured that the mailing file will be scored correctly. One note of caution, however. Sometimes a new or second file is used to test-score the algorithm, or one different from the one the model was built on. The danger with this is that if the same transcribing error is made when formulating both the original model and the scoring algorithm, then the model implementation could be in error despite the fact that it tests out correctly. This is why it is preferable to use the original model-building data file for the test. A general rule for test-scoring is to minimize human transfer of instructions, which can be accomplished by using “cut and paste” operations whenever possible.


(8) Failing to run an audit of the file as the first step in the model-building process.
More often than not the model-builder in not presented with a “clean” file with which to build the model. Such “messy data” is typically riddled with everything from observations that have missing values on one hand to records that have values that exceed the maximum possible values for that variable on the other, and everything in between. The first step in building a “workable” model,therefore, is always to check the file for such problems. Doing so so at the start of the model-building process can save a lot of heartache at the end.


(9) A consensus on just exactly what the model is expected to predict (and for which audience) is not reached and/or well understood.  
This may sound elementary, but many times models are built that end up being “shelved” because they predict outcome measures that were not intended, or predict correct outcome measures for the wrong population. The major reason for this is a lack of communication between interested parties. One way to help prevent this is to develop a “model specification form”, in which all pertinent information, including the audience, outcome measure, and so on is explicitly stated. In this way, the likelihood of inappropriate models being built can be substantially minimized.


(10) The model performs well but the mailing itself is not a financial ‘success’
This is ordinarily the result of a lack of financial planning, or insufficient attention being paid to the financial or “economic” aspects of the mailing, including such things as the marginal cost per piece mailed, the marginal revenue needed to reach certain financial objectives, and/or the depth of file that should be mailed to be maximally or optimally profitable. Although these considerations are in reality outside the domain of the model itself, many a model has been said “not to have worked” despite the fact that it did all that it could be expected to do. The use of a viable segmentation model in and of itself is no guarantee of a financially successful mailing, only careful financial planning in a conducive economic setting or environment can do that.


 

Mathematics of Strategic Metrics, KLIs, and KPIs – The Foundations of Predictive Dashboard

Here are the statistical relationships among Strategic Metrics, KLIs (Key Lead Indicators) and KPIs (Key Performance Indicators).

The interesting thing in this discussion is that strategies around developing a set of KLIs will have definitive and huge impact on Organizational Strategic Metric and it carries the Moneyballstatistical relationship even in the predictive time periods.  Remember that KLIs are highly predictive measures.

  • The KLIs and Organizational Strategic Metric are correlated, but organization’s KLIs will be uncorrelated with the KLIs of the competitive companies, and hence of the Strategic metric of the enterprise. Think about the reasons.  After all, one’s strategic metric should be independent of the competition’s strategic metric because of uniqueness of products and services and the uniqueness of value segments. The depth of correlation defines the types of strategies you will develop.
  • Strategic metric is a function of vision, an organization’s unique products/services, value segments which the organization is serving, and the budget
    • KLIs of an enterprise are highly collinear (or atleast defined in a way as collinear for better interpretation – as one can always construct a complementary measure which may have negative correlation but its impact will be to have favorable effect on the organization’s strategic metric) among themselves and highly correlated with the strategic metric.  KLIs are influential metrics that have favorable impact on the organization wide strategic metric
    • KPIs are highly correlated and have favorable impact with the strategic metric, though they are not influential metrics.
  • Also, KLIs and KPIs are highly correlated

Example: Oakland A’s

The strategic metric of Oakland A’s OBP (onbase percentage)

Their value segment is its fans – these are mostly local (this is not that simple in the case of other organizations spread out around the country).  There is a good analytical opportunity to figure out value segments from this context.

We all know what was their budget situation.  Had the budget been a different order of magnitude there is no interesting story here.   Oakland A’s had one third of the budget of the best team in the franchise against which Oakland A’s was daring to dream of beating them.

All the metrics associated with hiring, training, fielding, and firing were all KLIs one has to see the movie to appreciate the details here.

The ticket sales, for example is not a KLI. On its own, it will neither happen nor influential to win the division title for example. But as KLIs start pushing the organization’s strategic metric, any number of KPIs one may define will be after effect blooms of the effect of succeeding in the strategic value creation, more and more of the winning.

Why wouldn’t you buy this function, for Oakland A’s, Winning=function of OBP and OBP=(KLIs)

A key point I can not stress enough is that the strategic metric has a lag relationship with KPIs, and also they are weak relationships.  Managers should be thinking about KLIs, Key Leading Indicators to help them accomplish their favorable impacts on strategic metrics.

Go KLIs.

How to be a great data scientist – Let Us Follow the Paradigm of Stephen Covey… For Details Read The Original Book

Today, I read one more time Steven Covey’s Seven Habits of Highly Effective People and realized that the golden rules of CRM is exactly what he has articulated for personal life.  My thoughts started identifying more and more the relevance of his book for CRM and CRM analytics.  Here …

CRM and Steven Covey’s Seven Habits of Highly Effective People

On a philosophical note, it should not be surprising that CRM is actually applicable in every day life in dealing with our own management of life; after all, for most of us, each one of us serves some body else; we do this day in day out. We want to be courteous, connected, complete in our exchange of information or service, and correct in the quality of interaction.

Conversely, the Steven Covey’s – Seven Habits of Highly Effective People, should be directly applicable to CRM operations. Let us see what I mean by that. Mr. Covey says that the following habits are the common thread of highly effective people.

1. Be Proactive – Principles of Personal Vision

Of course, any product/service offering starts with a vision of the current world, how it could be made simpler and more tolerable world, from the point of a business model. We do market research, dynamics of demographics, life-styles, and psychographics to understand how the demand for certain products are going to full such a vision. This fundementally drives the new innovations.

So, I will say, you have to know the corporate vision, corporate strategies, … If you do not know, ask for it, internalize and create the information strategies from there.  Listen to this.

2. Begin with an End in Mind – Principles of Personal Leadership

Customer satisfaction in deriving benefits out of the products/services, simplicity and usability of the product, the value-cost ratio of such products and services is very fundamental and that should be the end in minds of the business managers and entrepreneurs.

The end in mind is always fulfillment of your strategic metric and build strong momentum in the right direction using KLI and KPIs.  Read more here:
3. Put First Things First – Principles of Personal Management

On a daily tactical management level, the managers have to know their priorities, when is the right time to reach the customer and what is the most important things for them to get out of the interaction with us

Communicate the CRM strategies and bring together the right team with the right tools, to execute on the projects. Instill in analytical leadership work across the organization

4. Think Win-Win – Principles of interpersonal leadership

With out this there is no way we can sell even the first unit of the product; this is a no-brainer; however, it would be surprising how many of us think this; it is not uncommon to hear in board meetings “we are not running a charity; where is the profit”. The simple rule is that a firm will be profitable, once the customers see value in our products and services. Before the board sees profits, in fact, customers feel your profits that is why they will also be buying your shares in open market.

The win-win means you have the right pricing both for the organization as well as for the consumers.

5. Seek First to Understand, Then to be Understood – Principles of Empathic
Communication

This is not anything new in any relationship, if you want to be the leader or you want to serve the other person; in any case, this is how you make others happy; first listen, and be listened to. This is the only way you get to hear the opportunity areas of product improvement, differentiation vis-à-vis your competition or even create differential pricing for different segments of your product/service users

6. Synergize – Priciples of Creative Cooperation

If your company is really listening, this is how you can blunt/crush your competition; A great CRM is one where the customer relationship management leads to new innovations in product development and new pricing models, with out sacrificing the customer satisfaction.

7. Sharpen the Saw – Principles of Balanced Self-Renewal

Again, we are saying that corporations can not be sitting in one place but grow and diversify by a well established CRM platform for management of interactions with their customers. Self-Renewal will happen based on the sixth principle and if it is used properly and intelligentlyOK, folks, CRM is nothing but Covey’s principle applied in business context; well this is for those who understand – seems like more than 20 million people understand this book or understand his principles, and we are bringing those lights to see the foundations of building a great CRM platform.

So now let us get into the next book that Steven Covey has written, “The Eighth Rule of Highly Effective People“ on how to achieve from success to greatness.

 

Introduction to Predictive Analytics and Data Collection (Strategic Analytics)- DL 402 WINTER 2014 Books and Moneyball Movie

A student account with Amazon gets you student price for Amazon Prime, if you register with your university student email account, which upgrades you to free two day shipping. To take advantage of the student rate for Amazon Prime and get your books faster, CLICK ON AMAZON OFFER.

4/16/14 update – Amazon changed its price structure and it is no longer free totally.

Amazon Student is nearly identical to business Prime membership and it costs $49 a year — after a 6-month free trial.  If you register on or before March 20, you get it for $39.

The books for DL 402 – Introduction to Predictive Analytics and Data Collection are:

       

Product Details
Taming BIG DATA by Bill Franks and PREDICTIVE ANALYTICS by Eric Siegel may be accessed using your library resources.

Note that you do not need to buy the movie just for the course.  You can watch it from the /Library Resources/Course Reserves section of the navigation area.  I bought a copy because I can share with others, on what is meant by “Moneyball phenomenon”.  I also like to give this as a gift to my clients and/or management.

There is no better way to educate people on what is meant by “strategic metric”, “key leading indicators”, and “organizational dynamics of introducing analytics” than watching and learning from this movie and discussing the critical learning.

 

Correct Way of Implementing a Predictive Scoring Function.

In the following my purpose is show how to write out the predictive equation and implement it as a scoring function correctly.  As an example, I use the following SPSS output from

“http://www.ats.ucla.edu/stat/spss/output/reg_spss.htm”.

This note is very important in applications and implementations of predictive models.  As usual in explanatory notes, I will explain this with respect to linear regression model.  The ideas are equally applicable in other types of models.  Since I saw it more than few times, I am thinking this may be useful to more people.

As a side recommendation, this site where I picked up the example to show and tell is a wonderful resource for all practitioners that discusses examples and codes for all the common software and I am thankful for using this as a teaching aid, as this is commonly usable by anyone in the world.  This is one of my favorite sites when I search for codes.

The site, http://www.ats.ucla.edu also provides similar outputs for the same data using SAS and STATA.  One can create similar output using R also.

The helpful site provides all the important details including the model to be used out of the output from a linear regression model.

My purpose here is to help clarify what is the right scoring function when you pull out the model from the outputs.  The practitioner (practically seen in really sophisticated organizations who are in turn flag ship guiding posts for others) adds some additional rules and write it out incorrectly. That is the reason why I want to bring this to practitioner’s attention.

"get file "c:\hsb2.sav".

regression
 /statistics coeff outs r anova ci
 /dependent science
 /method = enter math female socst read.





Variables in the model


” – all from the site as example data and meta data.

There are two answers I have seen when analysts write out scoring equation.

(1) Model to score in this case is ScienceScore= 12.325+0.389MathScore-2.01Female+0.05SocialStudiesScore+0.335ReadingScore

(2)  Model to score in this case is ScienceScore= 12.325+0.389MathScore+0.335ReadingScore

Note that the SocialStudiesScore and technically FemaleScore too, are not significant if alpha=0.05.

Do not use the way of writing as in (2), once you finalize the model as the last output.

Once a model is finalized for whatever reasons, you can not drop a variable from the scoring function, though it does not satisfy your standard p-value conditions, because

the rebuilt scoring model will be different if you drop the variable – its effect is lot more damaging depending on the levels of collinearity of the dropped variable.

Remember collinearity is a continuum not a discrete state, like p-value, a separate topic.

So, if you want to drop a variable and write the model, you have to rebuild the model with out that variable and use that new rebuilt model to write out the scoring function.

Comparing R-Square and Adj-R-Square – Getting a Clear Statement For an Age Old Problem

Since I am not seeing a practitioner’s note on how not to interpret R-Square and Adj-R-Square, I make note of that here.

In linear regression case, analysts have to use goodness of fit models and a function of sum of squares explained by the model is very useful.  The discussion between Sum of Square, Sum of Squares due to Regression and R-Square is stated here

Dr. Yu
There is a simplicity and beauty in interpreting the R-square, the amount of variance explained by the model.

One illusion of R-square is that it will keep increasing as one increases the number of parameters in the model, though the average incremental sum of squares explained for each additional variables may be negligible and useless as goodness of fit of incremental variables. Theil(1961) proposed adjusted R-square to adjust for the number of parameters used.  It is defined as R-square with a bar on top and it is

\bar R^2 = {1-(1-R^{2}){n-1 \over n-p-1}} = {R^{2}-(1-R^{2}){p \over n-p-1}}
 (wikipedia.org).
n=number of observations
p=number of parameters not including the intercept term.
One drawback of adjusted-R-square is that it can become negative.  Though many analysts interpret adjusted-R-square as variance explained, since variance can not be negative, that is not a tenable interpretation. Don’t use it that way.
One important use of adj-R-square is figuring out the importance of incremental variables’ contribution to model and hence very helpful on the parsimony of a model finalized.
So if you want to use a measure to interpret as “variance explained”, the only measure is R-square.  If you want to use a measure for parsimony testing of models, especially in nested models, use adj-R-square.  Loosely speaking, you can also use it for non-nested models while doing testing of model selection, and non-nested models where AIC – Akaike Information Criterion measure won the argument strictly speaking, especially if the goodness of fit is defined on the basis of lowest final prediction error(FPE).  That lead to a host of information based goodness of fit, the prominent being BIC (bayesian information criterion).
There are other measures of goodness of fit that are useful to explore different concepts.  That is for a different note.

===   Updates with input from others ====

  • Wayne Fischer

    Wayne
    Statistician at University of Texas Medical Branch
    R-squared increasing with each added variable to a model is not an illusion, it’s a mathematical fact. And although adj-R-squared should be used, along with R-squared, it by no means indicates “the importance of incremental variables’ contribution(s) to (the) model.” For “importance” use each estimated parameter’s t-statistic along with its p-value.
     Murat Özel likes this
  •  
    Top analyts available for onsite work-SAS/R/SPSS-Analytics, Health, Insura., Fin.& Invest.Services-Off Shore Development
    Good points on how different angles can be brought out on the word “importance”. Thanks for pointing out the ones I did not intend.I will explain what I meant. The illusion is meant regarding the importance of the additional variables for the over all prediction, which is the implication of adj R-square. Again, in your second reference regarding the word importance, I meant from total predictive power of the model.Generally people want common experiences and terminologies to explain new experiences. I am probably guilty of doing such things more among technical people and less among business managers or vice verse depending on the group I interact with. You see such discussions all the time.

    By the way, word importance, from an information theoretic point of view means very different from our discussion purposes here, a point I made in my notes.

    None the less, the main point is that if the data scientists (I like the word analysts) say adj-R-square as variance explained or leave it hanging with out clear position on how to interpret the differences between R-square and adj-R-square in a choice of these for interpreting as “variance explained” with an unintended take away message, the purpose of my note is lost.

    Best always,

  • Nigel Goodwin

    Nigel
    at Essence Products and Services
    I’m not sure what audience you are addressing. From your profile, you are clearly familiar with the subject area. Adjusted R^2 fell out of favour some years ago, and as you say various flavours of information criteria are now recommended as alternatives.Burnham and Anderson Model Selection and MultiModel Inferenceand

    Claeskens and Hjort Model Selection and Model Averaging

    are two authoritative text books.

    It also depends very much on the nature of the model and data, some tests are only really suitable for very small models (< 5 terms).

    My concern is that those new to the subject may get the impression that adj r^2 is suitable for model comparison.

    There have been other recent discussions on this topic in this group, so I won’t repeat myself.

  • Nigel Goodwin Nigel

    Nigel Goodwin

    at Essence Products and Services

    ps, if you are discussing variance, the variance from a single model will always underestimate variance – because it has ignored all the other possible models.

    See Breiman.

    Matt H. likes this

  • Matt Healy Matt

    Matt Healy

    Senior Research Investigator at Bristol-Myers Squibb

    Don’t forget to make plots of residuals — if there is a visible pattern in the residuals that can hint at what sort of model might work better.

  • Nethra Sambamoorthi, PhD – Sr. Stat/Econ

    Top analyts available for onsite work-SAS/R/SPSS-Analytics, Health, Insura., Fin.& Invest.Services-Off Shore Development

    Hi Nigel and Matt, thanks.

    Matt, definitely plots of residuals and other diagnostics are important and will provide intelligence on how to finalize the “best” model. Adding this statement would certainly avoid concerns from certain unsaid angles, though that was not the reason why I wrote the note.

    Nigel, this is a very narrow effort on interpreting average sum of squares due to error vs. model as captured in R-square vs. Adj-R-Square, in the case of nested models. The context here is introduction to regression models.

    If R-square is usable for model selection then Adj-R-Square is usable for model selection following Theil’s note, which addresses the problem of increasing R-square in the simplest normal regression model’s context. Considering the ever increasing complexity of models, why would any one use adj-R-square, I understand.

    Thanks for those references.

The Big Problem of Big Data Scientists – The Danger of Croaking In The Well – A Fun Writeup

Fun part:

Microsoft.com clipart

Here is a collection of 10 points why we as big data scientists need to go outside of our well and see the world.  The intensity of one’s croaking and fight with in the well can be easily measured how many positives/negatives you can have in this 10 points test.  The debate will be solved in LinkedIn by 2,752 comments, well sorry… 2,751 comments, to eliminate tie. 

To make it interesting I mix the negative and positive ways of saying these statements or possibly with double negatives.

You just have to score whether you agree or not agree. In no way, one becomes the most intelligent croakier because this one is croaking these 10 test items, first.   I know what you are thinking.  Smarty A Big Bad A, it should be 100 items!

What can I say! This is only a test.  You can absolutely change this list to your liking.

This is dedicated to the most loud, most contrarian, croakier.  Caution:  This is only a test to determine how balanced a critical thinker can be. This is a highly sophisticated balanced design of experiment to develop the intelligence we all need based on 299 survey interviewers. 

It is meant to be provocative; enjoy…

  1. Big data does not mean you do not need survey data!
  2. Leadership for big data may come from small data people
  3. You do not need critical thinking, you just need to do correlation, association, and regression in big data analytics
  4. Big data is big dross
  5. Big data opportunities can not be solved by computational methods
  6. Statisticians are not equipped to be leaders; they are just great geeks and epsilon tinkerers
  7. The farther away you are from statistical thinking the more successful you will become in handling big data
  8. Big data solutions results in much bigger presentation decks and many many visuals
  9. Statisticians lost the game of big data
  10. Statisticians will never understand big data because the whole science of statistics is about normal distributions. 

======================================================================
Serious part:

LinkedIn post: in “Why statistical community is disconnected from Big Data and how to fix it”, a genuine thought question.

Top analyts available for onsite work-SAS/R/SPSS-Analytics, Health, Insura., Fin.& Invest.Services-Off Shore Development
Statistical sciences is a fascinating science that any one can enjoy easily as long as you have interest to create intelligence from partially created/available data.

That changed with the data dross of big data; data are still raw material, not the solutions but in a parody of talks, I might even say data is the solution to get attention.

The history is replete with lots of examples of how scientists from other sciences lived as statisticians and statistics grows with the new comers but one will benefit by contributing not bashing.

But now, in the early stages of new developments, data got the central stage but in a matured stage data are going to be mined in real sense and gold nuggets from such mining outputs will still be the treasured ones, not the data.

It takes long preparation and time to become a gold miner. If a data scientist thinks that he is better than statistically trained glory to him once he proves unequivocally that all his solutions are top notch, or even most of his solutions are top notch, and beats the odds and I will be a great admirer.

So far, I have seen only association, clustering, and sophisticated decision trees and statisticians found the missing bridge to those innovations. EX: GBM.

It looks like a zealot type talk, when one says the new opportunities are top-down or bottom-up or to be a new breed of scientist one has to be insider or outsider. In fact, some statisticians contribute to that kind of division because they do not understand the importance of data mining and how it also involves probabilities and statistical errors. On the other hand, I see how the pure computational data scientists struggle to control two types of errors and give all artificial new terminologies only to learn later on that statisticians have already solved such problems. So that is why I am saying “come out of the well”.

It is a common conversation that the new converts or new pioneers and those who are on the grab-run for space, food, and mate are the croakiest of all and fight till the last drop of blood. They will find there is an advanced society already rooted in the new world, statisticians.

What can statistics departments do to create new breed of data scientists strengthening already their strong program?

Add leadership(communication/project management)/computational courses and statistics will again resurrect itself as a sexy field.

It has been an interesting reading how with few years of experience in data sciences any one claim anything, especially the ignorance claimed as rights to bash another profession.

Top analyts available for onsite work-SAS/R/SPSS-Analytics, Health, Insura., Fin.& Invest.Services-Off Shore Development
Friends, we can overall elevate the quality of these types of discussions.

I just register my point of view as a vote, when unprofessional statements are expressed in a community – this group – I am part of!

I do not think our egoistic, incomplete, ignorant views matter, except what our clients think. Do they care, only if you put your title as data scientists in your resume. Please do. Because possibly, they have committed to hire a “data scientist” to their superiors, not “statisticians” and we as executing leaders, have to support and coach in the end.

Some people call themselves as Data Scientist/Statistician to stress their contribution on intelligence/insight side; that is another way to look at how a statistician should call oneself. In the end, the contributions you make will define who you are. A strong set of computer science courses/communication ability helps a lot here.

From a different angle, however one may defend one’s enthusiastic zealot statements and their fairness, it does not help having denigrating or condescending attitude on another profession, which tat-amounts to racism – looking down on other groups.

In this case, one has to realize that Statisticians is a breed that has lived 150 plus years of glorious contributions. It is a fascinating science. Enjoy it!

The community will overcome the opportunity challenges, to harness unprecedented opportunity that is ahead in the next 15 years, which I state as $250 Billion dollars – back of the envelope estimate – if you include all the core peripheral fields, where data scientists can move around fairly. This is based on Mckinsey estimates in their core publication on Internet of Things. Just 5 years back, nobody believed analytics/datasciences/data intelligence has this much market.

Mark, everyone has opportunity to learn every day, I tell my children. The true judge of one’s capabilities are our clients who pay our money and risk their reputation and life by trusting us (all data scientists) with their problems, an opportunity for people like us(data scientists). In a practical world, our managers are our clients too and so it is our responsibility to coach them to appreciate the hard work that goes into delivering solutions. Having worked in business and consulting, i can understand you Michael from this perspective.

The main message I want to take from this is that there is a huge opportunity out there to educate people on data intelligence and I think it would be fun to work with thought leaders and industry stalwarts to identify areas on how to build bridges that helps everyone.

It has been fun to watch the cock fight and painful to see condescending attitude and emotions.

I got to run! I heard there is possibly a bet developing on which cock is going to win: R or Python. If I have time, I might put my vote there too; not interested in betting!

If I do not respond to your complaints and justifications any further, that is because this ends my participation in this thread, but no disrespect!

Thanks everyone and have a great rest of the weekend.

        How to Develop and Deploy an Information Strategy? – One of 10 Part Videos on Big Data Strategic Analytics

        This is a sample of collection meant for lecturing. You will see it as more professorial!

        Information strategy is becoming more and more important as organizations realize that there are data that are flowing through their processes and systems and it is better that they build the right aqua duct system of data flows and use that water of intelligence to become analytically competitive.

        Remember information strategy is different from data strategy and you need analysts to help you to master the both – one more reason why you need a CAO (spelled KO!), Chief Analytics Officer.

        The video presentation mentioned below is to understand what is an information strategy, how is it related to the organization’s strategy, how is it executed, and how it could be part of both re-engineering as well as continuous development in marketing or organizational analytical competitiveness.  

        How to develop and deploy an information strategy.mp4

        Those who are joined as community member of this blog get to hear more of this, on permission basis the following list of 10 videos. This is a sample of collection meant for lecturing. Use it for what it is worth.

        Will continue to refine voice, tonal, phrase qualities… Best always…

        From Data Monster & Insight Monster