Monthly Archives: September 2011

Predictive Analytics in Audting



This is fascinating how simple methods like Benford’s law could have helped identify something unusual is going on and hence would have helped predict the current Europe’s economic condition.

Fact and Fiction in EU-Governmental Economic Data

Bernhard Rauch1,
Max Göttsche1,
Gernot Brähler2,
Stefan Engel3

in German Economic Review.

I was fascinated by the article abstract from the reference, which is

“To detect manipulations or fraud in accounting data, auditors have successfully used Benford’s law as part of their fraud detection processes. Benford’s law proposes a distribution for first digits of numbers in naturally occurring data. Government accounting and statistics are similar in nature to financial accounting. In the European Union (EU), there is pressure to comply with the Stability and Growth Pact criteria. Therefore, like firms, governments might try to make their economic situation seem better. In this paper, we use a Benford test to investigate the quality of macroeconomic data relevant to the deficit criteria reported to Eurostat by the EU member states. We find that the data reported by Greece shows the greatest deviation from Benford’s law among all euro states.”

Benford’s law says that a truely randomly occuring measurements first digits are likely to follow a certain type of probability distribution

I think this is applicable in IRS auditing, predicting fraud activity, and also in a positive way to identify and measure whether somebody is systematically working towards a goal. The key word here is ‘systematically’. Fascinating! So the question is what type of regression method one has to use to bring in covariate models.

picture and equation reference: wikipedia

From Data Monster & Insight Monster

Top 10 Data Mining Algorithms – IEEE Knowledge and Information Systems 2008

Top 10 data mining algorithms – Knowledge and Information Systems (2008)
publication.

This not only provides the experience and thought processes of 145 data mining experts who voted on these, but also a great review paper for those who are in the field of data mining. Kudos to the organizer of this specific panel and team worked on this. A great contribution to the science.

Though it looks like, it is heavily influenced by the current trends in the field, I tend to think the ease of use, interpretation, and amenability for automated scoring will keep this selection of 10 for many many years to come. For example, even in new trends in web data, big data, un-structured data, the top algorithms will continue to dominate in terms of its applications, interpretations and quick usability of these algorithms.

From Data Monster & Insight Monster

Data Quality Issues – Incorporate in the beginning when a project starts – Don’t catch the tiger by tail!



“Quality is designed in the process; it is not checked or verified in the end (after the process is streamlined)” – Deming – Quality Guru.

Some of the simple measurements that will help the whole team to be on the same page regarding data quality are the following:

The whole team has to sign of on the simple measurements such as:

– Total number of records (possibly at some finer levels)
– range of the values of each field
– The meta data of the layout of a record and the whole data set
– Simple expected relationships among some of the key variables

So an important activity here is that it is not enough to see some sample records; you have to do some proc univariate, proc means, and proc frequency type analysis and as part of commitment to work, submit a report using the outputs that would pointing out variable by variable anomalies or acceptability of that variable so that client can have input for you immediately.

This has to be a 24 hour or 48 hour turn around for client appreciation.  Line up your resources before hand so that “continuous rowing of the boat” is happening, so to say.

If these are not agreed in the beginning of the project’s first week, you have really started handling the tiger by the tail or the elephant by its tusk, unless you play with innocent baby tiger or an innocent baby elephant.

Then everybody is worried about saving oneself from the tiger, and the focus on time management as part of total project management will be seriously challenged and some body in the team will be hurt!

Watch out, there will be lot of frustration and the team will be pulled down from down under in its boat race to keep up with the time commitment of project management. Nothing can save some body in the team getting blamed. It is the responsibility of the team leader to force this discipline.

From Data Monster & Insight Monster

Today’s Important Msg: Lifetime value of an e-mail blast: much longer than you think – Analyticbridge.com

This is a facinating follow up/analysis of an email studied by Analyticsbridge.com.

This provides some basis on why we should not think that the half life of an email much shorter than snail mail (typically there is general understanding that in customer service, for example, an email is supposed to be answered in http://www.blogger.com/img/blank.gif 24 hours).

Perhaps while customers will expect the company to respond Within a day or two, they may take lot more time to respond to an offer! An asymmetric behavioral expectation. So just going by your experience or gut feeling may not be the right thing unless you put yourselves in the shoes of the consumer.

From Data Monster & Insight Monster

A $13 Billion Analytics Netflix challenge – A Missed Opportunity?

What would have happened, if Netflix, in the spirit of its analytical competition, made an immediate term analytics challenge and asked for a solution of price dynamics and market dynamics instead of deciding on a botched decision-making process to change the prices betting easily 10 Billion dollar worth of market value of the company, which it lost in 3 months time.

Recall, it instituted the well known Netflix competition when it was $2 Billion market value company. It seems like it took it easy in developing, analyzing, and instituting a not-so-well understood price elasticity solution.

What would the life time value of the company for investors, had it done right the price changes in the beginning of summer?

Here is the story:

– Netflix believes in analytical competition as a company wide culture; it was the first company that boldly went ahead for a world wide competition for a million dollar to develop a key analytical solution that supports its distinctive need. Mr. Hastings started Netflix as a competition to Blockbuster model and wiped that out of competition.

– Consumer monthly price of digital viewing (recent development) $8

– Direct mail DVDs (the company started with this as the channel of delivery) +$2 (call Netflix to confirm these changes)

– 25MM subscribers, as of beginning of summer 2011

– 12MM of them use DVD option; 3MM use both; 10MM use only Net

– Expected to loose 200K subscribers because of a new pricing option. +$8 for incremental DVD option, but lost 800K subscribers attributed purely due to this price changes.

– A Huge Prediction Error in Price Elasticity from the point of view of the markets
The company lost close to 50% (almost $8B in market value) from its peak $16B market value of the company in the beginning of the summer to current value of $8B (latest news). Loosing in a growing market by the industry leader is unheard of. All, because of prediction of price elasticity and its implications to the market.

It is everyone’s guess what would be the CEOs decision had he been given a well understood and a better estimated price elasticity, market dynamics, and all its resulting effects for each and every segments.

Listen to the video discussion in CNBC.

Netflix and Quickster Mea Culpa

Reference

As of now, September 19th, 2011, 9:30AM, the CEO Mr. Hastings is thinking it is about the way the price changes have been communicated. Reference

As of now October, 2012, the company lost half of the market value from October 2011.

In his seminal book, ‘The Innovator’s Dilemma,’ Clay Christensen talks about why industry leaders almost always fail to act when ‘disruptive change’ enters their business. He defines this as new products that are dramatically cheaper, lower quality, lower margin but larger markets. … To win the future (Hastings) needs to attack his core assets by building new ones. Very few companies ever do this. Reference The Innovator’s Dilemma

From Data Monster & Insight Monster

Is Predictive Model a Statistical Model? Can an Insight Model be a Predictive Model

My thoughts on

http://jtonedm.com/2011/04/11/predictive-models-are-not-statistical-models/comment-page-1/#comment-24313

“DaveG brought out some important considerations that are pretty unique from statistical perspective.

There is nothing wrong to borrow from any science to apply conceptual and experiential thinking that consistently explains a phenomenon. Statistical/Mathematical/Computational/Econometric sciences all have a role in the business of prediction and the resulting application.

That is the reason why Predictive Analytics is a science on its own.

All statistical modeling are not regression models, like all predictive analytics are not statistical. The moment one understands the concept of random error even if they are not appreciative of it, they are in the statistical sciences.

Statistical or not we need proof for anything to believe that the process will always work consistently. Working with predictive models, I see how the pure computational scientists will have difficulty establishing the superiority of one algorithm over another only because they do not utilize the type I error and type II error in errors when the data has random component to it. We end up ranking one over the other even though the differences between one and the other could be due to error in the random component; in the end not really able to say what matters and come empty handed; talk about uncertainty. Statistics is a science of making certainty statements in the world of uncertain phenomenon (C.R.Rao – Congressional Medal of Science Laureate), albeit their need to have additional language in stating the uncertain statements in a tricky certainty looking statement. It is eye opening to see how people end up arguing about random phenomena because their type I and type II errors are so wide in the tail end of the decision tree and/or different people have different bounds for them node to node, with out knowing the differences. In fact in many applications, the eventual implications of these Type I and Types II errors and their location will come out only later in the game because the application of the model will show the real results of their inapplicability for any number of practical considerations. This scientific culture lead to famous decisions in the first million dollar data mining competition.  Netflix decided to identify first and second winner based on a business rule as to who has submitted it first, and the difference between the first and second was at the fifth decision and eventually none of the algorithms were used in practice again for business reasons of the cost of implementing the solution. For a full account of the whole event, on how the Netflix challenge was won, see http://www.research.att.com/articles/featured_stories/2010_01/2010_02_netflix_article.html?fbid=gy-J6K6DxJh.
 The winning team photo: http://bits.blogs.nytimes.com/2009/09/21/netflix-awards-1-million-prize-and-starts-a-new-contest/?ref=technology&_r=0 

While in graduate schools these cases will be discussed and interpreted, the minute vein that caused the anheurism is perhaps hard to find or in a practical world that does not matter. The best we can do is be maniacally focused based on the philosophy of consumer centered vision where the key factors, customer satisfaction, price, quality of product, and availability of service post sales – a different discussion.

The advances in data mining and computing made what was statistically not possible before, through the concept of data mining. If you innocently use million observations with out properly sampling to build a decision tree, almost all the variable you use in your predictive equation will become significant – there are lot more false positive explanation over the years in explaining marketing models only because we do not know where to draw the line regarding significance/importance of a variable in our predictive models blindly using 100 variable to 200 variable models. The difficult part is we won’t even know that we are biasing the results with lot of random errors. Even in a machine learning methods it is a good idea to create a pre-processing where you reduce the variables in a statistical way and then feed them into the machine learning algorithms. In a way an analyst needs an important area that is not fully utilized and it is what is called “insight models”, as against the predictive models.

Over the years I come to appreciate the importance of balancing these two types of models for any given predictive situation, especially in a world where proliferation of data length(millions) and number of data elements (thousands) – rows vs. columns – have become common and either there is too much correlation among a lot of variables or too little structure in many of the elements.

Now here is the kicker for the kindling of thoughts: For every predictive model there is an insight model which performs as good as the predictive model.  After all, I can not easily get out of the mode of statistical thinking.  And perhaps in a practical world, especially in application areas like Netflix – aka  big data – it suggests what is coming in the future.

From Data Monster & Insight Monster

PREDICT 402DL – Introduction to Predictive Analytics and Data Collection

I teach this course at Northwestern University. You may also refer to courses.northwestern.edu for complete details if you are a student. These are posted for other analysts so that they may become interested in learning more about the MSPA (Master of Science in Predictive Analytics) program in Northwestern.

The first three chapters provides a working definition of analytics, states the importance of analytics across the orgnization for competing in business, provides tools to assesss common attributes of analytically competitive businesses, and helps rank the stages of analytic competition

In chapters 4 and 5 the book helps (1) identify analytic techniques used to analyze internal busienss processes, (2) select the appropriate ananlytic applications for a given internal businesss processes, (3) identify analytic techniques used to analyze external business processes, and (4) select the appropriate analytic applications for a given external businesss processs.

In the third part of the book, chapters 6-9 provides tools to assess the analytic capabilities of an organization, methods to distinguish at what stage an organization is in analytic competition, walks through how organizations walk through various stages in becoming an analytic competitor, compares the roles of analytic executives, analytic professionals, and analytic amateurs, explains the six elements of BI architecture and finally specifies the relationship among the six elements of BI architecture.

In the first two chapters, the authors walk us through

• Organize the components of the business analytics model.
• Assess the role of data in the business analytics model.
• Classify the different types of links between business analytics and strategy.
• Recognize the types of analytic information available to inform the three disciplines outlined.

In Chapter 3,

the discussions are around

• Compare and contrast lag and lead information.
• Distinguish how lead versus lag information can be used in the development and management of a new
business process.
• Distinguish how lead versus lag information can be used to optimize existing processes.
• Assess each of the business processes listed on the three disciplines.
• Classify key performance indicators into their suggested business functions.

In chapter 4, the topics are covered on

• Apply a strategy mapping process to match analytic techniques to information requirements.
• Explain the difference between data, information, and knowledge.
• Evaluate the importance of each of the analyst competencies.
• Evaluate the advantages and disadvantages of different types of analytic reports.
• Formulate business examples of when the use of data-driven versus data mining versus explorative analytic
methods would be appropriate.
• Compose effective business requirement documents.

In chapters 5 and 6, the concepts covered are:

• Explain the relationship between components in a data warehouse.
• Identify business systems that may generate data.
• Organize the steps in the extraction transformation loading (ETL) process.
• Propose potential sources of poor quality data.
• Evaluate the effects of poor quality data.
• Identify potential sources of data in an organization.
• Assess the relationship between the usability and the availability of data.

Interestingly, to have better organized approach, we cover:

• Evaluate the benefits and limitations of data collection methodologies.
• Evaluate the benefits and limitations of data collection modalities.
• Apply the fundamentals of survey design to develop an effective survey.

from Chapters 5,7,8

and

• Explain the importance of sampling in analytics.
• Compare and contrast different sampling techniques.
• Assess the impact of missing data on the analytic process.
• Appraise the benefits and limitations of different data imputation techniques.

are covered from Chapters 3,4,6,and 10.

From Summer 2012, we have this additional books that are added to support the topics on visualization.  Chapters 1-3, 5, and 6

Now You See It: Simple Visualization Techniques for Quantitative AnalysisFebruary 2016

  • December 2015
  • September 2015
  • August 2015
  • July 2015
  • June 2015
  • May 2015
  • April 2015
  • March 2015
  • February 2015
  • January 2015
  • December 2014
  • November 2014
  • October 2014
  • September 2014
  • August 2014
  • July 2014
  • June 2014
  • May 2014
  • April 2014
  • March 2014
  • February 2014
  • January 2014
  • December 2013
  • November 2013
  • October 2013
  • September 2013
  • August 2013
  • July 2013
  • June 2013
  • May 2013
  • April 2013
  • March 2013
  • February 2013
  • January 2013
  • December 2012
  • November 2012
  • October 2012
  • September 2012
  • August 2012
  • July 2012
  • June 2012
  • May 2012
  • April 2012
  • March 2012
  • November 2011
  • October 2011
  • September 2011
  • August 2011
  • July 2011
  • June 2011
  • May 2011
  • November 2010
  • October 2010
  • August 2010
  • June 2010
  • March 2010
  • July 2009
  • June 2009
  • May 2009
  • March 2009
  • November 2008
  • July 2008
  • June 2008
  • February 2008
  • January 2008
  • November 2007
  • July 2007
  • April 2006