Monthly Archives: June 2013

How IBM Sees The Trends In Business Intelligence

These are not ordered by the priority in the list

– real time fraud detection

– Marketing campaign in real time

– Call center optimization to meet service level

– Business analytics using unstructured data (weblog/blog/wikis/product review and recommendations)

– social media/social interactions

– Intelligent traffic management to adhere to the service level with less costs

– smart power grids – intelligent meters from homes sending data to the power companies

– sustainability

– Bioinformatics and health analytics

Key challenges in doing the above works:

– big data (volume/variety/velocity)

– Smart analytics – Advanced and Predictive analytics

– Faster decision and faster time to value (real time analytics, agile BI and self-service BI)

From Data Monster & Insight Monster

Instructors/Professors are invited to apply for Institute of Analytics, Chennai

Institute of Analytics, Chennai is looking for:

1. Instructors to conduct Hadoop/HIVE/PIG training using AWS and Cloudera architecture

The course will be an in-person 3 day course in Chennai and applicants in India are encouraged to apply.  

2. Instructors to conduct 6 days course “Introduction to SAS and Predictive Modeling”

3. Instructors to conduct 6 days course “Introduction to R and Machine Learning Methods”

Please send your resume and a cover letter expressing your interest to hr@instituteofanalytics.com.  Attractive compensation and travel costs are offered.

MSPA 410 BOOKS – SUMMER 2013 – Predictive Modeling I

The following books are used in the summer 2013 “Introduction to Predictive Modeling – I” course that I will be teaching.

A student account with Amazon gets you student price for Amazon Prime, if you register with your university student email account, which upgrades you to free two day shipping.  To take advantage of the student rate for Amazon Prime and get your books faster, CLICK ON AMAZON OFFER.

The Books are:

From Data Monster & Insight Monster

Seven Key Trends in Predictive Analytics

  • Data enrichment with third party and census data – if your organization is starving for data

If data is the oil, the way Economist magazine characterized it, you need to have a mechanism to keep your engine working right.  There are powerful third party data as well extensive census data that are available for enriching your data vaults. It will get lot more easier as ‘open data’ becomes more pervasive.

  • Text analytics – extensive discussion is provided in Eric Siegel’s book on IBM Watson

Eighty percent of data is non-standardized. The sentiment analysis is subset of this.

  • Ensemble methods – extensive discussion is provided in Eric Siegel’s book
  • Addressability and scalability of survey intelligence: Segmentation based on sample surveys and how to extend the survey based on customer intelligence to the whole consumer/customer base
  • Integrating email marketing with web surfing data 
  • Combining brick (physical store visits data) and click (web store) data  and optimizing offer
  • Engaging consumers through a curriculum of messaging activities similar to President Obama’s 2012 campaign

From Data Monster & Insight Monster

Greatest Opportunity of our Generation – a $ 33 Trillion Wealth Creation In a Decade

McKinsey: The $33 Trillion Technology Payoff

I am speechless – You have enough to catch up.  Read…the 150 plus pages material.

A key slide from McKinsey

The mixture of big data and analytics is at least 5% of the lower certainty estimate of knowledge work that is $5T means, pure analytics alone has a lower estimate of $250B.

I remember a marketing data company executive arguing analytics can not be even 5 billion dollar industry, just 5 years back!!!  The dynamics of data collection and data dynamics is so deep and pervasive, it is changing fast.

Enjoy folks – life is fun… and so much to offer…

Disruptive technologies: Advances that will transform life, business, and the global economy

From Data Monster & Insight Monster

Missing Value Strategies


There is no better solution than to implement system capture mechanism that encourages/ coaxes/ incentivizes the end users of the application to provide the data so that it does not become a missing value situation.
One of the slides/tabulation in your study should be explicitly identifying the % missing for all the variables that are used in the study.
Explain why missing values for each of the study variables occur.
Sometimes, missing values are confused with naturally occurring non-availablity (existence) of values due to the construction/definition of variable.  This should be differentiated in the explanation of the % missing in the study variables. For example, in surveys that use hierarchical rule based questions, there will be missing situations that would occur because of the structure of the questions.
Use multiple imputation methods for application situations for best results.
Both “Delete Strategy” and “Mean Strategy” could be very biased methods, more so with the first one, than the later, depending on how much data is missing.
References:
http://www.ke.tu-darmstadt.de/publications/reports/tud-ke-2009-03.pdf.  This has more slant towards “machine learning” supporters.  According to the authors’ experimental approach with different strategies, for small incidence of missing values, a large number of different methods yield almost similar effectiveness. If the missing is too much, they diverge in effectiveness.  Also, “delete strategy” is the worst of all, not surprisingly. Also, combining multiple strategies is likely to yield better results.
http://people.oregonstate.edu/~acock/growth-curves/working%20with%20missing%20values.pdf   This document provides an excellent summary of statistical software approaches.  The authors provide a list of software and also example codes for some on how to use them.  The following is taken from the authors publication.
http://maartenbuis.nl/presentations/missing_cifor.pdfis an interesting presentation that walks through systematically, the problem and a desired approach to solving missing values.
http://www.ats.ucla.edu/stat/sas/library/multipleimputation.pdfprovides notes on how to use SAS to solve missing value imputations using PROC MI.
From Data Monster & Insight Monster

Also, see the following video for recent advances in missing value imputations.

http://www.youtube.com/watch?v=xnQ17bbSeEk

Statistician’s Ten Steps for Data Quality Management

A Statistician’s Ten steps for data quality management.

Identify and agree regarding the system implemented meta data vs. business logic supporting meta data, every time you receive data.  Always ask for a data dictionary which is managed by the IT department.  Also, ask for first and the last 10 records of the data that are being delivered.

  1. Ask for data to be delivered in a particular format (CSV, TXT with special separation character, EXCEL, or Other database forms, SAS, SPSS, DB2, … ) that you are very familiar to handle.  Over a long period of experience, I found it easier if the data is delivered in fixed format text form.  Yet, it is much easier if there is an automation that would create what is called ‘Data Audit Report’ for analysts to have a quick look at the delivered data and communicate with the data delivery team on the quality of the data.
  2. Make sure you can read the data and output the top 10 and bottom 10 records.  Visually read the sample data for each of the variables and make sure it matches with the data promised to have been delivered to you by the IT department.
  3. Check to see whether total number of observations sent by the provider and the total number of observations received are the same.
  4. How are the numeric elements coded? Numeric or character?
  5. If a field is a numeric element, find out (1) is it Integer or not, (2) Min, (3) Max, and (4) Number OF Missing values for numerical variables.  Check out the equivalence of full list of alpha (character) values along with number of missing for alpha variables
  6. Check for all consistency checks in the data that exist among variables.  For example, if there is a total revenue and also revenue by product groups, make sure the sum of the product group revenues is same as total revenue, after checking with business/IT managers that such a consistency check exist or not.  This is a tricky part. Because there are so many ways you can identify the consistency checks.  Identify the quick major ones and check it out.
  7. The Data Audit Report should also have distributions of each of the variable.  If a variable is a numeric variable, use quintiles or deciles to see the distribution.  If a variable is a character variable, use the occurrences of each of the characters.
  8. Make sure weights are provided if there is a sample survey or if sample is taken from a population.  If weights are not provided create a weighting system using an available auxiliary variable that is available for the full population.
  9. If the data is provided for a predictive model, make sure you are selecting the right reference population when modeling the target population.  It is not the whole US population list whether it is B2B or B2C application.
  10. Missing value distributions (missed or not) should also be covered in any communication with the IT department so that re-orienting the processes for better capture of data can be implemented.
From Data Monster & Insight Monster