A Statistician’s Ten steps for data quality management.
Identify and agree regarding the system implemented meta data vs. business logic supporting meta data, every time you receive data. Always ask for a data dictionary which is managed by the IT department. Also, ask for first and the last 10 records of the data that are being delivered.
- Ask for data to be delivered in a particular format (CSV, TXT with special separation character, EXCEL, or Other database forms, SAS, SPSS, DB2, … ) that you are very familiar to handle. Over a long period of experience, I found it easier if the data is delivered in fixed format text form. Yet, it is much easier if there is an automation that would create what is called ‘Data Audit Report’ for analysts to have a quick look at the delivered data and communicate with the data delivery team on the quality of the data.
- Make sure you can read the data and output the top 10 and bottom 10 records. Visually read the sample data for each of the variables and make sure it matches with the data promised to have been delivered to you by the IT department.
- Check to see whether total number of observations sent by the provider and the total number of observations received are the same.
- How are the numeric elements coded? Numeric or character?
- If a field is a numeric element, find out (1) is it Integer or not, (2) Min, (3) Max, and (4) Number OF Missing values for numerical variables. Check out the equivalence of full list of alpha (character) values along with number of missing for alpha variables
- Check for all consistency checks in the data that exist among variables. For example, if there is a total revenue and also revenue by product groups, make sure the sum of the product group revenues is same as total revenue, after checking with business/IT managers that such a consistency check exist or not. This is a tricky part. Because there are so many ways you can identify the consistency checks. Identify the quick major ones and check it out.
- The Data Audit Report should also have distributions of each of the variable. If a variable is a numeric variable, use quintiles or deciles to see the distribution. If a variable is a character variable, use the occurrences of each of the characters.
- Make sure weights are provided if there is a sample survey or if sample is taken from a population. If weights are not provided create a weighting system using an available auxiliary variable that is available for the full population.
- If the data is provided for a predictive model, make sure you are selecting the right reference population when modeling the target population. It is not the whole US population list whether it is B2B or B2C application.
- Missing value distributions (missed or not) should also be covered in any communication with the IT department so that re-orienting the processes for better capture of data can be implemented.