How confident are you in your data quality? Many organizations would unwillingly admit that their data quality is not to the level that they are happy with. Data quality is an issue which plagues many IT departments across the world. Managers and executives can be more confident in their decisions if there is an initiative to improve data quality. QualityStage is the data quality application belonging to the InfoSphere Information Server platform. One basic concept in QualityStage is data investigation, this post will show you how to use the investigate stage.
The Investigate Stage
The Investigate stage allows you to assess what some of the current data quality issues are. Investigating the data can fall under one of two broad categories:
- Single Domain Investigation: Used for investigate individual single domain columns. Some examples of single domain columns could be: First Name, Zip Code, SSN
- Free Form Investigation: Used for investigated free form field, or columns which have more than one domain in them. For example: 1444 N. Farnsworth Ave. Aurora IL
The single domain investigations produce reports which let you view frequency distributions of values as well as formats of the data. Free form investigations also require a standardization rule which has to be applied to the data. Once the data has been standardized, it will produce two reports: a token report and a pattern report.
Types of Investigations
The Investigate stage can be found under the Data Quality category in the palette of the DataStage Designer. The stage is capable of running three types of investigations:
- Character Discrete: Useful for investigating single domains. Requires choosing a Mask for the investigated columns of C, T, or X.
- Character Concatenate: Useful for investigating relationships between single domain fields. This investigation concatenates the values in one or more columns before running an investigation. Requires choosing a Mask for the investigated columns of C, T, or X.
- Word: Useful for investigating free form fields. Requires a standardization rule be applied to the data.
More on Masks
Masks influence how the stage will investigate the data. A mask has to be specified for every character in the column. You can use any combination of masks for each column which will be investigated. There are three types of masks: C, T, or X.
|C||View the actual character values of the data|
|T||View the pattern of the data|
|X||Ignore the character of the data|
Here are some examples of using different masks in the Investigate stage. When using a T mask characters are represented with an a and numbers are represented with a n.
|(123) 456-7890||CCCCCCCCCCCCCC||(123) 456-7890|
We’ll take a look at 2 different examples of using the Investigate stage. We will run a Character Discrete Investigation on 2 columns from our source data. Let’s create a very simple job using the Sequential File stage, the Investigate stage, and another target Sequential File stage.
The source data is customer data with some quality issues which has to be investigated. Here is a sample of the data:
Character Discrete with a C Mask
First we will first run a Character Discrete Investigation on the Source System column using a C mask. Let’s open up the Investigate stage. The stage defaults to a Character Discrete Investigate. You can choose which type of investigation to run by clicking on the associated button at the top of the window. You should see all of the available input columns.
We’ll select the SourceSystem column and click on the Add To Selected Columns button. You’ll be presented with a new window which lets you choose which mask you want to use. We’ll select the C mask and then click on the OK button.
You will see that the SourceSystem column was moved from the Available Data Columns section to the Scheduled Process section. You can also see which mask was chosen for the column. You can add additional columns to investigate as well if you would like.
Click on the Stage Properties button in order to open a new window representing the stage properties. Map the columns to the output in the Output > Mapping tab. Now we will close out of the Investigate stage. The target Sequential File stage has already been configured so we can save, compile, and run the job.
Once the job completes you can view the investigation report in the target stage. Since we were using a C mask the qsInvPattern column will always have the same value as the qsInvSample column. The last two columns represent a frequency distribution of the 3 different values which the stage found for the SourceSystem column. We can see that about 53% of our data came from the A source system, 13% from the H source system, and 34% from the L source system.
Character Discrete Investigate with a T Mask
Now let’s run a different investigation. We’ll open up the Investigate stage, select the SourceSystem column, and click on the Delete button. We will now run a Character Discrete investigation for the PolicyNumber column with a T mask. We can select the PolicyNumber column, click on the Add To Selected Columns button, and use a T mask for every character.
Click on the OK button and notice that the PolicyNumber column has moved from the Available Data Columns section to the Scheduled Process section. It is defined with the T mask.
Let’s close the Investigate stage and run our job. Once the job completes you can view the investigation report in the target stage. You can see that the stage found 5 different formats for the PolicyNumber column, represented by the qsInvPattern column. The qsInvSample column took a sample value from the incoming data so that you can easily view it in the investigation report. The qsInvCount and qsInvPercent columns represent our frequency distribution of the 5 different formats. You can see that about 3% of the data was in the aanaannnnnn format, 17% was found in the aanannnnnnn format, 63% was found in the aannannnnnn format, less than 1% was in the nnnnnnnn format, and 16% was in the nnnnnnnnn format. Rememeber that an a in the pattern represents a character and an n represents an number.
The investigate stage is useful for assessing the frequency distributions of values and formats of your data. Once you know have a better understanding of your data, you can more efficiently cleanse it during a downstream process. The stage can also be used to see how well a standardization rule handles incoming data. We’ll take a look at that in another post.