Our previous blog post gave our readers an introduction to the Investigate stage in the QualityStage application. It showed you how to use the Investigate stage to run two types of investigation reports: character discrete and character concatenate. This post will continue on the topic of the Investigate stage, but will focus on the third type of investigation which can be run in the stage – word investigation.
Word investigation can be used for investigating free-form fields, columns which typically have more than one data element in them. With word investigation a standardization rule set has to be specified. This rule set will be applied to the data prior to producing the report results. Some of the out of the box standardization rule sets include:
• USNAME – Rule set for standardizing US names – Mr. Artur S Kopec
• USADDR – Rule set for standardizing US addresses – 1444 N. Farnsworth Avenue
• USAREA – Rule set for standardizing US areas – Aurora IL. 60505
A word investigation is able to create both a token report and a pattern report. A token report outputs the columns of a frequency word report. A pattern report outputs the columns of a frequency pattern report to a separate output link.
More on Token and Pattern Reports
A token report will produce 3 columns. qsInvCount is a count of the number of times a certain word/token appeared. qsInvWord is the word/token which appeared in the data. qsInvClassCode is the class which has been specified for a particular word/token. The class code value is unique to a specific standardization rule set. Here are some sample class codes, belonging to the USNAME rule set, and what they represent:
|W||Organization Name Words|
|P||Individual Name Prefixes|
A pattern report will produce 5 columns. The first column is simply the column(s) which were used for investigation. The qsInvPattern column is a representation of the pattern which the source data is appearing in. The qsInvSample column shows you a sample record which produced the pattern appearing in the second column. The last 2 columns represent the frequency distribution of the various patterns found in the source data. Notice that you also see commas, periods, and question marks in the resulting pattern. A question mark represents an unclassified alpha token.
We’ll take a look a sample QualityStage job which uses the Investigate stage and runs a word investigate report on our source data. This job is very similar to the one from our previous blog, however now we can have two output links from the Investigate stage. One output link will contain the token report and the other output link will have the pattern report.
The source data is the same data which we were dealing with before. We have some customer data with some data quality issues which have to be investigated.
We will be running a word investigate report on the AddressLine1 and AddressLine2 columns. These columns contain values that you might typically see on an online order form. AddressLine1 contains the building address and AddressLine2 contains values such as the apartment number. The standardization rule set which we will use with this column is the USADDR rule set. This rule set is used for standardizing US address data.
We’ll open up the Investigate stage and click on the Word Investigate button at the top of the stage to specify that we want to run a word investigation.
First we will select the columns which we will Investigate. We can select the AddressLine1 column and click on the > button in order to specify that we want to investigate this column. We can do the same for AddressLine2. You can also specify if you would like the stage to produce a Pattern Report, a Token Report, or both by checking the associated check-box towards the lower left hand side of the stage. We’ll check both of the checkboxes.
Now we have to specify which Rule Set we will be using for this investigation. Let’s click the … button and browse the repository for the USADDR Rule Set. Once you select the Rule Set, it will then be defined in the Investigate stage.
Just like with any other DataStage/QualityStage stage you have to map the columns to the output. Before doing this, verify that the correct report is going to the correct output link. You can open the stage properties and go to the Stage > Link Ordering tab to verify that the correct report will go out the correct output link.
Once the link ordering has been verified, you can go ahead and map the columns to the output. You can do this through the Output > Mapping tab in the Stage Properties or by right-clicking the stage in the canvas and automatically mapping the columns across. The target Sequential File stages have already been configured so we can save, compile, and run the job.
Interpreting the Results
Once the job completes you can view the investigation reports in the target stages. For the token report, the qsInvClassCode column represents what class was assigned to the associated word. You can see that a class of T (Street Type) was assigned to the word DR (Drive). This token was present 565 times in the source data. As for the pattern report, we can see that 41% of our data is in the ^?T pattern. Which is representative of a numeric token, followed by an unclassified alpha token, followed by a Street Type token. For the qsInvSample column you can see the 235 is the ^ token, Alpha is the ? token, and AVE is the T token.
Investigation Token Report
Investigation Pattern Report
The investigate stage is useful for profiling and analyzing your data. You now have the knowledge to begin developing QualityStage jobs with the Investigate stage to begin assessing some of the data quality issues you might be facing.