Check out the video associated with this blog!
In version 9.1 of DataStage, you can read Excel sheet data (.XLS and .XLSX) directly into DataStage using a new stage called the Unstructured Data stage. Let’s take a closer look at how to do this. First, we’ll create a simple parallel job which connects three stages: an Unstructured Data stage, a Transformer stage, and a Data Set stage.
This job will read the data from the Excel file located on the file system of the DataStage server. Before we modify our job further, let’s open the Excel file on our client machine so that we can view the contents of it. Below are the contents of our specific Excel file:
- Tabular data will be extracted from the cell range: B5:K9
- The first row in this cell range consists of column headers
- Row 10 consists of summary data for the data records in rows 6 to 9 which we will not read into DataStage
- There are comments attached to the K column cells which we have to be read into DataStage
Back in the Datastage Designer, you can open up the Unstructured Data stage and click on the Configure button to open the configuration window for this stage. In the configuration window, you have to specify the name and the location of the Excel file on the DataStage server file system. In the Range expression property you specify the cell range defining the location of the data in the Excel file. You also have the option of picking up column headers from the first row in the defined cell range.
Once you click the Load button, the stage will get a template for the defined cell range and it will be displayed in the top right panel under the Excel Column tab. By default, all of the columns in the defined range are selected.
On the Property tab, you can select additional Excel metadata which can be read into DataStage. For example, some of the properties you can choose from include: Row number, Sheet name, File name, File path, Title, Subject, Authors, and Company. You can import these values as part of the data. In order to do this, select the desired properties and then click the Import button.
Once you have selected the required columns/properties and clicked the Import button, you then have to establish a mapping between the DataStage columns and the Excel columns. The Map panel will show the default mappings and allow you optionally override them as well.
The DataStage Column column contains the name of the DataStage column as it will be defined in the stage. Default names are automatically assigned based on header data extracted from the first row of the data range in the Excel file. The Excel Item column contains the columns defined in the Excel file and properties for any additional metadata columns. The Import Option column specifies what will actually be extracted from the cell. For Excel columns the default option is Value. The additional metadata columns do not have any selectable options. On the right hand side you have buttons which allow you to move columns up or down as well as insert and delete columns.
We can modify the DataStage column name as required by editing the DataStage Column field. For example, here we have renamed the Sales/Units columns to more descriptive names.
How do we extract the comments which are associated with the K column of the Excel file? In order to do this, you can insert a column into the Map panel and for Excel Item column you can select the column which contains the comments. Modify the Import Option column and specify the value as Comment.
Once we are done with the mapping setup in the configuration window, click the OK button in order to apply your settings to the stage.
While still in the Unstructured Data Stage, navigate to the Output –> Columns tab and modify the data types for the output columns. By default, almost all columns will automatically be defined as VarChar.
Once the data has been read into DataStage, you can apply required business transformations in the Transformer stage and then finally write the data to a data set for consumption with another DataStage job.