DataStage XML Processing Introduction

What is XML?

XML is pervasive in all industries, because of its versatility and neutrality for exchanging information between diverse devices, applications, and systems from various vendors.

These qualities, combined with its easy to understand nature, ability to handle structure, semi-structured and unstructured data and support for Unicode make XML a Universal standard for data interchange. XML has been readily accepted by the technical world because of its simplicity.

DataStage XML Processing Architecture

The XML pack is a Datastage connector that is designed to handle general XML data processing. As of the version 9.1 release, the XML pack is automatically inclulded out of the box with Information Server Enterprise Edition. The XML pack is installed on the Client, Services, Metadata Repository, and the Engine tiers as a part of the Datastage installation.

The XML pack has 4 logical tiers:

  1. The client tier has 2 major components:
  • XML metadata importer user interface
  • Assembly editor, which is used to create an XML processing job
  1. The services tier provides a “Representational State Transfer” service to wrap around the XML operator
  2. The metadata repository tier stores schema library and assembly information
  3. The engine tier hosts the XML run time operator to run the XML processing job

XML Stage Features

The most recent approach to XML processing in DataStage offers a single XML stage, which can be used as a source, target, or transformation stage – thus consolidating the functionality from 3 different stages used in older approaches.

  • Adds internal schema library management functionality that offers superior support for complex schemas and advanced capabilities to simplify parsing and composing more complex XML documents.
  • Makes use of “Assembly Editor” GUI that simplifies the complex task of defining hierarchical relationships and transformations.
  • Tight integration with XSD schema. After an XSD schema is imported into the XML schema library manager, the metadata from XSD is saved locally. This eliminates the requirement for schema files to be available during run time and eases the work of developers by feeding transformation and mapping functions directly into schema.

Schema Library Manager

The schema library manager is used to import the XML Schema or metadata into the metadata repository. The metadata is then automatically available to any of the clients in the Information Server suite such as, Datastage, QualityStage, FastTrack, etc. An XML schema is analogous to columns that are defined for a sequential file.

From the Datastage Designer toobar, you can access the schema library manager by clicking File -> Import -> Schema Library Manager.

DB2 pureXML Integration – Did You Know?

With the introduction of DB2 v9, IBM PureXML is a new feature that provides the capability of storing XML data, with hierarchical structures, natively in the database table. Previously, managing XML involved one of following indirect approaches:

  • Saving XML documents in a separate file system
  • Shredding XML data into multiple relational columns and tables
  • Isolating data into XML only database systems
  • Stuffing XML data into large objects (LOBs) data types in relational databases

Working With The Schema Library

Step 1: Creating a library

In the DataStage Designer, navigate to Import -> Schema Library Manager in the menubar.

import xml schema file metadata datastage designer

Create a new library.

create new library schema library manager

Importing an XML schema into a library – to import an existing XSD into the schema library, select an existing library and choose “Import New Resource” on the right side pane.

import xml schema file metadata datastage designer

Step 2: Viewing XML schema structure

Open a library by double clicking on a library name. This opens up another window with XML structures from one or more schemas that have already been imported in to the Schema Library Manager.

In the screen below, we are showing a sample, books XML schema structure. Notice from the right pane the various icons in use to represent different data formats like strings, float, and date.

viewing imported xml schema file library manager

Step 3: Importing related schema files

Schema files can relate to each other by an INCLUDE or IMPORT statement. The reference to other schema files is done via the Schema Location (include) or the Namespace (import) of the schema file. These references must be resolved within a single schema library. This means that all files that are referenced by an imported file must also be included in the library.

INCLUDE Statement:

Unlike the IMPORT statement that is using the namespace as the ID of the schema file, the INCLUDE statement uses the location of the file that relies on a physical location. When the schema files are imported into Information Server, the location attribute of each file defaults to the filename. However, the files are commonly referenced by more than just their filenames. For example, Schema A can reference Schema B by using a relative directory structure. It can even reference a URL to a web hosted file.

Example 1: <xs:include schemaLocation=”../common/basic.xsd”>
Example 2: <xs:include schemaLocation =”http://www.example.com/schemas/address.xsd”>


Hands-On Component

In order to run this exercise, download the following 2 XSD files:

https://dl.dropboxusercontent.com/u/16840390/definition.xsd
https://dl.dropboxusercontent.com/u/16840390/department.xsd

  • In the DataStage Designer toolbar, click Import –> Schema Library Manager
  • Choose an existing library or create a new library.
  • Click ‘Import New Resource’ and select the first schema, department.xsd that you downloaded:

xml schema file import error library manager

  • Notice that after the import is complete, an error message id displayed on screen. Hit the validate button to read more information on the error message.

xml schema file import error library manager

  • The error indicates that the schema department.xsd has a dependency on another schema, definition.xsd. The schema department.xsd contains an element named dept_id that has the type dept_id1. Because the type dept_id1 is defined in the schema definition.xsd, you need to import it, too. Go ahead and import the second XSD file, definition.xsd

xsd schema file upload complete

  • You will notice that the errors exist even after importing definition.xsd. The errors now point to a missing file. Open department.xsd in a notepad or a wordpad to get the schema location.
  • Copy the schema location, http://ibm.com/definitions/definition.xsd from the include statement.
  • In the Resources View, select definition.xsd and paste the schema location below in the File Location field as shown in the following figure.

schema file include location

  • This will help us resolve all schema library issues.

End of Hands-On Component


Step 4: Accessing Schema Library Manager from the XML Stage

Drag the XML Stage from the palette into the canvas and open it. Notice the ‘Edit Assembly’ button on the right on the screen below:

xml stage parallel job datastage

An assembly contains a series of steps that parse, compose, and transform hierarchical data. By default, an assembly contains an Overview, an Input step, and an Output step. You can add additional steps to the assembly, based on the the type of transformations that you want to perform.

assembly editor xml stage parallel job datastage

Click on the Libraries tab and select the required XML Schemas by expanding on the Category name. Multiple .XSD files can be imported at a time into the library. This can be done by selecting multiple files in the browse window on clicking Import New Resource.

Stay tuned for a future blog which will show you how to use the XML stage to read data from XML files, create XML documents, and transform XML data.