Introduction to DataStage Development Best Practices
DataStage standards can vary greatly from one business to another. The standards that one DataStage developer adheres to simply might not meet the requirements of another organization. Although exact requirements are always going to change from one company to another, DataStage developers should at least be aware of some of the industry standard best practices for developing their parallel jobs.
Theoretically, the ideal DataStage job will process all of our ETL requirements in one go, but realistically this is simply unachievable. Your DataStage jobs are subject to some of the following limitations:
- Hardware resources on the DataStage server – CPU, memory, I/O
- Operating system resources defined in Unix/Linux – nofile, nproc
- External interruptions such as excessive network latency, power loss, other processes accessing your data sources/targets
- Business requirements could force developers to create their ETL processes in a specific series of events
This blog will take a look at some of the best practices which result in well performing jobs, that all developers should at least consider, if not implement in their projects.
Overall Job Best Practices
- The same naming convention should be used for the various design components in DataStage such as projects, project folders, jobs, stages, and links across all teams and development projects within the organization. By using the same naming convention across projects, the quality and visibility of the entire application improves. Future maintenance of the developed jobs will also be easier.
- Create and maintain consistent standards for server directories which will host the installed application as well as any additional directories which will contain supported documentation/scripts/executables/etc.
- Documentation should be created for all of your jobs. There are a few ways of implementing documentation in your DataStage jobs:
- Short/Full Job Description
- Description Annotation Stage
- Annotation Stage
- Renaming Stages/Links
Short and Full Job Descriptions
Annotations and Renamed Stages/Links
Job Reusability Best Practices
- DataStage jobs should be developed as individual “modules” which can be then separately called to run as a sequence of events to run.
- Job parameters allow a single job design to process similar logic instead of creating a separate job for every individual ETL process. There is a property called Allow Multiple Instance in the job properties which allows multiple instances of a single job to run at the same time.
- A common set of environment variables should be defined and used across DataStage jobs for source and target stage properties:
- Database Connector Stages: DSN, User Name, and Password
- File Stages: File Name and Directory Name
- Parameter sets are containers for parameters which simplify the use of common parameters across jobs by giving developers a centralized location for defining and modifying lists of parameters. Parameter sets also allow easy reuse of common parameter values with a collection of values files.
- Parallel shared containers allow developers to create custom reusable design components – they allow reuse of common stage logic across jobs and projects.
- Runtime column propagation (RCP) enables jobs to process datasets of various types of metadata.
Using a Combination of Environment Variables and Parameters
Parallel Job Stage Best Practices
- Using reject links on the Sequential File stage allows you to capture records which fail to either get read into the DataStage environment or to get written out to the target file.
- If the size of the source file is very large, increase the performance of the read process by running the stage in parallel by adding the Number of Readers per Node option.
- Do not link the Sequential File stage with another stage using Same partitioning! The downstream stage will also execute in Sequential mode or with the wrong degree of parallelism.
- When the Sequential file stage is used as a target, by default it writes data to memory buffers and are only flushed out to the disk once the jobs completes.
- The environment variable $APT_EXPORT_FLUSH_COUNT allows you to modify this action
- The reference input data has to be small enough to fit into available memory – all of the reference data will be loaded into memory prior to processing.
- If the reference data is very large, use either the Join or Merge stages.
- On SMP systems, use Entire partitioning on the reference input links.
- On MPP/Clustered/GRID systems, it may be more appropriate to use a different partitioning strategy such as Hash or Modulus.
- If your reference data is does not change often or if you use it in multiple downstream jobs, consider using the Lookup File Set stage to create fast access reusable reference data.
- The partitioning strategy used on the input link should match the defined grouping keys.
- Use the Hash grouping method when processing a limited number of groups.
- Use the Sort grouping method when processing a large number of groups.
- You can use the Aggregator stage to sum the number of records that are being processed by your job.
- Only sort on the required key columns, unnecessarily choosing additional sort key columns will result in performance degradation for no additional result.
- You can specify the amount of memory allocated for sorting your data in the Sort stage with the Restrict Memory Usage property. By default, you have 20 MB per partition for sorting your data. By increasing this value you can potentially sort all of your data in-memory instead of having records temporarily spill over to the disk.
- A Stable Sort maintains the input order of the non sort key columns. Only specify a stable sort if it required by the business.
- The Transformer stage offers multiple functions for handling NULL values. What happens when you have an unhandled NULL used in processing in the Transformer stage depends on the Legacy Null Handling property. This property is enabled/disabled in the Stage Properties.
- Do not use an unnecessary amount of Transformers in your job – try to consolidate the logic spread across multiple Transformers in to a single stage.
- Do not use the Basic Transformer in parallel jobs. The Basic Transformer executes in sequential mode.
- Minimize Transformer function calls. If possible call all required functions inside of a stage variable and then reference the stage variable values in your output columns.
- Use the Database Connector stages for maximum functionality and performance. The ODBC Connector stage should be used only if a native parallel stage is not available for the given source or target database. For example, if accessing SQL Server from a Unix/Linux DataStage server.
- SQL used in the Connector stages should be finely tuned for maximum performance.
- Use a WHERE clause to limit the number of records sent across the network to a DataStage job.
Other Parallel Job Development Best Practices:
- Use the Data Set stage for intermediate storage in between your DataStage parallel jobs. Don’t use the Sequential File stage (flat files) for storing data on the disk for downstream consumption by other DataStage processes. Only land data to the disk with the Sequential File stage if the data has to be shared with other teams/applications/auditing/etc.
- The Copy stage is efficient for the following tasks:
- Renaming columns
- Dropping columns
- Default types conversions
- Splitting data for different types of downstream processing