Performance Tuning in IBM InfoSphere DataStage

Performance is a key factor in the success of any data warehousing project. Care for optimization and performance should be taken into account from the inception of the design and development process. Ideally, a DataStage job should process large volumes of data within a short period of time. For maximum throughput and performance, a well performing infrastructure is required, or else the tuning of DataStage jobs will not make much of a difference.

Determining The Bottleneck

One of the primary steps of performance tuning is to examine the end-to-end process flow within a DataStage job and understand which steps in the job are consuming the most time and resources. This can be done in the several ways:

1. The job score shows the generated processes, operator combinations, data sets, frame-work inserted sorts, buffers and partitions in the job. Score can be generated by setting the APT_DUMP_SCORE environment variable to TRUE before running the job. It also provides information about the node-operator combination. A score dump can help detect redundant operators, which can be used in modification of the job design to remove them.

2. The job monitor can be accessed through IBM Infosphere DataStage Director. It provides a snapshot of job’s performance (data distribution/skew across partitions, CPU utilization) at runtime. APT_MONITOR_TIME and APT_MONITOR_SIZE are the two environment variables that control the operation of the job monitor, which takes a snapshot every five seconds by default. This can be changed by changing the value of APT_MONITOR_TIME.

3. Performance Analysis, a new capability beginning in DataStage 8.x, can be used to collect information, generate reports and view detailed charts about job timeline, record throughput, CPU utilization, job memory utilization and physical machine utilization (shows processes other than the DataStage activity running on the machine). This is very useful in identifying the bottlenecks during a job’s execution. Performance Analysis can be enabled through a job property on the execution tab, which collects data at runtime. (Note: By default, this option is disabled)

4. Resource Estimation, a toolbar option available in DataStage 8.x, can be used to determine the system requirements needed to execute a particular job based on varying source data volumes and/or to analyze whether the current infrastructure can support the jobs that have been created.

Key Consideration Points During Job Development

There are several key aspects that could affect the job performance and these should be taken into consideration during the job design:
• Parallel configuration files allow the degree of parallelism and resources used by parallel jobs to be set dynamically at runtime. Multiple configuration files should be used to optimize overall throughput and to match job characteristics to available hardware resources.
• A DataStage job should not be overloaded with stages. Each additional stage in a job reduces the resources available for the other stages in that job, which affects the job performance.
• Columns that are not needed should not be propagated through the stages and jobs. Unused columns make each row transfer from one stage to the next more expensive. Removing these columns minimizes memory usage and optimizes buffering.
• Runtime column propagation (RCP) should be disabled in jobs to avoid unnecessary column propagation.
• By setting the $OSH_PRINT_SCHEMAS environment variable, we can verify that runtime schemas match the job column definitions. Avoid unnecessary data type conversions.
• Proper partitioning significantly improves overall job performance. Record counts per partition can be displayed by setting the environment variable, $APT_RECORD_COUNTS. Ideally, these counts should be approximately equal. Partitioning should be set in such a way so as to ensure an even data flow across all partitions, and data skew should be minimized. If business rules dictate otherwise, then repartitioning should be done as early as possible to have a more balanced distribution which will lead to improved performance of downstream stages.
• DataStage attempts to combine stages (operators) into a single process, and operator combination is intended to improve overall performance and reduce resource usage. Avoid repartitioning and use ‘Same’ partitioning for operator combination to occur. However, in some circumstances, operator combination may negatively impact performance and in such cases, all the combinations should be disabled by setting $APT_DISABLE_COMBINATION=TRUE.
• Do not sort the data unless necessary. Sorts done on a database (using ORDER BY clause) are usually much faster than those done in DataStage. Hence, sort the data when reading from the database if possible instead of using the Sort Stage or sorting on the input link.
• Sort order and partitioning are preserved in parallel datasets. If data has already been partitioned and sorted on a set of key columns, check the ″Don’t sort, previously sorted″ option for the key columns in the Sort Stage. When reading from these data sets, partitioning and sorting can be maintained by using the ‘Same’ partitioning method.
• Datasets store data in native internal format (no conversion overhead) and preserve partitioning and sort order. They are parallelized unlike sequential files and hence, are much faster. Datasets must therefore be used to land intermediate results in a set of linked jobs.
• Use Join Stage as opposed to Lookup for handling huge volumes of data. Lookup is most appropriate when the reference data is small enough to fit into available physical memory. Sparse lookup is appropriate if the driver to the Lookup is significantly smaller than the reference input (1:100).
• Avoid using multiple Transformer Stages when the functionality could be incorporated into a single stage. Use Copy, Filter, or Modify stages instead of Transformer for simple transformation functions like renaming or dropping columns, type conversions, filtering out based on certain constraints, mapping a single input link to multiple output links, etc.
• As much as possible, minimize the number of stage variables in a Transformer Stage as that affects performance, and also avoid unnecessary function calls.
• If existing Transformer-based jobs do not meet performance requirements and a complex reusable logic needs to be incorporated in the job, consider building your own custom stage.
• Data should not be read from Sequential files using ‘Same’ partitioning.
• Sequential files can be read in parallel by using the ‘Multiple readers per node’ option. Setting $APT_IMPORT_BUFFER_SIZE and $APT_EXPORT_BUFFER_SIZE environment variables may also improve performance of Sequential files on heavily loaded file servers.
• Use Hash method in Aggregators only when the number of distinct key column
values is small. A Sort method Aggregator should be used when the number
of distinct key values is large or unknown.
• SQL statements in Database stages can be tuned for performance. Appropriate indexes on tables guarantee a better performance of DataStage queries.
• ‘Array Size’ and ‘Record Count’ numerical values in Database stages can be tuned for faster inserts and updates. Default values are usually very low and may not be optimal.
• Best choice of database stages is to use Connector stages for maximum parallel performance and functionality.

Concluding Thoughts

Performance issues can be avoided by following the above best practices and performance guidelines. ‘Performance Analysis’ and ‘Resource Estimation’ functionalities can be used to gather detailed performance related data, and assist with more complicated scenarios.

References Used

Link 1 Link 2