ETL Fundamentals

As a company, chances are you have valuable data scattered throughout your system that needs to be gathered into a central location and accessed for business analysis. The problem is that currently the data exists in different systems, and in different formats.

This is where ETL comes into play. ETL is an acronym for Extract, Transform and Load, and is used to move data from one database to another, to form data marts and data warehouses, and also to convert databases from one format to another. ETL refers to the methods involved in accessing and manipulating source data and loading it into a target data warehouse.

Companies both large and small can use data warehouses to understand their data in a more analytical way and to extract the information contained within to increase their sales and revenues. While most companies already have databases to hold their transactional data (individual sales, for example) this data cannot easily be used for analysis. Whereas a database contains raw data, a data warehouse is a common repository which holds information which can be manipulated in such a way as to be used to answer complex business questions. “Which brand of hand soap did we sell the most of last month?” or “Which branch has the most customer traffic?” are the types of questions that the data warehouse can answer.

ETL is the process by which the data from everyday transactional databases gets moved or copied to the data warehouse. For example, a medical institution might have information on a patient in several departments and each department might list that customer’s information differently. The admissions department might list the patient by name, whereas the billing department might list the customer by account number or other ID. ETL can bundle all this data and consolidate it into a uniform presentation, enabling it to be stored in a database or data warehouse.

As we said, ETL stands for Extract, Transform, and Load. These three functions are combined into one tool to pull data out of one database and place it into another. On a high level, Extract is the process of reading data from a database, Transform is the process of converting data from one form into another, and Load is the process of writing data into the target database. ETL is software that enables businesses to combine and move their data. Again, the data can come from any source and can be in different forms or formats. Thus a company may use ETL to move data from one application to another, or to backup information, especially if transitioning to a new software application.

The first step in the ETL process is to map the data between source systems and target databases (data warehouses or data marts). Raw data can be written directly to disk, usually with only minimal restructuring. Structured source system data can be written to a relational table or flat file in this first step as well. This enables the extract to be quick and simple, and also allows the extract to be restarted in case of an interruption. After extraction, the data is transformed (modified) depending on specific business logic, and sent to the target repository. The data can then also be read multiple times if needed to support subsequent steps.

The second step is the cleansing of the source data in the staging area. Cleansing is an important function of ETL, as it eliminates duplicate or fragmented data, and data which is not required in the final target. ETL can be customized to fit your company’s particular needs.

The third step is transforming the cleansed source data and loading it into the target system. Transformation occurs via lookup tables, rules, and combining data. It is the process of converting the existing data into the format consistent with the data warehouse. The ETL software examines the data and, based on the rules it’s been given, updates it to the format required by the target repository. For example, a patient’s gender may be represented by “M/F” in one system, “0/1” in a second system, and “male/female” in yet another system. ETL is powerful enough to handle such dissimilarities by recognizing these different representations as the same information and convert them to the chosen format. Additionally, ETL can perform such functions as verifying phone numbers, standardizing fields, or expanding records with additional fields.

Transformation is perhaps the most powerful of the ETL steps. It can not only transform data from different departments but also data from different sources altogether. For example, data in an email program could be transformed right along with data from a manufacturing application, with the ultimate result being data of a common thread.

Loading is the process of storing the newly transformed data. The data is transported and loaded into the data warehouse via a variety of methods. The data can be normalized (e.g. by Snowflake Schema) or denormalized (e.g. by Star Schema). ETL allows you the flexibility to determine the method and outcomes ideally suited to your business needs.

Initially, the ETL process was performed by programmers using SQL code, which had its share of negatives. It could take long hours, utilize many resources, require complex coding, not to mention the challenge of maintaining the code. It was an unwieldy and tiresome task. Today, thankfully, there are more than a few ETL tools available on the market, which have eliminated these difficulties. The tools are extremely powerful and offer countless advantages in all stages of the ETL process (extraction, data cleansing, data profiling, transformation, debuggging and loading) when compared to the old method. They reduce costs, along with reducing coding efforts. The tools do the job quite well, and since they provide a graphical interface, require less expertise in database programming.

ETL tools process the data specifically to your business needs. They range from open source free tools to high price commercial tools. The amount of your data, what answers you will request from your data warehouse, and how often you require those answers, all need to be taken into account when choosing the right ETL tool for your business.

At PR3 Systems, we specialize in IBM’s DataStage ETL tool. As an IBM business partner, we provide the highest quality consulting and training services. Whether you are just beginning with DataStage, or upgrading to a later version, our IBM certified consultants will work with you to analyze your business needs and provide the development and/or training necessary to ensure you are getting the most out of your software. After all, especially in today’s economy, the bottom line is what counts. You’ve spent the time and resources to understand the positive impact DataStage can have on your business, so don’t stop there. Maximize your ROI by realizing the full potential DataStage offers. Contact us at 630-839-9258 or 630-364-1469 for more information.

PR3 Systems ~ Empowering you to make the right decisions at the right time.