Introduction To Hadoop
As we enter the era of “Big Data” more and more organizations are exploring the Hadoop technology and what value it can potentially provide. The data that organizations have to manage is constantly increasing in complexity and volume which currently existing systems are unable to handle. A new technology is required to manage the volume, velocity, and variety of big data. This is where Hadoop enters the picture.
Hadoop (HDFS) is simply a file system in which the data files are distributed across multiple computer systems (nodes). A Hadoop cluster is a set of computer systems which function as the file system. A single file in Hadoop can be spread over an indefinite amount of nodes in the Hadoop cluster. In theory, there is no limit to the amount of data which the file system can store since it is always possible to add more nodes.
Although Hadoop is open source, IBM offers a proprietary version of Hadoop called InfoSphere BigInsights which gives organizations the ability to easily and quickly setup a Hadoop cluster. In addition to streamlining the setup process of the cluster, BigInsights offers a variety of options for running advanced data analytics and application development on the data which exists in the HDFS.
Hadoop won’t be replacing the current data integration / business intelligence processes. Rather, it should be looked as a compliment to the current systems which many organizations utilize today. Let’s see how to integrate this new technology (BigInsights) with another piece of software, IBM’s ETL solution DataStage.
Version 9.1 of DataStage offers a new stage called the Big Data File stage which allows DataStage to read and write from Hadoop. Before we can utilize this stage in a DataStage job, we have to configure the environment correctly. The following pre-requirements have to be met:
- Verify that the Hadoop (BigInsights) cluster is up and running correctly. The status of BigInsights can be checked either from the BigInsights console or from the command line.
- Add the BigInsights library path to the dsenv file.
- Find out the required connection details to the BigInsights cluster.
- BDFS Cluster Host
- BDFS Cluster Port Number
- BDFS User: User name to access files
- BDFS Group: Group name for permissions – Multiple groups can be listed.
The Big Data File stage functions similarly to the Sequential File stage. It can be used as either a source or a target in a job. Other than the required connection properties to the HDFS, the stage has the same exact properties as the Sequential File stage (i.e. First line is column names, Reject mode, Write mode, etc.)
Writing To Hadoop With The Big Data File Stage
Let’s create a simple parallel job which reads data with a Sequential File stage and then sends the data to the Big Data File stage.
The source Sequential File stage will be setup just like it would be in any other scenario when reading data from a flat file.
Let’s take a look at the look at the data that we’ll be writing into Hadoop.
Now let’s see how we to set up the Big Data File stage. We can see below that the Big Data File stage has many of the same properties that the Sequential File stage has. However, we additionally have required properties that are used to connect to the BigInsights cluster. We have to specify the BDFS Cluster Host, BDFS Cluster Port Number, BDFS User, and BDFS Group.
Once we have configured the Big Data File stage, we can compile and run the job. The data will be read from a flat file which resides on the DataStage server and be written to a file on the HDFS. Once the data has been written to the file on the HDFS we can view it either by clicking the View data… button in the target Big Data File stage or through the BigInsights console.
Reading From Hadoop With The Big Data File Stage
Now let’s take a look an example in which we read data from a file which resides in the HDFS. Let’s create a simple parallel job which connects the Big Data File stage with a Sequential File stage.
In the Big Data File stage we can specify the properties exactly like is done in the Sequential File stage. Just like we had to specify the connection properties when the stage was used a target, the same has to be done when the stage is the source.
Now we can compile and run the job. Once the job completes we can look at the data in the Sequential File stage and verify that it is the same which came from the source file in the HDFS.
Rejecting Data While Reading With The Big Data File Stage
The Sequential File stage supports an optional reject link which allows us to capture data which fails to get read into the DataStage environment. The same can be done with the Big Data File stage. A secondary link created coming out of the Big Data File stage is automatically treated as a reject link. The dashed lines of the link represent the link being a reject link.
The Reject mode property in the Big Data File stage has to be specified as Reject in order to be able to compile the job successfully. In the stage Output –> Columns tab we can see that the reject column is of data type VarBinary, which allows it to capture records with any type of metadata.