Change Data Capture (also known as Data Replication or Mirroring) is a data transfer technology that uses an incremental data loading technique to transfer data from source to target databases. While several data integration technologies and tools such as ETL, ELT, EAI and MQ are already available in the market, Change Data Capture (CDC) is different in that it minimizes the data transfer latency between a record entry in the primary database (e.g. OLTP) and the record transfer to the secondary database (e.g. OLAP, Backup database etc.). Unlike traditional data loading techniques which are typically bulk and batch loads, CDC delivers near real-time data to the data warehouse through incremental data loads. IBM InfoSphere Change Data Delivery (ICDD) is an industry leading CDC tool built for handling complex change data capture requirements by augmenting CDC with ETL connectivity through either file-based integration or InfoSphere DataStage Direct Connect. The InfoSphere Change Data Delivery suite contains various change data capture programs which are optimized for specific databases, operating systems and platforms.
Note: The terms ‘Change Data Capture’ and ‘Change Data Delivery’ refer to the same technology and are used interchangeably in this blog post. The origins of the CDC technology was IBM’s purchase of the company DataMirror in 2007. In 2012, IBM made improvements to CDC enabling integration with InfoSphere DataStage for additional processing. The new product was called ‘InfoSphere Change Data Delivery’ while the original product was renamed as ‘InfoSphere Data Replication’.
In this blog post, I will discuss some key features of IBM Change Data Delivery and list its common use cases.
How Change Data Delivery works
Typically in data integration processes, incremental extraction logic for source data is coded inside the ETL. Change Data Delivery (CDD) implements incremental extraction differently by monitoring transaction log files of the source databases and detecting changes to the log file. Any change in the database tables will produce entries in the database log files. The CDD engine scrapes the log files (at specified intervals) looking for these changes. Once a change is detected in the logs, the changed records are identified, pushed from the source to the target CDD engine (database, ETL, MQ etc.) through TCP/IP, and subsequently applied to the target database through SQL. This helps achieve near real-time data warehousing.
Replication and mirroring are alternate terms for Change Data Capture. The term ‘replication’ is used when both the primary and secondary databases are available for active querying, balancing the load between the two databases and thereby improving performance of the primary database. The term ‘mirroring’ is used when the secondary database is used strictly as a backup database and is not available for other activities. While replication can transfer only data changes, mirroring can replicate both data and schema changes from the primary to the secondary database.
There are three types of replications. The type of replication selected for a use case is based on several factors such as latency requirements, data volumes, load patterns, structure of the data warehouse, etc.
1) Transactional – Near real-time transfer of records from primary to secondary databases.
2) Snapshot – Scraping of the primary database log files takes place at scheduled times and the changes recorded in the logs since the last push are transferred to the secondary databases.
3) Merge – This is a bi-directional replication technique order to keep two or more systems in sync and up to date.
Listed below are several key benefits that a business can realize by enabling real-time data integration using ICDD.
1) High availability: Change Data Capture enables continued operations in the face of system failures, data corruption, outages, and other unanticipated disruptions to the primary database. By transferring up-to-the-second data to an active secondary database, CDD helps maintain a substitutable secondary database and thus ensures high availability.
2) Disaster recovery: Many organizations build a dedicated secondary system, geographically separated from the primary system, to be be used in case of disaster to the primary location. Data Replication allows the data updates to quickly flow from the primary to the geographically separate secondary system in near real-time.
3) Real-time operational reports: Traditionally, data warehouses are batch loaded and do not contain the latest data. However, in today’s business environment, it is critical for managers to know what is happening now to determine what should happen next. CDD provides the data warehouse a continuous and up-to-the-second stream of data from transactional systems. This data is consumed by the BI systems and help managers with reports on the latest operational information.
4) Cross-selling: CDD enabled active data warehouses help sales personnel receive real-time insights about customers and their purchases. This helps them engage customers with personalized real-time offers and recommendations. Subsequently, increasing cross-selling.
5) Integrated view of data: Replication helps in data synchronization of various business entities (customers, products, locations etc.) across heterogeneous systems. This helps business get a complete view of the entity from disparate systems and thereby derive deeper insights.
6) Fraud detection: Banks and other financial institutions analyze credit card usage patterns to flag suspicious activity. CDD enables banks to access transaction information from their analytical databases in real-time; enabling them to detect and block potentially harmful activity.
7) Time-sensitive analytics: Time-sensitive tasks such as sales forecasting, price optimization, and risk calculations, drive the need to maintain information in as real time as possible. CDD helps maintain a real-time data warehouse.
8) Operating cost reduction: CDD helps improve the efficiency of the data integration process by reducing the batch window and latency, thereby, reducing associated processing and people costs.
9) Migration downtime: CDD prevents downtime by providing the ability to synchronize data between systems with potentially different technologies during system migrations. This is especially common during mergers and acquisitions when different production systems need to be maintained and transactions captured by each system need to be updated in the other system on a regular basis.
10) Distributed data centers: Many organizations have the need to maintain multiple copies of the data at different data centers for load balancing and improving response time. These data centers need to be refreshed on a regular basis to keep them in sync with the master database. Data replication addresses the need to have these data marts and systems to stay in sync with each other and the primary database.
11) Backups and archives: Traditionally, archival of historic data is done in tapes. However, restoring the data from tapes is a tedious process and many organizations are now using secondary databases as an archival system for historic data. CDD helps in transferring the data from primary to secondary databases; thus, reducing the cost of archiving and retrieving since the secondary database can be easily queried.
Change Data Delivery can be used for many types of complex use cases and enables the functioning of a real-time data warehouse. Real-time data warehouses provide enormous benefits to businesses and in today’s competitive business environment companies armed with CDD will have a big edge in the market.