Change Data Capture Using Databricks

Introduction to Change Data Capture (CDC)

In the intricate world of database management, Change Data Capture (CDC) shines as a reliable tool for tracking and logging changes in datasets. Whether it's happening right away or pretty quickly, CDC keeps an eye on data modifications, helping applications react promptly.

Essentially, CDC ensures that data is always up-to-date, which is key for making smart decisions. But it's not just about watching changes; CDC also ensures that data remains consistent across different systems. This means that no matter where your data travels, it stays reliable and accurate.

For tasks like combining data from various sources or analyzing it efficiently, CDC is a lifesaver. It simplifies the process, making it easier for organizations to make the most of their data. In a nutshell, CDC is like a trusty sidekick in the world of data management, always there to ensure smooth operations and informed decision-making.

Explanation of How CDC works

Change Data Capture (CDC) is a fundamental process in database management, facilitating the transmission of data alterations from an Online Transaction Processing (OLTP) database to a multitude of destination systems such as cache indexes, data lakes, warehouses, or other relational databases. These alterations encompass insertions, updates, or deletions within the source database, and CDC ensures the seamless replication of these changes across the destination systems in real-time.

While the primary source systems for CDC are typically relational databases, it's important to note that this principle is adaptable to any source system, including NoSQL databases and flat file storage. This adaptability underscores the versatility and applicability of CDC in diverse data environments.

A typical CDC workflow comprises a source database, a CDC framework, and one or more destination systems. This structured workflow ensures the efficient and accurate transmission of data changes across the data ecosystem.

The architecture of CDC relies on leveraging the inherent properties of the source database or employing an active data extraction method to facilitate uninterrupted data flow between the source and destination systems. This architecture ensures the reliability and integrity of data replication processes.

High level CDC diagram

Implementing CDC often necessitates decoupling the source and destination systems to accommodate the varying rates of change in the source database. This decoupling mechanism enables flexible and scalable data replication, ensuring that the destination systems can effectively keep pace with the dynamic nature of the source data.

In essence, CDC serves as a linchpin in modern data management strategies, enabling organizations to efficiently synchronize data across disparate systems and leverage real-time insights for informed decision-making.

Importance of CDC

1. Real-time Data Tracking

CDC allows businesses to monitor all database changes in real-time, which is essential for those handling big volumes of data. Making decisions quickly is made easier by this instant access to data.

2. Compliance and Audit Trails

By keeping thorough records of all data modifications, CDC helps companies comply with regulatory standards. This guarantees accountability and openness in data management procedures.

3. Data Integration and Warehousing

For jobs involving data integration, analytics, and warehousing, CDC is essential. Businesses would find it difficult to effectively handle and analyze the steady stream of data changes without CDC.

4. Data Quality Improvement

CDC assists in quickly identifying and fixing problems with data quality by recording all database modifications. This proactive strategy improves the overall dependability and correctness of the data.

5. Improved Decision-Making

By lowering data latency, CDC gives decision-makers access to the most recent information. Businesses are able to react swiftly to changes in the market and take well-informed decisions because of this agility.

6. Enhanced Productivity

By streamlining data management procedures, CDC implementation lowers human labour requirements and boosts output. It guarantees that companies may concentrate on things that provide value instead of managing and collecting data.

Different methods for implementing CDC in a Data pipeline

The following are the most common methods for implementing Change Data Capture (CDC) in a data pipeline.

1. Log-based CDC

log-based

How it Works : Log-based CDC is the process of accessing the source database's transaction logs (also known as redo logs or change logs). These logs record all data changes, such as inserts, updates, and deletions.
Advantages: Offers a thorough and low-impact method for capturing changes. Can collect changes at a granular level with no impact on the underlying database.
Considerations: Access to transaction logs is required, which can vary in format and accessibility depending on the database system. It may also necessitate careful oversight of log retention and storage.

2. Trigger-based CDC

trigger

How it Works: Trigger-based CDC entails creating triggers on source database tables to track changes. When a change (insert, update, or delete) occurs on a monitored table, the trigger detects it and sends it to the CDC system.
Advantages: Provides a simple technique for recording changes without requiring direct access to transaction logs. Can be used in a broad variety of database systems.
Considerations: Triggers add overhead to database operations and can degrade performance, particularly in high-volume transactional systems. To minimize the impact on performance, trigger design and management must be carefully considered.

3. Database replication

How it Works : Database replication is the process of duplicating data from a source to a target database in near real-time. Changes made to the source data are captured and propagated to the target database via the replication process.
Advantages: Offers a reliable and scalable means of collecting changes, particularly for large-scale data pipelines. Can be implemented using database systems' built-in replication features or specialist replication tools.
Considerations: Replication settings must be configured and managed, which includes replication lag and dispute resolution. May add complexity and overhead, particularly for cross-database or cross-platform replication.

Simplifying Change Data Capture with Databricks Delta

CDC

Databricks commonly encounters customers seeking to perform Change Data Capture (CDC) from various sources into Delta tables. These sources range from on-premises to cloud-based, including transactional stores and data warehouses.

The unifying factor is the generation of change sets, facilitated through ETL tools like Oracle GoldenGate or Informatica PowerExchange, vendor-supplied change tables (e.g., Oracle Change Data Capture), or user-maintained tables with triggers. To address this, Databricks offers a reference architecture leveraging Delta features for efficient CDC implementation, drawing from extensive experience across public and private sector deployments.

CDC before Databricks Delta

before

Before the advent of Databricks Delta, customers commonly established Change Data Capture (CDC) pipelines using tools such as Informatica, with Oracle acting as an intermediary before the data reached Databricks.

In this setup, Informatica was responsible for consolidating change sets from various sources into an Oracle data warehouse. Databricks jobs would then fetch these change sets from Oracle, usually once a day via JDBC, and proceed to update tables within Databricks. However, this method posed several challenges.

Primarily, it imposed strain on the Oracle instance due to the additional workload, leading to constraints on the scheduling of Extract, Transform, Load (ETL) jobs. Moreover, the refresh rates were restricted to nightly updates owing to concurrency issues associated with traditional Parquet tables.

CDC with Databricks Delta

after

With Databricks Delta, the CDC pipeline is more efficient, enabling more frequent updates. Change sets from sources like Informatica are directly written to S3. Databricks jobs can then run more frequently to process these changes and update Delta tables. This process can also be adapted to read CDC records from Kafka.

The pipeline involves maintaining a staging table to accumulate updates and a final table for the current snapshot. Each refresh period, a Spark job performs two main tasks: inserting new changes into the staging table and overwriting the final table with the latest data from the staging table.

This approach supports different update handling strategies, resembling SCD Type 2 for the staging table and SCD Type 1 for the final table. This streamlined process was successfully implemented for a customer's ETL pipeline, resulting in efficient updates to Delta tables.

CDC Simplified with the Apply Changes API

The introduction of the Apply Changes API in Delta Live Tables marks a significant advancement in simplifying Change Data Capture (CDC) processes. This API streamlines CDC by automatically managing out-of-sequence records, thereby obviating the need for intricate logic. It supplants the commonly used "merge into" statement for CDC processing on Databricks, offering a more intuitive and efficient approach. With its robust functionality, the Apply Changes API ensures accurate processing of CDC records while supporting the updating of tables with both SCD type 1 and type 2 configurations. SCD type 1 updates directly overwrite existing records without retaining historical data, whereas SCD type 2 preserves historical records, either for all updates or selectively for specified columns, empowering users with enhanced flexibility and precision in data management.

How to implement CDC with Delta Live Tables

To implement Change Data Capture (CDC) with Delta Live Tables, follow these steps for efficient and accurate data tracking and processing.

Specify a column in the source data for sequencing records.
Ensure the sequencing column represents a monotonically increasing order.
Delta Live Tables automatically handles out-of-order data.
For SCD Type 2 changes, update START_AT and END_AT columns of the target table.
Each key has one distinct update per sequencing value.
NULL sequencing values are not supported.

To perform CDC processing, do the following:

Create a streaming table.
Use APPLY CHANGES INTO statement to define source, keys, and sequencing.
Use SQL statements (CREATE OR REFRESH STREAMING TABLE, APPLY CHANGES) or Python functions for defining target streaming table and CDC processing.

In conclusion, Change Data Capture (CDC) is a critical process in modern data engineering, enabling organizations to track and record data modifications in real-time. With the evolution of technologies like Databricks Delta and Delta Live Tables, implementing CDC has become more streamlined and efficient, empowering businesses to maintain data consistency across systems and make informed decisions based on up-to-date information. By adhering to best practices and leveraging advanced features such as the Apply Changes API, organizations can simplify CDC implementation and enhance their data management practices. As the volume and complexity of data continue to grow, CDC remains an essential tool for organizations seeking to harness the full potential of their data assets and drive business success.

Latest Articles

Read more about the latest and greatest work Rearc has been up to.

Rearc-Cast: AI-Powered Content Intelligence System

Learn how we built Rearc-Cast, an automated system that creates podcasts, generates social summaries, and recommends relevant content.

AI & ML

Text-to-Speech

OpenAI

Talent Pipeline Analysis Project

Overview of the Talent Pipeline Analysis Rippling app

Data Analytics

Rippling

Bias Detection Program

An AI-powered bias detection program, which takes inputted PDFs and extracts the different forms of bias found (narrative, sentiment, slant, coverage depth, etc.).

LLM

ChatGPT

Python