Following the article “SeaTunnel CDC Under the Hood: Snapshots, Backfills, and Why Your Checkpoints Time Out”, which detailed the implementation mechanisms and principles of the Apache SeaTunnel CDC Source, this article will continue to explore the underlying technical logic of Apache SeaTunnel CDC by explaining the relationship between Debezium and Apache SeaTunnel.
To summarize their relationship in one sentence: Debezium is the core underlying engine of SeaTunnel CDC, while SeaTunnel CDC encapsulates, enhances, and extends Debezium’s functionalities.
Below is a detailed explanation of their relationship:
“Debezium can be regarded as the pioneer of CDC.” Within the SeaTunnel CDC ecosystem, Debezium plays an irreplaceable “foundation” role.
SourceRecord), containing the before and after states, operation type (Envelope Operation: CREATE/READ/UPDATE/DELETE), and other information, providing a standardized input for upper-layer processing.This is the most critical point for understanding their relationship.
debezium-embedded).debezium-api and debezium-embedded) to run the Debezium engine as a library directly within SeaTunnel’s process. This completely removes the mandatory dependency on a Kafka cluster.SeaTunnel builds a sophisticated “orchestration layer” on top of the Debezium engine to manage and schedule Debezium’s operations.
SeaTunnel sits at the top layer, handling read logic, deserialization, streaming fetch, and connection management; Debezium sits at the bottom layer, driving the database’s CDC mechanism and generating standardized data records.
SeaTunnel’s utilization of Debezium’s core functionalities is summarized in the table below:
| Function | Provided by Debezium (Core Capability) | Used by SeaTunnel (Encapsulation/Invocation) | |----|----|----| | Full Snapshot Read | Snapshot reading | SnapshotChangeEventSourceexecutes SELECT reads | | Incremental Read | Incremental reading | StreamingChangeEventSourcereads Binlog/WAL, etc. | | Data Structure | Data record (SourceRecord) | Extracts raw before/after information | | Operation Type | Envelope.Operation | Identifies CREATE/UPDATE/DELETE operations | | State Management | Offset & Schema management | Tracks read positions and DDL changes |
The two are connected in the data processing pipeline. Debezium produces the “raw material,” and SeaTunnel “processes” it into a standardized internal format.
Debezium Output: Produces SourceRecordcontaining raw change information.
SeaTunnel Translation: Uses DebeziumDeserializeSchema to deserialize SourceRecord, extract key information, and convert it into SeaTunnel’s internal row format SeaTunnelRow, while tagging the row type (RowKind, e.g., INSERT/UPDATE_AFTER).
By embedding and encapsulating Debezium, SeaTunnel CDC achieves significant enhancements compared to the native Debezium solution, as illustrated below:
Key Enhancements Provided by SeaTunnel:
Kafka Decoupling: This is the biggest difference. SeaTunnel CDC can write data directly to any supported Sink (e.g., data lake or warehouse) without passing through Kafka.
Parallel Reading Capability: SeaTunnel introduces parallel slicing to concurrently read full historical data, greatly improving efficiency.
Native Engine Integration: Deep integration with SeaTunnel (and Flink/Spark) checkpoint mechanism, ensuring exactly-once semantics.
Schema Evolution Support: Better handling of source-side DDL changes to adapt to table structure evolution.
\

