Change Data Capture (CDC)
Index
1. What is CDC?
Change Data Capture (CDC) is a way to track and stream all changes in your database tables (inserts, updates, deletes) as they happen, and deliver them to downstream systems in real time.
In our implementation, we use Debezium as the CDC engine.
2. How It Works (High-Level Architecture)
Here is a simplified flow of how our CDC system works:
Source Database: Our production MySQL Server is configured with binary logging (binlog).
Debezium Server: We run Debezium Server, which connects to the MySQL binlog and reads all row-level changes (inserts, updates, deletes). This is the core CDC component.
Publishing to Pub/Sub: Debezium publishes the change events directly to Google Cloud Pub/Sub
Cloud Run Function: A Cloud Run function is triggered by the Pub/Sub messages. This function transforms & wraps the events into the format we need (e.g., JSON) and writes them into a GCP Storage Bucket.
Partner’s Infrastructure: The JSON files are stored in a GCP bucket that resides in your infrastructure, to which we will deliver via a service account we provide.
3. What Entities Are Streamed
We currently capture CDC for the following business entities:
Account
Investment
Transaction
These represent the core objects in your system where changes matter most.
4. Change Types We Capture
For each of those entities, we stream all three types of changes:
Create (new records)
Update (changes to existing records)
Delete (records being removed)
5. Data Format / Examples
Entity | Example Payload |
|---|---|
Account |
|
Investment | |
Transaction |
|
6. Security & Access Control
Security is a top priority. Here’s how it’s handled:
We will provide you with a Google Cloud service account.
You need to grant write access to your GCP bucket for that service account, so our Cloud Run function can upload the CDC JSON files into your bucket.
Permission is limited: the service account only needs write (or bucket-object-create) rights, not full admin.
7. What You (the Partner) Need to Do
To make this work on your side, you will need to:
Create a GCP bucket in your own Google Cloud project (or identify an existing one).
Allow our service account to have write access to that bucket.
8. Benefits for Your Business
Near real-time data: You get up-to-date views of Accounts, Investments, and Transactions as they change.
Loose coupling: You don’t need to poll our database — we push changes to you.
Scalable: Built on GCP’s serverless and managed infrastructure (Pub/Sub, Cloud Run, Storage).
Resilient: Debezium ensures data consistency and keeps track of offset via its own storage. Debezium
Secure: Access is limited and auditable via GCP IAM.
9. Limitations / Considerations
Schema changes: If we change the schema of our tables (e.g., add or remove columns), that could affect the downstream JSON structure.
Latency: While it's near real-time, there may be a small delay (depending on Pub/Sub and Cloud Run).
Costs: You bear the cost for the GCP bucket storage, access, and any downstream compute you run on the JSON data.
10. Diagram
Here is a minimal architecture diagram:
MySQL (binlog)
│
▼
Debezium Server
│
▼
Google Cloud Pub/Sub (topic)
│
▼
Cloud Run Function (subscriber)
│
▼
GCP Storage Bucket (in your infra)