Postgres to Kafka Source Connector

Damindu Lakmal
3 min readFeb 25, 2024

--

postgres to Kafka

PostgreSQL is a powerful and versatile database management system suitable for a wide range of applications, from small-scale projects to large-scale enterprise deployments. Its features, performance, and extensibility make it a popular choice among developers and organizations worldwide.

What is Write Ahead Logging (WAL)

In PostgreSQL, the Write-Ahead Logging (WAL) mechanism is a key component of its transactional integrity and crash recovery features.

  1. Logging Changes: Whenever a transaction modifies data in the database, PostgreSQL writes the changes to a WAL file before making the modifications to the actual data files. This ensures that the changes are durably stored on disk even if the transaction hasn’t been committed yet.
  2. Sequential Write: WAL writes are sequential, meaning they are written in a continuous stream rather than randomly scattered across the disk. This sequential write pattern is more efficient for disk I/O, especially on magnetic hard drives, and helps in improving overall performance.
  3. Separate from Data Files: The WAL files are separate from the main data files of the database. This separation helps in isolating the transaction log from the actual data, making it easier to manage and protect.
  4. Redo and Undo Information: The WAL contains both redo and undo information. Redo information is used during crash recovery to replay committed transactions from the log, ensuring that the database is brought back to a consistent state after a crash. Undo information allows PostgreSQL to rollback uncommitted transactions during crash recovery.
  5. Checkpointing: Periodically, PostgreSQL performs a checkpoint operation where it flushes the modified data pages from memory to disk and writes a special record in the WAL to indicate that the changes up to that point have been durably stored. This helps in limiting the amount of WAL data that needs to be replayed during crash recovery, reducing the recovery time.
  6. Archiving and Streaming: PostgreSQL provides options to archive WAL files for backup and disaster recovery purposes. Additionally, it supports streaming replication, where WAL changes are continuously streamed to standby servers for real-time replication.

What is Log Sequence Number (LSN)

LSN is a unique identifier used to track and order changes recorded in the Write-Ahead Log (WAL). LSN has following attributes,

  • Unique Identifier
  • Monotonicity
  • Pointer to WAL Location
  • Checkpointing
  • Replication

How Postgres Write Data to Kafka

Before start the postgres server, you need to configure way_level as logical. In logical level, the same information is logged as with replica, plus information needed to extract logical change sets from the WAL. Using a level of logical will increase the WAL volume, particularly if many tables are configured for REPLICA IDENTITY FULL and many UPDATE and DELETE statements are executed.

Further more, there is no guarantee of transactionally committed or not. Data is been pushed to streaming when WAL files set as archive. So Kafka may received duplicate messages when postgres disaster recovery.

Start Source Connector

A source connector is a component that facilitates the ingestion of data from external systems into Kafka topics. Source connectors are responsible for continuously pulling data from the source systems and producing corresponding Kafka messages to specific Kafka topics. Go through this document for more information.

Replication Slot

After starting your connector then you can check weather replication slot status by following SQL command,

select slot_name,
plugin,
slot_type,active,
active_pid,
confirmed_flush_lsn,
restart_lsn,
wal_status
from pg_replication_slots where active ;
sql

Once the Debezium connector is set up and running, it continuously monitors the PostgreSQL database’s transaction log (WAL) to capture changes. Whenever a transaction commits, Debezium captures the changes made by that transaction, including inserts, updates, and deletes.

Summary

By streaming changes from PostgreSQL into Kafka, you can build real-time data pipelines and integrate your database with a wide range of downstream applications and systems, enabling real-time analytics, data synchronization, and event-driven architectures.

--

--