The replication slot created by a kafka-connector connector is filling up. I have a postgres RDS database on AWS. I put the following parameter group option on it (only showing the diff from default) rds.logical_replication: 1 I have kafka connect running with a debezium postgres connector. This is the config (with certain values redacted, of course) "database.dbname" = "mydb" "database.hostname" = "myhostname" "database.password" = "mypass" "database.port" = "myport" "database.server.name" = "postgres" "database.user" = "myuser" "database.whitelist" = "my_database" "include.schema.changes" = "false" "plugin.name" = "wal2json_streaming" "slot.name" = "my_slotname" "snapshot.mode" = "never" "table.whitelist" = "public.mytable" "tombstones.on.delete" = "false" "transforms" = "key" "transforms.key.field" = "id" "transforms.key.type" = "org.apache.kafka.connect.transforms.ExtractField$Key" If I get the status of this connector, it appears to be fine. curl -s http://my.kafkaconnect.url:kc_port/connectors/my-connector/status | jq { "name": "my-connector", "connector": { "state": "RUNNING", "worker_id": "some_ip" }, "tasks": [ { "id": 0, "state": "RUNNING", "worker_id": "some_ip" } ], "type": "source" } However, the replication slot in postgres keeps getting larger and larger: SELECT slot_name, pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) as replicationSlotLag, pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn)) as confirmedLag, active FROM pg_replication_slots; slot_name | replicationslotlag | confirmedlag | active -------------------------------+--------------------+--------------+-------- my_slotname | 20 GB | 20 GB | t Why does replication keep growing? As I understand, the kafka connect connector task that is running should be reading from this replication slot, publishing it to the topic postgres. public.mytable, and then the replication slot should decrease in size. Am I missing something in this chain of actions?

Please take a look at WAL Diskspace Consumption. The most common reason why the PostgreSQL WAL backlogs is because the connector is monitoring a database or a subset of tables from your database change much more infrequent compared to the other tables or databases in your environment and therefore the connector isn't acknowledging the LSNs frequent enough to avoid the WAL backlog. 对于 Debezium 1.0.x 及之前版本,启用heartbeat.interval.ms. 对于 Debezium 1.1.0 及之后的版本,还要考虑启用heartbeat.action.query。

发现这个Gunnar 提到这个的谷歌小组讨论- 核心心跳功能定期向心跳主题发送消息,允许确认已处理的 WAL 偏移量,以防仅发生过滤表中的事件(这是您观察到的)。心跳动作查询(需要表并包含在发布中)对于解决多个数据库的情况很有用,在这种情况下,连接器从一个数据库接收更改,否则没有/低流量,再次允许在这种情况下确认偏移量。 - 贡纳尔 在小组讨论中,他提到我们必须将此心跳表添加到发布中才能使此心跳查询起作用。这应该有所帮助。

我会将其标记为答案,因为它通常就是答案。对于 Debezium 1.0,心跳实际上并没有提交 LSN,因此复制滞后仍然增加。对于 1.1,查询会很好,但我们使用 ExtractField SMT,它与任何心跳都不兼容(因此它在 1.0 中也不起作用)。在此处跟踪issues.redhat.com/browse/DBZ-1909

链接文档中的重要信息:"For users on AWS RDS with Postgres, a similar situation to the third cause may occur on an idle environment, since AWS RDS makes writes to its own system tables not visible to the useres on a frequent basis (5 minutes). Again regularly emitting events will solve the problem."

您如何看待将设置max_slot_wal_keep_size作为预防措施?虽然在极端情况下这可能会导致 Debezium 的数据丢失,但它也确保 Debezium 无法关闭整个数据库!

编辑:更新:将心跳表添加到发布中。没有任何区别。这个问题仍然悬而未决。