-
Notifications
You must be signed in to change notification settings - Fork 193
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
After PITR recovery, Postgres won't start and report missing WAL file #881
Comments
I've got the same problem using PostgreSQL 16.1 and Barman 3.9.0. Recovery using |
The issue still persists in Barman 3.10. Here are logs and more details. Started the recovery using this command:
Started the PostgreSQL server, but it failed to start. Log file of barman:
Log file of PostgreSQL:
|
Hi, Could you please provide the output of the Also, the logs like you did before. Thanks a lot! |
Here are the requested information. I don't have a full test environment available, so I just created a blank database, backup it several times and tried to restore it using "--target-time", just to show the error.
https://medium.com/@kiwiv/recover-your-database-with-barman-be5ee4e11939
I picked the end date of the backup.
Log of PostgreSQL:
barman diagnose
Logfile of barman:
|
Hello @PaganMuffin @JP95Git , While taking a look at the issues reported by both of you, we guess we have 2 different problems here, one caused by an operational issue, and another by a couple of apparent bugs in Barman. We will follow-up with a couple of messages sharing our thoughts. Expect an update by tomorrow. |
From reading the logs, and I guess @barthisrael 's "operational issue" mentioned above is this that I will explain, is that you are picking a point in time for recovery that is outside the window of the backup and WALs that have been archived, so Postgres can never recover to that point in time as it doesn't exist in the timeline from the end of the backup until the last record in the 2 WALs available. We'll follow-up with more later |
In the case reported by @JP95Git, this initially seemed an operational issue, but this actually refers to a limitation in Barman. You seem to be requesting a Just for context: when you pass What Postgres does when in recovery with Now, with the proper context given, we will demonstrate that issue with the following example. We have a Barman backup here with this information: $ barman show-backup pg17 20240620T165110
Backup 20240620T165110:
Server Name : pg17
System Id : 7382227563826631291
Status : DONE
PostgreSQL Version : 170000
PGDATA directory : /var/lib/pgsql/17/data
Base backup information:
Disk usage : 22.3 MiB (54.3 MiB with WALs)
Incremental size : 22.3 MiB (-0.00%)
Timeline : 1
Begin WAL : 00000001000000000000003F
End WAL : 000000010000000000000040
WAL number : 2
Begin time : 2024-06-20 16:51:10.088025+00:00
End time : 2024-06-20 16:51:11.498244+00:00
Copy time : 1 second
Estimated throughput : 16.1 MiB/s
Begin Offset : 216
End Offset : 0
Begin LSN : 0/3F0000D8
End LSN : 0/40000000
WAL information:
No of files : 5
Disk usage : 80.0 MiB
WAL rate : 163.55/hour
Last available : 000000010000000000000045
Catalog information:
Retention Policy : not enforced
Previous Backup : - (this is the oldest base backup)
Next Backup : 20240620T165335 The important details here are:
We tried doing something similar to what you did, i.e. set $ barman recover pg17 20240620T165110 --target-time "2024-06-20 16:51:11.498244+00:00" --remote-ssh-command "ssh postgres@localhost" /var/lib/pgsql/17/restore
Starting remote restore for server pg17 using backup 20240620T165110
Destination directory: /var/lib/pgsql/17/restore
Remote command: ssh postgres@localhost
Doing PITR. Recovery target time: '2024-06-20 16:51:11.498244+00:00'
Copying the base backup.
Copying required WAL segments.
Generating recovery configuration
Identify dangerous settings in destination directory.
Recovery completed (start time: 2024-06-20 16:56:42.835003+00:00, elapsed time: 5 seconds)
Your PostgreSQL server has been successfully prepared for recovery! But then Postgres fails to start up: $ tail -10 restore/log/postgresql-Thu.log
2024-06-20 16:57:15.429 UTC [64304] LOG: restored log file "000000010000000000000041" from archive
cp: cannot stat 'barman_wal/000000010000000000000042': No such file or directory
2024-06-20 16:57:15.670 UTC [64304] LOG: restored log file "000000010000000000000041" from archive
cp: cannot stat 'barman_wal/000000010000000000000042': No such file or directory
2024-06-20 16:57:15.679 UTC [64304] LOG: redo done at 0/41FFFF90 system usage: CPU: user: 0.02 s, system: 0.36 s, elapsed: 0.48 s
2024-06-20 16:57:15.679 UTC [64304] FATAL: recovery ended before configured recovery target was reached
2024-06-20 16:57:15.682 UTC [64300] LOG: startup process (PID 64304) exited with exit code 1
2024-06-20 16:57:15.682 UTC [64300] LOG: terminating any other active server processes
2024-06-20 16:57:15.683 UTC [64300] LOG: shutting down due to startup process failure
2024-06-20 16:57:15.683 UTC [64300] LOG: database system is shut down As we can see, the issue is If we inspect the WALs which were copied by Barman, we will see there are no $ ls
00000001000000000000003F 000000010000000000000040 000000010000000000000041 archive_status
$ /usr/pgsql-17/bin/pg_waldump 00000001000000000000003F 000000010000000000000041 | egrep "COMMIT|ABORT"
$ That's why the recovery failed, Postgres was not able to reach a point where it was sure the requested recovery target was fully honored. If we inspect the Barman WAL archive, we find these $ ls -lh
total 149M
-rw-------. 1 vagrant vagrant 16M Jun 20 16:51 00000001000000000000003F
-rw-------. 1 vagrant vagrant 16M Jun 20 16:53 000000010000000000000040
-rw-------. 1 vagrant vagrant 16M Jun 20 16:53 000000010000000000000041
-rw-------. 1 vagrant vagrant 16M Jun 20 16:53 000000010000000000000042
-rw-------. 1 vagrant vagrant 16M Jun 20 16:53 000000010000000000000043
-rw-------. 1 vagrant vagrant 16M Jun 20 16:53 000000010000000000000044
-rw-------. 1 vagrant vagrant 16M Jun 20 16:53 000000010000000000000045
$ /usr/pgsql-17/bin/pg_waldump 00000001000000000000003F 000000010000000000000045 | egrep "COMMIT|ABORT"
rmgr: Transaction len (rec/tot): 405/ 405, tx: 789, lsn: 0/43D4DB58, prev 0/43D4DB18, desc: COMMIT 2024-06-20 16:53:25.807910 UTC; inval msgs: catcache 80 catcache 79 catcache 80 catcache 79 catcache 55 catcache 54 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 snapshot 2608 relcache 16423
rmgr: Transaction len (rec/tot): 66/ 66, tx: 790, lsn: 0/43D96F08, prev 0/43D96A70, desc: COMMIT 2024-06-20 16:53:31.856379 UTC; inval msgs: catcache 63 These Now, giving context on what Barman does when you run
However, as we can see in the above example, that was not enough. My backup finished at WAL In order to make that process reliable we would need to inspect the WAL dumps and look for We intend to submit changes to Barman in such a way that We also intend to add a note to the Barman documentation, suggesting to use For now, a workaround would be using |
The problem initially reported through this GitHub issue seems related with a problem in Barman. When the user runs However, that implicit time zone is only taken into consideration when printing messages in the console, not when configuring Postgres for recovery. We will illustrate the issue with an example. Assume we have a Barman host whose time zone is UTC+2. When we attempt to run $ barman recover pg17 20240621T170006 --target-time "2024-06-21 17:00:09.348083" --remote-ssh-command "ssh postgres@localhost" /var/lib/pgsql/17/restore
Starting remote restore for server pg17 using backup 20240621T170006
Destination directory: /var/lib/pgsql/17/restore
Remote command: ssh postgres@localhost
Doing PITR. Recovery target time: '2024-06-21 17:00:09.348083+02:00'
Copying the base backup.
Copying required WAL segments.
Generating recovery configuration
Identify dangerous settings in destination directory.
Recovery completed (start time: 2024-06-21 17:03:49.748856+02:00, elapsed time: 8 seconds)
Your PostgreSQL server has been successfully prepared for recovery! As we can see, Barman claims to have configured However, in practice that is not true as we can see from $ cat /var/lib/pgsql/17/restore/postgresql.auto.conf
# Do not edit this file manually!
# It will be overwritten by the ALTER SYSTEM command.
summarize_wal = 'on'
restore_command = 'cp barman_wal/%f %p'
recovery_end_command = 'rm -fr barman_wal'
recovery_target_time = '2024-06-21 17:00:09.348083' In my case I’m using the default $ tail -15 /var/lib/pgsql/17/restore/log/postgresql-Fri.log
2024-06-21 15:14:36.673 UTC [32083] LOG: starting point-in-time recovery to 2024-06-21 17:00:09.348083+00
2024-06-21 15:14:36.678 UTC [32083] LOG: redo starts at 1/7000028
2024-06-21 15:14:36.681 UTC [32083] LOG: restored log file "000000010000000100000008" from archive
2024-06-21 15:14:36.689 UTC [32083] LOG: restored log file "000000010000000100000009" from archive
cp: cannot stat 'barman_wal/00000001000000010000000A': No such file or directory
2024-06-21 15:14:36.698 UTC [32083] LOG: completed backup recovery with redo LSN 1/7000028 and end LSN 1/7000120
2024-06-21 15:14:36.698 UTC [32083] LOG: consistent recovery state reached at 1/7000120
2024-06-21 15:14:36.698 UTC [32079] LOG: database system is ready to accept read-only connections
cp: cannot stat 'barman_wal/00000001000000010000000A': No such file or directory
2024-06-21 15:14:36.704 UTC [32083] LOG: redo done at 1/9000238 system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.02 s
2024-06-21 15:14:36.704 UTC [32083] FATAL: recovery ended before configured recovery target was reached
2024-06-21 15:14:36.712 UTC [32079] LOG: startup process (PID 32083) exited with exit code 1
2024-06-21 15:14:36.712 UTC [32079] LOG: terminating any other active server processes
2024-06-21 15:14:36.713 UTC [32079] LOG: shutting down due to startup process failure
2024-06-21 15:14:36.713 UTC [32079] LOG: database system is shut down It is looking for a We are working on a fix for this issue. You can follow this PR for more details. |
Previous to this commit there could be cases where Postgres would be misconfigured by `barman recover`. For example, assume you had an environment like this: * Barman host configured with time zone UTC+2, e.g. `Europe/Berlin`;` * Postgres cluster with default value for timezone GUC, i.e. GMT value. If we attempted a `barman recover` execution with `--target-time` set, but without an explicit time zone, recovery could fail: * `barman recover` output would say that recovery target time was set with `+02:00`; * However `postgresql.auto.conf` would not contain `+02:00`. In practice either the recovery would fail because of an invalid recovery target, or the cluster would be recovered to a point-in-time different from the user needs. Neither of these cases are welcome. This commit fixes that issue by reusing the `target_datetime` key from the `recovery_info` variable. That entry contains the target time with a time zone explicitily set, even if the user specified no time zone through the command-line option. Note: a warning is issued if the user specified no time zone, and Barman assumed its host time zone. References: BAR-185 #881. Signed-off-by: Israel Barth Rubio <[email protected]>
Previous to this commit there could be cases where Postgres would be misconfigured by `barman recover`. For example, assume you had an environment like this: * Barman host configured with time zone UTC+2, e.g. `Europe/Berlin`;` * Postgres cluster with default value for timezone GUC, i.e. GMT value. If we attempted a `barman recover` execution with `--target-time` set, but without an explicit time zone, recovery could fail: * `barman recover` output would say that recovery target time was set with `+02:00`; * However `postgresql.auto.conf` would not contain `+02:00`. In practice either the recovery would fail because of an invalid recovery target, or the cluster would be recovered to a point-in-time different from the user needs. Neither of these cases are welcome. This commit fixes that issue by reusing the `target_datetime` key from the `recovery_info` variable. That entry contains the target time with a time zone explicitily set, even if the user specified no time zone through the command-line option. Note: a warning is issued if the user specified no time zone, and Barman assumed its host time zone. References: BAR-185 #881. Signed-off-by: Israel Barth Rubio <[email protected]>
… files Previous to this commit Barman would attempt to guess the required WAL files using the filesystem creation timestamp of them. However, that is not a reliable approach. For example, if there is streaming replication lag, the WAL files will be created in the Barman host later when compared to Postgres. That can be even worse in the case of `archive_command`, because it waits for the WAL files to be filled up before making them available for archiving. In those cases the recovery could end up failing because of missing `COMMIT` or `ABORT` records in the WAL files that were copied by Barman, i.e. Postgres would fail to perform recovery because it wouldn't know if it satistified or not the requested `recovery_target_time`. From now on, if the user requests a recovery with `--no-get-wal` and `--target-time`, Barman will simply copy all WAL files up to the timeline being recovered, guaranteeing that way that Postgres will be able to find `COMMIT` or `ABORT` records, if they exist in the archived WAL files, making it possible to complete the recovery. Note: we evaluated other implementations to avoid possibly copying a lot of unused WAL files. For example, with `pg_waldump` we would be able to look up for `COMMIT` and `ABORT` records in a way similar to what Postgres does. However, that could put a lot of overhead wherever that would be processed (during WAL archiving or backup recovery), so that option was discarded. References: BAR-189 #881. Signed-off-by: Israel Barth Rubio <[email protected]>
… files Previous to this commit Barman would attempt to guess the required WAL files using the filesystem creation timestamp of them. However, that is not a reliable approach. For example, if there is streaming replication lag, the WAL files will be created in the Barman host later when compared to Postgres. That can be even worse in the case of `archive_command`, because it waits for the WAL files to be filled up before making them available for archiving. In those cases the recovery could end up failing because of missing `COMMIT` or `ABORT` records in the WAL files that were copied by Barman, i.e. Postgres would fail to perform recovery because it wouldn't know if it satistified or not the requested `recovery_target_time`. From now on, if the user requests a recovery with `--no-get-wal` and `--target-time`, Barman will simply copy all WAL files up to the timeline being recovered, guaranteeing that way that Postgres will be able to find `COMMIT` or `ABORT` records, if they exist in the archived WAL files, making it possible to complete the recovery. Note: we evaluated other implementations to avoid possibly copying a lot of unused WAL files. For example, with `pg_waldump` we would be able to look up for `COMMIT` and `ABORT` records in a way similar to what Postgres does. However, that could put a lot of overhead wherever that would be processed (during WAL archiving or backup recovery), so that option was discarded. References: BAR-189 #881. Signed-off-by: Israel Barth Rubio <[email protected]>
Good news: The PITR recovery is now working.
It works :-) Step 3 was ugly, but apparently necessary, since I run PostgreSQL and barman under a normal user. I also had to compile both tools myself and install them in /opt. For security reasons I don't have root privileges and I'm only allowed to start/stop selected services via sudo. Would be great if we could override the generated restore_command from barman using some setting in barman such that step 3 is no longer needed. |
Hello @JP95Git ,
Thanks for letting us now! As both PRs related with this GitHub issue were merged into Once a new Barman version is released, which is expected to occur somewhen between August and September, those fixes/enhancements will become available.
As we are closing this issue, we opened a separate GitHub issue #956 to track this enhancement request. Although this sounds unusual, and in most cases what is automatically inferred by Barman should be enough, we understand that there might exist some cases where overrides are necessary. |
This restore command could be replaced with |
After doing PITR recovery with specific time Postgres won't start.
Recovery command is reporting that postgres is successfully prepared for recovery, but after starting postgres instance it fail and log that WAL file, which is after target time according to timeline, is missing.
I can do a restore to the latest version of DB without a problem.
Command used to recovery
Command log
Postgres Logs
Timeline UTC+1
(Barman server, Postgres server have same timezone)
TIME | ACTION | WAL NUMBER | ADD NFO
Barman log
The text was updated successfully, but these errors were encountered: