Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

After PITR recovery, Postgres won't start and report missing WAL file #881

Closed
PaganMuffin opened this issue Dec 18, 2023 · 11 comments
Closed
Assignees
Labels

Comments

@PaganMuffin
Copy link

After doing PITR recovery with specific time Postgres won't start.
Recovery command is reporting that postgres is successfully prepared for recovery, but after starting postgres instance it fail and log that WAL file, which is after target time according to timeline, is missing.
I can do a restore to the latest version of DB without a problem.

Command used to recovery

barman recover --target-time 20231215T162400 --remote-ssh-command 'ssh SERVER' remote 20231215T162003  /root/dev/pg-wal/pg-data/pgdata

Command log

Starting remote restore for server remote using backup 20231215T162003
Destination directory: /root/dev/pg-wal/pg-data/pgdata
Remote command: ssh SERVER
Doing PITR. Recovery target time: '2023-12-15 16:24:00+01:00'
Copying the base backup.
Copying required WAL segments.
Generating recovery configuration
Identify dangerous settings in destination directory.

Recovery completed (start time: 2023-12-18 08:39:26.886501+01:00, elapsed time: 33 seconds)
Your PostgreSQL server has been successfully prepared for recovery!

Postgres Logs

PostgreSQL Database directory appears to contain a database; Skipping initialization

2023-12-18 07:41:06.770 UTC [1] LOG:  starting PostgreSQL 14.10 on x86_64-pc-linux-musl, compiled by gcc (Alpine 13.2.1_git20231014) 13.2.1 20231014, 64-bit
2023-12-18 07:41:06.771 UTC [1] LOG:  listening on IPv4 address "0.0.0.0", port 5432
2023-12-18 07:41:06.771 UTC [1] LOG:  listening on IPv6 address "::", port 5432
2023-12-18 07:41:06.786 UTC [1] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2023-12-18 07:41:06.807 UTC [21] LOG:  database system was interrupted while in recovery at log time 2023-12-15 15:20:03 UTC
2023-12-18 07:41:06.807 UTC [21] HINT:  If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target.
cp: can't stat 'barman_wal/00000002.history': No such file or directory
2023-12-18 07:41:06.887 UTC [21] LOG:  starting point-in-time recovery to 2023-12-15 16:24:00+00
2023-12-18 07:41:06.910 UTC [21] LOG:  restored log file "000000010000000000000003" from archive
2023-12-18 07:41:07.059 UTC [21] LOG:  redo starts at 0/3000028
2023-12-18 07:41:07.059 UTC [21] LOG:  consistent recovery state reached at 0/3000100
2023-12-18 07:41:07.059 UTC [1] LOG:  database system is ready to accept read-only connections
2023-12-18 07:41:07.083 UTC [21] LOG:  restored log file "000000010000000000000004" from archive
cp: can't stat 'barman_wal/000000010000000000000005': No such file or directory
2023-12-18 07:41:07.755 UTC [21] LOG:  redo done at 0/4FFFF28 system usage: CPU: user: 0.08 s, system: 0.02 s, elapsed: 0.69 s
2023-12-18 07:41:07.755 UTC [21] LOG:  last completed transaction was at log time 2023-12-15 15:25:13.895411+00
2023-12-18 07:41:07.755 UTC [21] FATAL:  recovery ended before configured recovery target was reached
2023-12-18 07:41:07.758 UTC [1] LOG:  startup process (PID 21) exited with exit code 1
2023-12-18 07:41:07.758 UTC [1] LOG:  terminating any other active server processes
2023-12-18 07:41:07.761 UTC [1] LOG:  shutting down due to startup process failure
2023-12-18 07:41:07.765 UTC [1] LOG:  database system is shut down

Timeline UTC+1
(Barman server, Postgres server have same timezone)
TIME | ACTION | WAL NUMBER | ADD NFO

16:18 CREATE TABLE      WAL 1
16:19 INSERT 0 1001     WAL 1
16:20 WAL SWITCH        WAL 2
16:20 BASE BACKUP       WAL 3 20231215T162003
16:22 INSERT 0 1001     WAL 4
16:25 INSERT 0 50001    WAL 4
16:25 INSERT 0 100001   WAL 5
16:26 BASE BACKUP       WAL 6 EXCEPT 152004 cars 
16:29 BASE BACKUP WAIT  WAL 7 EXCEPT 152004 cars  
16:35 INSERT 0 100001   WAL 8 EXCEPT 252005 cars 
16:35 INSERT 0 100001   WAL 9 EXCEPT 352006 cars

Barman log

2023-12-18 08:39:25,578 [6438] barman.wal_archiver INFO: No xlog segments found from streaming for remote.
2023-12-18 08:39:26,889 [6438] barman.recovery_executor INFO: Starting remote restore for server remote using backup 20231215T162003
2023-12-18 08:39:26,890 [6438] barman.recovery_executor INFO: Destination directory: /root/dev/pg-wal/pg-data/pgdata
2023-12-18 08:39:26,890 [6438] barman.recovery_executor INFO: Remote command: ssh SERVER
2023-12-18 08:39:26,891 [6438] barman.recovery_executor INFO: Doing PITR. Recovery target time: '2023-12-15 16:24:00+01:00'
2023-12-18 08:39:27,487 [6438] barman.recovery_executor WARNING: Unable to retrieve safe horizon time for smart rsync copy: The /root/dev/pg-wal/pg-data/pgdata/.barman-recover.info file does not exist
2023-12-18 08:39:28,705 [6438] barman.recovery_executor INFO: Copying the base backup.
2023-12-18 08:39:28,716 [6438] barman.copy_controller INFO: Copy started (safe before None)
2023-12-18 08:39:28,717 [6438] barman.copy_controller INFO: Copy step 1 of 4: [global] analyze PGDATA directory: /Users/USERNAME/dev/pg_wal_test/barman/data/remote/base/20231215T162003/data/
2023-12-18 08:39:29,566 [6438] barman.copy_controller INFO: Copy step 2 of 4: [global] create destination directories and delete unknown files for PGDATA directory: /Users/USERNAME/dev/pg_wal_test/barman/data/remote/base/20231215T162003/data/
2023-12-18 08:39:45,065 [6438] barman.copy_controller INFO: Copy finished (safe before None)
2023-12-18 08:39:47,053 [6438] barman.recovery_executor INFO: Copying required WAL segments.
2023-12-18 08:39:47,054 [6438] barman.recovery_executor INFO: Starting copy of 2 WAL files 2/2 from WalFileInfo(compression='gzip', name='000000010000000000000003', size=16491, time=1702653605.3655617) to WalFileInfo(compression='gzip', name='000000010000000000000004', size=7354841, time=1702653930.5049307)
2023-12-18 08:39:56,860 [6438] barman.recovery_executor INFO: Finished copying 2 WAL files.
2023-12-18 08:39:56,862 [6438] barman.recovery_executor INFO: Generating recovery configuration
2023-12-18 08:39:59,018 [6438] barman.recovery_executor INFO: Identify dangerous settings in destination directory.
@JP95Git
Copy link

JP95Git commented Mar 15, 2024

I've got the same problem using PostgreSQL 16.1 and Barman 3.9.0.

Recovery using
barman --config /path/to/barman.conf backup --wait localhost
works fine and restores PostgreSQL to the latest state, but adding --target-time="2024-03-15 13:50:56.844481+00:00" lead to a missing WAL file.

@JP95Git
Copy link

JP95Git commented Apr 10, 2024

The issue still persists in Barman 3.10. Here are logs and more details.

Started the recovery using this command:
/path/to/barman --config /path/to/barman.conf recover localhost 20240410T165613 /path/to/database --target-time "2024-04-10 14:56:15.915033+00:00"

Starting local restore for server localhost using backup 20240410T165613
Destination directory: /path/to/database
Doing PITR. Recovery target time: '2024-04-10 14:56:15.915033+00:00'
Using safe horizon time for smart rsync copy: 2024-04-10 14:56:13.756299+00:00
Copying the base backup.
Copying required WAL segments.
Generating recovery configuration
Identify dangerous settings in destination directory.

IMPORTANT
These settings have been modified to prevent data losses

postgresql.conf line 262: archive_command = false
postgresql.auto.conf line 7: recovery_target_time = None

Recovery completed (start time: 2024-04-10 17:20:21.768695+02:00, elapsed time: less than one second)
Your PostgreSQL server has been successfully prepared for recovery!

Started the PostgreSQL server, but it failed to start.

Log file of barman:

2024-04-10 17:20:21,736 [2504447] barman.config WARNING: Discarding configuration file: .barman.auto.conf (not a file)
2024-04-10 17:20:21,764 [2504447] barman.wal_archiver INFO: No xlog segments found from streaming for localhost.
2024-04-10 17:20:21,770 [2504447] barman.recovery_executor INFO: Starting local restore for server localhost using backup 20240410T165613
2024-04-10 17:20:21,771 [2504447] barman.recovery_executor INFO: Destination directory: /path/to/database
2024-04-10 17:20:21,772 [2504447] barman.recovery_executor INFO: Doing PITR. Recovery target time: '2024-04-10 14:56:15.915033+00:00'
2024-04-10 17:20:21,787 [2504447] barman.recovery_executor INFO: Using safe horizon time for smart rsync copy: 2024-04-10 14:56:13.756299+00:00
2024-04-10 17:20:21,796 [2504447] barman.recovery_executor INFO: Copying the base backup.
2024-04-10 17:20:21,799 [2504447] barman.copy_controller INFO: Copy started (safe before datetime.datetime(2024, 4, 10, 14, 56, 13, 756299, tzinfo=tzutc()))
2024-04-10 17:20:21,799 [2504447] barman.copy_controller INFO: Copy step 1 of 4: [global] analyze PGDATA directory: /path/to/database-backup/localhost/base/20240410T165613/data/
2024-04-10 17:20:21,991 [2504447] barman.copy_controller INFO: Copy step 2 of 4: [global] create destination directories and delete unknown files for PGDATA directory: /path/to/database-backup/localhost/base/20240410T165613/data/
2024-04-10 17:20:22,108 [2504462] barman.copy_controller INFO: Copy step 3 of 4: [bucket 0] starting copy safe files from PGDATA directory: /path/to/database-backup/localhost/base/20240410T165613/data/
2024-04-10 17:20:22,207 [2504462] barman.copy_controller INFO: Copy step 3 of 4: [bucket 0] finished (duration: less than one second) copy safe files from PGDATA directory: /path/to/database-backup/localhost/base/20240410T165613/data/
2024-04-10 17:20:22,213 [2504462] barman.copy_controller INFO: Copy step 4 of 4: [bucket 0] starting copy files with checksum from PGDATA directory: /path/to/database-backup/localhost/base/20240410T165613/data/
2024-04-10 17:20:22,260 [2504462] barman.copy_controller INFO: Copy step 4 of 4: [bucket 0] finished (duration: less than one second) copy files with checksum from PGDATA directory: /path/to/database-backup/localhost/base/20240410T165613/data/
2024-04-10 17:20:22,261 [2504447] barman.copy_controller INFO: Copy finished (safe before 2024-04-10 14:56:13.756299+00:00)
2024-04-10 17:20:22,273 [2504447] barman.recovery_executor INFO: Copying required WAL segments.
2024-04-10 17:20:22,274 [2504447] barman.recovery_executor INFO: Starting copy of 2 WAL files 2/2 from WalFileInfo(compression='gzip', name='00000001000000000000000D', size=16465, time=1712760976.317646) to WalFileInfo(compression='gzip', name='00000001000000000000000E', size=16396, time=1712760977.4366577)
2024-04-10 17:20:22,481 [2504447] barman.recovery_executor INFO: Finished copying 2 WAL files.
2024-04-10 17:20:22,482 [2504447] barman.recovery_executor INFO: Generating recovery configuration
2024-04-10 17:20:22,492 [2504447] barman.recovery_executor INFO: Identify dangerous settings in destination directory.

Log file of PostgreSQL:

2024-04-10 15:20:44.521 UTC [2504481] LOG:  starting PostgreSQL 16.1 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 11.4.1 20230605 (Red Hat 11.4.1-2), 64-bit
2024-04-10 15:20:44.521 UTC [2504481] LOG:  listening on IPv4 address "0.0.0.0", port 5432
2024-04-10 15:20:44.521 UTC [2504481] LOG:  listening on IPv6 address "::", port 5432
2024-04-10 15:20:44.542 UTC [2504481] LOG:  listening on Unix socket "/tmp/.s.PGSQL.5432"
2024-04-10 15:20:44.548 UTC [2504485] LOG:  database system was interrupted; last known up at 2024-04-10 14:56:13 UTC
2024-04-10 15:20:44.548 UTC [2504485] LOG:  creating missing WAL directory "pg_wal/archive_status"
2024-04-10 15:20:44.699 UTC [2504486] FATAL:  the database system is starting up
2024-04-10 15:20:44.703 UTC [2504487] FATAL:  the database system is starting up
2024-04-10 15:20:44.706 UTC [2504488] FATAL:  the database system is starting up
cp: der Aufruf von stat für 'barman_wal/00000002.history' ist nicht möglich: Datei oder Verzeichnis nicht gefunden
2024-04-10 15:20:44.817 UTC [2504485] LOG:  starting point-in-time recovery to 2024-04-10 14:56:15.915033+00
2024-04-10 15:20:44.834 UTC [2504485] LOG:  restored log file "00000001000000000000000D" from archive
2024-04-10 15:20:44.965 UTC [2504485] LOG:  redo starts at 0/D000028
2024-04-10 15:20:44.983 UTC [2504485] LOG:  restored log file "00000001000000000000000E" from archive
cp: der Aufruf von stat für 'barman_wal/00000001000000000000000F' ist nicht möglich: Datei oder Verzeichnis nicht gefunden
2024-04-10 15:20:45.089 UTC [2504485] LOG:  consistent recovery state reached at 0/D000100
2024-04-10 15:20:45.089 UTC [2504481] LOG:  database system is ready to accept read-only connections
cp: der Aufruf von stat für 'barman_wal/00000001000000000000000F' ist nicht möglich: Datei oder Verzeichnis nicht gefunden
2024-04-10 15:20:45.095 UTC [2504485] LOG:  redo done at 0/E000060 system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.13 s
2024-04-10 15:20:45.095 UTC [2504485] FATAL:  recovery ended before configured recovery target was reached
2024-04-10 15:20:45.096 UTC [2504481] LOG:  startup process (PID 2504485) exited with exit code 1
2024-04-10 15:20:45.097 UTC [2504481] LOG:  terminating any other active server processes
2024-04-10 15:20:45.097 UTC [2504481] LOG:  shutting down due to startup process failure
2024-04-10 15:20:45.099 UTC [2504481] LOG:  database system is shut down

@gcalacoci
Copy link
Contributor

Hi, Could you please provide the output of the barman diagnose immediately before executing a P.I.T.R. command that fails like the above?

Also, the logs like you did before.

Thanks a lot!

@JP95Git
Copy link

JP95Git commented Apr 18, 2024

Here are the requested information. I don't have a full test environment available, so I just created a blank database, backup it several times and tried to restore it using "--target-time", just to show the error.

[username@xusername-PG1 database-backup]$ /opt/username/programs/barman/bin/barman --config /opt/username/programs/barman/etc/barman.conf list-backups localhost
localhost 20240418T185600 - Thu Apr 18 16:56:02 2024 - Size: 29.4 MiB - WAL Size: 16.2 KiB
localhost 20240418T185551 - Thu Apr 18 16:55:55 2024 - Size: 29.4 MiB - WAL Size: 32.1 KiB
localhost 20240418T000005 - Wed Apr 17 22:00:26 2024 - Size: 29.4 MiB - WAL Size: 32.3 KiB
[username@xusername-PG1 database-backup]$ barman --config /opt/username/programs/barman/etc/barman.conf show-backup localhost 20240418T185600
Backup 20240418T185600:
  Server Name            : localhost
  System Id              : 7356240957958843836
  Status                 : DONE
  PostgreSQL Version     : 160001
  PGDATA directory       : /opt/username/database

  Base backup information:
    Disk usage           : 29.4 MiB (29.4 MiB with WALs)
    Incremental size     : 8.9 KiB (-99.97%)
    Timeline             : 1
    Begin WAL            : 000000010000000000000010
    End WAL              : 000000010000000000000010
    WAL number           : 1
    WAL compression ratio: 99.90%
    Begin time           : 2024-04-18 16:56:00.224345+00:00
    End time             : 2024-04-18 16:56:02.161839+00:00
    Copy time            : less than one second
    Estimated throughput : 9.2 KiB/s
    Begin Offset         : 40
    End Offset           : 256
    Begin LSN            : 0/10000028
    End LSN              : 0/10000100

  WAL information:
    No of files          : 1
    Disk usage           : 16.2 KiB
    WAL rate             : 22.00/hour
    Compression ratio    : 99.90%
    Last available       : 000000010000000000000011

  Catalog information:
    Retention Policy     : VALID
    Previous Backup      : 20240418T185551
    Next Backup          : - (this is the latest base backup)

https://medium.com/@kiwiv/recover-your-database-with-barman-be5ee4e11939

If we want to use this backup, note the begin and end time and you can recover from the any time in between. For this example will pick up the end time.

I picked the end date of the backup.

[username@xusername-PG1 database-backup]$ barman --config /opt/username/programs/barman/etc/barman.conf recover localhost 20240418T185600 /opt/username/database --target-time "2024-04-18 16:56:02.161839+00:00"
Starting local restore for server localhost using backup 20240418T185600
Destination directory: /opt/username/database
Doing PITR. Recovery target time: '2024-04-18 16:56:02.161839+00:00'
Copying the base backup.
Copying required WAL segments.
Generating recovery configuration
Identify dangerous settings in destination directory.

IMPORTANT
These settings have been modified to prevent data losses

postgresql.conf line 262: archive_command = false

Recovery completed (start time: 2024-04-18 19:12:55.447251+02:00, elapsed time: 2 seconds)
Your PostgreSQL server has been successfully prepared for recovery!

Log of PostgreSQL:

2024-04-18 17:13:59.443 UTC [796275] LOG:  starting PostgreSQL 16.1 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 11.4.1 20230605 (Red Hat 11.4.1-2), 64-bit
2024-04-18 17:13:59.444 UTC [796275] LOG:  listening on IPv4 address "0.0.0.0", port 5432
2024-04-18 17:13:59.444 UTC [796275] LOG:  listening on IPv6 address "::", port 5432
2024-04-18 17:13:59.449 UTC [796275] LOG:  listening on Unix socket "/tmp/.s.PGSQL.5432"
2024-04-18 17:13:59.454 UTC [796279] LOG:  database system was interrupted; last known up at 2024-04-18 16:56:00 UTC
2024-04-18 17:13:59.454 UTC [796279] LOG:  creating missing WAL directory "pg_wal/archive_status"
2024-04-18 17:13:59.897 UTC [796281] FATAL:  the database system is starting up
2024-04-18 17:13:59.899 UTC [796282] FATAL:  the database system is starting up
2024-04-18 17:13:59.901 UTC [796283] FATAL:  the database system is starting up
cp: der Aufruf von stat für 'barman_wal/00000002.history' ist nicht möglich: Datei oder Verzeichnis nicht gefunden
2024-04-18 17:14:01.363 UTC [796279] LOG:  starting point-in-time recovery to 2024-04-18 16:56:02.161839+00
2024-04-18 17:14:01.381 UTC [796279] LOG:  restored log file "000000010000000000000010" from archive
2024-04-18 17:14:01.904 UTC [796300] FATAL:  the database system is not yet accepting connections
2024-04-18 17:14:01.904 UTC [796300] DETAIL:  Consistent recovery state has not been yet reached.
2024-04-18 17:14:01.907 UTC [796301] FATAL:  the database system is not yet accepting connections
2024-04-18 17:14:01.907 UTC [796301] DETAIL:  Consistent recovery state has not been yet reached.
2024-04-18 17:14:01.910 UTC [796302] FATAL:  the database system is not yet accepting connections
2024-04-18 17:14:01.910 UTC [796302] DETAIL:  Consistent recovery state has not been yet reached.
2024-04-18 17:14:02.233 UTC [796305] FATAL:  the database system is not yet accepting connections
2024-04-18 17:14:02.233 UTC [796305] DETAIL:  Consistent recovery state has not been yet reached.
2024-04-18 17:14:02.235 UTC [796306] FATAL:  the database system is not yet accepting connections
2024-04-18 17:14:02.235 UTC [796306] DETAIL:  Consistent recovery state has not been yet reached.
2024-04-18 17:14:02.241 UTC [796308] FATAL:  the database system is not yet accepting connections
2024-04-18 17:14:02.241 UTC [796308] DETAIL:  Consistent recovery state has not been yet reached.
2024-04-18 17:14:02.243 UTC [796309] FATAL:  the database system is not yet accepting connections
2024-04-18 17:14:02.243 UTC [796309] DETAIL:  Consistent recovery state has not been yet reached.
2024-04-18 17:14:02.661 UTC [796279] LOG:  redo starts at 0/10000028
2024-04-18 17:14:02.680 UTC [796279] LOG:  restored log file "000000010000000000000011" from archive
cp: der Aufruf von stat für 'barman_wal/000000010000000000000012' ist nicht möglich: Datei oder Verzeichnis nicht gefunden
2024-04-18 17:14:02.810 UTC [796279] LOG:  consistent recovery state reached at 0/10000100
2024-04-18 17:14:02.810 UTC [796275] LOG:  database system is ready to accept read-only connections
cp: der Aufruf von stat für 'barman_wal/000000010000000000000012' ist nicht möglich: Datei oder Verzeichnis nicht gefunden
2024-04-18 17:14:02.814 UTC [796279] LOG:  redo done at 0/110001B0 system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 1.30 s
2024-04-18 17:14:02.814 UTC [796279] FATAL:  recovery ended before configured recovery target was reached
2024-04-18 17:14:02.815 UTC [796275] LOG:  startup process (PID 796279) exited with exit code 1
2024-04-18 17:14:02.815 UTC [796275] LOG:  terminating any other active server processes
2024-04-18 17:14:02.816 UTC [796275] LOG:  shutting down due to startup process failure
2024-04-18 17:14:02.818 UTC [796275] LOG:  database system is shut down

barman diagnose

{
    "global": {
        "config": {
            "backup_method": "rsync",
            "backup_options": "concurrent_backup",
            "barman_home": "/opt/username/database-backup",
            "barman_user": "username",
            "compression": "gzip",
            "configuration_files_directory": "/opt/username/programs/barman/etc",
            "errors_list": [],
            "log_file": "/opt/username/programs/barman/logs/barman.log",
            "path_prefix": "/opt/username/programs/postgres/bin",
            "reuse_backup": "link",
            "slot_name": "barman",
            "streaming_archiver": "on"
        },
        "system_info": {
            "barman_ver": "3.10.0",
            "kernel_ver": "Linux xusername-PG1 5.14.0-362.18.1.el9_3.0.1.x86_64 #1 SMP PREEMPT_DYNAMIC Sun Feb 11 13:49:23 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux",
            "python_ver": "Python 3.9.18",
            "release": "RedHat Linux Rocky Linux release 9.3 (Blue Onyx)",
            "rsync_ver": "rsync  version 3.2.3  protocol version 31",
            "ssh_ver": "",
            "timestamp": "2024-04-18T18:59:10.292506+02:00"
        }
    },
    "models": {},
    "servers": {
        "localhost": {
            "active_model": null,
            "backups": {
                "20240418T000005": {
                    "backup_id": "20240418T000005",
                    "backup_label": "'START WAL LOCATION: 0/C000060 (file 00000001000000000000000C)\\nCHECKPOINT LOCATION: 0/C000098\\nBACKUP METHOD: streamed\\nBACKUP FROM: primary\\nSTART TIME: 2024-04-17 22:00:11 UTC\\nLABEL: Barman backup localhost 20240418T000005\\nSTART TIMELINE: 1\\n'",
                    "begin_offset": 96,
                    "begin_time": "2024-04-18T00:00:05.707294+02:00",
                    "begin_wal": "00000001000000000000000C",
                    "begin_xlog": "0/C000060",
                    "compression": null,
                    "config_file": "/opt/username/database/postgresql.conf",
                    "copy_stats": {
                        "analysis_time": 0.965946,
                        "analysis_time_per_item": {
                            "pgdata": 0.965946
                        },
                        "copy_time": 12.807903,
                        "copy_time_per_item": {
                            "pg_control": 0.295303,
                            "pgdata": 12.509192
                        },
                        "number_of_workers": 1,
                        "serialized_copy_time": 12.799614,
                        "serialized_copy_time_per_item": {
                            "pg_control": 0.295303,
                            "pgdata": 12.504311
                        },
                        "total_time": 14.197323
                    },
                    "deduplicated_size": 30794643,
                    "end_offset": 368,
                    "end_time": "2024-04-18T00:00:26.479322+02:00",
                    "end_wal": "00000001000000000000000C",
                    "end_xlog": "0/C000170",
                    "error": null,
                    "hba_file": "/opt/username/database/pg_hba.conf",
                    "ident_file": "/opt/username/database/pg_ident.conf",
                    "included_files": [
                        "/opt/username/database/postgresql.auto.conf"
                    ],
                    "mode": "rsync-concurrent",
                    "pgdata": "/opt/username/database",
                    "server_name": "localhost",
                    "size": 30794643,
                    "status": "DONE",
                    "systemid": "7356240957958843836",
                    "tablespaces": null,
                    "timeline": 1,
                    "version": 160001,
                    "xlog_segment_size": 16777216
                },
                "20240418T185551": {
                    "backup_id": "20240418T185551",
                    "backup_label": "'START WAL LOCATION: 0/E000028 (file 00000001000000000000000E)\\nCHECKPOINT LOCATION: 0/E000098\\nBACKUP METHOD: streamed\\nBACKUP FROM: primary\\nSTART TIME: 2024-04-18 16:55:53 UTC\\nLABEL: Barman backup localhost 20240418T185551\\nSTART TIMELINE: 1\\n'",
                    "begin_offset": 40,
                    "begin_time": "2024-04-18T18:55:51.775903+02:00",
                    "begin_wal": "00000001000000000000000E",
                    "begin_xlog": "0/E000028",
                    "compression": null,
                    "config_file": "/opt/username/database/postgresql.conf",
                    "copy_stats": {
                        "analysis_time": 0.792977,
                        "analysis_time_per_item": {
                            "pgdata": 0.792977
                        },
                        "copy_time": 0.973467,
                        "copy_time_per_item": {
                            "pg_control": 0.295436,
                            "pgdata": 0.674826
                        },
                        "number_of_workers": 1,
                        "serialized_copy_time": 0.963715,
                        "serialized_copy_time_per_item": {
                            "pg_control": 0.295436,
                            "pgdata": 0.668279
                        },
                        "total_time": 1.785295
                    },
                    "deduplicated_size": 9140,
                    "end_offset": 312,
                    "end_time": "2024-04-18T18:55:55.285195+02:00",
                    "end_wal": "00000001000000000000000E",
                    "end_xlog": "0/E000138",
                    "error": null,
                    "hba_file": "/opt/username/database/pg_hba.conf",
                    "ident_file": "/opt/username/database/pg_ident.conf",
                    "included_files": [
                        "/opt/username/database/postgresql.auto.conf"
                    ],
                    "mode": "rsync-concurrent",
                    "pgdata": "/opt/username/database",
                    "server_name": "localhost",
                    "size": 30794643,
                    "status": "DONE",
                    "systemid": "7356240957958843836",
                    "tablespaces": null,
                    "timeline": 1,
                    "version": 160001,
                    "xlog_segment_size": 16777216
                },
                "20240418T185600": {
                    "backup_id": "20240418T185600",
                    "backup_label": "'START WAL LOCATION: 0/10000028 (file 000000010000000000000010)\\nCHECKPOINT LOCATION: 0/10000060\\nBACKUP METHOD: streamed\\nBACKUP FROM: primary\\nSTART TIME: 2024-04-18 16:56:00 UTC\\nLABEL: Barman backup localhost 20240418T185600\\nSTART TIMELINE: 1\\n'",
                    "begin_offset": 40,
                    "begin_time": "2024-04-18T18:56:00.224345+02:00",
                    "begin_wal": "000000010000000000000010",
                    "begin_xlog": "0/10000028",
                    "compression": null,
                    "config_file": "/opt/username/database/postgresql.conf",
                    "copy_stats": {
                        "analysis_time": 0.764417,
                        "analysis_time_per_item": {
                            "pgdata": 0.764417
                        },
                        "copy_time": 0.970696,
                        "copy_time_per_item": {
                            "pg_control": 0.292661,
                            "pgdata": 0.676094
                        },
                        "number_of_workers": 1,
                        "serialized_copy_time": 0.961986,
                        "serialized_copy_time_per_item": {
                            "pg_control": 0.292661,
                            "pgdata": 0.669325
                        },
                        "total_time": 1.756391
                    },
                    "deduplicated_size": 9099,
                    "end_offset": 256,
                    "end_time": "2024-04-18T18:56:02.161839+02:00",
                    "end_wal": "000000010000000000000010",
                    "end_xlog": "0/10000100",
                    "error": null,
                    "hba_file": "/opt/username/database/pg_hba.conf",
                    "ident_file": "/opt/username/database/pg_ident.conf",
                    "included_files": [
                        "/opt/username/database/postgresql.auto.conf"
                    ],
                    "mode": "rsync-concurrent",
                    "pgdata": "/opt/username/database",
                    "server_name": "localhost",
                    "size": 30794646,
                    "status": "DONE",
                    "systemid": "7356240957958843836",
                    "tablespaces": null,
                    "timeline": 1,
                    "version": 160001,
                    "xlog_segment_size": 16777216
                }
            },
            "config": {
                "active": true,
                "archiver": false,
                "archiver_batch_size": 0,
                "autogenerate_manifest": false,
                "aws_profile": null,
                "aws_region": null,
                "azure_credential": null,
                "azure_resource_group": null,
                "azure_subscription_id": null,
                "backup_compression": null,
                "backup_compression_format": null,
                "backup_compression_level": null,
                "backup_compression_location": null,
                "backup_compression_workers": null,
                "backup_directory": "/opt/username/database-backup/localhost",
                "backup_method": "rsync",
                "backup_options": "concurrent_backup",
                "bandwidth_limit": null,
                "barman_home": "/opt/username/database-backup",
                "barman_lock_directory": "/opt/username/database-backup",
                "basebackup_retry_sleep": 30,
                "basebackup_retry_times": 0,
                "basebackups_directory": "/opt/username/database-backup/localhost/base",
                "check_timeout": 30,
                "cluster": "localhost",
                "compression": "gzip",
                "config_changes_queue": "/opt/username/database-backup/cfg_changes.queue",
                "conninfo": "host=localhost port=5432 dbname=postgres user=barman",
                "create_slot": "manual",
                "custom_compression_filter": null,
                "custom_compression_magic": null,
                "custom_decompression_filter": null,
                "description": null,
                "disabled": false,
                "errors_directory": "/opt/username/database-backup/localhost/errors",
                "forward_config_path": false,
                "gcp_project": null,
                "gcp_zone": null,
                "immediate_checkpoint": false,
                "incoming_wals_directory": "/opt/username/database-backup/localhost/incoming",
                "last_backup_maximum_age": null,
                "last_backup_minimum_size": null,
                "last_wal_maximum_age": null,
                "lock_directory_cleanup": true,
                "max_incoming_wals_queue": null,
                "minimum_redundancy": 0,
                "msg_list": [],
                "name": "localhost",
                "network_compression": false,
                "parallel_jobs": 1,
                "parallel_jobs_start_batch_period": 1,
                "parallel_jobs_start_batch_size": 10,
                "path_prefix": "/opt/username/programs/postgres/bin",
                "post_archive_retry_script": null,
                "post_archive_script": null,
                "post_backup_retry_script": null,
                "post_backup_script": null,
                "post_delete_retry_script": null,
                "post_delete_script": null,
                "post_recovery_retry_script": null,
                "post_recovery_script": null,
                "post_wal_delete_retry_script": null,
                "post_wal_delete_script": null,
                "pre_archive_retry_script": null,
                "pre_archive_script": null,
                "pre_backup_retry_script": null,
                "pre_backup_script": null,
                "pre_delete_retry_script": null,
                "pre_delete_script": null,
                "pre_recovery_retry_script": null,
                "pre_recovery_script": null,
                "pre_wal_delete_retry_script": null,
                "pre_wal_delete_script": null,
                "primary_checkpoint_timeout": 0,
                "primary_conninfo": null,
                "primary_ssh_command": null,
                "recovery_options": "",
                "recovery_staging_path": null,
                "retention_policy": "window 7 d",
                "retention_policy_mode": "auto",
                "reuse_backup": "link",
                "slot_name": "barman",
                "snapshot_disks": null,
                "snapshot_gcp_project": null,
                "snapshot_instance": null,
                "snapshot_provider": null,
                "snapshot_zone": null,
                "ssh_command": "ssh username@localhost",
                "streaming_archiver": true,
                "streaming_archiver_batch_size": 0,
                "streaming_archiver_name": "barman_receive_wal",
                "streaming_backup_name": "barman_streaming_backup",
                "streaming_conninfo": "host=localhost port=5432 dbname=postgres user=barman",
                "streaming_wals_directory": "/opt/username/database-backup/localhost/streaming",
                "tablespace_bandwidth_limit": null,
                "wal_conninfo": null,
                "wal_retention_policy": "simple-wal 7 d",
                "wal_streaming_conninfo": null,
                "wals_directory": "/opt/username/database-backup/localhost/wals"
            },
            "status": {
                "archive_timeout": 0,
                "checkpoint_timeout": 300,
                "config_file": "/opt/username/database/postgresql.conf",
                "connection_error": null,
                "current_lsn": "0/110000C8",
                "current_size": 31254984.0,
                "current_xlog": "000000010000000000000011",
                "data_checksums": "off",
                "data_directory": "/opt/username/database",
                "has_backup_privileges": true,
                "has_monitoring_privileges": true,
                "hba_file": "/opt/username/database/pg_hba.conf",
                "hot_standby": "on",
                "ident_file": "/opt/username/database/pg_ident.conf",
                "included_files": [
                    "/opt/username/database/postgresql.auto.conf"
                ],
                "is_in_recovery": false,
                "is_superuser": true,
                "max_replication_slots": "10",
                "max_wal_senders": "10",
                "pg_receivexlog_compatible": true,
                "pg_receivexlog_installed": true,
                "pg_receivexlog_path": "/opt/username/programs/postgres/bin/pg_receivewal",
                "pg_receivexlog_supports_slots": true,
                "pg_receivexlog_synchronous": false,
                "pg_receivexlog_version": "16.1",
                "postgres_systemid": "7356240957958843836",
                "replication_slot": [
                    "barman",
                    true,
                    "0/11000000"
                ],
                "replication_slot_support": true,
                "server_txt_version": "16.1",
                "streaming": true,
                "streaming_supported": true,
                "streaming_systemid": "7356240957958843836",
                "synchronous_standby_names": [
                    ""
                ],
                "timeline": 1,
                "version_supported": true,
                "wal_compression": "off",
                "wal_keep_size": "0",
                "wal_level": "logical",
                "xlog_segment_size": 16777216,
                "xlogpos": "0/110000C8"
            },
            "system_info": {
                "kernel_ver": "Linux xusername-PG1 5.14.0-362.18.1.el9_3.0.1.x86_64 #1 SMP PREEMPT_DYNAMIC Sun Feb 11 13:49:23 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux",
                "python_ver": "Python 3.9.18",
                "release": "RedHat Linux Rocky Linux release 9.3 (Blue Onyx)",
                "rsync_ver": "rsync  version 3.2.3  protocol version 31",
                "ssh_ver": ""
            },
            "wals": {
                "last_archived_wal_per_timeline": {
                    "00000001": {
                        "compression": "gzip",
                        "name": "000000010000000000000010",
                        "size": 16460,
                        "time": 1713459362.313908
                    }
                }
            }
        }
    }
}

Logfile of barman:

2024-04-18 19:12:02,287 [796162] barman.postgres WARNING: Error retrieving PostgreSQL status: connection to server at "localhost" (::1), port 5432 failed: Connection refused
2024-04-18 19:12:02,287 [796162] barman.postgres WARNING: Error retrieving PostgreSQL status: connection to server at "localhost" (::1), port 5432 failed: Connection refused
2024-04-18 19:12:02,293 [796162] barman.backup_executor WARNING: Error retrieving PostgreSQL status: connection to server at "localhost" (::1), port 5432 failed: Connection refused
2024-04-18 19:12:02,293 [796162] barman.server ERROR: Check 'no access to monitoring functions' failed for server 'localhost'
2024-04-18 19:12:02,293 [796162] barman.server ERROR: Check 'PostgreSQL streaming (WAL streaming)' failed for server 'localhost'
2024-04-18 19:12:02,307 [796162] barman.server ERROR: Impossible to start WAL streaming. Check the log for more details, or run 'barman check localhost'
2024-04-18 19:12:55,423 [796166] barman.config WARNING: Discarding configuration file: .barman.auto.conf (not a file)
2024-04-18 19:12:55,444 [796166] barman.wal_archiver INFO: No xlog segments found from streaming for localhost.
2024-04-18 19:12:55,469 [796166] barman.recovery_executor INFO: Starting local restore for server localhost using backup 20240418T185600
2024-04-18 19:12:55,470 [796166] barman.recovery_executor INFO: Destination directory: /opt/username/database
2024-04-18 19:12:55,471 [796166] barman.recovery_executor INFO: Doing PITR. Recovery target time: '2024-04-18 16:56:02.161839+00:00'
2024-04-18 19:12:55,476 [796166] barman.recovery_executor WARNING: Unable to retrieve safe horizon time for smart rsync copy: The /opt/username/database/.barman-recover.info file does not exist
2024-04-18 19:12:55,483 [796166] barman.recovery_executor INFO: Copying the base backup.
2024-04-18 19:12:55,485 [796166] barman.copy_controller INFO: Copy started (safe before None)
2024-04-18 19:12:55,485 [796166] barman.copy_controller INFO: Copy step 1 of 4: [global] analyze PGDATA directory: /opt/username/database-backup/localhost/base/20240418T185600/data/
2024-04-18 19:12:55,650 [796166] barman.copy_controller INFO: Copy step 2 of 4: [global] create destination directories and delete unknown files for PGDATA directory: /opt/username/database-backup/localhost/base/20240418T185600/data/
2024-04-18 19:12:57,882 [796194] barman.copy_controller INFO: Copy step 3 of 4: [bucket 0] starting copy safe files from PGDATA directory: /opt/username/database-backup/localhost/base/20240418T185600/data/
2024-04-18 19:12:57,931 [796194] barman.copy_controller INFO: Copy step 3 of 4: [bucket 0] finished (duration: less than one second) copy safe files from PGDATA directory: /opt/username/database-backup/localhost/base/20240418T185600/data/
2024-04-18 19:12:57,937 [796194] barman.copy_controller INFO: Copy step 4 of 4: [bucket 0] starting copy files with checksum from PGDATA directory: /opt/username/database-backup/localhost/base/20240418T185600/data/
2024-04-18 19:12:58,191 [796194] barman.copy_controller INFO: Copy step 4 of 4: [bucket 0] finished (duration: less than one second) copy files with checksum from PGDATA directory: /opt/username/database-backup/localhost/base/20240418T185600/data/
2024-04-18 19:12:58,197 [796166] barman.copy_controller INFO: Copy finished (safe before None)
2024-04-18 19:12:58,211 [796166] barman.recovery_executor INFO: Copying required WAL segments.
2024-04-18 19:12:58,212 [796166] barman.recovery_executor INFO: Starting copy of 2 WAL files 2/2 from WalFileInfo(compression='gzip', name='000000010000000000000010', size=16460, time=1713459362.313908) to WalFileInfo(compression='gzip', name='000000010000000000000011', size=16557, time=1713459689.5802295)
2024-04-18 19:12:58,413 [796166] barman.recovery_executor INFO: Finished copying 2 WAL files.
2024-04-18 19:12:58,414 [796166] barman.recovery_executor INFO: Generating recovery configuration
2024-04-18 19:12:58,426 [796166] barman.recovery_executor INFO: Identify dangerous settings in destination directory.
2024-04-18 19:13:01,549 [796225] barman.config WARNING: Discarding configuration file: .barman.auto.conf (not a file)
2024-04-18 19:13:01,571 [796225] barman.utils INFO: Cleaning up lockfiles directory.
2024-04-18 19:13:01,700 [796226] barman.config WARNING: Discarding configuration file: .barman.auto.conf (not a file)
2024-04-18 19:13:01,719 [796226] barman.wal_archiver INFO: No xlog segments found from streaming for localhost.
2024-04-18 19:13:01,729 [796227] barman.config WARNING: Discarding configuration file: .barman.auto.conf (not a file)
2024-04-18 19:13:01,737 [796227] barman.postgres WARNING: Error retrieving PostgreSQL status: connection to server at "localhost" (::1), port 5432 failed: Connection refused

@barthisrael
Copy link
Contributor

barthisrael commented Jun 20, 2024

Hello @PaganMuffin @JP95Git ,

While taking a look at the issues reported by both of you, we guess we have 2 different problems here, one caused by an operational issue, and another by a couple of apparent bugs in Barman.

We will follow-up with a couple of messages sharing our thoughts.

Expect an update by tomorrow.

@barthisrael barthisrael self-assigned this Jun 20, 2024
@martinmarques martinmarques self-assigned this Jun 21, 2024
@martinmarques
Copy link
Contributor

From reading the logs, and I guess @barthisrael 's "operational issue" mentioned above is this that I will explain, is that you are picking a point in time for recovery that is outside the window of the backup and WALs that have been archived, so Postgres can never recover to that point in time as it doesn't exist in the timeline from the end of the backup until the last record in the 2 WALs available.

We'll follow-up with more later

@barthisrael
Copy link
Contributor

In the case reported by @JP95Git, this initially seemed an operational issue, but this actually refers to a limitation in Barman.

You seem to be requesting a --target-time which Postgres is not able to honor because of a likely lack of COMMIT or ABORT records in the WAL files.

Just for context: when you pass --target-time to Barman, what it does is prepare Postgres for recovery with recovery_target_time set with the value of that Barman command-line option.

What Postgres does when in recovery with recovery_target_time set, is to keep replaying WALs until it finds a COMMIT or ABORT record which does not honor the configured target time. It replays any WALs that are contained within the target time, and as soon as the COMMIT or ABORT record contains a timestamp which is after the configured target time, it stops recovery. You can check that logic in the Postgres source code: https://github.com/postgres/postgres/blob/master/src/backend/access/transam/xlogrecovery.c#L1646-L1902.

Now, with the proper context given, we will demonstrate that issue with the following example.

We have a Barman backup here with this information:

$ barman show-backup pg17 20240620T165110
Backup 20240620T165110:
  Server Name            : pg17
  System Id              : 7382227563826631291
  Status                 : DONE
  PostgreSQL Version     : 170000
  PGDATA directory       : /var/lib/pgsql/17/data

  Base backup information:
    Disk usage           : 22.3 MiB (54.3 MiB with WALs)
    Incremental size     : 22.3 MiB (-0.00%)
    Timeline             : 1
    Begin WAL            : 00000001000000000000003F
    End WAL              : 000000010000000000000040
    WAL number           : 2
    Begin time           : 2024-06-20 16:51:10.088025+00:00
    End time             : 2024-06-20 16:51:11.498244+00:00
    Copy time            : 1 second
    Estimated throughput : 16.1 MiB/s
    Begin Offset         : 216
    End Offset           : 0
    Begin LSN            : 0/3F0000D8
    End LSN              : 0/40000000

  WAL information:
    No of files          : 5
    Disk usage           : 80.0 MiB
    WAL rate             : 163.55/hour
    Last available       : 000000010000000000000045

  Catalog information:
    Retention Policy     : not enforced
    Previous Backup      : - (this is the oldest base backup)
    Next Backup          : 20240620T165335

The important details here are:

  • Backup started at WAL 00000001000000000000003F, at timestamp 2024-06-20 16:51:10.088025+00:00;
  • Backup finished at WAL 000000010000000000000040, at timestamp 2024-06-20 16:51:11.498244+00:00.

We tried doing something similar to what you did, i.e. set --target-time to the end of the backup with a command like this:

$ barman recover pg17 20240620T165110 --target-time "2024-06-20 16:51:11.498244+00:00" --remote-ssh-command "ssh postgres@localhost" /var/lib/pgsql/17/restore
Starting remote restore for server pg17 using backup 20240620T165110
Destination directory: /var/lib/pgsql/17/restore
Remote command: ssh postgres@localhost
Doing PITR. Recovery target time: '2024-06-20 16:51:11.498244+00:00'
Copying the base backup.
Copying required WAL segments.
Generating recovery configuration
Identify dangerous settings in destination directory.

Recovery completed (start time: 2024-06-20 16:56:42.835003+00:00, elapsed time: 5 seconds)
Your PostgreSQL server has been successfully prepared for recovery!

But then Postgres fails to start up:

$ tail -10 restore/log/postgresql-Thu.log
2024-06-20 16:57:15.429 UTC [64304] LOG:  restored log file "000000010000000000000041" from archive
cp: cannot stat 'barman_wal/000000010000000000000042': No such file or directory
2024-06-20 16:57:15.670 UTC [64304] LOG:  restored log file "000000010000000000000041" from archive
cp: cannot stat 'barman_wal/000000010000000000000042': No such file or directory
2024-06-20 16:57:15.679 UTC [64304] LOG:  redo done at 0/41FFFF90 system usage: CPU: user: 0.02 s, system: 0.36 s, elapsed: 0.48 s
2024-06-20 16:57:15.679 UTC [64304] FATAL:  recovery ended before configured recovery target was reached
2024-06-20 16:57:15.682 UTC [64300] LOG:  startup process (PID 64304) exited with exit code 1
2024-06-20 16:57:15.682 UTC [64300] LOG:  terminating any other active server processes
2024-06-20 16:57:15.683 UTC [64300] LOG:  shutting down due to startup process failure
2024-06-20 16:57:15.683 UTC [64300] LOG:  database system is shut down

As we can see, the issue is FATAL: recovery ended before configured recovery target was reached.

If we inspect the WALs which were copied by Barman, we will see there are no COMMIT or ABORT records among them in our case:

$ ls
00000001000000000000003F  000000010000000000000040  000000010000000000000041  archive_status
$ /usr/pgsql-17/bin/pg_waldump 00000001000000000000003F 000000010000000000000041 | egrep "COMMIT|ABORT"
$

That's why the recovery failed, Postgres was not able to reach a point where it was sure the requested recovery target was fully honored.

If we inspect the Barman WAL archive, we find these COMMIT records:

$ ls -lh
total 149M
-rw-------. 1 vagrant vagrant 16M Jun 20 16:51 00000001000000000000003F
-rw-------. 1 vagrant vagrant 16M Jun 20 16:53 000000010000000000000040
-rw-------. 1 vagrant vagrant 16M Jun 20 16:53 000000010000000000000041
-rw-------. 1 vagrant vagrant 16M Jun 20 16:53 000000010000000000000042
-rw-------. 1 vagrant vagrant 16M Jun 20 16:53 000000010000000000000043
-rw-------. 1 vagrant vagrant 16M Jun 20 16:53 000000010000000000000044
-rw-------. 1 vagrant vagrant 16M Jun 20 16:53 000000010000000000000045
$ /usr/pgsql-17/bin/pg_waldump 00000001000000000000003F 000000010000000000000045 | egrep "COMMIT|ABORT"
rmgr: Transaction len (rec/tot):    405/   405, tx:        789, lsn: 0/43D4DB58, prev 0/43D4DB18, desc: COMMIT 2024-06-20 16:53:25.807910 UTC; inval msgs: catcache 80 catcache 79 catcache 80 catcache 79 catcache 55 catcache 54 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 snapshot 2608 relcache 16423
rmgr: Transaction len (rec/tot):     66/    66, tx:        790, lsn: 0/43D96F08, prev 0/43D96A70, desc: COMMIT 2024-06-20 16:53:31.856379 UTC; inval msgs: catcache 63

These COMMIT records are contained within the range between 000000010000000000000042 and 000000010000000000000045.

Now, giving context on what Barman does when you run barman recover with --no-get-wal (or with get-wal absent in your barman.conf):

  1. It copies the base backup
  2. It copies the WALs it thinks are required for the configured target. In the case of --target-time, the logic is here: https://github.com/EnterpriseDB/barman/blob/master/barman/server.py#L1825-L1828. It gets all WALs at least until the backup’s "End WAL" is reached plus all the WAL files that have a timestamp which is older or equal to the requested --target-time.

However, as we can see in the above example, that was not enough. My backup finished at WAL 000000010000000000000040, and the WAL 000000010000000000000041 is from timestamp 16:53, which is after my --target-time (2024-06-20 16:51:11.498244+00:00). That is why Barman didn't copy WALs newer than 000000010000000000000041.

In order to make that process reliable we would need to inspect the WAL dumps and look for COMMIT and ABORT records. That would put a lot of overhead in Barman archiving or Barman recovery, though, making that option not practical/well performant.

We intend to submit changes to Barman in such a way that barman recover with --no-get-wal and --target-time will behave similarly to when using --target-xid or --target-name, i.e. do not try to guess the required WAL files, but rather copy all of the WALs from Barman to Postgres host.

We also intend to add a note to the Barman documentation, suggesting to use --get-wal when performing PITR with either of those options of barman recover.

For now, a workaround would be using --get-wal in barman recover (or configure get-wal in your barman.conf), so instead of copying WALs as part of barman recover, Barman would configure Postgres with barman-wal-restore in the restore_command, allowing Postgres to keep asking for WALs from Barman for how long Postgres thinks it still needs more WALs, and Barman would be able to provide the required WAL files, if available in Barman.

@barthisrael
Copy link
Contributor

The problem initially reported through this GitHub issue seems related with a problem in Barman.

When the user runs barman recover with --target-time, and the value of --target-time contains no explicit time zone set, Barman assumes the time zone of the Barman host.

However, that implicit time zone is only taken into consideration when printing messages in the console, not when configuring Postgres for recovery.

We will illustrate the issue with an example. Assume we have a Barman host whose time zone is UTC+2. When we attempt to run barman recover with --target-time with no time zone, we see an output similar to this:

$ barman recover pg17 20240621T170006 --target-time "2024-06-21 17:00:09.348083" --remote-ssh-command "ssh postgres@localhost" /var/lib/pgsql/17/restore
Starting remote restore for server pg17 using backup 20240621T170006
Destination directory: /var/lib/pgsql/17/restore
Remote command: ssh postgres@localhost
Doing PITR. Recovery target time: '2024-06-21 17:00:09.348083+02:00'
Copying the base backup.
Copying required WAL segments.
Generating recovery configuration
Identify dangerous settings in destination directory.

Recovery completed (start time: 2024-06-21 17:03:49.748856+02:00, elapsed time: 8 seconds)
Your PostgreSQL server has been successfully prepared for recovery!

As we can see, Barman claims to have configured recovery_target_time as 2024-06-21 17:00:09.348083+02:00.

However, in practice that is not true as we can see from postgresql.auto.conf:

$ cat /var/lib/pgsql/17/restore/postgresql.auto.conf
# Do not edit this file manually!
# It will be overwritten by the ALTER SYSTEM command.
summarize_wal = 'on'

restore_command = 'cp barman_wal/%f %p'
recovery_end_command = 'rm -fr barman_wal'
recovery_target_time = '2024-06-21 17:00:09.348083'

In my case I’m using the default timezone (UTC) in my Postgres instance, so the recovery_target_time was set to something different from what was informed by Barman messages. As a consequence, my recovery attempt fails:

$ tail -15 /var/lib/pgsql/17/restore/log/postgresql-Fri.log
2024-06-21 15:14:36.673 UTC [32083] LOG:  starting point-in-time recovery to 2024-06-21 17:00:09.348083+00
2024-06-21 15:14:36.678 UTC [32083] LOG:  redo starts at 1/7000028
2024-06-21 15:14:36.681 UTC [32083] LOG:  restored log file "000000010000000100000008" from archive
2024-06-21 15:14:36.689 UTC [32083] LOG:  restored log file "000000010000000100000009" from archive
cp: cannot stat 'barman_wal/00000001000000010000000A': No such file or directory
2024-06-21 15:14:36.698 UTC [32083] LOG:  completed backup recovery with redo LSN 1/7000028 and end LSN 1/7000120
2024-06-21 15:14:36.698 UTC [32083] LOG:  consistent recovery state reached at 1/7000120
2024-06-21 15:14:36.698 UTC [32079] LOG:  database system is ready to accept read-only connections
cp: cannot stat 'barman_wal/00000001000000010000000A': No such file or directory
2024-06-21 15:14:36.704 UTC [32083] LOG:  redo done at 1/9000238 system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.02 s
2024-06-21 15:14:36.704 UTC [32083] FATAL:  recovery ended before configured recovery target was reached
2024-06-21 15:14:36.712 UTC [32079] LOG:  startup process (PID 32083) exited with exit code 1
2024-06-21 15:14:36.712 UTC [32079] LOG:  terminating any other active server processes
2024-06-21 15:14:36.713 UTC [32079] LOG:  shutting down due to startup process failure
2024-06-21 15:14:36.713 UTC [32079] LOG:  database system is shut down

It is looking for a recovery_target_time which is in the past, thus will never be reachable.

We are working on a fix for this issue. You can follow this PR for more details.

barthisrael added a commit that referenced this issue Jun 21, 2024
Previous to this commit there could be cases where Postgres would be
misconfigured by `barman recover`.

For example, assume you had an environment like this:

* Barman host configured with time zone UTC+2, e.g. `Europe/Berlin`;`
* Postgres cluster with default value for timezone GUC, i.e. GMT value.

If we attempted a `barman recover` execution with `--target-time` set,
but without an explicit time zone, recovery could fail:

* `barman recover` output would say that recovery target time was set
  with `+02:00`;
* However `postgresql.auto.conf` would not contain `+02:00`.

In practice either the recovery would fail because of an invalid
recovery target, or the cluster would be recovered to a point-in-time
different from the user needs. Neither of these cases are welcome.

This commit fixes that issue by reusing the `target_datetime` key from
the `recovery_info` variable. That entry contains the target time with
a time zone explicitily set, even if the user specified no time zone
through the command-line option.

Note: a warning is issued if the user specified no time zone, and Barman
assumed its host time zone.

References: BAR-185 #881.

Signed-off-by: Israel Barth Rubio <[email protected]>
barthisrael added a commit that referenced this issue Jun 25, 2024
Previous to this commit there could be cases where Postgres would be
misconfigured by `barman recover`.

For example, assume you had an environment like this:

* Barman host configured with time zone UTC+2, e.g. `Europe/Berlin`;`
* Postgres cluster with default value for timezone GUC, i.e. GMT value.

If we attempted a `barman recover` execution with `--target-time` set,
but without an explicit time zone, recovery could fail:

* `barman recover` output would say that recovery target time was set
  with `+02:00`;
* However `postgresql.auto.conf` would not contain `+02:00`.

In practice either the recovery would fail because of an invalid
recovery target, or the cluster would be recovered to a point-in-time
different from the user needs. Neither of these cases are welcome.

This commit fixes that issue by reusing the `target_datetime` key from
the `recovery_info` variable. That entry contains the target time with
a time zone explicitily set, even if the user specified no time zone
through the command-line option.

Note: a warning is issued if the user specified no time zone, and Barman
assumed its host time zone.

References: BAR-185 #881.

Signed-off-by: Israel Barth Rubio <[email protected]>
barthisrael added a commit that referenced this issue Jun 25, 2024
… files

Previous to this commit Barman would attempt to guess the required WAL
files using the filesystem creation timestamp of them.

However, that is not a reliable approach. For example, if there is
streaming replication lag, the WAL files will be created in the Barman
host later when compared to Postgres.

That can be even worse in the case of `archive_command`, because it
waits for the WAL files to be filled up before making them available
for archiving.

In those cases the recovery could end up failing because of missing
`COMMIT` or `ABORT` records in the WAL files that were copied by Barman,
i.e. Postgres would fail to perform recovery because it wouldn't know if
it satistified or not the requested `recovery_target_time`.

From now on, if the user requests a recovery with `--no-get-wal` and
`--target-time`, Barman will simply copy all WAL files up to the
timeline being recovered, guaranteeing that way that Postgres will be
able to find `COMMIT` or `ABORT` records, if they exist in the archived
WAL files, making it possible to complete the recovery.

Note: we evaluated other implementations to avoid possibly copying a lot
of unused WAL files. For example, with `pg_waldump` we would be able to
look up for `COMMIT` and `ABORT` records in a way similar to what
Postgres does. However, that could put a lot of overhead wherever that
would be processed (during WAL archiving or backup recovery), so that
option was discarded.

References: BAR-189 #881.

Signed-off-by: Israel Barth Rubio <[email protected]>
barthisrael added a commit that referenced this issue Jun 26, 2024
… files

Previous to this commit Barman would attempt to guess the required WAL
files using the filesystem creation timestamp of them.

However, that is not a reliable approach. For example, if there is
streaming replication lag, the WAL files will be created in the Barman
host later when compared to Postgres.

That can be even worse in the case of `archive_command`, because it
waits for the WAL files to be filled up before making them available
for archiving.

In those cases the recovery could end up failing because of missing
`COMMIT` or `ABORT` records in the WAL files that were copied by Barman,
i.e. Postgres would fail to perform recovery because it wouldn't know if
it satistified or not the requested `recovery_target_time`.

From now on, if the user requests a recovery with `--no-get-wal` and
`--target-time`, Barman will simply copy all WAL files up to the
timeline being recovered, guaranteeing that way that Postgres will be
able to find `COMMIT` or `ABORT` records, if they exist in the archived
WAL files, making it possible to complete the recovery.

Note: we evaluated other implementations to avoid possibly copying a lot
of unused WAL files. For example, with `pg_waldump` we would be able to
look up for `COMMIT` and `ABORT` records in a way similar to what
Postgres does. However, that could put a lot of overhead wherever that
would be processed (during WAL archiving or backup recovery), so that
option was discarded.

References: BAR-189 #881.

Signed-off-by: Israel Barth Rubio <[email protected]>
@JP95Git
Copy link

JP95Git commented Jun 27, 2024

Good news: The PITR recovery is now working.

  1. Add
    recovery_options = 'get-wal'
    to barman.conf
  2. Stop PostgreSQL
  3. /path/to/barman/bin/barman --config /path/to/config/barman.conf recover localhost 20240627T213726 /path/to/database --target-immediate
  4. Edit /path/to/database/postgresql.auto.conf
    Replace the restore_command with
    restore_command = '/path/to/barman/bin/barman --config /path/to/config/barman.conf get-wal -P localhost %f > %p'
  5. Start PostgreSQL

It works :-)

Step 3 was ugly, but apparently necessary, since I run PostgreSQL and barman under a normal user. I also had to compile both tools myself and install them in /opt. For security reasons I don't have root privileges and I'm only allowed to start/stop selected services via sudo.

Would be great if we could override the generated restore_command from barman using some setting in barman such that step 3 is no longer needed.

@barthisrael
Copy link
Contributor

Hello @JP95Git ,

Good news: The PITR recovery is now working.

  1. Add
    recovery_options = 'get-wal'
    to barman.conf
  2. Stop PostgreSQL
  3. /path/to/barman/bin/barman --config /path/to/config/barman.conf recover localhost 20240627T213726 /path/to/database --target-immediate
  4. Edit /path/to/database/postgresql.auto.conf
    Replace the restore_command with
    restore_command = '/path/to/barman/bin/barman --config /path/to/config/barman.conf get-wal -P localhost %f > %p'
  5. Start PostgreSQL

It works :-)

Thanks for letting us now!

As both PRs related with this GitHub issue were merged into master and are only pending a release, we are now closing this issue.

Once a new Barman version is released, which is expected to occur somewhen between August and September, those fixes/enhancements will become available.

Step 3 was ugly, but apparently necessary, since I run PostgreSQL and barman under a normal user. I also had to compile both tools myself and install them in /opt. For security reasons I don't have root privileges and I'm only allowed to start/stop selected services via sudo.

Would be great if we could override the generated restore_command from barman using some setting in barman such that step 3 is no longer needed.

As we are closing this issue, we opened a separate GitHub issue #956 to track this enhancement request.

Although this sounds unusual, and in most cases what is automatically inferred by Barman should be enough, we understand that there might exist some cases where overrides are necessary.

@martinmarques
Copy link
Contributor

3. Edit /path/to/database/postgresql.auto.conf
   Replace the restore_command with
   `restore_command = '/path/to/barman/bin/barman --config /path/to/config/barman.conf get-wal -P localhost %f > %p'`

This restore command could be replaced with barman-wal-restore if you also compiled it. It would make the command slimmer

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants