Cannot migrate from etcd HA to raft HA when using external S3 storage #29259

licenseplated · 2024-12-24T18:03:05Z

Nodes migrated from etcd to raft for HA all remain in standby

I have several clusters that use S3 for storage, KMS for auto unseal, and etcd for HA coordination. I've been trying to work out a process for migrating from etcd to raft for HA, while retaining S3 and KMS for storage and unsealing. But regardless of what I try, nodes that are configured to look at raft for HA all always unseal and enter standby mode, leaving me without an active node.

To Reproduce

I start with a cluster with this config:

    listener "tcp" {
        address = "[::]:8200"
        cluster_address = "[::]:8201"
        tls_disable = true
    }
    service_registration "kubernetes" {}
    storage "s3" {
        bucket = "my-unique-aws-id-vault"
        region = "us-west-2"
    }
    seal "awskms" {
        region = "us-west-2"
        kms_key_id = "alias/my-unique-aws-id-vault"
    }
    ha_storage "etcd" {
        address = "http://vault-etcd.vault.svc.cluster.local:2379"
        ha_enabled = "true"
        etcd_api = "v3"
    }

Then I run vault operator migrate -config=migrate.hcl with the following migrate.hcl file:

storage_source "etcd" {
  address = "http://vault-etcd.vault.svc.cluster.local:2379"
  ha_enabled = "true"
  etcd_api = "v3"
}
storage_destination "raft" {
  path = "/vault/data/raft"
}
api_addr = "https://vault.myfqdn.com"
cluster_addr = "https://vault-0.vault-internal:8201" # obviously each node has a unique name here

and I get output roughly like:

2024-12-24T17:41:59.897Z [INFO]  creating Raft: config="&raft.Config{ProtocolVersion:3, HeartbeatTimeout:5000000000, ElectionTimeout:5000000000, CommitTimeout:50000000, MaxAppendEntries:64, BatchApplyCh:true, ShutdownOnRemove:true, TrailingLogs:0x2800, SnapshotInterval:120000000000, SnapshotThreshold:0x2000, LeaderLeaseTimeout:2500000000, LocalID:\"36d71a81-0ae5-a1f9-d816-973cbfdfe6dc\", NotifyCh:(chan<- bool)(0x4002dc41c0), LogOutput:io.Writer(nil), LogLevel:\"DEBUG\", Logger:(*hclog.intLogger)(0x4003850cc0), NoSnapshotRestoreOnStart:true, PreVoteDisabled:false, skipStartup:false}"
2024-12-24T17:41:59.898Z [INFO]  initial configuration: index=1 servers="[{Suffrage:Voter ID:36d71a81-0ae5-a1f9-d816-973cbfdfe6dc Address:vault-1.vault-internal:8201}]"
2024-12-24T17:41:59.898Z [INFO]  entering follower state: follower="Node at 36d71a81-0ae5-a1f9-d816-973cbfdfe6dc [Follower]" leader-address= leader-id=
2024-12-24T17:42:08.519Z [WARN]  heartbeat timeout reached, starting election: last-leader-addr= last-leader-id=
2024-12-24T17:42:08.519Z [INFO]  entering candidate state: node="Node at 36d71a81-0ae5-a1f9-d816-973cbfdfe6dc [Candidate]" term=2
2024-12-24T17:42:08.519Z [INFO]  pre-vote successful, starting election: term=2 tally=1 refused=0 votesNeeded=1
2024-12-24T17:42:08.522Z [INFO]  election won: term=2 tally=1
2024-12-24T17:42:08.522Z [INFO]  entering leader state: leader="Node at 36d71a81-0ae5-a1f9-d816-973cbfdfe6dc [Leader]"
2024-12-24T17:42:08.550Z [INFO]  copied key: path=core/lock/71d493f6d0bc2b0b
2024-12-24T17:42:08.554Z [INFO]  copied key: path=core/lock/95a93f6d0ce7608
2024-12-24T17:42:08.554Z [INFO]  copied key: path=core/lock/644e93f6d0cdbc65
Success! All of the keys have been migrated.

So far so good; however, when I attempt to restart the nodes with the raft config for HA, none of the nodes becomes active:

    listener "tcp" {
        address = "[::]:8200"
        cluster_address = "[::]:8201"
        tls_disable = true
    }
    service_registration "kubernetes" {}
    storage "s3" {
        bucket = "my-unique-aws-id-vault"
        region = "us-west-2"
    }
    seal "awskms" {
        region = "us-west-2"
        kms_key_id = "alias/my-unique-aws-id-vault"
    }
    ha_storage "raft" {
      path = "/vault/data/raft"
    }

This is the last few log lines I see on all nodes with the new config:

│ 2024-12-24T17:51:49.383Z [INFO]  core: vault is unsealed                                                                                                                        │
│ 2024-12-24T17:51:49.383Z [INFO]  core: entering standby mode                                                                                                                    │
│ 2024-12-24T17:51:49.399Z [INFO]  core: unsealed with stored key                                                                                                                 │

and vault status

Key                      Value
---                      -----
Seal Type                awskms
Recovery Seal Type       shamir
Initialized              true
Sealed                   false
Total Recovery Shares    5
Threshold                3
Version                  1.18.1
Build Date               2024-10-29T14:21:31Z
Storage Type             s3
Cluster Name             vault-cluster-e7d44718
Cluster ID               b6e5b308-df72-5a40-346e-72fceb366bb2
HA Enabled               true
HA Cluster               n/a
HA Mode                  standby
Active Node Address      <none>

Expected behavior
One of the nodes is elected leader and is selected for the vault-active service

Environment:

Vault Server Version (retrieve with vault status): 1.18.1
Vault CLI Version (retrieve with vault version): Vault v1.18.1 (f479e5c), built 2024-10-29T14:21:31Z
Server Operating System/Architecture: k8s 1.28

Vault server configuration file(s): See above

Additional context

Was referred here by @jonathanfrappier from the discussion forums: https://discuss.hashicorp.com/t/unable-to-migrate-k8s-ha-from-etcd-to-raft-while-retaining-s3-storage/72116/2
Behavior is the same with 1 or 3 nodes, I have not tried any other configurations

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot migrate from etcd HA to raft HA when using external S3 storage #29259

Cannot migrate from etcd HA to raft HA when using external S3 storage #29259

licenseplated commented Dec 24, 2024 •

edited

Loading

Cannot migrate from etcd HA to raft HA when using external S3 storage #29259

Cannot migrate from etcd HA to raft HA when using external S3 storage #29259

Comments

licenseplated commented Dec 24, 2024 • edited Loading

licenseplated commented Dec 24, 2024 •

edited

Loading