Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot migrate from etcd HA to raft HA when using external S3 storage #29259

Open
licenseplated opened this issue Dec 24, 2024 · 0 comments
Open

Comments

@licenseplated
Copy link

licenseplated commented Dec 24, 2024

Nodes migrated from etcd to raft for HA all remain in standby

I have several clusters that use S3 for storage, KMS for auto unseal, and etcd for HA coordination. I've been trying to work out a process for migrating from etcd to raft for HA, while retaining S3 and KMS for storage and unsealing. But regardless of what I try, nodes that are configured to look at raft for HA all always unseal and enter standby mode, leaving me without an active node.

To Reproduce

  1. I start with a cluster with this config:
    listener "tcp" {
        address = "[::]:8200"
        cluster_address = "[::]:8201"
        tls_disable = true
    }
    service_registration "kubernetes" {}
    storage "s3" {
        bucket = "my-unique-aws-id-vault"
        region = "us-west-2"
    }
    seal "awskms" {
        region = "us-west-2"
        kms_key_id = "alias/my-unique-aws-id-vault"
    }
    ha_storage "etcd" {
        address = "http://vault-etcd.vault.svc.cluster.local:2379"
        ha_enabled = "true"
        etcd_api = "v3"
    }
  1. Then I run vault operator migrate -config=migrate.hcl with the following migrate.hcl file:
storage_source "etcd" {
  address = "http://vault-etcd.vault.svc.cluster.local:2379"
  ha_enabled = "true"
  etcd_api = "v3"
}
storage_destination "raft" {
  path = "/vault/data/raft"
}
api_addr = "https://vault.myfqdn.com"
cluster_addr = "https://vault-0.vault-internal:8201" # obviously each node has a unique name here

and I get output roughly like:

2024-12-24T17:41:59.897Z [INFO]  creating Raft: config="&raft.Config{ProtocolVersion:3, HeartbeatTimeout:5000000000, ElectionTimeout:5000000000, CommitTimeout:50000000, MaxAppendEntries:64, BatchApplyCh:true, ShutdownOnRemove:true, TrailingLogs:0x2800, SnapshotInterval:120000000000, SnapshotThreshold:0x2000, LeaderLeaseTimeout:2500000000, LocalID:\"36d71a81-0ae5-a1f9-d816-973cbfdfe6dc\", NotifyCh:(chan<- bool)(0x4002dc41c0), LogOutput:io.Writer(nil), LogLevel:\"DEBUG\", Logger:(*hclog.intLogger)(0x4003850cc0), NoSnapshotRestoreOnStart:true, PreVoteDisabled:false, skipStartup:false}"
2024-12-24T17:41:59.898Z [INFO]  initial configuration: index=1 servers="[{Suffrage:Voter ID:36d71a81-0ae5-a1f9-d816-973cbfdfe6dc Address:vault-1.vault-internal:8201}]"
2024-12-24T17:41:59.898Z [INFO]  entering follower state: follower="Node at 36d71a81-0ae5-a1f9-d816-973cbfdfe6dc [Follower]" leader-address= leader-id=
2024-12-24T17:42:08.519Z [WARN]  heartbeat timeout reached, starting election: last-leader-addr= last-leader-id=
2024-12-24T17:42:08.519Z [INFO]  entering candidate state: node="Node at 36d71a81-0ae5-a1f9-d816-973cbfdfe6dc [Candidate]" term=2
2024-12-24T17:42:08.519Z [INFO]  pre-vote successful, starting election: term=2 tally=1 refused=0 votesNeeded=1
2024-12-24T17:42:08.522Z [INFO]  election won: term=2 tally=1
2024-12-24T17:42:08.522Z [INFO]  entering leader state: leader="Node at 36d71a81-0ae5-a1f9-d816-973cbfdfe6dc [Leader]"
2024-12-24T17:42:08.550Z [INFO]  copied key: path=core/lock/71d493f6d0bc2b0b
2024-12-24T17:42:08.554Z [INFO]  copied key: path=core/lock/95a93f6d0ce7608
2024-12-24T17:42:08.554Z [INFO]  copied key: path=core/lock/644e93f6d0cdbc65
Success! All of the keys have been migrated.

So far so good; however, when I attempt to restart the nodes with the raft config for HA, none of the nodes becomes active:

    listener "tcp" {
        address = "[::]:8200"
        cluster_address = "[::]:8201"
        tls_disable = true
    }
    service_registration "kubernetes" {}
    storage "s3" {
        bucket = "my-unique-aws-id-vault"
        region = "us-west-2"
    }
    seal "awskms" {
        region = "us-west-2"
        kms_key_id = "alias/my-unique-aws-id-vault"
    }
    ha_storage "raft" {
      path = "/vault/data/raft"
    }

This is the last few log lines I see on all nodes with the new config:

│ 2024-12-24T17:51:49.383Z [INFO]  core: vault is unsealed                                                                                                                        │
│ 2024-12-24T17:51:49.383Z [INFO]  core: entering standby mode                                                                                                                    │
│ 2024-12-24T17:51:49.399Z [INFO]  core: unsealed with stored key                                                                                                                 │

and vault status

Key                      Value
---                      -----
Seal Type                awskms
Recovery Seal Type       shamir
Initialized              true
Sealed                   false
Total Recovery Shares    5
Threshold                3
Version                  1.18.1
Build Date               2024-10-29T14:21:31Z
Storage Type             s3
Cluster Name             vault-cluster-e7d44718
Cluster ID               b6e5b308-df72-5a40-346e-72fceb366bb2
HA Enabled               true
HA Cluster               n/a
HA Mode                  standby
Active Node Address      <none>

Expected behavior
One of the nodes is elected leader and is selected for the vault-active service

Environment:

  • Vault Server Version (retrieve with vault status): 1.18.1

  • Vault CLI Version (retrieve with vault version): Vault v1.18.1 (f479e5c), built 2024-10-29T14:21:31Z

  • Server Operating System/Architecture: k8s 1.28

Vault server configuration file(s): See above

Additional context

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant