Skip to content

Commit

Permalink
Create a KB for incorrect replica expansion
Browse files Browse the repository at this point in the history
Signed-off-by: Eric Weber <[email protected]>
  • Loading branch information
ejweber committed Jul 26, 2023
1 parent f664444 commit f052bad
Show file tree
Hide file tree
Showing 2 changed files with 201 additions and 1 deletion.
Original file line number Diff line number Diff line change
@@ -0,0 +1,200 @@
---
title: "Troubleshooting: Unexpected expansion leads to degradation or attach failure"
author: Eric Weber
draft: false
date: 2023-07-26
categories:
- "expansion"
- "replica"
- "instance-manager"
---

## Applicable versions

Confirmed in:

- Longhorn v1.3.2 - v1.3.3
- Longhorn v1.4.0 - v1.4.2
- Longhorn v1.5.0

Potentially mitigated in:

- Longhorn v1.4.3
- Longhorn v1.5.1

Complete fix planned in:

- Longhorn v1.4.x
- Longhorn v1.5.x
- Longhorn v1.6.0

## Symptoms

While the root cause is always the same, symptoms can vary depending on other factors (e.g. whether there are multiple
healthy replicas, which specific version of Longhorn is in use, etc.).

Generic symptoms that are not in-and-of-themselves evidence of this issue include:

- A volume is degraded with multiple failed rebuild attempts.
- A volume fails to attach and/or appears to be in an attach/detach loop.
- A volume experiencing the above and has fewer replicas than expected.

More specific symptoms include the following. Not all symptoms are present in all cases.

### Expansion error in the UI

A volume shows as expanding in the UI with a red info symbol indicating a problem. Hovering over the red info symbol
yields a message like:

Expansion Error: the expected size <small_size> of engine <engine> should not be smaller than the current size <large_size>. You can cancel the expansion to avoid volume crash.

An expansion is not actually ongoing and cannot be cancelled. Attempting to do so yields an error like:

unable to cancel expansion for volume <volume>: volume expansion is not started

### Instance-manager logs

Instance-manager pods responsible for rebuilding new or pre-existing replicas log repeated failure to do so because of a
size mismatch:

<time> time="<time>" level=error msg="failed to prune <snapshot>.img based on <snapshot>.img: file sizes are not
equal and the parent file is larger than the child file"

It is sometimes possible to catch this issue at its origination. The instance-manager pod for an engine logs that it
will expand a replica and then fails to add it. Note that this log is normal and is not by itself an indication of a
problem. However, it can be a red flag if no expansion has been requested:

<time> [longhorn-instance-manager] time="<time>" level=debug msg="Adding replica <replica_address>"
currentSize=<size> restore=false serviceURL="<engine_address>" size=<size>
<time> [longhorn-instance-manager] time="<time>" level=info msg="Prepare to expand new replica to size <size>"
<time> [longhorn-instance-manager] time="<time>" level=info msg="Adding replica <replica_address> in WO mode"

Similarly, the instance-manager pod for a replica logs that it is expanding:

<time> [<replica>] time="<time>" level=info msg="Replica server starts to expand to size <large_size>"

### Longhorn-manager logs

Longhorn-manager pods responsible for monitoring a volume's engine log a bug related to size:

E<date> <time> 1 engine_controller.go:731] failed to update status for engine <engine>: BUG: The expected size
<small_size> of engine <engine> should not be smaller than the current size <large_size>

It is sometimes possible to catch this issue at its origination. The longhorn-manager for an engine logs that it fails
to add a replica because it is not in the right state. Note that, while this indicates a likely problem, it is not by
itself an indication that the issue described in this KB has occurred.

<time> time="<time>" level=error msg="Failed rebuilding of replica <replica_address>" controller=longhorn-engine
engine=<engine> error="proxyServer=<instance_manager_address> destination=<engine_address>: failed to add replica
<replica_address> for volume: rpc error: code = Unknown desc = failed to create replica <replica_address> for volume
<engine_address>: rpc error: code = Unknown desc = replica must be closed, Can not add in state: dirty" node=<node>
volume=<volume>

### Snapshot chain on disk

Each Longhorn replica maintains a chain of snapshots on disk. Each snapshot is a sparse file with the nominal size of
the volume when it was taken. The size of all snapshots after a particular snapshot is increased, even though the volume
size was never altered:

-rw-r--r--. 1 root root 10737418240 Jun 8 04:42 volume-snap-snapshot-ab1a619f-196d-4f58-9a35-2c705a05cacb.img
-rw-r--r--. 1 root root 10737418240 Jun 6 12:11 volume-snap-snapshot-65bfafe1-9581-496a-81bf-78a3151c658d.img
-rw-r--r--. 1 root root 42949672960 Jun 6 12:11 volume-snap-snapshot-488c080c-0b4f-442f-aeec-667cd36f58cb.img
-rw-r--r--. 1 root root 42949672960 Jun 6 12:43 volume-snap-snapshot-fadec910-b472-45c0-bd0c-d11f0f5b0234.img
-rw-r--r--. 1 root root 42949672960 Jun 6 15:12 volume-snap-snapshot-d7b5d42f-0111-44a0-b9b7-6bc080a5a809.img
-rw-r--r--. 1 root root 42949672960 Jun 7 09:06 volume-snap-snapshot-ffb8c77b-8968-443d-b9e4-d858b9fa5261.img
-rw-r--r--. 1 root root 42949672960 Jun 7 12:03 volume-snap-snapshot-0236df7a-8b33-4569-8014-e33d735a4e01.img
-rw-r--r--. 1 root root 42949672960 Jun 7 15:08 volume-snap-snapshot-60621c68-3dc8-445d-bc08-f0f3c5587416.img
-rw-r--r--. 1 root root 42949672960 Jun 8 04:40 volume-snap-snapshot-71db93c1-d06f-4689-9365-5892a4bfc642.img
-rw-r--r--. 1 root root 42949672960 Jun 8 04:39 volume-snap-dailybac-d0c4f62a-8f7a-4522-854e-c754e1dadeb9.img
-rw-r--r--. 1 root root 42949672960 Jun 8 04:42 volume-snap-snapshot-23cbf46b-e1f8-41c7-8d21-edbdacdc38a0.img
-rw-r--r--. 1 root root 42949672960 Jun 8 07:50 volume-head-007.img

## Root cause

This issue occurs when the engine of a larger volume incorrectly attempts to add the running replica of a smaller
volume. While the larger engine fails to add the smaller replica (because the smaller replica is actively being used),
it successfully expands the smaller replica on disk. Once expanded, the smaller replica can continue to be used as
normal. Its engine can continue writing to and reading from the expected offsets and there may be no immediately
observable symptoms. The Longhorn control plane continues to assume the replica has the correct size.

Symptoms may start to appear when the expanded replica is used as the source for a rebuild (e.g. when another replica is
restarted in normal operation and must sync its files from a healthy one). The rebuild fails in the pruning process
because the volume head for the new replica has the correct size and the snapshot copied from the expanded replica has
a larger size.

Symptoms may also appear if the engine restarts with only the expanded replica. Because there is only one
replica, the engine successfully starts with that replica's size. This conflicts with the size expected by
Longhorn-manager, leading to errors. In practice, this situation can occur relatively easily. Rebuilds using the
expanded replica as a source fail, eventually causing the expanded replica to be the only one remaining.

## Known triggers

In general, this issue seems to be triggered by instance-manager pods being shut down / restarted or entire Longhorn
nodes being shut down / restarted while running engine and replica processes. The Longhorn control plane tracks a
replica by an address/port combination assigned by an instance-manager. During periods of high churn, the address/port
combination referring to one replica (and being tracked by the Kubernetes object for one engine) may be assumed by
another replica. At this moment, actions taken using the outdated Kubernetes object may cause its engine to communicate
with the wrong replica.

Two specific races that lead to this situation have been identified and fixed, but it is possible that another exists:

- https://github.com/longhorn/longhorn-manager/pull/1868 (Longhorn v1.3.x, v1.4.2+, v1.5.0+)
- https://github.com/longhorn/longhorn-manager/pull/2042 (Longhorn v1.3.x, v1.4.3+, v1.5.1+)

## Workaround

### Avoid the issue

Whenever possible, follow the [node maintenance guide](../../docs/1.5.1/volumes-and-nodes/maintenance) when
shutting down or restarting nodes. This eliminates the churn described above and ensures Longhorn safely moves engine
and replica processes between nodes. Never intentionally shut down instance-manager pods or nodes running
instance-manager pods while Longhorn processes are running in them.

### Correct the issue

If a replica has been expanded due to this issue but the volume is not yet degraded, it can be resolved with minimal
impact. Unfortunately, it is unlikely to discover it occurred before symptoms are present.

1. See that a replica has been unexpectedly expanded in instance-manager logs.
1. Verify that there are other, healthy replicas.
1. Delete the expanded replica.

If symptoms are observed and there is an acceptable backup,
[restore from backup](../../docs/1.5.1/snapshots-and-backups/backup-and-restore/restore-from-a-backup).

If symptoms are observed and there is not an acceptable backup, expand the volume to the size of the expanded replica.

1. Identify the size of the expanded replica. This is the larger size shown in the UI error and the instance-manager
logs. It is also the larger size of the snapshot files on disk.
1. Scale down the workload. It is likely already not running due to the issue.
1. [Expand the volume to the larger size](../../docs/1.5.1/volumes-and-nodes/expansion).
1. Scale up the workload.

In some situations, the above volume expansion may be unacceptable (e.g. if a 2 GiB volume was expanded by a 2 TiB
engine). If desired, after expansion:

1. Create a new volume (with the correct size).
1. Manually attach the old and new volume to the same node.
1. Copy all data from the old volume to the new volume using `cp` or `rsync` at the filesystem level.
1. Detach both volumes.
1. Use the new volume for the workload.
1. Delete the old volume.

## Long term fix

A complete fix for this issue is under active development. The goal is to make it impossible for any Longhorn component
(instance-manager, engine, etc.) to communicate with the wrong process by sending volume name and instance name metadata
in each request. If a process receives the wrong metadata, it will return an error and take no action. This fix should
be available in v1.6.0, v1.5.x, and v1.4.x. See the [GitHub issue](https://github.com/longhorn/longhorn/issues/5845) for
more information.

## Related information

- https://github.com/longhorn/longhorn/issues/5709:
One of the original GitHub issues.
- https://github.com/longhorn/longhorn/issues/6078:
GitHub issue reporting that a first fix was insufficient.
- https://github.com/longhorn/longhorn/issues/6217:
GitHub issue with a recreate and a second fix eliminating it.
- https://github.com/longhorn/longhorn/issues/5845:
GitHub issue tracking the long term fix.
Original file line number Diff line number Diff line change
Expand Up @@ -98,7 +98,7 @@ If it is accepted, additional capabilities will no longer be required as downstr
https://github.com/longhorn/longhorn/issues/5627#issuecomment-1577498183
- Original `open-iscsi` fix:
https://github.com/open-iscsi/open-iscsi/pull/244/commits/6df400925cfa9e723375c6f61524473703054220
- Testing for the workaround DaemonSet
- Testing for the workaround DaemonSet:
https://github.com/longhorn/longhorn/pull/6082#issuecomment-1581142425
- Fedora Project PR:
https://src.fedoraproject.org/rpms/iscsi-initiator-utils/pull-request/13

0 comments on commit f052bad

Please sign in to comment.