[#704] Add script to automatically fix node corruption #766

PruStephan · 2023-12-09T20:56:24Z

Description

Add script that performs the following:

Cheks if node logs contain record of corruption
Removes the node dir
Loads relevant snapshot to temporary dir
imports it

Related issue(s)

Resolves #704

Related changes (conditional)

I checked whether I should update the README
I checked whether native packaging works, i.e. native binary packages
can be successfully built.

Stylistic guide (mandatory)

My commits comply with the policy used in Serokell.

DMozhevitin

The restoration logic looks ok overall, so it seems that you're moving in the right direction, although I have some comments

DMozhevitin · 2023-12-20T19:24:48Z

baking/src/tezos_baking/restore_from_corruption.py

+        print("Could not delete node data dir. Manual restoration is required")
+
+    snapshot_array = None
+    config = {"network": os.environ["NETWORK"], "history_mode": history_mode}


I'd suggest modifying extract_relevant_snapshot to accept network and history_mode parameters separately rather than config dict, since the config is an attribute of wizard object, but we don't have it there

DMozhevitin · 2023-12-20T19:29:04Z

baking/src/tezos_baking/util.py

@@ -148,3 +158,222 @@ def url_is_reachable(url):
        return True
    except (urllib.error.URLError, ValueError):
        return False
+
+
+def fetch_snapshot(url, sha256=None):


I think it's better to extract all snapshot-related stuff to tezos_baking/snapshot.py rather than keeping it in util

DMozhevitin · 2023-12-20T19:45:58Z

baking/src/tezos_baking/restore_from_corruption.py

+    for json_url in default_providers.values():
+        with urllib.request.urlopen(json_url) as url:
+            snapshot_array = json.load(url)["data"]
+        if snapshot_array is not None:


I'd suggest to pick not the first suitable provider, but the one which have a snapshot with higher block_height

Unlike tezos-setup where we ask user to chose provider to download from (if there is no such a snapshot in provider, we use other ones as fallback), here we capable to chose the snapshot with most recent block by ourselves

DMozhevitin · 2023-12-20T19:49:21Z

baking/src/tezos_baking/restore_from_corruption.py

+    os.remove(snapshot_path)
+
+    if not reinstallation_result.returncode:
+        print("Recovery from corruption was successfull")


I'm not sure if output of these prints will be redirected to the systemd service logs, since we're going to use this script as a part of the service

Did you check it?

DMozhevitin · 2023-12-20T19:49:53Z

baking/src/tezos_baking/restore_from_corruption.py

+    if not reinstallation_result.returncode:
+        print("Recovery from corruption was successfull")
+    else:
+        print("Recovery from corruption failed. Manual restoration is required")


I suppose that the script should fail with non-zero exit code at this point (and in other places when the script failed to automatically restore node storage)

DMozhevitin · 2023-12-20T20:10:04Z

baking/src/tezos_baking/restore_from_corruption.py

+def main():
+    is_corrupted = check_node_corruption()
+    is_baking_installed = (
+        b"tezos-baking" in get_proc_output("which octez-baking").stdout


There is no octez-baking package, but tezos-baking.octez-* is about binaries from tezos/tezos that we build and provide via tezos-* packages (i.e. installing tezos-node package via sudo apt-get install tezos-node installs octez-node binary). But apart from that, tezos-baking is a special case, since there is no such binary as octez-baking, it's rather set of systemd services and other things that are used together with octez binaries.

which command searches the path of executable. Considering words above, it can't be used for checking the presence of tezos-baking package in the system, since this package isn't an executable.

So we need to use another approach to checking whether tezos-baking is installed. The first thing that comes to mind is apt list --installed | grep tezos-baking, but I'm pretty sure that there are better alternatives

[#704] Add env variable

a05c86d

PruStephan self-assigned this Dec 9, 2023

PruStephan force-pushed the PruStephan/#704-fix-node-corruption branch 6 times, most recently from 22fe622 to 9499993 Compare December 14, 2023 20:57

PruStephan force-pushed the PruStephan/#704-fix-node-corruption branch from 9499993 to 10ae4fd Compare December 18, 2023 21:46

fixup! [#704] Add env variable

d74780d

PruStephan force-pushed the PruStephan/#704-fix-node-corruption branch from 10ae4fd to d74780d Compare December 18, 2023 21:46

DMozhevitin requested changes Dec 20, 2023

View reviewed changes

DMozhevitin assigned krendelhoff2 and unassigned PruStephan Jan 8, 2024

krendelhoff2 closed this Feb 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[#704] Add script to automatically fix node corruption #766

[#704] Add script to automatically fix node corruption #766

PruStephan commented Dec 9, 2023

DMozhevitin left a comment

DMozhevitin Dec 20, 2023

DMozhevitin Dec 20, 2023

DMozhevitin Dec 20, 2023

DMozhevitin Dec 20, 2023

DMozhevitin Dec 20, 2023

DMozhevitin Dec 20, 2023

[#704] Add script to automatically fix node corruption #766

[#704] Add script to automatically fix node corruption #766

Conversation

PruStephan commented Dec 9, 2023

Description

Related issue(s)

Related changes (conditional)

Stylistic guide (mandatory)

DMozhevitin left a comment

Choose a reason for hiding this comment

DMozhevitin Dec 20, 2023

Choose a reason for hiding this comment

DMozhevitin Dec 20, 2023

Choose a reason for hiding this comment

DMozhevitin Dec 20, 2023

Choose a reason for hiding this comment

DMozhevitin Dec 20, 2023

Choose a reason for hiding this comment

DMozhevitin Dec 20, 2023

Choose a reason for hiding this comment

DMozhevitin Dec 20, 2023

Choose a reason for hiding this comment