Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[#704] Add script to automatically fix node corruption #766

Closed
wants to merge 2 commits into from

Conversation

PruStephan
Copy link
Contributor

Description

Add script that performs the following:

  • Cheks if node logs contain record of corruption
  • Removes the node dir
  • Loads relevant snapshot to temporary dir
  • imports it

Related issue(s)

Resolves #704

Related changes (conditional)

  • I checked whether I should update the README

  • I checked whether native packaging works, i.e. native binary packages
    can be successfully built.

Stylistic guide (mandatory)

@PruStephan PruStephan self-assigned this Dec 9, 2023
@PruStephan PruStephan force-pushed the PruStephan/#704-fix-node-corruption branch 6 times, most recently from 22fe622 to 9499993 Compare December 14, 2023 20:57
@PruStephan PruStephan force-pushed the PruStephan/#704-fix-node-corruption branch from 9499993 to 10ae4fd Compare December 18, 2023 21:46
@PruStephan PruStephan force-pushed the PruStephan/#704-fix-node-corruption branch from 10ae4fd to d74780d Compare December 18, 2023 21:46
Copy link
Collaborator

@DMozhevitin DMozhevitin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The restoration logic looks ok overall, so it seems that you're moving in the right direction, although I have some comments

print("Could not delete node data dir. Manual restoration is required")

snapshot_array = None
config = {"network": os.environ["NETWORK"], "history_mode": history_mode}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest modifying extract_relevant_snapshot to accept network and history_mode parameters separately rather than config dict, since the config is an attribute of wizard object, but we don't have it there

@@ -148,3 +158,222 @@ def url_is_reachable(url):
return True
except (urllib.error.URLError, ValueError):
return False


def fetch_snapshot(url, sha256=None):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to extract all snapshot-related stuff to tezos_baking/snapshot.py rather than keeping it in util

for json_url in default_providers.values():
with urllib.request.urlopen(json_url) as url:
snapshot_array = json.load(url)["data"]
if snapshot_array is not None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest to pick not the first suitable provider, but the one which have a snapshot with higher block_height

Unlike tezos-setup where we ask user to chose provider to download from (if there is no such a snapshot in provider, we use other ones as fallback), here we capable to chose the snapshot with most recent block by ourselves

os.remove(snapshot_path)

if not reinstallation_result.returncode:
print("Recovery from corruption was successfull")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if output of these prints will be redirected to the systemd service logs, since we're going to use this script as a part of the service

Did you check it?

if not reinstallation_result.returncode:
print("Recovery from corruption was successfull")
else:
print("Recovery from corruption failed. Manual restoration is required")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose that the script should fail with non-zero exit code at this point (and in other places when the script failed to automatically restore node storage)

def main():
is_corrupted = check_node_corruption()
is_baking_installed = (
b"tezos-baking" in get_proc_output("which octez-baking").stdout
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. There is no octez-baking package, but tezos-baking.octez-* is about binaries from tezos/tezos that we build and provide via tezos-* packages (i.e. installing tezos-node package via sudo apt-get install tezos-node installs octez-node binary). But apart from that, tezos-baking is a special case, since there is no such binary as octez-baking, it's rather set of systemd services and other things that are used together with octez binaries.

  2. which command searches the path of executable. Considering words above, it can't be used for checking the presence of tezos-baking package in the system, since this package isn't an executable.

So we need to use another approach to checking whether tezos-baking is installed. The first thing that comes to mind is apt list --installed | grep tezos-baking, but I'm pretty sure that there are better alternatives

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fix node storage corruption automatically
3 participants