Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug (regression): updating a register can fail on local network with stable-2024.08.2.3 but still works using stable.2024-07-25 #2077

Open
happybeing opened this issue Aug 29, 2024 · 24 comments

Comments

@happybeing
Copy link
Contributor

happybeing commented Aug 29, 2024

I have a bunch of scripts which I use to test my application (awe) on a local network. In brief, these create a local network and then upload a series of websites, some just a single version, but two involve uploading a series of about four versions.

When uploading multiple versions the following sequence is repeated to load and update the register:

  • get the register using Client::get_register()
  • sync using ClientRegister::sync()
  • write an new value using ClientRegister::write_merging_branches_online()
  • sync using ClientRegister::sync()

(The above all happens in awe_website_versions.rs).

The scripts and code to do the above have been run tens if not >100 times without ever seeing the following error, which is happening when I try to update the second multi-version website, but not the first!

When the error occurs (and it occurs at the same point in repeated runs of these scripts with a new local network every time), the write_merging_branches_online() function fails with:

Failed to add XorName to register: Network(GetRecordError(RecordDoesNotMatch(c7ec9c(754072be1f575e7b94b97f21556067218383dd627d570314e1357d910f9592e9))))

If I use stable-2024.08.2.3 the above happens every time. If I use stable.2024-07-25 this has never happened.

Below are the safe_network crate versions I'm building against in each case:

# Generated using: awe-dependencies --branch stable-2024.08.2.3
sn_cli = { version = "0.94.1" }
sn_client = { version = "0.109.1" }
sn_peers_acquisition = { version = "0.4.2" }
sn_registers = { version = "0.3.17" }
sn_transfers = { version = "0.18.10" }
sn_protocol = { version = "0.17.7" }

# Generated using: awe-dependencies --branch stable.2024-07-25
# sn_cli = { version = "0.94.0" }
# sn_client = { version = "0.109.0" }
# sn_peers_acquisition = { version = "0.4.1" }
# sn_registers = { version = "0.3.16" }
# sn_transfers = { version = "0.18.9" }
# sn_protocol = { version = "0.17.6" }

What is strange to me is:

  • it is so far 100% repeatable and
  • always the same website that updates ok, and the same website that doesn't

For all five Registers, including the three other Registers, I create the register and immediately write two values. This always succeeds. There's only an error in one of the two Registers I subsequently try to write a third value too, and it is always the same one.

@maqi
Copy link
Member

maqi commented Aug 30, 2024

2024.08.2.3 contains a breaking change for client to get Register from network, which is not supported by the current PROD-01.
you will have to wait for all nodes got updated to get it supported.

currently close this issue as it is not relevant.

@maqi maqi closed this as completed Aug 30, 2024
@happybeing
Copy link
Contributor Author

Thanks Qi, I understand but it is still an issue until it's fixed. Keeping it open allows others to find it.

Right now I understand very few people will be hitting this, but I think the point is important and it's not a good idea to close an issue unless there's another place where someone else having the same problem will be able to find out why.

It's also relevant in that it highlights another problem, that a few weeks before launch release are going out with issues if this kind.

I also expect you are under pressure to close issues as soon as possible because of the desire to see the end according to the plan. If so that's a mistake imo.

Thanks for your work @maqi, it is reassuring to that you are involved in these very tricky areas. 🙏

@loziniak
Copy link
Contributor

@happybeing , do you have this problem, when client and local nodes are built from the same version? because as I understand, @maqi 's answer suggests, that the issue is from incompatible versions.

@happybeing
Copy link
Contributor Author

It's not incompatible versions, it is a breaking change in that the register crate is ahead of the node crate since the recent update.

@maqi
Copy link
Member

maqi commented Aug 30, 2024

Hi, @happybeing

so first, I was in a rush this morning, so judged the issue and comment with partial understanding of your original question.

I now get the issue re-opened, as it might does show some new issue.

Meanwhile, it does help to confirm the issue by launch a new local testnet with all nodes upgraded to 2024.08.2.3, and run awe app with the same 2024.08.2.3, to see if the problem is reproducable.
Thank you very much.

@maqi maqi reopened this Aug 30, 2024
@maqi
Copy link
Member

maqi commented Aug 30, 2024

also @happybeing ,

is that possible when you hit that error of Failed to add XorName to register: Network(GetRecordError ,
you can collect :
1, the update history of the local register you have
2, do a Client::get_register(), and show its update history as well

The update history is the tree structured diagram that you shown at #2030 (comment)

Thank you very much

@happybeing
Copy link
Contributor Author

I'm confused Qi.

If this is just a matter of waiting until the node catches up with the register crate, what's the purpose of your requests?

@maqi
Copy link
Member

maqi commented Aug 31, 2024

sry, @happybeing ,

my original judgement of the issue was incorrect, which might gave you wrong impression that this is an issue of mis-matched version between nodes' and clients'.

As you mentioned, you used local testnet, which first client and node shall always use the same version, and second 2024.08.2.3 shall not contain any breaking change even have client use this version and nodes retain with old version.

Hence I suggested you to restart you local testnet to make sure client & nodes using same version of 2024.08.2.3

@happybeing
Copy link
Contributor Author

Thanks Qi.

I don't need to redo anything because I know the client and testnet were both built to the same version.

I don't know if I will be able to assist you further as I'm stepping back for a while at least.

It was great to work with you briefly.

@maqi
Copy link
Member

maqi commented Sep 3, 2024

Hi, @happybeing,

Thx for the clarification info.
If the client and testnet are always built with the same version, then I will check the other possiblities.

Really appreciate your contributions, and I also feel great to work with you as well. :)

@happybeing
Copy link
Contributor Author

If the client and testnet are always built with the same version, then I will check the other possibilities.

I generate the crate versions from the safe_network crate using a script that takes the relevant tag, checks it out and generates the deps for my app's Cargo.toml from the safe_network Cargo.lock. The output of that is included in the OP.

Good luck.

@happybeing
Copy link
Contributor Author

Here's a note to confirm that I have been running my tests successfully against stable.2024-07-25 many times since filing this issue.

I have just tried today's new release stable-2024.09.1.3 and can confirm that the issue described in the OP remains, and confirms that there has been a regression since stable.2024-07-25.

Below is the extract from my testnet-full script log. testnet-full starts a local testnet and then uses awe to upload and subsequently update several websites. Uploading creates a register and writes two entries to it. Updating a website retrieves the register and attempts to write another entry to it, and it is at this point that the error occurs - but not for every attempt to update a website.

As described in the OP, the error does not occur every time a website is updated, but appears to happen at the same point in the test script every time (which I find surprising and may be a useful clue).

Updating versions register 07a0da3efbd66d05582c98e20e2ba092c051cb88305e4ed6b5623c30d67a4f80aff1389d71cddeae8af667f08ebaa4ba91ab123b2d7c7d6647a03c9213df6850346adbf387e477fcdcbe382695c4af11
VersionsRegister::sync() - this can take a while...
VersionsRegister::sync() - ...done.
VersionsRegister::sync() - this can take a while...
VersionsRegister::sync() - ...done.
Failed to update website version: Failed to add XorName to register: Network(GetRecordError(RecordDoesNotMatch(2c1f97(8883a1665e21c08d50611c7629fb8acc4ae35a8b29bf599d9c9da5b1d8cb1cf1))))

Location:
    src/awe_website_versions.rs:410:28

@maqi
Copy link
Member

maqi commented Sep 10, 2024

Hi, @happybeing,

thx for the info supplied, really helpful.

I think here is why you are hitting this error:

  • When publish a register, it calls the inner function of get_record and will pass down a verify flag, which, the get process will compare the local target record with the fetched record, with there content hash.
  • However, for register, it holds the entire update history, and the get query will now merge all different paths into one
  • This result in a high chance that with couple of ops, the local update history will be different to the fetched one, even their root value are same.

This explains why the RecordDoesNotMatch error does not occur every time, but appears to happen at the same point in the test script, because the mismatch happens with higher chance when more ops undertaken.

It will be much helpful and appreciated, if you can tweak the line at https://github.com/maidsafe/safe_network/blob/main/sn_client/src/register.rs#L845 to be

        let verification_cfg = GetRecordCfg {
            get_quorum: Quorum::One,
            retry_strategy: Some(RetryStrategy::Quick),
            target_record: None,
            expected_holders,
        };

you only need to rebuild your awe with this tweaked code. the local testnet can retain there untouched.

@happybeing
Copy link
Contributor Author

Thanks @maqi. That solved the issue using local testnet so I'm building a new release of awe with this change in the local safe_network/sn_client`. Thank you! 👏

@maqi
Copy link
Member

maqi commented Sep 10, 2024

I shall thank you for helping us verify/pin this issue.
really appreciated

@maqi
Copy link
Member

maqi commented Sep 11, 2024

Here is PR #2103 trying to address this as a formal fix.

@happybeing
Copy link
Contributor Author

@maqi I'm still seeing two problems related to updating registers, similar to the issue in the OP (problem 1), and another previous issue where I see different versions of a register at different times (problem 2). The behaviour has changed though in both cases.

I am still using the 'manual' patch which you suggested above to build my client. That patch appeared to fix this issue on a local network, but I am now testing against the public network with my client built using stable-2024.09.1.3 (with the patch mentioned).

Problem 1. Registers still not updating. The first issue is that registers are still failing to reflect changes although I am not getting the error described in the OP. I've seen this happen twice with different registers, each time created to store the awe-some-sites website. What happens is that the register is created, two entries are written and merged immediately, leaving a total of two entries. Not long after I wrote a third entry but it was not reflected when I accessed the register which continued to show only two entries for at least ten minutes. I left this and came back hours later to find that the register was now showing 3 entries and this remained the case each time I accessed it.

I wasn't sure how long it took to reflect the change so I set up a query command to check the register size every five minutes and wrote a fourth entry. After 24 hours the register is still showing only 3 entries. I tried adding another entry later the same day - again without error - and today it is still showing only 3 entries.

The API indicates that the entries are written successfully every time, unlike the situation in the OP where an error is reported and the update does not happen.

All the above operations involved running my client on a VPS (creation, writing entries and then querying to see the number of entries every 5 minutes).

Problem 2. Register returns different numbers of entries. I've only seen this happen once so it is much less frequent than previously. While testing problem 1, I occasionally tried accessing the same register from my laptop (over mobile broadband) and once it returned the register but with only two entries.

You can see the status of the register live on the network yourself using awe inspect-register. The output below includes the 'audit' option which displays the register structure and shows that it only contains 3 nodes, one for each of the three entries:

$ awe inspect-register -ramd --include-files -e 1: a223f580ce058a3334028fbd3f2497502aae85e9ea703a61bd39d96f772d7b599759117f6e621ed5273acb1e2920ff56f812446a3bda96a2bcfd3eba99004a0a4017bc2147cbe3d2c882c1308afa00cc
Autonomi client initialising...
Connecting to the network using 25 peers
register    : a223f580ce058a3334028fbd3f2497502aae85e9ea703a61bd39d96f772d7b599759117f6e621ed5273acb1e2920ff56f812446a3bda96a2bcfd3eba99004a0a4017bc2147cbe3d2c882c1308afa00cc
owned by    : PublicKey(1759..b88c)
permissions : Writers({PublicKey(1759..b88c)})
app reg type: 5ebbbc..
size        : 3
audit       :
   current state is merged, 1 value:
   5fb227c3d914852bb7731a55f5feb1e4854ce007b1201e617597d6fb080362b0
entries 1 to 2:
entry 1 - fetching metadata at d05b3f5e8c085c8f3046d5859b627fb7de5c773c26cffcfcbc75364971121a90
DEBUG get_website_metadata_from_network() at d05b3f5e8c085c8f3046d5859b627fb7de5c773c26cffcfcbc75364971121a90
DEBUG autonomi_get_file()
DEBUG calling files_download.download_from()
DEBUG Ok() return
Retrieved 141 bytes
published  : 2024-09-16 11:35:16.862655237 UTC
directories: 1
files      : 1
total bytes: 2175
1510a27adde292bf39953e1d181fdf0253b238cf09f94013b7e0c4ada8c0d50d 2024-09-16 11:33:35 "/index.html" 2175 bytes
entry 2 - fetching metadata at 5fb227c3d914852bb7731a55f5feb1e4854ce007b1201e617597d6fb080362b0
DEBUG get_website_metadata_from_network() at 5fb227c3d914852bb7731a55f5feb1e4854ce007b1201e617597d6fb080362b0
DEBUG autonomi_get_file()
DEBUG calling files_download.download_from()
DEBUG Ok() return
Retrieved 143 bytes
published  : 2024-09-17 11:55:57.466132193 UTC
directories: 1
files      : 1
total bytes: 2319
e8ffe587101cfdbccbfa9736952aae933a5a33c726598ade34e609399eae7aa7 2024-09-17 11:55:29 "/index.html" 2319 bytes
======================
Root (Latest) Node(s):
[ 0] Node("0"..) Entry(5fb227c3d914852bb7731a55f5feb1e4854ce007b1201e617597d6fb080362b0)
======================
Register Structure:
(In general, earlier nodes are more indented)
[ 0] Node("0"..) Entry(5fb227c3d914852bb7731a55f5feb1e4854ce007b1201e617597d6fb080362b0)
  [ 1] Node("1"..) Entry(d05b3f5e8c085c8f3046d5859b627fb7de5c773c26cffcfcbc75364971121a90)
    [ 2] Node("2"..) Entry(5ebbbc4f061702c875b6cacb76e537eb482713c458b9d83c2f1e86ea9e0d0d0f)
======================

Here is the output of awe when it successfully updates that register writing a new value of e823ee142c6dbb216c58e6d3c66847fead81a6381e6cc732d26e1bdf41e91047 but which is not present in the 'audit' output immediately above.

You can see that it is the correct register (a223f580ce058a3334028fbd3f2497502aae85e9ea703a61bd39d96f772d7b599759117f6e621ed5273acb1e2920ff56f812446a3bda96a2bcfd3eba9900) queried above, and the the update is successful.

That update was done at 5pm Tuesday but the register output above shows it is still not being reflected by the network at 12:40 Wednesday.

Reading /home/safe/src/safe-browser/awe-sites/awe-some-sites-src/sites-community.txt
set LIST_NAME=Community Pioneer Websites
Reading /home/safe/src/safe-browser/awe-sites/awe-some-sites-src/sites-test.txt
set LIST_NAME=Test Websites
Inserting links and saving to /home/safe/src/safe-browser/awe-sites/awe-some-sites/content/index.html

=======================================================================================
upload_site(/home/safe/src/safe-browser/awe-sites/awe-some-sites, content)
---------------------------------------------------------------------------------------
Found register : a223f580ce058a3334028fbd3f2497502aae85e9ea703a61bd39d96f772d7b599759117f6e621ed5273acb1e2920ff56f812446a3bda96a2bcfd3eba9900
4a0a4017bc2147cbe3d2c882c1308afa00cc
in register file: /home/safe/src/safe-browser/aweb-addresses/public-network/awe-some-sites/register-address.txt

Updating on Autonomi public from: /home/safe/src/safe-browser/awe-sites/awe-some-sites/content
Autonomi client initialising...
Connecting to the network using 25 peers
Uploading website from: "/home/safe/src/safe-browser/awe-sites/awe-some-sites/content"
Files upload attempted previously, verifying 4 chunks
4 chunks were uploaded in the past but failed to verify. Will attempt to upload them again...
"/home/safe/src/safe-browser/awe-sites/awe-some-sites/content" will be made public and linkable
Splitting and uploading "/home/safe/src/safe-browser/awe-sites/awe-some-sites/content" into 4 chunks
**************************************
*          Uploaded Files            *
**************************************
Uploaded "index.html" to address a17af7a1f1004f9996dc3b5c78fc5607e428f7a5d61a21b3510870595a586245
Among 4 chunks, found 0 already existed in network, uploaded the leftover 4 chunks in 1 minutes 22 seconds
**************************************
*          Payment Details           *
**************************************
Made payment of NanoTokens(4) for 4 chunks
Made payment of NanoTokens(4) for royalties fees
New wallet balance: 0.000000058
web publish completed files: [("/home/safe/src/safe-browser/awe-sites/awe-some-sites/content/index.html", "index.html", ChunkAddress(a17af7))
]
WEBSITE CONTENT UPLOADED:
a17af7a1f1004f9996dc3b5c78fc5607e428f7a5d61a21b3510870595a586245 "/home/safe/src/safe-browser/awe-sites/awe-some-sites/content/index.html"
DEBUG publish_website_metadata() website_root '/home/safe/src/safe-browser/awe-sites/awe-some-sites/content'
Adding '/home/safe/src/safe-browser/awe-sites/awe-some-sites/content/index.html' as '/index.html'
wallet_dir: "/home/safe/.local/share/safe/client"
Paid 0.000000001+0.000000001 to store Website metadata, now uploading...
WEBSITE METADATA UPLOADED:
awm://e823ee142c6dbb216c58e6d3c66847fead81a6381e6cc732d26e1bdf41e91047
Updating versions register a223f580ce058a3334028fbd3f2497502aae85e9ea703a61bd39d96f772d7b599759117f6e621ed5273acb1e2920ff56f812446a3bda96a2bc
fd3eba99004a0a4017bc2147cbe3d2c882c1308afa00cc
VersionsRegister::sync() - this can take a while...
VersionsRegister::sync() - ...done.
VersionsRegister::sync() - this can take a while...
VersionsRegister::sync() - ...done.
website_metadata added to register: e823ee142c6dbb216c58e6d3c66847fead81a6381e6cc732d26e1bdf41e91047
VersionsRegister::sync() - this can take a while...
VersionsRegister::sync() - ...done.

WEBSITE UPDATED (version 2). All versions available at XOR-URL:
awv://a223f580ce058a3334028fbd3f2497502aae85e9ea703a61bd39d96f772d7b599759117f6e621ed5273acb1e2920ff56f812446a3bda96a2bcfd3eba99004a0a4017bc2
147cbe3d2c882c1308afa00cc

NOTE:
- To update this website, use 'awe update' as follows:

   awe update --update-xor a223f580ce058a3334028fbd3f2497502aae85e9ea703a61bd39d96f772d7b599759117f6e621ed5273acb1e2920ff56f812446a3bda96a2bc
fd3eba99004a0a4017bc2147cbe3d2c882c1308afa00cc --website-root /home/safe/src/safe-browser/awe-sites/awe-some-sites/content

- To browse the website use 'awe awv://<XOR-ADDRESS>' as follows:

   awe awv://a223f580ce058a3334028fbd3f2497502aae85e9ea703a61bd39d96f772d7b599759117f6e621ed5273acb1e2920ff56f812446a3bda96a2bcfd3eba99004a0a
4017bc2147cbe3d2c882c1308afa00cc

- For help use 'awe --help'

Metadata address added to:
  /home/safe/src/safe-browser/aweb-addresses/public-network/awe-some-sites/site-addresses.txt
Files addresses added to:
  /home/safe/src/safe-browser/aweb-addresses/public-network/awe-some-sites/file-addresses.txt
awe-update-builtins
url: 5ebbbc4f061702c875b6cacb76e537eb482713c458b9d83c2f1e86ea9e0d0d0f
Generating back-end /home/safe/src/safe-browser/awe/src-tauri/src/generated_rs/builtins_public.rs:
Clearing back-end /home/safe/src/safe-browser/awe/src-tauri/src/generated_rs/builtins_local.rs:
awe-update-builtins: SKIPPING GIT OPERATIONS - please copy /home/safe/src/safe-browser/awe/src-tauri/src/generated_rs/builtins_public.rs usin
g vps-sync to-laptop and commit manually
Generating front-end /home/safe/src/safe-browser/awe/src/generated/builtins-public.js:
url: awv://a223f580ce058a3334028fbd3f2497502aae85e9ea703a61bd39d96f772d7b599759117f6e621ed5273acb1e2920ff56f812446a3bda96a2bcfd3eba99004a0a40
17bc2147cbe3d2c882c1308afa00cc
Updated builtins at:
/home/safe/src/safe-browser/awe/src-tauri/src/generated_rs/builtins_public.rs
/home/safe/src/safe-browser/awe/src/generated/builtins-public.js

@happybeing
Copy link
Contributor Author

^^ @loziniak

Have you used registers on the latest public network?

@loziniak
Copy link
Contributor

No, I'm struggling with local still... :-P

@maqi
Copy link
Member

maqi commented Sep 19, 2024

Hi, @happybeing ,

thx for the further update and the detailed info provided.

I am still using the 'manual' patch which you suggested above to build my client

I'd suggest you use the above mentioned PR 2103 to replace the manual patch.
The previous suggested one might mute some error that shall be raised to your awe app.
(it was just for a quick diagnose/pin the issue, not supposed to be used for long run :) )
And because of that, it might give you a wrong impression that your update succeeded, but actually it failed somewhere along the flow.

Register returns different numbers of entries

that numbers of entries refers to the update history right ?
MerkleReg (used by Register internally) uses a DAG structure internally, which could generate different nodes (entries) / branches due to different update path.
As I understand, as long as the root value (i.e. the one you got from .read() function) matches the final expected value, you don't need to care about the update history (i.e. number of entries) ? as it is supposed to be vary ?

@happybeing
Copy link
Contributor Author

As I understand, as long as the root value (i.e. the one you got from .read() function) matches the final expected value, you don't need to care about the update history (i.e. number of entries) ? as it is supposed to be vary ?

That may be the case for some uses but not in all cases. If you want a version history (e.g. versioned web, versioned file-system) then you need to access the history, not just the final merged value.

MerkleReg (used by Register internally) uses a DAG structure internally, which could generate different nodes (entries) / branches due to different update path.

I don't think that is the explanation in my use case but can't be sure. However, perhaps the issue is related to your first point - my code still using the patch. I was awaiting the merged PR before doing any more so we can see if this error goes away once that is in stable and the network is once again reset/updated.

The use cases for Registers, their API and the network implementation are still undefined 'launch' is supposedly one month away. 🤷‍♂️

@happybeing
Copy link
Contributor Author

On a local testnet built against safe_network stable-2024.10.1.2 I am still getting the following error for some attempts to update a Register:

Failed to update website version: Failed to add XorName to register: Network(GetRecordError(RecordDoesNotMatch(0cb113(9f0430935c26b71d82f284a8bccdd9e28b6155bec7ff3c5c6d07b10fb38f4561))))

I attach the full log of my local test output which includes setting up the local testnet and then running awe to upload several websites. This includes both creating and in some cases updating the website.
Thu 3 Oct 16:10:04 BST 2024-awelog-upload-local.gz

@RolandSherwin
Copy link
Member

Hey @happybeing, does this PR #2270 fix the issue for you?

@happybeing
Copy link
Contributor Author

Thanks for the head's up.

I don't have much time for testing and am not sure if my app will have been broken by recent EVM/API changes so may take a while to test but I hope to do so once it is in stable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants