Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(NODE-1474): refine config tool and add config versioning #2299

Open
wants to merge 19 commits into
base: master
Choose a base branch
from

Conversation

andrewbattat
Copy link
Member

@andrewbattat andrewbattat commented Oct 28, 2024

Changes:

  • Rename and change type to bool: nns_public_key_exists, node_operator_private_key_path, use_ssh_authorized_keys, inject_ic_crypto, inject_ic_state, inject_ic_registry_local_store,
  • Add GuestOSDevSettings fields: hostname (will be used to manually set hostname in testing), generate_ic_boundary_tls_cert (used for creating a testnet with API BNs)
  • Add config commands: GenerateGuestosConfig, GenerateTestnetConfig
  • Add config_version field, config version unit tests, and defined protocol for updating config
  • Uncouple domain_name from ipv4_config

@andrewbattat andrewbattat self-assigned this Oct 28, 2024
@github-actions github-actions bot added the feat label Oct 28, 2024
@andrewbattat andrewbattat marked this pull request as ready for review October 28, 2024 20:06
@andrewbattat andrewbattat requested a review from a team as a code owner October 28, 2024 20:06
@github-actions github-actions bot added the @node label Oct 28, 2024
rs/ic_os/config/src/generate_testnet_config.rs Outdated Show resolved Hide resolved
};

// Assemble GuestOSConfig
let guestos_config = GuestOSConfig {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider creating a function which returns Result, testing would be easier.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generate_testnet_config does return a Result. Are you saying to actually return the file path or the contents written to the file rather than just ()?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I missed the whole point in my comment, what I meant was a function that returns Result. The tests have to initialize a NamedTempFile (and we can't even nicely test the contents without parsing the output file). If you split the implementation into two, it'd be easier to test the contents of the created GuestOSConfig.

rs/ic_os/config/src/main.rs Show resolved Hide resolved
Comment on lines 140 to 264
"elasticsearch_tags": "tag1 tag2"
},
"nns_public_key_exists": true,
"nns_urls": [
"http://localhost"
],
"node_operator_private_key_exists": true,
"use_ssh_authorized_keys": false,
"icos_dev_settings": {}
},
"hostos_settings": {
"vm_memory": 490,
"vm_cpu": "kvm",
"verbose": false
},
"guestos_settings": {
"inject_ic_crypto": false,
"inject_ic_state": false,
"inject_ic_registry_local_store": false,
"guestos_dev_settings": {
"backup_spool": {
"backup_retention_time_seconds": 3600,
"backup_purging_interval_seconds": 600
},
"malicious_behavior": null,
"query_stats_epoch_length": 1000,
"bitcoind_addr": "127.0.0.1:8333",
"jaeger_addr": "127.0.0.1:6831",
"socks_proxy": "127.0.0.1:1080",
"hostname": "my-node"
}
}
}
"#;

const GUESTOS_CONFIG_JSON_V1_0_0: &str = r#"
{
"network_settings": {
"ipv6_config": {
"Fixed": {
"address": "2a00:fb01:400:200::2/64",
"gateway": "2a00:fb01:400:200::1"
}
},
"ipv4_config": null
},
"icos_settings": {
"config_version": "1.0.0",
"mgmt_mac": "ec:2a:72:31:a2:0c",
"deployment_environment": "Mainnet",
"logging": {
"elasticsearch_hosts": "elasticsearch-node-0.mercury.dfinity.systems:443",
"elasticsearch_tags": "tag1 tag2"
},
"nns_public_key_exists": true,
"nns_urls": [
"http://localhost"
],
"node_operator_private_key_exists": true,
"use_ssh_authorized_keys": false,
"icos_dev_settings": {}
},
"guestos_settings": {
"inject_ic_crypto": true,
"inject_ic_state": true,
"inject_ic_registry_local_store": true,
"guestos_dev_settings": {
"backup_spool": {
"backup_retention_time_seconds": 7200,
"backup_purging_interval_seconds": 1200
},
"malicious_behavior": null,
"query_stats_epoch_length": 2000,
"bitcoind_addr": "127.0.0.1:8333",
"jaeger_addr": "127.0.0.1:6831",
"socks_proxy": "127.0.0.1:1080",
"hostname": "guest-node"
}
}
}
"#;

#[test]
fn test_deserialize_hostos_config_v1_0_0() -> Result<(), Box<dyn std::error::Error>> {
let config: HostOSConfig = serde_json::from_str(HOSTOS_CONFIG_JSON_V1_0_0)?;
assert_eq!(config.icos_settings.config_version, "1.0.0");
assert_eq!(config.hostos_settings.vm_cpu, "kvm");
Ok(())
}

#[test]
fn test_deserialize_guestos_config_v1_0_0() -> Result<(), Box<dyn std::error::Error>> {
let config: GuestOSConfig = serde_json::from_str(GUESTOS_CONFIG_JSON_V1_0_0)?;
assert_eq!(config.icos_settings.config_version, "1.0.0");
assert_eq!(
config.icos_settings.mgmt_mac.to_string(),
"ec:2a:72:31:a2:0c"
);
Ok(())
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The simplest solution is to just maintain config backwards compatibility via unit tests (and system tests once we have those going).

When someone updates config, they will increment the CONFIG_VERSION field and add deserialization unit tests

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about it a little more and I have a few concerns with this approach:

  • When would the version be bumped? After every change (addition, removal)?
  • What happens if a type changes which is defined outside types.rs?
  • How would the unit tests ensure that an older version of the code can read a newer version of the json?
  • Is the plan to add a json constant every time a new version is introduced?

Copy link
Member Author

@andrewbattat andrewbattat Nov 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When would the version be bumped? After every change (addition, removal)?

That's what I was thinking, yes:
//! - **Updating CONFIG_VERSION**: Always update the CONFIG_VERSION constant (increment the minor version) whenever you modify the configuration.

What happens if a type changes which is defined outside types.rs?
That's a great point. The types defined outside of types.rs:

  • malicous_behavior (this is only used in testing, so I don't think it's an issue. It should always be null in production)
  • FormattedMacAddress (FormattedMacAddress is just a String wrapper, but even if it were to change, the unit tests would catch any deserialization issues that arise with the old config)
    pub struct FormattedMacAddress(String);

How would the unit tests ensure that an older version of the code can read a newer version of the json?

The deserialization code is updated on HostOS and GuestOS updates, so this shouldn't be an issue. Therefore, each version of deserialize_config only needs backwards compatibility, not forwards compatibility, in case we did want to change deserialize_config.

Is the plan to add a json constant every time a new version is introduced?

That was what I was thinking, yes. If you're introducing a new version, add a GUESTOS_CONFIG_JSON_V1_1_0 and HOSTOS_CONFIG_JSON_V1_1_0 constant and a unit test for each.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Therefore, each version of deserialize_config only needs backwards compatibility, not forwards compatibility

Are you sure? What if we deploy a newer HostOS and then retrograde the GuestOS? Then an older GuestOS would need to understand a config "from the future".

For example, say we have an ip_addr field which we rename to ip_address at some point. Even if we implement reading from both ip_addr and ip_address in the newer version to be backwards compatible, we'd still have problems if serialize the config using the newer field name (ip_address) which the older GuestOS would not understand. Or am I missing something?

add a GUESTOS_CONFIG_JSON_V1_1_0 and HOSTOS_CONFIG_JSON_V1_1_0 constant and a unit test for each.

What does the unit test check? Just that deserialization is successful or also that all fields have been preserved? In the latter case, the unit tests would be huge.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure? What if we deploy a newer HostOS and then retrograde the GuestOS? Then an older GuestOS would need to understand a config "from the future".

Ah, yes, you're right. I'll add a note capturing this and emphasize that deserialize_config should only be updated when necessary (and it's very short so I doubt it will be) and that if it is updated, to not do it in tandem with config updates. Are we comfortable with this risk?

What does the unit test check? Just that deserialization is successful or also that all fields have been preserved?

Just that deserialization is successful. Does this seem sufficient?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we comfortable with this risk?

I'm concerned about the following situation (lmk if I'm wrong): we have a logging field in ICOSSettings. One day someone comes and says "hmm we don't need to configure logging anymore" and they remove the logging field while bumping the version and adding a unittest. The change roles out to a newly installed HostOS and the machine is added to a subnet which runs a GuestOS version without the change. This older version will error out / crash while parsing the config because of the missing logging field and we may not even be able to easily reproduce the problem.

Just that deserialization is successful. Does this seem sufficient?

I see, so it protects us from adding a new required field without a default value. Makes sense. We may want to consider marking every field as optional or defaulted (which is what protocol buffers do). That would address my concern above.

Copy link
Member Author

@andrewbattat andrewbattat Nov 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm concerned about the following situation...

Arff. You're absolutely right. So our options for handling this:

  • Set policy: you cannot remove a non-optional/non-default field unless you are absolutely sure it's no longer in use anywhere on mainnet (and sure that we won't rollback to a version that uses it). ANALYSIS: I actually think this is manageable. I suspect we won't be removing a lot of fields. And the impact of hitting issue is not catastrophic. As you just outlined, it's a downgrade issue impacting freshly-deployed nodes. Removing fields should never introduce an upgrade compatibility risk.
  • Make every field optional or have a default value. ANALYSIS: I don't love this. What do we do for ipv6_config? This is not optional, and we don't want default values for it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unless you are absolutely sure it's no longer in use anywhere on mainnet

When removing a field, we'd need to remove the field from the guestos deserialization (while keeping it in the hostos), wait until all guests have been updated and only then remove the field.

This is not optional, and we don't want default values for it.

Fair, if we want to refactor it in the future, we'll have to keep populating the old field as well.

I believe the correct policy is to say that required fields must not be removed directly. In a first step, they have to be made optional and code that reads the value must be removed/handle missing values. In a second step, after the first step has rolled out to all guestos-s, the field can be removed. We need to keep a list of removed field names to ensure they're never reused.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition, it'd also be a good policy that we require a thorough review process on these files to ensure that changes are long-term and correct.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good! I've updated the Configuration Update Protocol:

//! - Removing Fields: To prevent backwards-compatibility deserialization errors, required fields must not be removed directly:
//! In a first step, they have to be made optional and code that reads the value must be removed/handle missing values.
//! In a second step, after the first step has rolled out to all OSes and there is no risk of a rollback, the field can be removed.

Regarding this:

We need to keep a list of removed field names to ensure they're never reused.

I actually believe the unit tests should handle this. Consider field X is a String in unit tests. Then it's removed from config, and later, a new field named X is added that's an integer. When this new field X is added to the config, the old unit test will fail as it's unable to deserialize the String as an int.

In addition, it'd also be a good policy that we require a thorough review process on these files to ensure that changes are long-term and correct.

I agree. Is there any way to formalize this? I added to the config readme:

Any changes to the configuration should undergo a thorough review process to ensure they follow the guidlines.

@andrewbattat andrewbattat changed the title feat(NODE-1474): refine config tool feat(NODE-1474): refine config tool and add config versioning Oct 30, 2024
..Default::default()
};

let result = generate_testnet_config(args, PathBuf::from("/tmp/guestos_config.json"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not 100% optimal to do I/O during a unit test (unless it's the I/O that we're testing) but it's also not too bad. Optimally, we'd have a function which returns Result instead of Result<()> which we's used for unit testing. But I'm fine with the current approach as well because it's simpler and we can always refactor the code in the future.

rs/ic_os/config/src/generate_testnet_config.rs Outdated Show resolved Hide resolved
rs/ic_os/config/src/generate_testnet_config.rs Outdated Show resolved Hide resolved
rs/ic_os/config/src/generate_testnet_config.rs Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants