-
Notifications
You must be signed in to change notification settings - Fork 25
Move instance timeseries definitions to TOML #724
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- Use central TOML definitions for the vCPU usage and PVPANIC timeseries. Note that this change is _breaking_ because we pull in the backwards-incompatible definitions that include the sled identifiers. - Add the sled-identifiers into the `InstanceMetadata` passed to Proplis via its instance-ensure API request. The sled-agent can provide those from its populated baseboard when provisioning instances. - Support letting the oximeter producer server use internal DNS to look up Nexus for metric registration. This helps avoid failures to register if the Nexus we use originally disappears during reconfiguration of those services. This doesn't happen yet, but the work is ongoing in Omicron. - Add more ergonomic argument to control metric registration via `propolis-cli`, ensuring backwards compatibility. Also add arguments for the sled-identifiers.
I expect this to fail the PHD migration tests in the same way that #659 did, because I've added some new metadata to the instance-ensure request that the code running on Once we're happy with this, I'll include a PR in Omicron that updates the pinned Propolis here, and also deletes the past incompatible timeseries definitions. |
pub sled_id: Uuid, | ||
pub sled_serial: String, | ||
pub sled_revision: u32, | ||
pub sled_model: String, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add a (phd?) test for checking that these values are updated when an instance is migrated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure that seems good. I've not written any PHD tests, so might take me a bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You'll probably want to start by looking at the existing PHD tests for live migration. Hopefully, the test-support library should be pretty straightforward, but do let me know if you have questions!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a test in 4d5453a. It feels a bit contrived, since we're mostly testing the internal TestVm
's handling of instance metadata, so I'm on the fence about it. Happy to keep or remove, depending on what others think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the caveat that I haven't looked at the non-PHD parts of the PR very carefully yet: I think the test would feel less contrived if the metadata were part of struct VmSpec
instead of being randomly generated in start_local_vm
. I'd arrange that like this:
- add an
InstanceMetadata
tostruct VmSpec
- have
VmConfig::vm_spec
generate random IDs (right now this is instart_local_vm
) - add a
VmSpec
function that generates new sled IDs/revisions (in your change this is happening ininstance_ensure_internal
); call this function fromFramework::spawn_successor_vm
(just like the call torefresh_crucible_backends
)
Then most of the changes to struct TestVm
can go away, and your new test would look something like this:
async fn migration_ensures_instance_metadata(ctx: &Framework) {
// Create a source instance, and fetch the instance metadata its metrics are
// generated with.
let mut source = ctx
.spawn_default_vm("migration_ensures_instance_metadata_source")
.await?;
let mut target = ctx
.spawn_successor_vm(
"migration_ensures_instance_metadata_target",
&source,
None,
)
.await?;
source.launch().await?;
source.wait_to_boot().await?;
let source_expected = source.vm_spec().metadata;
let source_metadata = source.get_spec().await?.properties.metadata;
assert_eq!(source_metadata, source_expected);
// Migrate the instance to a new server, and refetch the metadata.
target
.migrate_from(&source, Uuid::new_v4(), MigrationTimeout::default())
.await?;
let target_expected = target.vm_spec().metadata;
let target_metadata = target.get_spec().await?.properties.metadata;
assert_eq!(target_metadata, target_expected);
// gjc: I might not keep the "project and silo IDs shouldn't change"
// requirement; the control plane has to uphold that, not Propolis. You
// could have a unit test for `VmSpec::refresh_sled_info` that makes sure
// this invariant is upheld, though.
}
WDYT? (Does this suggestion make any sense in view of what's going on in the rest of the PR? It very well may not, in which case you should feel free to ignore it :))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion @gjcolombo. That does seem slightly less contrived, in that at least the manipulation of the source and target metadata is contained and mostly transparent to the tests. I've made those updates in 7981b96, with a few minor visibility modifications to glue it all together. Thanks!
- Group imports - Pass instance metadata to run method, rather than disaggregated arguments - Remove unnecessary `MetricsEndpointConfig::new()`, use direct initialization
- Add PHD test verifying instance timeseries metadata across a migration - Add note pointing to the Omicron files containing TOML timeseries definitions.
Improve the PHD test by making the relationship between the source / target instance metadata more transparent to the actual test framework and each propolis server. They just "get" whatever the spec tells them, similar to the sled-agent providing the metadata to their ensure API calls, and dutifully apply the data to the right instance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with a couple of little comments about the comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, this looks good to me! I commented on a couple pretty minor nitpicks, but they're not terribly important.
/// Logging level for the server | ||
#[clap(long, default_value_t = slog::Level::Info, value_parser = parse_log_level)] | ||
log_level: slog::Level, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding this feels like a good idea, but it also seems pretty much completely unrelated to this change --- what's the motivation for doing it as part of this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to see the oximeter_producer::Server
log messages to confirm that it was appropriately using internal DNS. That's all logged at debug
or trace
, so I needed a way to increase verbosity at the top level.
- Ignore case in metric registration parameter - Default values - Log at info
Thanks for the input everyone! The only issue here is the expected failure to migrate between this branch and the merge base, since I changed the API. Merging shortly! |
- Pulls in oxidecomputer/propolis#724, which added sled-identifiers to the virtual machine timeseries. One part of #5267. - Updates requried Crucible dependency. - Expunge previous schema and data for the virtual machine timeseries, as the new schema is incompatible.
- Pulls in oxidecomputer/propolis#724, which added sled-identifiers to the virtual machine timeseries. One part of #5267. - Updates requried Crucible dependency. - Expunge previous schema and data for the virtual machine timeseries, as the new schema is incompatible.
- Pulls in oxidecomputer/propolis#724, which added sled-identifiers to the virtual machine timeseries. One part of #5267. - Updates requried Crucible dependency. - Expunge previous schema and data for the virtual machine timeseries, as the new schema is incompatible.
InstanceMetadata
passed to Proplis via its instance-ensure API request. The sled-agent can provide those from its populated baseboard when provisioning instances.propolis-cli
, ensuring backwards compatibility. Also add arguments for the sled-identifiers.