Move instance timeseries definitions to TOML #724

bnaecker · 2024-07-22T19:23:08Z

Use central TOML definitions for the vCPU usage and PVPANIC timeseries. Note that this change is breaking because we pull in the backwards-incompatible definitions that include the sled identifiers.
Add the sled-identifiers into the InstanceMetadata passed to Proplis via its instance-ensure API request. The sled-agent can provide those from its populated baseboard when provisioning instances.
Support letting the oximeter producer server use internal DNS to look up Nexus for metric registration. This helps avoid failures to register if the Nexus we use originally disappears during reconfiguration of those services. This doesn't happen yet, but the work is ongoing in Omicron.
Add more ergonomic argument to control metric registration via propolis-cli, ensuring backwards compatibility. Also add arguments for the sled-identifiers.

- Use central TOML definitions for the vCPU usage and PVPANIC timeseries. Note that this change is _breaking_ because we pull in the backwards-incompatible definitions that include the sled identifiers. - Add the sled-identifiers into the `InstanceMetadata` passed to Proplis via its instance-ensure API request. The sled-agent can provide those from its populated baseboard when provisioning instances. - Support letting the oximeter producer server use internal DNS to look up Nexus for metric registration. This helps avoid failures to register if the Nexus we use originally disappears during reconfiguration of those services. This doesn't happen yet, but the work is ongoing in Omicron. - Add more ergonomic argument to control metric registration via `propolis-cli`, ensuring backwards compatibility. Also add arguments for the sled-identifiers.

bnaecker · 2024-07-22T19:25:05Z

I expect this to fail the PHD migration tests in the same way that #659 did, because I've added some new metadata to the instance-ensure request that the code running on main will neither provide nor understand. We're still expecting to stop all instances during an upgrade, so this should be acceptable, but I'd love a gut check on that too.

Once we're happy with this, I'll include a PR in Omicron that updates the pinned Propolis here, and also deletes the past incompatible timeseries definitions.

bin/propolis-server/src/lib/server.rs

bin/propolis-server/src/lib/stats/pvpanic.rs

bin/propolis-server/src/lib/stats/virtual_machine.rs

openapi/propolis-server.json

bin/propolis-cli/src/main.rs

pfmooney · 2024-07-22T19:37:12Z

crates/propolis-api-types/src/lib.rs

+    pub sled_id: Uuid,
+    pub sled_serial: String,
+    pub sled_revision: u32,
+    pub sled_model: String,


Should we add a (phd?) test for checking that these values are updated when an instance is migrated?

Sure that seems good. I've not written any PHD tests, so might take me a bit.

You'll probably want to start by looking at the existing PHD tests for live migration. Hopefully, the test-support library should be pretty straightforward, but do let me know if you have questions!

I added a test in 4d5453a. It feels a bit contrived, since we're mostly testing the internal TestVm's handling of instance metadata, so I'm on the fence about it. Happy to keep or remove, depending on what others think.

With the caveat that I haven't looked at the non-PHD parts of the PR very carefully yet: I think the test would feel less contrived if the metadata were part of struct VmSpec instead of being randomly generated in start_local_vm. I'd arrange that like this:

add an InstanceMetadata to struct VmSpec

have VmConfig::vm_spec generate random IDs (right now this is in start_local_vm)

add a VmSpec function that generates new sled IDs/revisions (in your change this is happening in instance_ensure_internal); call this function from Framework::spawn_successor_vm (just like the call to refresh_crucible_backends)

Then most of the changes to struct TestVm can go away, and your new test would look something like this:

async fn migration_ensures_instance_metadata(ctx: &Framework) { // Create a source instance, and fetch the instance metadata its metrics are // generated with. let mut source = ctx .spawn_default_vm("migration_ensures_instance_metadata_source") .await?; let mut target = ctx .spawn_successor_vm( "migration_ensures_instance_metadata_target", &source, None, ) .await?; source.launch().await?; source.wait_to_boot().await?; let source_expected = source.vm_spec().metadata; let source_metadata = source.get_spec().await?.properties.metadata; assert_eq!(source_metadata, source_expected); // Migrate the instance to a new server, and refetch the metadata. target .migrate_from(&source, Uuid::new_v4(), MigrationTimeout::default()) .await?; let target_expected = target.vm_spec().metadata; let target_metadata = target.get_spec().await?.properties.metadata; assert_eq!(target_metadata, target_expected); // gjc: I might not keep the "project and silo IDs shouldn't change" // requirement; the control plane has to uphold that, not Propolis. You // could have a unit test for `VmSpec::refresh_sled_info` that makes sure // this invariant is upheld, though. }

WDYT? (Does this suggestion make any sense in view of what's going on in the rest of the PR? It very well may not, in which case you should feel free to ignore it :))

Thanks for the suggestion @gjcolombo. That does seem slightly less contrived, in that at least the manipulation of the source and target metadata is contained and mostly transparent to the tests. I've made those updates in 7981b96, with a few minor visibility modifications to glue it all together. Thanks!

- Group imports - Pass instance metadata to run method, rather than disaggregated arguments - Remove unnecessary `MetricsEndpointConfig::new()`, use direct initialization

- Add PHD test verifying instance timeseries metadata across a migration - Add note pointing to the Omicron files containing TOML timeseries definitions.

Improve the PHD test by making the relationship between the source / target instance metadata more transparent to the actual test framework and each propolis server. They just "get" whatever the spec tells them, similar to the sled-agent providing the metadata to their ensure API calls, and dutifully apply the data to the right instance.

gjcolombo

LGTM with a couple of little comments about the comments.

bin/propolis-server/src/lib/stats.rs

phd-tests/tests/src/migrate.rs

hawkw

Overall, this looks good to me! I commented on a couple pretty minor nitpicks, but they're not terribly important.

bin/propolis-cli/src/main.rs

bin/propolis-server/src/lib/stats/pvpanic.rs

bin/propolis-server/src/main.rs

hawkw · 2024-07-23T22:21:33Z

bin/propolis-server/src/main.rs

+        /// Logging level for the server
+        #[clap(long, default_value_t = slog::Level::Info, value_parser = parse_log_level)]
+        log_level: slog::Level,


Adding this feels like a good idea, but it also seems pretty much completely unrelated to this change --- what's the motivation for doing it as part of this PR?

I wanted to see the oximeter_producer::Server log messages to confirm that it was appropriately using internal DNS. That's all logged at debug or trace, so I needed a way to increase verbosity at the top level.

bin/propolis-server/src/main.rs

- Ignore case in metric registration parameter - Default values - Log at info

bnaecker · 2024-07-24T17:26:33Z

Thanks for the input everyone! The only issue here is the expected failure to migrate between this branch and the merge base, since I changed the API. Merging shortly!

- Pulls in oxidecomputer/propolis#724, which added sled-identifiers to the virtual machine timeseries. One part of #5267. - Updates requried Crucible dependency. - Expunge previous schema and data for the virtual machine timeseries, as the new schema is incompatible.

bnaecker requested review from pfmooney and hawkw July 22, 2024 19:23

pfmooney reviewed Jul 22, 2024

View reviewed changes

bnaecker added 4 commits July 22, 2024 19:39

Update falcon OpenAPI spec

ec187cd

Review feedback

8a71e3d

- Group imports - Pass instance metadata to run method, rather than disaggregated arguments - Remove unnecessary `MetricsEndpointConfig::new()`, use direct initialization

Review feedback

4d5453a

- Add PHD test verifying instance timeseries metadata across a migration - Add note pointing to the Omicron files containing TOML timeseries definitions.

bnaecker requested review from gjcolombo and pfmooney July 22, 2024 23:59

gjcolombo approved these changes Jul 23, 2024

View reviewed changes

bin/propolis-server/src/lib/stats.rs Outdated Show resolved Hide resolved

phd-tests/tests/src/migrate.rs Outdated Show resolved Hide resolved

Review feedback: remove old TODO, move comments

4e6ebde

hawkw approved these changes Jul 23, 2024

View reviewed changes

bnaecker added 2 commits July 24, 2024 16:36

Review feedback

56792ba

- Ignore case in metric registration parameter - Default values - Log at info

Merge branch 'master' into move-instance-timeseries-to-toml

1a23d76

bnaecker merged commit 923c384 into master Jul 24, 2024
9 of 10 checks passed

bnaecker deleted the move-instance-timeseries-to-toml branch July 24, 2024 17:27

bnaecker mentioned this pull request Jul 26, 2024

Fix typo in server metric CLI argument #728

Merged

Move instance timeseries definitions to TOML #724

Move instance timeseries definitions to TOML #724

Uh oh!

Conversation

bnaecker commented Jul 22, 2024

Uh oh!

bnaecker commented Jul 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gjcolombo left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

hawkw left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

bnaecker commented Jul 24, 2024

Uh oh!

Uh oh!

Uh oh!

bnaecker commented Jul 22, 2024 •

edited

Loading