Skip to content

Conversation

bnaecker
Copy link
Contributor

  • Use central TOML definitions for the vCPU usage and PVPANIC timeseries. Note that this change is breaking because we pull in the backwards-incompatible definitions that include the sled identifiers.
  • Add the sled-identifiers into the InstanceMetadata passed to Proplis via its instance-ensure API request. The sled-agent can provide those from its populated baseboard when provisioning instances.
  • Support letting the oximeter producer server use internal DNS to look up Nexus for metric registration. This helps avoid failures to register if the Nexus we use originally disappears during reconfiguration of those services. This doesn't happen yet, but the work is ongoing in Omicron.
  • Add more ergonomic argument to control metric registration via propolis-cli, ensuring backwards compatibility. Also add arguments for the sled-identifiers.

- Use central TOML definitions for the vCPU usage and PVPANIC
  timeseries. Note that this change is _breaking_ because we pull in the
  backwards-incompatible definitions that include the sled identifiers.
- Add the sled-identifiers into the `InstanceMetadata` passed to Proplis
  via its instance-ensure API request. The sled-agent can provide those
  from its populated baseboard when provisioning instances.
- Support letting the oximeter producer server use internal DNS to look
  up Nexus for metric registration. This helps avoid failures to
  register if the Nexus we use originally disappears during
  reconfiguration of those services. This doesn't happen yet, but the
  work is ongoing in Omicron.
- Add more ergonomic argument to control metric registration via
  `propolis-cli`, ensuring backwards compatibility. Also add arguments
  for the sled-identifiers.
@bnaecker bnaecker requested review from pfmooney and hawkw July 22, 2024 19:23
@bnaecker
Copy link
Contributor Author

bnaecker commented Jul 22, 2024

I expect this to fail the PHD migration tests in the same way that #659 did, because I've added some new metadata to the instance-ensure request that the code running on main will neither provide nor understand. We're still expecting to stop all instances during an upgrade, so this should be acceptable, but I'd love a gut check on that too.

Once we're happy with this, I'll include a PR in Omicron that updates the pinned Propolis here, and also deletes the past incompatible timeseries definitions.

Comment on lines +39 to +42
pub sled_id: Uuid,
pub sled_serial: String,
pub sled_revision: u32,
pub sled_model: String,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add a (phd?) test for checking that these values are updated when an instance is migrated?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure that seems good. I've not written any PHD tests, so might take me a bit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll probably want to start by looking at the existing PHD tests for live migration. Hopefully, the test-support library should be pretty straightforward, but do let me know if you have questions!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a test in 4d5453a. It feels a bit contrived, since we're mostly testing the internal TestVm's handling of instance metadata, so I'm on the fence about it. Happy to keep or remove, depending on what others think.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the caveat that I haven't looked at the non-PHD parts of the PR very carefully yet: I think the test would feel less contrived if the metadata were part of struct VmSpec instead of being randomly generated in start_local_vm. I'd arrange that like this:

  • add an InstanceMetadata to struct VmSpec
  • have VmConfig::vm_spec generate random IDs (right now this is in start_local_vm)
  • add a VmSpec function that generates new sled IDs/revisions (in your change this is happening in instance_ensure_internal); call this function from Framework::spawn_successor_vm (just like the call to refresh_crucible_backends)

Then most of the changes to struct TestVm can go away, and your new test would look something like this:

async fn migration_ensures_instance_metadata(ctx: &Framework) {
    // Create a source instance, and fetch the instance metadata its metrics are
    // generated with.
    let mut source = ctx
        .spawn_default_vm("migration_ensures_instance_metadata_source")
        .await?;
    let mut target = ctx
        .spawn_successor_vm(
            "migration_ensures_instance_metadata_target",
            &source,
            None,
        )
        .await?;

    source.launch().await?;
    source.wait_to_boot().await?;
    let source_expected = source.vm_spec().metadata;
    let source_metadata = source.get_spec().await?.properties.metadata;
    assert_eq!(source_metadata, source_expected);

    // Migrate the instance to a new server, and refetch the metadata.
    target
        .migrate_from(&source, Uuid::new_v4(), MigrationTimeout::default())
        .await?;
    let target_expected = target.vm_spec().metadata;
    let target_metadata = target.get_spec().await?.properties.metadata;
    assert_eq!(target_metadata, target_expected);

    // gjc: I might not keep the "project and silo IDs shouldn't change"
    // requirement; the control plane has to uphold that, not Propolis. You
    // could have a unit test for `VmSpec::refresh_sled_info` that makes sure
    // this invariant is upheld, though.
}

WDYT? (Does this suggestion make any sense in view of what's going on in the rest of the PR? It very well may not, in which case you should feel free to ignore it :))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion @gjcolombo. That does seem slightly less contrived, in that at least the manipulation of the source and target metadata is contained and mostly transparent to the tests. I've made those updates in 7981b96, with a few minor visibility modifications to glue it all together. Thanks!

bnaecker added 4 commits July 22, 2024 19:39
- Group imports
- Pass instance metadata to run method, rather than disaggregated
  arguments
- Remove unnecessary `MetricsEndpointConfig::new()`, use direct
  initialization
- Add PHD test verifying instance timeseries metadata across a migration
- Add note pointing to the Omicron files containing TOML timeseries
  definitions.
Improve the PHD test by making the relationship between the source /
target instance metadata more transparent to the actual test framework
and each propolis server. They just "get" whatever the spec tells them,
similar to the sled-agent providing the metadata to their ensure API
calls, and dutifully apply the data to the right instance.
@bnaecker bnaecker requested review from gjcolombo and pfmooney July 22, 2024 23:59
Copy link
Contributor

@gjcolombo gjcolombo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with a couple of little comments about the comments.

Copy link
Member

@hawkw hawkw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, this looks good to me! I commented on a couple pretty minor nitpicks, but they're not terribly important.

Comment on lines +104 to +106
/// Logging level for the server
#[clap(long, default_value_t = slog::Level::Info, value_parser = parse_log_level)]
log_level: slog::Level,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding this feels like a good idea, but it also seems pretty much completely unrelated to this change --- what's the motivation for doing it as part of this PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to see the oximeter_producer::Server log messages to confirm that it was appropriately using internal DNS. That's all logged at debug or trace, so I needed a way to increase verbosity at the top level.

bnaecker added 2 commits July 24, 2024 16:36
- Ignore case in metric registration parameter
- Default values
- Log at info
@bnaecker
Copy link
Contributor Author

Thanks for the input everyone! The only issue here is the expected failure to migrate between this branch and the merge base, since I changed the API. Merging shortly!

@bnaecker bnaecker merged commit 923c384 into master Jul 24, 2024
9 of 10 checks passed
@bnaecker bnaecker deleted the move-instance-timeseries-to-toml branch July 24, 2024 17:27
bnaecker added a commit to oxidecomputer/omicron that referenced this pull request Jul 27, 2024
- Pulls in oxidecomputer/propolis#724, which
  added sled-identifiers to the virtual machine timeseries. One part of
  #5267.
- Updates requried Crucible dependency.
- Expunge previous schema and data for the virtual machine timeseries,
  as the new schema is incompatible.
bnaecker added a commit to oxidecomputer/omicron that referenced this pull request Jul 27, 2024
- Pulls in oxidecomputer/propolis#724, which
  added sled-identifiers to the virtual machine timeseries. One part of
  #5267.
- Updates requried Crucible dependency.
- Expunge previous schema and data for the virtual machine timeseries,
  as the new schema is incompatible.
bnaecker added a commit to oxidecomputer/omicron that referenced this pull request Jul 27, 2024
- Pulls in oxidecomputer/propolis#724, which
  added sled-identifiers to the virtual machine timeseries. One part of
  #5267.
- Updates requried Crucible dependency.
- Expunge previous schema and data for the virtual machine timeseries,
  as the new schema is incompatible.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants