Audit log MVP #7339

david-crespo · 2025-01-14T17:36:56Z

Initial implementation of RFD 523.

High-level design

Logging an operation has two steps, corresponding to two app layer methods called directly in the request handler:
- audit_log_entry_init: called before anything else, and if it fails, we bail -- this guarantees nothing can happen without getting logged
- audit_log_entry_complete: called after the operation succeeds or fails, filling in the row with the success or failure result. Currently we only log the HTTP status code and possibly error message, but we will fill this in further with, e.g., the ID of the created resource (if applicable), and maybe the entire success response.
This log is stored in CockroachDB and not somewhere else (like Clickhouse) because we need an immediate guarantee at write time that the audit log initialization happened before we proceed with the API operation.
The audit log can only be retrieved by fleet viewers at /v1/system/audit-log
The audit log list is powered by a SQL view that filters for only completed entries
The audit log list is ordered by time_completed, not time_started. This turns out to be very important — see the doc comment on audit_log_list in nexus/db-queries/src/db/datastore/audit_log.rs.
Audit log entries have unique IDs in order to let clients deduplicate them if they fetch overlapping ranges
- Timestamps could not be used as the primary key because (a) timestamp collisions are possible, and (b) we are ordering by time_completed, but not all entries in the audit log table have non-null time_completed

Operations logged

See nexus/src/external_api/http_entrypoints.rs. My goal was to start by logging the operations that create sessions and tokens. Eventually I think we want to log pretty much everything that's not a GET.

login_saml: last step of SAML login, creates web session
login_local: username/password login, creates web session
device_auth_confirm: last step of token create
project_create and project_delete
instance_create and instance_delete
disk_create and disk_delete

Next steps

Things that are not in this PR, but which we will want to do soon, possibly as soon as this release. I put the highest priority items first.

Log ID of created resource

For actions that create a resource, like disk or instance create, we need to at least log the ID of the resource created. Even for token and session creation, we can probably log the ID of the created token or session. We may also want to log names if we have them.

Log display name of user and silo

We only have UUIDs for user and silo and they are not very pleasant to work with. It's a lot easier to see what's going on at a glance if we have display names. On top of that, after a user or silo is deleted, there isn't a way to look them up in the API by ID and get that info.

Auto-complete uncompleted entries

Unlike with initialization (because we bail if it fails), we do not have a guarantee that audit log completion runs successfully because we don't want to turn every loggable operation into a saga to enable rollbacks. To deal with this, we will likely need a background job to complete any rows hanging around uncompleted for longer than N minutes or hours. Because these will not have success or error info about the logged operation, we will probably need an explicit third kind of completed entry, like success/error/timeout.

Log ID of token or session used to authenticate operation

We have these IDs as of #8137, might as well use them.

Versioned log format

We may want to indicate breaking changes to the log format so that customers update whatever system is consuming and storing the log.

Silo-level audit log endpoint

In this PR, the audit log can only be retrieved by fleet viewers at a system-level endpoint. We will probably want to allow silo admins to retrieve an audit log scoped to their silo. That will require

A silo-scoped /v1/audit-log endpoint accessible only to silo admins that does more or less what the system-level one does, plus where silo_id = <silo_id>
A SiloAuditLog authz resource alongside AuditLog that is tied to a specific silo
More robust logging of the silo an operation takes place in, probably related to the above point about better actor logging on login actions. The external authenticator actor is not in a silo, so currently we are not writing down what silo a login attempt is happening in.

Log putative user for login operations

For failed login attempts we want to know who they were trying to log in as. For SAML login this may not be meaningful as we only get the request from the IdP after login was successful over there, but for password login we could log the username.

Log full JSON response

We may want to go as far as to log the entire JSON response. One minor difficulty I ran into is that Dropshot handles serializing the response struct to JSON, so we don't have access to the serialized thing in the request handlers. Feels like a shame to serialize it twice, but we might have to if we want to write down the response.

Clean up old entries

Background task to delete entries older than N days, as determined by our as-yet-undetermined our retention policy. We need to keep an eye on how fast the table will grow, but it seems we already have some tables that are quite huge compared to this one and we don't clean them up yet, so I'm not too worried about it. We expect customers will want to frequently fetch the log and save it off-rack, so the retention period probably doesn't need to be very long.

Log a bunch more events

Right now the audit log calls are a bit verbose. Dropshot deliberately does not support middleware, which would let us do this kind of thing automatically outside of the handlers. Finding a more ergonomic and less noisy way of doing the audit logging and latency logging might require a declarative macro.

common/src/api/external/http_pagination.rs

nexus/db-model/src/audit_log.rs

david-crespo · 2025-01-15T16:48:31Z

nexus/src/external_api/http_entrypoints.rs

            let project =
                nexus.project_create(&opctx, &new_project.into_inner()).await?;
+
+            let _ = nexus.audit_log_entry_complete(&opctx).await?;


I started with project create because it's easy to work with in tests, but I know it's not in the short list of things we want to start with. We might end up simply logging every endpoint.

Yeah, we'll want (at least eventually) to include all (at least all authenticated) API methods. I think if we want to just have a subset of the methods available then we should prioritize those that make changes (vs GET operations), but with the intention of getting coverage of the API.

Related note, while not a requirement for this initial version, I spoke to @sunshowers about strategies for how we might be able to enforce new methods must implement the audit log. It's a place I think we'd like to get to.

It's related to dropshot lacking middleware — notice we manually call this instrument_dropshot_handler thing in every endpoint. I wonder if we could build that in elsewhere, make it automatic, and add the audit log call to it.

inickles

Some initial thoughts on the fields in AuditLogEntry.

inickles · 2025-01-15T17:21:54Z

nexus/db-model/src/audit_log.rs

+    // TODO: this isn't in the RFD but it seems nice to have
+    pub request_uri: String,


Yeah, this looks like it might be the closest thing that we'd have to something like a rack and/or fleet ID, which is something I think we'd want - something for customer to be able to filter which audit logs came from which rack / fleet.

This may suffice for now, but maybe just until we get multi-rack implemented?

inickles · 2025-01-15T17:28:25Z

nexus/db-model/src/audit_log.rs

+    // Fields that are optional because they get filled in after the action completes
+    /// Time in milliseconds between receiving request and responding
+    pub duration: Option<TimeDelta>,


While fine to include, I don't think this is required, in case that makes it easier. I'm not following the earlier note about this relates to including the response in the audit log entry.

I just meant the response and the duration are both things we only know at the end of the operation.

inickles · 2025-01-15T17:29:58Z

nexus/db-model/src/audit_log.rs

+    // TODO: including a real response complicates things
+    // Response data on success (if applicable)
+    // pub success_response: Option<Value>,


While this indeed complicates things, it is critical IMO. For example, if someone were to create a new instance this audit log should say what that new instance ID is as a result.

nexus/db-model/src/audit_log.rs

inickles · 2025-01-15T17:52:33Z

nexus/db-model/src/audit_log.rs

+
+#[derive(Queryable, Insertable, Selectable, Clone, Debug)]
+#[diesel(table_name = audit_log)]
+pub struct AuditLogEntry {


I'm thinking it might make more sense to put operation-specific things like resource_type, resource_id and maybe action into a something like a request_elements: Value, where the operation can decide what makes to include.

inickles · 2025-01-15T18:01:52Z

nexus/db-model/src/audit_log.rs

+
+#[derive(Queryable, Insertable, Selectable, Clone, Debug)]
+#[diesel(table_name = audit_log)]
+pub struct AuditLogEntry {


I'd like for us to include a version format, where we stick to major/minor semver, and include a event_version in this struct. I'm not sure how we'd want to manage that, and for all I know it might be a little more difficult for fields with Value type (request and response bits), but I think it's important for us to not silently break user parsers.

I was thinking we could use the release version, but I see you mean the abstract shape of the log entry, and we'd want the version to stay the same across releases when applicable to indicate that log parsing logic does not have to change. So we should probably include both a log format version and the release version. Semver might be overkill — maybe we can get away with integers and not worry about distinguishing between breaking, semi-breaking, and non-breaking changes.

I was thinking we could use the release version, but I see you mean the abstract shape of the log entry, and we'd want the version to stay the same across releases when applicable to indicate that log parsing logic does not have to change. So we should probably include both a log format version and the release version. Semver might be overkill — maybe we can get away with integers and not worry about distinguishing between breaking, semi-breaking, and non-breaking changes.

The patch number of SemVer might be overkill, but I think following similar rules for Major and Minor versions to differentiate between changes that'd break parsers vs those that shouldn't (e.g. new fields added) could still fit into SemVer rules and be a natural means indicating when parser logic has to change.

Pulling these refactors out of #7339 because they're mechanical and just add noise. The point is to make it a cleaner diff when we add the function calls or wrapper code that creates audit log entries, as well as to clean up the `device_auth` (eliminated) and `console_api` (shrunken substantially) files, which have always been a little out of place. ### Refactors With the change to a trait-based Dropshot API, the already weird `console_api` and `device_auth` modules became even weirder, because the actual endpoint definitions were moved out of those files and into `http_entrypoints.rs`, but they still called functions that lived in the other files. These functions were redundant and had signatures more or less identical to the endpoint handlers. That's the main reason we lose 90 lines here. Before we had ``` http_entrypoints.rs -> console_api/device_auth -> nexus/src/app functions ``` Now we (mostly) cut out the middleman: ``` http_entrypoints.rs -> nexus/src/app functions ``` Some of what was in the middle moved up into the endpoint handlers, some moved "down" into the nexus "service layer" functions. ### One (1) functional change The one functional change is that the console endpoints are all instrumented now.

david-crespo commented Jan 14, 2025

View reviewed changes

common/src/api/external/http_pagination.rs Outdated Show resolved Hide resolved

david-crespo force-pushed the crespo/audit-log branch from ec3d782 to 6f04417 Compare January 14, 2025 18:42

inickles reviewed Jan 14, 2025

View reviewed changes

nexus/db-model/src/audit_log.rs Outdated Show resolved Hide resolved

david-crespo force-pushed the crespo/audit-log branch 5 times, most recently from 9258b89 to 1c4e5bf Compare January 15, 2025 15:57

david-crespo commented Jan 15, 2025

View reviewed changes

inickles reviewed Jan 15, 2025

View reviewed changes

david-crespo mentioned this pull request Jan 18, 2025

Refactor login endpoints, instrument console endpoints #7374

Merged

david-crespo force-pushed the crespo/audit-log branch 4 times, most recently from 8dae6b3 to 9d70d86 Compare January 30, 2025 21:56

david-crespo added this to the 13 milestone Jan 31, 2025

david-crespo self-assigned this Jan 31, 2025

benjaminleonard mentioned this pull request Feb 10, 2025

Audit Log Designs oxidecomputer/console#2684

Closed

morlandi7 modified the milestones: 13, 14 Feb 11, 2025

david-crespo force-pushed the crespo/audit-log branch 5 times, most recently from f9d36c0 to f95cb8a Compare March 6, 2025 19:48

This was referenced Mar 19, 2025

minor: refactor session create methods #7827

Merged

[nexus] webhooks #7277

Merged

PaginatedByTimeAndId #7842

Merged

david-crespo force-pushed the crespo/audit-log branch from f95cb8a to e1843f8 Compare March 20, 2025 15:32

david-crespo added 12 commits August 8, 2025 13:47

retry complete queries a few times

9373e4a

address jordan's comments

0d443ed

eliza's feedback

54f2b1c

add result enum to response and plumb it through

1896cfe

do the same structured approach for audit log entry init

844f6e2

don't treat external-authenticator as actor of login ops

cef17bb

warn on failed completion attempts, error on Utter Failure

57eb3d8

log project, instance, disk create and delete + test that

c422370

eliza suggestions

61e0f4e

rename access_method to auth_method

c4ec694

limit the length of the request URI we write down

cebdfe4

add more db constraints on result-related columns

866ead3

david-crespo force-pushed the crespo/audit-log branch from 752941b to 866ead3 Compare August 8, 2025 18:56

AUGH REGEN OPENAPI + last few tweaks

62c4928

david-crespo force-pushed the crespo/audit-log branch from b506ec0 to 62c4928 Compare August 8, 2025 20:21

david-crespo enabled auto-merge (squash) August 8, 2025 20:21

david-crespo mentioned this pull request Aug 8, 2025

Use typed UUIDs for silo user and group #8803

Merged

david-crespo merged commit 19e9f75 into main Aug 8, 2025
18 checks passed

david-crespo deleted the crespo/audit-log branch August 8, 2025 22:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Audit log MVP #7339

Audit log MVP #7339

Uh oh!

david-crespo commented Jan 14, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

david-crespo Jan 15, 2025

Uh oh!

inickles Jan 15, 2025

Uh oh!

david-crespo Jan 15, 2025 •

edited

Loading

Uh oh!

inickles left a comment

Uh oh!

inickles Jan 15, 2025

Uh oh!

inickles Jan 15, 2025

Uh oh!

david-crespo Jan 15, 2025

Uh oh!

inickles Jan 15, 2025

Uh oh!

Uh oh!

inickles Jan 15, 2025

Uh oh!

inickles Jan 15, 2025

Uh oh!

david-crespo Jan 15, 2025

Uh oh!

inickles Jan 24, 2025

Uh oh!

Uh oh!

Uh oh!

		// TODO: this isn't in the RFD but it seems nice to have
		pub request_uri: String,

Audit log MVP #7339

Audit log MVP #7339

Uh oh!

Conversation

david-crespo commented Jan 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

High-level design

Operations logged

Next steps

Log ID of created resource

Log display name of user and silo

Auto-complete uncompleted entries

Log ID of token or session used to authenticate operation

Versioned log format

Silo-level audit log endpoint

Log putative user for login operations

Log full JSON response

Clean up old entries

Log a bunch more events

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

david-crespo Jan 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

inickles left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

david-crespo commented Jan 14, 2025 •

edited

Loading

david-crespo Jan 15, 2025 •

edited

Loading