[data design] Observations #14696

miketheman · 2023-10-04T22:01:49Z

Today we use Event/HasEvents to mix in to a model to produce distinct *_events tables for the associated model. It's cool, it works. 🎉 The pattern is use for Events is called table_per_related.

We often will query for events like someproject.events to get all events related to the Project, somerelease.events for all events related to a Release, etc.
Since the only ForeignKey is to the parent (mixed in) class, there's rarely ever a need to express "find all Events for all Releases of a given Project".

In thinking about Observations as a similar pattern, I'm considering adding a FK to the User that provided the Observation.
If we wanted to ask "Show me all kinds of Observations made by this User", we would need to collect the observations from any mixed-in models, a la (user.project_observations + user.release_observations + ...)

If we wanted to get all observations from a related User in one-shot, a la user.observations (using relationship on a polymorphic identity) we have to model things a little differently.

We have two options here:

Joined Table Inheritance (JTI) - one table per mixed-in thing, and a JOIN table for all mixed-in-things
Single Table Inheritance (STI) - one wider table that adds a column for the mixed-in-thing
Read more in https://docs.sqlalchemy.org/en/20/orm/inheritance.html

We currently use JTI with OIDC publishers resulting in 3 discrete tables and +1 for any new publisher added in the future. Looks a little like this:

erDiagram
    OIDC_PUBLISHERS }o--|| GITHUB_OIDC_PUBLISHERS : is
    OIDC_PUBLISHERS }o--|| GOOGLE_OIDC_PUBLISHERS : is
    OIDC_PUBLISHERS {
        string discriminator
        uuid id PK
    }
    GITHUB_OIDC_PUBLISHERS {
        string repository_name
        string repository_owner
        string repository_owner_id
        string workflow_filename
        string environment
        uuid id PK, FK
    }
    GOOGLE_OIDC_PUBLISHERS {
        string email
        string sub
        uuid id PK, FK
    }

JTI works well for OIDC since the distinct tables differ in their structure, whereas Events (and the future Observations) conform to a more uniform structure.

Using STI to show a proposed table as an example like this:

erDiagram
    OBSERVATIONS }o--|| PROJECTS : have
    OBSERVATIONS }o--|| RELEASES : have
    OBSERVATIONS }o--|| USERS : have
    PROJECTS {
        etc etc
        uuid id FK
    }
    RELEASES {
        etc etc
        uuid id FK
    }
    USERS {
        etc etc
        uuid id FK
    }
    OBSERVATIONS {
        datetime created
        string kind
        string summary
        jsonb payload
        uuid id PK
        string type "discriminator column project, release, et al"
        uuid related_id FK "ID of the related record"
        uuid observer_id FK "user ID"
    }

Questions in my head are:

Is adding a required FK for user.id a good or bad idea?
I think good, but maybe not?
Do we foresee a use case for user.observations ?
I can imagine wanting to look at what Observations a User had provided over time, probably other use cases.
Is there a preference/lean towards JTI or STI?
STI seems simpler to me as there's fewer tables to content with, and fewer JOINs in the resulting query.

(These questions are similar to ones I had about Roles, TeamRoles, and OrganizationRoles, so it's possible that the learnings here will materialize in changes over there as well)

The text was updated successfully, but these errors were encountered:

ewdurbin · 2023-10-11T10:20:07Z

Questions in my head are:

Is adding a required FK for user.id a good or bad idea?
I think good, but maybe not?

I think it might make sense to have related_type + related_id and observer_type + observer_id. Theoretically we may have system observations, or observations by an abstract "Observer" service. Alternatively a required observer_id that is FK to a new Observer model that encapsulates the "what the heck is this thing? a user? a system? a job in PyPI? etc"

Do we foresee a use case for user.observations ?
I can imagine wanting to look at what Observations a User had provided over time, probably other use cases.

I could totally forsee an observer.observations that would be accessible via a related user, org, or something.

Is there a preference/lean towards JTI or STI?
STI seems simpler to me as there's fewer tables to content with, and fewer JOINs in the resulting query.

JTI seems to be nice because it would allow our observations for niche or interesting use-cases add on additional structured data. Cribbing on JSONB for the payload to start makes good sense as we learn, but I think there is an argument to be made that in the future we may want more columnar data.

miketheman · 2023-10-11T20:10:14Z

Great comments! An interesting part here is that the concept of the Observer seems like it would be STI where it is the one that determines the Observer based on the discriminator observer_type and then can find the associated User (or system, or whatever) object based on the observer_type+related_observer_id.

Then use JTI to have distinct tables for ProjectObservations, ReleaseObservations, UserObservations, et al?

I'll continue tinkering with this model to figure out how these actually relate to each other and put something here for to look at.

miketheman · 2023-10-17T22:12:59Z

After further tinkering with this data model even more, I'm coming to a simpler approach for the Observers side.

A model & mixin that looks a little like:

class Observer(Base):
    __tablename__ = "observer"
    id: Mapped[UUID] = mapped_column(primary_key=True)
    ...


class RefObserverMixin:
    observer_id: Mapped[UUID | None] = mapped_column(
        ForeignKey("observer.id"), comment="Observer ID. Null if not set"
    )

    @declared_attr.directive
    def observer(cls) -> Mapped[Observer]:
        return relationship(
            Observer, cascade="all, delete-orphan", single_parent=True
        )

should allow us to add a(nother) mixin to User, which would add a single nullable column on the users table for observer_id, while allowing us to add the mixin to any future model we might want.

One drawback is that an Observer has no clue where it came from. 🤔
I'm still mulling over whether that's important or not.
I'm thinking that it is important, since we'd want to know from a given Observation who the Observer was. I'll continue to poke at different patterns from SQLAlchemy docs.

miketheman · 2023-10-18T14:27:12Z

sqlalchemy-utils has a Generic Relationship that looks like it might be closer to the thing we want: https://sqlalchemy-utils.readthedocs.io/en/latest/generic_relationship.html but there might be an issue with sqlalchemy 2.0.22. I'll try it out.

miketheman · 2023-10-19T18:57:26Z

Considering there's still an unresolved issue with sqlalchemy-utils vs sqlalchemy, I kept reading the docs and tinkering with different implementations until I found one that worked well enough for now.
Refs: https://docs.sqlalchemy.org/en/20/_modules/examples/generic_associations/discriminator_on_association.html

In this case, we get an ObserverAssociation table that stores an ID, a discriminator type, and the related ID of the thing we're discriminating with (User today, anything mixing in the relation tomorrow).
We also get an Observer table that stores the new Observer ID, associated

This adds a column to the mixed-in model (User) to store the related observer ID so we can "step" through the Observer association tables to get Observations (which have yet to exist).

I'll proceed with adding an Observer model and mixin.

One outstanding question I have is what do we expect to happen if the original Observer parent is removed? Do we expect the associated Observer to also go away? And what of any of the Observations that Observer made? Should they go away too? Or do we accept that if we remove an Observer's parent, we get an orphaned Observer and preserve their Observations.

I'm going with "Don't delete the Observer, since they own Observations." We might do something where we re-link to deleted-user (like we do with Journals), but I'll leave that until a bit later.

miketheman · 2023-10-25T16:53:59Z

After working through a few approaches, I have been able to wrangle our code and sqlalchemy to produce this (abbreviated) data model so far (Thanks @ewdurbin and DBML for making the representation relatively easy!)

In this example, I've mixed in an Observer to User, so we get the abstract association so we can have non-User Observers.
I've also mixed in HasObservations into Project and Release models, represented by two new tables to represent the Observations that are particular to those models, similar to our current HasEvents mixin.

The next step here is to determine how to model the relationships so I can ask any mixed-in Observer for all of the *Observations they made.

miketheman · 2023-10-31T22:08:58Z

See #14834 for what I've come up with so far.

There's something I'm missing on the creation patterns through proxies, so have done some in-place hackery that needs to be resolved, as well as being able to reference all *Observations from a given Observer.parent.

ewdurbin · 2023-12-06T18:44:05Z

@miketheman did #14834 fully implement the concerns you had remaining with proxies?

miketheman · 2023-12-11T21:05:59Z

@ewdurbin I think it did - it's still not as "ActiveRecord-y" as I want it to be, but it is good enough for how we use it. Thanks for following up!

miketheman added needs discussion a product management/policy issue maintainers and users should discuss data quality labels Oct 4, 2023

This comment was marked as outdated.

Sign in to view

miketheman closed this as completed Oct 10, 2023

miketheman reopened this Oct 11, 2023

miketheman closed this as completed Dec 11, 2023

sethmlarson mentioned this issue Dec 14, 2023

Secret reporting endpoint #14961

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[data design] Observations #14696

[data design] Observations #14696

miketheman commented Oct 4, 2023 •

edited

Loading

This comment was marked as outdated.

ewdurbin commented Oct 11, 2023

miketheman commented Oct 11, 2023

miketheman commented Oct 17, 2023

miketheman commented Oct 18, 2023

miketheman commented Oct 19, 2023 •

edited

Loading

miketheman commented Oct 25, 2023

miketheman commented Oct 31, 2023

ewdurbin commented Dec 6, 2023

miketheman commented Dec 11, 2023

[data design] Observations #14696

[data design] Observations #14696

Comments

miketheman commented Oct 4, 2023 • edited Loading

This comment was marked as outdated.

ewdurbin commented Oct 11, 2023

miketheman commented Oct 11, 2023

miketheman commented Oct 17, 2023

miketheman commented Oct 18, 2023

miketheman commented Oct 19, 2023 • edited Loading

miketheman commented Oct 25, 2023

miketheman commented Oct 31, 2023

ewdurbin commented Dec 6, 2023

miketheman commented Dec 11, 2023

miketheman commented Oct 4, 2023 •

edited

Loading

miketheman commented Oct 19, 2023 •

edited

Loading