Skip to content

[data design] Observations #14696

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
miketheman opened this issue Oct 4, 2023 · 10 comments
Closed

[data design] Observations #14696

miketheman opened this issue Oct 4, 2023 · 10 comments
Labels
data quality needs discussion a product management/policy issue maintainers and users should discuss

Comments

@miketheman
Copy link
Member

miketheman commented Oct 4, 2023

Today we use Event/HasEvents to mix in to a model to produce distinct *_events tables for the associated model. It's cool, it works. 🎉 The pattern is use for Events is called table_per_related.

We often will query for events like someproject.events to get all events related to the Project, somerelease.events for all events related to a Release, etc.
Since the only ForeignKey is to the parent (mixed in) class, there's rarely ever a need to express "find all Events for all Releases of a given Project".

In thinking about Observations as a similar pattern, I'm considering adding a FK to the User that provided the Observation.
If we wanted to ask "Show me all kinds of Observations made by this User", we would need to collect the observations from any mixed-in models, a la (user.project_observations + user.release_observations + ...)

If we wanted to get all observations from a related User in one-shot, a la user.observations (using relationship on a polymorphic identity) we have to model things a little differently.

We have two options here:

We currently use JTI with OIDC publishers resulting in 3 discrete tables and +1 for any new publisher added in the future. Looks a little like this:

erDiagram
    OIDC_PUBLISHERS }o--|| GITHUB_OIDC_PUBLISHERS : is
    OIDC_PUBLISHERS }o--|| GOOGLE_OIDC_PUBLISHERS : is
    OIDC_PUBLISHERS {
        string discriminator
        uuid id PK
    }
    GITHUB_OIDC_PUBLISHERS {
        string repository_name
        string repository_owner
        string repository_owner_id
        string workflow_filename
        string environment
        uuid id PK, FK
    }
    GOOGLE_OIDC_PUBLISHERS {
        string email
        string sub
        uuid id PK, FK
    }
Loading

JTI works well for OIDC since the distinct tables differ in their structure, whereas Events (and the future Observations) conform to a more uniform structure.

Using STI to show a proposed table as an example like this:

erDiagram
    OBSERVATIONS }o--|| PROJECTS : have
    OBSERVATIONS }o--|| RELEASES : have
    OBSERVATIONS }o--|| USERS : have
    PROJECTS {
        etc etc
        uuid id FK
    }
    RELEASES {
        etc etc
        uuid id FK
    }
    USERS {
        etc etc
        uuid id FK
    }
    OBSERVATIONS {
        datetime created
        string kind
        string summary
        jsonb payload
        uuid id PK
        string type "discriminator column project, release, et al"
        uuid related_id FK "ID of the related record"
        uuid observer_id FK "user ID"
    }
Loading

Questions in my head are:

  1. Is adding a required FK for user.id a good or bad idea?
    I think good, but maybe not?
  2. Do we foresee a use case for user.observations ?
    I can imagine wanting to look at what Observations a User had provided over time, probably other use cases.
  3. Is there a preference/lean towards JTI or STI?
    STI seems simpler to me as there's fewer tables to content with, and fewer JOINs in the resulting query.

(These questions are similar to ones I had about Roles, TeamRoles, and OrganizationRoles, so it's possible that the learnings here will materialize in changes over there as well)

@miketheman miketheman added needs discussion a product management/policy issue maintainers and users should discuss data quality labels Oct 4, 2023
@miketheman

This comment was marked as outdated.

@ewdurbin
Copy link
Member

Questions in my head are:

  1. Is adding a required FK for user.id a good or bad idea?
    I think good, but maybe not?

I think it might make sense to have related_type + related_id and observer_type + observer_id. Theoretically we may have system observations, or observations by an abstract "Observer" service. Alternatively a required observer_id that is FK to a new Observer model that encapsulates the "what the heck is this thing? a user? a system? a job in PyPI? etc"

  1. Do we foresee a use case for user.observations ?
    I can imagine wanting to look at what Observations a User had provided over time, probably other use cases.

I could totally forsee an observer.observations that would be accessible via a related user, org, or something.

  1. Is there a preference/lean towards JTI or STI?
    STI seems simpler to me as there's fewer tables to content with, and fewer JOINs in the resulting query.

JTI seems to be nice because it would allow our observations for niche or interesting use-cases add on additional structured data. Cribbing on JSONB for the payload to start makes good sense as we learn, but I think there is an argument to be made that in the future we may want more columnar data.

@miketheman miketheman reopened this Oct 11, 2023
@miketheman
Copy link
Member Author

Great comments! An interesting part here is that the concept of the Observer seems like it would be STI where it is the one that determines the Observer based on the discriminator observer_type and then can find the associated User (or system, or whatever) object based on the observer_type+related_observer_id.

Then use JTI to have distinct tables for ProjectObservations, ReleaseObservations, UserObservations, et al?

I'll continue tinkering with this model to figure out how these actually relate to each other and put something here for to look at.

@miketheman
Copy link
Member Author

After further tinkering with this data model even more, I'm coming to a simpler approach for the Observers side.

A model & mixin that looks a little like:

class Observer(Base):
    __tablename__ = "observer"
    id: Mapped[UUID] = mapped_column(primary_key=True)
    ...


class RefObserverMixin:
    observer_id: Mapped[UUID | None] = mapped_column(
        ForeignKey("observer.id"), comment="Observer ID. Null if not set"
    )

    @declared_attr.directive
    def observer(cls) -> Mapped[Observer]:
        return relationship(
            Observer, cascade="all, delete-orphan", single_parent=True
        )

should allow us to add a(nother) mixin to User, which would add a single nullable column on the users table for observer_id, while allowing us to add the mixin to any future model we might want.

One drawback is that an Observer has no clue where it came from. 🤔
I'm still mulling over whether that's important or not.
I'm thinking that it is important, since we'd want to know from a given Observation who the Observer was. I'll continue to poke at different patterns from SQLAlchemy docs.

@miketheman
Copy link
Member Author

sqlalchemy-utils has a Generic Relationship that looks like it might be closer to the thing we want: https://sqlalchemy-utils.readthedocs.io/en/latest/generic_relationship.html but there might be an issue with sqlalchemy 2.0.22. I'll try it out.

@miketheman
Copy link
Member Author

miketheman commented Oct 19, 2023

Considering there's still an unresolved issue with sqlalchemy-utils vs sqlalchemy, I kept reading the docs and tinkering with different implementations until I found one that worked well enough for now.
Refs: https://docs.sqlalchemy.org/en/20/_modules/examples/generic_associations/discriminator_on_association.html

In this case, we get an ObserverAssociation table that stores an ID, a discriminator type, and the related ID of the thing we're discriminating with (User today, anything mixing in the relation tomorrow).
We also get an Observer table that stores the new Observer ID, associated

This adds a column to the mixed-in model (User) to store the related observer ID so we can "step" through the Observer association tables to get Observations (which have yet to exist).

I'll proceed with adding an Observer model and mixin.

One outstanding question I have is what do we expect to happen if the original Observer parent is removed? Do we expect the associated Observer to also go away? And what of any of the Observations that Observer made? Should they go away too? Or do we accept that if we remove an Observer's parent, we get an orphaned Observer and preserve their Observations.

I'm going with "Don't delete the Observer, since they own Observations." We might do something where we re-link to deleted-user (like we do with Journals), but I'll leave that until a bit later.

@miketheman
Copy link
Member Author

After working through a few approaches, I have been able to wrangle our code and sqlalchemy to produce this (abbreviated) data model so far (Thanks @ewdurbin and DBML for making the representation relatively easy!)

Screenshot 2023-10-25 at 12 49 07

In this example, I've mixed in an Observer to User, so we get the abstract association so we can have non-User Observers.
I've also mixed in HasObservations into Project and Release models, represented by two new tables to represent the Observations that are particular to those models, similar to our current HasEvents mixin.

The next step here is to determine how to model the relationships so I can ask any mixed-in Observer for all of the *Observations they made.

@miketheman
Copy link
Member Author

See #14834 for what I've come up with so far.

There's something I'm missing on the creation patterns through proxies, so have done some in-place hackery that needs to be resolved, as well as being able to reference all *Observations from a given Observer.parent.

@ewdurbin
Copy link
Member

ewdurbin commented Dec 6, 2023

@miketheman did #14834 fully implement the concerns you had remaining with proxies?

@miketheman
Copy link
Member Author

@ewdurbin I think it did - it's still not as "ActiveRecord-y" as I want it to be, but it is good enough for how we use it. Thanks for following up!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data quality needs discussion a product management/policy issue maintainers and users should discuss
Projects
None yet
Development

No branches or pull requests

2 participants