Skip to content

feat(discover2) Add more aggregate field support #14321

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Aug 13, 2019

Conversation

markstory
Copy link
Member

@markstory markstory commented Aug 8, 2019

This is an alternate approach to modifying the discover endpoints to fit the requirements of discover2/events search. By extending the existing events endpoint with rollup and additional functions/aliases we can get what we need.

The rollup parameter isn't used right now but will be used when we update events-stats which will be one of the next tasks.

This approach will require some shim code in the UI to convert saved discover queries into a compatible field list and query string, but that work now seems simpler than building out the remaining ancillary endpoints in discover.

Refs SEN-810

This is an alternate approach to modifying the discover endpoints to fit
the requirements of discover2/events search. By extending the existing
events endpoint with rollup and additional functions/aliases we can get
what we need. This approach will require some shim code in the UI to
convert saved discover queries into a compatible field list and query
string, but that work now seems simpler than building out the remaining
ancillary endpoints in discover.

Refs SEN-811
@@ -147,9 +110,6 @@ def get_snuba_query_args_legacy(self, request, organization):
except InvalidSearchQuery as exc:
raise OrganizationEventsError(exc.message)

# Filter out special aggregates.
self._filter_unspecified_special_fields_in_conditions(snuba_args, set())
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was removed as we won't be supporting filtering on aggregate columns in the short term, and with fully dynamic aggregate columns we would need expand the query parser to accomodate aggregate filtering.

Copy link
Member

@lynnagara lynnagara Aug 8, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we go ahead and remove the function completely? sorry, just saw you did this

if any(field for field in groupby if field not in ALLOWED_GROUPINGS):
message = ('Invalid groupby value requested. Allowed values are ' +
', '.join(ALLOWED_GROUPINGS))
return Response({'detail': message}, status=400)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was removed because the endpoint no longer allows the end user to define groupby. Instead groupings are implicit based on the fields requested.

@@ -215,9 +215,9 @@ class SearchVisitor(NodeVisitor):
'device.battery_level', 'device.charging', 'device.online',
'device.simulator', 'error.handled', 'issue.id', 'stack.colno',
'stack.in_app', 'stack.lineno', 'stack.stack_level',
# OrganizationEvents aggregations
'event_count', 'user_count',
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing these as they aren't a thing anymore.

@markstory markstory requested review from lynnagara, a team and dashed August 8, 2019 18:31
# OrganizationEvents aggregations
'event_count', 'user_count',

'duration',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this going to break regular event/issue search if someone is using these field names?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The removed fields were only really available on events-v2, which no customers have access to.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is not just used in events-v2 though, it's used for event search generally. Try typing duration:something into issue or event search and you will see what I mean

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I doubt anyone knows about these fields since they were never documented

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately people do use duration as a tag though. And 500ing on a query with this term in it doesn't seem great

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can remove duration for now. Given that some customers are using this as a tag would we want to expose the real duration as event.duration?

u'{}_{}'.format(match.group('function'), match.group('column'))
])

rollup = snuba_args.get('rollup')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What will rollup be used for?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rollup gets used by timeseries endpoints, which I'm planning to update soon.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Member

@dashed dashed Aug 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

# Ensure fields we require to build a functioning interface
# are present. We don't add fields when using a rollup as the additional fields
# would be aggregated away.
if 'project.id' not in columns:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to id on the line below, won't the aggregations check be needed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We always need a project id in order to generate a link to the event in the modal. This complicates cross project aggregation. Another possibility is to use argMax(project_id, timestamp) to get the project for the latest event.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if the logic in this block that is related to building a functioning interface and modal links lived in the user interface instead instead of here? I think the API could be solely responsible for making sure queries are valid and then running them, not modifying queries so that the UI gets the field it needs. Otherwise this behavior would be super confusing if someone is using the API directly and getting back random fields that they didn't request just because our UI needs it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My concern with that is we'll be storing field lists in the URL and in the database. Storing all the decomposed fields makes URLs more fragile as removing an expanded field could cause incorrect rendering. It also makes future changes harder. If we want to modify how a field is expanded we can't easily do that because the expansion is saved incorrectly in the database.

sortField: 'event_count',
// TODO generalize this.
'count(id)': {
sortField: 'count_id',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you only sort by one column?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The links only add/change a single sort condition.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. For events we use -timestamp, -event_id or timestamp, event_id so we have a stable order, which we don't have here. Not sure if it matters though.

columns.append(field)
continue

validate_aggregate(field, match)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be nice if all the validations were done in one place up front, but i'm guessing this is not as easy?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function (resolve_field_list) is called pretty early in the organization_events endpoint processing. I would like to use a DRF serializer for this work but other request validation steps get_filter_params() are embedded in endpoint logic.

if aggregations and 'latest_event' not in fields:
columns.extend(deepcopy(FIELD_ALIASES['latest_event']['fields']))

basic_fields = [col for col in columns if isinstance(col, six.string_types)]
Copy link
Member

@lynnagara lynnagara Aug 8, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When is a column name not a string? Wouldn't that be a validation error if that ever happened?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The column name can be a non-string value when a field alias is used. For example the latest_event field gets expanded into a function using argMax.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added validation to disallow end users from providing non-string field names.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha. IMO it's easier to read if you put the special expanded fields (which are actually aggregations) in a separate list since they have a different type and need to be handled differently

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the field aliases that expand into aggregates now update aggregations. I've had to hack in a string function call for latest_event as snuba doesn't support multiple columns in aggregate functions just yet. I've opened a pull to fix that though.


basic_fields = [col for col in columns if isinstance(col, six.string_types)]
if rollup and basic_fields:
raise InvalidSearchQuery('You cannot combine rollup with non-aggregate fields')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't that ok since they get added to groupby clause?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe, my understanding of rollup is that it cuts the data up into timeslices. Getting a non-aggregate column in that kind of result set is not helpful as it breaks the rollup slices.

Copy link
Member

@lynnagara lynnagara Aug 8, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why won't it be useful? I could see something like Errors grouped by version rolled up by week being pretty useful.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point. I'll take this out.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second thought, this scenario is an error if there are no aggregates being included. However, basic fields + aggregates should be valid.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok

VALID_AGGREGATES = {
'count_unique': {
'snuba_name': 'uniq',
'fields': '*',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can imagine also wanting to do all fields excluding some small list

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lynnagara im wondering, is there an aggregate function where we would want to exclude certain columns?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally would prefer to use safe-lists rather than ban-lists for the acceptable aggregate columns.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The situation where I was thinking we might need a ban list is where we want to support the aggregate on custom user-defined tag columns (can't use a safe list) but might need to exclude some internal stuff

Generally not a useful scenario
Don't inject a project.id field when aggregating across non-project
fields. Instead use argMax() to get a project.name from the latest
event.
},
# This doesn't work yet, but is an illustration of how it could work
'p75': {
'snuba_name': 'quantileTiming(0.75)',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we wanted to stop passing function calls like this explicitly, and try something like
[quantileTiming, [0.75]]. Not sure if this is something that would be directly passed to snuba though?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with an array form as well. The quantile functions are a bit funny as they need to end up in clickhouse like quantileTiming(0.75)(duration) I'm not sure we've done that kind of function call in snuba before.

Copy link
Member

@wedamija wedamija Aug 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might need snuba to translate that for us, not sure. I feel like we had something similar to this before, can't remember exactly what it was though. Could potentially do [quantileTiming, [0.75, 'duration']] and have snuba sort it out. I think other patterns we've done were to extract information from the name, but that's probably not a great practice (something like [quantileTiming_75, ['duration']])

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah the implementation of the quantile functions is still imcomplete, but we can figure that out once storage for transactions is sorted.

# OrganizationEvents aggregations
'event_count', 'user_count',

'duration',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I doubt anyone knows about these fields since they were never documented

* Remove non-deterministic results.
* Remove condition test for fields that are no longer supported.
@markstory markstory requested review from lynnagara and wedamija August 9, 2019 16:12
Copy link
Member

@dashed dashed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

I'll defer to @wedamija and @lynnagara for approval.

We need to use string function names in the short term as snuba's json
schema validation prevents aggregates on more than one column. I've also
changed the logic to allow rollup with basic fields as long as at least
one aggregate is also specified.
markstory added a commit to getsentry/snuba that referenced this pull request Aug 12, 2019
We have a use-case in sentry where we want to use `argMax()` which
requires multiple columns to work. While this function is allowed in
`selected_columns` it is actually an aggregate value and should probably
be in `aggregations`. This expands the schema and query builder to allow
lists of columns in aggregations functions.

Refs getsentry/sentry#14321
markstory added a commit to getsentry/snuba that referenced this pull request Aug 12, 2019
We have a use-case in sentry where we want to use `argMax()` which
requires multiple columns to work. While this function is allowed in
`selected_columns` it is actually an aggregate value and should probably
be in `aggregations`. This expands the schema and query builder to allow
lists of columns in aggregations functions.

Refs getsentry/sentry#14321
Duration is currently in use as a tag by a few customers.
Copy link
Member

@wedamija wedamija left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, probably worth lyn having final signoff

Copy link
Member

@lynnagara lynnagara left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@markstory markstory merged commit 0605f0e into master Aug 13, 2019
markstory added a commit to getsentry/snuba that referenced this pull request Aug 22, 2019
We have a use-case in sentry where we want to use `argMax()` which
requires multiple columns to work. While this function is allowed in
`selected_columns` it is actually an aggregate value and should probably
be in `aggregations`. This expands the schema and query builder to allow
lists of columns in aggregations functions.

Refs getsentry/sentry#14321
@evanpurkhiser evanpurkhiser deleted the sen-811-server-field-aliases branch August 22, 2019 22:52
markstory added a commit to getsentry/snuba that referenced this pull request Aug 23, 2019
We have a use-case in sentry where we want to use `argMax()` which
requires multiple columns to work. While this function is allowed in
`selected_columns` it is actually an aggregate value and should probably
be in `aggregations`. This expands the schema and query builder to allow
lists of columns in aggregations functions.

Refs getsentry/sentry#14321
@github-actions github-actions bot locked and limited conversation to collaborators Dec 20, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants