-
Notifications
You must be signed in to change notification settings - Fork 2k
Reducing executor #26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Mapping vs reducing executorDiscussed this with @dschafer in IRC a bit, he asked to provide some concrete As I understood, the current executor was made with the assumption that it's In our system, we assumed that extra round-trips are bad and that, as our Let's imagine schema:
Mapping executorWe assume we have fields in resolve(parent, {id}, {db, conn}) {
return db.table('User').get(id).run(conn); // Promise
} For relations (like resolve(parent, args, {db, conn}) {
return db.table('Posts').get(parent.author).run(conn);
} Reducing executor, version 1This is approximately how we do it. Here we would need some abstraction, which can later be converted to database resolve(parent, {id}, {db, conn}) {
let baseQuery = new Query({
selector: new IDSelector({id: id, tableName: tableName}),
});
// this calls resolve on each child and returns results (queries)
let childrenResults = this.resolveChildren(baseQuery);
let finalQuery = combineQueries(baseQuery, childrenResults); // combines query together
// We actually just return the query and run and later stages
return finalQuery.run(db, conn);
} Relation resolves would be like this: resolve(baseQuery) {
let newQuery = new Query({
selector: new RelatedSelector({
tableName: tableName,
relatedField: fieldName
}),
});
let childrenResults = this.resolveChildren(newQuery);
let finalQuery = combineQueries(newQuery, childrenResults);
return baseQuery.addJoin(fieldName, newQuery)
} One can also apply some optimizations by having custom resolves for scalars. resolve(baseQuery) {
return baseQuery.addToSelect(fieldName)
} Reducing executor, version 2What I propose is to add reduce(ownResult, childrenResultMap) {
for (let key of childrenResultMap.keys()) {
ownResult[key] = childrenResultMap[key];
}
return ownResult;
} This would convert what we had in version 1 to: // query field
resolve(parent, {id}) {
return new Query({
selector: new IDSelector({id: id, tableName: tableName}),
});
return baseQuery;
}
reduce(parent, ownResult, childrenResultMap, {conn}) {
return combineQueries(ownResult, childrenResultMap).run(conn)
}
// join
resolve(parent, {id}) {
return new Query({
selector: new RelatedSelector({
tableName: tableName,
relatedField: fieldName
}),
});
}
reduce(parent, ownResult, childrenResultMap) {
return parent.addJoin(
combineQueries(ownResult, childrenResultMap)
);
} |
The idea of a reducing executor is appealing, but I think any reducing executor will have to make assumption on the domain specific structure of the application. An alternative I am using is queries collapsing. It kind of achieve the same goal but without making the spec too opinionated. The query merger is configured to reduce round trips to the db and would wait a specified amount of time to receive more queries to merge and send to the database. |
For example all mapping executor you provided assume that a relationship is actually in a different location and thus require a query to be resolved. While if I am using a document based datastore I might not need that query. |
I don't think it makes any assumptions, I was just showing how it solves the problem in my domain. In your alternative solution, reducer could collect individual queries in reduce step and execute them all together at the end. Similarly, executor doesn't dictate where you add |
I agree with this approach, and it seems general enough to be included in the spec. It's still useful with document stores for limiting the requested fields. Using a debounced merger seems like a very specific solution for multithreaded or evented architectures, and has implications on latency. In addition, the query flow becomes much more difficult to reason about. I'm interested in how FB approaches this; you probably have something figured out already. |
There's a little bit of discussion about this going on in #19 too, might help to merge the discussion somewhere. |
@pushrax The debounced merger I suggested is definitely not a solution that I am suggesting for the spec. I just noted how I am solving this particular issue in the context of my implementation. The problem I see with the reducing solution above is that it assume that we could always find a way to reduce field selection into smaller queries. I am just saying that the reduce part will always be domain specific and might not be a good fit to be in the spec. My vote would be to the resolver functions enough meta data on the context of the selection so that kind of implementation is possible. |
I fear a similar outcome, but I'm interested in what could become a more general utility to provide. Right now we offer the field AST as one of the arguments to the Any reducer needs to support this case:
Once reduced, the query executor is still going to want to use a top-down descent into the result so that it can ensure the returned result adheres to the type system contract: that non-nullables are in fact not null, that scalars are of the correct types, and that no additional fields have leaked through. |
Chatted with @leebyron just now. It seems pretty clear that we're missing something here to make this case easier. There seem to be a couple of approaches that might work:
My instincts personally run towards the collect step, but I think the way we'll get to the ideal solution here is by experimentation and trying things out. So I'd suggest you take whichever option makes the most sense to you and build it as an temporary extension to the core right now; for example, if you wanted to build out the "collect" pass, it could be implemented as a temporary step prior to execution that uses the By building out a solution as a temporary extension to the core, we'll be able to figure out which of the options ends up being easiest and cleanest, and we'll see if different implementations end up requiring similar techniques. Once we know that, we'll hopefully discover the general piece that we're missing, and figure out the right way to add it. Super excited to see how this development proceeds, I think seeing what the temporary extension looks like will make it really clear what the right changes to make to the core are! |
@dschafer I don't quite understand what's your idea about @leebyron I don't see a problem with reducer and the fragments, fragments seem to be resolved to flat field map on And yes, I'll try to hack implementation to illustrate what I mean, I think it'd be easier :) |
Ah, I think I get what @dschafer means. That would work too. |
Riiight, I was wrong about |
Hey all, we're opening up a community slack to have real time discussion about GraphQL. This is probably an interesting topic to continue on. https://graphql-slack.herokuapp.com |
Has anybody made any progress on this, any other place with a discussion related to this? You have the following types client, project, task with the obvious relations and resolvers in place.
What i've seen so far, the ways the resolvers are written, a query like the one below
would generate the folowing queries to the backend (DB)
The major problem is the fact that "resolver C" is called multiple times, once for each "project" item that was returned by "resolver B". All the queries above can be "reduced" to the one below (PostgreSQL specific).
|
@ruslantalpa If you use smart batching with something like dataloader, what you'll actually see is this:
The merging you speak of only makes sense if all of these resolvers actually fire a query to the sql database, and the execution planner has no way of knowing this at the moment. What I described above might be a bit faster, but it might also not be. You can't really tell until you try this out with different parameters. If you're thinking about doing that comparison, I'd be very interested in seeing the results. I've been wanting to do something like that myself, but I have yet to find the time for it. If someone else has time and interest in doing such a comparison, just let me know and I can tell you about the experimental setup that I had in mind for it. |
@ruslantalpa Thanks for the example. I'm actively working on this, at my current rate it will take me several weeks to reformulate #304 to address the excellent feedback received. The resolving algorithm is quite elegant, I don't feel constrained by it. I DO think that the resolve interface is focused on the wrong abstraction. The parameters passed belong to the world of the query. I believe the parameters passed to the resolve should belong to the world of the schema. The difference is subtle. Think of it this way. What if one wanted to have a completely different query language, but use the current GraphQL schema and resolve process? The current resolve implementation is tightly coupled to the syntax of the query. This is an obstacle to a vibrant ecosystem. The coupling needs to be attacked. After the parameter impedance mismatch is eliminated, I believe it will take confusion off the table and unlock the path to new solutions for this problem. So ... what would parameters to resolve look like that would allow you to easily generate that PostgreSQL query? |
@helfer are you saying that your 3 queries are (might be) faster then the one i provided? I guarantee they are not, simply because your 3rd query has to wait for the 2nd to complete (and this is a trivial example, only 2 sequential steps, most of the time there will be more). Something like dataloader is not going to beat a PG query planner. @JeffRMoore I am not sure how the parameters to resolvers will help solve this problem, but i haven't given enough time you what you said (nor to your pr) so you might be right.
to
Sure there other places that need changes but this is the core of the idea, there needs to be a way to create "resolvers" on the fly and attaching them to the AST. This alone might allow for other functions to dynamically reducing the number of resolvers that need execution, in places that permit that (all going to the same backend). |
That In the general case, for that sort of to-many relationship, most ORMs even (with full knowledge of the query) use a subquery load strategy, which does follow @helfer's strategy of issuing multiple queries to the database, and doing the join in-memory after receiving the database rows, to avoid the cartesian explosion issues from actually running this query in the straightforward way with a join. This isn't to say that the idea of the reducing executor isn't useful – it's just that it's less useful than you might think, even in the context of e.g. talking directly to a SQL database, because the general approach for handling to-many relationships is to use an extra query anyway, which is exactly what you get out of the DataLoader. |
tl;dr waterfalled loads with DataLoader closely approximate subquery load strategies for relationships in SQL ORMs, which most ORMs use by default for to-many relationships anyway, so the bar for a reducing executor being useful in at least that context is higher than you'd initially think |
@ruslantalpa You're probably right that three queries won't be faster than the one joined on in most cases, but I'm not as convinced as you are that it's impossible. There are definitely some cases that I'd want to test for. That's not really the main point though. The main point is that any difference might not matter at all. As @taion said, ORMs work in a very similar way, and for many (if not most) use cases, they are completely appropriate. The reason people use ORMs is the same reason I would suggest sticking to making multiple queries in the first implementation: simplicity. Even if it is slower, I wouldn't want to make my GraphQL server more complicated by optimizing, unless I really had to. Anything else seems like premature optimization to me. You can definitely pass the data down from the top level resolvers. The first argument to the resolve function of a field is simply what the parent's resolve function returned. If your root resolve function returns an object that has the exact shape of the data queried for, then you don't even need to define your other resolvers. Does that make sense? |
I don't think the goal is necessarily to reduce the number of resolve calls, its to reduce the number of backend queries generated by calling the resolve functions. |
The subquery load is actually better than a straight join, because otherwise you end up fetching a bunch of duplicate rows. Without doing dumb JSON tricks, that query above becomes: SELECT
clients.id AS client_id,
clients.name AS client_name,
projects.id AS project_id,
projects.name AS project_name,
tasks.id AS task_id,
tasks.name AS task_name
FROM
clients
LEFT OUTER JOIN
projects
ON
projects.client_id = clients.id
LEFT OUTER JOIN
tasks
ON
tasks.project_id = projects.id This is rather less nice than the subquery load with the join done by the SQL client. |
SQL ORMs don't do subquery loads over joined loads for to-many relations because it's easier – they do subquery loads over joined loads because it avoids making the database return a lot of extra data to you. |
@taion Yeah, that's a good point. I've thought about that before, but I wasn't sure if database engines do some optimization by compressing the duplicate data. I would have guessed that they must be doing that if it's a performance concern. |
I'm going off of http://docs.sqlalchemy.org/en/latest/orm/loading_relationships.html#what-kind-of-loading-to-use, which suggests that (at least for Python) this sort of thing isn't optimized away, most of the time. |
@helfer you are right about passing data down (looking into it). Using that one could create "smarter" resolvers that can prefetch data for their children (and with that eliminate the need to reduce the number of resolvers). I don't think the coupling is a problem since its a thing dictated by the problem being modeled (client project relation is not going to change) @JeffRMoore reducing the number of resolvers would have the same effect, but you are right, strictly speaking the goal is to reduce the number of backend trips. @taion don't want to start a flame war about the ORMs :) but just because they are doing it that way does not mean they are doing it the right way. If you look more closely at my query you'll see there is no "cartesian explosion", the shape of the response is EXACTLY the shape GQL will return. The fact that it uses functions like array_to_json has nothing to do with the executor or generality this implementation needs to maintain, that is handled by my resolvers anyway so i am taking advantage of the features the backed (PostgreSQL) offers me. The reason the ORMs are doing it that (suboptimal) way is because they target all the databases so they use only features available in all of them (standard sql) but since i am using only one backend (actually everybody does when using an ORM) then i am going to take advantage of the features it offers (and throw away broken abstraction) In conclusion, first of all i would like to apologies for derailing the discussion a bit. |
Even in the PostgreSQL context, your example only works if you're okay with all of your data being coerced into JSON. It's not generally appropriate, and you're also making the database do more work with type coercion. If any of those are e.g. That doesn't mean that I think this sort of up-front planning is a bad idea, necessarily – the bar just has to be higher than "it forces me to do the equivalent of a waterfalled subquery load". |
@taion it works in all cases since gql has to return json, so the data needs to be serialisable and coercion in DB is always faster then in any scripting language, let's move this discussion to slack if you are interested |
That's only in the case where you don't need to do anything with that value on the database client application. Again, I'm not saying that what you have doesn't work for your use case – but it's not a generic solution. I don't object to the idea of having support for this sort of thing; just that the scope is necessarily limited, and the value-add over DataLoader for general use cases has limited scope – not zero scope, but not full scope either. |
I like the idea of reducing executor. I got a way to make it work with |
I don't believe I followed the entire reducer discussion as above. I'm going to try to hit some of the examples mentioned above in the links and try to see how far I get, but can anybody report back in terms of how the explorations mentioned above worked out? Or does anybody have anything more fleshed out, updated for 2021? Possibly related.... #3303 |
Hello!
In our system we took GraphQL query and translated it to ReQL (RethinkDB query language) for execution. This allowed us to optimize the query, eg use built-in joins in ReQL instead of querying the database two times. I can see how the similar system can be used for, eg, SQL queries. We also did some further optimizations, like returning only the fields that are requested by GraphQL, reducing the response payload size for tables with many keys.
I see a way to do it graphql-js by manually going through selection sets in resolve, but that feels very clumsy and hacky, as the library does this walking already. Some ways to do it would be to allow children
resolve
to modify root or to split collect/resolve/complete steps into independent parts that can be used separately.Do you think deferred execution and query translation is a good idea? Should we stick to current model and do multiple requests to the database? Is there some solution to that, which I don't see?
Thanks!
The text was updated successfully, but these errors were encountered: