-
Notifications
You must be signed in to change notification settings - Fork 106
AI.DAGRUN support, code refactoring and improvement #322
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #322 +/- ##
==========================================
+ Coverage 56.75% 58.12% +1.37%
==========================================
Files 27 32 +5
Lines 5050 5540 +490
==========================================
+ Hits 2866 3220 +354
- Misses 2184 2320 +136
Continue to review full report at Codecov.
|
03f2fb2
to
13a3602
Compare
* 3. have reply callback put the data back into the key | ||
* | ||
* This way we avoid any race condition. The only gotcha is making sure no one | ||
* overwrites the model until it's done computing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find this comment a bit misleading:
RAI_ModelRunCtxCreate
will shallow copy the model, so since this command runs in the main thread, there's no need for locking or waiting because run context will hold the model that was found in the key at the time this portion of the command was executed. If a new model is set while the run is waiting in the queue, the model is disposed at the key level, but the old model is retained in the run context and runs. It is then deallocated when run context is deallocated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lantiga agree. This was an old comment that just moved from inside the function to the function description.
https://github.com/RedisAI/RedisAI/blob/master/src/redisai.c#L1139
Should we PR this change specifically on a different pull request, merge it on master and then update here? Or just updating here is enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, let’s just update here. I’d like to fast track this work anyway
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking very good so far.
I propose to refactor RunInfo, so that instead of a single mctx and sctx pointer, it holds an array of (mctx | sctx) pointers. I think it's better to do it now than wait too long and have to change code pervasively again. This will open the door to the full DAG implementation which at this point I'd like to get done shortly.
@lantiga the latest commit makes usage of this change for DAGRUN, so that instead of a single mctx and sctx pointer, it holds an array of (mctx | sctx) pointers (within RAI_DagOp). To be gradually adopted on modelrun and scriptrun ( for now and to be able to fully test and harden it only on dagrun ).
What do you think?
Looking very good so far. I propose to refactor RunInfo, so that instead of a single mctx and sctx pointer, it holds an array of (mctx | sctx) pointers. I think it's better to do it now than wait too long and have to change code pervasively again. This will open the door to the full DAG implementation which at this point I'd like to get done shortly. I'll be happy to do that as soon as you're done with the rest, just let me know. We need to be careful with the semantics of values we reply with from the local context. Say we have:
then we should get Ok, this uses two MODELRUN's, but in the current code it's the same thing as
I would expect |
Another comment: should we call the command simply |
|
…nfo, background_workers, and model_script_run_session
…nter, it holds an array of (mctx | sctx) pointers (within RAI_DagOp). To be gradually adopted on modelrun and scriptrun ( for now only on dagrun )
…for more complex patterns )
…ext (ensuring that write after read does not alter the tensorget read value)
…. If no MODELRUN default to CPU
…mmand from RedisAI_DagRunSession
…n to better describe the context in which RedisAI blocking commands MODELRUN and SCRIPTRUN operate
… working as expected for TF
LGTM at a first complete review pass. I'd say let's do a last review pass after rebasing on master + leak fixes. |
…e of RedisAI_RunInfo helper methods ( consistent constructors and destructors among modelrun,scriptrun and dagrun)
…or datatype (removed possibility to retain RString)
while (next_item != NULL) { | ||
RedisAI_RunInfo *next_rinfo = (RedisAI_RunInfo *)next_item->value; | ||
|
||
if (RAI_RunInfoBatchable(rinfo, next_rinfo) == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we need to prevent a DAG run info to be considered batchable. I'm pushing a commit to make that happen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, we should note this limitation and extend batching to DAG as well at some point. This will require some DAG refactoring: individual MODELRUN commands should be placed in the queue, so they have a chance to be batched. We'll have to make batching logic slightly more complicated to avoid the order of DAG commands to change for an individual client due to batching.
We can leave this out for now, but I'll open an issue.
WIP - AI.DAGRUN, commands code refactoring and improvement (#322) * [wip] refactored TENSORGET, TENSORSET, and MODELRUN to be re-used by AI.DAGRUN * [add] first version of dagrun with modelrun and persist working * [wip] refactored non-command methods within redisai.c into dag, run_info, background_workers, and model_script_run_session * [fix] fixed wrong includes * [add] adding init methods to RAI_DagOp and RedisAI_RunInfo * [wip] ai.tensorset, PERSIST and LOAD working as expected * [add] dagrun's tensorset and tensorget working as expected * [add] extended test for tensorset and tensorget * [wip] wip on modelrun within dagrun * [wip] fist version of tensorget |> modelrun |> tensorget working * [add] refactor RunInfo, so that instead of a single mctx and sctx pointer, it holds an array of (mctx | sctx) pointers (within RAI_DagOp). To be gradually adopted on modelrun and scriptrun ( for now only on dagrun ) * [add] added redisai-py as a requirement for tests ( it helps testing for more complex patterns ) * [add] added test for semantics of values we reply from the local context (ensuring that write after read does not alter the tensorget read value) * [wip] discover the DAGRUN device queue from the arguments of MODELRUN. If no MODELRUN default to CPU * [fix] fxied wrong reference passing on RedisAI_Parse_ModelRun_RedisCommand * [fix] fixed wrong reference passing on RedisAI_Parse_ModelRun_RedisCommand from RedisAI_DagRunSession * [wip] wip on minor optimizations * [add] exteded dag.h to have proper documentation * [add] extended model_script_run_session header file with documentation to better describe the context in which RedisAI blocking commands MODELRUN and SCRIPTRUN operate * [add] moved configuration properties and parsing out of redisai.c to config.h/c * [add] backends_intra_op_parallelism and backends_inter_op_parallelism working as expected for TF * [add] intra_op and inter_op parallelism working as expected for TF backend * [add] exclude perf profile reports from git * [add] wip on mem sanitizer * [add] working on RAI_FreeRunInfo and RAI_FreeDagOp * [add] using RAI_InitRunInfo on RedisAI_ScriptRun_RedisCommand * [add] using array data type on RedisAI_RunInfo rinfo->outkeys * [add] small leaks fix for dag * [add] partial refactor of RedisAI_ScriptRun_RedisCommand to make usage of RedisAI_RunInfo helper methods ( consistent constructors and destructors among modelrun,scriptrun and dagrun) * [add] kickoff negative testing of AI.DAGRUN * [add] extended negative testing on dag and removed complexity of tensor datatype (removed possibility to retain RString) * [add] extended AI.DAGRUN negative testing. fixed negative testing leaks * [add] more extensive tests and severall touches on same keys on AI.DAGRUN ci * Fixes for macOS and in general (#327) * Prevent a DAG run info to be considered batchable * Ensure sync on failing ONNX test Co-authored-by: Luca Antiga <[email protected]>
kickoff version that enables #88
RedisAI_ModelRun_RedisCommand
( part 1 of 3 (main thread portion )) to be capable of use local tensors as inputRedisAI_RunInfo
RedisAI_RunInfo
first )Extra (enable further optimizations):