Skip to content

Conversation

filipecosta90
Copy link
Collaborator

@filipecosta90 filipecosta90 commented Apr 12, 2020

kickoff version that enables #88

  • refactored ensureRunQueue to pass also RunQueueInfo
  • refactored TENSORGET, TENSORSET to be re-used by AI.DAGRUN
  • refactored MODELRUN's RedisAI_ModelRun_RedisCommand ( part 1 of 3 (main thread portion )) to be capable of use local tensors as input
  • add local tensors structure to RedisAI_RunInfo
  • refactored MODELRUN's RedisAI_Run_Reply ( part 3 of 3 (main thread portion )) to be capable of use local tensors as output (requires changes of RedisAI_RunInfo first )
  • do the persist on unblock callback
  • enable running more than one blocking command inside dag reusing the same runinfo data
  • simple tensorset |> modelrun |> tensorget with and without touching the keyspace via writes ( only read via LOAD )
  • discover the DAGRUN device queue from the arguments of MODELRUN. If no MODELRUN default to CPU
  • Move the modelctx from rinfo to DagOp. Instead of a single mctx and sctx pointer, it holds an array of (mctx | sctx) pointers (within RAI_DagOp). To be gradually adopted on modelrun and scriptrun ( for now only on dagrun )
  • Write after read test on DAGs local context
  • proper negative testing
  • leak check on negative testing
  • rebase on current master (minor batching refactoring)
  • leak check on positive testing
  • removed complexity and potential unsafe behaviour of tensorset
    Extra (enable further optimizations):
  • enable INTRA_OP_PARALLELISM and INTER_OP_PARALLELISM load time configurations. Only enforced for now on TF backend. ( kickoff for RedisAI needs to manage the number of background threads for backends #203 )

@filipecosta90 filipecosta90 requested a review from lantiga April 12, 2020 19:30
@codecov
Copy link

codecov bot commented Apr 12, 2020

Codecov Report

Merging #322 into master will increase coverage by 1.37%.
The diff coverage is 72.35%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #322      +/-   ##
==========================================
+ Coverage   56.75%   58.12%   +1.37%     
==========================================
  Files          27       32       +5     
  Lines        5050     5540     +490     
==========================================
+ Hits         2866     3220     +354     
- Misses       2184     2320     +136     
Impacted Files Coverage Δ
src/redisai.h 0.00% <ø> (ø)
src/util/dict.c 35.30% <ø> (+2.73%) ⬆️
src/config.c 18.18% <18.18%> (ø)
src/backends/tensorflow.c 67.53% <24.00%> (-3.16%) ⬇️
src/backends/util.c 41.66% <33.33%> (ø)
src/stats.c 80.85% <50.00%> (-19.15%) ⬇️
src/dag.c 69.18% <69.18%> (ø)
src/model.c 70.43% <72.61%> (+2.38%) ⬆️
src/run_info.c 80.76% <80.76%> (ø)
src/redisai.c 79.63% <81.35%> (+2.49%) ⬆️
... and 13 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d7f241f...d7f241f. Read the comment docs.

* 3. have reply callback put the data back into the key
*
* This way we avoid any race condition. The only gotcha is making sure no one
* overwrites the model until it's done computing.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this comment a bit misleading:
RAI_ModelRunCtxCreate will shallow copy the model, so since this command runs in the main thread, there's no need for locking or waiting because run context will hold the model that was found in the key at the time this portion of the command was executed. If a new model is set while the run is waiting in the queue, the model is disposed at the key level, but the old model is retained in the run context and runs. It is then deallocated when run context is deallocated.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lantiga agree. This was an old comment that just moved from inside the function to the function description.
https://github.com/RedisAI/RedisAI/blob/master/src/redisai.c#L1139
Should we PR this change specifically on a different pull request, merge it on master and then update here? Or just updating here is enough?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, let’s just update here. I’d like to fast track this work anyway

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking very good so far.

I propose to refactor RunInfo, so that instead of a single mctx and sctx pointer, it holds an array of (mctx | sctx) pointers. I think it's better to do it now than wait too long and have to change code pervasively again. This will open the door to the full DAG implementation which at this point I'd like to get done shortly.

@lantiga the latest commit makes usage of this change for DAGRUN, so that instead of a single mctx and sctx pointer, it holds an array of (mctx | sctx) pointers (within RAI_DagOp). To be gradually adopted on modelrun and scriptrun ( for now and to be able to fully test and harden it only on dagrun ).
What do you think?

@lantiga
Copy link
Contributor

lantiga commented Apr 13, 2020

Looking very good so far.

I propose to refactor RunInfo, so that instead of a single mctx and sctx pointer, it holds an array of (mctx | sctx) pointers. I think it's better to do it now than wait too long and have to change code pervasively again. This will open the door to the full DAG implementation which at this point I'd like to get done shortly. I'll be happy to do that as soon as you're done with the rest, just let me know.

We need to be careful with the semantics of values we reply with from the local context. Say we have:

TENSORSET foo ...
MODELRUN m1 INPUTS foo OUTPUTS bar
TENSORGET bar
MODELRUN m2 INPUTS foo OUTPUTS bar

then we should get bar after m1 has run, not m2.

Ok, this uses two MODELRUN's, but in the current code it's the same thing as

TENSORSET foo ...
TENSORSET bar ...
TENSORGET bar ...
MODELRUN m INPUTS foo OUTPUTS bar
TENSORGET bar ...

I would expect bar to have two different values in output. What do you think?

@lantiga
Copy link
Contributor

lantiga commented Apr 13, 2020

Another comment: should we call the command simply AI.RUN?

@filipecosta90
Copy link
Collaborator Author

I propose to refactor RunInfo
Agree @lantiga . I'm also refactoring the RedisAI_DagRun_RedisCommand to only parse the LOAD and PERSIST and split the commands and pass all logic to the background thread as you've mentioned

…ext (ensuring that write after read does not alter the tensorget read value)
@filipecosta90 filipecosta90 requested a review from lantiga April 14, 2020 17:22
@lantiga
Copy link
Contributor

lantiga commented Apr 17, 2020

LGTM at a first complete review pass. I'd say let's do a last review pass after rebasing on master + leak fixes.

lantiga
lantiga previously approved these changes Apr 18, 2020
while (next_item != NULL) {
RedisAI_RunInfo *next_rinfo = (RedisAI_RunInfo *)next_item->value;

if (RAI_RunInfoBatchable(rinfo, next_rinfo) == 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we need to prevent a DAG run info to be considered batchable. I'm pushing a commit to make that happen.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, we should note this limitation and extend batching to DAG as well at some point. This will require some DAG refactoring: individual MODELRUN commands should be placed in the queue, so they have a chance to be batched. We'll have to make batching logic slightly more complicated to avoid the order of DAG commands to change for an individual client due to batching.
We can leave this out for now, but I'll open an issue.

@lantiga lantiga merged commit 6f36dab into master Apr 18, 2020
@filipecosta90 filipecosta90 changed the title WIP - AI.DAGRUN, commands code refactoring and improvement AI.DAGRUN support, code refactoring and improvement May 1, 2020
@filipecosta90 filipecosta90 deleted the perf-dagrun branch May 1, 2020 10:49
lantiga added a commit that referenced this pull request May 6, 2020
 
WIP - AI.DAGRUN, commands code refactoring and improvement (#322)

* [wip] refactored TENSORGET, TENSORSET, and MODELRUN  to be re-used by AI.DAGRUN

* [add] first version of dagrun with modelrun and persist working

* [wip] refactored non-command methods within redisai.c into dag, run_info, background_workers, and model_script_run_session

* [fix] fixed wrong includes

* [add] adding init methods to RAI_DagOp and RedisAI_RunInfo

* [wip] ai.tensorset, PERSIST and LOAD working as expected

* [add] dagrun's tensorset and tensorget working as expected

* [add] extended test for tensorset and tensorget

* [wip] wip on modelrun within dagrun

* [wip] fist version of tensorget |> modelrun |> tensorget working

* [add] refactor RunInfo, so that instead of a single mctx and sctx pointer, it holds an array of (mctx | sctx) pointers (within RAI_DagOp). To be gradually adopted on modelrun and scriptrun ( for now only on dagrun )

* [add] added redisai-py as a requirement for tests ( it helps testing for more complex patterns )

* [add] added test for semantics of values we reply from the local context (ensuring that write after read does not alter the tensorget read value)

* [wip] discover the DAGRUN device queue from the arguments of MODELRUN. If no MODELRUN default to CPU

* [fix] fxied wrong reference passing on RedisAI_Parse_ModelRun_RedisCommand

* [fix] fixed wrong reference passing on RedisAI_Parse_ModelRun_RedisCommand from RedisAI_DagRunSession

* [wip] wip on minor optimizations

* [add] exteded dag.h to have proper documentation

* [add] extended model_script_run_session header file with documentation to better describe the context in which RedisAI blocking commands MODELRUN and SCRIPTRUN operate

* [add] moved configuration properties and parsing out of redisai.c to config.h/c

* [add] backends_intra_op_parallelism and backends_inter_op_parallelism working as expected for TF

* [add] intra_op and inter_op parallelism working as expected for TF backend

* [add] exclude perf profile reports from git

* [add] wip on mem sanitizer

* [add] working on RAI_FreeRunInfo and RAI_FreeDagOp

* [add] using RAI_InitRunInfo on RedisAI_ScriptRun_RedisCommand

* [add] using array data type on RedisAI_RunInfo rinfo->outkeys

* [add] small leaks fix for dag

* [add] partial refactor of RedisAI_ScriptRun_RedisCommand to make usage of RedisAI_RunInfo helper methods ( consistent constructors and destructors among modelrun,scriptrun and dagrun)

* [add] kickoff negative testing of AI.DAGRUN

* [add] extended negative testing on dag and removed complexity of tensor datatype (removed possibility to retain RString)

* [add] extended AI.DAGRUN negative testing. fixed negative testing leaks

* [add] more extensive tests and severall touches on same keys on AI.DAGRUN ci

* Fixes for macOS and in general (#327)

* Prevent a DAG run info to be considered batchable

* Ensure sync on failing ONNX test

Co-authored-by: Luca Antiga <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants