AI.DAGRUN support, code refactoring and improvement #322

filipecosta90 · 2020-04-12T19:30:44Z

kickoff version that enables #88

codecov · 2020-04-12T21:29:24Z

Codecov Report

Merging #322 into master will increase coverage by 1.37%.
The diff coverage is 72.35%.

@@            Coverage Diff             @@
##           master     #322      +/-   ##
==========================================
+ Coverage   56.75%   58.12%   +1.37%     
==========================================
  Files          27       32       +5     
  Lines        5050     5540     +490     
==========================================
+ Hits         2866     3220     +354     
- Misses       2184     2320     +136

Impacted Files	Coverage Δ
src/redisai.h	`0.00% <ø> (ø)`
src/util/dict.c	`35.30% <ø> (+2.73%)`	⬆️
src/config.c	`18.18% <18.18%> (ø)`
src/backends/tensorflow.c	`67.53% <24.00%> (-3.16%)`	⬇️
src/backends/util.c	`41.66% <33.33%> (ø)`
src/stats.c	`80.85% <50.00%> (-19.15%)`	⬇️
src/dag.c	`69.18% <69.18%> (ø)`
src/model.c	`70.43% <72.61%> (+2.38%)`	⬆️
src/run_info.c	`80.76% <80.76%> (ø)`
src/redisai.c	`79.63% <81.35%> (+2.49%)`	⬆️
... and 13 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d7f241f...d7f241f. Read the comment docs.

…AI.DAGRUN

lantiga · 2020-04-13T08:09:58Z

src/redisai.c

+ * 3. have reply callback put the data back into the key
+ * 
+ * This way we avoid any race condition. The only gotcha is making sure no one
+ * overwrites the model until it's done computing.


I find this comment a bit misleading:
RAI_ModelRunCtxCreate will shallow copy the model, so since this command runs in the main thread, there's no need for locking or waiting because run context will hold the model that was found in the key at the time this portion of the command was executed. If a new model is set while the run is waiting in the queue, the model is disposed at the key level, but the old model is retained in the run context and runs. It is then deallocated when run context is deallocated.

@lantiga agree. This was an old comment that just moved from inside the function to the function description.
https://github.com/RedisAI/RedisAI/blob/master/src/redisai.c#L1139
Should we PR this change specifically on a different pull request, merge it on master and then update here? Or just updating here is enough?

Right, let’s just update here. I’d like to fast track this work anyway

Looking very good so far.

I propose to refactor RunInfo, so that instead of a single mctx and sctx pointer, it holds an array of (mctx | sctx) pointers. I think it's better to do it now than wait too long and have to change code pervasively again. This will open the door to the full DAG implementation which at this point I'd like to get done shortly.

@lantiga the latest commit makes usage of this change for DAGRUN, so that instead of a single mctx and sctx pointer, it holds an array of (mctx | sctx) pointers (within RAI_DagOp). To be gradually adopted on modelrun and scriptrun ( for now and to be able to fully test and harden it only on dagrun ).
What do you think?

lantiga · 2020-04-13T09:56:58Z

Looking very good so far.

I propose to refactor RunInfo, so that instead of a single mctx and sctx pointer, it holds an array of (mctx | sctx) pointers. I think it's better to do it now than wait too long and have to change code pervasively again. This will open the door to the full DAG implementation which at this point I'd like to get done shortly. I'll be happy to do that as soon as you're done with the rest, just let me know.

We need to be careful with the semantics of values we reply with from the local context. Say we have:

TENSORSET foo ...
MODELRUN m1 INPUTS foo OUTPUTS bar
TENSORGET bar
MODELRUN m2 INPUTS foo OUTPUTS bar

then we should get bar after m1 has run, not m2.

Ok, this uses two MODELRUN's, but in the current code it's the same thing as

TENSORSET foo ...
TENSORSET bar ...
TENSORGET bar ...
MODELRUN m INPUTS foo OUTPUTS bar
TENSORGET bar ...

I would expect bar to have two different values in output. What do you think?

lantiga · 2020-04-13T10:00:10Z

Another comment: should we call the command simply AI.RUN?

filipecosta90 · 2020-04-13T10:27:03Z

I propose to refactor RunInfo
Agree @lantiga . I'm also refactoring the RedisAI_DagRun_RedisCommand to only parse the LOAD and PERSIST and split the commands and pass all logic to the background thread as you've mentioned

…nfo, background_workers, and model_script_run_session

src/redisai.c

…nter, it holds an array of (mctx | sctx) pointers (within RAI_DagOp). To be gradually adopted on modelrun and scriptrun ( for now only on dagrun )

test/tests_dag.py

…for more complex patterns )

…ext (ensuring that write after read does not alter the tensorget read value)

test/tests_dag.py

…. If no MODELRUN default to CPU

…mmand

…mmand from RedisAI_DagRunSession

…n to better describe the context in which RedisAI blocking commands MODELRUN and SCRIPTRUN operate

…config.h/c

… working as expected for TF

…ckend

lantiga · 2020-04-17T09:23:39Z

LGTM at a first complete review pass. I'd say let's do a last review pass after rebasing on master + leak fixes.

…e of RedisAI_RunInfo helper methods ( consistent constructors and destructors among modelrun,scriptrun and dagrun)

…or datatype (removed possibility to retain RString)

…GRUN ci

lantiga · 2020-04-18T14:42:38Z

src/background_workers.c

+        while (next_item != NULL) {
+          RedisAI_RunInfo *next_rinfo = (RedisAI_RunInfo *)next_item->value;
+
+          if (RAI_RunInfoBatchable(rinfo, next_rinfo) == 0) {


Here we need to prevent a DAG run info to be considered batchable. I'm pushing a commit to make that happen.

BTW, we should note this limitation and extend batching to DAG as well at some point. This will require some DAG refactoring: individual MODELRUN commands should be placed in the queue, so they have a chance to be batched. We'll have to make batching logic slightly more complicated to avoid the order of DAG commands to change for an individual client due to batching.
We can leave this out for now, but I'll open an issue.

WIP - AI.DAGRUN, commands code refactoring and improvement (#322) * [wip] refactored TENSORGET, TENSORSET, and MODELRUN to be re-used by AI.DAGRUN * [add] first version of dagrun with modelrun and persist working * [wip] refactored non-command methods within redisai.c into dag, run_info, background_workers, and model_script_run_session * [fix] fixed wrong includes * [add] adding init methods to RAI_DagOp and RedisAI_RunInfo * [wip] ai.tensorset, PERSIST and LOAD working as expected * [add] dagrun's tensorset and tensorget working as expected * [add] extended test for tensorset and tensorget * [wip] wip on modelrun within dagrun * [wip] fist version of tensorget |> modelrun |> tensorget working * [add] refactor RunInfo, so that instead of a single mctx and sctx pointer, it holds an array of (mctx | sctx) pointers (within RAI_DagOp). To be gradually adopted on modelrun and scriptrun ( for now only on dagrun ) * [add] added redisai-py as a requirement for tests ( it helps testing for more complex patterns ) * [add] added test for semantics of values we reply from the local context (ensuring that write after read does not alter the tensorget read value) * [wip] discover the DAGRUN device queue from the arguments of MODELRUN. If no MODELRUN default to CPU * [fix] fxied wrong reference passing on RedisAI_Parse_ModelRun_RedisCommand * [fix] fixed wrong reference passing on RedisAI_Parse_ModelRun_RedisCommand from RedisAI_DagRunSession * [wip] wip on minor optimizations * [add] exteded dag.h to have proper documentation * [add] extended model_script_run_session header file with documentation to better describe the context in which RedisAI blocking commands MODELRUN and SCRIPTRUN operate * [add] moved configuration properties and parsing out of redisai.c to config.h/c * [add] backends_intra_op_parallelism and backends_inter_op_parallelism working as expected for TF * [add] intra_op and inter_op parallelism working as expected for TF backend * [add] exclude perf profile reports from git * [add] wip on mem sanitizer * [add] working on RAI_FreeRunInfo and RAI_FreeDagOp * [add] using RAI_InitRunInfo on RedisAI_ScriptRun_RedisCommand * [add] using array data type on RedisAI_RunInfo rinfo->outkeys * [add] small leaks fix for dag * [add] partial refactor of RedisAI_ScriptRun_RedisCommand to make usage of RedisAI_RunInfo helper methods ( consistent constructors and destructors among modelrun,scriptrun and dagrun) * [add] kickoff negative testing of AI.DAGRUN * [add] extended negative testing on dag and removed complexity of tensor datatype (removed possibility to retain RString) * [add] extended AI.DAGRUN negative testing. fixed negative testing leaks * [add] more extensive tests and severall touches on same keys on AI.DAGRUN ci * Fixes for macOS and in general (#327) * Prevent a DAG run info to be considered batchable * Ensure sync on failing ONNX test Co-authored-by: Luca Antiga <[email protected]>

filipecosta90 requested a review from lantiga April 12, 2020 19:30

[wip] refactored TENSORGET, TENSORSET, and MODELRUN to be re-used by …

13a3602

…AI.DAGRUN

filipecosta90 force-pushed the perf-dagrun branch from 03f2fb2 to 13a3602 Compare April 12, 2020 21:50

[add] first version of dagrun with modelrun and persist working

b3651d9

lantiga reviewed Apr 13, 2020

View reviewed changes

filipecosta90 added 2 commits April 13, 2020 14:28

[wip] refactored non-command methods within redisai.c into dag, run_i…

7f28ffb

…nfo, background_workers, and model_script_run_session

[fix] fixed wrong includes

e1cb050

lantiga reviewed Apr 13, 2020

View reviewed changes

src/redisai.c Outdated Show resolved Hide resolved

filipecosta90 added 7 commits April 13, 2020 16:47

[add] adding init methods to RAI_DagOp and RedisAI_RunInfo

f0b3ccb

[wip] ai.tensorset, PERSIST and LOAD working as expected

f4199b7

[add] dagrun's tensorset and tensorget working as expected

4df1195

[add] extended test for tensorset and tensorget

0049c8c

[wip] wip on modelrun within dagrun

3bc86fa

[wip] fist version of tensorget |> modelrun |> tensorget working

ee9ad0c

[add] refactor RunInfo, so that instead of a single mctx and sctx poi…

88b020d

…nter, it holds an array of (mctx | sctx) pointers (within RAI_DagOp). To be gradually adopted on modelrun and scriptrun ( for now only on dagrun )

filipecosta90 commented Apr 14, 2020

View reviewed changes

test/tests_dag.py Show resolved Hide resolved

filipecosta90 added 2 commits April 14, 2020 13:18

[add] added redisai-py as a requirement for tests ( it helps testing …

cddc1b5

…for more complex patterns )

[add] added test for semantics of values we reply from the local cont…

80c713d

…ext (ensuring that write after read does not alter the tensorget read value)

filipecosta90 commented Apr 14, 2020

View reviewed changes

test/tests_dag.py Show resolved Hide resolved

filipecosta90 added 3 commits April 14, 2020 16:31

[wip] discover the DAGRUN device queue from the arguments of MODELRUN…

ab37bbe

…. If no MODELRUN default to CPU

[fix] fxied wrong reference passing on RedisAI_Parse_ModelRun_RedisCo…

98462ec

…mmand

[fix] fixed wrong reference passing on RedisAI_Parse_ModelRun_RedisCo…

f43939d

…mmand from RedisAI_DagRunSession

filipecosta90 requested a review from lantiga April 14, 2020 17:22

filipecosta90 added 3 commits April 15, 2020 00:05

[wip] wip on minor optimizations

f21ebe8

[add] exteded dag.h to have proper documentation

567e156

[add] extended model_script_run_session header file with documentatio…

6263ed3

…n to better describe the context in which RedisAI blocking commands MODELRUN and SCRIPTRUN operate

filipecosta90 added 2 commits April 15, 2020 14:19

[add] moved configuration properties and parsing out of redisai.c to …

bc7da7b

…config.h/c

[add] backends_intra_op_parallelism and backends_inter_op_parallelism…

cba87b9

… working as expected for TF

filipecosta90 mentioned this pull request Apr 15, 2020

RedisAI needs to manage the number of background threads for backends #203

Closed

[add] intra_op and inter_op parallelism working as expected for TF ba…

104c3ec

…ckend

filipecosta90 added 2 commits April 17, 2020 11:27

[add] exclude perf profile reports from git

46656c6

[add] rebased from master

300fb5a

filipecosta90 added the enhancement label Apr 17, 2020

filipecosta90 and others added 12 commits April 17, 2020 16:37

[add] wip on mem sanitizer

e5c6ae7

[add] working on RAI_FreeRunInfo and RAI_FreeDagOp

3357ca0

[add] using RAI_InitRunInfo on RedisAI_ScriptRun_RedisCommand

570af1a

[add] using array data type on RedisAI_RunInfo rinfo->outkeys

bfe2cc8

[add] small leaks fix for dag

c65f857

[add] partial refactor of RedisAI_ScriptRun_RedisCommand to make usag…

09061c7

…e of RedisAI_RunInfo helper methods ( consistent constructors and destructors among modelrun,scriptrun and dagrun)

[add] kickoff negative testing of AI.DAGRUN

3a9a8fb

[add] extended negative testing on dag and removed complexity of tens…

e6dcb30

…or datatype (removed possibility to retain RString)

[add] extended AI.DAGRUN negative testing. fixed negative testing leaks

25be30f

[add] more extensive tests and severall touches on same keys on AI.DA…

61155fb

…GRUN ci

Fixes for macOS and in general (#327)

f18b6d1

Prevent a DAG run info to be considered batchable

a5fdb11

lantiga previously approved these changes Apr 18, 2020

View reviewed changes

lantiga dismissed their stale review via a5fdb11 April 18, 2020 14:56

lantiga mentioned this pull request Apr 18, 2020

Extend batching to DAG commands #328

Closed

Ensure sync on failing ONNX test

d7f241f

lantiga approved these changes Apr 18, 2020

View reviewed changes

lantiga merged commit 6f36dab into master Apr 18, 2020

filipecosta90 changed the title ~~WIP - AI.DAGRUN, commands code refactoring and improvement~~ AI.DAGRUN support, code refactoring and improvement May 1, 2020

filipecosta90 deleted the perf-dagrun branch May 1, 2020 10:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

AI.DAGRUN support, code refactoring and improvement #322

AI.DAGRUN support, code refactoring and improvement #322

Uh oh!

filipecosta90 commented Apr 12, 2020 •

edited

Loading

Uh oh!

codecov bot commented Apr 12, 2020 •

edited

Loading

Uh oh!

lantiga Apr 13, 2020

Uh oh!

filipecosta90 Apr 13, 2020

Uh oh!

lantiga Apr 13, 2020

Uh oh!

filipecosta90 Apr 14, 2020

Uh oh!

lantiga commented Apr 13, 2020

Uh oh!

lantiga commented Apr 13, 2020

Uh oh!

filipecosta90 commented Apr 13, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lantiga commented Apr 17, 2020

Uh oh!

lantiga Apr 18, 2020

Uh oh!

lantiga Apr 18, 2020

Uh oh!

Uh oh!

AI.DAGRUN support, code refactoring and improvement #322

AI.DAGRUN support, code refactoring and improvement #322

Uh oh!

Conversation

filipecosta90 commented Apr 12, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Apr 12, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

lantiga Apr 13, 2020

Choose a reason for hiding this comment

Uh oh!

filipecosta90 Apr 13, 2020

Choose a reason for hiding this comment

Uh oh!

lantiga Apr 13, 2020

Choose a reason for hiding this comment

Uh oh!

filipecosta90 Apr 14, 2020

Choose a reason for hiding this comment

Uh oh!

lantiga commented Apr 13, 2020

Uh oh!

lantiga commented Apr 13, 2020

Uh oh!

filipecosta90 commented Apr 13, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lantiga commented Apr 17, 2020

Uh oh!

lantiga Apr 18, 2020

Choose a reason for hiding this comment

Uh oh!

lantiga Apr 18, 2020

Choose a reason for hiding this comment

Uh oh!

Uh oh!

filipecosta90 commented Apr 12, 2020 •

edited

Loading

codecov bot commented Apr 12, 2020 •

edited

Loading