-
Notifications
You must be signed in to change notification settings - Fork 64
Error handling updates in quota management usage #367
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Improved error handling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for PR, can you please address the logger issue and resolve the merge issue ?
pkg/controller/quota/quotaforestmanager/qm_lib_backend_with_quotasubt_mgr.go
Outdated
Show resolved
Hide resolved
Removed un-used code Fixed other linter warnings.
pkg/controller/quota/quotaforestmanager/qm_lib_backend_with_quotasubt_mgr.go
Show resolved
Hide resolved
pkg/controller/quota/quotaforestmanager/qm_lib_backend_with_quotasubt_mgr.go
Outdated
Show resolved
Hide resolved
pkg/controller/quota/quotaforestmanager/qm_lib_backend_with_quotasubt_mgr.go
Outdated
Show resolved
Hide resolved
in QM trees and requests
Added rules to update the deployment crds.
I am ok to approve this PR, a single test has failed due to resource jitter issues |
fixes: #335 |
More error handling updates
Minor cleanup in the Fits function.
Logging improvements Additional test cases in quotaforest manager
@@ -115,9 +115,9 @@ endif | |||
|
|||
run-test: | |||
$(info Running unit tests...) | |||
hack/make-rules/test.sh $(WHAT) $(TESTS) | |||
go test -v -coverprofile cover.out -race -parallel 8 ./pkg/... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@asm582, @KPostOffice, @dmatch01, @metalcycling let me know if you are OK with the verbose output from testing and with the exclusion of the ./cmd
package.
@@ -1235,8 +1241,11 @@ func (qjm *XController) ScheduleNext() { | |||
// HeadOfLine logic | |||
HOLStartTime := time.Now() | |||
forwarded := false | |||
fowardingLoopCount := 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@asm582, @dmatch01, @KPostOffice and @metalcycling please take a very close look at the changes starting from this line.
if fits { | ||
// aw is ready to go! | ||
apiQueueJob, e := qjm.queueJobLister.AppWrappers(qj.Namespace).Get(qj.Name) | ||
// apiQueueJob's ControllerFirstTimestamp is only microsecond level instead of nanosecond level | ||
if e != nil { | ||
klog.Errorf("[ScheduleNext] Unable to get AW %s from API cache &aw=%p Version=%s Status=%+v err=%#v", qj.Name, qj, qj.ResourceVersion, qj.Status, err) | ||
if qjm.quotaManager != nil && quotaFits { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@asm582, @dmatch01, @KPostOffice and @metalcycling I believe that this one of the causes for client already allocated
set of errors.
cc: @tardieu
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
quota is not released when AW is deleted by clients, which is a different issue.
} else if schedulingTimeExpired { | ||
// stop trying to dispatch after HeadOfLineHoldingTime | ||
// release quota if allocated | ||
if qjm.quotaManager != nil && quotaFits { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@asm582, @KPostOffice, @metalcycling in your testing have you seen this condition happen ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@z103cb We have not seen this condition happen, who sets HeadOfLineHoldingTime
? is this part of the helm command?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@asm582 the HeadofLineHoldingTime is a command line option, defined at:
fs.IntVar(&s.HeadOfLineHoldingTime, "headoflineholdingtime", s.HeadOfLineHoldingTime, "Number of seconds a job can stay at the Head Of Line without being bumped. Default is 0.")
AFAIK, the helm chart does not define an override
break | ||
} else { // Try to dispatch again after one second | ||
if qjm.quotaManager != nil && quotaFits { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@asm582, @metalcycling have you seen this happen in your testing ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think retries could be avoided here explicitly, undispatched AW should be picked up in the next dispatch cycle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@asm582 can you be more specific I really don't understand your comment.
@@ -39,6 +40,9 @@ type Manager struct { | |||
consumerInfos map[string]*ConsumerInfo | |||
// mode of the quota mnager; initially in maintenance mode | |||
mode Mode | |||
|
|||
// mutex to | |||
mutex sync.RWMutex |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@asm582, @metalcycling and @dmatch01 I believe a single mutex would be sufficient for now and MCAD use of the library. The entire quota library forest is not thread safe, we'll have to figure out how to make it so, and perhaps simplify it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would like to understand more about the threading issue, ideally, the whole dispatching should be single-threaded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
} | ||
// Execute tests in parallel | ||
for _, tc := range tests { | ||
tc := tc // capture range variable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is necessary to allow for all the tests to be executed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please address the review
} else if schedulingTimeExpired { | ||
// stop trying to dispatch after HeadOfLineHoldingTime | ||
// release quota if allocated | ||
if qjm.quotaManager != nil && quotaFits { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@z103cb We have not seen this condition happen, who sets HeadOfLineHoldingTime
? is this part of the helm command?
@@ -1,5 +1,5 @@ | |||
/* | |||
Copyright 2022 The Multi-Cluster App Dispatcher Authors. | |||
Copyright 2022, 2203 The Multi-Cluster App Dispatcher Authors. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: fix date
@@ -39,6 +40,9 @@ type Manager struct { | |||
consumerInfos map[string]*ConsumerInfo | |||
// mode of the quota mnager; initially in maintenance mode | |||
mode Mode | |||
|
|||
// mutex to | |||
mutex sync.RWMutex |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would like to understand more about the threading issue, ideally, the whole dispatching should be single-threaded.
if fits { | ||
// aw is ready to go! | ||
apiQueueJob, e := qjm.queueJobLister.AppWrappers(qj.Namespace).Get(qj.Name) | ||
// apiQueueJob's ControllerFirstTimestamp is only microsecond level instead of nanosecond level | ||
if e != nil { | ||
klog.Errorf("[ScheduleNext] Unable to get AW %s from API cache &aw=%p Version=%s Status=%+v err=%#v", qj.Name, qj, qj.ResourceVersion, qj.Status, err) | ||
if qjm.quotaManager != nil && quotaFits { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
quota is not released when AW is deleted by clients, which is a different issue.
break | ||
} else { // Try to dispatch again after one second | ||
if qjm.quotaManager != nil && quotaFits { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think retries could be avoided here explicitly, undispatched AW should be picked up in the next dispatch cycle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We discussed offline, LGTM 🎉
All quota management test pass, but there is still issues with MCAD resource jitter in tests. merging |
Small fixes around error handling inside the QM management initialisation path.