Skip to content

Error handling updates in quota management usage #367

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 16 commits into from
May 26, 2023
Merged

Conversation

z103cb
Copy link
Contributor

@z103cb z103cb commented May 15, 2023

Small fixes around error handling inside the QM management initialisation path.

Copy link
Member

@asm582 asm582 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for PR, can you please address the logger issue and resolve the merge issue ?

z103cb added 2 commits May 17, 2023 09:05
Removed un-used code
Fixed other linter warnings.
@z103cb z103cb changed the title Error handling updates in QM management initialization Error handling updates in quite management initialisation May 17, 2023
@asm582
Copy link
Member

asm582 commented May 17, 2023

I am ok to approve this PR, a single test has failed due to resource jitter issues

asm582
asm582 previously approved these changes May 17, 2023
@KPostOffice
Copy link
Contributor

fixes: #335

z103cb added 2 commits May 19, 2023 11:01
More error handling updates
Minor cleanup in the Fits function.
@z103cb
Copy link
Contributor Author

z103cb commented May 23, 2023

The failures of the build are related to #388, #392

@tardieu tardieu mentioned this pull request May 25, 2023
Logging improvements
Additional test cases in quotaforest manager
@z103cb z103cb changed the title Error handling updates in quite management initialisation Error handling updates in quota management usage May 26, 2023
@@ -115,9 +115,9 @@ endif

run-test:
$(info Running unit tests...)
hack/make-rules/test.sh $(WHAT) $(TESTS)
go test -v -coverprofile cover.out -race -parallel 8 ./pkg/...
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@asm582, @KPostOffice, @dmatch01, @metalcycling let me know if you are OK with the verbose output from testing and with the exclusion of the ./cmd package.

@@ -1235,8 +1241,11 @@ func (qjm *XController) ScheduleNext() {
// HeadOfLine logic
HOLStartTime := time.Now()
forwarded := false
fowardingLoopCount := 1
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@asm582, @dmatch01, @KPostOffice and @metalcycling please take a very close look at the changes starting from this line.

if fits {
// aw is ready to go!
apiQueueJob, e := qjm.queueJobLister.AppWrappers(qj.Namespace).Get(qj.Name)
// apiQueueJob's ControllerFirstTimestamp is only microsecond level instead of nanosecond level
if e != nil {
klog.Errorf("[ScheduleNext] Unable to get AW %s from API cache &aw=%p Version=%s Status=%+v err=%#v", qj.Name, qj, qj.ResourceVersion, qj.Status, err)
if qjm.quotaManager != nil && quotaFits {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@asm582, @dmatch01, @KPostOffice and @metalcycling I believe that this one of the causes for client already allocated set of errors.

cc: @tardieu

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

quota is not released when AW is deleted by clients, which is a different issue.

} else if schedulingTimeExpired {
// stop trying to dispatch after HeadOfLineHoldingTime
// release quota if allocated
if qjm.quotaManager != nil && quotaFits {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@asm582, @KPostOffice, @metalcycling in your testing have you seen this condition happen ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@z103cb We have not seen this condition happen, who sets HeadOfLineHoldingTime ? is this part of the helm command?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

break
} else { // Try to dispatch again after one second
if qjm.quotaManager != nil && quotaFits {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@asm582, @metalcycling have you seen this happen in your testing ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think retries could be avoided here explicitly, undispatched AW should be picked up in the next dispatch cycle.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@asm582 can you be more specific I really don't understand your comment.

@@ -39,6 +40,9 @@ type Manager struct {
consumerInfos map[string]*ConsumerInfo
// mode of the quota mnager; initially in maintenance mode
mode Mode

// mutex to
mutex sync.RWMutex
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@asm582, @metalcycling and @dmatch01 I believe a single mutex would be sufficient for now and MCAD use of the library. The entire quota library forest is not thread safe, we'll have to figure out how to make it so, and perhaps simplify it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would like to understand more about the threading issue, ideally, the whole dispatching should be single-threaded.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@asm582 take a look at #392

}
// Execute tests in parallel
for _, tc := range tests {
tc := tc // capture range variable
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is necessary to allow for all the tests to be executed.

Copy link
Member

@asm582 asm582 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please address the review

} else if schedulingTimeExpired {
// stop trying to dispatch after HeadOfLineHoldingTime
// release quota if allocated
if qjm.quotaManager != nil && quotaFits {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@z103cb We have not seen this condition happen, who sets HeadOfLineHoldingTime ? is this part of the helm command?

@@ -1,5 +1,5 @@
/*
Copyright 2022 The Multi-Cluster App Dispatcher Authors.
Copyright 2022, 2203 The Multi-Cluster App Dispatcher Authors.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: fix date

@@ -39,6 +40,9 @@ type Manager struct {
consumerInfos map[string]*ConsumerInfo
// mode of the quota mnager; initially in maintenance mode
mode Mode

// mutex to
mutex sync.RWMutex
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would like to understand more about the threading issue, ideally, the whole dispatching should be single-threaded.

if fits {
// aw is ready to go!
apiQueueJob, e := qjm.queueJobLister.AppWrappers(qj.Namespace).Get(qj.Name)
// apiQueueJob's ControllerFirstTimestamp is only microsecond level instead of nanosecond level
if e != nil {
klog.Errorf("[ScheduleNext] Unable to get AW %s from API cache &aw=%p Version=%s Status=%+v err=%#v", qj.Name, qj, qj.ResourceVersion, qj.Status, err)
if qjm.quotaManager != nil && quotaFits {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

quota is not released when AW is deleted by clients, which is a different issue.

break
} else { // Try to dispatch again after one second
if qjm.quotaManager != nil && quotaFits {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think retries could be avoided here explicitly, undispatched AW should be picked up in the next dispatch cycle.

Copy link
Member

@asm582 asm582 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed offline, LGTM 🎉

@asm582
Copy link
Member

asm582 commented May 26, 2023

All quota management test pass, but there is still issues with MCAD resource jitter in tests. merging

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

Successfully merging this pull request may close these issues.

3 participants