Skip to content

Modules can now start additional startup tasks, while server is starting #2119

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 5 commits into from
Closed

Modules can now start additional startup tasks, while server is starting #2119

wants to merge 5 commits into from

Conversation

pstibrany
Copy link
Contributor

@pstibrany pstibrany commented Feb 12, 2020

Modules can now have start function for starting additional tasks on startup.

Start functions are called concurrently, and only after all modules are initialized. Server's HTTP service starts while modules are still starting, so modules must be able to handle that. Both ingester and queriers will return error.

This is currently used by blocks-based ingester (to replay WAL for locally stored TSDBs) and querier (to do initial fetch of blocks).

This is alternative to #2104.

Note: this is not ideal, but is relatively simple change. I think we should overhaul the lifecycle of our modules, and have separate NEW (instantiated, but nothing started yet), STARTING (replaying WALs, fetching blocks, whatever needs to be done for service to be ready) and RUNNING (service is running) states and make those states observable by other modules (for dependencies and for global view on the server, plus make those states observable via metrics / simple HTML page), but this PR doesn't go that far. I'll write a design document for that first.

Signed-off-by: Peter Štibraný [email protected]

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

Copy link
Contributor

@thorfour thorfour left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! This looks like a great qol change thank you!

Copy link
Contributor

@jtlisi jtlisi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

It might be worth noting the new startup behavior in Blocks documentation. If I'm not mistaken, the user must be aware that the startup function needs complete before a transfer can take place and tune accordingly?

func (t *Cortex) start(target moduleName) error {
deps := orderedDeps(target)

wg := sync.WaitGroup{}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you are only returning errors you should consider using a sync.errgroup instead of the waitgroup and channel.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, looks exactly like what I need.

Copy link
Contributor

@jtlisi jtlisi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Withdrawing approval til integration tests are sorted

@pstibrany
Copy link
Contributor Author

Before approving this, I will check return values from gRPC methods to make sure we send the correct errors and status codes back.

@pstibrany
Copy link
Contributor Author

pstibrany commented Feb 13, 2020

Withdrawing approval til integration tests are sorted

Integration tests found a real bug (start was only called on dependencies, but not target module itself), which is now fixed. I didn't catch it in my testing, as I was only running with All target which doesn't have start function.

…startup.

Start functions are called concurrently, and only after all modules
are initialized. Server's HTTP service starts while modules are still
starting, so modules must be able to handle that.

This is currently used by blocks-based ingester and querier.

Signed-off-by: Peter Štibraný <[email protected]>
Signed-off-by: Peter Štibraný <[email protected]>
Signed-off-by: Peter Štibraný <[email protected]>
…d yet.

codes.Unavailable seems to be the correct code, and client is expected
to retry later (with some backoff).

Signed-off-by: Peter Štibraný <[email protected]>
@pstibrany
Copy link
Contributor Author

Before approving this, I will check return values from gRPC methods to make sure we send the correct errors and status codes back.

After some reading, I believe returning status.Error(codes.Unavailable, message) is the right thing to do from gRPC handlers, when ingester or querier are not ready yet. When client receives Unavailable, it is expected to retry (with some backoff), which is exactly what we want to express here.

I've updated the code accordingly.

@pstibrany
Copy link
Contributor Author

After some reading, I believe returning status.Error(codes.Unavailable, message) is the right thing to do from gRPC handlers, when ingester or querier are not ready yet. When client receives Unavailable, it is expected to retry (with some backoff), which is exactly what we want to express here.

I've updated the code accordingly.

Of course, ingester doesn't even register itself into the Ring, until it is ready.

With blocks store situation is more complicated... both querier and ruler that use it are started immediately, and don't wait for "block store" to be ready first. All requests to the block store will however fail until it is. This may cause troubles and I will investigate more how to solve it. Ideally they should observe it's component for readiness before using it.

Copy link
Contributor

@codesome codesome left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is exactly what I was looking for atm. Perfect timing!

@@ -67,7 +67,9 @@ require (
github.com/weaveworks/common v0.0.0-20200201141823-27e183090ab1
go.etcd.io/bbolt v1.3.3
go.etcd.io/etcd v0.0.0-20190709142735-eb7dd97135a5
go.uber.org/atomic v1.5.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of this additional dependency, can we use https://golang.org/pkg/sync/atomic/ for atomic loads and stores?

Copy link
Contributor Author

@pstibrany pstibrany Feb 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have this dependency, although under old name (github.com/uber-go/atomic). I've removed old version in PR #2127.

@gouthamve
Copy link
Contributor

@pstibrany Is the querier and ruler using the ingester only a problem in single binary mode? Can we not register the ingester in the ring (or mark it pending) until it starts up?

@codesome codesome mentioned this pull request Feb 14, 2020
3 tasks
@pstibrany
Copy link
Contributor Author

@pstibrany Is the querier and ruler using the ingester only a problem in single binary mode? Can we not register the ingester in the ring (or mark it pending) until it starts up?

Ruler isn't using ingester. Querier is, but only indirectly, via gRPC calls.

Problem that this PR is solving for QuerierChunkStore is that blocks-based QuerierChunkStore needs to fetch blocks metadata on startup, which can take a while. However, Ruler is using QuerierChunkStore and has no way to see whether QuerierChunkStore has finished its fetching and is now ready to use, or not.

I'd like to solve this in a general way, and will send new PR showing how to do that. I'd suggest to keep this PR open for now, so that we can compare both approaches and see which one works better. (I don't really like approach in this PR, to be honest)

@pstibrany
Copy link
Contributor Author

I have sent alternative PR to this one, #2166. It revamps services module and supports everything that we need to correctly implement in this PR, in a much better way. If you have time, please take a look.

@gouthamve
Copy link
Contributor

Closing this in favour of #2166

@gouthamve gouthamve closed this Mar 4, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants