Skip to content

support for multiple namenodes with failover #77

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Oct 26, 2017

Conversation

itszootime
Copy link
Contributor

Hello! Here's a basic implementation for multiple namenode support.

  • If a request results in a non-NamenodeError (i.e. EOF) or StandbyException, the client will close the current connection, try to establish a new connection with another host, and retry the request.
  • The failing host will be marked as down and a connection attempt won't be made for at least 5 seconds (not configurable at the moment, but could be).
  • If all hosts were unreachable or in standby, the client request will result in a "No available namenodes" error.

I realise there isn't much in terms of testing for this behaviour - would be happy to hear some suggestions for how this could be done.

@colinmarc
Copy link
Owner

This looks like it should work fine, although testing is a concern. The other issue is that we can't change the signature of New and NewNamenodeConnection. A few other issues require a NewOptions method, so that'd be a good place to add this functionality, but I haven't gotten around to implementing that.

@ccl0326
Copy link

ccl0326 commented Jul 6, 2017

any progress?

@itszootime
Copy link
Contributor Author

I've recently refactored this to support passing options, just need to do a bit more local testing before pushing it up.

@itszootime itszootime force-pushed the support_multiple_namenodes branch from 6ed6a15 to 0e45348 Compare August 14, 2017 11:46
@itszootime
Copy link
Contributor Author

@colinmarc only took 3 months, but I've updated the PR. Let me know what you think. Cheers!

}
return namenodes[0], nil
return namenodes, nil
}

// NewForUser returns a connected Client with the user specified, or an error if
// it can't connect.
Copy link
Owner

@colinmarc colinmarc Aug 16, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably deprecate these, but that can be a separate PR.

rpc/namenode.go Outdated
func WrapNamenodeConnection(conn net.Conn, user string) (*NamenodeConnection, error) {
// NewNamenodeConnectionWithOptions creates a new connection to a namenode with
// the given options and performs an initial handshake.
func NewNamenodeConnectionWithOptions(options NamenodeConnectionOptions) (*NamenodeConnection, error) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, because I'm dumb and made this exported way back when I started this library, we can't just remove the old function. Can you leave it and make it deprecated?

rpc/namenode.go Outdated
}

if c.conn == nil {
return fmt.Errorf("No available namenodes")
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No shouldn't be capitalized. This can also just be errors.New. Better, though, would be to include the last error, so that this isn't completely opaque in degenerate cases: something like fmt.Errorf("no available namenodes: %s", lastErr)

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(also, for the single-namenode case, that would report the error, which is what we want)

rpc/namenode.go Outdated
@@ -55,45 +67,96 @@ func (err *NamenodeError) Error() string {
return s
}

// NewNamenodeConnection creates a new connection to a Namenode, and preforms an
type host struct {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

namenodeHost?

rpc/namenode.go Outdated
}

return err
c.markFailure(err)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit confusing, but I think this treats StandbyExceptions the same as any non-NamenodeError error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, the operation will only retry if it's either a non-NamenodeError or a StandbyException

rpc/namenode.go Outdated
err = c.writeRequest(method, req)
if err != nil {
c.markFailure(err)
goto R
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Elsewhere in the code I use for loops and continue/break rather than goto. This is probably a bit shorter, but I think harder to read.

@colinmarc
Copy link
Owner

Thanks for the changes! The netcode here looks pretty solid, but I'm a bit nervous about not really having a good test framework for it. Has anyone run this in production?

@itszootime
Copy link
Contributor Author

@colinmarc: I think I've implemented all of your suggestions now.

I understand your concerns about testing this. Locally I've been using a hadoop-ha Docker setup - maybe this would be an option for automated testing too?

We've been using this code in production for a distributed service that mirrors files from HDFS. We haven't had any issues - it handled an active namenode switchover as expected.

@colinmarc
Copy link
Owner

Sorry for the delay. I poked around with this locally and it does seem to do the thing, so I'm going to merge it. I wish we had better tests for the different connection failure modes, but I don't really see a way to do that.

Thanks so much for the contribution!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants