Skip to content

Conversation

meln5674
Copy link
Contributor

@meln5674 meln5674 commented Nov 18, 2024

Add flags --tls-cert, --tls-key, --tls-ca/--tls-client-ca to the commands podman remote, podman system service, and podman system connection add to support serving the API socket using TLS and mTLS, as well as connecting to such a socket.

This relies on containers/common#2249 and will fail CI until merged.

Fixes #24583

Does this PR introduce a user-facing change?

* The `podman system service` command now supports serving over tcp with TLS and mTLS
* The `podman system connection add` command now supports creating connections to TLS and mTLS tcp sockets
* The `podman remote` commands now support connecting to TLS and mTLS tcp sockets

@openshift-ci openshift-ci bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note labels Nov 18, 2024
@github-actions github-actions bot added the kind/api-change Change to remote API; merits scrutiny label Nov 18, 2024
Copy link

Ephemeral COPR build failed. @containers/packit-build please check.

@Luap99
Copy link
Member

Luap99 commented Nov 19, 2024

@jwhonce @baude @mheon FYI. Had you ever looked at TLS support for the podman service and remote client?
With the amount of code here this looks easy enough to accept to me but I was wondering if there is/was any reason why we do not support it.

@mheon
Copy link
Member

mheon commented Nov 19, 2024 via email

@mheon
Copy link
Member

mheon commented Nov 19, 2024

Just skimmed, but LGTM on approach. A lot less code than I was expecting for certificate auth.

Now, if we have to start providing docs for how to do certificate auth right, I think the amount of changes grows substantially, but I don't think it's really worth it at this point.

@TomSweeneyRedHat
Copy link
Member

Sending the directory location of the private key out over the wire doesn't give me the warm fuzzies. Is it something that other software does?

@jwhonce
Copy link
Member

jwhonce commented Nov 21, 2024

@Luap99 I seem to remember the idea behind using the ssh executable as well as the golang libraries was to push this kind of handshake into that code vs. podman. But @cdoern did that work, so I don't know where it stands.

@meln5674
Copy link
Contributor Author

Sending the directory location of the private key out over the wire doesn't give me the warm fuzzies. Is it something that other software does?

That certainly wasn't my intent, but I'm not sure what you're referring to. Is that a consequence of adding it to the conf file that I'm unaware of?

@Luap99
Copy link
Member

Luap99 commented Nov 25, 2024

Sending the directory location of the private key out over the wire doesn't give me the warm fuzzies. Is it something that other software does?

That certainly wasn't my intent, but I'm not sure what you're referring to. Is that a consequence of adding it to the conf file that I'm unaware of?

It is not being send anywhere, both the client and server read the files locally and then use the certificates to perform a normal TLS handshake via the go std lib AFAICT.
@TomSweeneyRedHat What are you referring to by sending the private key over the write?

@Luap99 I seem to remember the idea behind using the ssh executable as well as the golang libraries was to push this kind of handshake into that code vs. podman. But @cdoern did that work, so I don't know where it stands.

@jwhonce I am not sure what are you referring to? The question for me was if there was a specific reason why TLS support was not done before. If not then I think we should accept this PR

Copy link
Member

@Luap99 Luap99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also this will need some e2e or system tests where we spawn a server with TLS and the connect with the remove client.

"os"
)

func ReadCertBundle(path string) (*x509.CertPool, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you mope this in a new separate package, maybe call it cert and then name the function ReadBundle()

We should really move away from dumping everything into util packages as this causes a lot of unwanted side effects via other imports that really should not have to be imported on the remote client, i.e. #23818. Fixing that will be quite some work but in the meantime we should not add new things there that are needed by the remote client.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 323 to 362
if len(tlsCAFile) != 0 || len(tlsCertFile) != 0 || len(tlsKeyFile) != 0 {
logrus.Debugf("using TLS cert=%s key=%s ca=%s", tlsCertFile, tlsKeyFile, tlsCAFile)
transport.TLSClientConfig = &tls.Config{}
connection.tls = true
}
if len(tlsCAFile) != 0 {
pool, err := util.ReadCertBundle(tlsCAFile)
if err != nil {
return connection, fmt.Errorf("unable to read CA bundle: %w", err)
}
transport.TLSClientConfig.RootCAs = pool
}
if len(tlsCertFile) != 0 && len(tlsKeyFile) != 0 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The condition here do not seem to line up perfectly.

If only tlsCertFile is set then the first condition matches and says tls is used but then we never add the the key on the last one. That case should return a hard error I would assume.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My intent was to have that check performed in the cmd/podman/... packages, but on reflection it makes sense to also perform the same check here, as that wouldn't account for the env vars. Done.

Comment on lines 160 to 182
if cmd.Flags().Changed("tls-cert") {
return errors.New("--tls-cert option not supported for ssh scheme")
}
if cmd.Flags().Changed("tls-key") {
return errors.New("--tls-key option not supported for ssh scheme")
}
if cmd.Flags().Changed("tls-ca") {
return errors.New("--tls-ca option not supported for ssh scheme")
}
return ssh.Create(entities, sshMode)
case "unix":
if cmd.Flags().Changed("identity") {
return errors.New("--identity option not supported for unix scheme")
}
if cmd.Flags().Changed("tls-cert") {
return errors.New("--tls-cert option not supported for unix scheme")
}
if cmd.Flags().Changed("tls-key") {
return errors.New("--tls-key option not supported for unix scheme")
}
if cmd.Flags().Changed("tls-ca") {
return errors.New("--tls-ca option not supported for unix scheme")
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This duplicates the same conditions. Move them outside the switch case and match if scheme != tcp so you do not duplicate that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@meln5674
Copy link
Contributor Author

Also this will need some e2e or system tests where we spawn a server with TLS and the connect with the remove client.

Of course. I was hoping to have my environment capable of passing tests on main before I went on vacation last week, but that didn't end up happening. After throwing up my hands and just making a fresh CentOS VM, I have that working now.

I'd like to use any existing tests for SSH/TCP remotes as a basis for my tests, as well as ensure my tests have parity with them, but for the life of me, I cannot seem to find any such tests. Would you mind pointing me to where they are?

@TomSweeneyRedHat
Copy link
Member

@Luap99 sorry, just back to this. This line: https://github.com/containers/podman/pull/24601/files#diff-cf5cc76e70a369c5221f08f29ec0452710af1f4ecd423c162d5b669688416caeR525 seemed to indicate that you were looking for the directory where the private key resigned. Unless I misinterpreted, that doesn't feel right.

@meln5674
Copy link
Contributor Author

meln5674 commented Dec 9, 2024

@Luap99 sorry, just back to this. This line: https://github.com/containers/podman/pull/24601/files#diff-cf5cc76e70a369c5221f08f29ec0452710af1f4ecd423c162d5b669688416caeR525 seemed to indicate that you were looking for the directory where the private key resigned. Unless I misinterpreted, that doesn't feel right.

Yes, the purpose of this line is to create a command line argument so that the user can provide the path to their TLS client private key, which is necessary to perform authentication. Unless I too have misinterpreted something else in the codebase, neither that path, nor especially not the data in that file, ever goes over the wire. This is the analog of the --identity flag for SSH authentication.

@meln5674
Copy link
Contributor Author

Also this will need some e2e or system tests where we spawn a server with TLS and the connect with the remove client.

Apologies for the delay on this, a number of other items were pushed onto my priority stack, but I've been able to resume work on it.

As mentioned, there weren't any substantial e2e tests for the TCP API, and the libpod remote test suite was entirely using the unix socket, so I have refactored it to allow running it for unix, plain TCP, TLS, and mTLS. This might seem excessive, but it was only by running this complete suite did I find additional issues I had to fix. Suggestions are welcome for how to achieve similar coverage with fewer tests.

Speaking of said issues, it would seem that the container attach client has a subtle race condition in it, as well as a memory leak.

By overwriting the http transport and capturing the net.Conn as a local variable, if the same Connection is used concurrently, there is a possibility that the first net.Conn will be overwritten on the first attach call's local variable, resulting in the two calls proxying each-other's streams, or just outright corrupting them.

As well, because the original transport is never restored, if the same client is used to attach to multiple containers, the closure is never freed, meaning the dial function will call an ever-increasing stack of closures. Fixing this issue felt outside the scope of this PR, but I have added a note warning of it to whoever feels like addressing it in the future.

I would suggest adding some sort of DoRequestWithConn method which handles using the new transport for that request alone and manually performs the dialing, writing the request, and reading the response, and validating the upgrade.

Finally, there were a number of other e2e tests that were failing in my fresh centos 9 environment, even on main, so I have simply ignored them for the time being.

Copy link
Member

@Luap99 Luap99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a few comments, I need more time to look at the test changes. But we definitely cannot just explode the matrix and run 4 times the amount.

Makefile Outdated
Comment on lines 685 to 673
$(MAKE) ginkgo-run TAGS="$(REMOTETAGS) remote_testing remote_unix_testing"
$(MAKE) ginkgo-run TAGS="$(REMOTETAGS) remote_testing remote_tcp_testing"
$(MAKE) ginkgo-run TAGS="$(REMOTETAGS) remote_testing remote_tls_testing"
$(MAKE) ginkgo-run TAGS="$(REMOTETAGS) remote_testing remote_mtls_testing"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not acceptable to me, it is certainly great to force coverage. But we run this many times on each Pr. A 4x time increase is not acceptable.

What we can consider is some split testing, we run the test on fedora rawhide, 41, 40 and debian sid so technically would could wire this up in CI ro run each case on a different distro to not add any new overhead will still getting full coverage. The transport layer should certainly not care about the distro (except underlying kernel bugs of course) so I think that may be best option.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, this was a "nuclear option" in order to get tests to run with the least amount of change to the tests themselves, and to make sure that I wasn't missing anything.

One thought I had in the interim was that ginkgo allows tagging tests and sets of tests, and selecting a subset of them on the command line. Using this, it would be possible to run a full remote tests via unix sockets, and then a subset of all tests against tcp, tls, and mtls. It would then also be possible to add a separate target to run all tests in all remotes, but wouldn't be run in CI, only on-demand in development environments.

The main thing I would need for this is guidance from the podman core team on which tests they believed were critical to be tested over every possible remote, and which are "good enough" to only be tested over unix. For example, the attach endpoints would definitely need to be tested on all different remotes, as that's how I discovered the additional fixes I had to make.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't heard back, so I've gone ahead with this strategy with a small selection of, admittedly arbitrary, tests to pare down to when testing TCP, TLS, and mTLS, based mostly on gut feeling of "things that might go wrong when using a different transport".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Luap99 WDYT of the current labeling tests solution vs running past-fedora tests using tls? The labels are a clever spot checking solution, but I think editing the cirrus matrix for split testing feels more maintainable?

} else {
rpt, err = rpt.Parse(report.OriginPodman,
"{{range .}}{{.Name}}\t{{.URI}}\t{{.Identity}}\t{{.Default}}\t{{.ReadWrite}}\n{{end -}}")
"{{range .}}{{.Name}}\t{{.URI}}\t{{.Identity}}\t{{.TLSCAFile}}\t{{.TLSCertFile}}\t{{.TLSKeyFile}}\t{{.Default}}\t{{.ReadWrite}}\n{{end -}}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO this is a breaking change. While nobody should relay on the order of the output and use --format if they use it in scripts we can never know.
I guess I could be convinced to add them as last keys after readwrite.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would agree, and while I'm not sure if podman or the the umbrella containers project has strict guidelines on the topic, I've always considered "human readable" data to be excluded from breaking changes. I would no sooner worry about breaking scripts that scrape this output than scripts that break if a new log message was added.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After thinking about this for a bit, if breaking changes are that much of a concern, I have hidden the new outputs behind a flag. My thought is that this flag can be removed and this output made the default at the next major version bump when breaking changes are permissible.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this is breaking some system tests ATM.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this still counts as a breaking change ATM. I think the best thing to do is just to drop them, or to move them to the end, after readwrite. I prefer to drop them, and just document them as format options, so that if a user wants to see them, can can use --format to display them


Context("without running API service", func() {
It("add ssh://", func() {
cmd := []string{"system", "connection", "add",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess these changes are done by your formatter? Can you revert them, they bloat the diff unnecessarily and make reviewing harder

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe all of the errant auto-formats have been reverted. Please let me know if I have missed any.

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 8, 2025
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 17, 2025
@meln5674 meln5674 force-pushed the feature/remote-tls branch 4 times, most recently from 3e9683f to 817a696 Compare February 18, 2025 01:26
@ashley-cui
Copy link
Member

This feature has become a priority for us, so I'm going to try to push this over the finish line, working on top of this draft. @meln5674 If you have any concerns, please feel free to reach out and let me know 😊

@meln5674
Copy link
Contributor Author

The feature is basically done, and I'm happy to finish it myself, but I've been waiting for a second review since February. Its only marked as draft as it depends on the other PR in common to be merged first.

Copy link
Member

@ashley-cui ashley-cui left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@meln5674 Apologies, thanks for all your work! Happy to give reviews to push this through 😊

Just one nit.

Other than that, I think we need to wire up and test the common changes through Podman's CI before either PR merges. Could you do a go mod replace pointing to your c/common branch, and vendor that in in a separate commit here, just so we can throw it at CI? Once that passes, we can drop the commit here before merging.

return NewConnectionWithIdentityOrTLS(ctx, uri, identity, "", "", "", machine)
}

func NewConnectionWithIdentityOrTLS(ctx context.Context, uri string, identity string, tlsCertFile, tlsKeyFile, tlsCAFile string, machine bool) (context.Context, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit, but could we put the TLS stuff into a TLS struct? The amount of args going into this function is starting to be long enough that it makes a maintenance hassle.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@meln5674
Copy link
Contributor Author

Will do. I'll need to spend a bit of time resurrecting the dev VM I had for this today to make sure tests still all pass locally after rebasing.

@meln5674
Copy link
Contributor Author

@ashley-cui Understood. I made the nit'ed changes and rebased this branch on them. Once the container-libs PR is merged, I will point back common back to upstream and squash and all changes to a single commit for final review and CI run.

@meln5674 meln5674 force-pushed the feature/remote-tls branch 4 times, most recently from 7eb37ec to 0c0578e Compare September 25, 2025 03:45
Copy link
Member

@ashley-cui ashley-cui left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

c/container-libs PR merged, please re-vendor and squash.

@containers/podman-maintainers PTAL

@meln5674
Copy link
Contributor Author

meln5674 commented Sep 25, 2025

It doesn't look like container-libs has been assigned a new semantic version. I'll go ahead and re-vendor on the commit hash for now, but should it be like that for the final merge? Should there be a separate PR that just bumps the dependency?

It also appears that the CI is complaining about differences in tool versions, which I suspect are between HEAD on container-libs and podman.

EDIT: Didn't realize there was a special Makefile target for vendoring, going to try that. For whatever reason that wasn't needed for vendoring the fork.

@meln5674 meln5674 marked this pull request as ready for review September 25, 2025 23:44
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 25, 2025
@meln5674
Copy link
Contributor Author

Not sure if this is a me problem, but github has been stuck on "Checking for the ability to merge automatically..." for the last hour and change, and hasn't detected the new commit I pushed. Unsure of what I can do about this.

@ashley-cui
Copy link
Member

@meln5674 Strange, feel free to re-push or to close and re-open the PR, see if that kicks the CI.

@mheon
Copy link
Member

mheon commented Sep 26, 2025

Force-push is usually the most reliable way

* Added flags to point to TLS PEM files to use for exposing and connecting
  to an encrypted remote API socket with server and client authentication.
* Added TLS fields for system connection ls templates.
* Added special "tls" format for system connection ls to list TLS fields
  in human-readable table format.
* Updated remote integration and system tests to allow specifying a
  "transport" to run the full suite against a unix, tcp, tls, or mtls
  system service.
* Added system tests to verify basic operation of unix, tcp, tls, and mtls
  services, clients, and connections.

Signed-off-by: Andrew Melnick <[email protected]>
@meln5674
Copy link
Contributor Author

Force push did the trick. I think the three failures are flakes again, as they don't look related to remotes, could those be checked and restarted?

@mheon
Copy link
Member

mheon commented Sep 26, 2025

Concur all look like flakes, restarted

@meln5674
Copy link
Contributor Author

One more WSL flake, it looks like.

@ashley-cui
Copy link
Member

re-started the flaked test.

@ashley-cui
Copy link
Member

LGTM @containers/podman-maintainers PTAL and merge

@Luap99
Copy link
Member

Luap99 commented Sep 30, 2025

It doesn't look like container-libs has been assigned a new semantic version. I'll go ahead and re-vendor on the commit hash for now, but should it be like that for the final merge? Should there be a separate PR that just bumps the dependency?

To add some context we update by commit during the dev cycle, only before we cut new podman/buildah releases we cut tags on the other modules and then use those. Otherwise we would end up with a ton of new tags during dev which doesn't seem that helpful.

Copy link
Member

@Luap99 Luap99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Sep 30, 2025
Copy link
Contributor

openshift-ci bot commented Sep 30, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Luap99, meln5674

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 30, 2025
@openshift-merge-bot openshift-merge-bot bot merged commit 3747e3d into containers:main Sep 30, 2025
89 checks passed
@ashley-cui
Copy link
Member

Thanks for your patience and work on this one, @meln5674 !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. bloat_approved Approve a PR in which binary file size grows by over 50k kind/api-change Change to remote API; merits scrutiny lgtm Indicates that a PR is ready to be merged. release-note
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support (m)TLS API socket
8 participants