Skip to content

use TLS to communicate with Consul Connect services #1

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

Scandiravian
Copy link

@Scandiravian Scandiravian commented Jan 19, 2021

Hi Gufran :)

What does this PR do?

As mentioned in this comment on your PR to traefik, I've fixed the problems you were experiencing when trying to establish a TLS connection to a Consul Connect service

I decided to remove the "ConnectNative" configuration, since I couldn't find a way to use the Consul Connect package, without registering traefik as a Connect Native service and it simplified the configuration slightly to remove the option.

Slightly off-topic stuff

I've only made small changes to your original design, but I'd like to do some additional work on it such as:

  • Merging connect proxies and their upstream service into one service in traefik (right now, the connect service only works when using the name of the proxy service, i.e. web-sidecar-proxy.example.com connects to the proxy, while web.example.com connects to the upstream service)
  • Ability to configure whether to use the non-connect service as a fallback in case traefik fails to use Consul Connect
  • Separate service and certificate refresh intervals (maybe do certificate refreshes based on the time they expire)

I want to make sure, which of these ideas (if any), that you're onboard with, as it's your PR and you probably have some plans of your own.

I can make new PRs for these if you think any of the suggestions will be a good idea. Let me know what you think :)

type ConnectConfig struct {
ConnectAware bool `description:"Enable Consul Connect support." json:"connectAware,omitEmpty" toml:"connectAware,omitempty" yaml:"connectAware,omitEmpty"`
// NOTE: Unless the certificates are issued by a trusted authority, this needs to be set as true
InsecureSkipVerify bool `description:"Skip verifying certificates for communicating with Consul Connect services." json:"insecureSkipVerify,omitempty" toml:"insecureSkipVerify,omitempty" yaml:"insecureSkipVerify,omitempty"`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there no way to teach traefik to trust the Connect CA?

@Gufran
Copy link
Owner

Gufran commented Jan 19, 2021

Hi @RavensburgOP
Thank you very much for picking this up, and I'm sorry for the shoddy work that you had to clean up.
The configuration related changed look alright to me. I have some remarks on your other points though:

  • Merging connect proxies and their upstream service into one service in traefik (right now, the connect service only works when using the name of the proxy service, i.e. web-sidecar-proxy.example.com connects to the proxy, while web.example.com connects to the upstream service)

Not sure if I understand this. Do you mean you want to eliminate the distinction between the service and its proxy? If so then it won't work because the services are not exposed over the network when they run with a connect sidecar. Every service that participates in connect mesh gets its own isolated network, and the only way to get inside the network is to go via the sidecar. Only the proxy is able to perform TLS handshake.
See figure:

+---------- isolated network --------------+
|                                          |
|  +------------ Service ---------------+  |
|  |              :port                 |  |
|  +----------------^-------------------+  |
|                   |                      |
|  +----------------v-------------------+  |
|  |           sidecar:port             |  |
|  +----------------^-------------------+  |
|                   | exposed port         |
|                   | binding              |
+-------------------|----------------------+
                    v
                  Client
  • Ability to configure whether to use the non-connect service as a fallback in case traefik fails to use Consul Connect

I think it is better left for the cluster operator to decide. Consul can implement this fallback logic using the L7 traffic management features. Implementing this in traefik will not only make it difficult for people to take full advantage of the L7 features in consul, it will also make the code more involved with all the things that consul is already doing for us.

  • Separate service and certificate refresh intervals (maybe do certificate refreshes based on the time they expire)

it already works this way. The watcher loop relies on consul blocking queries and the state is refreshed if and only if there is an update available in catalog. See https://github.com/Gufran/traefik/pull/1/files#diff-c3e25341e36d91fba700c36777fee7b575bb7004a9521a76ec17d2fb3f748364R461-R473

I want to make sure, which of these ideas (if any), that you're onboard with, as it's your PR and you probably have some plans of your own.

I don't have any reservation or preference, I only wanted to make it work with Connect, so everything else you can do on top of that is an added bonus. Although you might want to think a little bit from the perspective of a person who is responsible for managing consul and consul connect, sometimes all we want from a proxy is just the proxy and nothing else. The more things you add to it the more likely it is that you'll begin to contend with Consul's L7 features.

At a cursory glance the code looks fine to me, I'll be able to test it in a few hours.

@apollo13
Copy link
Collaborator

Ok, I figured out my current issues. the watchConnectTLS did not properly set insecureSkipVerify. It would be great if you two could apply the following patch on the existing changes:

diff --git a/pkg/provider/consulcatalog/config.go b/pkg/provider/consulcatalog/config.go
index 6df6803c..48ee1785 100644
--- a/pkg/provider/consulcatalog/config.go
+++ b/pkg/provider/consulcatalog/config.go
@@ -4,6 +4,8 @@ import (
        "context"
        "errors"
        "fmt"
+       "net"
+
        "github.com/hashicorp/consul/api"
        "github.com/traefik/traefik/v2/pkg/config/dynamic"
        "github.com/traefik/traefik/v2/pkg/config/label"
@@ -11,7 +13,6 @@ import (
        "github.com/traefik/traefik/v2/pkg/provider"
        "github.com/traefik/traefik/v2/pkg/provider/constraints"
        "github.com/traefik/traefik/v2/pkg/tls"
-       "net"
 )
 
 func (p *Provider) buildConfiguration(ctx context.Context, items []itemData, certInfo *connectCert) *dynamic.Configuration {
@@ -67,7 +68,7 @@ func (p *Provider) buildConfiguration(ctx context.Context, items []itemData, cer
                        confFromLabel.HTTP.ServersTransports = make(map[string]*dynamic.ServersTransport)
                }
 
-               if certInfo != nil {
+               if item.ExtraConf.ConnectEnabled {
                        confFromLabel.HTTP.ServersTransports[connectTransportName(item.Name)] = certInfo.serverTransport(connectTransportName(item.Name))
                }
 
diff --git a/pkg/provider/consulcatalog/consul_catalog.go b/pkg/provider/consulcatalog/consul_catalog.go
index 6fea9d65..1fac2446 100644
--- a/pkg/provider/consulcatalog/consul_catalog.go
+++ b/pkg/provider/consulcatalog/consul_catalog.go
@@ -497,9 +497,10 @@ func (p *Provider) watchConnectTLS(ctx context.Context) {
 
                case <-ticker.C:
                        p.certChan <- &connectCert{
-                               service: p.ServiceName,
-                               root:    rootCerts,
-                               leaf:    leafCerts,
+                               service:            p.ServiceName,
+                               insecureSkipVerify: p.Connect.InsecureSkipVerify,
+                               root:               rootCerts,
+                               leaf:               leafCerts,
                        }
                }
        }

After that we still have to figure out how to actually verfiy the certs properly :/

@Scandiravian
Copy link
Author

Scandiravian commented Jan 19, 2021

Hi @Gufran

Thanks for taking a look at my PR and for such a thorough reply! 😄

Not sure if I understand this. Do you mean you want to eliminate the distinction between the service and its proxy? If so then it won't work because the services are not exposed over the network when they run with a connect sidecar. Every service that participates in connect mesh gets its own isolated network, and the only way to get inside the network is to go via the sidecar. Only the proxy is able to perform TLS handshake.

I think I might be a bit confused, because as far as I can understand in the code, the L7 traffic management in Consul is circumvented, since the "consulService.ServiceAddress" returns the IP of the service. There'll be both a service for the connect-proxy and one for the service itself.

I've setup a test environment with a service (web), that has an envoy sidecar (web-sidecar-proxy). The host name is defined with:

defaultRule: 'Host(`{{ .Name }}.{{ index .Labels "com.consul.service"}}example.com`)'

When I then call web-sidecar-proxy.example.com I'll use the TLS connection, whereas web.example.com will be a regular HTTP call.

I might be missing something here, so if you can see where I'm going wrong, I'd be very happy to get an explanation 😄

it already works this way. The watcher loop relies on consul blocking queries and the state is refreshed if and only if there is an update available in catalog.

I think my inexperience in Go probably is showing here (my experience boils down to a few hours last Friday learning it and the time I spent figuring out, what was going wrong)

What tripped me up was this

ticker := time.NewTicker(time.Duration(p.RefreshInterval))

for {
	select {
	case <-ctx.Done():
		ticker.Stop()
		return

	case rootCerts = <-rootChan:
	case leafCerts = <-leafChan:

	case <-ticker.C:
		p.certChan <- &connectCert{
			service: p.Connect.ServiceName,
			root:    rootCerts,
			leaf:    leafCerts,
		}
	}
}

Since I understood it as a pull mechanism, where the watcher functions were called on each tick. Thanks for helping me understand Go a bit better 😄

@Scandiravian
Copy link
Author

@apollo13 Good catch! I'll add a small todo and fix it later tonight or tomorrow 😄

@apollo13
Copy link
Collaborator

I think I might be a bit confused, because as far as I can understand in the code, the L7 traffic management in Consul is circumvented, since the "consulService.ServiceAddress" returns the IP of the service. There'll be both a service for the connect-proxy and one for the service itself.

No, ServiceAddress & ServicePort of the connect enabled service will be the sidecar proxy. It is true that the original service might have the same ServiceAddress in most cases since it is in the same pod; but for security reasons the actual service will listen only on localhost. Traefik should never be able to connect to it.

I might be missing something here, so if you can see where I'm going wrong, I'd be very happy to get an explanation smile

I guess it depends on how you look at it; in your example you should probably rename the sidecar service from web-sidecar-proxy to web and the web service to web-backend. Generally you do not want to expose both services to traefik; what do you gain by using connect if you can also reach the unsecured http port…

Ability to configure whether to use the non-connect service as a fallback in case traefik fails to use Consul Connect

Imo a strong "no", if connect is configured then you want it to be used. If done right traefik cannot (physically) connect to the non-connect service anyways

@apollo13
Copy link
Collaborator

apollo13 commented Jan 19, 2021

I fetched a consul leaf certificate:

Certificate:
    Data:
        Version: 3 (0x2)
        Serial Number: 1591 (0x637)
        Signature Algorithm: ecdsa-with-SHA256
        Issuer: CN = pri-own2hh5.consul.ca.9243f0dc.consul
        Validity
            Not Before: Jan 19 19:29:58 2021 GMT
            Not After : Jan 22 19:29:58 2021 GMT
        Subject: CN = redissidecarproxy.svc.default.9243f0dc.consul
        Subject Public Key Info:
            Public Key Algorithm: id-ecPublicKey
                Public-Key: (256 bit)
                pub:
                    04:12:58:8f:fe:25:53:98:b0:ae:93:5e:e3:aa:03:
                    6a:98:6c:f3:44:1a:d9:e6:12:2b:cd:82:6d:66:2e:
                    d3:30:e1:59:b4:4f:d6:0d:52:d9:6b:1c:58:44:e9:
                    1f:ae:a7:21:aa:27:b4:24:51:99:d6:c5:48:dc:9c:
                    63:0e:8a:46:34
                ASN1 OID: prime256v1
                NIST CURVE: P-256
        X509v3 extensions:
            X509v3 Key Usage: critical
                Digital Signature, Key Encipherment, Data Encipherment, Key Agreement
            X509v3 Extended Key Usage: 
                TLS Web Client Authentication, TLS Web Server Authentication
            X509v3 Basic Constraints: critical
                CA:FALSE
            X509v3 Subject Key Identifier: 
                85:50:9C:2B:2F:4D:5A:01:09:F1:2A:56:5E:43:8B:24:55:B2:58:D0:81:08:50:67:DC:DE:73:BE:AA:00:69:E4
            X509v3 Authority Key Identifier: 
                keyid:99:50:42:14:54:78:F9:C6:00:23:F2:4C:C5:FB:94:D4:B9:5E:3A:EE:8B:E5:36:53:12:12:18:D5:0B:B8:3E:57

            X509v3 Subject Alternative Name: 
                URI:spiffe://9243f0dc-9e90-e0db-07dc-0f22dd5b2e93.consul/ns/default/dc/dc1/svc/redis-sidecar-proxy
    Signature Algorithm: ecdsa-with-SHA256
         30:46:02:21:00:a1:5d:f9:00:15:8e:05:ae:3c:fc:f1:e6:dd:
         81:7a:99:1b:c3:62:67:ef:69:80:0e:3b:bc:8f:90:5e:3b:7f:
         f5:02:21:00:f9:dc:c9:77:b9:8b:99:f8:39:98:bd:d5:d1:ba:
         16:59:ee:ff:d9:9e:11:44:9b:0b:e9:b4:2f:b4:c7:98:7d:35

We somehow need to tell Traefik/go to not validate the hostname; I wonder of a low-cost solution would be a validateHostname flag on the ServersTransport instead of allowing to override dial/tls_configs somehow.

EDIT:// what does insecureSkipVerify do exactly? does it also skip CA checks? If not setting it to true would be okay.

@Scandiravian
Copy link
Author

Scandiravian commented Jan 20, 2021

No, ServiceAddress & ServicePort of the connect enabled service will be the sidecar proxy. It is true that the original service might have the same ServiceAddress in most cases since it is in the same pod; but for security reasons the actual service will listen only on localhost. Traefik should never be able to connect to it.

I'm a bit confused about this statement, since that's not what I'm seeing here. I'm running a mock environment through docker-compose with 2 services (api and web), they each have a sidecar proxy (api-sidecar-proxy and web-sidecar-proxy). When traefik calls Consul to ask for the list of services, both the service and the service proxy is returned

With a default rule for consul services in traefik, web.example.com will route to the web-service on its non-tls port, while web-sidecar-proxy.example.com will route to the TLS port on the sidecar service

In the traefik's logs and the dashboard, it also shows both the service itself and its sidecar

It could be that this is because I'm using docker-compose to spin up the services, since docker-compose doesn't have a concept of sidecars and everything simple shares a network bridge. I'd like to see if this also happens if I set things up in Nomad.

I guess it depends on how you look at it; in your example you should probably rename the sidecar service from web-sidecar-proxy to web and the web service to web-backend. Generally you do not want to expose both services to traefik; what do you gain by using connect if you can also reach the unsecured http port.

I think I've phrased my previous posts poorly. I'm not saying "I want" to expose both services to traefik, but in my setup they are by default, which made me assume that this is the expected behaviour with sidecar proxies. It now seems to me that I'm probably mistaken and something is not as it should be 😄

@Scandiravian
Copy link
Author

Scandiravian commented Jan 20, 2021

what does insecureSkipVerify do exactly? does it also skip CA checks? If not setting it to true would be okay.

If set to true it skips both hostname and chain verification. There was a change to the docs back in June to acknowledge, that it's okay to disable these checks, as long as some of the other verification methods are used source

I did a quick look through the crypto/tls source code and it doesn't seem like there is a way to disable only the hostname validation 😕

Edit: I might have been wrong https://github.com/golang/go/blob/928bda4f4a88efe2e53f3607e8d2ad0796b449c0/src/crypto/tls/example_test.go#L186-L228

@apollo13
Copy link
Collaborator

I did a quick look through the crypto/tls source code and it doesn't seem like there is a way to disable only the hostname validation 😕

Yes that is right, one has to provide a VerifyConnection callback to implement your own handling then, the test you linked shows an example of that. That said the biggest question is how to teach traefik to do that :)

I'm running a mock environment through docker-compose with 2 services (api and web), they each have a sidecar proxy (api-sidecar-proxy and web-sidecar-proxy). When traefik calls Consul to ask for the list of services, both the service and the service proxy is returned

the sidecar proxy running in the same containers or do you have 4 containers in total? But yes this will (even in nomad) register 4 services in consul (2 "real" and 2 "proxies")

With a default rule for consul services in traefik, web.example.com will route to the web-service on its non-tls port, while web-sidecar-proxy.example.com will route to the TLS port on the sidecar service

That is also correct but I generally disable exposeByDefault and explicitly enable exposing on the services where I want them exposed. Note that this only concerns traefik; in consul you'd always see all services.

It could be that this is because I'm using docker-compose to spin up the services, since docker-compose doesn't have a concept of sidecars and everything simple shares a network bridge. I'd like to see if this also happens if I set things up in Nomad.

When deploying that to nomad you will also see the same number of services; but the non-proxy services will not be reachable if (emphasis on if) you configured them properly (ie the should only listen on 127.0.0.1)

@Scandiravian
Copy link
Author

Scandiravian commented Jan 20, 2021

Yes that is right, one has to provide a VerifyConnection callback to implement your own handling then, the test you linked shows an example of that. That said the biggest question is how to teach traefik to do that :)

I think the best way to solve this is to ask the Consul maintainers to implement the callback on their end, then add a field in the ServersTransport in traefik, which can be set to the aforementioned callback function in config.go. I can open a feature request on the Consul repo to request this, if we can agree on this approach? 😄

the sidecar proxy running in the same containers or do you have 4 containers in total? But yes this will (even in nomad) register 4 services in consul (2 "real" and 2 "proxies")

That is also correct but I generally disable exposeByDefault and explicitly enable exposing on the services where I want them exposed. Note that this only concerns traefik; in consul you'd always see all services.

That makes sense. I think it will be important to give a good description of this in the traefik docs as best practices, since it might not be intuitively obvious, how to do this setup and preserve safety :)

Thanks for the explanation, it helped me understand the underlying mechanisms much better!

@Scandiravian
Copy link
Author

Consul already has one:
https://github.com/hashicorp/consul/blob/6ecf3b72ca3e43c143fc10f1c3589f2ed4eefbd5/connect/tls.go#L257-L260
which is then used as:
https://github.com/hashicorp/consul/blob/6ecf3b72ca3e43c143fc10f1c3589f2ed4eefbd5/connect/tls.go#L362-L366

They both work on the tls config, not the tls connection, which I think is needed to verify the calls themselves (like here). They are also private functions (they start with lower case), so they would need to be made public for us to get access to them

@apollo13
Copy link
Collaborator

Ah sorry don't know enough about go wrt private functions. But working on the TLS config object should be fine; when the roundtrippper is generated we have a TLS config: https://github.com/traefik/traefik/blob/2747e240c1a97031367a1a566a1401a2367a54d2/pkg/server/service/roundtripper.go#L143-L148 -- at this point you'd also have access to the ServersTransport, no?

@apollo13
Copy link
Collaborator

Mhm, I am leaning towards simply copying verifyChain https://github.com/hashicorp/consul/blob/6ecf3b72ca3e43c143fc10f1c3589f2ed4eefbd5/connect/tls.go#L264 and adjusting to our needs.

@apollo13
Copy link
Collaborator

This is the code that upstream Go does for verification if InsecureSkipVerify is not set: https://github.com/golang/go/blob/520f3b72db7befab2028d9a47376267cf2d274a9/src/crypto/tls/handshake_client.go#L837-L851 -- we should be able to use that in a verifier function minus the DNSName.

@apollo13
Copy link
Collaborator

@Gufran The PR is now working, it would be great if you could merge this so we can continue the discussion on the original traefik PR.

@Gufran
Copy link
Owner

Gufran commented Jan 21, 2021

Thank you guys, I really appreciate all the help.

@Gufran Gufran merged commit fec2c5e into Gufran:consul-connect-backend Jan 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants