|
| 1 | +Token Scanning |
| 2 | +============== |
| 3 | + |
| 4 | +People make mistakes. Sometimes, they post their PyPI tokens publicly. Some |
| 5 | +content managers run regexes to try and identify published secrets, and ideally |
| 6 | +have them deactivated. PyPI has started integrating with such systems in order |
| 7 | +to help secure packages. |
| 8 | + |
| 9 | +.. contents:: |
| 10 | + :local: |
| 11 | + |
| 12 | +How to recognize a PyPI secret |
| 13 | +------------------------------ |
| 14 | + |
| 15 | +A PyPI API token is a string consisting of a prefix (``pypi``), a separator |
| 16 | +(``-``) and a macaroon serialized with PyMacaroonv2, which means it's the |
| 17 | +``base64`` of:: |
| 18 | + |
| 19 | + \x02\x01\x08pypi.org\x02\x01b |
| 20 | + |
| 21 | +Thanks to this, we know that a PyPI token is bound to start with:: |
| 22 | + |
| 23 | + pypi-AgEIcHlwaS5vcmc[A-Za-z0-9-_]{70,} |
| 24 | + |
| 25 | +A token can be arbitrary long because we may add arbitrary many caveats. For |
| 26 | +more details on the token format, see `pypitoken |
| 27 | +<https://pypitoken.readthedocs.io>`_. |
| 28 | + |
| 29 | +GitHub Secret Scanning |
| 30 | +---------------------- |
| 31 | + |
| 32 | +GitHub's Token scanning feature used to be called "Token Scanning" and is now |
| 33 | +"Secret Scanning". You may find the 2 names. GitHub scans public commits with |
| 34 | +the regex above (actually the limit to at least 130 characters long). For all |
| 35 | +tokens identified within a "push" event, they send us reports in bulk. The |
| 36 | +format is explained thouroughly in `their doc |
| 37 | +<https://docs.github.com/en/developers/overview/secret-scanning>`_ as well as |
| 38 | +in the `warehouse implementation ticket |
| 39 | +<https://github.com/pypa/warehouse/issues/6051>`_. |
| 40 | + |
| 41 | +In short: they send us a cryptographically signed payload describing each |
| 42 | +leaked token alongside with a public URL pointing to it. |
| 43 | + |
| 44 | +How to test it manually |
| 45 | +^^^^^^^^^^^^^^^^^^^^^^^ |
| 46 | + |
| 47 | +A fake github service is launched by Docker Compose. Head your browser to |
| 48 | +``http://localhost:8964``. Create/reorder/... one ore more public keys, make |
| 49 | +sure one key is marked as current, then write your payload, using the following |
| 50 | +format: |
| 51 | + |
| 52 | +.. code-block:: json |
| 53 | +
|
| 54 | + [{ |
| 55 | + "type": "pypi_api_token", |
| 56 | + "token": "pypi-...", |
| 57 | + "url": "https://example.com" |
| 58 | + }] |
| 59 | +
|
| 60 | +Send your payload. It sends it to your local Warhouse. If a match is found, you |
| 61 | +should find that: |
| 62 | + |
| 63 | +- the token you sent has disappeared from the user account page, |
| 64 | +- 2 new security events have been sent: one for the token deletion, one for the |
| 65 | + notification email. |
| 66 | + |
| 67 | +After you send the token, the page will reload, and you'll find the details of |
| 68 | +the request at the bottom. If all went well, you should see a ``204`` ('No |
| 69 | +Content'). |
| 70 | + |
| 71 | +Whether it worked or not, a bunch of metrics have been issued, you can see them |
| 72 | +in the `notdatadog` container log. |
| 73 | + |
| 74 | +GitLab Secret Detection |
| 75 | +----------------------- |
| 76 | + |
| 77 | +GitLab also has an equivalent mechanism, named "Secret Detection", not |
| 78 | +implemented in Warehouse yet (see `#9280 |
| 79 | +<https://github.com/pypa/warehouse/issues/9280>`_). |
| 80 | + |
| 81 | +PyPI token disclosure infrastructure |
| 82 | +------------------------------------ |
| 83 | + |
| 84 | +The code is mainly in ``warehouse/integration/github``. |
| 85 | +There are 3 main parts in handling a token disclosure report: |
| 86 | + |
| 87 | +- The Web view, which is the top-level glue but does not implement the logic |
| 88 | +- Vendor specific authenticity check & loading. In the case of GitHub, we check |
| 89 | + that the payload and the associated signature match with the public keys |
| 90 | + available in their meta-API |
| 91 | +- (Supposedly-)Vendor-independent disclosure analysis: |
| 92 | + |
| 93 | + - Each token is processed individually in its own celery task |
| 94 | + - Token is analyzed, we check if its format is correct and if it |
| 95 | + corresponds to a macaroon we have in the DB |
| 96 | + - We don't check the signature. This is something that could change in the |
| 97 | + future but for now, we consider that if a token identifier leaked, even |
| 98 | + without a valid signature, it's enough to warrant deleting it. |
| 99 | + - If it's valid, we delete it, log a security event and send an email |
| 100 | + (which will spawn a second celery task) |
0 commit comments