Skip to content

Commit 434ee92

Browse files
committed
Add corresponding documentation
1 parent 6389d73 commit 434ee92

File tree

2 files changed

+101
-0
lines changed

2 files changed

+101
-0
lines changed

docs/development/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -32,6 +32,7 @@ or the `distutils-sig mailing list`_, to ask questions or get involved.
3232
development-database
3333
cloud
3434
malware-checks
35+
token-scanning
3536

3637
.. _`GitHub`: https://github.com/pypa/warehouse
3738
.. _`"What to put in your bug report"`: http://www.contribution-guide.org/#what-to-put-in-your-bug-report

docs/development/token-scanning.rst

Lines changed: 100 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,100 @@
1+
Token Scanning
2+
==============
3+
4+
People make mistakes. Sometimes, they post their PyPI tokens publicly. Some
5+
content managers run regexes to try and identify published secrets, and ideally
6+
have them deactivated. PyPI has started integrating with such systems in order
7+
to help secure packages.
8+
9+
.. contents::
10+
:local:
11+
12+
How to recognize a PyPI secret
13+
------------------------------
14+
15+
A PyPI API token is a string consisting of a prefix (``pypi``), a separator
16+
(``-``) and a macaroon serialized with PyMacaroonv2, which means it's the
17+
``base64`` of::
18+
19+
\x02\x01\x08pypi.org\x02\x01b
20+
21+
Thanks to this, we know that a PyPI token is bound to start with::
22+
23+
pypi-AgEIcHlwaS5vcmc[A-Za-z0-9-_]{70,}
24+
25+
A token can be arbitrary long because we may add arbitrary many caveats. For
26+
more details on the token format, see `pypitoken
27+
<https://pypitoken.readthedocs.io>`_.
28+
29+
GitHub Secret Scanning
30+
----------------------
31+
32+
GitHub's Token scanning feature used to be called "Token Scanning" and is now
33+
"Secret Scanning". You may find the 2 names. GitHub scans public commits with
34+
the regex above (actually the limit to at least 130 characters long). For all
35+
tokens identified within a "push" event, they send us reports in bulk. The
36+
format is explained thouroughly in `their doc
37+
<https://docs.github.com/en/developers/overview/secret-scanning>`_ as well as
38+
in the `warehouse implementation ticket
39+
<https://github.com/pypa/warehouse/issues/6051>`_.
40+
41+
In short: they send us a cryptographically signed payload describing each
42+
leaked token alongside with a public URL pointing to it.
43+
44+
How to test it manually
45+
^^^^^^^^^^^^^^^^^^^^^^^
46+
47+
A fake github service is launched by Docker Compose. Head your browser to
48+
``http://localhost:8964``. Create/reorder/... one ore more public keys, make
49+
sure one key is marked as current, then write your payload, using the following
50+
format:
51+
52+
.. code-block:: json
53+
54+
[{
55+
"type": "pypi_api_token",
56+
"token": "pypi-...",
57+
"url": "https://example.com"
58+
}]
59+
60+
Send your payload. It sends it to your local Warhouse. If a match is found, you
61+
should find that:
62+
63+
- the token you sent has disappeared from the user account page,
64+
- 2 new security events have been sent: one for the token deletion, one for the
65+
notification email.
66+
67+
After you send the token, the page will reload, and you'll find the details of
68+
the request at the bottom. If all went well, you should see a ``204`` ('No
69+
Content').
70+
71+
Whether it worked or not, a bunch of metrics have been issued, you can see them
72+
in the `notdatadog` container log.
73+
74+
GitLab Secret Detection
75+
-----------------------
76+
77+
GitLab also has an equivalent mechanism, named "Secret Detection", not
78+
implemented in Warehouse yet (see `#9280
79+
<https://github.com/pypa/warehouse/issues/9280>`_).
80+
81+
PyPI token disclosure infrastructure
82+
------------------------------------
83+
84+
The code is mainly in ``warehouse/integration/github``.
85+
There are 3 main parts in handling a token disclosure report:
86+
87+
- The Web view, which is the top-level glue but does not implement the logic
88+
- Vendor specific authenticity check & loading. In the case of GitHub, we check
89+
that the payload and the associated signature match with the public keys
90+
available in their meta-API
91+
- (Supposedly-)Vendor-independent disclosure analysis:
92+
93+
- Each token is processed individually in its own celery task
94+
- Token is analyzed, we check if its format is correct and if it
95+
corresponds to a macaroon we have in the DB
96+
- We don't check the signature. This is something that could change in the
97+
future but for now, we consider that if a token identifier leaked, even
98+
without a valid signature, it's enough to warrant deleting it.
99+
- If it's valid, we delete it, log a security event and send an email
100+
(which will spawn a second celery task)

0 commit comments

Comments
 (0)