Skip to content

[DRAFT] PyPI Observation Reporting Payload #14503

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
miketheman opened this issue Sep 7, 2023 · 9 comments
Closed

[DRAFT] PyPI Observation Reporting Payload #14503

miketheman opened this issue Sep 7, 2023 · 9 comments
Labels
malware-detection Issues related to automated malware detection. meta Meta issues (rollouts, etc) needs discussion a product management/policy issue maintainers and users should discuss security Security-related issues and pull requests

Comments

@miketheman
Copy link
Member

Background

As part of a broader malware handling project, we’re looking at creating the first implementation by which Reporters may submit Observations about packages on PyPI.
(Specific terminology may change as we learn more.)

We’re still working out other parts of the API infrastructure, but wanted to get a conversation started on what the API transaction payload between Reporters and PyPI might look like.

As a refresher, our currently security reporting process requests via email:

  • A URL to the project in question
  • An explanation of what makes the project a security issue
  • If applicable: a link to the problematic lines in the project's distributions via inspector.pypi.io

We have engaged a variety of security researchers and reporters who have been reporting malicious packages to PyPI Security to understand what aspects they prefer. We’ve learned more about their processes and use cases, and believe we’ve come up with something that is more automation-friendly, and leverages an evolving standard.

Proposal

We’ve learned that there’s a general desire for more standards in the overall security ecosystem, and the OSV project has defined a machine-friendly format for collecting published advisories.
The OSV Schema 1.6.0 is used for advisory databases.

While PyPI isn’t an advisory database, we thought using a format similar to OSV schema for an inbound payload format would be more sustainable long term, as we don’t invent our own standard, rather layer some extras on top of the existing one.

Minimal Example

A Terse, Minimal Example, that expresses only the absolutely required keys:

{
  "schema_version": "1.6.0+pypi",
  "modified": "2021-01-01T00:00:00Z",
  "summary": "During installation of pacakge, BitCoin miner installed and activated",
  "affected": [
    {
      "package": {
        "name": "request3",
        "ecosystem": "PyPI"
      },
      "versions": ["2.19.5"]
    }
  ],
  "references": [
    {
      "type": "INSPECTOR_URL",
      "url": "https://inspector.pypi.io/project/request3/2.19.5/..."
    }
  ]
}

Changes from OSV schema

  • remove id, as required top-level key
    • As a reporter, you are unlikely to have an ID yet when reporting. We could preserve the requirement, and allow any string, and discard it?
  • add schema_version, summary, affected, references as required top-level keys
    • These already exist in the schema, we're making them required for transacting with our (future) API
  • add some validations of:
    • at least 1 package
    • ecosystem is PyPI
    • references.[INSPECTOR_URL] starts with inspector.pypi.io

The only extension to the OSV Schema here (beyond validations) is adding INSPECTOR_URL to references - to be explicit vs WEB, EVIDENCE or any other keys in the base schema.

A lot of potential extras are in database_specific, and we can extend the schema to decide what’s required or not. database_specific seems more geared towards advisory databases rather than interaction payloads and for long-term storage.

The verbose example below shows what that might look like.

Verbose Example

{
  "schema_version": "1.6.0+pypi",
  "modified": "2021-01-01T00:00:00Z",
  "summary": "request3 downloads bitcoin miner on install",
  "details": "request3 downloads a bitcoin miner on install and exhausts the CPU \n and then makes the system unresponsive.",
  "affected": [
    {
      "package": {
        "ecosystem": "PyPI",
        "name": "request3"
      },
      "versions": [
        "2.19.5"
      ]
    }
  ],
  "references": [
    {
      "type": "PACKAGE",
      "url": "https://pypi.org/project/request3/2.19.5/"
    },
    {
      "type": "INSPECTOR_URL",
      "url": "https://inspector.pypi.io/project/request3/2.19.0/packages/a8/72/e876c05c2af349ac10a140f4e53b1163723be3f77db729377a127df480cd/request3_2.19.5.tar.gz/request3-2.19.5/setup.py"
    }
  ],
  "credits": [
    {
      "name": "Awesome Researcher Team",
      "contact": [
        "[email protected]"
      ],
      "type": "FINDER"
    }
  ],
  "database_specific": {
    "confidence": "HIGH",
    "iocs": [
      {
        "type": "INJECTION_COMPONENT",
        "name": "Trojan Horse"
      },
      {
        "type": "OBFUSCATION_TECHNIQUE",
        "name": "PyArmor"
      },
      {
        "type": "TRIGGER",
        "name": "install"
      },
      {
        "type": "TARGET_PLATFORM",
        "name": "Windows"
      },
      {
        "type": "OBJECTIVE",
        "name": "Financial Gain"
      }
    ]
  }
}

iocs is shorthand for Indicators Of Compromise

I pulled the iocs values from an index curated by Backstabbers Knife Collection (private).
There's also ATT&CK, CAPEC, CWE & CVE from MITRE - seems pretty useful to me.

Cool, so what?

We need your input on what other fields we should consider required, and what names for categories we'd allow.
Also, whether this format makes sense, or something else entirely is better or worse?

Please feel free to comment here, or if you'd prefer to converse privately, email me at mike at python dot org

Thanks for your feedback and engagement!

@miketheman miketheman added needs discussion a product management/policy issue maintainers and users should discuss security Security-related issues and pull requests malware-detection Issues related to automated malware detection. meta Meta issues (rollouts, etc) labels Sep 7, 2023
@louislang
Copy link

Seems really reasonable to me @ewdurbin! Will iocs be a predefined thing, or will it be free form? We've been kicking around using ATT&CK internally for our IOCs. While not explictly supply chain, it's got coverage of most techniques we're likely to encounter.

@miketheman
Copy link
Member Author

Will iocs be a predefined thing, or will it be free form?

That's one of the questions we're trying to answer here. If there's particular IOC collections folks are pretty used to and happy with, we'd like to make that easy for y'all to report.

@import-pandas-as-numpy
Copy link

import-pandas-as-numpy commented Sep 9, 2023

Echoing @louislang here, looks good from our (Vipyr Security) end as far as what we were anticipating this system look like at least. The IOC discussion is a little complicated-- if this is something that's going to be extended forward in the future, I'm not sure ATT&CK is sustainable in that regard. (Can novice reporters reliably generate ATT&CK mappings, or are we going to use the e-mail system forever?)

Like Phylum, we've also been looking at ATT&CK behavior mapping internally, but chaining automated detection rules to produce comprehensive summaries of the capabilities and behaviors specific to ATT&CK can be quite the task.

Ultimately, I think IOC's is going to be whatever PyPI administrators want to make of it-- if you (PyPI) want to see direct IOC mappings with some standard for data aggregation reasons, then that seems reasonable. However, I'm not sure it does a whole lot for us (the researchers/reporters). It seems reasonable to me to make this a soft requirement-- and allow individuals within the trusted reporting sphere to classify reports after the fact if they so choose, as well as review and update those reports as necessary if possible.

I can think of a fair few situations where our organization can discern malicious intent with a few IOC's, only to peel back the layers and discover that this isn't a discrete 'hey it runs some obfuscated code'. And honestly, there are situations where we (collectively) simply don't have the time nor the bandwidth to accurately classify these as well.

So I can see this being a situation that degrades that internal warehousing of malicious packages, and subsequently, makes the system less effective.

That was a little wordy, summary:

  • Looks like what we were expecting, good work!
  • ATT&CK is a good starting point for IOC's, and is something we discussed in the face to face meeting.
  • It drives a fair bit of work on our end for automated IOC synthesis.
  • There are situations where we can probably make a determination on maliciousness without a full understanding of all capabilities, and there are situations where we wouldn't want to degrade timely removal over full enumeration of capabilities.
  • I don't actually have any strong opinions against the simple example you provided in the original post. Seems to convey enough information to inform the takedown action (whether automated or manual), as well as give us some things to chew on in the security sphere if this information is accessible to us in any way to help us develop further rulesets or pull packages based on some standard behaviors.

@calebbrown
Copy link

Hi!

Thanks for the proposal - this is definitely on the right track!

Basing the reporting payload on OSV is a good idea. We (OpenSSF) are using OSV for tracking malware (see https://github.com/ossf/malicious-packages), which seems like a good application of OSV.

That said, I do have some comments on the specification above.

  1. The extensions technically make this proposal separate to OSV as defined by https://github.com/ossf/osv-schema. In particular the schema_version: "1.6.0+pypi" and the addition of INSPECTOR_URL to references are extensions that don't exist upstream.

    Diverging from OSV may create future problems, as these OSV reports won't be compatible with OSV tools.

    To help me understand, what was the rationale to not use something like EVIDENCE and parse the url for inspector.pypi.io? Or add a database_specific entry for these references instead?

  2. Are the IOCs proposed actually IOCs? Aren't they explicit indicators (e.g. domains, URLs, IPs, file hashes, etc). What is proposed is certainly useful, but appears to be more oriented towards classifying the malware based on its behavior and target.

    For https://github.com/ossf/package-analysis and https://github.com/ossf/malicious-packages we are planning on automatically passing through IOCs such as domains, and URLs.

  3. For IOCs, I was also wondering what the rationale was for choosing a list of objects (e.g. "iocs": [ { "type": "FOO", ... } ] rather than just an object (e.g. "iocs": { "FOO": ... }?

I have some vested interest in the IOCs being defined well, as it would be convenient to share the same definition for IOCs and other meta-data with our use in https://github.com/ossf/malicious-packages.

@miketheman
Copy link
Member Author

@import-pandas-as-numpy Thanks for your detailed response!

... or are we going to use the e-mail system forever?

I think we'll likely always keep the email system active, since we use it for more than malware reports, but are looking for ways to automate and speed up the handling.

Re: IOCs - I totally hear you! We wanted to start the conversation, and see what folks are using.

It seems reasonable to me to make this a soft requirement...

Indeed! With our initial proposal, they are not required, rather we'd like to get clear on the format that folks might want to use. And it looks like there's definitely interest in ATT&CK, and others - so what if we left that flexibility open, and y'all provide whatever IOCs from those sets you have when you have them.
Maybe part of our API design could be that once you're reported something, you could enrich the report with more details as you get them, in which case we follow the "last one in wins" (something OSV schema does as well).

@miketheman
Copy link
Member Author

@calebbrown Thanks for taking the time!

Diverging from OSV may create future problems ...

I tried to make this apparent earlier, maybe I missed the mark?

We're intending on using this payload as an transaction payload between Reporters and PyPI, and we can then re-materialize the OSV-compliant payload for a database to store (likely https://github.com/pypa/advisory-database )

To help me understand, what was the rationale to not use something like EVIDENCE and parse the url for inspector.pypi.io? Or add a database_specific entry for these references instead?

I considered using EVIDENCE since its description kinda matched what we want, but was also overly broad for validation.
If we want to require and validate that a report payload must contain an inspector URL, applying that validation at the EVIDENCE seemed restrictive, since someone could supply other URLs in an EVIDENCE value.
It might be my lack of jsonschema validation expertise, but I couldn't figure out a way to express "at least oneOf reference.[] entry with EVIDENCE that starts with pattern https://inspector...

If there's some invocation I can layer on to jsonschema to get that, that's be awesome! It might mean some more concrete Definitions for more types of things, so that they can be used as a contains.

Are the IOCs proposed actually IOCs? Aren't they explicit indicators (e.g. domains, URLs, IPs, file hashes, etc).
...
I have some vested interest in the IOCs being defined well,

This is why I love these conversations - I'm not entirely certain, so let's figure that out together! It's apparent that different shops use this term, as well as classifications and categorizations, differently.

Our intent is to have reporters supply as much structured details as they can, completely recognizing that there's a ton of hard-to-categorize-in-time issues. If I'm misusing the term IOC, then let's figure out what the right one is 😁

3. For IOCs, I was also wondering what the rationale was for choosing a list of objects (e.g. "iocs": [ { "type": "FOO", ... } ] rather than just an object (e.g. "iocs": { "FOO": ... }?

Nothing specifically, other than probably didn't think of it that way.
If we were taking the "use specific IDs from any framework", it could also be "iocs": ["FOO", ...]

@oliverchang
Copy link

I tried to make this apparent earlier, maybe I missed the mark?
We're intending on using this payload as an transaction payload between Reporters and PyPI, and we can then re-materialize the OSV-compliant payload for a database to store (likely https://github.com/pypa/advisory-database )

That makes sense! That said though, I'm also concerned about the "schema_version": "1.6.0+pypi" approach, because while it's very similar to OSV, but it's not compliant OSV in some subtle ways (e.g. INSPECTOR_URL). Having such similar formats differ in such minor ways might introduce a bit of confusion for users in the future.

Alternatively, would it work at all to place all necessary new fields/types (e.g. INSPECTOR_URL) into database_specific, and keep the meaning of all fields the same? We could consider the PyPI reporting format to be a "stricter subset of OSV".

The other differences you listed are around different requirements for fields (e.g. requiring more fields to exist, and the ecosystem be PyPI). But this seems all OK because you're just enforcing stricter validation on top of the existing OSV rules.

The one exception is the non-enforcement of requiring id, but I think that still fits under the definition of a "stricter subset of OSV". WDYT?

@miketheman
Copy link
Member Author

Hey gang! It's been a while, thanks for your responses and patience! There's been a lot of work streams leading back to this conversation. An update!

I've since gotten a lot more clarity on this general topic, and am reconsidering the initial desire to have our inbound API reporting payload be very similar to the OSV schema "documentation" payload.

As I've gone down the path of API design, since we're looking at using structural API segments, there's less of a need to duplicate the same information in the payload.
For example, if I'm performing a POST api.py.org/api/projects/deadbeef/observations I can already satisfy a large amount of the data we'd need to create an OSV schema-compliant record, such as:

  • package name (deadbeef in the URL)
  • ecosystem (it's us, PyPI!)
  • modified (when it's submitted to us)
  • versions would be handled by posting to a specific project's release endpoint, so if there were specific releases identified as problematic, but not all of them, then use the release for that version

And since we'd want specific keys for our inbound reports like an inspector URL, we can thus not have to conflict with OSV spec at all on keys like schema, rather use API versioning for payload formats, and can produce an OSV schema document based in the inbound format if we so desire.

I've got a draft PR underway, there's still some more work to do. But the general inbound reporting payload won't require any specific IOC mappings just yet, only our required keys for submitting, but we're very much open to watching the space unfold and learning more as we go.

Any thoughts you have that conflict with this approach, please let me know!

@miketheman
Copy link
Member Author

I'm going to close this issue, as we've made progress on our API design and have at least a single API payload for Projects in preview now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
malware-detection Issues related to automated malware detection. meta Meta issues (rollouts, etc) needs discussion a product management/policy issue maintainers and users should discuss security Security-related issues and pull requests
Projects
None yet
Development

No branches or pull requests

5 participants