Project detail statistics: respect robots.txt #12791

jayaddison · 2023-01-09T20:54:48Z

What's the problem this feature will solve?
During detection of statistics provider URLs, it would probably make sense to apply a post-filter to remove URLs that are disallowed based on robots.txt rules. This would provide a mechanism to downstream statistics providers to allow them to reduce their traffic load.

Describe the solution you'd like
Integrating with the robotexclusionrulesparser could be one way to achieve this. Python's standard library also includes a robots.txt parser -- although in my experience Python's default user-agent is frequently blocked by webservers, making it difficult to retrieve the robots.txt file using that approach.

Additional context
Relates to #12789.

The text was updated successfully, but these errors were encountered:

di · 2023-01-09T21:12:34Z

Thanks for the issue. Given that we have to manually integrate with statistics providers (and likely would have to integrate with an API to acquire the data) I'm not sure this is necessary. Can you give me a more specific example of how we'd be creating traffic that would violate a given site's robots.txt file?

jayaddison · 2023-01-09T21:57:48Z

The PyPI service itself wouldn't generate the traffic, so I don't think there's a requirement to do this. And on reflection, perhaps you're right - most of the traffic is (should be?) coming from human users, so perhaps robots.txt isn't relevant.

It felt to me that if we could determine that a provider has indicated that a particular URL shouldn't be fetched automatically, then we could filter unnecessary/unwanted requests to it on behalf of the provider.

(it's definitely possible that I'm a bit confused here)

jayaddison · 2023-01-09T22:15:38Z

Yep; in particular I think I'd mistaken the ability of release authors to list fairly arbitrary URLs on their releases (not too dissimilar to handling untrusted-user-input) with the possibility for arbitrary downstream requests from distributed browsers.

That shouldn't be possible because the server filters to specific URL patterns, and then the browser generates requests only within that relatively narrow channel (which GitHub would presumably complain about if-and-when receiving unwanted levels of traffic).

This can probably be closed, although I do wonder if it could be a useful additional safety check (especially in the context of multiple statistics providers).

jayaddison · 2023-01-09T22:16:29Z

(and to actually answer your question: I don't have any examples of this occurring)

di · 2023-01-09T22:17:50Z

I think if we move forward on #12789 we can revisit, but my gut says that this probably won't be necessary. Closing for now!

jayaddison · 2023-02-09T01:40:31Z

most of the traffic is (should be?) coming from human users

Is that true? (in context here, the user doesn't necessarily have a say in whether a request is made by their user agent -- somewhat robotically -- when the relevant scripts are evaluated. nor does the destination site have any of their own code (as far as I know) in the relevant scripting, so they can't do much to adjust traffic from the origin)

jayaddison added feature request requires triaging maintainers need to do initial inspection of issue labels Jan 9, 2023

di closed this as completed Jan 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Project detail statistics: respect robots.txt #12791

Project detail statistics: respect robots.txt #12791

jayaddison commented Jan 9, 2023

di commented Jan 9, 2023

Uh oh!

jayaddison commented Jan 9, 2023

Uh oh!

jayaddison commented Jan 9, 2023

Uh oh!

jayaddison commented Jan 9, 2023

Uh oh!

di commented Jan 9, 2023

Uh oh!

jayaddison commented Feb 9, 2023

Uh oh!

Project detail statistics: respect robots.txt #12791

Project detail statistics: respect robots.txt #12791

Comments

jayaddison commented Jan 9, 2023

di commented Jan 9, 2023

Uh oh!

jayaddison commented Jan 9, 2023

Uh oh!

jayaddison commented Jan 9, 2023

Uh oh!

jayaddison commented Jan 9, 2023

Uh oh!

di commented Jan 9, 2023

Uh oh!

jayaddison commented Feb 9, 2023

Uh oh!