-
Notifications
You must be signed in to change notification settings - Fork 1k
Project detail statistics: respect robots.txt #12791
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thanks for the issue. Given that we have to manually integrate with statistics providers (and likely would have to integrate with an API to acquire the data) I'm not sure this is necessary. Can you give me a more specific example of how we'd be creating traffic that would violate a given site's |
The PyPI service itself wouldn't generate the traffic, so I don't think there's a requirement to do this. And on reflection, perhaps you're right - most of the traffic is (should be?) coming from human users, so perhaps It felt to me that if we could determine that a provider has indicated that a particular URL shouldn't be fetched automatically, then we could filter unnecessary/unwanted requests to it on behalf of the provider. (it's definitely possible that I'm a bit confused here) |
Yep; in particular I think I'd mistaken the ability of release authors to list fairly arbitrary URLs on their releases (not too dissimilar to handling untrusted-user-input) with the possibility for arbitrary downstream requests from distributed browsers. That shouldn't be possible because the server filters to specific URL patterns, and then the browser generates requests only within that relatively narrow channel (which GitHub would presumably complain about if-and-when receiving unwanted levels of traffic). This can probably be closed, although I do wonder if it could be a useful additional safety check (especially in the context of multiple statistics providers). |
(and to actually answer your question: I don't have any examples of this occurring) |
I think if we move forward on #12789 we can revisit, but my gut says that this probably won't be necessary. Closing for now! |
Is that true? (in context here, the user doesn't necessarily have a say in whether a request is made by their user agent -- somewhat robotically -- when the relevant scripts are evaluated. nor does the destination site have any of their own code (as far as I know) in the relevant scripting, so they can't do much to adjust traffic from the origin) |
What's the problem this feature will solve?
During detection of statistics provider URLs, it would probably make sense to apply a post-filter to remove URLs that are disallowed based on
robots.txt
rules. This would provide a mechanism to downstream statistics providers to allow them to reduce their traffic load.Describe the solution you'd like
Integrating with the
robotexclusionrulesparser
could be one way to achieve this. Python's standard library also includes arobots.txt
parser -- although in my experience Python's default user-agent is frequently blocked by webservers, making it difficult to retrieve therobots.txt
file using that approach.Additional context
Relates to #12789.
The text was updated successfully, but these errors were encountered: