Skip to content

Why does allowed_by_robots and one_agent_allowed_by_robots parse robotstxt for each request? #3

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
let4be opened this issue May 31, 2021 · 2 comments

Comments

@let4be
Copy link

let4be commented May 31, 2021

This API and example are really confusing...
Why cannot we simply parse once and then call methods to check if URL is allowed?...

@Folyd
Copy link
Owner

Folyd commented Jun 3, 2021

Good point. This crate simply ported the Google original library to Rust. Therefore we kept the logic remain: parse -> emit for each request. Indeed, this could be optimized to one parse for multiple requests. (p.s. I don't know why Google never did this.) Of course, contributions are always welcome.

@let4be
Copy link
Author

let4be commented Jun 3, 2021

I was looking for a decent robots.txt library written in rust to integrate into my Broad Web Crawler(open source toy project) and so far this one seems like the best bet because of "google" and tests...

But I don't like "parse for each request approach", seems hurtful and unnecessary for performance reasons,
from a swift look at source code I think the change would be somewhere here

impl<S: RobotsMatchStrategy> RobotsParseHandler for RobotsMatcher<'_, S> {

If I get some time in the next couple of weeks I might go for it :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants