Skip to content

Fix solution in Link Checker in Concurrency Morning exercises #904

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jul 13, 2023

Conversation

pwnall
Copy link
Member

@pwnall pwnall commented Jul 5, 2023

This change fixes the following issues with the current solution:

  1. It is not listed on the "solutions" page.
  2. It is not multi-threaded and does not use channels.

Fixes #816.

@pwnall
Copy link
Member Author

pwnall commented Jul 5, 2023

@mgeisler This is a lot of code, and I'm pretty new at Rust. I won't be upset if you merge and change / rewrite, if that's easier than writing up a lot of feedback 😄

@pwnall pwnall force-pushed the link-check-threaded branch from 855c5f9 to ab46ed5 Compare July 5, 2023 01:12
@djmitche djmitche requested review from djmitche and qwandor July 5, 2023 13:32
Copy link
Collaborator

@djmitche djmitche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good! I added a few rustdoc comments. There are some interesting things to discuss here, such as using a Mutex to make a "mpmc" channel.

@djmitche
Copy link
Collaborator

djmitche commented Jul 5, 2023

@qwandor, ptal

@pwnall
Copy link
Member Author

pwnall commented Jul 5, 2023

@djmitche Thank you very much for the comment improvements!

In case it helps, I came up with the Arc+Mutex while trying to follow the exercise instructions at

* Use threads to check the links in parallel: send the URLs to be checked to a
channel and let a few threads check the URLs in parallel.
. In particular, I understood "send the URLs [...] to a channel" as asking me to figure out how to use a single mpsc::channel to coordinate between the "main" thread and the threads checking URLs.

I also considered the crossbeam crate, which offers "mpmc" channels. I thought that solution would be less compatible with the spirit of the exercise, because the instructions already include a list of crates for us to use. By contrast, the proposal above ties in a few concepts introduced in the morning session.

I wrote my thoughts above hoping they help us come up with the best possible prompt text for future students.

@djmitche
Copy link
Collaborator

djmitche commented Jul 6, 2023

Hm, those instructions might be too limiting. I could see solving this with a condition variable or a semaphore, too.

In general, we've tried to make the exercises fairly open-ended. Of course, the solution will only pick one way to do it. The way you've chosen is a good one!

@mgeisler
Copy link
Collaborator

mgeisler commented Jul 7, 2023

@mgeisler This is a lot of code, and I'm pretty new at Rust. I won't be upset if you merge and change / rewrite, if that's easier than writing up a lot of feedback smile

Thanks a lot for putting this up — people have been writing me asking for a solution 😬 Infact, we have #816 open for this 😄

Fun fact, to test this, I downloaded the code and tried to crawl https://google.github.io/comprehensive-rust/ instead — I was curious if we have broken links there! That soon resulted in

Got crawling error: bad http response: 429 Too Many Requests

errors and a bit later, I was blocked from posting this comment... luckily, the rate limit was lifted a few minutes later.

I noticed one thing:

Checking https://google.github.io/comprehensive-rust/print.html#assumptions
Checking https://google.github.io/comprehensive-rust/print.html#running-the-course
Checking https://google.github.io/comprehensive-rust/print.html#course-structure
Checking https://google.github.io/comprehensive-rust/print.html#deep-dives

It would be great to clean the URLs of any query string and anchor texts before inserting them into visited_pages.

@mgeisler
Copy link
Collaborator

mgeisler commented Jul 7, 2023

Another thing I've noticed when doing this exercise in class: it's super hard and I wish I had provided much more scaffolding for people to use. Your PR is already a great improvement, but I feel we should

  1. Write comments above the functions we give to the students, and
  2. Consider giving the major building blocks as well — with comments explaining what we expect people to fill in.

I'm an old-school web developer, but I've had people in my classes who've never done any web development whatsoever. HTTP status codes are foreign to them, HTML is foreign to them, URL fragments are foreign to them. Adding more scaffolding would probably help a lot there to make the exercise useful people from more backgrounds.

After this PR is merged, I would also like to include a diagram or two which shows the intended interaction between the crawler threads, the channels, and the main thread.

I've had people in my classes who took the old link extraction logic and ran that in a multi-threaded way — when that is a purely local and super fast operation. So we would need more detail explaining that it is the IO operations that are slow and which we seek to parallelize here.

0951459527

This comment was marked as spam.

@pwnall
Copy link
Member Author

pwnall commented Jul 7, 2023

@mgeisler I fully agree with your thoughts above! Would you be OK if I copied your thoughts (and my reply) into a separate Issue?

I see this PR as making a first step in what is hopefully the right direction. This PR brings the solution in line with the requirements outlined in the problem, with minimal scaffolding changes. I think that, once we agree that there are no obvious simplifications to the solution, we can conclude that the problem is too difficult. Once we're there, we can think of possible fixes.

I would be most interested in prototyping a scaffolding change where we give students a spawn_crawler_threads that works with a single crawler thread, and ask them to figure out how to support multiple crawler threads. I worry that this may make things too simple, but I figure I can iterate once I see what that looks like.

WDYT?

@jmpfar
Copy link

jmpfar commented Jul 8, 2023

My extremely unedited solution with a single channel and no mutex
It's extremely full of .unwrap() and far from my proudest moment, but might be useful if you want to try adding that kind of solution
https://gist.github.com/jmpfar/4f6f0c4258037c0e4dda981a6d5dc6dc

@mgeisler
Copy link
Collaborator

@mgeisler I fully agree with your thoughts above! Would you be OK if I copied your thoughts (and my reply) into a separate Issue?

Yes, definitely!

I see this PR as making a first step in what is hopefully the right direction. This PR brings the solution in line with the requirements outlined in the problem, with minimal scaffolding changes. I think that, once we agree that there are no obvious simplifications to the solution, we can conclude that the problem is too difficult. Once we're there, we can think of possible fixes.

I completely agree, we should merge this now since it's a huge improvement on what we have.

I would be most interested in prototyping a scaffolding change where we give students a spawn_crawler_threads that works with a single crawler thread, and ask them to figure out how to support multiple crawler threads. I worry that this may make things too simple, but I figure I can iterate once I see what that looks like.

My experience is that people find this to be a hard exercise — most find it too hard for the 30-40 minutes we normally have at this point in the course. I think less than 25% of the people in a class manage to write a working solution in that time.

@mgeisler
Copy link
Collaborator

My extremely unedited solution with a single channel and no mutex It's extremely full of .unwrap() and far from my proudest moment, but might be useful if you want to try adding that kind of solution https://gist.github.com/jmpfar/4f6f0c4258037c0e4dda981a6d5dc6dc

Hi @jmpfar, thanks for posting this! Your solution is not bad: it delegates the I/O-heavy operations to new threads and does the CPU-intensive work in the main thread. Many people don't get to that stage.

A small problem with your solution is that you have an unbounded number of threads: if a page has 100 outgoing links, you spawn 100 threads. To solve that, you more or less have to create a multi-producer-multi-consumer abstraction: you need this to have a bounded number of threads read from a single channel.

@jmpfar
Copy link

jmpfar commented Jul 10, 2023

@mgeisler ah, you are right 😬
I figured that having a bounded channel would limit the amount of threads but it still means that new threads can be created only to be waiting later post-request on the queue size.
Thought about sending a Started message before fetching a link to force wait, but this will not put a cap on the amount of running threads, only the concurrent requests.
Moving the Started message to before starting a thread or using a Semaphore will block the main thread so I cannot think of a better solution than @pwnall's

Anyway from my side I can attest this was the hardest exercise in the course (minus android and baremetal which I didn't do) and it took me significant time (but was fun all in all :)

@mgeisler
Copy link
Collaborator

Anyway from my side I can attest this was the hardest exercise in the course (minus android and baremetal which I didn't do) and it took me significant time (but was fun all in all :)

Thanks, that matches my impression too. I'm glad you liked it despite it being hard 😄 Perhaps it would be good to call it a "take home exercise" instead since it's really a small projects and not a real exercise with a clear answer.

Copy link
Collaborator

@mgeisler mgeisler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's get this merged since it's an excellent start for further updates. @fw-immunant has put up some PRs to restructure the course so he might have further input down the line.

@mgeisler
Copy link
Collaborator

Hi @pwnall, the build fails because a code block is not marked with compile_fail:

---- /tmp/mdbook-uKXC6H/exercises/concurrency/solutions-morning.md - Concurrency_Morning_Exercise::Link_Checker (line 108) stdout ----
error[E0433]: failed to resolve: use of undeclared crate or module `reqwest`
  --> /tmp/mdbook-uKXC6H/exercises/concurrency/solutions-morning.md:126:5

When we run mdbook test, all code blocks are tested — but they're tested without dependencies. So we need to mark them as compile_fail if they depend on anything outside the standard library. @mauvealerts is looking to see if we can use mdbook-keeper (#175) to get more test coverage here.

@pwnall
Copy link
Member Author

pwnall commented Jul 12, 2023

Thank you for decoding the error for me!

Let me figure out how to modify this chain to make the build green. I haven't updated a branch with someone else's commits before 😄

pwnall and others added 2 commits July 12, 2023 11:42
This change fixes the following issues with the current solution:

1. It is not listed on the "solutions" page.
2. It is not multi-threaded and does not use channels.
@pwnall pwnall force-pushed the link-check-threaded branch from 674e813 to 104ab9c Compare July 12, 2023 18:49
@pwnall
Copy link
Member Author

pwnall commented Jul 12, 2023

@mgeisler Thank you again for the tips! The build is green now :)

@mgeisler
Copy link
Collaborator

@mgeisler Thank you again for the tips! The build is green now :)

Yeah, looks good! Just for your information, we squash-merge all PRs in this repository. This makes it a bit easier for people since they can just add more and more commits to fix review comments. The commits all disappear from the main history when the branch is squash-merged into main. (But force-pushing like you did is also completely fine: that's the workflow you would use if you do a normal merge and want to preserve a nice history.)

@mgeisler
Copy link
Collaborator

Thank you for decoding the error for me!

A small note: it feels terribly to have large code snippets that say compile_fail — how can we know that it still works? In this case we're saved by the fact that the code is included from an external link-checker.rs file which in turn is referenced by our Cargo.toml file. This means that we do compile the link checker as part of our CI, but we do it outside of mdbook test.

@mgeisler mgeisler merged commit ef99d15 into google:main Jul 13, 2023
yohcop pushed a commit to yohcop/comprehensive-rust that referenced this pull request Sep 12, 2023
…#904)

* Fix solution in Link Checker in Concurrency Morning exercises.

This change fixes the following issues with the current solution:

1. It is not listed on the "solutions" page.
2. It is not multi-threaded and does not use channels.

---------

Co-authored-by: Dustin J. Mitchell <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Concurrency morning solution does not contain a solution for Link Checker
5 participants