Fix solution in Link Checker in Concurrency Morning exercises #904

pwnall · 2023-07-05T01:08:25Z

This change fixes the following issues with the current solution:

It is not listed on the "solutions" page.
It is not multi-threaded and does not use channels.

Fixes #816.

pwnall · 2023-07-05T01:10:17Z

@mgeisler This is a lot of code, and I'm pretty new at Rust. I won't be upset if you merge and change / rewrite, if that's easier than writing up a lot of feedback 😄

djmitche

This looks good! I added a few rustdoc comments. There are some interesting things to discuss here, such as using a Mutex to make a "mpmc" channel.

djmitche · 2023-07-05T16:23:44Z

@qwandor, ptal

pwnall · 2023-07-05T21:12:21Z

@djmitche Thank you very much for the comment improvements!

In case it helps, I came up with the Arc+Mutex while trying to follow the exercise instructions at

comprehensive-rust/src/exercises/concurrency/link-checker.md

Lines 80 to 81 in 4413b6a

    
           * Use threads to check the links in parallel: send the URLs to be checked to a 
        
             channel and let a few threads check the URLs in parallel.

. In particular, I understood "send the URLs [...] to a channel" as asking me to figure out how to use a single mpsc::channel to coordinate between the "main" thread and the threads checking URLs.

I also considered the crossbeam crate, which offers "mpmc" channels. I thought that solution would be less compatible with the spirit of the exercise, because the instructions already include a list of crates for us to use. By contrast, the proposal above ties in a few concepts introduced in the morning session.

I wrote my thoughts above hoping they help us come up with the best possible prompt text for future students.

djmitche · 2023-07-06T17:15:19Z

Hm, those instructions might be too limiting. I could see solving this with a condition variable or a semaphore, too.

In general, we've tried to make the exercises fairly open-ended. Of course, the solution will only pick one way to do it. The way you've chosen is a good one!

mgeisler · 2023-07-07T08:04:59Z

@mgeisler This is a lot of code, and I'm pretty new at Rust. I won't be upset if you merge and change / rewrite, if that's easier than writing up a lot of feedback smile

Thanks a lot for putting this up — people have been writing me asking for a solution 😬 Infact, we have #816 open for this 😄

Fun fact, to test this, I downloaded the code and tried to crawl https://google.github.io/comprehensive-rust/ instead — I was curious if we have broken links there! That soon resulted in

Got crawling error: bad http response: 429 Too Many Requests

errors and a bit later, I was blocked from posting this comment... luckily, the rate limit was lifted a few minutes later.

I noticed one thing:

Checking https://google.github.io/comprehensive-rust/print.html#assumptions
Checking https://google.github.io/comprehensive-rust/print.html#running-the-course
Checking https://google.github.io/comprehensive-rust/print.html#course-structure
Checking https://google.github.io/comprehensive-rust/print.html#deep-dives

It would be great to clean the URLs of any query string and anchor texts before inserting them into visited_pages.

mgeisler · 2023-07-07T08:11:16Z

Another thing I've noticed when doing this exercise in class: it's super hard and I wish I had provided much more scaffolding for people to use. Your PR is already a great improvement, but I feel we should

Write comments above the functions we give to the students, and
Consider giving the major building blocks as well — with comments explaining what we expect people to fill in.

I'm an old-school web developer, but I've had people in my classes who've never done any web development whatsoever. HTTP status codes are foreign to them, HTML is foreign to them, URL fragments are foreign to them. Adding more scaffolding would probably help a lot there to make the exercise useful people from more backgrounds.

After this PR is merged, I would also like to include a diagram or two which shows the intended interaction between the crawler threads, the channels, and the main thread.

I've had people in my classes who took the old link extraction logic and ran that in a multi-threaded way — when that is a purely local and super fast operation. So we would need more detail explaining that it is the IO operations that are slow and which we seek to parallelize here.

pwnall · 2023-07-07T23:44:18Z

@mgeisler I fully agree with your thoughts above! Would you be OK if I copied your thoughts (and my reply) into a separate Issue?

I see this PR as making a first step in what is hopefully the right direction. This PR brings the solution in line with the requirements outlined in the problem, with minimal scaffolding changes. I think that, once we agree that there are no obvious simplifications to the solution, we can conclude that the problem is too difficult. Once we're there, we can think of possible fixes.

I would be most interested in prototyping a scaffolding change where we give students a spawn_crawler_threads that works with a single crawler thread, and ask them to figure out how to support multiple crawler threads. I worry that this may make things too simple, but I figure I can iterate once I see what that looks like.

WDYT?

jmpfar · 2023-07-08T20:44:08Z

My extremely unedited solution with a single channel and no mutex
It's extremely full of .unwrap() and far from my proudest moment, but might be useful if you want to try adding that kind of solution
https://gist.github.com/jmpfar/4f6f0c4258037c0e4dda981a6d5dc6dc

mgeisler · 2023-07-10T08:32:18Z

@mgeisler I fully agree with your thoughts above! Would you be OK if I copied your thoughts (and my reply) into a separate Issue?

Yes, definitely!

I see this PR as making a first step in what is hopefully the right direction. This PR brings the solution in line with the requirements outlined in the problem, with minimal scaffolding changes. I think that, once we agree that there are no obvious simplifications to the solution, we can conclude that the problem is too difficult. Once we're there, we can think of possible fixes.

I completely agree, we should merge this now since it's a huge improvement on what we have.

I would be most interested in prototyping a scaffolding change where we give students a spawn_crawler_threads that works with a single crawler thread, and ask them to figure out how to support multiple crawler threads. I worry that this may make things too simple, but I figure I can iterate once I see what that looks like.

My experience is that people find this to be a hard exercise — most find it too hard for the 30-40 minutes we normally have at this point in the course. I think less than 25% of the people in a class manage to write a working solution in that time.

mgeisler · 2023-07-10T08:39:24Z

My extremely unedited solution with a single channel and no mutex It's extremely full of .unwrap() and far from my proudest moment, but might be useful if you want to try adding that kind of solution https://gist.github.com/jmpfar/4f6f0c4258037c0e4dda981a6d5dc6dc

Hi @jmpfar, thanks for posting this! Your solution is not bad: it delegates the I/O-heavy operations to new threads and does the CPU-intensive work in the main thread. Many people don't get to that stage.

A small problem with your solution is that you have an unbounded number of threads: if a page has 100 outgoing links, you spawn 100 threads. To solve that, you more or less have to create a multi-producer-multi-consumer abstraction: you need this to have a bounded number of threads read from a single channel.

jmpfar · 2023-07-10T10:31:35Z

@mgeisler ah, you are right 😬
I figured that having a bounded channel would limit the amount of threads but it still means that new threads can be created only to be waiting later post-request on the queue size.
Thought about sending a Started message before fetching a link to force wait, but this will not put a cap on the amount of running threads, only the concurrent requests.
Moving the Started message to before starting a thread or using a Semaphore will block the main thread so I cannot think of a better solution than @pwnall's

Anyway from my side I can attest this was the hardest exercise in the course (minus android and baremetal which I didn't do) and it took me significant time (but was fun all in all :)

mgeisler · 2023-07-10T11:11:59Z

Anyway from my side I can attest this was the hardest exercise in the course (minus android and baremetal which I didn't do) and it took me significant time (but was fun all in all :)

Thanks, that matches my impression too. I'm glad you liked it despite it being hard 😄 Perhaps it would be good to call it a "take home exercise" instead since it's really a small projects and not a real exercise with a clear answer.

mgeisler

Let's get this merged since it's an excellent start for further updates. @fw-immunant has put up some PRs to restructure the course so he might have further input down the line.

mgeisler · 2023-07-12T10:08:16Z

Hi @pwnall, the build fails because a code block is not marked with compile_fail:

---- /tmp/mdbook-uKXC6H/exercises/concurrency/solutions-morning.md - Concurrency_Morning_Exercise::Link_Checker (line 108) stdout ----
error[E0433]: failed to resolve: use of undeclared crate or module `reqwest`
  --> /tmp/mdbook-uKXC6H/exercises/concurrency/solutions-morning.md:126:5

When we run mdbook test, all code blocks are tested — but they're tested without dependencies. So we need to mark them as compile_fail if they depend on anything outside the standard library. @mauvealerts is looking to see if we can use mdbook-keeper (#175) to get more test coverage here.

pwnall · 2023-07-12T18:28:58Z

Thank you for decoding the error for me!

Let me figure out how to modify this chain to make the build green. I haven't updated a branch with someone else's commits before 😄

This change fixes the following issues with the current solution: 1. It is not listed on the "solutions" page. 2. It is not multi-threaded and does not use channels.

pwnall · 2023-07-12T19:13:48Z

@mgeisler Thank you again for the tips! The build is green now :)

mgeisler · 2023-07-13T08:49:00Z

@mgeisler Thank you again for the tips! The build is green now :)

Yeah, looks good! Just for your information, we squash-merge all PRs in this repository. This makes it a bit easier for people since they can just add more and more commits to fix review comments. The commits all disappear from the main history when the branch is squash-merged into main. (But force-pushing like you did is also completely fine: that's the workflow you would use if you do a normal merge and want to preserve a nice history.)

mgeisler · 2023-07-13T08:51:50Z

Thank you for decoding the error for me!

A small note: it feels terribly to have large code snippets that say compile_fail — how can we know that it still works? In this case we're saved by the fact that the code is included from an external link-checker.rs file which in turn is referenced by our Cargo.toml file. This means that we do compile the link checker as part of our CI, but we do it outside of mdbook test.

…#904) * Fix solution in Link Checker in Concurrency Morning exercises. This change fixes the following issues with the current solution: 1. It is not listed on the "solutions" page. 2. It is not multi-threaded and does not use channels. --------- Co-authored-by: Dustin J. Mitchell <[email protected]>

pwnall force-pushed the link-check-threaded branch from 855c5f9 to ab46ed5 Compare July 5, 2023 01:12

djmitche requested review from djmitche and qwandor July 5, 2023 13:32

djmitche approved these changes Jul 5, 2023

View reviewed changes

mgeisler mentioned this pull request Jul 7, 2023

Adding link checker solution to concurrency morning solutions #815

Closed

0951459527 approved these changes Jul 7, 2023

View reviewed changes

This comment was marked as spam.

Sign in to view

mgeisler approved these changes Jul 10, 2023

View reviewed changes

pwnall and others added 2 commits July 12, 2023 11:42

Fix solution in Link Checker in Concurrency Morning exercises.

96fe2e8

This change fixes the following issues with the current solution: 1. It is not listed on the "solutions" page. 2. It is not multi-threaded and does not use channels.

add a few rustdoc

104ab9c

pwnall force-pushed the link-check-threaded branch from 674e813 to 104ab9c Compare July 12, 2023 18:49

mgeisler merged commit ef99d15 into google:main Jul 13, 2023

Fix solution in Link Checker in Concurrency Morning exercises #904

Fix solution in Link Checker in Concurrency Morning exercises #904

Uh oh!

Conversation

pwnall commented Jul 5, 2023 • edited by mgeisler Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pwnall commented Jul 5, 2023

Uh oh!

djmitche left a comment

Choose a reason for hiding this comment

Uh oh!

djmitche commented Jul 5, 2023

Uh oh!

pwnall commented Jul 5, 2023

Uh oh!

djmitche commented Jul 6, 2023

Uh oh!

mgeisler commented Jul 7, 2023

Uh oh!

mgeisler commented Jul 7, 2023

Uh oh!

This comment was marked as spam.

Uh oh!

pwnall commented Jul 7, 2023

Uh oh!

jmpfar commented Jul 8, 2023

Uh oh!

mgeisler commented Jul 10, 2023

Uh oh!

mgeisler commented Jul 10, 2023

Uh oh!

jmpfar commented Jul 10, 2023

Uh oh!

mgeisler commented Jul 10, 2023

Uh oh!

mgeisler left a comment

Choose a reason for hiding this comment

Uh oh!

mgeisler commented Jul 12, 2023

Uh oh!

pwnall commented Jul 12, 2023

Uh oh!

pwnall commented Jul 12, 2023

Uh oh!

mgeisler commented Jul 13, 2023

Uh oh!

mgeisler commented Jul 13, 2023

Uh oh!

Uh oh!

pwnall commented Jul 5, 2023 •

edited by mgeisler

Loading