-
Notifications
You must be signed in to change notification settings - Fork 76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
when retrying/resuming an interrupted download, wget2 should not send a If-Modified-Since header #269
Comments
I have access to the logs of this server so I did some checking there server is running Apache on Ubuntu no errors in the server logs this missed/skipped files don't show up in the server logs at all wget2 does hit the server pretty hard, it's a dual-core VPS and with wget2 downloading I see both cores at about 60%, which one of the Apache processes registering about 120% CPU usage I ran one additional wget2 download, which missed 92 files. I selected 10 of them to further investigate via server logs
So of the 10 skipped files I selected for further investigation, for 9 of them, wget2 did hit the server and get a 307 redirect, as expected, however, I can't tell for sure if it followed the redirect or not, because I have many different URLs forwarding to the same destination and can't really distinguish them in the log. Whether or not it followed the redirect, it didn't actually end up writing the file to disk as it should have. The odd one out of the 10, should have been a 200 OK, but it didn't show up in the server log at all on this run. |
Looking into this deeper, I was able to account for 40 of the missing/skipped files. I observed that 40 were consistently missing from all 12 wget2 tests, and I see what was going on with them specifically, it seems like there was a change to the So subtracting those 40, the actual number of skipped/missing files on the wget2 runs ranges from 1-63 |
Updated stats after removing those 40
Of what remains, it looks like basically everything should have been a 200 OK, however, on the runs where wget2 failed to download them, I don't see them in the Apache logs at all. |
First of all, thank you so much for the detailed work on this ! The results are interesting very useful for further investigations. Regarding WolfSSL: The implementation is mostly unused / experimental. Let's look at it again when we are at a stable behavior with gnutls. |
Okay, I ran 3 more tests with default settings (HTTP2 5 threads), then 3 HTTP2 with 1 thread, then 3 HTTP1 with 5 threads, all GnuTLS This time I'll only be listing the number of files actually downloaded, not the number of files reported downloaded (which is usually less)
With HTTP2 there seems to be little difference beween 1 thread and 5 threads, download speeds are the same, although 5 threads hits the server CPU harder. Not sure what the point is of using multiple threads with HTTP2. The HTTP1 tests all had only a single file missing, and that same file turned out to be missing from all the others as well (except yesterday's wget1 tests) It's another redirect situation, but different than the ones I mentioned yesterday, because the redirect target is within the target directory, meaning https://skyqueen.cc/archive/71master/cracky/kareha.pl/1351024951/ is an HTTP 307 redirect to https://skyqueen.cc/archive/71master/cracky/kareha.pl/missing-thread.txt (which is also linked directly from https://skyqueen.cc/archive/71master/cracky/kareha.pl/ ) Possibility that wget2 isn't following the 307 redirect because it already has the redirect target on its download list? wget1 will happily download the same file many times if multiple URLs redirect to it, but maybe this is an intentional change? Seems like using HTTP1 does get rid of the random file skips Investigating the random file skips further, I tried doing one more HTTP2 1-thread test with debugging turned on ( However, it's going extremely slowly, it's been 10 minutes and it's still processing the first page I'll post back later when/if it finishes |
I don't think the debug download is ever going to finish, it seems to be processing links extremely slowly, like 1 per second and it hasn't gotten halfway through processing the the first page yet
I'll leave it running though |
For some reason, debugging and logging to file with However, doing it like this worked, with only slight slowdown:
Completed in 4m1s, reported 21373 files downloaded, actually 21374 files downloaded (reported+1), 45 files missing/skipped. log.txt was about 25MB and log-err.txt was about 470MB. The 45 missing files:
No trace of the missing files in the Apache log (except the 1351024951, which I mentioned in a prior post) This seems relevant:
So it apparently had some kind of issue downloading that one, which explains the 1191886349/index.html being missing, as well as others linked from it such as 1191886349/39 but it shows up as a 200 OK in the Apache log That doesn't account for all of them, though. It downloaded 1249158438/index.html successfully*, but of the pages linked from it, such as 1249158438/449, some of them are missing and I'm finding no trace of the missing ones in the logs *actually I looked at the copy of 1249158438/index.html that it downloaded, and it appears to have truncated during download. Why are the files being randomly truncated, though? Why only with HTTP2 and not HTTP1? Why is wget2 not indicating that the file was truncated? Taking another look at the Apache log to try to figure out what happened with the truncated file, I actually see wget request the file twice, the first responded with a 200 OK, and the second responded with a 304 Not Modified
Why did wget request the file twice? I looked at other non-truncated files, and I only see them once in the Apache log. In fact, that was only 304 in the Apache log for the entire run in the wget log, I see the 304 but not the 200?! So it looks like the skipped files from this run can be divided into three root causes:
The other missing files were consequences of 2 & 3 |
Doing some additional runs, deleting files in between each (so that there will be no legitimate 304's), I'm seeing response 0 and response 304 for different files each time Any time there's a 304 in the wget log (which shows in the Apache log as a 200 followed by a 304), I look at the downloaded file to confirm it's truncated
|
Awesome report ! |
Correction to some earlier stuff -- turns out my wget2_openssl and wget2_wolfssl were actually using GnuTLS as well... even though they reported OpenSSL or WolfSSL on I tried it again a different way and verified that they all worked as intended this time:
I know you said to focus on GnuTLS but to cover the bases I ran some quick re-checks with my fixed OpenSSL compile I do still see the random failures sometimes with OpenSSL, but it seems the random failures are less likely to happen than with GnuTLS. Can't say for certain though. With OpenSSL, I had 3 "clean" runs out of 5 (the only skipped file was the 307 redirect thing), and 2 dirty runs (9 truncated files, 1 missing file). Whereas with GnuTLS, "clean" runs are rare, like 1 out of 5. |
I have a hypothesis about what might be going on with the truncated files |
Sounds reasonable. I have to think about implementing a test for this. |
Looking further into the server side of this I found something I definitely should have noticed something sooner, looks like hammering the server with HTTP2 is resulting in Apache sub-process crashes. In fact, each crash corresponds with a "Failed to read 102400 bytes" message, and probably with a truncated file as well. So I'm starting to get a clearer picture of what's going on. Example:
As far as I can tell these Apache crashes are only happening when a wget HTTP2 download is running; I'm not seeing them with normal HTTP2 browser traffic, although I'll keep checking the logs. I'll look further info the server crashes, I also have a higher-end server with newer software that I'd like to test on as well. I'm doing most of my HTTP2 testing with For turning down the intensity further to try to alleviate the server crashes, I guess I need to start looking at Based on initial testing, I'm still seeing server crashes even with any other options for running in a more server-friendly manner? I guess |
Request window set to 1 should do sequential requests, one at a time. You can see that in the debug logs. Sorry, I am afk, can't test myself until the weekend. |
changed the title.... I obviously have some stuff going on server-side that I need to investigate but in terms of wget2 behavior specifically, I think the main issue is that when attempting to resume/retry/whatever after an interrupted download, wget2 is sending a If-Modified-Since header which will generally just result in the server sending back a HTTP 304 Not Modified. If the original attempt is interrupted and you think you have a truncated file, better to not send a If-Modified-Since header. |
I agree that sending a Last-Modified header in this case doesn't make sense. Instead wget2 should try a Range request and if the server doesn't support it, the full file needs to be re-downloaded. |
I created 12 empty subdirectories for testing (3 for wget1 and 9 for wget2)
I ran a test download 3 times with wget1 and got identical results each time (21459 files). Command used:
time wget -v -m -np -o log.txt https://skyqueen.cc/archive/71master/cracky/kareha.pl/
Then I ran the same test download with wget2 9 times (3 each using the 3 TLS suites). Commands used:
For the wget2 tests, the number of files actually downloaded ranged from 21356 to 21418, meaning between 41 and 103 files were always missing compared to the wget1 tests
Additionally, the number of files reported downloaded was usually less than the number of files actually downloaded, ranging from 1-10 less
After all 12 wget2 tests, I re-ran the download in the same directory to see if any additional files were downloaded; in 3 tests, 1 additional file was downloaded, while in 9 tests, no additional files were downloaded
Test results:
I then generated a listing.txt for all 12 directories and ran a line count to verify the numbers:
Sorted by number of files:
Duplication checking revealed that the three wget1 listings were identical, as expected
wget2-gnutls-run3 and wget2-openssl-run2 were also identical
wget2-gnutls-run2 and wget2-gnutls-run1 downloaded the same number of files but the file list is not identical
wget2-openssl-run3 and wget2-openssl-run1 downloaded the same number of files but the file list is not identical
Comparison between ./wget1-run3/listing.txt and ./wget2-openssl-run2/listing.txt (best wget2 run) confirming 41 files missing:
Comparison between ./wget1-run3/listing.txt and ./wget2-openssl-run3/listing.txt (tied for worst wget2 run) confirming 103 files missing:
Comparison between the two worst wget2 runs, ./wget2-openssl-run3/listing.txt and ./wget2-openssl-run1/listing.txt, which downloaded the same number of files but not identical file list:
in other words, 62 files were downloaded in wget2-openssl-run3 but not in wget2-openssl-run1, and a different 62 files were downloaded in wget2-openssl-run1 but not wget2-openssl-run3
The text was updated successfully, but these errors were encountered: