Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

when retrying/resuming an interrupted download, wget2 should not send a If-Modified-Since header #269

Open
catharsis71 opened this issue Oct 9, 2022 · 16 comments

Comments

@catharsis71
Copy link

I created 12 empty subdirectories for testing (3 for wget1 and 9 for wget2)

I ran a test download 3 times with wget1 and got identical results each time (21459 files). Command used:

time wget -v -m -np -o log.txt https://skyqueen.cc/archive/71master/cracky/kareha.pl/

Then I ran the same test download with wget2 9 times (3 each using the 3 TLS suites). Commands used:

time wget2_openssl -v -m -np -o log.txt https://skyqueen.cc/archive/71master/cracky/kareha.pl/
time wget2_gnutls -v -m -np -o log.txt https://skyqueen.cc/archive/71master/cracky/kareha.pl/
time wget2_wolfssl -v -m -np -o log.txt https://skyqueen.cc/archive/71master/cracky/kareha.pl/

For the wget2 tests, the number of files actually downloaded ranged from 21356 to 21418, meaning between 41 and 103 files were always missing compared to the wget1 tests

Additionally, the number of files reported downloaded was usually less than the number of files actually downloaded, ranging from 1-10 less

After all 12 wget2 tests, I re-ran the download in the same directory to see if any additional files were downloaded; in 3 tests, 1 additional file was downloaded, while in 9 tests, no additional files were downloaded

Test results:

wget1-run1: 59m44.541s, reported 21459 files downloaded, actual 21459 (match)
wget1-run2: 58m36.737s, reported 21459 files downloaded, actual 21459 (match)
wget1-run3: 58m33.169s, reported 21459 files downloaded, actual 21459 (match)
wget2-gnutls-run1: 3m14.264s, reported 21390 files downloaded, actual 21392 (reported+2), 67 files missing, no change after redownload
wget2-gnutls-run2: 3m13.831s, reported 21381 files downloaded, actual 21391 (reported+10), 68 files missing, after redownload 21392 (previous+1), still 67 files missing
wget2-gnutls-run3: 3m11.720s, reported 21418 files downloaded, actual 21418 (match), 41 files missing, no change after redownload
wget2-openssl-run1: 3m43.024s, reported 21348 files downloaded, actual 21356 (reported+8), 103 files missing, no change after redownload
wget2-openssl-run2: 3m7.452s, reported 21414 files downloaded, actual 21417 (reported+3), 42 files missing, after redownload 21418 (previous+1), still 41 files missing
wget2-openssl-run3: 3m4.584s, reported 21349 files downloaded, actual 21356 (reported+7), 103 files missing, no change after redownload
wget2-wolftls-run1: 2m58.913s, reported 21381 files downloaded, actual 21393 (reported+2), 66 files missing, no change after redownload
wget2-wolftls-run2: 3m14.559s, reported 21401 files downloaded, actual 21405 (reported+4), 54 files missing, after redownload 21406 (previous+1), still 53 files missing
wget2-wolftls-run3: 3m28.003s, reported 21414 files downloaded, actual 21415 (reported+1), 44 files missing, no change after redownload

I then generated a listing.txt for all 12 directories and ran a line count to verify the numbers:

   21459 ./wget1-run1/listing.txt
   21459 ./wget1-run2/listing.txt
   21459 ./wget1-run3/listing.txt
   21392 ./wget2-gnutls-run1/listing.txt
   21392 ./wget2-gnutls-run2/listing.txt
   21418 ./wget2-gnutls-run3/listing.txt
   21356 ./wget2-openssl-run1/listing.txt
   21418 ./wget2-openssl-run2/listing.txt
   21356 ./wget2-openssl-run3/listing.txt
   21393 ./wget2-wolftls-run1/listing.txt
   21406 ./wget2-wolftls-run2/listing.txt
   21415 ./wget2-wolftls-run3/listing.txt

Sorted by number of files:

   21459 ./wget1-run3/listing.txt
   21459 ./wget1-run2/listing.txt
   21459 ./wget1-run1/listing.txt
   21418 ./wget2-openssl-run2/listing.txt
   21418 ./wget2-gnutls-run3/listing.txt
   21415 ./wget2-wolftls-run3/listing.txt
   21406 ./wget2-wolftls-run2/listing.txt
   21393 ./wget2-wolftls-run1/listing.txt
   21392 ./wget2-gnutls-run2/listing.txt
   21392 ./wget2-gnutls-run1/listing.txt
   21356 ./wget2-openssl-run3/listing.txt
   21356 ./wget2-openssl-run1/listing.txt

Duplication checking revealed that the three wget1 listings were identical, as expected

wget2-gnutls-run3 and wget2-openssl-run2 were also identical

wget2-gnutls-run2 and wget2-gnutls-run1 downloaded the same number of files but the file list is not identical

wget2-openssl-run3 and wget2-openssl-run1 downloaded the same number of files but the file list is not identical

Comparison between ./wget1-run3/listing.txt and ./wget2-openssl-run2/listing.txt (best wget2 run) confirming 41 files missing:

XXXXXXXXX:/mnt/s/wget-temp/test$ comm -13 ./wget1-run3/listing.txt ./wget2-openssl-run2/listing.txt
XXXXXXXXX:/mnt/s/wget-temp/test$ comm -23 ./wget1-run3/listing.txt ./wget2-openssl-run2/listing.txt
./skyqueen.cc/archive/71master/cracky/kareha.pl/1175545451/5
./skyqueen.cc/archive/71master/cracky/kareha.pl/1178845554/24
./skyqueen.cc/archive/71master/cracky/kareha.pl/1178845554/50
./skyqueen.cc/archive/71master/cracky/kareha.pl/1178845554/6
./skyqueen.cc/archive/71master/cracky/kareha.pl/1179585228/10
./skyqueen.cc/archive/71master/cracky/kareha.pl/1179585228/37
./skyqueen.cc/archive/71master/cracky/kareha.pl/1179585228/6
./skyqueen.cc/archive/71master/cracky/kareha.pl/1179585228/8
./skyqueen.cc/archive/71master/cracky/kareha.pl/1185304526/10
./skyqueen.cc/archive/71master/cracky/kareha.pl/1185774837/73
./skyqueen.cc/archive/71master/cracky/kareha.pl/1186206956/17
./skyqueen.cc/archive/71master/cracky/kareha.pl/1186206956/21
./skyqueen.cc/archive/71master/cracky/kareha.pl/1186206956/23
./skyqueen.cc/archive/71master/cracky/kareha.pl/1186206956/25
./skyqueen.cc/archive/71master/cracky/kareha.pl/1186206956/3
./skyqueen.cc/archive/71master/cracky/kareha.pl/1186254038/8
./skyqueen.cc/archive/71master/cracky/kareha.pl/1186349078/18
./skyqueen.cc/archive/71master/cracky/kareha.pl/1186359009/136
./skyqueen.cc/archive/71master/cracky/kareha.pl/1186609935/56
./skyqueen.cc/archive/71master/cracky/kareha.pl/1186670859/10
./skyqueen.cc/archive/71master/cracky/kareha.pl/1186670859/13
./skyqueen.cc/archive/71master/cracky/kareha.pl/1186670859/17
./skyqueen.cc/archive/71master/cracky/kareha.pl/1186670859/39
./skyqueen.cc/archive/71master/cracky/kareha.pl/1186670859/5
./skyqueen.cc/archive/71master/cracky/kareha.pl/1186707560/2
./skyqueen.cc/archive/71master/cracky/kareha.pl/1186871731/11
./skyqueen.cc/archive/71master/cracky/kareha.pl/1186871731/15
./skyqueen.cc/archive/71master/cracky/kareha.pl/1186871731/20
./skyqueen.cc/archive/71master/cracky/kareha.pl/1186911053/8
./skyqueen.cc/archive/71master/cracky/kareha.pl/1189970310/2
./skyqueen.cc/archive/71master/cracky/kareha.pl/1197674844/10
./skyqueen.cc/archive/71master/cracky/kareha.pl/1197674844/5
./skyqueen.cc/archive/71master/cracky/kareha.pl/1214504867/17
./skyqueen.cc/archive/71master/cracky/kareha.pl/1325567524/10
./skyqueen.cc/archive/71master/cracky/kareha.pl/1336091931/11
./skyqueen.cc/archive/71master/cracky/kareha.pl/1351024951/index.html
./skyqueen.cc/archive/71master/cracky/kareha.pl/1369399115/2
./skyqueen.cc/archive/71master/cracky/kareha.pl/1386306579/14
./skyqueen.cc/archive/71master/cracky/kareha.pl/1389128895/34
./skyqueen.cc/archive/71master/cracky/kareha.pl/1390164898/3
./skyqueen.cc/archive/71master/cracky/kareha.pl/1391372958/10
XXXXXXXXX:/mnt/s/wget-temp/test$ comm -23 ./wget1-run3/listing.txt ./wget2-openssl-run2/listing.txt | wc -l
41
XXXXXXXXX:/mnt/s/wget-temp/test$

Comparison between ./wget1-run3/listing.txt and ./wget2-openssl-run3/listing.txt (tied for worst wget2 run) confirming 103 files missing:

XXXXXXXXX:/mnt/s/wget-temp/test$ comm -13 ./wget1-run3/listing.txt ./wget2-openssl-run3/listing.txt
XXXXXXXXX:/mnt/s/wget-temp/test$ comm -23 ./wget1-run3/listing.txt ./wget2-openssl-run3/listing.txt
./skyqueen.cc/archive/71master/cracky/kareha.pl/1175545451/5
./skyqueen.cc/archive/71master/cracky/kareha.pl/1178845554/24
./skyqueen.cc/archive/71master/cracky/kareha.pl/1178845554/50
./skyqueen.cc/archive/71master/cracky/kareha.pl/1178845554/6
./skyqueen.cc/archive/71master/cracky/kareha.pl/1179585228/10
./skyqueen.cc/archive/71master/cracky/kareha.pl/1179585228/37
./skyqueen.cc/archive/71master/cracky/kareha.pl/1179585228/6
./skyqueen.cc/archive/71master/cracky/kareha.pl/1179585228/8
./skyqueen.cc/archive/71master/cracky/kareha.pl/1185304526/10
./skyqueen.cc/archive/71master/cracky/kareha.pl/1185774837/73
./skyqueen.cc/archive/71master/cracky/kareha.pl/1186206956/17
./skyqueen.cc/archive/71master/cracky/kareha.pl/1186206956/21
./skyqueen.cc/archive/71master/cracky/kareha.pl/1186206956/23
./skyqueen.cc/archive/71master/cracky/kareha.pl/1186206956/25
./skyqueen.cc/archive/71master/cracky/kareha.pl/1186206956/3
./skyqueen.cc/archive/71master/cracky/kareha.pl/1186254038/8
./skyqueen.cc/archive/71master/cracky/kareha.pl/1186349078/18
./skyqueen.cc/archive/71master/cracky/kareha.pl/1186359009/136
./skyqueen.cc/archive/71master/cracky/kareha.pl/1186450965/1
./skyqueen.cc/archive/71master/cracky/kareha.pl/1186450965/12
./skyqueen.cc/archive/71master/cracky/kareha.pl/1186450965/14
./skyqueen.cc/archive/71master/cracky/kareha.pl/1186450965/17
./skyqueen.cc/archive/71master/cracky/kareha.pl/1186450965/18
./skyqueen.cc/archive/71master/cracky/kareha.pl/1186450965/2
./skyqueen.cc/archive/71master/cracky/kareha.pl/1186450965/23
./skyqueen.cc/archive/71master/cracky/kareha.pl/1186450965/24
./skyqueen.cc/archive/71master/cracky/kareha.pl/1186450965/28
./skyqueen.cc/archive/71master/cracky/kareha.pl/1186450965/29
./skyqueen.cc/archive/71master/cracky/kareha.pl/1186450965/33
./skyqueen.cc/archive/71master/cracky/kareha.pl/1186450965/34
./skyqueen.cc/archive/71master/cracky/kareha.pl/1186450965/37
./skyqueen.cc/archive/71master/cracky/kareha.pl/1186450965/39
./skyqueen.cc/archive/71master/cracky/kareha.pl/1186450965/4
./skyqueen.cc/archive/71master/cracky/kareha.pl/1186450965/44
./skyqueen.cc/archive/71master/cracky/kareha.pl/1186450965/45
./skyqueen.cc/archive/71master/cracky/kareha.pl/1186450965/47
./skyqueen.cc/archive/71master/cracky/kareha.pl/1186450965/48
./skyqueen.cc/archive/71master/cracky/kareha.pl/1186450965/7
./skyqueen.cc/archive/71master/cracky/kareha.pl/1186450965/9
./skyqueen.cc/archive/71master/cracky/kareha.pl/1186609935/56
./skyqueen.cc/archive/71master/cracky/kareha.pl/1186670859/10
./skyqueen.cc/archive/71master/cracky/kareha.pl/1186670859/13
./skyqueen.cc/archive/71master/cracky/kareha.pl/1186670859/17
./skyqueen.cc/archive/71master/cracky/kareha.pl/1186670859/39
./skyqueen.cc/archive/71master/cracky/kareha.pl/1186670859/5
./skyqueen.cc/archive/71master/cracky/kareha.pl/1186707560/2
./skyqueen.cc/archive/71master/cracky/kareha.pl/1186871731/11
./skyqueen.cc/archive/71master/cracky/kareha.pl/1186871731/15
./skyqueen.cc/archive/71master/cracky/kareha.pl/1186871731/20
./skyqueen.cc/archive/71master/cracky/kareha.pl/1186911053/1
./skyqueen.cc/archive/71master/cracky/kareha.pl/1186911053/15
./skyqueen.cc/archive/71master/cracky/kareha.pl/1186911053/2
./skyqueen.cc/archive/71master/cracky/kareha.pl/1186911053/5
./skyqueen.cc/archive/71master/cracky/kareha.pl/1186911053/7
./skyqueen.cc/archive/71master/cracky/kareha.pl/1186911053/8
./skyqueen.cc/archive/71master/cracky/kareha.pl/1186911053/9
./skyqueen.cc/archive/71master/cracky/kareha.pl/1189359669/1
./skyqueen.cc/archive/71master/cracky/kareha.pl/1189359669/2
./skyqueen.cc/archive/71master/cracky/kareha.pl/1189359669/6
./skyqueen.cc/archive/71master/cracky/kareha.pl/1189970310/2
./skyqueen.cc/archive/71master/cracky/kareha.pl/1190056040/11
./skyqueen.cc/archive/71master/cracky/kareha.pl/1190056040/13
./skyqueen.cc/archive/71master/cracky/kareha.pl/1190056040/14
./skyqueen.cc/archive/71master/cracky/kareha.pl/1190056040/15
./skyqueen.cc/archive/71master/cracky/kareha.pl/1190056040/17
./skyqueen.cc/archive/71master/cracky/kareha.pl/1190056040/19
./skyqueen.cc/archive/71master/cracky/kareha.pl/1190056040/2
./skyqueen.cc/archive/71master/cracky/kareha.pl/1190056040/20
./skyqueen.cc/archive/71master/cracky/kareha.pl/1190056040/22
./skyqueen.cc/archive/71master/cracky/kareha.pl/1190056040/23
./skyqueen.cc/archive/71master/cracky/kareha.pl/1190056040/24
./skyqueen.cc/archive/71master/cracky/kareha.pl/1190056040/25
./skyqueen.cc/archive/71master/cracky/kareha.pl/1190056040/30
./skyqueen.cc/archive/71master/cracky/kareha.pl/1190056040/4
./skyqueen.cc/archive/71master/cracky/kareha.pl/1190056040/8
./skyqueen.cc/archive/71master/cracky/kareha.pl/1190056040/9
./skyqueen.cc/archive/71master/cracky/kareha.pl/1190231688/1
./skyqueen.cc/archive/71master/cracky/kareha.pl/1190268199/11
./skyqueen.cc/archive/71master/cracky/kareha.pl/1190268199/14
./skyqueen.cc/archive/71master/cracky/kareha.pl/1190268199/15
./skyqueen.cc/archive/71master/cracky/kareha.pl/1190268199/18
./skyqueen.cc/archive/71master/cracky/kareha.pl/1190268199/2
./skyqueen.cc/archive/71master/cracky/kareha.pl/1190268199/21
./skyqueen.cc/archive/71master/cracky/kareha.pl/1190268199/22
./skyqueen.cc/archive/71master/cracky/kareha.pl/1190268199/23
./skyqueen.cc/archive/71master/cracky/kareha.pl/1190268199/24
./skyqueen.cc/archive/71master/cracky/kareha.pl/1190268199/26
./skyqueen.cc/archive/71master/cracky/kareha.pl/1190268199/28
./skyqueen.cc/archive/71master/cracky/kareha.pl/1190268199/29
./skyqueen.cc/archive/71master/cracky/kareha.pl/1190268199/4
./skyqueen.cc/archive/71master/cracky/kareha.pl/1190268199/8
./skyqueen.cc/archive/71master/cracky/kareha.pl/1190268199/9
./skyqueen.cc/archive/71master/cracky/kareha.pl/1197674844/10
./skyqueen.cc/archive/71master/cracky/kareha.pl/1197674844/5
./skyqueen.cc/archive/71master/cracky/kareha.pl/1214504867/17
./skyqueen.cc/archive/71master/cracky/kareha.pl/1325567524/10
./skyqueen.cc/archive/71master/cracky/kareha.pl/1336091931/11
./skyqueen.cc/archive/71master/cracky/kareha.pl/1351024951/index.html
./skyqueen.cc/archive/71master/cracky/kareha.pl/1369399115/2
./skyqueen.cc/archive/71master/cracky/kareha.pl/1386306579/14
./skyqueen.cc/archive/71master/cracky/kareha.pl/1389128895/34
./skyqueen.cc/archive/71master/cracky/kareha.pl/1390164898/3
./skyqueen.cc/archive/71master/cracky/kareha.pl/1391372958/10
XXXXXXXXX:/mnt/s/wget-temp/test$ comm -23 ./wget1-run3/listing.txt ./wget2-openssl-run3/listing.txt | wc -l
103
XXXXXXXXX:/mnt/s/wget-temp/test$

Comparison between the two worst wget2 runs, ./wget2-openssl-run3/listing.txt and ./wget2-openssl-run1/listing.txt, which downloaded the same number of files but not identical file list:

XXXXXXXXX:/mnt/s/wget-temp/test$ comm -13 ./wget2-openssl-run3/listing.txt ./wget2-openssl-run1/listing.txt
./skyqueen.cc/archive/71master/cracky/kareha.pl/1186450965/1
./skyqueen.cc/archive/71master/cracky/kareha.pl/1186450965/12
./skyqueen.cc/archive/71master/cracky/kareha.pl/1186450965/14
./skyqueen.cc/archive/71master/cracky/kareha.pl/1186450965/17
./skyqueen.cc/archive/71master/cracky/kareha.pl/1186450965/18
./skyqueen.cc/archive/71master/cracky/kareha.pl/1186450965/2
./skyqueen.cc/archive/71master/cracky/kareha.pl/1186450965/23
./skyqueen.cc/archive/71master/cracky/kareha.pl/1186450965/24
./skyqueen.cc/archive/71master/cracky/kareha.pl/1186450965/28
./skyqueen.cc/archive/71master/cracky/kareha.pl/1186450965/29
./skyqueen.cc/archive/71master/cracky/kareha.pl/1186450965/33
./skyqueen.cc/archive/71master/cracky/kareha.pl/1186450965/34
./skyqueen.cc/archive/71master/cracky/kareha.pl/1186450965/37
./skyqueen.cc/archive/71master/cracky/kareha.pl/1186450965/39
./skyqueen.cc/archive/71master/cracky/kareha.pl/1186450965/4
./skyqueen.cc/archive/71master/cracky/kareha.pl/1186450965/44
./skyqueen.cc/archive/71master/cracky/kareha.pl/1186450965/45
./skyqueen.cc/archive/71master/cracky/kareha.pl/1186450965/47
./skyqueen.cc/archive/71master/cracky/kareha.pl/1186450965/48
./skyqueen.cc/archive/71master/cracky/kareha.pl/1186450965/7
./skyqueen.cc/archive/71master/cracky/kareha.pl/1186450965/9
./skyqueen.cc/archive/71master/cracky/kareha.pl/1186911053/1
./skyqueen.cc/archive/71master/cracky/kareha.pl/1186911053/15
./skyqueen.cc/archive/71master/cracky/kareha.pl/1186911053/2
./skyqueen.cc/archive/71master/cracky/kareha.pl/1186911053/5
./skyqueen.cc/archive/71master/cracky/kareha.pl/1186911053/7
./skyqueen.cc/archive/71master/cracky/kareha.pl/1186911053/9
./skyqueen.cc/archive/71master/cracky/kareha.pl/1189359669/1
./skyqueen.cc/archive/71master/cracky/kareha.pl/1189359669/2
./skyqueen.cc/archive/71master/cracky/kareha.pl/1189359669/6
./skyqueen.cc/archive/71master/cracky/kareha.pl/1190056040/11
./skyqueen.cc/archive/71master/cracky/kareha.pl/1190056040/13
./skyqueen.cc/archive/71master/cracky/kareha.pl/1190056040/14
./skyqueen.cc/archive/71master/cracky/kareha.pl/1190056040/15
./skyqueen.cc/archive/71master/cracky/kareha.pl/1190056040/17
./skyqueen.cc/archive/71master/cracky/kareha.pl/1190056040/19
./skyqueen.cc/archive/71master/cracky/kareha.pl/1190056040/2
./skyqueen.cc/archive/71master/cracky/kareha.pl/1190056040/20
./skyqueen.cc/archive/71master/cracky/kareha.pl/1190056040/22
./skyqueen.cc/archive/71master/cracky/kareha.pl/1190056040/23
./skyqueen.cc/archive/71master/cracky/kareha.pl/1190056040/24
./skyqueen.cc/archive/71master/cracky/kareha.pl/1190056040/25
./skyqueen.cc/archive/71master/cracky/kareha.pl/1190056040/30
./skyqueen.cc/archive/71master/cracky/kareha.pl/1190056040/4
./skyqueen.cc/archive/71master/cracky/kareha.pl/1190056040/8
./skyqueen.cc/archive/71master/cracky/kareha.pl/1190056040/9
./skyqueen.cc/archive/71master/cracky/kareha.pl/1190231688/1
./skyqueen.cc/archive/71master/cracky/kareha.pl/1190268199/11
./skyqueen.cc/archive/71master/cracky/kareha.pl/1190268199/14
./skyqueen.cc/archive/71master/cracky/kareha.pl/1190268199/15
./skyqueen.cc/archive/71master/cracky/kareha.pl/1190268199/18
./skyqueen.cc/archive/71master/cracky/kareha.pl/1190268199/2
./skyqueen.cc/archive/71master/cracky/kareha.pl/1190268199/21
./skyqueen.cc/archive/71master/cracky/kareha.pl/1190268199/22
./skyqueen.cc/archive/71master/cracky/kareha.pl/1190268199/23
./skyqueen.cc/archive/71master/cracky/kareha.pl/1190268199/24
./skyqueen.cc/archive/71master/cracky/kareha.pl/1190268199/26
./skyqueen.cc/archive/71master/cracky/kareha.pl/1190268199/28
./skyqueen.cc/archive/71master/cracky/kareha.pl/1190268199/29
./skyqueen.cc/archive/71master/cracky/kareha.pl/1190268199/4
./skyqueen.cc/archive/71master/cracky/kareha.pl/1190268199/8
./skyqueen.cc/archive/71master/cracky/kareha.pl/1190268199/9
XXXXXXXXX:/mnt/s/wget-temp/test$ comm -13 ./wget2-openssl-run3/listing.txt ./wget2-openssl-run1/listing.txt | wc -l
62
XXXXXXXXX:/mnt/s/wget-temp/test$ comm -23 ./wget2-openssl-run3/listing.txt ./wget2-openssl-run1/listing.txt
./skyqueen.cc/archive/71master/cracky/kareha.pl/1248980400/14
./skyqueen.cc/archive/71master/cracky/kareha.pl/1248980400/15
./skyqueen.cc/archive/71master/cracky/kareha.pl/1301923222/2
./skyqueen.cc/archive/71master/cracky/kareha.pl/1302193637/3
./skyqueen.cc/archive/71master/cracky/kareha.pl/1302193637/6
./skyqueen.cc/archive/71master/cracky/kareha.pl/1302193637/7
./skyqueen.cc/archive/71master/cracky/kareha.pl/1302193637/8
./skyqueen.cc/archive/71master/cracky/kareha.pl/1303005278/2
./skyqueen.cc/archive/71master/cracky/kareha.pl/1304603144/100
./skyqueen.cc/archive/71master/cracky/kareha.pl/1304603144/103
./skyqueen.cc/archive/71master/cracky/kareha.pl/1304603144/104
./skyqueen.cc/archive/71master/cracky/kareha.pl/1304603144/105
./skyqueen.cc/archive/71master/cracky/kareha.pl/1304603144/106
./skyqueen.cc/archive/71master/cracky/kareha.pl/1304603144/107
./skyqueen.cc/archive/71master/cracky/kareha.pl/1304603144/114
./skyqueen.cc/archive/71master/cracky/kareha.pl/1304603144/116
./skyqueen.cc/archive/71master/cracky/kareha.pl/1304603144/122
./skyqueen.cc/archive/71master/cracky/kareha.pl/1304603144/123
./skyqueen.cc/archive/71master/cracky/kareha.pl/1304603144/125
./skyqueen.cc/archive/71master/cracky/kareha.pl/1304603144/127
./skyqueen.cc/archive/71master/cracky/kareha.pl/1304603144/129
./skyqueen.cc/archive/71master/cracky/kareha.pl/1304603144/134
./skyqueen.cc/archive/71master/cracky/kareha.pl/1304603144/139
./skyqueen.cc/archive/71master/cracky/kareha.pl/1304603144/144
./skyqueen.cc/archive/71master/cracky/kareha.pl/1304603144/146
./skyqueen.cc/archive/71master/cracky/kareha.pl/1304603144/149
./skyqueen.cc/archive/71master/cracky/kareha.pl/1304603144/150
./skyqueen.cc/archive/71master/cracky/kareha.pl/1304603144/152
./skyqueen.cc/archive/71master/cracky/kareha.pl/1304603144/154
./skyqueen.cc/archive/71master/cracky/kareha.pl/1304603144/156
./skyqueen.cc/archive/71master/cracky/kareha.pl/1304603144/23
./skyqueen.cc/archive/71master/cracky/kareha.pl/1304603144/29
./skyqueen.cc/archive/71master/cracky/kareha.pl/1304603144/30
./skyqueen.cc/archive/71master/cracky/kareha.pl/1304603144/33
./skyqueen.cc/archive/71master/cracky/kareha.pl/1304603144/44
./skyqueen.cc/archive/71master/cracky/kareha.pl/1304603144/48
./skyqueen.cc/archive/71master/cracky/kareha.pl/1304603144/50
./skyqueen.cc/archive/71master/cracky/kareha.pl/1304603144/56
./skyqueen.cc/archive/71master/cracky/kareha.pl/1304603144/57
./skyqueen.cc/archive/71master/cracky/kareha.pl/1304603144/59
./skyqueen.cc/archive/71master/cracky/kareha.pl/1304603144/61
./skyqueen.cc/archive/71master/cracky/kareha.pl/1304603144/63
./skyqueen.cc/archive/71master/cracky/kareha.pl/1304603144/64
./skyqueen.cc/archive/71master/cracky/kareha.pl/1304603144/69
./skyqueen.cc/archive/71master/cracky/kareha.pl/1304603144/70
./skyqueen.cc/archive/71master/cracky/kareha.pl/1304603144/71
./skyqueen.cc/archive/71master/cracky/kareha.pl/1304603144/72
./skyqueen.cc/archive/71master/cracky/kareha.pl/1304603144/73
./skyqueen.cc/archive/71master/cracky/kareha.pl/1304603144/75
./skyqueen.cc/archive/71master/cracky/kareha.pl/1304603144/78
./skyqueen.cc/archive/71master/cracky/kareha.pl/1304603144/79
./skyqueen.cc/archive/71master/cracky/kareha.pl/1304603144/81
./skyqueen.cc/archive/71master/cracky/kareha.pl/1304603144/85
./skyqueen.cc/archive/71master/cracky/kareha.pl/1304603144/92
./skyqueen.cc/archive/71master/cracky/kareha.pl/1304603144/94
./skyqueen.cc/archive/71master/cracky/kareha.pl/1304603144/97
./skyqueen.cc/archive/71master/cracky/kareha.pl/1304603144/98
./skyqueen.cc/archive/71master/cracky/kareha.pl/1319036436/1
./skyqueen.cc/archive/71master/cracky/kareha.pl/1319036436/374
./skyqueen.cc/archive/71master/cracky/kareha.pl/1319036436/431
./skyqueen.cc/archive/71master/cracky/kareha.pl/1319036436/478
./skyqueen.cc/archive/71master/cracky/kareha.pl/1319036436/481
XXXXXXXXX:/mnt/s/wget-temp/test$ comm -13 ./wget2-openssl-run3/listing.txt ./wget2-openssl-run1/listing.txt | wc -l
62
XXXXXXXXX:/mnt/s/wget-temp/test$

in other words, 62 files were downloaded in wget2-openssl-run3 but not in wget2-openssl-run1, and a different 62 files were downloaded in wget2-openssl-run1 but not wget2-openssl-run3

@catharsis71
Copy link
Author

I have access to the logs of this server so I did some checking there

server is running Apache on Ubuntu

no errors in the server logs

this missed/skipped files don't show up in the server logs at all

wget2 does hit the server pretty hard, it's a dual-core VPS and with wget2 downloading I see both cores at about 60%, which one of the Apache processes registering about 120% CPU usage

I ran one additional wget2 download, which missed 92 files. I selected 10 of them to further investigate via server logs

./skyqueen.cc/archive/71master/cracky/kareha.pl/1175545451/5 - in server log as a 307 (as expected)
./skyqueen.cc/archive/71master/cracky/kareha.pl/1178845554/24 - in server log as a 307 (as expected)
./skyqueen.cc/archive/71master/cracky/kareha.pl/1186707560/2 - in server log as a 307 (as expected)
./skyqueen.cc/archive/71master/cracky/kareha.pl/1186871731/11 - in server log as a 307 (as expected)
./skyqueen.cc/archive/71master/cracky/kareha.pl/1197770253/44 - not in server log (should have been a 200)
./skyqueen.cc/archive/71master/cracky/kareha.pl/1369399115/2 - in server log as a 307 (as expected)
./skyqueen.cc/archive/71master/cracky/kareha.pl/1386306579/14 - in server log as a 307 (as expected)
./skyqueen.cc/archive/71master/cracky/kareha.pl/1389128895/34 - in server log as a 307 (as expected)
./skyqueen.cc/archive/71master/cracky/kareha.pl/1390164898/3 - in server log as a 307 (as expected)
./skyqueen.cc/archive/71master/cracky/kareha.pl/1391372958/10 - in server log as a 307 (as expected)

So of the 10 skipped files I selected for further investigation, for 9 of them, wget2 did hit the server and get a 307 redirect, as expected, however, I can't tell for sure if it followed the redirect or not, because I have many different URLs forwarding to the same destination and can't really distinguish them in the log. Whether or not it followed the redirect, it didn't actually end up writing the file to disk as it should have. The odd one out of the 10, should have been a 200 OK, but it didn't show up in the server log at all on this run.

@catharsis71
Copy link
Author

Looking into this deeper, I was able to account for 40 of the missing/skipped files. I observed that 40 were consistently missing from all 12 wget2 tests, and I see what was going on with them specifically, it seems like there was a change to the --no-parent behavior, wget1 will follow redirects into parent directories even if --no-parent is active, whereas wget2 will not. So that's fine, the wget2 way makes more sense anyway.

So subtracting those 40, the actual number of skipped/missing files on the wget2 runs ranges from 1-63

@catharsis71
Copy link
Author

Updated stats after removing those 40

wget1-run1: 21419 files
wget1-run2: 21419 files
wget1-run3: 21419 files
wget2-gnutls-run1: 21392 files (missing 27)
wget2-gnutls-run2: 21392 files (missing 27, but NOT identical to wget2-gnutls-run1)
wget2-gnutls-run3: 21418 files (missing 1)
wget2-openssl-run1: 21356 files (missing 63)
wget2-openssl-run2: 21418 files (missing 1, identical to wget2-gnutls-run3)
wget2-openssl-run3: 21356 files (missing 63, but NOT identical to wget2-openssl-run1)
wget2-wolftls-run1: 21393 files (missing 26)
wget2-wolftls-run2: 21406 files (missing 13)
wget2-wolftls-run3: 21415 files (missing 4)

Of what remains, it looks like basically everything should have been a 200 OK, however, on the runs where wget2 failed to download them, I don't see them in the Apache logs at all.

@rockdaboot
Copy link
Owner

First of all, thank you so much for the detailed work on this ! The results are interesting very useful for further investigations.
I am currently having a few days of vacation with almost no network (just ok for text messages), but will pick this up when back.
The --no-patent issue is a good finding, this needs to be fixed in wget1.
With -d -o log.txt --no-progress, you should be able to see what wget2 does with those missing files. The flakyness of the results indicate a server issue (e.g. overload), but that's more of a guess.
Could you test (just with wget2-gnutls) with --no-http2 ?
And also with --max-threads 1 ?

Regarding WolfSSL: The implementation is mostly unused / experimental. Let's look at it again when we are at a stable behavior with gnutls.

@catharsis71
Copy link
Author

catharsis71 commented Oct 10, 2022

Okay, I ran 3 more tests with default settings (HTTP2 5 threads), then 3 HTTP2 with 1 thread, then 3 HTTP1 with 5 threads, all GnuTLS

This time I'll only be listing the number of files actually downloaded, not the number of files reported downloaded (which is usually less)

wget2_gnutls_new1 - 3m05s, 21356 files (63 files missing)
wget2_gnutls_new2 - 3m12s, 21418 files (1 files missing)
wget2_gnutls_new3 - 3m26s, 21417 files (2 files missing)

wget2_gnutls_1thread_run1 - 3m18s, 21303 files (116 files missing)
wget2_gnutls_1thread_run2 - 3m23s, 21418 files (1 files missing)
wget2_gnutls_1thread_run3 - 3m07s, 21381 files (38 files missing)

wget2_gnutls_http1_run1 - 11m05s, 21418 (1 files missing)
wget2_gnutls_http1_run2 - 11m06s, 21418 (1 files missing)
wget2_gnutls_http1_run3 - 11m20s, 21418 (1 files missing)

With HTTP2 there seems to be little difference beween 1 thread and 5 threads, download speeds are the same, although 5 threads hits the server CPU harder. Not sure what the point is of using multiple threads with HTTP2.

The HTTP1 tests all had only a single file missing, and that same file turned out to be missing from all the others as well (except yesterday's wget1 tests)

It's another redirect situation, but different than the ones I mentioned yesterday, because the redirect target is within the target directory, meaning --no-parent shouldn't be excluding it

https://skyqueen.cc/archive/71master/cracky/kareha.pl/1351024951/ is an HTTP 307 redirect to https://skyqueen.cc/archive/71master/cracky/kareha.pl/missing-thread.txt (which is also linked directly from https://skyqueen.cc/archive/71master/cracky/kareha.pl/ )

Possibility that wget2 isn't following the 307 redirect because it already has the redirect target on its download list? wget1 will happily download the same file many times if multiple URLs redirect to it, but maybe this is an intentional change?

Seems like using HTTP1 does get rid of the random file skips

Investigating the random file skips further, I tried doing one more HTTP2 1-thread test with debugging turned on (time wget2_gnutls -d --progress none --max-threads 1 -m -np -o log.txt https://skyqueen.cc/archive/71master/cracky/kareha.pl/)

However, it's going extremely slowly, it's been 10 minutes and it's still processing the first page

I'll post back later when/if it finishes

@catharsis71
Copy link
Author

catharsis71 commented Oct 10, 2022

I don't think the debug download is ever going to finish, it seems to be processing links extremely slowly, like 1 per second and it hasn't gotten halfway through processing the the first page yet

10.114710.051 tr/@class=odd
10.114710.148 td/@class=indexcolicon
10.114710.246 a/@href=1325606203/
10.114710.344 img/@src=/icons/folder.gif
10.114710.441 img/@alt=[DIR]
10.114710.537 td/@class=indexcolname
10.114710.635 a/@href=1325606203/
10.114710.732 ='1325606203/'
10.114710.830 td/@class=indexcollastmod
10.114710.927 ='2022-08-24 17:16  '
10.114711.024 td/@class=indexcolsize
10.114711.121 ='  - '
10.114711.219 ='
   '
10.114711.317 tr/@class=even
10.114711.414 td/@class=indexcolicon
10.114711.511 a/@href=1325642112/
10.114711.608 img/@src=/icons/folder.gif
10.114711.706 img/@alt=[DIR]
10.114711.805 td/@class=indexcolname
10.114711.902 a/@href=1325642112/
10.114711.999 ='1325642112/'
10.114712.096 td/@class=indexcollastmod
10.114712.193 ='2022-08-24 17:16  '
10.114712.290 td/@class=indexcolsize
10.114712.388 ='  - '
10.114712.485 ='

I'll leave it running though

@catharsis71
Copy link
Author

For some reason, debugging and logging to file with -o log.txt didn't work, it was running glacially slowly and never would have finished

However, doing it like this worked, with only slight slowdown:

wget2_gnutls -d --progress none --max-threads 1 -m -np https://skyqueen.cc/archive/71master/cracky/kareha.pl/ 2> log-err.txt > log.txt

Completed in 4m1s, reported 21373 files downloaded, actually 21374 files downloaded (reported+1), 45 files missing/skipped. log.txt was about 25MB and log-err.txt was about 470MB.

The 45 missing files:

./skyqueen.cc/archive/71master/cracky/kareha.pl/1191886349/1
./skyqueen.cc/archive/71master/cracky/kareha.pl/1191886349/10
./skyqueen.cc/archive/71master/cracky/kareha.pl/1191886349/11
./skyqueen.cc/archive/71master/cracky/kareha.pl/1191886349/12
./skyqueen.cc/archive/71master/cracky/kareha.pl/1191886349/15
./skyqueen.cc/archive/71master/cracky/kareha.pl/1191886349/17
./skyqueen.cc/archive/71master/cracky/kareha.pl/1191886349/18
./skyqueen.cc/archive/71master/cracky/kareha.pl/1191886349/2
./skyqueen.cc/archive/71master/cracky/kareha.pl/1191886349/20
./skyqueen.cc/archive/71master/cracky/kareha.pl/1191886349/21
./skyqueen.cc/archive/71master/cracky/kareha.pl/1191886349/25
./skyqueen.cc/archive/71master/cracky/kareha.pl/1191886349/26
./skyqueen.cc/archive/71master/cracky/kareha.pl/1191886349/27
./skyqueen.cc/archive/71master/cracky/kareha.pl/1191886349/28
./skyqueen.cc/archive/71master/cracky/kareha.pl/1191886349/3
./skyqueen.cc/archive/71master/cracky/kareha.pl/1191886349/30
./skyqueen.cc/archive/71master/cracky/kareha.pl/1191886349/31
./skyqueen.cc/archive/71master/cracky/kareha.pl/1191886349/32
./skyqueen.cc/archive/71master/cracky/kareha.pl/1191886349/33
./skyqueen.cc/archive/71master/cracky/kareha.pl/1191886349/35
./skyqueen.cc/archive/71master/cracky/kareha.pl/1191886349/39
./skyqueen.cc/archive/71master/cracky/kareha.pl/1191886349/6
./skyqueen.cc/archive/71master/cracky/kareha.pl/1191886349/7
./skyqueen.cc/archive/71master/cracky/kareha.pl/1191886349/8
./skyqueen.cc/archive/71master/cracky/kareha.pl/1191886349/index.html
./skyqueen.cc/archive/71master/cracky/kareha.pl/1249158438/318
./skyqueen.cc/archive/71master/cracky/kareha.pl/1249158438/319
./skyqueen.cc/archive/71master/cracky/kareha.pl/1249158438/320
./skyqueen.cc/archive/71master/cracky/kareha.pl/1249158438/336
./skyqueen.cc/archive/71master/cracky/kareha.pl/1249158438/390
./skyqueen.cc/archive/71master/cracky/kareha.pl/1249158438/391
./skyqueen.cc/archive/71master/cracky/kareha.pl/1249158438/392
./skyqueen.cc/archive/71master/cracky/kareha.pl/1249158438/393
./skyqueen.cc/archive/71master/cracky/kareha.pl/1249158438/396
./skyqueen.cc/archive/71master/cracky/kareha.pl/1249158438/397
./skyqueen.cc/archive/71master/cracky/kareha.pl/1249158438/398
./skyqueen.cc/archive/71master/cracky/kareha.pl/1249158438/399
./skyqueen.cc/archive/71master/cracky/kareha.pl/1249158438/400
./skyqueen.cc/archive/71master/cracky/kareha.pl/1249158438/401
./skyqueen.cc/archive/71master/cracky/kareha.pl/1249158438/403
./skyqueen.cc/archive/71master/cracky/kareha.pl/1249158438/443
./skyqueen.cc/archive/71master/cracky/kareha.pl/1249158438/445
./skyqueen.cc/archive/71master/cracky/kareha.pl/1249158438/447
./skyqueen.cc/archive/71master/cracky/kareha.pl/1249158438/449
./skyqueen.cc/archive/71master/cracky/kareha.pl/1351024951/index.html

No trace of the missing files in the Apache log (except the 1351024951, which I mentioned in a prior post)

This seems relevant:

HTTP response 0  [https://skyqueen.cc/archive/71master/cracky/kareha.pl/1191886349/]

So it apparently had some kind of issue downloading that one, which explains the 1191886349/index.html being missing, as well as others linked from it such as 1191886349/39

but it shows up as a 200 OK in the Apache log

That doesn't account for all of them, though. It downloaded 1249158438/index.html successfully*, but of the pages linked from it, such as 1249158438/449, some of them are missing and I'm finding no trace of the missing ones in the logs

*actually I looked at the copy of 1249158438/index.html that it downloaded, and it appears to have truncated during download. Why are the files being randomly truncated, though? Why only with HTTP2 and not HTTP1? Why is wget2 not indicating that the file was truncated?

Taking another look at the Apache log to try to figure out what happened with the truncated file, I actually see wget request the file twice, the first responded with a 200 OK, and the second responded with a 304 Not Modified

[2022-10-10/13:20] skyqueen.cc XXXX "GET /archive/71master/cracky/kareha.pl/1249158438/ HTTP/2.0" 200 "https://skyqueen.cc/archive/71master/cracky/kareha.pl/" "wget2/2.0.1" - - - -
[2022-10-10/13:20] skyqueen.cc XXXX "GET /archive/71master/cracky/kareha.pl/1249158438/ HTTP/2.0" 304 "https://skyqueen.cc/archive/71master/cracky/kareha.pl/" "wget2/2.0.1" - - - -

Why did wget request the file twice?

I looked at other non-truncated files, and I only see them once in the Apache log.

In fact, that was only 304 in the Apache log for the entire run

in the wget log, I see the 304 but not the 200?!

So it looks like the skipped files from this run can be divided into three root causes:

  1. /1351024951/ is a 307 redirect and wget2 always fails to follow this redirect (unless doing a non-recursive direct download), but I'm not sure why; it's not a --no-parent situation as the redirect target is within the scope of the download

  2. /1191886349/ showed up as an "HTTP response 0" in the wget log but showed as a 200 OK in the Apache log

  3. /1249158438/ was truncated during download. Per the Apache log, wget made two requests, the first answered with a 200 OK, and the second with a 304 Not Modified. But I only see the 304 in the wget log.

The other missing files were consequences of 2 & 3

@catharsis71
Copy link
Author

Doing some additional runs, deleting files in between each (so that there will be no legitimate 304's), I'm seeing response 0 and response 304 for different files each time

Any time there's a 304 in the wget log (which shows in the Apache log as a 200 followed by a 304), I look at the downloaded file to confirm it's truncated
Any time there's a 0 in the wget log (which shows in the Apache log as a 200), the file is always missing from disk


Run 1:

log.txt:HTTP response 304  [https://skyqueen.cc/archive/71master/cracky/kareha.pl/1274452416/]
log.txt:HTTP response 304  [https://skyqueen.cc/archive/71master/cracky/kareha.pl/1274504470/]
(examined these files and confirmed they were truncated)

Run 2:

log.txt:HTTP response 304  [https://skyqueen.cc/archive/71master/cracky/kareha.pl/1249158438/]
log.txt:HTTP response 304  [https://skyqueen.cc/archive/71master/cracky/kareha.pl/1250456905/]
log.txt:HTTP response 304  [https://skyqueen.cc/archive/71master/cracky/kareha.pl/1250126361/]
log.txt:HTTP response 304  [https://skyqueen.cc/archive/71master/cracky/kareha.pl/1250460753/]
log.txt:HTTP response 304  [https://skyqueen.cc/archive/71master/cracky/kareha.pl/1274504201/]
log.txt:HTTP response 304  [https://skyqueen.cc/archive/71master/cracky/kareha.pl/1274504470/]
log.txt:HTTP response 304  [https://skyqueen.cc/archive/71master/cracky/kareha.pl/1319036436/]
(examined these files and confirmed they were truncated)

Run 3:

log.txt:HTTP response 0  [https://skyqueen.cc/archive/71master/cracky/kareha.pl/1191551218/]
log.txt:HTTP response 304  [https://skyqueen.cc/archive/71master/cracky/kareha.pl/1274504470/]
log.txt:HTTP response 304  [https://skyqueen.cc/archive/71master/cracky/kareha.pl/1304612348/]
log.txt:HTTP response 304  [https://skyqueen.cc/archive/71master/cracky/kareha.pl/1304603144/]
log.txt:HTTP response 304  [https://skyqueen.cc/archive/71master/cracky/kareha.pl/1309614638/]
log.txt:HTTP response 304  [https://skyqueen.cc/archive/71master/cracky/kareha.pl/1310219441/]
log.txt:HTTP response 304  [https://skyqueen.cc/archive/71master/cracky/kareha.pl/1310030666/]
log.txt:HTTP response 304  [https://skyqueen.cc/archive/71master/cracky/kareha.pl/1310292997/]
log.txt:HTTP response 304  [https://skyqueen.cc/archive/71master/cracky/kareha.pl/1310001577/]
log.txt:HTTP response 304  [https://skyqueen.cc/archive/71master/cracky/kareha.pl/1310028049/]
log.txt:HTTP response 304  [https://skyqueen.cc/archive/71master/cracky/kareha.pl/1310133372/]
log.txt:HTTP response 304  [https://skyqueen.cc/archive/71master/cracky/kareha.pl/1310418243/]
log.txt:HTTP response 304  [https://skyqueen.cc/archive/71master/cracky/kareha.pl/1310417728/]
(first file (the response 0) was missing from disk, the others were truncated)

Run 4:

log.txt:HTTP response 304  [https://skyqueen.cc/archive/71master/cracky/kareha.pl/1319036436/]
(file was truncated)

Run 5:

log.txt:HTTP response 304  [https://skyqueen.cc/archive/71master/cracky/kareha.pl/1274504201/]
log.txt:HTTP response 0  [https://skyqueen.cc/archive/71master/cracky/kareha.pl/1278913088/]
log.txt:HTTP response 304  [https://skyqueen.cc/archive/71master/cracky/kareha.pl/1319036436/]
log.txt:HTTP response 0  [https://skyqueen.cc/archive/71master/cracky/kareha.pl/1325565899/]
(the 0's were missing from disk and the 304's were truncated, in fact, one of them was 0 bytes)

Run 6:

log.txt:HTTP response 304  [https://skyqueen.cc/archive/71master/cracky/kareha.pl/1193082056/]
log.txt:HTTP response 304  [https://skyqueen.cc/archive/71master/cracky/kareha.pl/1193048627/]
log.txt:HTTP response 304  [https://skyqueen.cc/archive/71master/cracky/kareha.pl/1193083280/]
log.txt:HTTP response 304  [https://skyqueen.cc/archive/71master/cracky/kareha.pl/1193155533/]
log.txt:HTTP response 304  [https://skyqueen.cc/archive/71master/cracky/kareha.pl/1193209146/]
log.txt:HTTP response 304  [https://skyqueen.cc/archive/71master/cracky/kareha.pl/1193167618/]
log.txt:HTTP response 304  [https://skyqueen.cc/archive/71master/cracky/kareha.pl/1274504201/]
log.txt:HTTP response 304  [https://skyqueen.cc/archive/71master/cracky/kareha.pl/1274504470/]
(files were truncated)

@rockdaboot
Copy link
Owner

Awesome report !
The +1 file might be due to robots.txt not being reported.
Redirections to the same file likely are only downloaded once (needs to be fixed).
Using multiple threads via http/2 on a single host can speed up downloads, given that the connections are load balanced to different servers.
The http/2 flakyness needs to be investigated further (I can do when back).

@catharsis71
Copy link
Author

catharsis71 commented Oct 10, 2022

Correction to some earlier stuff -- turns out my wget2_openssl and wget2_wolfssl were actually using GnuTLS as well... even though they reported OpenSSL or WolfSSL on --version, my plan to keep them separate apparently didn't work and all 3 of my compiled copies were using libraries from the one I compiled last, which happened to be GnuTLS

I tried it again a different way and verified that they all worked as intended this time:

$ wget2-gnutls -d https://www.google.com/ 2>&1 | grep -i init
10.154715.388 GnuTLS init
10.154715.398 GnuTLS init done
$ wget2-openssl -d https://www.google.com/ 2>&1 | grep -i init
10.154722.170 OpenSSL initialized
$ wget2-wolfssl -d https://www.google.com/ 2>&1 | grep -i init
10.154728.607 WolfSSL init
10.154728.654 WolfSSL init done

I know you said to focus on GnuTLS but to cover the bases I ran some quick re-checks with my fixed OpenSSL compile

I do still see the random failures sometimes with OpenSSL, but it seems the random failures are less likely to happen than with GnuTLS. Can't say for certain though. With OpenSSL, I had 3 "clean" runs out of 5 (the only skipped file was the 307 redirect thing), and 2 dirty runs (9 truncated files, 1 missing file). Whereas with GnuTLS, "clean" runs are rare, like 1 out of 5.

@catharsis71
Copy link
Author

I have a hypothesis about what might be going on with the truncated files
You mentioned in another bug that if wget thinks a file was truncated, it will re-request the file from the server
But I think when this happens, it's including a Last-Modified header
and the server is responding with an HTTP 304 because the Last-Modified header matches
shouldn't wget refrain from sending the Last-Modified header when re-requesting a file that it believes was truncated?

@rockdaboot
Copy link
Owner

Sounds reasonable. I have to think about implementing a test for this.

@catharsis71
Copy link
Author

Looking further into the server side of this I found something I definitely should have noticed something sooner, looks like hammering the server with HTTP2 is resulting in Apache sub-process crashes. In fact, each crash corresponds with a "Failed to read 102400 bytes" message, and probably with a truncated file as well. So I'm starting to get a clearer picture of what's going on.

Example:

Client:
Failed to read 102400 bytes (hostname='skyqueen.cc', ip=2a04:dd00:16:7:195:242:99:71, errno=2)
Failed to read 102400 bytes (hostname='skyqueen.cc', ip=2a04:dd00:16:7:195:242:99:71, errno=2)
Failed to read 102400 bytes (hostname='skyqueen.cc', ip=2a04:dd00:16:7:195:242:99:71, errno=32)
Failed to read 102400 bytes (hostname='skyqueen.cc', ip=2a04:dd00:16:7:195:242:99:71, errno=2)
Failed to read 102400 bytes (hostname='skyqueen.cc', ip=2a04:dd00:16:7:195:242:99:71, errno=2)
Failed to read 102400 bytes (hostname='skyqueen.cc', ip=2a04:dd00:16:7:195:242:99:71, errno=2)
Failed to read 102400 bytes (hostname='skyqueen.cc', ip=2a04:dd00:16:7:195:242:99:71, errno=2)
Failed to read 102400 bytes (hostname='skyqueen.cc', ip=2a04:dd00:16:7:195:242:99:71, errno=11)
Failed to read 102400 bytes (hostname='skyqueen.cc', ip=2a04:dd00:16:7:195:242:99:71, errno=2)
Failed to read 102400 bytes (hostname='skyqueen.cc', ip=2a04:dd00:16:7:195:242:99:71, errno=2)

Server:
[Mon Oct 31 20:52:34.237237 2022] [core:notice] [pid 3768453:tid 140109948124224] AH00052: child pid 3769483 exit signal Segmentation fault (11)
[Mon Oct 31 20:52:40.252944 2022] [core:notice] [pid 3768453:tid 140109948124224] AH00052: child pid 3769482 exit signal Segmentation fault (11)
[Mon Oct 31 20:52:48.266106 2022] [core:notice] [pid 3768453:tid 140109948124224] AH00052: child pid 3770001 exit signal Segmentation fault (11)
[Mon Oct 31 20:52:54.279294 2022] [core:notice] [pid 3768453:tid 140109948124224] AH00052: child pid 3770066 exit signal Segmentation fault (11)
[Mon Oct 31 20:52:57.287511 2022] [core:notice] [pid 3768453:tid 140109948124224] AH00052: child pid 3770131 exit signal Segmentation fault (11)
[Mon Oct 31 20:52:59.298819 2022] [core:notice] [pid 3768453:tid 140109948124224] AH00052: child pid 3770196 exit signal Segmentation fault (11)
[Mon Oct 31 20:53:03.308400 2022] [core:notice] [pid 3768453:tid 140109948124224] AH00052: child pid 3770261 exit signal Segmentation fault (11)
[Mon Oct 31 20:53:05.316149 2022] [core:notice] [pid 3768453:tid 140109948124224] AH00052: child pid 3770326 exit signal Segmentation fault (11)
[Mon Oct 31 20:53:16.335511 2022] [core:notice] [pid 3768453:tid 140109948124224] AH00052: child pid 3770391 exit signal Segmentation fault (11)
[Mon Oct 31 20:53:25.355039 2022] [core:notice] [pid 3768453:tid 140109948124224] AH00052: child pid 3770456 exit signal Segmentation fault (11)

As far as I can tell these Apache crashes are only happening when a wget HTTP2 download is running; I'm not seeing them with normal HTTP2 browser traffic, although I'll keep checking the logs.

I'll look further info the server crashes, I also have a higher-end server with newer software that I'd like to test on as well.

I'm doing most of my HTTP2 testing with --max-threads 1 because multiple threads with HTTP2 doesn't seem to have any benefit at least not with my server. Max-threads 1 vs default doesn't seem to affect the number of crashes I see on the Apache side although multiple threads does increase the number of "failed to read" messages presumably because each thread pops the error for each crash.

For turning down the intensity further to try to alleviate the server crashes, I guess I need to start looking at --http2-request-window?

Based on initial testing, I'm still seeing server crashes even with --http2-request-window 1 --max-threads=1

any other options for running in a more server-friendly manner? I guess --wait would be an option but at that point it'd probably be much faster to just use HTTP1

@rockdaboot
Copy link
Owner

Request window set to 1 should do sequential requests, one at a time. You can see that in the debug logs. Sorry, I am afk, can't test myself until the weekend.

@catharsis71 catharsis71 changed the title wget2 randomly skips files and always downloads 41-103 fewer files than wget1 in the same test when resuming an interrupted download, wget2 should not send a last-modified header Nov 6, 2022
@catharsis71 catharsis71 changed the title when resuming an interrupted download, wget2 should not send a last-modified header when retrying/resuming an interrupted download, wget2 should not send a last-modified header Nov 6, 2022
@catharsis71 catharsis71 changed the title when retrying/resuming an interrupted download, wget2 should not send a last-modified header when retrying/resuming an interrupted download, wget2 should not send a If-Modified-Since header Nov 6, 2022
@catharsis71
Copy link
Author

changed the title.... I obviously have some stuff going on server-side that I need to investigate but in terms of wget2 behavior specifically, I think the main issue is that when attempting to resume/retry/whatever after an interrupted download, wget2 is sending a If-Modified-Since header which will generally just result in the server sending back a HTTP 304 Not Modified. If the original attempt is interrupted and you think you have a truncated file, better to not send a If-Modified-Since header.

@rockdaboot
Copy link
Owner

I agree that sending a Last-Modified header in this case doesn't make sense. Instead wget2 should try a Range request and if the server doesn't support it, the full file needs to be re-downloaded.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants