Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow excluding cache based on status code #1403

Merged
merged 14 commits into from
Oct 14, 2024

Conversation

dmathieu
Copy link
Contributor

Closes #1400

This introduces an option --cache-exclude-status, which allows specifying a range of HTTP status codes which will be ignored from the cache.

lychee-bin/src/options.rs Outdated Show resolved Hide resolved
lychee-bin/src/commands/check.rs Outdated Show resolved Hide resolved
@dmathieu dmathieu marked this pull request as ready for review April 11, 2024 12:55
@mre
Copy link
Member

mre commented Apr 12, 2024

Perfect. Great work!

Perhaps the only thing I'd add is an integration test, similar to this:

#[tokio::test]
async fn test_lycheecache_file() -> Result<()> {
let base_path = fixtures_path().join("cache");
let cache_file = base_path.join(LYCHEE_CACHE_FILE);
// Unconditionally remove cache file if it exists
let _ = fs::remove_file(&cache_file);
let mock_server_ok = mock_server!(StatusCode::OK);
let mock_server_err = mock_server!(StatusCode::NOT_FOUND);
let mock_server_exclude = mock_server!(StatusCode::OK);
let dir = tempfile::tempdir()?;
let mut file = File::create(dir.path().join("c.md"))?;
writeln!(file, "{}", mock_server_ok.uri().as_str())?;
writeln!(file, "{}", mock_server_err.uri().as_str())?;
writeln!(file, "{}", mock_server_exclude.uri().as_str())?;
let mut cmd = main_command();
let test_cmd = cmd
.current_dir(&base_path)
.arg(dir.path().join("c.md"))
.arg("--verbose")
.arg("--no-progress")
.arg("--cache")
.arg("--exclude")
.arg(mock_server_exclude.uri());
assert!(
!cache_file.exists(),
"cache file should not exist before this test"
);
// run first without cache to generate the cache file
test_cmd
.assert()
.stderr(contains(format!("[200] {}/\n", mock_server_ok.uri())))
.stderr(contains(format!(
"[404] {}/ | Failed: Network error: Not Found\n",
mock_server_err.uri()
)));
// check content of cache file
let data = fs::read_to_string(&cache_file)?;
assert!(data.contains(&format!("{}/,200", mock_server_ok.uri())));
assert!(data.contains(&format!("{}/,404", mock_server_err.uri())));
// run again to verify cache behavior
test_cmd
.assert()
.stderr(contains(format!(
"[200] {}/ | Cached: OK (cached)\n",
mock_server_ok.uri()
)))
.stderr(contains(format!(
"[404] {}/ | Cached: Error (cached)\n",
mock_server_err.uri()
)));
// clear the cache file
fs::remove_file(&cache_file)?;
Ok(())
}

Probably it's just a matter of copy-pasting that block and changing the parameter, to test the new flag.
Would you be interested in adding that?
I could understand if it's outside the scope for now, though, and I'd be fine with merging it like it is as well. Let me know what you prefer.

@dmathieu
Copy link
Contributor Author

I'm happy to do it, but it'll have to wait for a few days. I can then do it either in this PR, or in a new one.

@mre
Copy link
Member

mre commented Apr 15, 2024

Just tested the changes locally.
When I run lychee like below, the cache is empty:

❯❯❯ cargo run -- --verbose --cache https://lychee.cli.rs/
    Finished dev [unoptimized + debuginfo] target(s) in 0.19s
     Running `target/debug/lychee --verbose --cache 'https://lychee.cli.rs/'`
[INFO ] Cache is recent (age: 1m 6s, max age: 1d 0h 0m 0s). Using.
✔ [200] https://lychee.cli.rs/introduction/
✔ [200] https://lychee.cli.rs/#_top
✔ [200] https://lychee.cli.rs/favicon.svg
✔ [200] https://lychee.cli.rs/
✔ [200] https://lychee.cli.rs/_astro/logo.BJx6koUn.svg
✔ [200] https://lychee.cli.rs/sitemap-index.xml
✔ [200] https://lychee.cli.rs/_astro/index.fVW1leCO.css
✔ [200] https://lychee.cli.rs/
✔ [200] https://lychee.cli.rs/
✔ [200] https://lychee.cli.rs/_astro/logo.BJx6koUn_1qY8Fi.svg
✔ [200] https://github.com/lycheeverse/lycheeverse.github.io/edit/master/src/content/docs/index.mdx
✔ [200] https://github.com/lycheeverse/lychee/

🔍 12 Total (in 1s) ✅ 12 OK 🚫 0 Errors
~/C/p/l/lychee ❯❯❯ echo $?
0
~/C/p/l/lychee ❯❯❯ cat .lycheecache
~/C/p/l/lychee ❯❯❯ 

@mre
Copy link
Member

mre commented Apr 15, 2024

I've added the integration test we discussed, and that one seems to work. Something isn't adding up here. 🤔

@dmathieu
Copy link
Contributor Author

dmathieu commented Apr 22, 2024

Could that be related to a cache file already existing? (in which case, the issue could be that in the case of running tests with an existing cache, the file isn't written again?)

@dmathieu dmathieu force-pushed the exclude-cache branch 3 times, most recently from 31bc11e to 8631028 Compare June 5, 2024 12:23
@dmathieu
Copy link
Contributor Author

dmathieu commented Jun 5, 2024

I'm not sure whether there's still something missing in this PR?

@mre
Copy link
Member

mre commented Jun 13, 2024

Sorry for the late response.

The feature doesn't work right now.
I tested it by excluding only a single status code, but it excludes everything.

Could that be related to a cache file already existing? (in which case, the issue could be that in the case of running tests with an existing cache, the file isn't written again?)

Nope, tried in a directory without a cache file.

I didn't have time to look at it, but I guess it has something to do with the defaults of StatusCodeSelector. I know we don't use StatusCodeSelector::default for the excluded status codes, but it looks like the defaults get picked up. As a consequence, status codes like 200 get excluded by default when the setting is used, rendering the feature sort of useless.

If you have some time to investigate or test on your side, I would greatly appreciate the effort. :)

@dmathieu
Copy link
Contributor Author

dmathieu commented Oct 4, 2024

Sorry for the delay. I have been trying this branch explicitly, and I'm not seeing any issue in the behavior.

I'm testing a single file which has the following content:

[Valid](https://github.com)
[Error](https://example.ai)
[Missing](https://github.com/404missingpage)

I run lychee and then inspect the cache:

rm .lycheecache; cargo run -- --cache --cache-exclude-status 404 /Users/dmathieu/code/src/github.com/dmathieu/experiments/tmp; cat .lycheecache
   Compiling lychee-lib v0.15.1 (/Users/dmathieu/code/src/github.com/lycheeverse/lychee/lychee-lib)
   Compiling lychee v0.15.1 (/Users/dmathieu/code/src/github.com/lycheeverse/lychee/lychee-bin)
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 4.07s
     Running `target/debug/lychee --cache --cache-exclude-status 404 /Users/dmathieu/code/src/github.com/dmathieu/experiments/tmp`
  3/3 ━━━━━━━━━━━━━━━━━━━━ Finished extracting links                                                                                                                                                                                                                                                                                                                        Issues found in 1 input. Find details below.

[/Users/dmathieu/code/src/github.com/dmathieu/experiments/tmp/README.md]:
     [404] https://github.com/404missingpage
   [ERROR] https://example.ai/

🔍 3 Total (in 0s) ✅ 1 OK 🚫 2 Errors
   [WARN ] There were issues with GitHub URLs. You could try setting a GitHub token and running lychee again.
https://github.com/,200,1728031740
https://example.ai/,,1728031739

The 404 URL is properly not cached, while the 200 one is cached.
The fact that the erroring address is cached too is problematic, but it feels like fixing it should be another issue, as there's no notion of status code.

@mre
Copy link
Member

mre commented Oct 7, 2024

No worries and thanks for the update.

What I'm seeing is the following:

--cache-exclude-status works correctly, as you described.

❯ rm .lycheecache
❯ cargo run -- --cache --cache-exclude-status 404 tmp                                                ✘ 
     Running `target/debug/lychee --cache --cache-exclude-status 404 tmp`
  3/3 ━━━━━━━━━━━━━━━━━━━━ Finished extracting links                                                                     Issues found in 1 input. Find details below.

[tmp]:
   [ERROR] https://example.ai/
     [404] https://github.com/404missingpage

🔍 3 Total (in 0s) ✅ 1 OK 🚫 2 Errors
   [WARN ] There were issues with GitHub URLs. You could try setting a GitHub token and running lychee again.
❯ cat .lycheecache                                                                                   ✘ 
https://example.ai/,,1728333536
https://github.com/,200,1728333536

It caches the 200 status code from github.com and excludes the 404
status code.

Now let us remove the cache and run lychee without --cache-exclude-status.

❯ rm .lycheecache
❯ cargo run -- --cache tmp
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.60s
     Running `target/debug/lychee --cache tmp`
  3/3 ━━━━━━━━━━━━━━━━━━━━ Finished extracting links                                                                     Issues found in 1 input. Find details below.

[tmp]:
     [404] https://github.com/404missingpage
   [ERROR] https://example.ai/

🔍 3 Total (in 0s) ✅ 1 OK 🚫 2 Errors
   [WARN ] There were issues with GitHub URLs. You could try setting a GitHub token and running lychee again.
❯ cat .lycheecache                                                                                   ✘ 
https://example.ai/,,1728333603
https://github.com/404missingpage,404,1728333603

As we can see, the cache now contains the 404 status code as expected.
However, it does not contain the 200 status code from github.com. This is unexpected.
In the currently released version of lychee, the cache contains the
200 status code from github.com:

❯ rm .lycheecache
❯ lychee --cache tmp
  3/3 ━━━━━━━━━━━━━━━━━━━━ Finished extracting links                                                                     Issues found in 1 input. Find details below.

[tmp]:
✗ [404] https://github.com/404missingpage | Failed: Network error: Not Found
✗ [ERR] https://example.ai/ | Failed: Network error: error sending request for url (https://example.ai/)

🔍 3 Total (in 0s) ✅ 1 OK 🚫 2 Errors
💡 There were issues with GitHub URLs. You could try setting a GitHub token and running lychee again.%                   
❯ cat .lycheecache                                                                                   ✘ 
https://github.com/404missingpage,404,1728333682
https://github.com/,200,1728333682
https://example.ai/,,1728333682

I think we want the behavior of the current release, where the cache
contains the 200 status code from github.com. This is because the
--cache-exclude-status flag is supposed to exclude additional status
codes from the cache but keep the 200 status codes in the cache
unless they are excluded by using --cache-exclude-status 200.

@dmathieu
Copy link
Contributor Author

dmathieu commented Oct 8, 2024

Ah yes, you're right. It seems this is because we reuse the same StatusCodeSelector, which has a default for the accept option, but we don't want that for our exclusion.

I've been looking around, but I haven't found a clean way to fix this so far. It seems like the derivative crate could be a solution, but I'd rather not add a dependency on this project for no reason.

An alternative would be to somehow have a StatusCodeSelector, and a StatusCodeExcluder, both with different defaults but similar behavior under the hood.
Or just to copy both behaviors to be entirely separate.

@mre
Copy link
Member

mre commented Oct 8, 2024

You're right. StatusCodeExcluder sounds like the way to go. Just copy the other one and make the modifications. They are different enough to merit a separate struct.

@dmathieu
Copy link
Contributor Author

dmathieu commented Oct 8, 2024

The CI failure seems to be a flake? I'm not reproducing it locally.

@mre
Copy link
Member

mre commented Oct 14, 2024

Fixed a couple of typos and split up the tests into individual test-cases.
As you pointed out, the failing test is probably flaky. It passes now.

Thanks for the contribution @dmathieu!

@mre mre merged commit f0ebac2 into lycheeverse:master Oct 14, 2024
7 checks passed
@mre mre mentioned this pull request Oct 14, 2024
@dmathieu dmathieu deleted the exclude-cache branch October 14, 2024 06:59
This was referenced Oct 21, 2024
mre pushed a commit that referenced this pull request Oct 27, 2024
This introduces an option `--cache-exclude-status`, which allows specifying a range of HTTP status codes which will be ignored from the cache.

Closes #1400.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow option to ignore failures in cache
3 participants