Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[#197] Canonicalize filepaths #230

Merged
merged 4 commits into from
Dec 22, 2022
Merged

Conversation

aeqz
Copy link
Contributor

@aeqz aeqz commented Dec 2, 2022

Description

Problem: The current usage of filepaths is error-prone and can be simplified.

Solution: Canonicalize filepaths at the boundaries, so their management will be safer and will simplify the codebase. This has the side effect of changing the program output regarding file paths, so we should also probably decide how these will be shown after the refactor.

Related issue(s)

Fixes #197

✅ Checklist for your Pull Request

Ideally a PR has all of the checkmarks set.

If something in this list is irrelevant to your PR, you should still set this
checkmark indicating that you are sure it is dealt with (be that by irrelevance).

Related changes (conditional)

  • Tests

    • If I added new functionality, I added tests covering it.
    • If I fixed a bug, I added a regression test to prevent the bug from
      silently reappearing again.
  • Documentation

    • I checked whether I should update the docs and did so if necessary:
  • Public contracts

    • Any modifications of public contracts comply with the Evolution
      of Public Contracts
      policy.
    • I added an entry to the changelog if my changes are visible to the users
      and
    • provided a migration guide for breaking changes if possible

Stylistic guide (mandatory)

src/Xrefcheck/Core.hs Outdated Show resolved Hide resolved
src/Xrefcheck/Core.hs Outdated Show resolved Hide resolved
src/Xrefcheck/Verify.hs Outdated Show resolved Hide resolved
@Martoon-00
Copy link
Member

Cool, looking at your comments I like how you approach all this.

Gonna take a thorough look by Monday. Please switch to another ticket meanwhile, unless you have something else to do here before my review.

@aeqz aeqz changed the title Aeqz/#197 canonicalize filepaths [#197] Canonicalize filepaths Dec 5, 2022
Copy link
Member

@Martoon-00 Martoon-00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good, better than what I expected to be done in this issue 👍

And your code generally looks well, cool.

I'm sorry for the delayed review, hope it wasn't a blocker for you.

src/Xrefcheck/Core.hs Outdated Show resolved Hide resolved
src/Xrefcheck/Core.hs Show resolved Hide resolved
src/Xrefcheck/Core.hs Outdated Show resolved Hide resolved
src/Xrefcheck/Scanners/Markdown.hs Outdated Show resolved Hide resolved
src/Xrefcheck/System.hs Show resolved Hide resolved
src/Xrefcheck/Scan.hs Outdated Show resolved Hide resolved
src/Xrefcheck/Verify.hs Outdated Show resolved Hide resolved
src/Xrefcheck/Verify.hs Outdated Show resolved Hide resolved
src/Xrefcheck/System.hs Outdated Show resolved Hide resolved
src/Xrefcheck/Core.hs Outdated Show resolved Hide resolved
@Martoon-00
Copy link
Member

On the failing tests:

I believe that the switch from displaying absolute paths in most cases is very good.

However I wonder why in case of Windows (this CI is in GitHub actions) we get relative paths with /, not with \, despite everything is canonicalized. Maybe there is a quick answer to this? If not, no need to setup Windows locally.

Also, I find that in case of File not found error it may be beneficial to print an absolute path too (perhaps on a separate line), this way it may be simpler for the user to investigate the issue if it was not expected.


There are tests that seem broken now while they shouldn't, e.g. "24 Ignore file with broken xrefcheck annotation: config file".

And some diff seems to point to places that were invalid but now are valid, not sure how did this happen. Apparently, we had too many "Link targets a local file outside repository" errors, now it seems to work better.

Could you please regenerate the golden files and carefully commit only those changes where we had "Link targets a local file outside repository" illegally, if that does not take much effort? This should give us an opportunity to document the reasoning of this change (in the commit description) and should simplify sorting through the remaining failing tests.

@aeqz
Copy link
Contributor Author

aeqz commented Dec 6, 2022

I have moved CanonicalPath related code under the Xrefcheck.System module, and now I will try to fix the issues related to path printing and unexpected golden test diffs.

@aeqz
Copy link
Contributor Author

aeqz commented Dec 6, 2022

Now the Golden test expected outputs reflect how the program output has changed. I will investigate the couple of diffs that are unexpectedly not being reported as invalid.

@aeqz aeqz force-pushed the aeqz/#197-canonicalize_filepaths branch from 429459e to 9bc47b2 Compare December 7, 2022 09:29
@aeqz
Copy link
Contributor Author

aeqz commented Dec 7, 2022

The golden test that is failing reports as "local file outside repository" a link that is not really outside the repository, but passes trough its root: the root directory is dir1 and the link, relative from dir1, is ../dir1/file.md.

After the refactor, its path is treated as file.md, so it does not report that it is outside the repository. If we want to preserve that behaviour, we should add an extra check that, before canonicalizing the file link path, follows its indirections and detects whether it passes through the root or not.

Copy link
Member

@Martoon-00 Martoon-00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the changes to the golden tests that were applied, looks neat.

A few more comments.

src/Xrefcheck/System.hs Outdated Show resolved Hide resolved
src/Xrefcheck/System.hs Show resolved Hide resolved
src/Xrefcheck/System.hs Show resolved Hide resolved
src/Xrefcheck/System.hs Outdated Show resolved Hide resolved
tests/Test/Xrefcheck/LocalSpec.hs Show resolved Hide resolved
src/Xrefcheck/Verify.hs Outdated Show resolved Hide resolved
src/Xrefcheck/Verify.hs Outdated Show resolved Hide resolved
@Martoon-00
Copy link
Member

The golden test that is failing reports as "local file outside repository" a link that is not really outside the repository, but passes trough its root: the root directory is dir1 and the link, relative from dir1, is ../dir1/file.md.

After the refactor, its path is treated as file.md, so it does not report that it is outside the repository. If we want to preserve that behaviour, we should add an extra check that, before canonicalizing the file link path, follows its indirections and detects whether it passes through the root or not.

Interesting. So we got more flexible after your changes.

I think this may be helpful in case the user has auto-generated documentation, so let's try to preserve this new feature of yours.

But, canonicalizePath seems to poke the filesystem taking the prefixes of the given filepath, and I believe we must avoid touching anything outside of the repository: some users may care about files' permissions and what is being touched by third party programs. So let's make sure to add the new check in a way that excludes canonicalizePaths calls over links that even "temporarily" exit the repository.

@aeqz
Copy link
Contributor Author

aeqz commented Dec 13, 2022

Another drawback that I have noticed with having file paths canonicalised is that, on case-insensitive systems, if the file example/dir/file.md exists, then exAMple/DIR/FILe.md is canonicalised as example/dir/file.md, so we cannot report it as invalid even if we wanted a case-sensitive setting such as in #211.

This is why I have not fixed the Golden tests yet: some case-related errors that are reported in CI are not reported in my computer.

src/Xrefcheck/System.hs Outdated Show resolved Hide resolved
@aeqz aeqz force-pushed the aeqz/#197-canonicalize_filepaths branch from 511e5d0 to dd17047 Compare December 13, 2022 13:48
@Martoon-00
Copy link
Member

if the file example/dir/file.md exists, then exAMple/DIR/FILe.md is canonicalised as example/dir/file.md, so we cannot report it as invalid even if we wanted a case-sensitive setting such as in #211.

Oooh, that's a very good spot and a severe issue.

This is a completely different concern compared to #197, so could you please create a separate ticket for this? And under it investigate how GitHub and GitLab handle the casing of filepaths, is it similar to anchors handling or not.

Maybe we will close that new ticket with this PR too.


Now on replacing canonicalizePath: it's a serious take; on the one hand, it's easy to mess up with things here, on the other, given that we try to do things partially manually, we could try to go further and avoid using makeAbsolute.

Let's really go with the weakened canonicalization as you have it now, and please create another ticket for trying to get rid of makeAbsoule and work with filepaths in a pure way.

tests/Test/Xrefcheck/RedirectRequestsSpec.hs Show resolved Hide resolved
src/Xrefcheck/Verify.hs Outdated Show resolved Hide resolved
src/Xrefcheck/System.hs Show resolved Hide resolved
tests/Test/Xrefcheck/CanonicalPathSpec.hs Outdated Show resolved Hide resolved
src/Xrefcheck/RepoInfo.hs Outdated Show resolved Hide resolved
src/Xrefcheck/Verify.hs Outdated Show resolved Hide resolved
-- | A FilePath that has been canonicalized. Should be created via
-- 'canonicalizePath'. Currently, canonical paths have been made absolute,
-- normalised regarding the running platform (e.g. Posix or Windows), with
-- indirections syntractically expanded as much as possible and with no
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, what is "syntractically"? Is it a typo?

Nothing -> []

-- | Get a relative 'FilePath' from the second given path (child) with
-- repect to the first one (root).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*respect

prepare :: FilePath -> FilePath
prepare path = case dropWhile FP.isPathSeparator path of
"" -> "."
other -> other
Copy link
Contributor

@YuriRomanowski YuriRomanowski Dec 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO this name is confusing a bit, prepare for what? Wouldn't it be better to call it dropLeadingPathSeparators?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will give it a proper name. It also handles specially the empty case, returning "." instead of "".

--
-- Within 'Gather', 'seFile' stores the 'FilePath' corresponding
-- to the file in where the error was found.
data ScanError (a :: ScanStage) = ScanError
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think that this 'Trees that grow' style fits in here?

Copy link
Member

@Martoon-00 Martoon-00 Dec 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I like your type families solution here, it makes particular sense. At your place, I would probably write it in the same way :)

However I think that two separate datatypes would work better here, and here's why: our core part serves as sort of API for potential scanners developers, and later there will be a lot of code that works with ScanError 'Parse.

With the current approach, we win at the core side (fewer number of datatypes) and lose at the parsers' side (one has to always initialize that seFile = ()), and the latter is more significant for the code amount and API-wise.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be an option to use mkParseScanError instead of initialising seFile to ()? I can also give type aliases to ScanError 'Parse and ScanError 'Gather also.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh right, we already use a smart constructor there, so avoiding () each time won't be a problem.

Okay, let's go with it.

RIFileLocal -> checkRef rAnchor riRoot file ""
RIFileRelative -> do
let relativeToRoot = getPosixRelativeOrAbsoluteChild riRoot (takeDirectory file)
</> toString rLink
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, AFAIU, this is not always relative, from the function name. But even better it would be to call it shownPath, like inside the checkRef function, it seems that it doesn't really matter that it is relative.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually so, I agree here. Having relativeToRoot would make the code slightly more readable IMO, but here it is not necessarily relative.

[ testGroup "Canonicalization"
[ testCase "Trailing separator" $ do
path <- canonicalizePath "./example/dir/"
getPosixRelativeOrAbsoluteChild current path @?= "example/dir",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please format it like this:

exceptions =
  [ InvalidStatusCode
  , MissingContentHeader
  , InternalServerError
  ]

@YuriRomanowski
Copy link
Contributor

I'm perfectly fine with this PR, titanic work! I approved it.

@aeqz
Copy link
Contributor Author

aeqz commented Dec 16, 2022

I will wait also for @Martoon-00 's review. And then, get ready for conflicts in our currently open PRs 🙊

@Martoon-00
Copy link
Member

Oh, looks like hasIndirectionThroughParent is gone now, why so?

@aeqz
Copy link
Contributor Author

aeqz commented Dec 19, 2022

Oh, looks like hasIndirectionThroughParent is gone now, why so?

I added hasIndirectionThroughParent, but then removed it because of the following comment:

my previous concern about poking files outside of the repository does not hold for makeAbsolute. So nevermind my previous comment.

Do we need it for something that I have missed? Or should I keep it just in case it may be useful in the future?

@Martoon-00
Copy link
Member

Do we need it for something that I have missed?

Ah, I didn't put my thought concrete.

My comment "about poking files outside of the repository" only affected where hasIndirectionThroughParent check had to be performed, that we shouldn't run canonicalizePath on paths that are yet unchecked on indirection.

But checking for indirections has to be there in either way, we cannot throw this check away altogether.

My subsequent comment mentioning virtual files meant that we probably want to check on indirections inside checkRef function, at the same position where this check was located before this PR.

@aeqz
Copy link
Contributor Author

aeqz commented Dec 19, 2022

@Martoon-00 right! I misunderstood it. Now the Golden tests output shows the same behaviour as before the refactor regarding both virtual files and links that pass through the repository root.

@aeqz aeqz force-pushed the aeqz/#197-canonicalize_filepaths branch from c095d6e to 742aa00 Compare December 20, 2022 11:08
@Martoon-00
Copy link
Member

Martoon-00 commented Dec 20, 2022

1 Warning
⚠️ Please, only use letters, digits and dashes in the branch name.
Found: ["_"]

Generated by 🚫 Danger

@Martoon-00
Copy link
Member

Feel free to prettify the commits history, I'll make the final review pass then.

Problem: the danger checks were failing because it was configured to
fetch only and partially the current PR branch.

Solution: force the danger checks CI to get all the repository branches.
@aeqz aeqz force-pushed the aeqz/#197-canonicalize_filepaths branch from 5309f71 to 1d3e863 Compare December 20, 2022 16:01
Copy link
Member

@Martoon-00 Martoon-00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your titanic work!

I believe we achieved quite a lot, both from the perspective of code cleanness and in how much better the output now looks like.

I'm leaving one more comment about commits history prettification, and going to approve after that.

src/Xrefcheck/Scanners/Markdown.hs Show resolved Hide resolved
RIFileLocal -> checkRef rAnchor riRoot file ""
RIFileRelative -> do
let relativeToRoot = getPosixRelativeOrAbsoluteChild riRoot (takeDirectory file)
</> toString rLink
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually so, I agree here. Having relativeToRoot would make the code slightly more readable IMO, but here it is not necessarily relative.

Problem: our code contains a where statement with a 5-space indentation.

Solution: indent it with 6 spaces instead.
Problem: the current usage of filepaths is error-prone and can be
simplified.

Solution: canonicalize filepaths at the boundaries, so their management
will be safer and will simplify the codebase.
@aeqz aeqz force-pushed the aeqz/#197-canonicalize_filepaths branch from 1d3e863 to 0886062 Compare December 22, 2022 15:30
@Martoon-00
Copy link
Member

Hmhm, danger complains about Style tags.

But we can actually allow them. You can update danger/helpers.rb to allow both Chore and Style tags.

@aeqz aeqz merged commit 16041d0 into master Dec 22, 2022
@aeqz aeqz deleted the aeqz/#197-canonicalize_filepaths branch December 22, 2022 16:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Refactor filepath equality checks
3 participants