Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change working directory of C++ parser #572

Open
wants to merge 11 commits into
base: master
Choose a base branch
from

Conversation

andocz
Copy link
Contributor

@andocz andocz commented Mar 28, 2022

This PR makes Clang tool invocations use the working directories of each compilation command. This fixes relative paths in arguments not being handled properly (#571), as previously the tool ran in the default working directory of ..

Callers of SourceManager::getFile in cppparser were changed to provide an absolute path.

Additionally, inaccurate comments in SourceManager::getFile were corrected.

Fixes compilation commands containing relative paths not getting parsed
properly.
Reverts functional change of 04826d2 as it's no longer necessary.
@mcserep mcserep linked an issue Mar 28, 2022 that may be closed by this pull request
@mcserep mcserep added Kind: Bug ⚠️ Kind: Important 🥇 Plugin: C++ Issues related to the parsing and presentation of C++ projects. labels Mar 28, 2022
@mcserep mcserep added this to the Release Gershwin milestone Mar 28, 2022
@mcserep mcserep requested a review from whisperity March 28, 2022 11:21
@andocz andocz marked this pull request as draft March 28, 2022 12:13
@andocz
Copy link
Contributor Author

andocz commented Mar 28, 2022

It turns out there's a bug I didn't catch: if the build directory doesn't exist, the parsing fails, because RealFileSystem requires that the chdir target exists. To fix this, I'll have to reimplement RealFileSystem without this check.

@whisperity
Copy link
Collaborator

It turns out there's a bug I didn't catch: if the build directory doesn't exist, the parsing fails, because RealFileSystem requires that the chdir target exists. To fix this, I'll have to reimplement RealFileSystem without this check.

I do not understand this. How is it possible to have a build generated with CMake (which creates directory structures) or logged (in which case the build was executed and should have created the directories) but do not have the "build directory" (what does this refer to?) exist?

Is it meaningful to parse a file in such a context? If the directory where the build was executed (is this what "build directory" refers to?) doesn't even exist, how was the build executed? We'll end up having issues with (potentially generated?) files gone missing...

In general, I would prefer if we tried not to reimplement things from LLVM too much, unless really needed. We're way behind the upstream and all these reimplements will cause problems when we try supporting newer releases of Ubuntu even.

Perhaps we could either create the directory, or give a hard error and have the user reconsider what they messed up with their build system...

@whisperity
Copy link
Collaborator

Okay, I tried giving a second read to the patch itself but I am still confused. What do we need the chdir for, and why do we try chdiring to that directory? Or is that chdir needed by Clang instead? What about VirtualFileSystem? Can't we create a VirtualFileSystem that always says "Yes, sir, this directory surely exists!" in case it doesn't really exist on the real device, and overlay that on top of the RFS in the VFS stack the ClangTool uses and then make clang believe that directory is there?

@andocz
Copy link
Contributor Author

andocz commented Mar 28, 2022

How is it possible to have a build generated with CMake (which creates directory structures) or logged (in which case the build was executed and should have created the directories) but do not have the "build directory" (what does this refer to?) exist?

It can happen if the build directory is selectively transferred between computers, like what happens on the CI where we zip up just the source files. The build directory (where CMake was invoked) exists here actually, but the compilation directory ("directory" in compile_commands.json) of some commands doesn't, since directories containing no source files go missing.

We'll end up having issues with (potentially generated?) files gone missing...

We can have the compilation dependencies all intact with only compilation directories containing none missing, which is what happened on the CI.

In general, I would prefer if we tried not to reimplement things from LLVM too much, unless really needed.

Oh, I didn't mean reimplementing it from scratch, just creating a wrapper FileSystem subclass that adjust arguments before passing them on to the methods of a RealFileSystem instance, converting relative paths to absolute ones. It would work like ProxyFileSystem.

Or is that chdir needed by Clang instead?

Yes, it chdirs to the directory of each compilation command before running the tool here. This is what will make future references to relative paths get resolved correctly.

Can't we create a VirtualFileSystem that always says [...]

I don't think that would work. RealFileSystem uses llvm::sys::fs::is_directory to verify that the work dir exists; the ClangTool's stack isn't used at this level.

But even if it worked, I think my "redirecting wrapper class" idea is more straightforward.

Fixes the previous implementation not working if compilation directories are
missing.
@andocz andocz marked this pull request as ready for review March 29, 2022 05:30
@whisperity
Copy link
Collaborator

Oh, there is ProxyFileSystem now... neat. Back when I implemented reparse, that wasn't a thing yet. The setCWD would've worked though, because that's also what reparse uses:

std::error_code
DatabaseFileSystem::setCurrentWorkingDirectory(const Twine& path_)
{
SmallString<128> fullPath;
sys::fs::make_absolute(_currentWorkingDirectory, fullPath);
sys::path::append(fullPath, path_);
std::error_code error = sys::fs::make_absolute(fullPath);
if (!error)
_currentWorkingDirectory = fullPath.str().str();
getCurrentWorkingDirectory();
return error;
}

And in this case, almost none (apart from Clang's own intrinsic headers that install with CodeCompass...) of the directories exist on the server. And the overlay is passed to ClangTool:

ClangTool tool(
*compileDb, getFilenameForId(fileId_),
std::make_shared<clang::PCHContainerOperations>(), overlayFs);

@andocz
Copy link
Contributor Author

andocz commented Mar 30, 2022

The reason missing directories never caused an issue with reparse is because it doesn't set the working directory properly. It's using the default value of . from FixedCompilationDatabase::loadFromCommandLine. It's a good thing you mentioned reparse, because I realized it's still broken with relative paths.

DatabaseFileSystem doesn't actually use its _currentWorkingDirectory for anything. It passes the paths it receives as-is to getFile. Also, setCurrentWorkingDirectory doesn't work properly if path_ is absolute; the path_ gets appended to the work dir instead of being used in itself. These will have to be corrected.

This name reflects its purpose better.
It didn't work correctly (`<existing_dir>/<non_existing_dir>/..` still gives an
error), and if it did, it would be making potentially incorrect assumptions
about the file system.

This reverts commit 19cda25.
This reverts commit 96345c4.
The reparser needs this information to correctly execute its commands.
@andocz
Copy link
Contributor Author

andocz commented Apr 5, 2022

The reparser properly supports relative paths now. I had to add a new table to the database (BuildDirectory) for this.

I decided to scrap the FakeWorkDirFileSystem, because to make it work properly, it would need to resolve ..s in paths lexically, which is too messy for my liking. (I can think of symlinks as a case when this would produce wrong results.) Now if the build directory is missing, a warning is emitted and / is used as a fallback.

For parsing to fail, a build command needs to have its directory missing and use a relative path. I don't think it's worth going out of our way to recover from this scenario given how unlikely it is. You suggested creating the build directory, but I think the parser shouldn't modify its inputs.

struct BuildAction;

#pragma db object
struct BuildDirectory
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why this has to be a brand new table, foreign key, and all that jazz for the storage. BuildAction already contains string command;, for example. If "directory" is just another member of the entries in a compilation database JSON (just like "command"), then why isn't this reflected in the schema? Every build action MUST have a build directory (it just may or may not exist or point to anything meaningful).

This struct, as it is now, won't do uniqueing of build directories either, because it is the build directory that refers to one build action each. If we want a separate record to only store one directory ("string") exactly once, then it is the buildaction that has to refer into this relation's key.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to keep backwards compatibility with existing databases, that's why I didn't want to change BuildAction's schema. Is that not a priority? In that case you're right, storing it in BuildAction would make more sense.

Deduplicating the strings is not worth the hassle, in the CodeCompass SQLite database BuildAction only takes up 0.012% of the space.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can easily reparse projects, backward-compatibility with existing databases is not a high priority.

@mcserep mcserep added the Status: At risk 🚨 Issue or PR which got stuck and progress in work is at risk. label Feb 11, 2023
@mcserep mcserep removed this from the Release Gershwin milestone Feb 5, 2024
@mcserep mcserep added this to the Release H* milestone Feb 5, 2024
@mcserep
Copy link
Collaborator

mcserep commented Apr 9, 2024

The PR should also handle relative paths for the file in the compilation command source.

As part of #637, it was discover that in case the directory is /source/, the file is file.cpp and the execution directory of CodeCompass_parser is /cc/, then instead of /source/file.cpp, the non-existent file /cc/file.cpp is being parse, leading to error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Kind: Bug ⚠️ Kind: Important 🥇 Plugin: C++ Issues related to the parsing and presentation of C++ projects. Status: At risk 🚨 Issue or PR which got stuck and progress in work is at risk.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Relative include flags with spaces cause parsing failure
3 participants