Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor codebase dictionary into CodeBase class #98

Merged
merged 8 commits into from
Aug 26, 2024

Conversation

Pennycook
Copy link
Contributor

Related issues

Proposed changes

  • Add a new CodeBase class storing all information about which directories make up a code base and which files should be excluded from analysis.
  • Move legacy functionality into CodeBase where appropriate: whether a file should be excluded from analysis is now implemented via __contains__; and listing the contents of a code base is now implemented via __iter__.
  • Rewrite all tests to use CodeBase. Note that most of the changes here are actually related to casting between Path and strings, required because some legacy internals do not consider these representations to be equivalent.
  • Update the documentation and worked example to highlight that tracking all source files in a directory may result in unexpected files being included in the analysis.

Note that although most uses of CodeBase here only use a single directory, the intent is to enable a list of explicit directories to be passed in the future as:

[codebase]
directories = [
  "src1/",
  "src2/",
]

...in order to support analysis of disjoint codebases, and to allow codebasin to be run from any directory.

Represents a code base as a combination of:
- A list of (source) directories; and
- A list of exclude patterns.

The intent of this object is to replace usages of untyped dictionaries to store
code base information, to improve documentation and usability.

Signed-off-by: John Pennycook <[email protected]>
Since a lot of the functionality driven by the tests has not yet been updated
to use Path, a lot of casts between Path and string are introduced here. These
casts will eventually be removed, but require changes to other functionality.

Signed-off-by: John Pennycook <[email protected]>
A side-effect of tracking all source files in the code base is that analysis
can now pick up unexpected files that were previously never encountered in a
compilation database.

For example, any C++ files automatically generated by CMake will be identified
as unused code if codebasin is invoked in a directory containing both build/
and src/ directories.

This commit updates the documentation to highlight the importance of running
codebasin in the right directory (or otherwise separating build and src).

Signed-off-by: John Pennycook <[email protected]>
@Pennycook Pennycook added documentation Improvements or additions to documentation enhancement New feature or request labels May 1, 2024
@Pennycook Pennycook added this to the 2.0.0 milestone May 1, 2024
@Pennycook Pennycook requested a review from laserkelvin May 1, 2024 18:13
@Pennycook
Copy link
Contributor Author

The recent failure isn't related to my latest push (which was a simple formatting change). The issue appears to be that one of the tests is relying on behavior that isn't guaranteed: the __iter__ implementation calls rglob, and the operating system is free to return the files in any order.

I'm not sure what we should do here, so I'm interested in your opinion @laserkelvin. The simplest thing to do would probably be to document that we can't guarantee that we'll visit files in any particular order, and then update the test to use assertCountEqual instead of assertEqual.

@laserkelvin
Copy link
Contributor

laserkelvin commented Aug 22, 2024

You need to run sorted on the results returned by rglob to be consistent - at least for a given OS. I think Windows and Linux might potentially come up with different results, however.

codebasin/__init__.py Show resolved Hide resolved
codebasin/__init__.py Outdated Show resolved Hide resolved
codebasin/__init__.py Show resolved Hide resolved
codebasin/__init__.py Show resolved Hide resolved
codebasin/finder.py Outdated Show resolved Hide resolved
tests/macro_expansion/test_macro_expansion.py Show resolved Hide resolved
rglob is not guaranteed to return a sorted list, and the output is
OS-dependent.

We may want to revisit this decision in future, but it will be easier to
further constrain the behavior of the CodeBase iterator than to remove
functionality that users rely on.

Signed-off-by: John Pennycook <[email protected]>
finder.find() expects a string for compatibility with legacy interfaces.
Performing the cast inside of the function is less error-prone, and will
mean fewer updates later when all of the strings are replaced with
Paths.

Signed-off-by: John Pennycook <[email protected]>
Copy link
Contributor

@laserkelvin laserkelvin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@Pennycook Pennycook merged commit 3e8eae2 into intel:main Aug 26, 2024
3 checks passed
@Pennycook Pennycook deleted the codebase branch August 26, 2024 16:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Track all source files in the source directory Paths in tests should not be relative to the root
2 participants