Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

is_dirty() is very slow when using diff.astextplain.textconv #1962

Open
idbrii opened this issue Sep 19, 2024 · 1 comment
Open

is_dirty() is very slow when using diff.astextplain.textconv #1962

idbrii opened this issue Sep 19, 2024 · 1 comment

Comments

@idbrii
Copy link
Contributor

idbrii commented Sep 19, 2024

Running is_dirty() on my repo takes 5 minutes because it's a large repo, has text conversion enabled for diffs, and is_dirty() is outputting a full diff. is_dirty() should be a relatively simple operation, but since it uses git diff instead of a plumbing command like git diff-files it incurs the cost of displaying nice output for users.

The diff.astextplain.textconv git option converts pdf files to text before diffing. This option appears to come with msys git. It's useful when diffing interactively, but a lot of overhead when just checking for dirty state.

GitPython doesn't look at the output of the diff, it just checks that it's not empty:

GitPython/git/repo/base.py

Lines 957 to 977 in 3470fb3

# Start from the one which is fastest to evaluate.
default_args = ["--abbrev=40", "--full-index", "--raw"]
if not submodules:
default_args.append("--ignore-submodules")
if path:
default_args.extend(["--", str(path)])
if index:
# diff index against HEAD.
if osp.isfile(self.index.path) and len(self.git.diff("--cached", *default_args)):
return True
# END index handling
if working_tree:
# diff index against working tree.
if len(self.git.diff(*default_args)):
return True
# END working tree handling
if untracked_files:
if len(self._get_untracked_files(path, ignore_submodules=not submodules)):
return True
# END untracked files
return False

If we switch from the diff to diff-index, we can see that it's comparable in speed to turning off the text conversions:

$ time (git diff --abbrev=40 --full-index --raw | cat)
...snip...
:100644 100644 66130ffa9aa8b4bc98a1918a946919f94c9a819d 0000000000000000000000000000000000000000 M      src/thread.cpp

real    5m14.842s
user    1m6.922s
sys     4m20.000s


$ time (git diff-index --quiet HEAD -- && echo "clean" || echo "dirty")

real    0m4.517s
user    0m0.203s
sys     0m32.281s


$ echo "*.pdf	-diff=astextplain" > .gitattributes

$ time (git diff --abbrev=40 --full-index --raw | cat)
...snip...
:100644 100644 66130ffa9aa8b4bc98a1918a946919f94c9a819d 0000000000000000000000000000000000000000 M      src/thread.cpp

real    0m4.394s
user    0m0.109s
sys     0m32.250s

(These timings are all after running these command several times. When I first ran git diff it took 26 minutes!)

Workaround

Add a .gitattributes that disables the text conversion:

*.pdf	-diff=astextplain

Solution

I think is_dirty should instead use diff-files and diff-index. This answer looks like a good explanation of how they work.

Here's my rough replacement for is_dirty:

import git


def is_dirty(
    repo,
    index: bool = True,
    working_tree: bool = True,
    untracked_files: bool = False,
    submodules: bool = True,
    path=None,
):
    default_args = []
    if submodules:
        default_args.append("--ignore-submodules")

    if index:
        try:
            # Always want to end with -- (even without path).
            args = default_args + ["--quiet", "--cached", "HEAD", "--", path]
            repo.git.diff_index(*args)
        except git.exc.GitCommandError:
            return True
    if working_tree:
        try:
            args = default_args + ["--quiet", "--", path]
            repo.git.diff_files(*args)
        except git.exc.GitCommandError:
            return True
    if untracked_files:
        if len(repo._get_untracked_files(path, ignore_submodules=not submodules)):
            return True
    return False


repo = git.Repo.init("c:/code/project")
print("is_dirty", is_dirty(repo))

I'll try to find time to make a proper patch if that sounds good.

@Byron
Copy link
Member

Byron commented Sep 20, 2024

Thanks so much for this wonderful issue, and for figuring out a solution as well!
Yes, please do post a PR for this, the approach sounds perfect.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

2 participants