git - 如何对异常缓慢的 git-diff 进行故障排除?

标签 git git-diff

我最近克隆了一个远程仓库,其中一些 git命令运行非常缓慢。例如,运行

git diff --quiet

...需要约 40 秒。 (就其值(value)而言,repo 是干净的。我正在使用 git 版本 2.20.1。)

在试图找出导致这种缓慢的原因时,我遇到了一些废除它的程序,尽管我不知道为什么。

在这些过程中,我发现的最简单/最快的一个是:(从一个新克隆的 repo 实例开始)创建一个分支 master ,然后 checkout 。在此之后,如果我查看 master再一次,现在 git diff --quiet快速完成(低于 50 毫秒)。

下面是一个示例交互,显示了各种操作的时间信息1:
rm -rf ./"$REPONAME"      #  0.174 s
git clone "$URL"          # 54.118 s
cd ./"$REPONAME"          #  0.007 s

git diff --quiet          # 39.438 s

git branch VOODOO         #  0.032 s
git checkout VOODOO       # 31.247 s
git diff --quiet          #  0.014 s

git checkout master       #  0.034 s
git diff --quiet          #  0.012 s

正如我已经强调的那样,这只是“修复” repo 的几种可能程序之一,它们对我来说都同样神秘。这只是我发现的最简单/最快的一个。

上面的时序序列是非常可重复的(即,每次我完全按照所示运行特定序列时,我得到的时序大致相同)。

然而,它对看似微小的变化非常敏感。例如,如果我替换 git branch VOODOO; git checkout VOODOOgit checkout -b VOODOO ,随后的时序分布发生了根本性的变化:
rm -rf ./"$REPONAME"      #  0.015 s
git clone "$URL"          # 45.312 s
cd ./"$REPONAME"          #  0.007 s

git diff --quiet          # 46.145 s

git checkout -b VOODOO    # 42.363 s
git diff --quiet          # 41.180 s

git checkout master       # 47.345 s
git diff --quiet          #  0.018 s

我想弄清楚发生了什么。如何进一步解决问题?

是否有永久(“可提交”)的方式来“修复” repo ? (“修复”我的意思是:摆脱 git diff --quietgit checkout ... 等的长时间延迟)

(顺便说一句,git gc 不会修复 repo ,即使是暂时的;我试过了。)

我认为最终“修复” repo 的是 git开始构建和缓存一些辅助数据结构,使其能够有效地执行某些操作。如果这个假设是正确的,那么我的问题可以改写为:导致git的最直接方式是什么?构建这样的辅助数据结构?

编辑:可以阐明上述内容的另一条信息是该存储库包含一个特别大 (1GB) 的文件。 (这解释了 git clone 步骤的缓慢。我不知道这是否与 git diff --quiet 等的缓慢有关,如果是,如何。)

1 不用说,我已经将分支命名为 VOODOO以反射(reflect)我对正在发生的事情的无知。

最佳答案

首先检查 Git 2.27 甚至即将推出的 2.28(2020 年第 3 季度)问题是否仍然存在
我会使用 GIT_TRACE2_PERF 来衡量任何性能。 (如 I did here )
使用 Git 2.28(2020 年第 3 季度),在具有太多统计信息不匹配路径的工作树中“diff --quiet”期间的内存使用量已大大减少。
它的补丁描述说明了一个用例,其中“diff --quiet ”可能很慢:
请参阅 commit d2d7fbeJeff King ( peff )(2020 年 6 月 1 日)。
(由 Junio C Hamano -- gitster --commit 0cd0afc merge ,2020 年 6 月 18 日)

diff: discard blob data from stat-unmatched pairs

Reported-by: Jan Christoph Uhde
Signed-off-by: Jeff King


When performing a tree-level diff against the working tree, we may find that our index stat information is dirty, so we queue a filepair to be examined later.
If the actual content hasn't changed, we call this a stat-unmatch; the stat information was out of date, but there's no actual diff.

Normally diffcore_std() would detect and remove these identical filepairs via diffcore_skip_stat_unmatch().

However, when "--quiet" is used, we want to stop the diff as soon as we see any changes, so we check for stat-unmatches immediately in diff_change().

That check may require us to actually load the file contents into the pair of diff_filespecs.
If we find that the pair isn't a stat-unmatch, then no big deal; we'd likely load the contents later anyway to generate a patch, do rename detection, etc, so we want to hold on to it.
But if it is a stat-unmatch, then we have no more use for that data; the whole point is that we're going discard the pair. However, we never free the allocated diff_filespec data.

In most cases, keeping that data isn't a problem. We don't expect a lot of stat-unmatch entries, and since we're using --quiet, we'd quit as soon as we saw such a real change anyway.

However, there are extreme cases where it makes a big difference:

  1. We'd generally mmap() the working tree half of the pair.
    And since the OS may limit the total number of maps, we can run afoul of this in large repositories. E.g.:

     $ cd linux
    $ git ls-files | wc -l
    67959
    $ sysctl vm.max_map_count
    vm.max_map_count = 65530
    $ git ls-files | xargs touch ;# everything is stat-dirty!
    $ git diff --quiet
    fatal: mmap failed: Cannot allocate memory
    

It should be unusual to have so many files stat-dirty, but it's possible if you've just run a script like "sed -i" or similar.

After this patch, the above correctly exits with code 0.

  1. Even if you don't hit mmap limits, the index half of the pair will have been pulled from the object database into heap memory.
    Again in a clone of linux.git, running:

    $ git ls-files | head -n 10000 | xargs touch
    $ git diff --quiet
    

peaks at 145MB heap before this patch, and 94MB after.

This patch solves the problem by freeing any diff_filespec data we picked up during the "--quiet" stat-unmatch check in diff_changes.
Nobody is going to need that data later, so there's no point holding on to it.
There are a few things to note:

  • we could skip queueing the pair entirely, which could in theory save a little work. But there's not much to save, as we need a diff_filepair to feed to diff_filespec_check_stat_unmatch() anyway.
    And since we cache the result of the stat-unmatch checks, a later call to diffcore_skip_stat_unmatch() call will quickly skip over them.
    The diffcore code also counts up the number of stat-unmatched pairs as it removes them. It's doubtful any callers would care about that in combination with --quiet, but we'd have to reimplement the logic here to be on the safe side. So it's not really worth the trouble.

  • I didn't write a test, because we always produce the correct output unless we run up against system mmap limits, which are bot unportable and expensive to test against. Measuring peak heap would be interesting, but our perf suite isn't yet capable of that.

  • note that diff without "--quiet" does not suffer from the same problem. In diffcore_skip_stat_unmatch(), we detect the stat-unmatch entries and drop them immediately, so we're not carrying their data around.

  • you can still trigger the mmap limit problem if you truly have that many files with actual changes. But it's rather unlikely. The stat-unmatch check avoids loading the file contents if the size don't match, so you'd need a pretty trivial change in every single file.
    Likewise, inexact rename detection might load the data for many files all at once. But you'd need not just 64k changes, but that many deletions and additions. The most likely candidate is perhaps break-detection, which would load the data for all pairs and keep it around for the content-level diff. But again, you'd need 64k actually changed files in the first place.

So it's still possible to trigger this case, but it seems like "I accidentally made all my files stat-dirty" is the most likely case in the real world.



在 Git 2.30(2021 年第一季度)中,“ git diff ”( man )和其他共享相同机制以与工作树文件进行比较的命令已被教导在可用时利用 fsmonitor 数据。
请参阅 commit 2bfa953commit 471b115commit ed5a245commit 89afd5fcommit 5851462commit dc69d47Nipunn Koorapati ( nipunn1313 ) (2020 年 10 月 20 日)。
请参阅 commit c9052a8Alex Vandiver ( alexmv )(2020 年 10 月 20 日)。
(由 Junio C Hamano -- gitster --commit bf69da5 中 merge ,2020 年 11 月 9 日)

t/perf: add fsmonitor perf test for git diff

Signed-off-by: Nipunn Koorapati


Results for the git-diff fsmonitor optimization in patch in the parent-rev (using a 400k file repo to test)

As you can see here - git diff(man) with fsmonitor running is significantly better with this patch series (80% faster on my workload)!

GIT_PERF_LARGE_REPO=~/src/server ./run v2.29.0-rc1 . -- p7519-fsmonitor.sh

Test                                                                     v2.29.0-rc1       this tree
-----------------------------------------------------------------------------------------------------------------
7519.2: status (fsmonitor=.git/hooks/fsmonitor-watchman)                 1.46(0.82+0.64)   1.47(0.83+0.62) +0.7%
7519.3: status -uno (fsmonitor=.git/hooks/fsmonitor-watchman)            0.16(0.12+0.04)   0.17(0.12+0.05) +6.3%
7519.4: status -uall (fsmonitor=.git/hooks/fsmonitor-watchman)           1.36(0.73+0.62)   1.37(0.76+0.60) +0.7%
7519.5: diff (fsmonitor=.git/hooks/fsmonitor-watchman)                   0.85(0.22+0.63)   0.14(0.10+0.05) -83.5%
7519.6: diff -- 0_files (fsmonitor=.git/hooks/fsmonitor-watchman)        0.12(0.08+0.05)   0.13(0.11+0.02) +8.3%
7519.7: diff -- 10_files (fsmonitor=.git/hooks/fsmonitor-watchman)       0.12(0.08+0.04)   0.13(0.09+0.04) +8.3%
7519.8: diff -- 100_files (fsmonitor=.git/hooks/fsmonitor-watchman)      0.12(0.07+0.05)   0.13(0.07+0.06) +8.3%
7519.9: diff -- 1000_files (fsmonitor=.git/hooks/fsmonitor-watchman)     0.12(0.09+0.04)   0.13(0.08+0.05) +8.3%
7519.10: diff -- 10000_files (fsmonitor=.git/hooks/fsmonitor-watchman)   0.14(0.09+0.05)   0.13(0.10+0.03) -7.1%
7519.12: status (fsmonitor=)                                             1.67(0.93+1.49)   1.67(0.99+1.42) +0.0%
7519.13: status -uno (fsmonitor=)                                        0.37(0.30+0.82)   0.37(0.33+0.79) +0.0%
7519.14: status -uall (fsmonitor=)                                       1.58(0.97+1.35)   1.57(0.86+1.45) -0.6%
7519.15: diff (fsmonitor=)                                               0.34(0.28+0.83)   0.34(0.27+0.83) +0.0%
7519.16: diff -- 0_files (fsmonitor=)                                    0.09(0.06+0.04)   0.09(0.08+0.02) +0.0%
7519.17: diff -- 10_files (fsmonitor=)                                   0.09(0.07+0.03)   0.09(0.06+0.05) +0.0%
7519.18: diff -- 100_files (fsmonitor=)                                  0.09(0.06+0.04)   0.09(0.06+0.04) +0.0%
7519.19: diff -- 1000_files (fsmonitor=)                                 0.09(0.06+0.04)   0.09(0.05+0.05) +0.0%
7519.20: diff -- 10000_files (fsmonitor=)                                0.10(0.08+0.04)   0.10(0.06+0.05) +0.0%

I also added a benchmark for a tiny git diff(man) workload w/ a pathspec. I see an approximately .02 second overhead added w/ and w/o fsmonitor.

From looking at these results, I suspected that refresh_fsmonitor is already happening during git diff(man) - independent of this patch series' optimization.
Confirmed that suspicion by breaking on refresh_fsmonitor.

(gdb) bt  [simplified]

关于git - 如何对异常缓慢的 git-diff 进行故障排除?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/57523348/

相关文章:

Git - 如何验证 Git 是否安装在 Ubuntu 上以及安装在何处

git - 如何在本地使用 Git?

Github 操作仅在功能分支上运行

不同文件夹中特定文件的 Git diff

git - 如何编写外部 git diff 来比较添加的行和删除的行(并 stash 匹配项)?

git - git diff的/a/b前缀是什么原因

git - 如何使用git比较文件的工作副本、暂存副本和提交副本

regex - Git 差异 : ignore lines starting with a word

git - 有没有办法记录 "unfetch"?

git - 他们是恢复 `git checkout` 的方法吗?