我最近克隆了一个远程仓库,其中一些 git
命令运行非常缓慢。例如,运行
git diff --quiet
...需要约 40 秒。 (就其值(value)而言,repo 是干净的。我正在使用
git
版本 2.20.1。)在试图找出导致这种缓慢的原因时,我遇到了一些废除它的程序,尽管我不知道为什么。
在这些过程中,我发现的最简单/最快的一个是:(从一个新克隆的 repo 实例开始)创建一个分支
master
,然后 checkout 。在此之后,如果我查看 master
再一次,现在 git diff --quiet
快速完成(低于 50 毫秒)。下面是一个示例交互,显示了各种操作的时间信息1:
rm -rf ./"$REPONAME" # 0.174 s
git clone "$URL" # 54.118 s
cd ./"$REPONAME" # 0.007 s
git diff --quiet # 39.438 s
git branch VOODOO # 0.032 s
git checkout VOODOO # 31.247 s
git diff --quiet # 0.014 s
git checkout master # 0.034 s
git diff --quiet # 0.012 s
正如我已经强调的那样,这只是“修复” repo 的几种可能程序之一,它们对我来说都同样神秘。这只是我发现的最简单/最快的一个。
上面的时序序列是非常可重复的(即,每次我完全按照所示运行特定序列时,我得到的时序大致相同)。
然而,它对看似微小的变化非常敏感。例如,如果我替换
git branch VOODOO; git checkout VOODOO
与 git checkout -b VOODOO
,随后的时序分布发生了根本性的变化:rm -rf ./"$REPONAME" # 0.015 s
git clone "$URL" # 45.312 s
cd ./"$REPONAME" # 0.007 s
git diff --quiet # 46.145 s
git checkout -b VOODOO # 42.363 s
git diff --quiet # 41.180 s
git checkout master # 47.345 s
git diff --quiet # 0.018 s
我想弄清楚发生了什么。如何进一步解决问题?
是否有永久(“可提交”)的方式来“修复” repo ? (“修复”我的意思是:摆脱
git diff --quiet
、 git checkout ...
等的长时间延迟)(顺便说一句,
git gc
不会修复 repo ,即使是暂时的;我试过了。)我认为最终“修复” repo 的是
git
开始构建和缓存一些辅助数据结构,使其能够有效地执行某些操作。如果这个假设是正确的,那么我的问题可以改写为:导致git
的最直接方式是什么?构建这样的辅助数据结构?编辑:可以阐明上述内容的另一条信息是该存储库包含一个特别大 (1GB) 的文件。 (这解释了
git clone
步骤的缓慢。我不知道这是否与 git diff --quiet
等的缓慢有关,如果是,如何。)1 不用说,我已经将分支命名为
VOODOO
以反射(reflect)我对正在发生的事情的无知。
最佳答案
首先检查 Git 2.27 甚至即将推出的 2.28(2020 年第 3 季度)问题是否仍然存在
我会使用 GIT_TRACE2_PERF
来衡量任何性能。 (如 I did here )
使用 Git 2.28(2020 年第 3 季度),在具有太多统计信息不匹配路径的工作树中“diff --quiet
”期间的内存使用量已大大减少。
它的补丁描述说明了一个用例,其中“diff --quiet
”可能很慢:
请参阅 commit d2d7fbe 的 Jeff King ( peff
)(2020 年 6 月 1 日)。
(由 Junio C Hamano -- gitster
-- 在 commit 0cd0afc merge ,2020 年 6 月 18 日)
diff
: discard blob data from stat-unmatched pairsReported-by: Jan Christoph Uhde
Signed-off-by: Jeff King
When performing a tree-level diff against the working tree, we may find that our index stat information is dirty, so we queue a filepair to be examined later.
If the actual content hasn't changed, we call this astat-unmatch
; the stat information was out of date, but there's no actual diff.Normally
diffcore_std()
would detect and remove these identical filepairs viadiffcore_skip_stat_unmatch()
.However, when "
--quiet
" is used, we want to stop the diff as soon as we see any changes, so we check for stat-unmatches immediately indiff_change()
.That check may require us to actually load the file contents into the pair of
diff_filespecs
.
If we find that the pair isn't a stat-unmatch, then no big deal; we'd likely load the contents later anyway to generate a patch, do rename detection, etc, so we want to hold on to it.
But if it is a stat-unmatch, then we have no more use for that data; the whole point is that we're going discard the pair. However, we never free the allocateddiff_filespec
data.In most cases, keeping that data isn't a problem. We don't expect a lot of stat-unmatch entries, and since we're using
--quiet
, we'd quit as soon as we saw such a real change anyway.However, there are extreme cases where it makes a big difference:
We'd generally mmap() the working tree half of the pair.
And since the OS may limit the total number of maps, we can run afoul of this in large repositories. E.g.:$ cd linux $ git ls-files | wc -l 67959 $ sysctl vm.max_map_count vm.max_map_count = 65530 $ git ls-files | xargs touch ;# everything is stat-dirty! $ git diff --quiet fatal: mmap failed: Cannot allocate memory
It should be unusual to have so many files stat-dirty, but it's possible if you've just run a script like "
sed -i
" or similar.After this patch, the above correctly exits with code 0.
Even if you don't hit mmap limits, the index half of the pair will have been pulled from the object database into heap memory.
Again in a clone oflinux.git
, running:$ git ls-files | head -n 10000 | xargs touch $ git diff --quiet
peaks at 145MB heap before this patch, and 94MB after.
This patch solves the problem by freeing any
diff_filespec
data we picked up during the "--quiet
" stat-unmatch check indiff_changes
.
Nobody is going to need that data later, so there's no point holding on to it.
There are a few things to note:
we could skip queueing the pair entirely, which could in theory save a little work. But there's not much to save, as we need a
diff_filepair
to feed todiff_filespec_check_stat_unmatch()
anyway.
And since we cache the result of thestat-unmatch
checks, a later call todiffcore_skip_stat_unmatch()
call will quickly skip over them.
Thediffcore
code also counts up the number of stat-unmatched pairs as it removes them. It's doubtful any callers would care about that in combination with--quiet
, but we'd have to reimplement the logic here to be on the safe side. So it's not really worth the trouble.I didn't write a test, because we always produce the correct output unless we run up against system mmap limits, which are bot unportable and expensive to test against. Measuring peak heap would be interesting, but our perf suite isn't yet capable of that.
note that diff without "
--quiet
" does not suffer from the same problem. Indiffcore_skip_stat_unmatch()
, we detect thestat-unmatch
entries and drop them immediately, so we're not carrying their data around.you can still trigger the
mmap
limit problem if you truly have that many files with actual changes. But it's rather unlikely. Thestat-unmatch
check avoids loading the file contents if the size don't match, so you'd need a pretty trivial change in every single file.
Likewise, inexact rename detection might load the data for many files all at once. But you'd need not just 64k changes, but that many deletions and additions. The most likely candidate is perhaps break-detection, which would load the data for all pairs and keep it around for the content-level diff. But again, you'd need 64k actually changed files in the first place.So it's still possible to trigger this case, but it seems like "I accidentally made all my files stat-dirty" is the most likely case in the real world.
在 Git 2.30(2021 年第一季度)中,“
git diff
”( man )和其他共享相同机制以与工作树文件进行比较的命令已被教导在可用时利用 fsmonitor
数据。请参阅 commit 2bfa953 的 commit 471b115 、 commit ed5a245 、 commit 89afd5f 、 commit 5851462 、 commit dc69d47 、 Nipunn Koorapati (
nipunn1313
) (2020 年 10 月 20 日)。请参阅 commit c9052a8 的 Alex Vandiver (
alexmv
)(2020 年 10 月 20 日)。(由 Junio C Hamano --
gitster
-- 在 commit bf69da5 中 merge ,2020 年 11 月 9 日)
t/perf
: addfsmonitor
perf test forgit diff
Signed-off-by: Nipunn Koorapati
Results for the git-diff fsmonitor optimization in patch in the parent-rev (using a 400k file repo to test)
As you can see here -
git diff
(man) withfsmonitor
running is significantly better with this patch series (80% faster on my workload)!
GIT_PERF_LARGE_REPO=~/src/server
./run v2.29.0-rc1 . -- p7519-fsmonitor.shTest v2.29.0-rc1 this tree ----------------------------------------------------------------------------------------------------------------- 7519.2: status (fsmonitor=.git/hooks/fsmonitor-watchman) 1.46(0.82+0.64) 1.47(0.83+0.62) +0.7% 7519.3: status -uno (fsmonitor=.git/hooks/fsmonitor-watchman) 0.16(0.12+0.04) 0.17(0.12+0.05) +6.3% 7519.4: status -uall (fsmonitor=.git/hooks/fsmonitor-watchman) 1.36(0.73+0.62) 1.37(0.76+0.60) +0.7% 7519.5: diff (fsmonitor=.git/hooks/fsmonitor-watchman) 0.85(0.22+0.63) 0.14(0.10+0.05) -83.5% 7519.6: diff -- 0_files (fsmonitor=.git/hooks/fsmonitor-watchman) 0.12(0.08+0.05) 0.13(0.11+0.02) +8.3% 7519.7: diff -- 10_files (fsmonitor=.git/hooks/fsmonitor-watchman) 0.12(0.08+0.04) 0.13(0.09+0.04) +8.3% 7519.8: diff -- 100_files (fsmonitor=.git/hooks/fsmonitor-watchman) 0.12(0.07+0.05) 0.13(0.07+0.06) +8.3% 7519.9: diff -- 1000_files (fsmonitor=.git/hooks/fsmonitor-watchman) 0.12(0.09+0.04) 0.13(0.08+0.05) +8.3% 7519.10: diff -- 10000_files (fsmonitor=.git/hooks/fsmonitor-watchman) 0.14(0.09+0.05) 0.13(0.10+0.03) -7.1% 7519.12: status (fsmonitor=) 1.67(0.93+1.49) 1.67(0.99+1.42) +0.0% 7519.13: status -uno (fsmonitor=) 0.37(0.30+0.82) 0.37(0.33+0.79) +0.0% 7519.14: status -uall (fsmonitor=) 1.58(0.97+1.35) 1.57(0.86+1.45) -0.6% 7519.15: diff (fsmonitor=) 0.34(0.28+0.83) 0.34(0.27+0.83) +0.0% 7519.16: diff -- 0_files (fsmonitor=) 0.09(0.06+0.04) 0.09(0.08+0.02) +0.0% 7519.17: diff -- 10_files (fsmonitor=) 0.09(0.07+0.03) 0.09(0.06+0.05) +0.0% 7519.18: diff -- 100_files (fsmonitor=) 0.09(0.06+0.04) 0.09(0.06+0.04) +0.0% 7519.19: diff -- 1000_files (fsmonitor=) 0.09(0.06+0.04) 0.09(0.05+0.05) +0.0% 7519.20: diff -- 10000_files (fsmonitor=) 0.10(0.08+0.04) 0.10(0.06+0.05) +0.0%
I also added a benchmark for a tiny
git diff
(man) workload w/ a pathspec. I see an approximately .02 second overhead added w/ and w/ofsmonitor
.From looking at these results, I suspected that
refresh_fsmonitor
is already happening duringgit diff
(man) - independent of this patch series' optimization.
Confirmed that suspicion by breaking onrefresh_fsmonitor
.(gdb) bt [simplified]
- 0
refresh_fsmonitor
atfsmonitor.c
#176- 1
ie_match_stat
atread-cache.c
#375- 2
match_stat_with_submodule
atdiff-lib.c
#237- 4
builtin_diff_files
atbuiltin/diff.c
#260- 5
cmd_diff
atbuiltin/diff.c
#541- 6
run_builtin
atgit.c
#450- 7
handle_builtin
atgit.c
#700- 8
run_argv
atgit.c
#767- 9
cmd_main
atgit.c
#898- 10
main
atcommon-main.c
#52
关于git - 如何对异常缓慢的 git-diff 进行故障排除?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/57523348/