git - Git Large File Storage 背后的存储机制是什么?

标签 git github

Github 最近推出了一个 extension到 git 以不同的方式存储大文件。 extension replaces large files with text pointers in Git 到底是什么意思?

最佳答案

你可以在git-lfs sources中看到怎么样"text pointer" is defined :

type Pointer struct {
    Version string
    Oid     string
    Size    int64
    OidType string
} 

smudgeclean来源意味着git-lfs可以使用 content filter driver 为了:

  • checkout 时下载实际文件
  • 在提交时将它们存储在外部源中。

参见 the pointer specs :

The core Git LFS idea is that instead of writing large blobs to a Git repository, only a pointer file is written.

version https://git-lfs.github.com/spec/v1
oid sha256:4d7a214614ab2935c943f9e0ff69d22eadbb8f32b1258daaa5e2ca24d17e2393
size 12345
(ending \n)

Git LFS needs a URL endpoint to talk to a remote server.
A Git repository can have different Git LFS endpoints for different remotes.

实际文件是上传到或下载自遵守 Git-LFS API 的服务器.

git-lfs man page 证实了这一点,其中提到:

The actual file gets pushed to a Git LFS API

您需要一个实现该 API 的 Git 服务器,以支持上传和下载二进制内容。


关于内容过滤器驱动程序(它在 Git 中存在很长时间,早于 lfs,在这里被 lfs 使用来添加这个“大文件管理”功能),这是大部分工作发生的地方:

The smudge filter runs as files are being checked out from the Git repository to the working directory.
Git sends the content of the Git blob as STDIN, and expects the content to write to the working directory as STDOUT.

Read 100 bytes.

  • If the content is ASCII and matches the pointer file format:
    Look for the file in .git/lfs/objects/{OID}.

  • If it's not there, download it from the server.
    Read its contents to STDOUT

  • Otherwise, simply pass the STDIN out through STDOUT.

The clean filter runs as files are added to repositories.
Git sends the content of the file being added as STDIN, and expects the content to write to Git as STDOUT.

  • Stream binary content from STDIN to a temp file, while calculating its SHA-256 signature.
  • Check for the file at .git/lfs/objects/{OID}.
  • If it does not exist:
    • Queue the OID to be uploaded.
    • Move the temp file to .git/lfs/objects/{OID}.
  • Delete the temp file.
  • Write the pointer file to STDOUT.

Git 2.11(2016 年 11 月)有一个提交详细说明了它是如何工作的:commit edcc858 ,由 Martin-Louis Bright 提供帮助并由 Lars Schneider 签字。

convert: add filter.<driver>.process option

Git's clean/smudge mechanism invokes an external filter process for every single blob that is affected by a filter. If Git filters a lot of blobs then the startup time of the external filter processes can become a significant part of the overall Git execution time.

In a preliminary performance test this developer used a clean/smudge filter written in golang to filter 12,000 files. This process took 364s with the existing filter mechanism and 5s with the new mechanism. See details here: git-lfs/git-lfs#1382

This patch adds the filter.<driver>.process string option which, if used, keeps the external filter process running and processes all blobs with the packet format (pkt-line) based protocol over standard input and standard output.
The full protocol is explained in detail in Documentation/gitattributes.txt.

A few key decisions:

  • The long running filter process is referred to as filter protocol version 2 because the existing single shot filter invocation is considered version 1.
  • Git sends a welcome message and expects a response right after the external filter process has started. This ensures that Git will not hang if a version 1 filter is incorrectly used with the filter.<driver>.process option for version 2 filters. In addition, Git can detect this kind of error and warn the user.
  • The status of a filter operation (e.g. "success" or "error) is set before the actual response and (if necessary!) re-set after the response. The advantage of this two step status response is that if the filter detects an error early, then the filter can communicate this and Git does not even need to create structures to read the response.
  • All status responses are pkt-line lists terminated with a flush packet. This allows us to send other status fields with the same protocol in the future.

这会导致在 Git 2.12(2017 年第一季度)中设置警告

参见 commit 7eeda8b (2016 年 12 月 18 日),以及 commit c6b0831 (2016 年 12 月 3 日)Lars Schneider ( larsxschneider ) .
(由 Junio C Hamano -- gitster -- merge 于 commit 08721a0 ,2016 年 12 月 27 日)

docs: warn about possible '=' in clean/smudge filter process values

A pathname value in a clean/smudge filter process "key=value" pair can contain the '=' character (introduced in edcc858).
Make the user aware of this issue in the docs, add a corresponding test case, and fix the issue in filter process value parser of the example implementation in contrib.

关于git - Git Large File Storage 背后的存储机制是什么?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/29530200/

相关文章:

github - 如何为 github.com 存储库启用 "Reviewers"功能?

git - 为什么我不能推送到这个裸存储库?

git - 与 git 有冲突的 cherry-picking

git - 如何在 git 中的特定哈希(提交)处 checkout 代码

Git:将一些最后的提交移动到一个新分支,然后为它们创建一个补丁

security - 将 Amazon S3 key 存储在私有(private)存储库中

git - 向 github 添加一个新的 web 项目

github - 在 GitHub 贡献图上显示对私有(private)存储库的提交

git - Git 如何访问我的 GitHub 凭据?

git - Git .ssh 的默认路径是什么?