wget - wget 中的递归下载是如何工作的？

wget 用于镜像站点，但我想知道该实用程序如何下载域的所有 URL。

wget -r www.xyz.com

wget如何下载域xyz的所有URL？它是否访问索引页面并解析它并像爬虫一样提取链接？

最佳答案

简短回答:通常，是的，Wget 会抓取所有 URL，但有一些异常(exception):

被 robots.txt 阻止的 URL

该网站包含比默认抓取深度更深的 URL

使用旧版本的 Wget 在某些 CSS 情况下可能无法检索所有文件

至于起点，Wget 只是从你给它的任何 URL 开始，在这种情况下 www.xyz.com .由于大多数 Web 服务器软件在未指定页面时都会返回索引页面，因此 Wget 接收索引页面以开始。
细节
男人为GNU Wget 1.17.1 :

Wget can follow links in HTML, XHTML, and CSS pages ... This is sometimes referred to as "recursive downloading."

但补充说:

While doing that, Wget respects the Robot Exclusion Standard (/robots.txt).

所以如果 /robots.txt指定不索引 /some/secret/page.htm当然，默认情况下这将被排除，与尊重 robots.txt 的任何其他爬虫相同.
此外，存在默认深度限制:

-r

--recursive

Turn on recursive retrieving. The default maximum depth is 5.

因此，如果由于某种原因碰巧有比 5 更深的链接，为了满足您最初的捕获愿望 all URLs您可能想使用 -l选项，例如 -l 6去六深:

-l depth

--level=depth

Specify recursion maximum depth level depth.

另外，请注意早期版本的 Wget 在处理 CSS 中的资源时遇到问题，而这些资源又由 @import url 链接。，如报告:wget downloads CSS @import, but ignores files referenced within them .但是他们没有说他们使用的是什么版本，我还没有测试最新版本。我当时的解决方法是手动找出丢失的 Assets ，并专门为那些丢失的 Assets 编写单独的 Wget 命令。

关于wget - wget 中的递归下载是如何工作的？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/36292939/

wget - wget 中的递归下载是如何工作的？

上一篇：sonarqube - 是否可以完全从 SonarQube 仪表板隐藏技术债务指标？

下一篇：arm - 如何使用 "openocd and JTAG board"设置 LLDB