我正在编写一个抓取程序。我收集页面上的所有链接。它们可能是相对路径。例如:
foo.html
/foo.html
../foo.html
../../foo.html
我可以将它们连接到它们所在页面的 url(基本路径),但这并不完全简单。例如:
http://www.example.com/foo + /bar.html = http://www.example.com/bar.html
http://www.example.com/bla/?foo=bar + ../foo.html = http://www.example.com/foo.html
我想知道是否有 Erlang Lib、C Lib 或 CLI 程序可以为我找到正确的连接?
最佳答案
就 CLI 而言,wget
具有 --base
开关:
-B URL --base=URL
Resolves relative links using URL as the point of reference, when reading links from an HTML file specified via the -i/--input-file option (together with --force-html, or when the input file was fetched remotely from a server describing it as HTML). This is equivalent to the presence of a "BASE" tag in the HTML input file, with URL as the value for the "href" attribute.
For instance, if you specify http://foo/bar/a.html for URL, and Wget reads ../baz/b.html from the input file, it would be resolved to http://foo/baz/b.html.
因此,如果您执行它以将文件输出到 stdout 并使用您的 erlang 脚本读取它,那么应该可以工作。
关于bash - 寻找库来连接相对/完整网址。,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/8833011/