我正在使用 linkchecker抓取英国政府网站,映射超链接关系,输出GML文件。
我不想包含图像的 URL,例如任何包含 jpeg 或 png 文件格式引用的 URL(例如“www.gov.uk/somefile.jpeg”)。
我已经尝试了几个小时来使用 --ignore-url
命令行参数和各种正则表达式来实现这一点。这是我放弃前的最后一次尝试:
linkchecker --ignore-url='(png|jpg|jpeg|gif|tiff|bmp|svg|js)$' -r1 --verbose --no-warnings -ogml/utf_8 --file-output=gml/utf_8/www.gov.uk_RECURSION_1_LEVEL_NO_IMAGES.gml https://www.gov.uk
谁能告诉我这是否可行,如果可行,请提出解决方案?
最佳答案
琐事:
根据 docs :
--ignore-url=REGEX
URLs matching the given regular expression will be ignored and not checked.
This option can be given multiple times.
LinkChecker accepts Python regular expressions. See http://docs.python.org/howto/regex.html for an introduction. An addition is that a leading exclamation mark negates the regular expression.
因此我们可以轻松地用 python 检查您的正则表达式,看看它为什么不起作用(live test):
import re
our_pattern = re.compile(r'(png|jpg|jpeg|gif|tiff|bmp|svg|js)$')
input_data = '''
www.gov.uk/
www.gov.uk/index.html
www.gov.uk/admin.html
www.gov.uk/somefile.jpeg
www.gov.uk/anotherone.png
'''
input_data = input_data.strip().split('\n')
for address in input_data:
print('Address: %s\t Matched as Image: %s' % (address, bool(our_pattern.match(address))))
# ^ or our_pattern.fullmatch
输出:
Address: www.gov.uk/ Matched as Image: False
Address: www.gov.uk/index.html Matched as Image: False
Address: www.gov.uk/admin.html Matched as Image: False
Address: www.gov.uk/somefile.jpeg Matched as Image: False
Address: www.gov.uk/anotherone.png Matched as Image: False
我认为,这里的问题是因为部分匹配,因此让我们尝试完全匹配(pattern,live test):
...
our_pattern = re.compile(r'.*(?:png|jpg|jpeg|gif|tiff|bmp|svg|js)$')
# ^ Note this (matches any character unlimited times)
...
...输出为:
Address: www.gov.uk/ Matched as Image: False
Address: www.gov.uk/index.html Matched as Image: False
Address: www.gov.uk/admin.html Matched as Image: False
Address: www.gov.uk/somefile.jpeg Matched as Image: True
Address: www.gov.uk/anotherone.png Matched as Image: True
解决方案:
如您所见,在您的尝试中,您的 URL 与给定的正则表达式不匹配,因此未被忽略。唯一匹配正则表达式的是列出的扩展名(png、jpg、...)。
要克服这个问题 - 用 .*
匹配扩展前的所有字符。
另一个问题 - 包含引号。
来自文档的示例:
Don't check mailto: URLs. All other links are checked as usual:
linkchecker --ignore-url=^mailto: mysite.example.org
所以你最后的选择是:
--ignore-url=.*(?:png|jpg|jpeg|gif|tiff|bmp|svg|js)$
希望对您有所帮助!
关于python - 如何使用 linkchecker 忽略包含图像格式的 URL,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/45050452/