python - 拒绝 scrapy linkextractor 中的某些链接

with open('/home/timmy/myamazon/bannedasins.txt') as f:
    banned_asins = f.read().split('\n')

class AmazonSpider(CrawlSpider):

    name = 'amazon'
    allowed_domains = ['amazon.com',]

    rules = (
            Rule(LinkExtractor(restrict_xpaths='//li[@class="a-last"]/a')),
            Rule(LinkExtractor(restrict_xpaths='//h2/a[@class="a-link-normal a-text-normal"]',
            process_value= lambda i:f"https://www.amazon.com/dp/{re.search('dp/(.*)/',i).groups()[0]}"),
            callback="parse_item"),
            )

我有以下两条规则来提取正确工作的亚马逊产品链接，现在我想从搜索中删除一些 Asins re.search('dp/(.*)/',i).groups() [0] 这会检索 ASIN 并将其置于格式 https://www.amazon.com/dp/{ASIN} 中，我想要做的是——如果 asin在 banned_asins 中不要提取它。

看完Link Extractors Scrapy doc ，我相信它是由 deny_extensions 完成的，但不确定如何使用

banned_asins= ['B07RTX74L7','B07D9JCH5X',......]

最佳答案

deny_extensions 将不起作用，它指的是在链接中出现时未遵循的常见文件扩展名，请参阅 here如果没有给出默认值。

您只需在 process_value 中过滤掉被禁止的 asins功能。如果它返回 None，给定的链接将被忽略:

process_value (callable) –

a function which receives each value extracted from the tag and attributes scanned and can modify the value and return a new one, or return None to ignore the link altogether. If not given, process_value defaults to lambda x: x.

所以应该是:

def process_value(i):
    asin = re.search('dp/(.*)', i).groups()[0]
    return f"https://www.amazon.com/dp/{asin}" if asin not in banned_asins else None

....

    rules = (
        Rule(LinkExtractor(restrict_xpaths='//li[@class="a-last"]/a')),
        Rule(LinkExtractor(restrict_xpaths='//h2/a[@class="a-link-normal a-text-normal"]',
            process_value=process_value), callback="parse_item"),
        )

关于python - 拒绝 scrapy linkextractor 中的某些链接，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/57137698/

python - 拒绝 scrapy linkextractor 中的某些链接

上一篇：python - Elasticsearch 6.7.0 嵌套 "bool", "should","must"查询

下一篇：python - Django 中的切片与查询集