with open('/home/timmy/myamazon/bannedasins.txt') as f:
banned_asins = f.read().split('\n')
class AmazonSpider(CrawlSpider):
name = 'amazon'
allowed_domains = ['amazon.com',]
rules = (
Rule(LinkExtractor(restrict_xpaths='//li[@class="a-last"]/a')),
Rule(LinkExtractor(restrict_xpaths='//h2/a[@class="a-link-normal a-text-normal"]',
process_value= lambda i:f"https://www.amazon.com/dp/{re.search('dp/(.*)/',i).groups()[0]}"),
callback="parse_item"),
)
我有以下两条规则来提取正确工作的亚马逊产品链接,现在我想从搜索中删除一些 Asins re.search('dp/(.*)/',i).groups() [0]
这会检索 ASIN 并将其置于格式 https://www.amazon.com/dp/{ASIN}
中,我想要做的是——如果 asin在 banned_asins
中不要提取它。
看完Link Extractors Scrapy doc ,我相信它是由 deny_extensions
完成的,但不确定如何使用
banned_asins= ['B07RTX74L7','B07D9JCH5X',......]
最佳答案
deny_extensions
将不起作用,它指的是在链接中出现时未遵循的常见文件扩展名,请参阅 here如果没有给出默认值。
您只需在 process_value
中过滤掉被禁止的 asins功能。如果它返回 None
,给定的链接将被忽略:
process_value (callable) –
a function which receives each value extracted from the tag and attributes scanned and can modify the value and return a new one, or return
None
to ignore the link altogether. If not given,process_value
defaults tolambda x: x
.
所以应该是:
def process_value(i):
asin = re.search('dp/(.*)', i).groups()[0]
return f"https://www.amazon.com/dp/{asin}" if asin not in banned_asins else None
....
rules = (
Rule(LinkExtractor(restrict_xpaths='//li[@class="a-last"]/a')),
Rule(LinkExtractor(restrict_xpaths='//h2/a[@class="a-link-normal a-text-normal"]',
process_value=process_value), callback="parse_item"),
)
关于python - 拒绝 scrapy linkextractor 中的某些链接,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/57137698/