我有一些从数据库动态抓取的规则并将它们添加到我的蜘蛛中:
self.name = exSettings['site']
self.allowed_domains = [exSettings['root']]
self.start_urls = ['http://' + exSettings['root']]
self.rules = [Rule(SgmlLinkExtractor(allow=(exSettings['root'] + '$',)), follow= True)]
denyRules = []
for rule in exSettings['settings']:
linkRegex = rule['link_regex']
if rule['link_type'] == 'property_url':
propertyRule = Rule(SgmlLinkExtractor(allow=(linkRegex,)), follow=True, callback='parseProperty')
self.rules.insert(0, propertyRule)
self.listingEx.append({'link_regex': linkRegex, 'extraction': rule['extraction']})
elif rule['link_type'] == 'project_url':
projectRule = Rule(SgmlLinkExtractor(allow=(linkRegex,)), follow=True) #not set to crawl yet due to conflict if same links appear for both
self.rules.insert(0, projectRule)
elif rule['link_type'] == 'favorable_url':
favorableRule = Rule(SgmlLinkExtractor(allow=(linkRegex,)), follow=True)
self.rules.append(favorableRule)
elif rule['link_type'] == 'ignore_url':
denyRules.append(linkRegex)
#somehow all urls will get ignored if allow is empty and put as the first rule
d = Rule(SgmlLinkExtractor(allow=('testingonly',), deny=tuple(denyRules)), follow=True)
#self.rules.insert(0,d) #I have tried with both status but same results
self.rules.append(d)
我的数据库中有以下规则:
link_regex: /listing/\d+/.+ (property_url)
link_regex: /project-listings/.+ (favorable_url)
link_regex: singapore-property-listing/ (favorable_url)
link_regex: /mrt/ (ignore_url)
我在日志中看到了这一点:
http://www.propertyguru.com.sg/singapore-property-listing/property-for-sale/mrt/125/ang-mo-kio-mrt-station> (referer: http://www.propertyguru.com.sg/listing/8277630/for-sale-thomson-grand-6-star-development-)
/mrt/
不是应该被拒绝吗?为什么我仍然抓取到上面的链接?
最佳答案
据我所知,deny
参数必须位于相同的 SgmlLinkExtractor
中,它具有 allow
模式。
在您的情况下,您创建了 SgmlLinkExtractor
,它允许 favorable_url
('singapore-property-listing/'
)。但此提取器没有任何 deny
模式,因此它也会提取 /mrt/
。
要解决此问题,您应该向对应的 SgmlLinkExtractor
添加 deny
模式。另请参阅related question .
也许有一些方法可以定义全局 deny
模式,但我还没有看到它们。
关于python - scrapy拒绝规则不被忽略,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/8794693/