嗨,我正在尝试使用crawlspider,并且我创建了自己的拒绝规则
class MySpider(CrawlSpider):
name = "craigs"
allowed_domains = ["careers-cooperhealth.icims.com"]
start_urls = ["careers-cooperhealth.icims.com"]
d= [0-9]
path_deny_base = [ '.(login)', '.(intro)', '(candidate)', '(referral)', '(reminder)', '(/search)',]
rules = (Rule (SgmlLinkExtractor(deny = path_deny_base,
allow=('careers-cooperhealth.icims.com/jobs/…;*')),
callback="parse_items",
follow= True), )
我的蜘蛛仍然爬行类似 https://careers-cooperhealth.icims.com/jobs/22660/registered-nurse-prn/login 的页面登录名不应被抓取的地方有什么问题?
最佳答案
就这样改变它(没有点和括号):
deny = ['login', 'intro', 'candidate', 'referral', 'reminder', 'search']
allow = ['jobs']
rules = (Rule (SgmlLinkExtractor(deny = deny,
allow=allow,
restrict_xpaths=('*')),
callback="parse_items",
follow= True),)
这意味着提取的链接中没有login
或intro
等,仅提取其中包含jobs
的链接。
这是抓取链接https://careers-cooperhealth.icims.com/jobs/intro?hashed=0
并打印“YAHOO!”的完整蜘蛛代码:
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
class MySpider(CrawlSpider):
name = "craigs"
allowed_domains = ["careers-cooperhealth.icims.com"]
start_urls = ["https://careers-cooperhealth.icims.com"]
deny = ['login', 'intro', 'candidate', 'referral', 'reminder', 'search']
allow = ['jobs']
rules = (Rule (SgmlLinkExtractor(deny = deny,
allow=allow,
restrict_xpaths=('*')),
callback="parse_items",
follow= True),)
def parse_items(self, response):
print "YAHOO!"
希望有帮助。
关于python - scrapy蜘蛛绕过拒绝我的规则,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/18482813/