python - Scrapy 荣誉 rel=nofollow

scrapy 可以忽略 rel="nofollow" 链接吗？看着 sgml.py在 scrapy 0.22 中看起来是这样的:

如何启用它？

最佳答案

Paul 说对了，我就是这样做的:

rules = (
# Extract all pages, follow links, call method 'parse_page' for response callback, before processing links call method links_processor
Rule(LinkExtractor(allow=('','/')),follow=True,callback='parse_page',process_links='links_processor'),

这是实际的功能(我是 python 的新手，我确信有一种更好的方法可以在不创建新列表的情况下从 for 循环中删除项目

def links_processor(self,links): 
 # A hook into the links processing from an existing page, done in order to not follow "nofollow" links 
 ret_links = list()
 if links:
 for link in links:
 if not link.nofollow: ret_links.append(link)
 return ret_links

很简单。

关于python - Scrapy 荣誉 rel=nofollow，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/21392222/

上一篇：python - Statsmodels 基于异方差一致性标准误差绘制平均置信区间

下一篇：python - 从 python 中的模块导入变量会复制吗？

python - Django Celery 周期性任务运行但 RabbitMQ 队列未被消耗

python - Python 中的移位密码 : error using ord

python - 如何修复 : 'TypeError: expected string or bytes-like object' when doing unit test on a views. py 函数

python - BeautifulSoup 网页抓取，没有结果

ruby - mongoid self 关系？

python - 将参数传递给回调函数

android - 为什么buildozer显示错误: Activity class {org. test.myapp/org.renpy.android.PythonActivity}不存在

php爬虫检测

python - Scrapy/OpenSSL 抓取 HTTPS 站点 : AttributeError: 'module' object has no attribute 'SSL_CTX_set_session_id_context'