python - 如何知道scrapy规则提取了哪些链接

我正在尝试使用Rule和LinkExtractor来提取链接，这是我在scrapy shell中的代码

from urllib.parse import quote
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
url= f'https://www.google.com/search?q={quote("Hello World")}'
fetch(url)
x=LinkExtractor(restrict_xpaths='//div[@class="r"]/a')
y=Rule(x)

我尝试使用dir(x)来查看我可以应用哪些方法，我能找到的最好的方法是x.__sizeof__()，但它显示的是32实际 10 个链接。我的问题是如何找出使用它们实际提取的链接(类似列表)。这就是 dir(x) 显示的内容

['__class__'、'__delattr__'、'__dict__'、'__dir__'、'__doc__'、'__eq__'、'__format__'、'__ge__'、'__getattribute__'、'__gt__'、'__hash__' , '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', ' __str__'、'__subclasshook__'、'__weakref__'、'_csstranslator'、'_extract_links'、'_link_allowed'、'_process_links'、'allow_domains'、'allow_res'、'canonicalize'、'deny_domains'、'deny_extensions'、'deny_res' , 'extract_links', 'link_extractor', '匹配', 'restrict_xpaths']

最佳答案

您可以使用以下方法来准确获取提取的内容

x=LinkExtractor(restrict_xpaths='//div[@class="r"]/a')
links_objects=x.extract_links(response) # a list like

对于您可以使用的实际网址

for link in links_objects:
    print(link.url) #links

关于python - 如何知道scrapy规则提取了哪些链接，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/56959097/

上一篇：python - 使用 Python 最大限度地提高 Teensy 3.2 实时绘图数据的串行通信速度

下一篇：python - 使用字典中的函数作为菜单

python - 在 Ubuntu 上使用 pip 安装 NumPy 失败

python - 使用 xarray 将单个月度 NetCDF 文件拆分为每日平均的 NetCDF 多个文件

html - 使用 xpath 选择内部带有图像的链接的 href

python - 知道如何使用 scrapy 访问此网址吗？

python - 多对多的嵌套序列化器

python - 如何调用databricks Rest API来列出运行的作业

python - Windows 在目录中找不到 scrapy 文件

python - 将 Response 对象从引用者带入 parse_item 回调

python - Scrapy 蜘蛛输出 empy csv 文件