python - 使用 requests_html 时无法按预期提取结果

标签 python python-3.x pyquery python-requests-html

我无法使用 requests_html 提取正确的结果:

>>> from requests_html import HTMLSession
>>> session = HTMLSession()
>>> r = session.get('https://www.amazon.com/dp/B07569DYGN')
>>> r.html.find("#productDetails_detailBullets_sections1")
[]

我可以在源内容中找到 id 'productDetails_detailBullets_sections1':

>>> """<table id="productDetails_detailBullets_sections1" class="a-keyvalue prodDetTable" role="presentation">""" in r.text
True

其实PyQuery中同样存在这个问题。

为什么 requests_html 找不到这个元素？

最佳答案

我正在搜索 #comparison_price_row，它仍然找到了一些东西。源代码中的下一个 ID 是 comparison_shipping_info_row，但搜索 #comparison_shipping_info_row 会返回一个空数组。这两个元素处于同一级别(同一父元素)。我检查了两者之间的所有来源，但没有发现问题。

起初。

然后我看到两者之间有一个 NUL 字节，这可能会使库出错。

从输入中删除 NUL 字节后，可以找到想要的元素:

r2 = requests_html.HTML(html=r.text.replace('\0', ''))
r2.find('#productDetails_detailBullets_sections1')

[<Element 'table' role='presentation' class=('a-keyvalue', 'prodDetTable') id='productDetails_detailBullets_sections1'>]

关于python - 使用 requests_html 时无法按预期提取结果，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/52699466/

上一篇：python - 将两个数据框与一些公共(public)列合并，其中公共(public)列的组合需要自定义函数

下一篇：python - 如何手动关闭 websocket

python - 将 xpath 表达式传递给 xpath 结果

python - 不和谐.py |发出静音命令，无法获取成员名称

python - 如何在将 pyquery 对象转换为字符串时取消转义特殊字符

具有基本身份验证的 Python REST POST

python - 将文件夹中的多个jpg文件编码为python中的base 64

python - 如何将CNN应用于短时傅立叶变换？

python - 如何将网页中嵌入的视频链接名称与视频名称一起提取

python - pip 错误 : unrecognized command line option ‘-fstack-protector-strong’