scrapy - 使用 Scrapy 在 JSON 中抓取 HTML

我正在请求一个响应是这样的 JSON 的网站:

{
    "success": true,
    "response": "<html>... html goes here ...</html>"
}

我已经看到了两种废弃 HTML 或 JSON 的方法，但还没有找到如何在 JSON 中废弃 HTML。是否可以使用scrapy来做到这一点？

最佳答案

好吧，还有另一种方式，你绝对不需要构造响应对象。你可以使用 lxml 来解析你的 html 文本。你不需要安装任何新的 lib ，因为 Scrapy Selector 是基于 lxml 的。只需将以下代码添加到 导入 lxml 库。

from lxml import etree

这是一个示例，假设 json 响应是:

{
    "success": true,
    "htmlinjson": "<html><body> <p id='p1'>p111111</p> <p id='p2'>p22222</p> </html>"
}

通过以下方式从 json 响应中提取 html 文本:

import json

htmlText = json.loads(response.text)['htmlinjson']

然后使用以下命令构造一个 lxml xpath 选择器:

from lxml import etree

resultPage = etree.HTML(htmlText)

现在使用 lxml 选择器提取 id="p1"节点的文本，基于 xpath，就像 scrapy xpath 选择器所做的那样:

print resultPage.xpath('//p[@id="p1"]')[0].text

你会得到:

p111111

希望有帮助:)

关于scrapy - 使用 Scrapy 在 JSON 中抓取 HTML，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/37791989/

相关文章：

scrapy - 如何处理来自 splash 的 scrapy 中的多个返回值