python - 使用scrapy在没有javascript代码的情况下抓取文本

我目前正在使用 scrapy 设置一堆蜘蛛。这些蜘蛛应该从目标站点提取仅文本(文章、论坛帖子、段落等)。

问题是:有时，我的目标节点包含一个 <script>标记，因此抓取的文本包含 javascript 代码。

Here is a link到我正在使用的真实示例。在这种情况下，我的目标节点是 //td[@id='contenuStory'] .问题是有一个 <script>在第一个子 div 中标记。

我花了很多时间在网络和 SO 上搜索解决方案，但我找不到任何东西。我希望我没有错过任何明显的东西!

例子

HTML 响应(仅目标节点):

<div id="content">
    <div id="part1">Some text</div>
    <script>var s = 'javascript I don't want';</script>
    <div id="part2">Some other text</div>
</div>

我想要的东西:

Some text
Some other text

我得到的:

Some text
var s = 'javascript I don't want';
Some other text

我的代码

给定一个 xpath 选择器，我使用以下函数来提取文本:

def getText(hxs):
    if len(hxs) > 0:
        l = hxs.select('string(.)')
        if len(l) > 0:
            s = l[0].extract().encode('utf-8')
        else:
            s = hxs[0].extract().encode('utf-8')
        return s
    else:
        return 0

我试过使用 XPath 轴(类似 child::script 的东西)但无济于事。

最佳答案

尝试 w3lib.html 中的 utils 函数:

from w3lib.html import remove_tags, remove_tags_with_content

input = hxs.select('//div[@id="content"]').extract()
output = remove_tags(remove_tags_with_content(input, ('script', )))

关于python - 使用scrapy在没有javascript代码的情况下抓取文本，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/19774340/

上一篇：python - Python3.3的venv中Pip的正确使用

下一篇：python - 使用 Python 和 Flask 流式传输数据引发 RuntimeError : working outside of request context

python - 如何在python中创建多个变量的所有可能组合

python - 从集合中获取一个或无

python - 在 wsgi 测试环境中提供静态文件

c# - 构建 XDocument 时出错

python - 如果没有要爬行的网址，Scrapy 会关闭蜘蛛

python - 如何在 TextEdit 中使用文本并对其应用更改？就像我的代码看到它并告诉我我可以在 -btn_func- 函数中做什么？

sql-server - 调整查询以解析 SQL Server 2014 上的 XML 数据

python - 使用Scrapy爬取公共(public)FTP服务器

scrapy - 飞溅内存限制(scrapy)