我是非常初学者,所以放轻松。 我用谷歌搜索了如何修复它,但我得到的每个答案都是针对 Xpath 的,而我使用的是 CSS。
我正在学习本教程 https://hexfox.com/p/scrape-your-cinemas-listings-to-get-a-daily-email-of-films-with-a-high-imdb-rating/并达到了这个:
import scrapy
class CinemaSpider(scrapy.Spider):
name = "cinema"
allowed_domains = ['cineroxy.com.br']
start_urls = [
'http://cineroxy.com.br/programacao-brisamar',
]
def parse(self, response):
movie_names = response.css('.titulo p::text').extract()
for movie_name in movie_names:
yield {
'name': movie_name
}
我已正确执行,因此它将获取信息并创建一个 json 文件:
C:\Python27\Scripts>scrapy runspider cinema_scraper.py -o movies.json
但结果是这样的:
[
{"name": "\r\n A Bailarina\r\n "},
{"name": "\r\n Assassins Creed - O Filme\r\n "},
{"name": "\r\n Cinquenta Tons Mais Escuros\r\n "},
{"name": "\r\n Minha M\u00e3e \u00e9 uma Pe\u00e7a 2\r\n "},
{"name": "\r\n Moana - Um Mar de Aventura\r\n "},
{"name": "\r\n Os Penetras 2 - Quem D\u00e1 Mais?\r\n "},
{"name": "\r\n Quatro Vidas de Um Cachorro\r\n "},
{"name": "\r\n Resident Evil 6: O \u00daltimo Cap\u00edtulo\r\n "},
{"name": "\r\n xXx: Reativado\r\n "}
]
现在,我有 3 个输出/提取问题需要解决:\r\n、大空白和尝试提取强调词时的错误(Resident Evil 6: O\u00daltimo Cap\u00edtulo 原作是 Resident Evil 6: O Último Capítulo)。
这个网站的源代码与我研究过的其他网站的源代码有一点不同,那就是它在写标题之前少了一行:
<a href='../filme/resident-evil-6-o-ultimo-capitulo'>
<img id="cphConteudo_rptBusca_imgFilme_7" title="Resident Evil 6: O Último Capítulo" class="img" src="http://www.cineroxy.com.br/suiteinstitucional/arquivos/filmes/040920161914411.jpg" />
<div class="titulo">
<p>
Resident Evil 6: O Último Capítulo
</p>
</div>
<div class="passar-mouse">
clique para ver os horários <img src="Arquitetura/Imagens/Icones/drop.png" alt="" />
</div>
</a>
对于冗长的帖子和可能出现的巨大愚蠢错误,我们深表歉意。 提前致谢。
最佳答案
yield {
'name': movie_name.strip()
}
代码:
"\r\n A Bailarina\r\n ".strip()
输出:
'A Bailarina'
strip()
将去除前导和尾部空格
JSON:
也就是在你的settings.py
中加入:
FEED_EXPORT_ENCODING = 'utf-8'
文档:https://doc.scrapy.org/en/1.2/topics/feed-exports.html#feed-export-encoding
关于css - 如何在 Scrapy css 中删除\r\n、空格和启用重音符号?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/41931682/