python - 使用正确的字符编码进行抓取(python 请求 + beautifulsoup)

我在解析此网站时遇到问题:http://fm4-archiv.at/files.php?cat=106

它包含特殊字符，例如变音符号。看这里:

正如您在上面的屏幕截图中看到的那样，我的 chrome 浏览器正确显示了变音符号。然而，在其他页面(例如: http://fm4-archiv.at/files.php?cat=105 )上，变音没有正确显示，如下面的屏幕截图所示:

元 HTML 标记在页面上定义了以下字符集:

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"/>

我使用 python requests 包来获取 HTML，然后使用 Beautifulsoup 来抓取所需的数据。我的代码如下:

r = requests.get(URL)
soup = BeautifulSoup(r.content,"lxml")

如果我打印编码( print(r.encoding )，结果是 UTF-8 .如果我手动将编码更改为 ISO-8859-1或 cp1252调用 r.encoding = ISO-8859-1当我在控制台上输出数据时，没有任何变化。这也是我的主要问题。

r = requests.get(URL)
r.encoding = 'ISO-8859-1'
soup = BeautifulSoup(r.content,"lxml")

仍然会在我的 python IDE 的控制台输出中显示以下字符串:

Der WildlÃ¶wenpfleger

相反，它应该是

Der Wildlöwenpfleger

如何更改我的代码以正确解析变音符号？

最佳答案

一般来说，而不是使用 r.content这是接收到的字节串，使用 r.text这是使用由requests确定的编码的解码内容.

在这种情况下 requests将使用 UTF-8 解码传入的字节字符串，因为这是服务器在 Content-Type 中报告的编码。标题:

import requests

r = requests.get('http://fm4-archiv.at/files.php?cat=106')

>>> type(r.content)    # raw content
<class 'bytes'>
>>> type(r.text)       # decoded to unicode
<class 'str'>    
>>> r.headers['Content-Type']
'text/html; charset=UTF-8'
>>> r.encoding
'UTF-8'

>>> soup = BeautifulSoup(r.text, 'lxml')

这将解决“Wildlöwenpfleger”问题，但是，页面的其他部分随后开始中断，例如:

>>> soup = BeautifulSoup(r.text, 'lxml')     # using decoded string... should work
>>> soup.find_all('a')[39]
<a href="details.php?file=1882">Der Wildlöwenpfleger</a>
>>> soup.find_all('a')[10]
<a href="files.php?cat=87" title="Stermann und Grissemann sind auf Sommerfrische und haben Hermes ihren Salon �bergeben. Auf Streifz�gen durch die Popliteratur st��t Hermes auf deren gro�e Themen und h�rt mit euch quer. In der heutige">Salon Hermes (6 files)

显示“Wildlöwenpfleger”已修复，但现在第二个链接中的“übergeben”和其他人已损坏。

似乎在一个 HTML 文档中使用了多种编码。第一个链接使用 UTF-8 编码:

>>> r.content[8013:8070].decode('iso-8859-1')
'<a href="details.php?file=1882">Der WildlÃ¶wenpfleger</a>'

>>> r.content[8013:8070].decode('utf8')
'<a href="details.php?file=1882">Der Wildlöwenpfleger</a>'

但第二个链接使用 ISO-8859-1 编码:

>>> r.content[2868:3132].decode('iso-8859-1')
'<a href="files.php?cat=87" title="Stermann und Grissemann sind auf Sommerfrische und haben Hermes ihren Salon übergeben. Auf Streifzügen durch die Popliteratur stößt Hermes auf deren große Themen und hört mit euch quer. In der heutige">Salon Hermes (6 files)\r\n</a>'

>>> r.content[2868:3132].decode('utf8', 'replace')
'<a href="files.php?cat=87" title="Stermann und Grissemann sind auf Sommerfrische und haben Hermes ihren Salon �bergeben. Auf Streifz�gen durch die Popliteratur st��t Hermes auf deren gro�e Themen und h�rt mit euch quer. In der heutige">Salon Hermes (6 files)\r\n</a>'

显然，在同一个 HTML 文档中使用多种编码是不正确的。除了联系文档作者并要求更正之外，您无法轻松处理混合编码。也许你可以运行 chardet.detect() 在您处理数据时对其进行处理，但这不会令人愉快。

关于python - 使用正确的字符编码进行抓取(python 请求 + beautifulsoup)，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/46253288/

python - 使用正确的字符编码进行抓取(python 请求 + beautifulsoup)

上一篇：reactjs - 禁用antdDatePicker的日期和时间

下一篇：amazon-web-services - 在 AWS S3 上，我可以从生命周期规则中排除文件吗