python - urlopen、BeautifulSoup 和 UTF-8 问题

标签 python utf-8 urllib2 beautifulsoup

我只是想检索网页,但 HTML 文件中嵌入了一个外来字符。当我使用“查看源代码”时,这个字符不可见。

isbn = 9780141187983
url = "http://search.barnesandnoble.com/booksearch/isbninquiry.asp?ean=%s" % isbn
opener = urllib2.build_opener()
url_opener = opener.open(url)
page = url_opener.read()
html = BeautifulSoup(page) 
html #This line causes error.

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 21555: ordinal not in range(128)

我也试过...

html = BeautifulSoup(page.encode('utf-8'))

如何在不出现此错误的情况下将此网页读入 BeautifulSoup?

最佳答案

当您尝试打印 BeautifulSoup 文件的表示时,这个错误可能实际上发生了,如果我怀疑您正在交互式控制台中工作,这将自动发生。

# This code will work fine, note we are assigning the result 
# of the BeautifulSoup object to prevent it from printing immediately.
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(u'\xa0')

# This will probably show the error you saw
print soup

# And this would probably be fine
print soup.encode('utf-8')

关于python - urlopen、BeautifulSoup 和 UTF-8 问题,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/1320591/

相关文章:

php - 通用且干净的 UTF-8 编码 (PHP)

Python urllib2 HTTPBasicAuthHandler

jquery - 向 Django REST Framework 发送 jQuery 请求导致未找到 JSON 对象错误

python - Jenkins:Stacktrace 和 GitLab 之间的链接

java - RandomAccessFile 读取西里尔文 UTF-8 java

java - url 的内容是 UTF-8,但是当我 system.out 字符串时它不再是 UTF-8

python - python- youtube。获取网址视频列表

python - 如何在 python 中使用 pandas 仅从具有两个数据框的网页中选择第二个数据框?

python - 类型错误 : super() takes at least 1 argument (0 given) error is specific to any python version?

python - 装饰器为我的所有类属性创建 @property getter Python 3