python - Beautiful Soup 和字符编码

我正在尝试使用 Beautiful Soup 和 Python 2.6.5 从带有斯堪的纳维亚字符的网站中提取文本和 HTML。

html = open('page.html', 'r').read()
soup = BeautifulSoup(html)

descriptions = soup.findAll(attrs={'class' : 'description' })

for i in descriptions:
    description_html = i.a.__str__()
    description_text = i.a.text.__str__()
    description_html = description_html.replace("/subdir/", "http://www.domain.com/subdir/")
    print description_html

但是在执行时，程序失败并显示以下错误消息:

Traceback (most recent call last):
    File "test01.py", line 40, in <module>
        description_text = i.a.text.__str__()
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 19:         ordinal not in range(128)

如果有帮助的话，输入页面似乎是用 ISO-8859-1 编码的。我尝试使用 BeautifulSoup(html, fromEncoding="latin-1") 设置正确的源编码，但它也没有帮助。

现在是 2011 年，我正在努力解决一些琐碎的字符编码问题，我相信有一个非常简单的解决方案可以解决所有这些问题。

最佳答案