python - 在Python中提取数据时如何获取unicode字符串？

标签 python unicode web-scraping

我正在尝试从越南网站提取文本，该网站的字符集为 utf-8。但是，我得到的文本始终是 Ascii，我找不到将它们转换为 unicode 或准确获取网站上文本的方法。结果，我无法按预期将它们保存到文件中。
我知道这是 Python 中 unicode 非常流行的问题，但我仍然希望有人能帮助我解决这个问题。谢谢。
我的代码:

import requests, re, io
import simplejson as json
from lxml import html, etree

base = "http://www.amthuc365.vn/cong-thuc/"
page = requests.get(base + "trang-" + str(1) + ".html")
pageTree = html.fromstring(page.text)

links = pageTree.xpath('//ul[contains(@class, "mt30")]/li/a/@href')
names = pageTree.xpath('//h3[@class="title"]/a/text()')
for name in names[:1]:
    print name
    # LÃ m bÃ¡nh oreo nhÃ¢n bÆ¡ Äáºu phá»ng thÆ¡m bÃ¹i

但我需要的是“Làm bánh oreo nhân bơ đậu phộng thơm bùi”
谢谢。

最佳答案

只需从 page.text 切换到 page.content 就可以正常工作。

说明 here .

另见:

关于python - 在Python中提取数据时如何获取unicode字符串？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/32675046/

上一篇：python - 为什么带有 SVM 的 AdaBoostClassifier 效果更差

下一篇：python - 如何取消转义 flask 中的网址？

c++ - 打开包含非 ASCII 字符的文件

python - 用 Python 从 NYT 文件中抓取完整的文章？

python - 如何在Python中对大量点进行反向地理编码？

python - PyQt 中的 QKeyPress 事件

javascript - encodeURIComponent 抛出异常

python - 如何在python中从img html中抓取src

ajax - CasperJS 中的 while 循环？

python - 如何在 psycopg2 中使用服务器端游标

python - 使用总体样本的分类器 : scaling the population and then sampling/scaling the sample/scaling the X_TRAIN split of the sample?