Python Beautiful Soup 'ascii' 编解码器无法编码字符 u'\xa5'

标签 python html web-scraping beautifulsoup

我在网络抓取页面的某些元素时遇到了一些奇怪的字符。似乎给出错误的字符是:

? ????Á¢¢Á? /?? />? /??? ?/¢¥Á ??%% ?Á ?????Á? ?> /???¥??> ¥? ¥©Á ?>¢¥/%%/¥??> ?Â >Á? Â?Á ©???¢ ñ%Á?¥???/% Á%Á?¥??>?? />? Â??Á? ??¥?? ??¢¥????¥??> ¢`¢¥Á¢ ??%% ?Á ??À?/?Á? ¥? _ÁÁ¥ ?>??Á/¢?>À Á????Á>¥ ????¥Á? />? ??__?>??/¥??>¢ ?Á

我的代码如下

url= "http://www.nsf.gov#######@#@#@##";
    #webbrowser.open(url,new =new );
    flagcnt+=1
    if flagcnt%20==0: #autosleep for avoiding shut-out
        print "flagcount: "
        print flagcnt
        time.sleep(5)
     #Program Code extraction
    r = requests.get (url)
    sp=BeautifulSoup(r.content)

页码:http://www.nsf.gov/awardsearch

我阅读了有关此错误的所有页面，其中一些建议解码和编码，但它们似乎没有帮助。我不知道这里使用的是哪种编码。尝试降级 BS 版本但没有帮助。任何帮助表示赞赏。 Python 2.7 BS 4

最佳答案

这对我有用:

page_text = r.text.encode('utf-8').decode('ascii', 'ignore')
page_soupy = BeautifulSoup.BeautifulSoup(page_text)

关于Python Beautiful Soup 'ascii' 编解码器无法编码字符 u'\xa5'，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/29688440/

上一篇：python - 从列表中获取最接近的日期时间

下一篇：python - 如何只返回 datetime.datetime 类型的日期部分？

相关文章：

html - eBay HTML 元素描述在预览时看起来不同

python - 使用 Selenium 从表中抓取数据

Golang 相当于 strtotime ("this Sunday, 23:59:59")

将哈希值生成为数字的 Python 库

python - 在 python 中为大型 scipy.sparse 矩阵运算分配内存

javascript - 在脚本/注释/隐藏 Div 中存储附加数据

html - 将文本内容添加到 div 会导致内联 block 元素被撞下

python - 将巨大的 Keras 模型加载到 Flask 应用程序中

python - Dask中compute()的目的

javascript - 抓取 html 页面结果..顺序不正确