python - 使用 Beautiful Soup 抓取网站时出现编码错误

标签 python python-3.x web-scraping beautifulsoup

我正在尝试从 this 中抓取文本网站。它返回这样的文本:

डा. भà¥à¤·à¤¬à¤¹à¤¾à¤¦à¥à¤° थापालाठपà¥à¤¤à¥à¤°à¥à¤¶à¥à¤, à¤à¤®à¥à¤°à¤¿à¤à¤¾à¤®à¤¾ तà¥à¤à¤¶à¥à¤°à¥à¤à¥ निधन

而不是:

भारतीय विदेश सचिव गोखले आज नेपाल आउँदै.

当前代码:

headers = {
        'Connection': 'close',
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36',
    }

def get_url_soup(url):
        url_request = requests.get(url, headers=headers, allow_redirects=True)
        soup = BeautifulSoup(url_request.text, 'lxml')
        return soup

soup = get_url_soup('https://www.onlinekhabar.com/2019/03/753522')
title_card = soup.find('div', {'class': 'nws__title--card'})

最佳答案

使用EncodingDetector :

from bs4.dammit import EncodingDetector

headers = {
        'Connection': 'close',
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36',
    }

def get_url_soup(url):
        url_request = requests.get(url, headers=headers, allow_redirects=True)
        http_encoding = url_request.encoding if 'charset' in url_request.headers.get('content-type', '').lower() else None
        html_encoding = EncodingDetector.find_declared_encoding(url_request.content, is_html=True)
        encoding = html_encoding or http_encoding
        soup = BeautifulSoup(url_request.content, 'lxml', from_encoding=encoding)
        return soup

soup = get_url_soup('https://www.onlinekhabar.com/2019/03/753522')
title_card = soup.find('div', {'class': 'nws__title--card'})

print(title_card.text)

输出:

होमपेज / 
समाचार / 
राष्ट्रिय समाचार

भारतीय विदेश सचिव गोखले आज नेपाल आउँदै
प्रधानमन्त्रीलगायत शीर्ष नेतासँग भेट्ने 
.
.
.

关于python - 使用 Beautiful Soup 抓取网站时出现编码错误,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/55391495/

相关文章:

python - numpy 测量时间 - 语法错误

python - celery 中的简单周期性任务不起作用但没有错误

python - 如何将类 "filter"的实例直接转换为 str

python-3.x - 如何在 Tkinter 中获得水平滚动条?

php - 爬取网页时如何将网页内容转换为一致的字符集?

python - 抓取亚马逊交易页面不返回 html 代码 - python

python - 如何在 Jupyter Notebook 终端中激活 conda virtualenv?

Python:理解对象 __del__ 方法

Python——函数不返回值

javascript - R:网页抓取维基百科的 JavaScript 表