python - 无法读取美丽汤的 html 页面

标签 python python-3.x beautifulsoup python-requests python-beautifultable

以下代码在输出中打印 hi 后卡住了。您能检查一下这有什么问题吗？如果该网站是安全的并且我需要一些特殊的身份验证？

from bs4 import BeautifulSoup
import requests

print('hi')
rooturl='http://www.hoovers.com/company-information/company-search.html'
r=requests.get(rooturl);
print('hi1')
soup=BeautifulSoup(r.content,"html.parser");
print('hi2')
print(soup)

最佳答案

Unable to read html page from beautiful soup

为什么你遇到这个问题是网站认为你是机器人，他们不会向你发送任何东西。他们甚至会挂断连接让你永远等待。

You just imitate browser's request, then server will consider you are not an robot.

添加 header 是处理此问题的最简单方法。但你不应该只传递User-Agent(就像这次)。请记住复制浏览器的请求并通过测试删除无用的元素。如果你懒就直接使用浏览器的 headers，但上传文件时一定不要全部复制

from bs4 import BeautifulSoup
import requests

rooturl='http://www.hoovers.com/company-information/company-search.html'
with requests.Session() as se:
    se.headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36",
        "Accept-Encoding": "gzip, deflate",
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
        "Accept-Language": "en"
    }
    resp = se.get(rooturl)
print(resp.content)
soup = BeautifulSoup(resp.content,"html.parser")

关于python - 无法读取美丽汤的 html 页面，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/53782607/

上一篇：python - 使用 docker python "no space left on device"后无法正确重新启动 ubuntu，外围设备上没有可用内存

下一篇：python - 该文件被另一个进程错误使用，随后出现意外行为

相关文章：

python - 搜索数据库时的 Sqlite/Python 简单问题

python - 如何使用BeautifulSoup访问标签的属性值

python - 使用 beautiful soup 解析 html 表

python - 使用 beautifulsoup 提取换行符之间的文本(例如 <br/> 标签)

python - 如何使用 SymPy 加速符号集成？

python - 具有不同条件语句的函数返回

python - 从 pandas.core.series.Series 中提取数据时出现 KeyError

Python将对象属性写入文件

python - 多次重用代码的最有效方法？还有我的 while 使用 boolean 开关无限循环

python - 从xml文件中检索数据到mysql数据库