Python 请求:检查 URL 是否不是 HTML 网页

所以我有一个爬虫使用这样的东西:

#if ".mp3" in baseUrl[0] or ".pdf" in baseUrl[0]:
if baseUrl[0][-4] == "." and ".htm" not in baseUrl[0]:
    raise Exception
html = requests.get(baseUrl[0], timeout=3).text

这很好用。发生的情况是，如果像 .mp4 或 .m4a 这样的文件进入爬虫而不是 HTML 页面，那么脚本就会挂起，在 Linux 中，当我尝试运行脚本时，它只会打印:

Killed

是否有更有效的方法来捕获这些非 HTML 页面？

最佳答案

您可以发送头部请求并检查内容类型。如果是 text/html 则只继续

r = requests.head(url)
if "text/html" in r.headers["content-type"]:
    html = requests.get(url).text
else:
    print "non html page"

如果你只想发出单个请求，

r = requests.get(url)
if "text/html" in r.headers["content-type"]:    
    html = r.text
else:
    print "non html page"

关于Python 请求:检查 URL 是否不是 HTML 网页，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/25392321/

上一篇：python - 使用 JSONField() 获取 non_field_errors

下一篇：python - 沿轴 0 重复一个 scipy csr 稀疏矩阵

python - 将 Flask 应用拆分为多个文件

python - 从多个事件源中删除事件

Python请求SSL错误-证书验证失败

Python - Beautiful Soup - 如何过滤提取的数据中的关键字？

python - 如何循环遍历字典以在最终数据帧的单独行中获取它们的输出？

python - django 和 celery+rabbitmq 的消费者连接错误？

python - 在单个 JSON 中保存多个请求

python - Web抓取以在您的管中找到实时 View

python - 如何修复 python 中的 "Failed to establish a new connection: [Errno 10061]"错误？