python - 避免卡在 conn.getresponse() ( httplib.HTTPConnection )

我用 python 编写了一个爬虫，用于根据一些给定的 url 从网站下载一些网页。我注意到偶尔我的程序会在“conn.getresponse()”这一行挂起。没有异常被抛出，程序只是一直在那里等待。

conn = httplib.HTTPConnection(component.netloc)
conn.request("GET", component.path + "?" + component.query)
resp = conn.getresponse() #hang here

我阅读了 api 文档，它说(添加超时):

conn = httplib.HTTPConnection(component.netloc, timeout=10)

但是，它不允许我“重试”连接。超时后重试抓取的最佳做法是什么？

例如，我正在考虑以下解决方案:

trials = 3
while trials > 0:
    try:
        ... code here ...
    except:
        trials -= 1

我的方向正确吗？

最佳答案

However, it does not allow me to "retry" the connection.

是的，超时旨在将此策略推回到它所属的位置，在您的代码中(并且在 httplib 之外)。

What is the best practice to retry the crawling after a timeout?

它非常依赖于应用程序。你的爬虫可以忍受多长时间来推迟它的其他工作？您有多希望它深入到每个站点？您是否需要能够忍受缓慢、超额订阅的服务器？遇到爬虫有节流或其他对策的服务器怎么办？当我问的时候，你尊重 robots.txt 吗？

由于这些问题的答案可能千差万别，因此您有必要根据爬虫的需求、您倾向于爬网的站点(假设有趋势)以及您的 WAN 性能进行调整。

关于python - 避免卡在 conn.getresponse() ( httplib.HTTPConnection )，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/8571466/