目前,如果检索网页时出错,则 soup 不会填充到页面中,而是从 beautifulsoup 获得默认返回。
我正在寻找一种方法来检查这一点,以便如果获取网页时出现错误,我可以跳过一段代码,例如
if soup:
do stuff
但我不想一起终止。抱歉新手询问。
def getwebpage(address):
try:
user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
headers = { 'User-Agent' : user_agent }
req = urllib2.Request(address, None, headers)
web_handle = urllib2.urlopen(req)
except urllib2.HTTPError, e:
error_desc = BaseHTTPServer.BaseHTTPRequestHandler.responses[e.code][0]
appendlog('HTTP Error: ' + str(e.code) + ': ' + address)
return
except urllib2.URLError, e:
appendlog('URL Error: ' + e.reason[1] + ': ' + address)
return
except:
appendlog('Unknown Error: ' + address)
return
return web_handle
def test():
soup = BeautifulSoup(getwebpage('http://doesnotexistblah.com/'))
print soup
if soup:
do stuff
test()
最佳答案
构建代码,使一个函数封装从 url 检索数据的整个过程,另一个函数封装该数据的处理:
import urllib2, httplib
from BeautifulSoup import BeautifulSoup
def append_log(message):
print message
def get_web_page(address):
try:
user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
headers = { 'User-Agent' : user_agent }
request = urllib2.Request(address, None, headers)
response = urllib2.urlopen(request, timeout=20)
try:
return response.read()
finally:
response.close()
except urllib2.HTTPError as e:
error_desc = httplib.responses.get(e.code, '')
append_log('HTTP Error: ' + str(e.code) + ': ' +
error_desc + ': ' + address)
except urllib2.URLError as e:
append_log('URL Error: ' + e.reason[1] + ': ' + address)
except Exception as e:
append_log('Unknown Error: ' + str(e) + address)
def process_web_page(data):
if data is not None:
print BeautifulSoup(data)
else:
pass # do something else
data = get_web_page('http://doesnotexistblah.com/')
process_web_page(data)
data = get_web_page('http://docs.python.org/copyright.html')
process_web_page(data)
关于python - BeautifulSoup 无法加载网页时如何处理,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/7922362/