我正在使用下面的代码来获取网站的标题。
from bs4 import BeautifulSoup
import urllib2
line_in_list = ['www.dailynews.lk','www.elpais.com','www.dailynews.co.zw']
for websites in line_in_list:
url = "http://" + websites
page = urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
site_title = soup.find_all("title")
print site_title
如果网站列表包含“不良”(不存在的)网站/网页,或者网站有某种类型或错误,例如“404 页面未找到”等,脚本将中断并停止。
我可以用什么方式让脚本忽略/跳过“坏的”(不存在的)和有问题的网站/网页?
最佳答案
line_in_list = ['www.dailynews.lk','www.elpais.com',"www.no.dede",'www.dailynews.co.zw']
for websites in line_in_list:
url = "http://" + websites
try:
page = urllib2.urlopen(url)
except Exception, e:
print e
continue
soup = BeautifulSoup(page.read())
site_title = soup.find_all("title")
print site_title
[<title>Popular News Items | Daily News Online : Sri Lanka's National News</title>]
[<title>EL PAÍS: el periódico global</title>]
<urlopen error [Errno -2] Name or service not known>
[<title>
DailyNews - Telling it like it is
</title>]
关于python - 使用 BeautifulSoup 解析网页——跳过 404 错误页面,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/24322368/