python - 使用 BeautifulSoup 解析网页——跳过 404 错误页面

我正在使用下面的代码来获取网站的标题。

from bs4 import BeautifulSoup
import urllib2

line_in_list = ['www.dailynews.lk','www.elpais.com','www.dailynews.co.zw']

for websites in line_in_list:
    url = "http://" + websites
    page = urllib2.urlopen(url)
    soup = BeautifulSoup(page.read())
    site_title = soup.find_all("title")
    print site_title

如果网站列表包含“不良”(不存在的)网站/网页，或者网站有某种类型或错误，例如“404 页面未找到”等，脚本将中断并停止。

我可以用什么方式让脚本忽略/跳过“坏的”(不存在的)和有问题的网站/网页？

最佳答案

line_in_list = ['www.dailynews.lk','www.elpais.com',"www.no.dede",'www.dailynews.co.zw']

for websites in line_in_list:
    url = "http://" + websites
    try:
       page = urllib2.urlopen(url)
    except Exception, e:
        print e
        continue

    soup = BeautifulSoup(page.read())
    site_title = soup.find_all("title")
    print site_title

[<title>Popular News Items | Daily News Online : Sri Lanka's National News</title>]
[<title>EL PAÍS: el periódico global</title>]
<urlopen error [Errno -2] Name or service not known>
[<title>
DailyNews - Telling it like it is
</title>]

关于python - 使用 BeautifulSoup 解析网页——跳过 404 错误页面，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/24322368/

上一篇：python - 将属性添加到类列表以返回具有特定属性的所有对象

下一篇：python - Python 怎么能比我的硬盘更快地读取这个文件？

相关文章：

javascript - scrapy中如何解析JSON数据

python - 如何在 BeautifulSoup4 - Python 中选择具有多个类的元素？

python - 使用 BeautifulSoup 删除第一个子节点

Python 在我的查询参数中添加单引号

python - 用户输入循环

javascript - 通过 Javascript 使用网页抓取时，只能在出现用户激活错误时显示文件选择器对话框

python - 清理抓取结果以返回 anchor 文本，但不返回 HTML

python - 从子目录python导入类

python - 在 Microsoft BOT Framework (Python) 中添加两条消息之间的延迟

r - 如何从雅虎(使用 Quantmod)获取 ETF 财务信息(例如 NAV)？