python - 使用 beautifulsoup 从 URL 列表中获取第一个 URL

标签 python python-2.7 html-parsing beautifulsoup

我正在尝试使用 beautifulsoup 提取 URL 标签列表中的第一个 URL，但挂断了。到目前为止，我已经能够使用以下代码获得我正在寻找的结果。

rows = results.findAll('p',{'class':'row'})
for row in rows:
  for link in row.findAll('a'):
    print(link)

这会打印三个 <a>类似于以下标签。

<a href="http://something.foo">1</a>
<a href="http://something.bar">2</a>
<a href="http://something.foobar">3</a>

我想要做的是从第一个 a href 中提取 URL。 I found another post描述了使用一些正则表达式执行此操作，但到目前为止我还无法使其正常工作。

我不断收到此错误消息:

    Traceback (most recent call last):
  File "./scraper.py", line 25, in <module>
    for link in row.find('a', href=re.compile('^http://')):
TypeError: 'NoneType' object is not iterable

任何帮助或指导将不胜感激。让我知道我需要发布哪些其他详细信息。

最佳答案

如果您只想要第一个结果，则无需使用 findAll - 您可以使用 find。 Html 属性在 BeautifulSoup 中作为字典公开。最后，如果 find 的第二个参数是字符串而不是字典，则将其用作类。您还可以将其作为命名参数提供:find('p', class='row')。

了解了这一点，您就可以用简单的代码来完成您想要的事情:

results.find('p','row').find('a')['href']

关于python - 使用 beautifulsoup 从 URL 列表中获取第一个 URL，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/20383865/

上一篇：python - PUT curl 请求返回错误的 URI (flask-RESTful)

下一篇：python - IEP 中 Python 的乘法表？

php - 爬虫如何解析网页中的文本？

python-2.7 - gspread 输入前面有单引号

python - 使用 py2exe 为 python 代码构建可执行文件

python - Pandas 在 Python 中将一些行转换为列

ruby-on-rails - 加载用于在 Rails 中解析的网页

html-parsing - Java JSoup : what does this message mean?

python - 如何使用 mysql.connector 从 MySQL 返回 str？

python - 如何在 python 中使用 selenium 中的凭据提供程序？

python - 曲线拟合中的函数如何定义？