Python + BeautifulSoup : How to get ‘href’ attribute of ‘a’ element?

我有以下内容:

  html =
  '''<div class=“file-one”>
    <a href=“/file-one/additional” class=“file-link">
      <h3 class=“file-name”>File One</h3>
    </a>
    <div class=“location”>
      Down
    </div>
  </div>'''

并且只想获取 href 的文本，即 /file-one/additional。所以我做了:

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'html.parser')

link_text = “”

for a in soup.find_all(‘a’, href=True, text=True):
    link_text = a[‘href’]

print “Link: “ + link_text

但它只是打印一个空白，什么也没有。只需 链接:。所以我在另一个网站上测试了它，但使用了不同的 HTML，并且它有效。

我做错了什么？还是该站点有意编程为不返回 href 的可能性？

预先感谢您，一定会点赞/接受答案!

最佳答案

您的 html 中的“a”标签没有直接包含任何文本，但它包含一个包含文本的“h3”标签。这意味着 text 为 None，并且 .find_all() 无法选择标签。如果标签包含除文本内容以外的任何其他 html 元素，通常不要使用 text 参数。

如果您只使用标签的名称(和 href 关键字参数)来选择元素，您可以解决这个问题。然后在循环中添加一个条件来检查它们是否包含文本。

soup = BeautifulSoup(html, 'html.parser')
links_with_text = []
for a in soup.find_all('a', href=True): 
    if a.text: 
        links_with_text.append(a['href'])

或者，如果您更喜欢单行代码，则可以使用列表推导式。

links_with_text = [a['href'] for a in soup.find_all('a', href=True) if a.text]

或者你可以传递一个 lambda到 .find_all()。

tags = soup.find_all(lambda tag: tag.name == 'a' and tag.get('href') and tag.text)

如果您想收集所有链接，无论它们是否有文本，只需选择所有具有“href”属性的“a”标签。 anchor 标记通常有链接，但这不是必需的，所以我认为最好使用 href 参数。

使用 .find_all()。

links = [a['href'] for a in soup.find_all('a', href=True)]

将 .select() 与 CSS 选择器结合使用。

links = [a['href'] for a in soup.select('a[href]')]

关于Python + BeautifulSoup : How to get ‘href’ attribute of ‘a’ element?，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/43814754/

Python + BeautifulSoup : How to get ‘href’ attribute of ‘a’ element?

上一篇：html - 导航栏下拉菜单不适用于 Angular 和 Bootstrap 4

下一篇：html - 减少 Glyphicon 的重量