python - 如何使用 BeautifulSoup 获取最后一个 URL 链接元素

标签 python python-3.x beautifulsoup

如何使用 BeautifulSoup 从给定页面获取最后一个 html 链接?我正在尝试获取其中包含 lenta.ru 的链接。但是,如果网页包含多个 lenta.ru,则会打印每个 lenta.ru。不过,我只想获取最后一个 lenta.ru 链接,这是翻译的指针链接。

我得到这些结果

http://lenta.ru/news/2012/09/03/ipsos/ https://uynaa.wordpress.com/2012/09/06/%d0%b0%d1%80%d0%b0%d0%b2%d0%b4%d1%83%d0%b3%d0%b0%d0%b0%d1%80-%d1%81%d0%b0%d1%80%d1%8b%d0%bd-%d0%b1%d1%8d%d0%bb%d1%8d%d0%b3/
http://lenta.ru/news/2012/09/04/endofobama/ https://uynaa.wordpress.com/2012/09/06/%d0%b0%d1%80%d0%b0%d0%b2%d0%b4%d1%83%d0%b3%d0%b0%d0%b0%d1%80-%d1%81%d0%b0%d1%80%d1%8b%d0%bd-%d0%b1%d1%8d%d0%bb%d1%8d%d0%b3/
http://lenta.ru/news/2012/09/04/response/ https://uynaa.wordpress.com/2012/09/06/%d0%b0%d1%80%d0%b0%d0%b2%d0%b4%d1%83%d0%b3%d0%b0%d0%b0%d1%80-%d1%81%d0%b0%d1%80%d1%8b%d0%bd-%d0%b1%d1%8d%d0%bb%d1%8d%d0%b3/
http://www.lenta.ru/articles/2012/09/05/threat/ https://uynaa.wordpress.com/2012/09/06/%d0%b0%d1%80%d0%b0%d0%b2%d0%b4%d1%83%d0%b3%d0%b0%d0%b0%d1%80-%d1%81%d0%b0%d1%80%d1%8b%d0%bd-%d0%b1%d1%8d%d0%bb%d1%8d%d0%b3/
http://lenta.ru/articles/2012/08/21/terranova/ https://uynaa.wordpress.com/2012/08/23/%d1%85%d2%af%d0%bd-%d0%b1%d0%b0-%d0%bc%d3%a9%d1%81/

预期输出

http://www.lenta.ru/articles/2012/09/05/threat/ https://uynaa.wordpress.com/2012/09/06/%d0%b0%d1%80%d0%b0%d0%b2%d0%b4%d1%83%d0%b3%d0%b0%d0%b0%d1%80-%d1%81%d0%b0%d1%80%d1%8b%d0%bd-%d0%b1%d1%8d%d0%bb%d1%8d%d0%b3/
http://lenta.ru/articles/2012/08/21/terranova/ https://uynaa.wordpress.com/2012/08/23/%d1%85%d2%af%d0%bd-%d0%b1%d0%b0-%d0%bc%d3%a9%d1%81/

我的代码

import re
import requests
from lxml import html
from bs4 import BeautifulSoup
from urllib.request import urlopen

with open("./uynaa.txt") as inFile:
    uynaa_txt = inFile.readlines()

for tmp in uynaa_txt:

    html = urlopen(tmp).read()
    soup = BeautifulSoup(html, "lxml")

    for a in soup.select('div.entry a'):
        if "lenta.ru" in a.get('href', ''):
            print(a, tmp)

uynaa.txt

https://uynaa.wordpress.com/2012/09/06/%d0%b0%d1%80%d0%b0%d0%b2%d0%b4%d1%83%d0%b3%d0%b0%d0%b0%d1%80-%d1%81%d0%b0%d1%80%d1%8b%d0%bd-%d0%b1%d1%8d%d0%bb%d1%8d%d0%b3/
https://uynaa.wordpress.com/2012/08/23/%d1%85%d2%af%d0%bd-%d0%b1%d0%b0-%d0%bc%d3%a9%d1%81/

最佳答案

解决方案

soup.select('div.entry a')[-1]

说明

soup.select 返回一个列表。您可以使用 [-1] 检索列表中的最后一项。 如果页面只有一个匹配的链接,则最后一项也将是第一项,但这不会给您带来任何影响问题。

# full working code

from bs4 import BeautifulSoup
example_page = """
<body>
<a href="http://lenta.ru/news/2012/09/03/ipsos/ https://uynaa.wordpress.com/2012/09/06/%d0%b0%d1%80%d0%b0%d0%b2%d0%b4%d1%83%d0%b3%d0%b0%d0%b0%d1%80-%d1%81%d0%b0%d1%80%d1%8b%d0%bd-%d0%b1%d1%8d%d0%bb%d1%8d%d0%b3/"></a>
<a href="http://lenta.ru/news/2012/09/04/endofobama/ https://uynaa.wordpress.com/2012/09/06/%d0%b0%d1%80%d0%b0%d0%b2%d0%b4%d1%83%d0%b3%d0%b0%d0%b0%d1%80-%d1%81%d0%b0%d1%80%d1%8b%d0%bd-%d0%b1%d1%8d%d0%bb%d1%8d%d0%b3/" ></a>
<a href="http://lenta.ru/news/2012/09/04/response/ https://uynaa.wordpress.com/2012/09/06/%d0%b0%d1%80%d0%b0%d0%b2%d0%b4%d1%83%d0%b3%d0%b0%d0%b0%d1%80-%d1%81%d0%b0%d1%80%d1%8b%d0%bd-%d0%b1%d1%8d%d0%bb%d1%8d%d0%b3/" ></a>
</body>
"""
soup = BeautifulSoup(example_page, "lxml")

print(soup.body.select("a")[-1])

关于python - 如何使用 BeautifulSoup 获取最后一个 URL 链接元素,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/63904967/

相关文章:

python - 哪个更好 : `iter` or `while` for looping?

python - termcolor python 中的占位符

python - gpg : keyserver receive failed: Cannot assign requested address 上的非零 [错误] 代码

python - 属性错误 : 'Manager' object has no attribute 'get_by_natural_key' error in Django?

python - 获取与 Windows 中的任务管理器相同的进程详细信息

python - 将全局变量转换为类

python - BeautifulSoup - 处理类似表格的网站结构|返回字典

Python,BeautifulSoup - 提取字符串的一部分

python - 使用 Beautifulsoup 进行数据抓取。找错 body

python - 在 Python 中使用计数查找重复和唯一的嵌套序列项