如果div类标题文本包含 "English",Python 3 BeautifulSoup获取URL(href或baseURL)

标签 python html beautifulsoup

<div class="gallery" data-tags="19 16 40193 41706 40476 7921 815 425 900 362 229 154 146 13 65 129 766 25 9 51931 188">
    <a href="/g/987654/" class="cover" style="padding:0 0 142.79999999999998% 0">
    <img is="lazyload-image" class="" width="250" height="357" data-src="https://abc.cloud.xyz/galleries/123456/thumb.jpg" alt="" src="https://abc.cloud.xyz/galleries/123456/thumb.jpg">
    <div class="caption">[User] Text ABCDEFGH [English] </div>
    </a>
</div>

程序不会将 URL/href 保存到 txt 文件中。我认为它找不到href

如果带有类标题的 div 元素包含 Word English,则应将元素类封面的 href (/g/987654/) 保存在 txt 文件中。

from bs4 import BeautifulSoup
import requests

url = "https://google.com"

response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")

base_urls = []
for div in soup.find_all("div", {"class": "caption"}):
    if "English" in div.text:
        a_tag = div.find_previous_sibling("a")
        if a_tag:
            base_urls.append(a_tag["baseURL"])

with open("base_urls.txt", "w") as f:
    for base_url in base_urls:
        f.write(base_url + "\n")

**到目前为止我尝试过的 ** 此代码有效,但它将所有 href 保存到 txt 文件中...

from bs4 import BeautifulSoup
import requests

url = "https://google.com"

page = requests.get(url)
soup = BeautifulSoup(page.content, "html.parser")

links = soup.find_all("a")

hrefs = [link["href"] for link in links]

with open("links_test1.txt", "w") as file:
    for href in hrefs:
        file.write(href + "\n")

################################################ ########################

新部分

    from bs4 import BeautifulSoup
    import requests
   
    lurl = ["https://web.com/page1","https://web.com/page2","https://web.com/page3"]
       
for url in lurl:
    response = requests.get(url)
    soup = BeautifulSoup(response.text, "html.parser")
    base_urls = []
for div in soup.find_all("div", {"class": "caption"}):
    if "English" in div.text:
        a_tag = div.find_previous("a")
        if a_tag:
            base_urls.append(a_tag["href"])
with open("base_urls2.txt", "w") as f:
    for base_url in base_urls:
        f.write(base_url + "\n")

如果我可以输入包含所有 URL 的列表或 txt 文件,那就太理想了。任何想法???对于我来说,尝试在 beautifulsoup 中导入 txt 文件是不可能的...... 我对 python 还很陌生......

txt 文件可能看起来像这样

https://web1.com https://user1.com https://web.com

每行只有1个url

最佳答案

查看 HTML 代码段,您应该使用 .find_previous 而不是 .find_previous_sibling。另外,请使用 a_tag['href'],而不是 a_tag['baseURL']:

from bs4 import BeautifulSoup


html_doc = """\
<div class="gallery" data-tags="19 16 40193 41706 40476 7921 815 425 900 362 229 154 146 13 65 129 766 25 9 51931 188">
    <a href="/g/987654/" class="cover" style="padding:0 0 142.79999999999998% 0">
    <img is="lazyload-image" class="" width="250" height="357" data-src="https://abc.cloud.xyz/galleries/123456/thumb.jpg" alt="" src="https://abc.cloud.xyz/galleries/123456/thumb.jpg">
    <div class="caption">[User] Text ABCDEFGH [English] </div>
    </a>
</div>"""

soup = BeautifulSoup(html_doc, "html.parser")


base_urls = []
for div in soup.find_all("div", {"class": "caption"}):
    if "English" in div.text:
        a_tag = div.find_previous("a")
        if a_tag:
            base_urls.append(a_tag["href"])

print(base_urls)

打印:

['/g/987654/']

关于如果div类标题文本包含 "English",Python 3 BeautifulSoup获取URL(href或baseURL),我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/74740364/

相关文章:

html - Css 旋转木马箭头在模态内调整大小时向上流动

html - Facebook 不从博客中提取元数据

python - 用 Beautiful Soup 抓取问题

java - 如何使用 Java 运行带参数的 python 代码,./AdafruitDHT.py 22 4

python - 如何在python中以这种方式打印字典

python - pprint 十六进制数

python - 计算 3↑↑↑3(在 Python 中)

html - 为什么我的导航栏链接是垂直显示的而不是水平显示的?

mapreduce - python - PipeMapRed.waitOutputThreads() : subprocess failed with code 1

python - 如何定义 BeautifulSoup 的 "source.find"部分