python - 无法从网页的不同深度抓取相似的链接

我在 python 中创建了一个脚本来解析来自网页的不同链接。着陆页中有两个部分。一个是顶级体验，另一个是更多体验。我目前的尝试可以从这两个类别中获取链接。

目前我想收集的链接类型(其中很少)在 Top Experiences 部分下。但是，当我遍历 More Experiences 部分下的链接时，我可以看到它们都指向页面，其中有一个名为 Experiences 的部分，在该部分下有链接类似于着陆页中 Top Experiences 下的链接。我想捕获他们。

我想要的一个链接如下所示:https://www.airbnb.com/experiences/20712?source=seo。

website link

我目前的尝试是从两个类别中获取链接:

import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup

URL = "https://www.airbnb.com/sitemaps/v2/experiences_pdp-L0-0"

def get_links(link):
    res = requests.get(link)
    soup = BeautifulSoup(res.text,"lxml")
    items = [urljoin(link,item.get("href")) for item in soup.select("div[style='margin-top:16px'] a._1f0v6pq")]
    return items

if __name__ == '__main__':
    for item in get_links(URL):
        print(item)

How can I parse all the links under Top Experiences section along with the links under Experiences section that can be found upon traversing the links under More Experiences?

请check out the image如果有什么不清楚的。我用的是画笔，所以写的可能有点难看。

最佳答案

解决方案有点棘手。它可以通过多种方式实现。我发现最有用的方法是在 get_links() 函数中递归使用 More Experiences 下的链接。 更多体验下的所有链接都有一个共同的关键字_pdp-。

因此，当您在函数中定义条件语句以使链接通过函数 get_links() 递归筛选时，else block 将生成所需的链接。需要注意的最重要的一点是，所有需要的链接都在类 _1f0v6pq 中，因此获取链接的逻辑相当简单。

import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup

URL = "https://www.airbnb.com/sitemaps/v2/experiences_pdp-L0-0"

def get_links(link):
    res = requests.get(link)
    soup = BeautifulSoup(res.text,"lxml")
    for item in soup.select("div[style='margin-top:16px'] a._1f0v6pq"):
        if "_pdp-" in item.get("href"):
            get_links(urljoin(URL,item.get("href")))
        else:
            print(urljoin(URL,item.get("href")))

if __name__ == '__main__':
    get_links(URL)

关于python - 无法从网页的不同深度抓取相似的链接，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/54362974/

python - 无法从网页的不同深度抓取相似的链接

上一篇：python - Pandas - 计算 df 中的行以发现每天的存活率

下一篇：python - docker 容器 : UDP Communication with other hosts