python - 解析页面中超链接的所有 html 源

标签 python beautifulsoup

我试图从页面中的所有(指定)超链接获取所有 html 源。 页面是https://dota2.gamepedia.com/Category:Counters ,我尝试检索的后续页面源是 https://dota2.gamepedia.com/Abaddon/Counters , https://dota2.gamepedia.com/Alchemist/Counters ……等等。

我尝试了以下代码,但没有结果

from bs4 import BeautifulSoup
import requests

source = requests.get('https://dota2.gamepedia.com/Category:Counters').text

soup = BeautifulSoup(source, 'lxml')
links = soup.find_all('div', class_="mw-category-group")


for c in links:
    b = c.find_all('a')
    for a in b:
        u = a.get('href')
        url = "https://dota2.gamepedia.com" + u
        # print("https://dota2.gamepedia.com" + u)
        for sources in url:
            sources = requests.get(url).text
            soup = BeautifulSoup(sources, "lxml")
            print(sources)

#
# print(url)

最佳答案

使用 CSS 选择器,既简单又快捷。我提供了一些打印内容以确保我们的方式正确。

from bs4 import BeautifulSoup
import requests

source = requests.get('https://dota2.gamepedia.com/Category:Counters').text
soup = BeautifulSoup(source, 'lxml')

for link in soup.select(".mw-category-group a"):
    url = "https://dota2.gamepedia.com" +link['href']
    print(url)
    sources = requests.get(url).text
    soup = BeautifulSoup(sources, "lxml")
    print("Page Header of Subsequest page")
    print(soup.select_one("#firstHeading").text)

输出: 根据 print 语句,您在控制台上的输出将如下所示。

https://dota2.gamepedia.com/Abaddon/Counters
Page Header of Subsequest page
Abaddon/Counters
https://dota2.gamepedia.com/Alchemist/Counters
Page Header of Subsequest page
Alchemist/Counters
https://dota2.gamepedia.com/Ancient_Apparition/Counters
Page Header of Subsequest page
Ancient Apparition/Counters
https://dota2.gamepedia.com/Anti-Mage/Counters
Page Header of Subsequest page
Anti-Mage/Counters
https://dota2.gamepedia.com/Arc_Warden/Counters
Page Header of Subsequest page
Arc Warden/Counters
https://dota2.gamepedia.com/Axe/Counters
Page Header of Subsequest page
Axe/Counters
https://dota2.gamepedia.com/Bane/Counters
Page Header of Subsequest page
Bane/Counters
https://dota2.gamepedia.com/Batrider/Counters
Page Header of Subsequest page
Batrider/Counters
https://dota2.gamepedia.com/Beastmaster/Counters
Page Header of Subsequest page
Beastmaster/Counters
https://dota2.gamepedia.com/Bloodseeker/Counters
Page Header of Subsequest page
Bloodseeker/Counters
https://dota2.gamepedia.com/Bounty_Hunter/Counters
Page Header of Subsequest page
Bounty Hunter/Counters
https://dota2.gamepedia.com/Brewmaster/Counters
Page Header of Subsequest page
Brewmaster/Counters
https://dota2.gamepedia.com/Bristleback/Counters
Page Header of Subsequest page
Bristleback/Counters
https://dota2.gamepedia.com/Broodmother/Counters
Page Header of Subsequest page
Broodmother/Counters
https://dota2.gamepedia.com/Centaur_Warrunner/Counters
Page Header of Subsequest page
Centaur Warrunner/Counters
https://dota2.gamepedia.com/Chaos_Knight/Counters
Page Header of Subsequest page
Chaos Knight/Counters
https://dota2.gamepedia.com/Chen/Counters
Page Header of Subsequest page
Chen/Counters
https://dota2.gamepedia.com/Clinkz/Counters
Page Header of Subsequest page
Clinkz/Counters
https://dota2.gamepedia.com/Clockwerk/Counters
Page Header of Subsequest page
Clockwerk/Counters
https://dota2.gamepedia.com/Crystal_Maiden/Counters
Page Header of Subsequest page
Crystal Maiden/Counters
https://dota2.gamepedia.com/Dark_Seer/Counters
Page Header of Subsequest page

等等...

关于python - 解析页面中超链接的所有 html 源,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/58917614/

相关文章:

python - 询问用户收集 gem 的天数

python - 在 pymongo 中加入具有相同值的集合中的文档

查询中的 Python 列表

python - 当某些行包含其他格式时,使用 mechanize & beautiful 对表格进行转义

python - 分隔 <pre> 标记内的文本

regex - 如何在python3中组合两个re.compile正则表达式?

python - 如何使用 BeautifulSoup 获取嵌套在 TD 中的 DIV 内部的链接

android - 在 Termux 上安装 Pandas 会抛出错误 : Broken toolchain

python - 来自 : can't read/var/mail/BeautifulSoup 的 BS4 和 BeautifulSoup 错误

python - 如何将html切片成数据框