python - 如何使用 BeautifulSoup 获取完整网址

标签 python beautifulsoup web-crawler

我找不到如何获取网站的完整地址: 例如，我得到“/wiki/Main_Page”而不是“https://en.wikipedia.org/wiki/Main_Page”。我不能简单地将 url 添加到链接，因为它会给出:“https://en.wikipedia.org/wiki/WKIK/wiki/Main_Page”，这是不正确的。我的目标是使其适用于任何网站，因此我正在寻找通用的解决方案。

这是代码:

from bs4 import BeautifulSoup
import requests

url ="https://en.wikipedia.org/wiki/WKIK"
r  = requests.get(url)
data = r.text
soup = BeautifulSoup(data)

for link in soup.find_all('a', href=True):
    print "Found the URL:", link['href']

这是它返回的部分内容:

>Found the URL: /wiki/WKIK_(AM)
>Found the URL: /wiki/WKIK-FM
>Found the URL: /wiki/File:Disambig_gray.svg
>Found the URL: /wiki/Help:Disambiguation
>Found the URL: //en.wikipedia.org/w/index.php?
>title=Special:WhatLinksHere/WKIK&namespace=0

最佳答案

此处的其他答案可能会遇到某些相对 URL 的问题，例如包含句点的 URL (../page)。

Python 的 requests 库有 a function called urljoin获取完整的 URL:

requests.compat.urljoin(currentPage, link)

因此，如果您访问 https://en.wikipedia.org/wiki/WKIK，并且页面上有一个 href 为 的链接/wiki/Main_Page，该函数将返回https://en.wikipedia.org/wiki/Main_Page。

关于python - 如何使用 BeautifulSoup 获取完整网址，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/44746021/

上一篇：python - 计算列表内重复列表的数量

下一篇：python - 删除数百万个 pandas 行，其中包含 numpy 数组中的值

python - Robotparser 似乎没有正确解析

python - 如何在详细或 Debug模式下运行 WSGIServer？

python - 使用 Python 查找 javascript 链接

python - 使用 Psycopg2 插入 Python 字典

python - 网页抓取代码中出现 JSON 错误，如何修复？

Python - print link.get href - 打印输出 url 仅用逗号分隔

web-crawler - 在爬网中获得超过请求的限制

python - pip:升级包而不升级特定依赖

python - 计算两个列表项之间的因子差异