python - 如何在 Python 中使用 lxml 从网页查找所有(完整)子链接

以下代码采用 URL 并返回原始 url 页面上包含的页面的链接列表。

import urllib
import lxml.html

def getSubLinks(url):
sublinks = []
connection = urllib.urlopen(url)
dom = lxml.html.fromstring(connection.read())
for link in dom.xpath('//a/@href'):
    sublinks.append(link)
return sublinks

这似乎有效，但对于同一域上的页面，它会从 URL 中删除域，而这不是我想要的。我想找回完整的未更改的链接。例如，在网页上使用这个:

"http://www.nufc.com "

返回列表(以及更多):

['http://www.altoonativetravel.com/', 'index.htm',    '2015-16html/fixtures.html', .....

但是，正如您所看到的，前面的“http://www.nufc.com”已从“2015-16html/fixtures.html”和其他内容中删除，而我不希望这种情况发生，我想要“http://www.nufc.com/2015-16html/fixtures.html” '。我该如何解决这个问题？

最佳答案

您可以使用以下内容:

import urllib
import lxml.html

def getSubLinks(url):
    sublinks = []
    connection = urllib.urlopen(url)
    dom = lxml.html.fromstring(connection.read())
    for link in dom.xpath('//a/@href'):
        if not link.startswith('http'):
            sublinks.append(url+link)
        else:
            sublinks.append(link)
    return sublinks

调用函数时，请使用 getSubLinks('http://www.nufc.com/')(请注意 URL 末尾的 /)。

此循环遍历页面上 a 标记的每个 href 属性。对于每个链接，如果链接不是以"http"开头，则会附加url+link，即"http://www.nufc。 com/"+ 链接.这将生成您想要的结果集。

关于python - 如何在 Python 中使用 lxml 从网页查找所有(完整)子链接，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/34615152/

上一篇：python - PyYAML:如何指定包含目录？

下一篇：python - 翻译 argparse 的内部字符串

javascript - 通过 JS 更改 CSS 图片

python - SQL:使用现有表/df 中的信息创建新表/df

python - 如何从Python提高Easticsearch的查询准确性？

python - pygame使用pygame.mixer.music.load(file)播放声音会给出NoneType错误

android - 将图像从 URL 保存到 SQLITE 数据库

php - 如何从数据库条目创建永久可访问的 URL？

python - 如何使用 Folium 将聚类标记添加到 Choropleth

javascript - jQuery ajax获取html内容并传递参数

android - 在 android 的 webview 中在运行时更改 URL