我在 github 上发现了一个很好的工具,可以让你输入 URL 来从中提取链接:https://github.com/devharsh/Links-Extractor
但是,我想提取页面上的所有 URL,而不仅仅是可点击的链接,例如网站 HTML 中的链接:
<a href="www.example.com">test</a>
in plaintext HTML: www.example.com
and <img src="www.example.com/picture.png">
将打印出:
www.example.com
www.example.com
www.example.com/picture.png
我是Python新手,我还没有找到任何在线工具可以让你从多个页面中提取URL(我想要它,这样你输入多个URL,运行它就会从你输入的每个URL中提取所有URL)输入),它们只允许输入一个 URL 并从该页面提取链接(一次一个)。
这里是 python 代码(经过编辑以处理 UTF-8 和百分比编码):
#!/usr/bin/python
__author__ = "Devharsh Trivedi"
__copyright__ = "Copyright 2018, Devharsh Trivedi"
__license__ = "GPL"
__version__ = "1.4"
__maintainer__ = "Devharsh Trivedi"
__email__ = "devharsh@live.in"
__status__ = "Production"
import sys
import requests
from bs4 import BeautifulSoup
from urllib.parse import urlparse
try:
for link in sys.argv[1:]:
page = requests.get(link)
soup = BeautifulSoup(page.text, "lxml")
extlist = set()
intlist = set()
for a in soup.findAll("a", attrs={"href":True}):
if len(a['href'].strip()) > 1 and a['href'][0] != '#' and 'javascript:' not in a['href'].strip() and 'mailto:' not in a['href'].strip() and 'tel:' not in a['href'].strip():
if 'http' in a['href'].strip() or 'https' in a['href'].strip():
if urlparse(link).netloc.lower() in urlparse(a['href'].strip()).netloc.lower():
intlist.add(a['href'])
else:
extlist.add(a['href'])
else:
intlist.add(a['href'])
print('\n')
print(link)
print('---------------------')
print('\n')
print(str(len(intlist)) + ' internal links found:')
print('\n')
for il in intlist:
print(il.encode("utf-8"))
print('\n')
print(str(len(extlist)) + ' external links found:')
print('\n')
for el in extlist:
print(el.encode("utf-8"))
print('\n')
except Exception as e:
print(e)
最佳答案
这是一个用于识别 URL 的快速正则表达式:
(http|ftp|https)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?
实际上,这看起来像:
import re
import requests
import sys
def find_urls(links):
url_list = []
for link in links:
page = requests.get(link).text
parts = re.findall('(http|ftp|https)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?', page)
true_url = [p + '://' + d + sd for p, d, sd in parts]
url_list.extend(true_url)
return url_list
print(find_urls(sys.argv[1:]))
输出:
print(find_urls(['https://www.google.com']))
是:
['http://schema.org/WebPage', 'https://www.google.com/imghp?hl=en&tab=wi', 'https://maps.google.com/maps?hl=en&tab=wl', 'https://play.google.com/?hl=en&tab=w8', 'https://www.youtube.com/?gl=US&tab=w1', 'https://news.google.com/nwshp?hl=en&tab=wn', 'https://mail.google.com/mail/?tab=wm', 'https://drive.google.com/?tab=wo', 'https://www.google.com/intl/en/about/products?tab=wh', 'http://www.google.com/history/optout?hl=en', 'https://accounts.google.com/ServiceLogin?hl=en&passive=true&continue=https://www.google.com/']
感谢 Rajeev here对于正则表达式
编辑:鉴于作者更新的用例,我做了一些尝试和错误,发现了这个新的正则表达式:
((https?:\/\/.+)?(\/.*)+)
这是实践中的:
def find_urls(links):
url_list = []
for link in links:
page = requests.get(link).text
parts = re.findall('((https?:\/\/.+)?(\/.*)+)', page)
url_list.extend(parts)
return url_list
我不能保证这适用于每个用例(我不是正则表达式专家),但它应该适用于您在网页中找到的大多数 URL/文件路径
关于Python命令提示符-自动提取链接,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/57153534/