Python命令提示符-自动提取链接

我在 github 上发现了一个很好的工具，可以让你输入 URL 来从中提取链接:https://github.com/devharsh/Links-Extractor

但是，我想提取页面上的所有 URL，而不仅仅是可点击的链接，例如网站 HTML 中的链接:

<a href="www.example.com">test</a>
in plaintext HTML: www.example.com
and <img src="www.example.com/picture.png">

将打印出:

www.example.com
www.example.com
www.example.com/picture.png

我是Python新手，我还没有找到任何在线工具可以让你从多个页面中提取URL(我想要它，这样你输入多个URL，运行它就会从你输入的每个URL中提取所有URL)输入)，它们只允许输入一个 URL 并从该页面提取链接(一次一个)。

它只打印出 HTML 标记 URL，但不是全部。

这里是 python 代码(经过编辑以处理 UTF-8 和百分比编码):

#!/usr/bin/python

__author__ = "Devharsh Trivedi"
__copyright__ = "Copyright 2018, Devharsh Trivedi"
__license__ = "GPL"
__version__ = "1.4"
__maintainer__ = "Devharsh Trivedi"
__email__ = "devharsh@live.in"
__status__ = "Production"

import sys
import requests
from bs4 import BeautifulSoup
from urllib.parse import urlparse

try:

    for link in sys.argv[1:]:
        page = requests.get(link)
        soup = BeautifulSoup(page.text, "lxml")
        extlist = set()
        intlist = set()
        
        for a in soup.findAll("a", attrs={"href":True}):
            if len(a['href'].strip()) > 1 and a['href'][0] != '#' and 'javascript:' not in a['href'].strip() and 'mailto:' not in a['href'].strip() and 'tel:' not in a['href'].strip():
                if 'http' in a['href'].strip() or 'https' in a['href'].strip():
                    if urlparse(link).netloc.lower() in urlparse(a['href'].strip()).netloc.lower():
                        intlist.add(a['href'])
                    else:
                        extlist.add(a['href'])
                else:
                    intlist.add(a['href'])
        
        print('\n')
        print(link)
        print('---------------------')
        print('\n')
        print(str(len(intlist)) + ' internal links found:')
        print('\n')
        for il in intlist:
            print(il.encode("utf-8"))
        print('\n')
        print(str(len(extlist)) + ' external links found:')
        print('\n')
        for el in extlist:
            print(el.encode("utf-8"))
        print('\n')
        
except Exception as e:
    print(e)

最佳答案

这是一个用于识别 URL 的快速正则表达式:

(http|ftp|https)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?

实际上，这看起来像:

import re
import requests
import sys

def find_urls(links):
  url_list = []
  for link in links:
    page = requests.get(link).text
    parts = re.findall('(http|ftp|https)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?', page)
    true_url = [p + '://' + d + sd for p, d, sd in parts]
    url_list.extend(true_url)
  return url_list

print(find_urls(sys.argv[1:]))

输出:

print(find_urls(['https://www.google.com']))

是:

['http://schema.org/WebPage', 'https://www.google.com/imghp?hl=en&tab=wi', 'https://maps.google.com/maps?hl=en&tab=wl', 'https://play.google.com/?hl=en&tab=w8', 'https://www.youtube.com/?gl=US&tab=w1', 'https://news.google.com/nwshp?hl=en&tab=wn', 'https://mail.google.com/mail/?tab=wm', 'https://drive.google.com/?tab=wo', 'https://www.google.com/intl/en/about/products?tab=wh', 'http://www.google.com/history/optout?hl=en', 'https://accounts.google.com/ServiceLogin?hl=en&passive=true&continue=https://www.google.com/']

感谢 Rajeev here对于正则表达式

编辑:鉴于作者更新的用例，我做了一些尝试和错误，发现了这个新的正则表达式:

((https?:\/\/.+)?(\/.*)+)

这是实践中的:

def find_urls(links):
  url_list = []
  for link in links:
    page = requests.get(link).text
    parts = re.findall('((https?:\/\/.+)?(\/.*)+)', page)
    url_list.extend(parts)
  return url_list

我不能保证这适用于每个用例(我不是正则表达式专家)，但它应该适用于您在网页中找到的大多数 URL/文件路径

关于Python命令提示符-自动提取链接，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/57153534/

Python命令提示符-自动提取链接

上一篇：python - 将灰度图像分量颜色分割为黑、白、灰

下一篇：python - Tkinter 页面未加载