python - 查找字符串中所有 HTML 和非 HTML 编码的 URL

我想查找字符串中的所有 URL。我在 StackOverflow 上找到了各种解决方案，这些解决方案根据字符串的内容而有所不同。

例如，假设我的字符串包含 HTML，this answer建议使用 BeautifulSoup 或 lxml。

另一方面，如果我的字符串仅包含没有 HTML 标记的纯 URL，this answer建议使用正则表达式。

鉴于我的字符串包含 HTML 编码的 URL 以及纯 URL，我无法找到一个好的解决方案。这是一些示例代码:

import lxml.html

example_data = """<a href="http://www.some-random-domain.com/abc123/def.html">Click Me!</a>
http://www.another-random-domain.com/xyz.html"""
dom = lxml.html.fromstring(example_data)
for link in dom.xpath('//a/@href'):
    print "Found Link: ", link

正如预期的那样，这会导致:

Found Link:  http://www.some-random-domain.com/abc123/def.html

我还尝试了 @Yannisp 提到的 twitter-text-python 库，但它似乎没有提取两个 URL:

>>> from ttp.ttp import Parser
>>> p = Parser()
>>> r = p.parse(example_data)
>>> r.urls
['http://www.another-random-domain.com/xyz.html']

从包含 HTML 和非 HTML 编码数据混合的字符串中提取两种 URL 的最佳方法是什么？是否有一个好的模块可以做到这一点？或者我被迫将正则表达式与 BeautifulSoup/lxml 结合起来？

最佳答案

我投票是因为它激发了我的好奇心。好像有一个库叫twitter-text-python ，它解析 Twitter 帖子以检测 url 和 href。否则，我会使用 regex + lxml 组合

关于python - 查找字符串中所有 HTML 和非 HTML 编码的 URL，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/30330696/

上一篇：python - 缺少 python 方法 NetworkX

下一篇：python - 使用 set()/setp() 设置 matplotlib 中的未知属性

相关文章：

python - 将 grid.lines 添加到 tkinter 不起作用

python - 使用 Python 3.6.5 安装 MySQLdb 的问题 : Can't open 'mysql.h' - and Can't open : 'config-win.h' : No such file or directory

html - 具有固定标题的表 - tbody 在我的表中溢出了 thead

php - 正则表达式挑战 : Capture all the numbers in a specific row

css - Gulp中如何匹配返回各种扩展名的文件？

python : Timer without blocking the window in Tkinter

python - 为什么 eval ('"\x2 7"' ) == eval ('"\\x2 7"' )？

javascript - 如何为所有实例设置相同的子组件动态宽度？网络组件

javascript - 如何使用队列方法制作动态列表

javascript - javascript正则表达式中的后视断言语法错误