python - 从 Python 中的给定字符串中删除所有形式的 URL

我是 python 的新手，想知道是否有更好的解决方案来匹配可能在给定字符串中找到的所有形式的 URL。谷歌搜索后，似乎有很多解决方案可以提取域，将其替换为链接等，但没有一个可以从字符串中删除/删除它们。我在下面提到了一些例子以供引用。谢谢!

str = 'this is some text that will have one form or the other url embeded, most will have valid URLs while there are cases where they can be bad. for eg, http://www.google.com and http://www.google.co.uk and www.domain.co.uk and etc.'

URLless_string = re.sub(r'(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|

(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))', '', thestring)

print '==' + URLless_string + '=='

错误日志:

C:\Python27>python test.py
  File "test.py", line 7
SyntaxError: Non-ASCII character '\xab' in file test.py on line 7, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details

最佳答案

在源文件的顶部包含编码行(正则表达式字符串包含非 ascii 符号，如 »)，例如:

# -*- coding: utf-8 -*-
import re
...

还用三重单引号(或双引号)包围您的正则表达式字符串 - ''' 或 """ 而不是单引号，因为该字符串本身已经包含引号( ' 和 ").

r'''(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))'''

关于python - 从 Python 中的给定字符串中删除所有形式的 URL，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/14081050/

python - 从 Python 中的给定字符串中删除所有形式的 URL

上一篇：python - PyPi 服务器响应 500

下一篇：python - 是否可以在 Python 中处理任意大的字符串？ (通过 * 运算符创建)