python - 在Python的正则表达式中提取两个标记之间的文本并处理反斜杠

我的文件中有一些 URL，其中一些嵌入在特定的开始和结束标记之间，而另一些则没有。我只需要提取嵌入在开始标记和结束标记之间的标记。

我的 inputfile.txt 中的一行如下所示:

some gibberish data-start=\"https:\/\/cdn.net\/hphotos-ak-xfa1\/1.jpg\" data-end this is useless text, some gibberishhh data-start=\"https:\/\/cdn.net\/hphotos-xaf1\/2.jpg\" data-end some gibberish fake-data-start=\"https:\/\/cdn.net\/hphotos-xaf1\/2.jpg\" fake-data-end

我需要的 URL 的开始和结束标记是 data-start 和 data-end，而不是 fake-data-start > 和假数据结束。

现在我在 Python 中使用以下正则表达式来提取上述 URL:

(?<=\ data-start=\\\")([^"]+\.[^"]+\.[^"]+)(?=\"\ data-end)

我相信上面的正则表达式是有效的，我从 this link 进行了测试。

我的Python代码是:

import re
import string
import sys

s = re.compile('(?<=\ data-start=\\\")([^"]+\.[^"]+\.[^"]+)(?=\"\ data-end)')

fin = open('inputfile.txt') 

for line in fin: 
    m = s.findall(line)

if m:       
    print m

但是，我的 Python 代码无法找到 URL，另一方面，如果我从文件中删除所有反斜杠，则上述代码可以正常工作。我无法解释这种差异。

最佳答案

反斜杠用作转义字符。所以;对于每个单个 (\) 反斜杠，您需要两个反斜杠(\\)。您可以在此处使用以下正则表达式:

(?<=data-start=\\").*?(?=\\" data-end)

说明:

(?<=              # look behind to see if there is:
   data-start=    #   'data-start='
  \\              #   '\'
  "               #   '"'
)                 # end of look-behind
.*?               # any character except \n (0 or more times)
(?=               # look ahead to see if there is:
  \\              #   '\'
  " data-end      #   '" data-end'
)                 # end of look-ahead

注意:如果您的数据跨多行，请使用内联 (?s) 修饰符强制点匹配换行符。

(?s)(?<=data-start=\\").*?(?=\\" data-end)

最终解决方案:

import re

myfile = open('inputfile.txt', 'r')
regex  = re.compile(r'(?<=data-start=\\").*?(?=\\" data-end)')

for line in myfile:
    matches = regex.findall(line)
    for m in matches:
        print m

输出

https:\/\/cdn.net\/hphotos-ak-xfa1\/1.jpg
https:\/\/cdn.net\/hphotos-xaf1\/2.jpg

关于python - 在Python的正则表达式中提取两个标记之间的文本并处理反斜杠，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/24194075/

python - 在Python的正则表达式中提取两个标记之间的文本并处理反斜杠

上一篇：python - gethostname() 返回准确的主机名，bind() 不喜欢它

下一篇：python - Django "get"方法不起作用