python - 打印出可选的正则表达式

我正在尝试打印出存储在字典中的值。这些值是从正则表达式创建的。

目前我有一些可选字段，但我不确定我这样做是否正确

(field A(field B)? field C (field D)?)?

我读了一个快速引用，它说 ?表示出现 0 次或 1 次。

当我尝试搜索诸如 reputation 之类的字段时或 content-type我得到 None因为这些在我的正则表达式中是可选的。我可能有错误的正则表达式，但我想知道为什么每当我搜索一个可选字段时 (...)?它打印出 None

我的代码:

import re

httpproxy515139 = re.compile(r'....url\=\"(?P<url>(.*))\"(\s+exceptions\=\"(?P<exceptions>([^\"]*))\"\s+error\=\"(?P<error>([^\"]*))\"\s+(reputation\=\"(?P<reputation_opt>([^\"]*))\"\s+)?category\=\"(?P<category>([^\"]*))\"\s+reputation\=\"(?P<reputation>([^\"]*))\"\s+categoryname\=\"(?P<categoryname>([^\"]*))\"\s+(content-type\=\"(?P<content_type>([^\"]*))\")?)?')

f  = open("sophos-httpproxy.out", "r")
fw = open("sophosfilter.log", "w+")

HttpProxyCount = 0
otherCount = 0

for line in f.readlines():
    HttpProxy = re.search(httpproxy515139, line)
    HttpProxy.groupdict()

    print "AV Field: "
    print "Date/Time: " + str(HttpProxy.groupdict()['categoryname'])

这是完整的正则表达式:

(?P<datetime>\w\w\w\s+\d+\s+\d\d:\d\d:\d\d)\s+(?P<IP>\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}).*httpproxy\[(?P<HTTPcode>(.*))\]:\s+id\=\"(?P<id>([^\"]*))\"\s+severity\=\"(?P<Severity>([^\"]*))\"\s+sys\=\"(?P<sys>([^\"]*))\"\s+sub\=\"(?P<sub>([^\"]*))\"\s+name\=\"(?P<name>([^\"]*))\"\s+action\=\"(?P<action>([^\"]*))\"\s+method\=\"(?P<method>([^\"]*))\"\s+srcip\=\"(?P<srcip>([^\"]*))\"\s+dstip\=\"(?P<dstip>([^\"]*))\"\s+user\=\"(?P<user>[^\"]*)\"\s+statuscode\=\"(?P<statuscode>([^\"]*))\"\s+cached\=\"(?P<cached>([^\"]*))\"\s+profile\=\"(?P<profile>([^\"]*))\"\s+filteraction\=\"(?P<filteraction>([^\"]*))\"\s+size\=\"(?P<size>([^\"]*))\"\s+request\=\"(?P<request>([^\"]*))\"\s+url\=\"(?P<url>(.*))\"(\s+exceptions\=\"(?P<exceptions>([^\"]*))\"\s+error\=\"(?P<error>([^\"]*))\"\s+(reputation\=\"(?P<reputation_opt>([^\"]*))\"\s+)?category\=\"(?P<category>([^\"]*))\"\s+reputation\=\"(?P<reputation>([^\"]*))\"\s+categoryname\=\"(?P<categoryname>([^\"]*))\"\s+(content-type\=\"(?P<content_type>([^\"]*))\")?)?

这是一个示例输入:

Oct 7 13:22:55 192.168.10.2 2013: 10:07-13:22:54 httpproxy[15359]: id="0001" severity="info" sys="SecureWeb" sub="http" name="http access" action="pass" method="GET" srcip="192.168.8.47" dstip="64.94.90.108" user="" statuscode="200" cached="0" profile="REF_DefaultHTTPProfile (Default Proxy)" filteraction="REF_DefaultHTTPCFFAction (Default content filter action)" size="1502" request="0x10870200" url="http://www.concordmonitor.com/csp/mediapool/sites/dt.common.streams.StreamServer.cls?STREAMOID=6rXcvJGqsPivgZ7qnO$Sic$daE2N3K4ZzOUsqbU5sYvZF78hLWDhaM8n_FuBV1yRWCsjLu883Ygn4B49Lvm9bPe2QeMKQdVeZmXF$9l$4uCZ8QDXhaHEp3rvzXRJFdy0KqPHLoMevcTLo3h8xh70Y6N_U_CryOsw6FTOdKL_jpQ-&CONTENTTYPE=image/jpeg" exceptions="" error="" category="134" reputation="neutral" categoryname="General News" content-type="image/jpeg"

我想抓取整个日志

但是有时 url里面有很多引号，使事情变得困惑。同样在一些日志中，有一个额外的 reputation error 之间的数据字段和声誉。 content-type也不总是出现。有时 url 之后的所有内容数据字段也丢失了。这就是为什么我添加了所有可选的 ? .我正在尝试考虑这些事件并打印 None必要时。

最佳答案

让我们把你的正则表达式分成两部分:

....url\=\"(?P<url>(.*))\"

和

(\s+exceptions\=\"(?P<exceptions>([^\"]*))\"\s+error\=\"(?P<error>([^\"]*))\"\s+(reputation\=\"(?P<reputation_opt>([^\"]*))\"\s+)?category\=\"(?P<category>([^\"]*))\"\s+reputation\=\"(?P<reputation>([^\"]*))\"\s+categoryname\=\"(?P<categoryname>([^\"]*))\"\s+(content-type\=\"(?P<content_type>([^\"]*))\")?)?

第一部分的.*是贪心的。它会匹配所有它能匹配的东西，只有在绝对必要时才会回溯。

第二部分是一个巨大的可选组。

当正则表达式执行时，.* 将匹配字符串末尾的所有内容，然后根据需要回溯，直到 \" 可以匹配引号。这将是字符串中的最后一个引号，它可能不是您想要的引号。

然后，巨型可选组将尝试匹配，但由于贪婪的 .* 已经吃掉了巨型可选组应该解析的所有内容，它将失败。由于它是可选的，因此正则表达式算法将适用。

要解决这个问题？好吧，非贪婪量词可能有助于解决眼前的问题，但更好的解决方案可能是停止尝试使用正则表达式来解析它。为您的数据格式寻找现有的解析器。您是否试图从 HTML 或 XML 中提取数据？我看到很多关于 BeautifulSoup 的建议.

关于python - 打印出可选的正则表达式，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/21342251/

python - 打印出可选的正则表达式

上一篇：python - 如何获取列表中每个项目的数量？

下一篇：python - 减去字典中的所有值是值是 float 列表