python - 正则表达式从日志文件中提取不需要的 IP 地址

标签 python regex

我有 sever.log 文件。我的正则表达式正在提取所有由点分隔的 3 个数字的数字。我的代码如下

192.168.10.20 - - [18/Jul/2017:08:41:37 +0000] "PUT /search/tag/list HTTP/1.0" 200 5042 "http://cooper.com/homepage/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/5342 (KHTML, like Gecko) Chrome/14.0.870.0 Safari/5342"
10.30.24.3 - - [18/Jul/2017:08:45:15 +0000] "POST /search/tag/list HTTP/1.0" 200 4939 "http://www.cole-brown.net/category/main/list/privacy/" "Mozilla/5.0 (X11; Linux i686) AppleWebKit/5322 (KHTML, like Gecko) Chrome/14.0.843.0 Safari/5322"
98.5.45.3 - - [18/Jul/2017:08:45:49 +0000] "GET /apps/cart.jsp?appID=8471 HTTP/1.0" 200 4958 "http://knight-chase.com/post.jsp" "Mozilla/5.0 (Macintosh; U; PPC Mac OS X 10_7_3; rv:1.9.6.20) Gecko/2013-11-03 17:44:01 Firefox/3.8"

我的代码

import re
with open (r'C:\Users\ubuntu\Desktop\Tests\apache.log', 'r') as fr1:
    line1 = fr1.read()
regex = r"(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})"
#print(re.findall(regex, line1, re.DOTALL))
listofip = (re.findall(regex, line1))
result ={}
for i in listofip:
    result[i] = listofip.count(i)
result

我的输出

{'192.168.10.20': 1,
 '14.0.870.0': 1,
 '10.30.24.3': 1,
 '14.0.843.0': 1,
 '98.5.45.3': 1,
 '1.9.6.20': 1}

期望输出

{'192.168.10.20': 1,
 '10.30.24.3': 1,
 '98.5.45.3': 1}

最佳答案

如果每行都有 IP,您可以简单地逐行读取并将它们拆分并获取第一项:

#line1=r'''192.168.10.20 - - [18/Jul/2017:08:41:37 +0000] "PUT /search/tag/list HTTP/1.0" 200 5042 "http://cooper.com/homepage/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/5342 (KHTML, like Gecko) Chrome/14.0.870.0 Safari/5342"
#10.30.24.3 - - [18/Jul/2017:08:45:15 +0000] "POST /search/tag/list HTTP/1.0" 200 4939 "http://www.cole-brown.net/category/main/list/privacy/" "Mozilla/5.0 (X11; Linux i686) AppleWebKit/5322 (KHTML, like Gecko) Chrome/14.0.843.0 Safari/5322"
#98.5.45.3 - - [18/Jul/2017:08:45:49 +0000] "GET /apps/cart.jsp?appID=8471 HTTP/1.0" 200 4958 "http://knight-chase.com/post.jsp" "Mozilla/5.0 (Macintosh; U; PPC Mac OS X 10_7_3; rv:1.9.6.20) Gecko/2013-11-03 17:44:01 Firefox/3.8"
#98.5.45.3 - - [18/Jul/2017:08:45:49 +0000] "GET /apps/cart.jsp?appID=8471 HTTP/1.0" 200 4958 "http://knight-chase.com/post.jsp" "Mozilla/5.0 (Macintosh; U; PPC Mac OS X 10_7_3; rv:1.9.6.20) Gecko/2013-11-03 17:44:01 Firefox/3.8"'''
result ={}
with open (r'C:\Users\ubuntu\Desktop\Tests\apache.log', 'r') as fr1:
    for line in fr1:
        ip = line.split()[0]
        if ip in result:
            result[ip] += 1
        else:
            result[ip] = 1
print(result)
# => {'192.168.10.20': 1, '10.30.24.3': 1, '98.5.45.3': 2}

参见the Python demo .

要仅使用正则表达式获取行开头的 IP,您可以使用

r'(?m)^\d{1,3}(?:\.\d{1,3}){3}'

请参阅regex demo

请注意,在行开头匹配的更好的 IP 正则表达式(请参阅 this reference )是

r'^(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)(?:\.(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)){3}'

甚至是这个,考虑到每个 IP 后面都有一个空格:

r'^(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)(?:\.(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)){3}(?!\S)'

详细信息

  • (?m)^ - 行首
  • \d{1,3} - 1 到 3 位数字
  • (?:\.\d{1,3}){3} - 出现 3 次 . 和 1 到 3 个数字。

请参阅Python demo :

import re
line1=r'''192.168.10.20 - - [18/Jul/2017:08:41:37 +0000] "PUT /search/tag/list HTTP/1.0" 200 5042 "http://cooper.com/homepage/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/5342 (KHTML, like Gecko) Chrome/14.0.870.0 Safari/5342"
10.30.24.3 - - [18/Jul/2017:08:45:15 +0000] "POST /search/tag/list HTTP/1.0" 200 4939 "http://www.cole-brown.net/category/main/list/privacy/" "Mozilla/5.0 (X11; Linux i686) AppleWebKit/5322 (KHTML, like Gecko) Chrome/14.0.843.0 Safari/5322"
98.5.45.3 - - [18/Jul/2017:08:45:49 +0000] "GET /apps/cart.jsp?appID=8471 HTTP/1.0" 200 4958 "http://knight-chase.com/post.jsp" "Mozilla/5.0 (Macintosh; U; PPC Mac OS X 10_7_3; rv:1.9.6.20) Gecko/2013-11-03 17:44:01 Firefox/3.8"'''

rx = r"^\d{1,3}(?:\.\d{1,3}){3}\b"
listofip = re.findall(rx, line1, re.M)
result ={}
for ip in listofip:
    if ip in result:
        result[ip] += 1
    else:
        result[ip] = 1
print(result)
# => {'192.168.10.20': 1, '10.30.24.3': 1, '98.5.45.3': 1} 

关于python - 正则表达式从日志文件中提取不需要的 IP 地址,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/57520033/

相关文章:

javascript - 替换两个标记之间的 URL 中的 ID

正则表达式提取网址的一部分

python - 在 python 中使用 re 查找引号中的项目,但不查找转义引号

python - 如何将 4 个多索引级别行层次结构从 excel 上传到 pandas 数据框?

python - 字典理解 Python

python - 在构造函数中调用函数时出现 NameError

python - C++ 中的 Crypto++ :Encrypt in Python , 解密

python - 匹配一个单词但仅当另一个单词不出现时的正则表达式?

javascript - JavaScript 中的多重正则表达式

python - 使用 Javascript 将 Python 代码放入 HTML 文件中