python - 使用正则表达式从特定文本格式获取信息

标签 python regex string python-3.x

我有一个文本包含这样的内容:

(some text)
libncursesw5-dev:amd64 depends on libc6-dev | libc-dev;(some text)
libx32ncursesw5 depends on libc6-x32 (>= 2.16);(some text)
libx32ncurses5-dev depends on libncurses5-dev (= 5.9+20150516-2ubuntu1);(some text)
libx32ncursesw5-dev depends on libc6-dev-x32;(some text)
lib32tinfo-dev depends on lib32c-dev;(some text)

这是其中一个句子的完整示例:

dpkg: error processing package lib32tinfo5 (--install):
 dependency problems - leaving unconfigured
dpkg: dependency problems prevent configuration of libncurses5-dev:amd64:
 libncurses5-dev:amd64 depends on libc6-dev | libc-dev; however:
    Package libc6-dev is not installed.
    Package libc-dev is not installed.

整个文本分为几个段落,例如上面的段落,每个段落都包含其中一个句子。

我想要一个在 python 中使用 re 库的正则表达式,它可以使用 findall 选项为我提供类似的东西:

('libc6-dev', '', 'libc-dev', '')
('libc6-x32','2.16')
('libncurses5-dev','5.9+20150516-2ubuntu1')
('libc6-dev-x32','')
('lib32c-dev','')

换句话说,我需要您的帮助,以便从此类文本中获取包含软件包及其版本(如果指定)的元组。

我做了这个正则表达式:

(?<=depends on )([a-zA-Z0-9\-]*)(?: \([=> ]*([a-zA-Z0-9-+.]*)(?:\)))?|(?: \| )([a-zA-Z0-9\-]*)(?: \([=> ]*([a-zA-Z0-9-+.]*)(?:\)))?(?=;)

我得到了这个结果:

('libc6-dev', '', '', '')
('', '', 'libc-dev', '')
('libc6-x32', '2.16', '', '')
('libncurses5-dev', '5.9+20150516-2ubuntu1', '', '')
('libc6-dev-x32', '', '', '')
('lib32c-dev', '', '', '')

正如你所看到的,对于这句话:

libncursesw5-dev:amd64 depends on libc6-dev | libc-dev;

我得到了这个答案:

('libc6-dev', '', '', '')
('', '', 'libc-dev', '')

而不是这个:

('libc6-dev', '', 'libc-dev', '')

感谢您的帮助。

最佳答案

#!/usr/bin/python2
# -*- coding: utf-8 -*-

import re

input = """(some text)
libncursesw5-dev:amd64 depends on libc6-dev | libc-dev;(some text)
libx32ncursesw5 depends on libc6-x32 (>= 2.16);(some text)
libx32ncurses5-dev depends on libncurses5-dev (= 5.9+20150516-2ubuntu1);(some text)
libx32ncursesw5-dev depends on libc6-dev-x32;(some text)
lib32tinfo-dev depends on lib32c-dev;(some text)"""

#a = []
#m = re.findall("depends on ([^\s;]+)\ \|\ ([^\s;]+)", input) # 1
#a = a + m
#m = re.findall("depends on ([^\s;]+)\ \([><=]{,2} ([^;]+)\)", input) # 2, 3
#a = a + m
#m = re.findall("depends on ([^\s;]+)", input) # 4, 5
#a = a + m

m = re.findall("depends on ([^\s;]+)\ \|\ ([^\s;]+)|depends on ([^\s;]+)\ \([><=]{,2} ([^;]+)\)|depends on ([^\s;]+)", input)

print m

输出:

[
    ('libc6-dev', 'libc-dev', '', '', ''),
    ('', '', 'libc6-x32', '2.16', ''),
    ('', '', 'libncurses5-dev', '5.9+20150516-2ubuntu1', ''),
    ('', '', '', '', 'libc6-dev-x32'),
    ('', '', '', '', 'lib32c-dev')
]

您可以通过 | 一项一项或全部获取不知道这是否可以帮助你

关于python - 使用正则表达式从特定文本格式获取信息,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/36528348/

相关文章:

ruby - 如何在某些类型的参数上使用 RSpec 正则表达式参数匹配

java - 如何用正则表达式替换字符串值

c - 一次读取文件而不是逐行读取文件(在 c 中)

python - 向多个用户 FCM 发送通知

python - 当没有重复项时,pandas 中索引重复错误

regex - 如何在vi中替换NUL?

JavaScript - 拆分,选择给定数字后的所有内容

python - 尝试根据输入创建一个简单的按属性选择的脚本

python - 将哈希算法从 C 语言翻译成 Python

python - 如何克服 python 中的正则表达式深度限制?