python - 使用匹配字符串中定义的字符串长度

标签 python parsing pyparsing

具有以下字符串:

commit a8c11fcee68881dfb86095aa36290fb304047cf1
log size 110
Author: XXXXXX XXXXXXXX <<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="1b4343434343434343434343434343435b434343434335434343" rel="noreferrer noopener nofollow">[email protected]</a>>
Date:   Tue, 10 Apr 2012 11:19:44 +0300

    First commit

3       0       README.MD

如何在语法定义中使用值 110 来匹配其余内容? “日志大小”包括字段(此处:AuthorDate,但可以有任意数量的字段)和实际消息。

最后一行不是“日志消息”的一部分。

我想要获取的是commit的值、包含AuthorDate等元数据的字典以及实际的日志消息,这里是“第一次提交”。

问题是,日志大小告诉我这条消息有多长,但这也包括字段作者日期

110 是该字符串的大小:

Author: XXXXXX XXXXXXXX <<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="6b3333333333333333333333333333332b333333333345333333" rel="noreferrer noopener nofollow">[email protected]</a>>
Date:   Tue, 10 Apr 2012 11:19:44 +0300

    First commit

最佳答案

我的算法思路和NPE一样。
但我进一步插入了正则表达式的使用。

我用第二次出现的日志消息扩展了分析的文本,并注意在“日志大小 xxx\n”行中放置正确数量的字符

regex1 将每个出现的情况分为 4 组。第三组包含具有字典的行,第四组包含字典行之后和其他出现之前的尾随行。

import re

ss = """commit a8c11fcee68881dfb86095aa36290fb304047cf1
log size 110
Author: XXXXXX XXXXXXXX <<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="6d3535353535353535353535353535352d353535353543353535" rel="noreferrer noopener nofollow">[email protected]</a>>
Date:   Tue, 10 Apr 2012 11:19:44 +0300

    First commit
3       0       README.MD
blablah bla
commit 12458777AFDRE1254
log size 170
   Author: Jim Bluefish <<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="583231353a343e2b30183f35393134763b3735" rel="noreferrer noopener nofollow">[email protected]</a>>
Date   :   Yesterday 21:45:01 +0800
  A key with whitespace :       A_stupid_value    

    Funny commit
  From far from you
457      popo       not_README.MD"""

n = 0
print ('------ DISPLAY OF THE TEXT ------\n'
       ' col 1: index of line,\n'
       ' col 2: number of chars in the line\n'
       ' col 3: total of the numbers of chars of lines\n'
       ' col 4: repr(line)\n')
for j,line in enumerate(ss.splitlines(1)):
    n += len(line)
    print '%2d  %2d  %3d  %r' % (j,len(line),n,line)


print '=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-='
print '\n\n\n------ ANALYSER 2 OF THE TEXT ------'

regx1 = re.compile('^commit +(.+) *\r?\n'
                   'log size +(\d+) *\r?\n'
                   '((?:^ *.+?(?<! ) *: *.+(?<! ) *\r?\n)+)'
                   '((?:.*\r?\n(?!commit))+)',
                   re.MULTILINE)

regx2 = re.compile('^ *(.+?)(?<! ) *: *(.+)(?<! ) *\r?\n',
                   re.MULTILINE)

for mat in regx1.finditer(ss):

    commit_value,logsize,dicolines,msg = mat.groups()

    print ('\ncommit_value == %s\n'
           'logsize == %s'
           % (commit_value,logsize))

    print 'dictionary :\n',dict(regx2.findall(dicolines))

    actual_log_message = msg[0:int(logsize)-len(dicolines)].strip(' \r\n')
    print 'actual_log_message ==',repr(actual_log_message)

结果

------ DISPLAY OF THE TEXT ------
 col 1: index of line,
 col 2: number of chars in the line
 col 3: total of the numbers of chars of lines
 col 4: repr(line)

 0  48   48  'commit a8c11fcee68881dfb86095aa36290fb304047cf1\n'
 1  13   61  'log size 110\n'
 2  52  113  'Author: XXXXXX XXXXXXXX <<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="eeb6b6b6b6b6b6b6b6b6b6b6b6b6b6b6aeb6b6b6b6b6c0b6b6b6" rel="noreferrer noopener nofollow">[email protected]</a>>\n'
 3  40  153  'Date:   Tue, 10 Apr 2012 11:19:44 +0300\n'
 4   1  154  '\n'
 5  17  171  '    First commit\n'
 6  26  197  '3       0       README.MD\n'
 7  12  209  'blablah bla\n'
 8  25  234  'commit 12458777AFDRE1254\n'
 9  13  247  'log size 170\n'
10  45  292  '   Author: Jim Bluefish <<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="e68c8f8b848a80958ea6818b878f8ac885898b" rel="noreferrer noopener nofollow">[email protected]</a>>\n'
11  36  328  'Date   :   Yesterday 21:45:01 +0800\n'
12  51  379  '  A key with whitespace :       A_stupid_value    \n'
13   1  380  '\n'
14  17  397  '    Funny commit\n'
15  20  417  '  From far from you\n'
16  33  450  '457      popo       not_README.MD'



------ ANALYSER OF THE TEXT ------

commit_value == a8c11fcee68881dfb86095aa36290fb304047cf1
logsize == 110
dico :
{'Date': 'Tue, 10 Apr 2012 11:19:44 +0300', 'Author': 'XXXXXX XXXXXXXX <<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="3860606060606060606060606060606078606060606016606060" rel="noreferrer noopener nofollow">[email protected]</a>>'}
actual_log_message == 'First commit'


commit_value == 12458777AFDRE1254
logsize == 170
dico :
{'Date': 'Yesterday 21:45:01 +0800', 'A key with whitespace': 'A_stupid_value', 'Author': 'Jim Bluefish <<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="63090a0e010f05100b23040e020a0f4d000c0e" rel="noreferrer noopener nofollow">[email protected]</a>>'}
actual_log_message == 'Funny commit\n  From far from you'

关于python - 使用匹配字符串中定义的字符串长度,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/13658806/

相关文章:

java - Android:如何在主要 Activity 的变量中存储值(value)?

html - 显示为 :none; on itself and all children, 的 <ul><li> 列表仍然向父级 <li> 添加 3px 填充

python - 基于 pyparsing 的分割

python - 无法将 datetime.strptime 与 from datetime import datetime 一起使用

Python 3.7 - 连接字符串并将其写入磁盘的快速方法

python - 使用 while 循环查找最小的用户输入数

python - 常规模式“^ ab | cd $”和^(ab | cd)$有什么区别?

xampp 上的 laravel 出现错误,但 php artisan serve 正在运行

python - 在 pyparsing 中使用 ZeroOrMore 来 SkipTo 不为空标记的正确方法

python - 使用 pyparsing 连接三元运算符