python - 编辑 pyparsing 解析结果

标签 python parsing pyparsing logfile

这类似于 question I've asked before

我为包含多个日志的文本文件编写了一个pyparsing语法logparser。日志记录每个函数调用和每个函数完成。底层进程是多线程的,因此有可能调用一个慢速函数 A,然后调用一个快速函数 B 并几乎立即完成,然后该函数 A 完成并给我们它的返回值。因此,日志文件用手读取非常困难,因为一个函数的调用信息和返回值信息可能相距数千行。

我的解析器能够解析函数调用(从现在开始称为 input_blocks )及其返回值(从现在开始称为 output_blocks )。我的解析结果( logparser.searchString(logfile) )如下所示:

[0]:                            # first log
  - input_blocks:
    [0]:
      - func_name: 'Foo'
      - parameters: ...
      - thread: '123'
      - timestamp_in: '12:01'
    [1]:
      - func_name: 'Bar'
      - parameters: ...
      - thread: '456'
      - timestamp_in: '12:02'
  - output_blocks:
    [0]:
      - func_name: 'Bar'
      - func_time: '1'
      - parameters: ...
      - thread: '456'
      - timestamp_out: '12:03'
    [1]:
      - func_name: 'Foo'
      - func_time: '3'
      - parameters: ...
      - thread: '123'
      - timestamp_out: '12:04'
[1]:                            # second log
    - input_blocks:
    ...

    - output_blocks:
    ...
...                             # n-th log

我想解决一个函数调用的输入和输出信息分离的问题。所以我想将 input_block 和相应的 output_block 放入 function_block 中。我的最终解析结果应如下所示:

[0]:                            # first log
  - function_blocks:
    [0]:
        - input_block:
            - func_name: 'Foo'
            - parameters: ...
            - thread: '123'
            - timestamp_in: '12:01'
        - output_block:
            - func_name: 'Foo'
            - func_time: '3'
            - parameters: ...
            - thread: '123'
            - timestamp_out: '12:04'
    [1]:
        - input_block:
            - func_name: 'Bar'
            - parameters: ...
            - thread: '456'
            - timestamp_in: '12:02'
        - output_block:
            - func_name: 'Bar'
            - func_time: '1'
            - parameters: ...
            - thread: '456'
            - timestamp_out: '12:03'
[1]:                            # second log
    - function_blocks:
    [0]: ...
    [1]: ...
...                             # n-th log

为了实现此目的,我定义了一个函数 rearrange ,它迭代 input_blocksoutput_blocks 并检查 func_namethread 和时间戳是否匹配。然而,将匹配的 block 移动到一个 function_block 中是我所缺少的部分。然后我将此函数设置为日志语法的解析操作:logparser.setParseAction(rearrange)

def rearrange(log_token):
    for input_block in log_token.input_blocks:
        for output_block in log_token.output_blocks:
            if (output_block.func_name == input_block.func_name
                and output_block.thread == input_block.thread
                and check_timestamp(output_block.timestamp_out,
                                    output_block.func_time,
                                    input_block.timestamp_in):
                # output_block and input_block match -> put them in a function_block
                # modify log_token
    return log_token

我的问题是:如何将匹配的 output_blockinput_block 放入 function_block 中,同时我仍然喜欢 pyparsing.ParseResults 的轻松访问方法?

我的想法是这样的:

def rearrange(log_token):
    # define a new ParseResults object in which I store matching input & output blocks
    function_blocks = pp.ParseResults(name='function_blocks')

    # find matching blocks
    for input_block in log_token.input_blocks:
        for output_block in log_token.output_blocks:
            if (output_block.func_name == input_block.func_name
                and output_block.thread == input_block.thread
                and check_timestamp(output_block.timestamp_out,
                                    output_block.func_time,
                                    input_block.timestamp_in):
                # output_block and input_block match -> put them in a function_block
                function_blocks.append(input_block.pop() + output_block.pop())  # this addition causes a maximum recursion error?
    log_token.append(function_blocks)
    return log_token

但这不起作用。添加会导致最大递归错误,并且 .pop() 无法按预期工作。它不会弹出整个 block ,它只是弹出该 block 中的最后一个条目。此外,它实际上也没有删除该条目,只是将其从列表中删除,但仍然可以通过其结果名称访问它。

也有可能某些 input_blocks 没有相应的 output_block (例如,如果进程在所有函数完成之前崩溃)。所以我的解析结果应该具有属性 input_blocksoutput_blocks (对于备用 block )和 function_blocks (对于匹配 block )。

感谢您的帮助!

编辑:

我做了一个更简单的例子来说明我的问题。另外,我进行了尝试并找到了一种可行的解决方案,但有点困惑。我必须承认其中包含了大量的反复试验,因为我既没有找到有关 ParseResults 的文档,也无法理解 ParseResults 的内部工作原理以及如何正确创建我自己的嵌套 ojit_code 结构。

from pyparsing import *

def main():
    log_data = '''\
    Func1_in
    Func2_in
    Func2_out
    Func1_out
    Func3_in'''

    ParserElement.inlineLiteralsUsing(Suppress)
    input_block = Group(Word(alphanums)('func_name') + '_in').setResultsName('input_blocks', listAllMatches=True)
    output_block = Group(Word(alphanums)('func_name') +'_out').setResultsName('output_blocks', listAllMatches=True)
    log = OneOrMore(input_block | output_block)

    parse_results = log.parseString(log_data)
    print('***** before rearranging *****')
    print(parse_results.dump())

    parse_results = rearrange(parse_results)
    print('***** after rearranging *****')
    print(parse_results.dump())

def rearrange(log_token):
    function_blocks = list()

    for input_block in log_token.input_blocks:
        for output_block in log_token.output_blocks:
            if input_block.func_name == output_block.func_name:
              # found two matching blocks! now put them in a function_block
              # and delete them from their original positions in log_token
                # I have to do both __setitem__ and .append so it shows up in the dict and in the list
                # and .copy() is necessary because I delete the original objects later
                tmp_function_block = ParseResults()
                tmp_function_block.__setitem__('input', input_block.copy())
                tmp_function_block.append(input_block.copy())
                tmp_function_block.__setitem__('output', output_block.copy())
                tmp_function_block.append(output_block.copy())
                function_block = ParseResults(name='function_blocks', toklist=tmp_function_block, asList=True,
                                              modal=False)  # I have no idea what modal and asList do, this was trial-and-error until I got acceptable output
                del function_block['input'], function_block['output']  # remove duplicate data

                function_blocks.append(function_block)
                # delete from original position in log_token
                input_block.clear()
                output_block.clear()
    log_token.__setitem__('function_blocks', sum(function_blocks))
    return log_token


if __name__ == '__main__':
    main()

输出:

***** before rearranging *****
[['Func1'], ['Func2'], ['Func2'], ['Func1'], ['Func3']]
- input_blocks: [['Func1'], ['Func2'], ['Func3']]
  [0]:
    ['Func1']
    - func_name: 'Func1'
  [1]:
    ['Func2']
    - func_name: 'Func2'
  [2]:
    ['Func3']
    - func_name: 'Func3'
- output_blocks: [['Func2'], ['Func1']]
  [0]:
    ['Func2']
    - func_name: 'Func2'
  [1]:
    ['Func1']
    - func_name: 'Func1'
***** after rearranging *****
[[], [], [], [], ['Func3']]
- function_blocks: [['Func1'], ['Func1'], ['Func2'], ['Func2'], [], []]   # why is this duplicated? I just want the inner function_blocks!
  - function_blocks: [[['Func1'], ['Func1']], [['Func2'], ['Func2']], [[], []]]
    [0]:
      [['Func1'], ['Func1']]
      - input: ['Func1']
        - func_name: 'Func1'
      - output: ['Func1']
        - func_name: 'Func1'
    [1]:
      [['Func2'], ['Func2']]
      - input: ['Func2']
        - func_name: 'Func2'
      - output: ['Func2']
        - func_name: 'Func2'
    [2]:                              # where does this come from?
      [[], []]
      - input: []
      - output: []
- input_blocks: [[], [], ['Func3']]
  [0]:                                # how do I delete these indexes?
    []                                #  I think I only cleared their contents
  [1]:
    []
  [2]:
    ['Func3']
    - func_name: 'Func3'
- output_blocks: [[], []]
  [0]:
    []
  [1]:
    []

最佳答案

此版本的rearrange解决了我在您的示例中看到的大部分问题:

def rearrange(log_token):
    function_blocks = list()

    for input_block in log_token.input_blocks:
        # look for match among output blocks that have not been cleared
        for output_block in filter(None, log_token.output_blocks):

            if input_block.func_name == output_block.func_name:
                # found two matching blocks! now put them in a function_block
                # and clear them from in their original positions in log_token

                # create rearranged block, first with a list of the two blocks
                # instead of append()'ing, just initialize with a list containing
                # the two block copies
                tmp_function_block = ParseResults([input_block.copy(), output_block.copy()])

                # now assign the blocks by name
                # x.__setitem__(key, value) is the same as x[key] = value
                tmp_function_block['input'] = tmp_function_block[0]
                tmp_function_block['output'] = tmp_function_block[1]

                # wrap that all in another ParseResults, as if we had matched a Group
                function_block = ParseResults(name='function_blocks', toklist=tmp_function_block, asList=True,
                                              modal=False)  # I have no idea what modal and asList do, this was trial-and-error until I got acceptable output

                del function_block['input'], function_block['output']  # remove duplicate name references

                function_blocks.append(function_block)
                # clear blocks in their original positions in log_token, so they won't be matched any more
                input_block.clear()
                output_block.clear()

                # match found, no need to keep going looking for a matching output block 
                break

    # find all input blocks that weren't cleared (had matching output blocks) and append as input-only blocks
    for input_block in filter(None, log_token.input_blocks):
        # no matching output for this input
        tmp_function_block = ParseResults([input_block.copy()])
        tmp_function_block['input'] = tmp_function_block[0]
        function_block = ParseResults(name='function_blocks', toklist=tmp_function_block, asList=True,
                                      modal=False)  # I have no idea what modal and asList do, this was trial-and-error until I got acceptable output
        del function_block['input']  # remove duplicate data
        function_blocks.append(function_block)
        input_block.clear()

    # clean out log_token, and reload with rearranged function blocks
    log_token.clear()
    log_token.extend(function_blocks)
    log_token['function_blocks'] =  sum(function_blocks)

    return log_token

由于这需要输入 token 并返回重新排列的 token ,因此您可以按原样将其设为解析操作:

    # trailing '*' on the results name is equivalent to listAllMatches=True
    input_block = Group(Word(alphanums)('func_name') + '_in')('input_blocks*')
    output_block = Group(Word(alphanums)('func_name') +'_out')('output_blocks*')
    log = OneOrMore(input_block | output_block)
    log.addParseAction(rearrange)

由于 rearrange 更新了 log_token,如果将其设为解析操作,则不需要结束 return 语句。

有趣的是,您如何通过清除找到匹配项的 block 来就地更新列表 - 非常聪明。

通常,将标记组装到 ParseResults 中是一个内部函数,因此文档对这个主题的介绍很少。我只是浏览了模块文档,但我并没有真正看到这个主题的好归宿。

关于python - 编辑 pyparsing 解析结果,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/51293558/

相关文章:

parsing - Dart 使用哪种风格的多行注释?

algorithm - 蒙古语名字的处理

python - Pyparsing:如何将 token 存储到以下列表/组中?

python - 如何在 pyparsing 中解析 float ,包括减号

python - 在 Flask、SqlAlchemy、Python 中创建一个自动知道是否插入或更新的函数

python - 如何创建使用 tar 文件的 setup.py

java - xml解析器,一项

python - pyparsing 优先级分割

python - 使用 python 读取标签中包含 xmlns 的 XML 文件

python - datetime 类型的对象不是 JSON 可序列化错误