python - 获取一种语法来读取文本中的多个关键字

标签 python pyparsing

我仍然认为自己是 pyparsing 的新手。我将两种快速语法放在一起,但都没有成功实现我想要做的事情。我试图想出一种看起来非常简单的语法,但事实证明(至少对我来说)并不那么微不足道。该语言有一个基本定义。它按关键字和正文进行分割。 body 可以跨越多行。关键字位于行的开头大约 20 个字符内,但以“;”结尾(无引号)。因此,我编写了一个快速演示程序,以便可以使用几个语法进行测试。然而,当我尝试使用它们时,它们总是得到第一个关键字,但之后就没有了。

我已附上源代码作为示例以及正在发生的输出。尽管这只是测试代码,出于习惯我还是做了文档。在下面的示例中,两个关键字是 NOW;最后;理想情况下,我不希望关键字中包含分号。

有什么想法我应该做些什么才能使这项工作成功吗?

from pyparsing import *

def testString(text,grammar):
    """
    @summary: perform a test of a grammar
    2type text: text
    @param text: text buffer for input (a message to be parsed)
    @type grammar: MatchFirst or equivalent pyparsing construct
    @param grammar: some grammar defined somewhere else
    @type pgm: text
    @param pgm: typically name of the program, which invoked this function.
    @status: 20130802 CODED
    """
    print 'Input Text is %s' % text
    print 'Grammar is %s' % grammar
    tokens = grammar.parseString(text)
    print 'After parse string: %s' % tokens
    tokens.dump()
    tokens.keys()

    return tokens


def getText(msgIndex):
    """
    @summary: make a text string suitable for parsing
    @returns: returns a text buffer
    @type msgIndex: int
    @param msgIndex: a number corresponding to a text buffer to retrieve
    @status: 20130802 CODED
    """

    msg = [  """NOW; is the time for a few good ones to come to the aid
of new things to come for it is almost time for
a tornado to strike upon a small hill
when least expected.
lastly; another day progresses and
then we find that which we seek
and finally we will
find our happiness perhaps its closer than 1 or 2 years or not so
    """,
         '',
      ]

    return msg[msgIndex]

def getGrammar(grammarIndex):
    """
    @summary: make a grammar given an index
    @type: grammarIndex: int
    @param grammarIndex: a number corresponding to the grammar to be retrieved
    @Note: a good run will return 2 keys: NOW: and lastly:  and each key will have an associated body. The body is all
    words and text up to the next keyword or eof which ever is first.
    """
    kw = Combine(Word(alphas + nums) + Literal(';'))('KEY')
    kw.setDebug(True)
    body1 = delimitedList(OneOrMore(Word(alphas + nums)) +~kw)('Body')
    body1.setDebug(True)
    g1 = OneOrMore(Group(kw + body1))

    # ok start defining a new grammar (borrow kw from grammar).

    body2 = SkipTo(~kw, include=False)('BODY')
    body2.setDebug(True)

    g2 = OneOrMore(Group(kw+body2))
    grammar = [g1,
           g2,
          ]
    return grammar[grammarIndex]


if __name__ == '__main__':
    # list indices [ text, grammar ]
    tests = {1: [0,0],
         2: [0,1],
        }
    check = tests.keys()
    check.sort()
    for testno in check:
    print 'STARTING Test %d' % testno
    text = getText(tests[testno][0])
    grammar = getGrammar(tests[testno][1])
    tokens = testString(text, grammar)
    print 'Tokens found %s' % tokens
    print 'ENDING Test %d' % testno

输出如下所示:(使用 python 2.7 和 pyparsing 2.0.1)

    STARTING Test 1
    Input Text is NOW; is the time for a few good ones to come to the aid
    of new things to come for it is almost time for
    a tornado to strike upon a small hill
    when least expected.
    lastly; another day progresses and
    then we find that which we seek
    and finally we will
    find our happiness perhaps its closer than 1 or 2 years or not so

    Grammar is {Group:({Combine:({W:(abcd...) ";"}) {{W:(abcd...)}... ~{Combine:({W:(abcd...) ";"})}} [, {{W:(abcd...)}... ~{Combine:({W:(abcd...) ";"})}}]...})}...
    Match Combine:({W:(abcd...) ";"}) at loc 0(1,1)
    Matched Combine:({W:(abcd...) ";"}) -> ['NOW;']
    Match {{W:(abcd...)}... ~{Combine:({W:(abcd...) ";"})}} [, {{W:(abcd...)}... ~{Combine:({W:(abcd...) ";"})}}]... at loc 4(1,5)
    Match Combine:({W:(abcd...) ";"}) at loc 161(4,20)
    Exception raised:Expected W:(abcd...) (at char 161), (line:4, col:20)
    Matched {{W:(abcd...)}... ~{Combine:({W:(abcd...) ";"})}} [, {{W:(abcd...)}... ~{Combine:({W:(abcd...) ";"})}}]... -> ['is', 'the', 'time', 'for', 'a', 'few', 'good', 'ones', 'to', 'come', 'to', 'the', 'aid', 'of', 'new', 'things', 'to', 'come', 'for', 'it', 'is', 'almost', 'time', 'for', 'a', 'tornado', 'to', 'strike', 'upon', 'a', 'small', 'hill', 'when', 'least', 'expected']
    Match Combine:({W:(abcd...) ";"}) at loc 161(4,20)
    Exception raised:Expected W:(abcd...) (at char 161), (line:4, col:20)
    After parse string: [['NOW;', 'is', 'the', 'time', 'for', 'a', 'few', 'good', 'ones', 'to', 'come', 'to', 'the', 'aid', 'of', 'new', 'things', 'to', 'come', 'for', 'it', 'is', 'almost', 'time', 'for', 'a', 'tornado', 'to', 'strike', 'upon', 'a', 'small', 'hill', 'when', 'least', 'expected']]
    Tokens found [['NOW;', 'is', 'the', 'time', 'for', 'a', 'few', 'good', 'ones', 'to', 'come', 'to', 'the', 'aid', 'of', 'new', 'things', 'to', 'come', 'for', 'it', 'is', 'almost', 'time', 'for', 'a', 'tornado', 'to', 'strike', 'upon', 'a', 'small', 'hill', 'when', 'least', 'expected']]
    ENDING Test 1
    STARTING Test 2
    Input Text is NOW; is the time for a few good ones to come to the aid
    of new things to come for it is almost time for
    a tornado to strike upon a small hill
    when least expected.
    lastly; another day progresses and
    then we find that which we seek
    and finally we will
    find our happiness perhaps its closer than 1 or 2 years or not so

    Grammar is {Group:({Combine:({W:(abcd...) ";"}) SkipTo:(~{Combine:({W:(abcd...) ";"})})})}...
    Match Combine:({W:(abcd...) ";"}) at loc 0(1,1)
    Matched Combine:({W:(abcd...) ";"}) -> ['NOW;']
    Match SkipTo:(~{Combine:({W:(abcd...) ";"})}) at loc 4(1,5)
    Match Combine:({W:(abcd...) ";"}) at loc 4(1,5)
    Exception raised:Expected ";" (at char 7), (line:1, col:8)
    Matched SkipTo:(~{Combine:({W:(abcd...) ";"})}) -> ['']
    Match Combine:({W:(abcd...) ";"}) at loc 5(1,6)
    Exception raised:Expected ";" (at char 7), (line:1, col:8)
    After parse string: [['NOW;', '']]
    Tokens found [['NOW;', '']]
    ENDING Test 2

    Process finished with exit code 0

最佳答案

我很擅长 TDD,但是这里你的整个测试和替代选择基础设施确实妨碍了查看语法在哪里以及它发生了什么。如果我去掉所有额外的机制,我发现你的语法只是:

kw = Combine(Word(alphas + nums) + Literal(';'))('KEY')
body1 = delimitedList(OneOrMore(Word(alphas + nums)) +~kw)('Body')
g1 = OneOrMore(Group(kw + body1))

我看到的第一个问题是你对 body1 的定义:

body1 = delimitedList(OneOrMore(Word(alphas + nums)) +~kw)('Body')

您处于负向前瞻的正确轨道上,但为了使其在 pyparsing 中工作,您必须将其放在表达式的开头,而不是结尾。将其视为“在匹配另一个有效单词之前,我将首先排除它是关键字。”:

body1 = delimitedList(OneOrMore(~kw + Word(alphas + nums)))('Body')

(顺便问一下,为什么这是一个 delimitedListdelimitedList 通常保留用于带有分隔符的真实列表,例如程序函数的逗号分隔参数。所有这确实接受可能混合到正文中的任何逗号,应该使用标点符号列表更直接地处理它。)

这是我的代码的测试版本:

from pyparsing import *

kw = Combine(Word(alphas + nums) + Literal(';'))('KEY')
body1 = OneOrMore(~kw + Word(alphas + nums))('Body')
g1 = OneOrMore(Group(kw + body1))

msg = [  """NOW; is the time for a few good ones to come to the aid
of new things to come for it is almost time for
a tornado to strike upon a small hill
when least expected.
lastly; another day progresses and
then we find that which we seek
and finally we will
find our happiness perhaps its closer than 1 or 2 years or not so
    """,
             '',
          ][0]

result = g1.parseString(msg)
# we expect multiple groups, each containing "KEY" and "Body" names,
# so iterate over groups, and dump the contents of each
for res in result:
    print res.dump()

我仍然得到与你相同的结果,只是第一个关键字匹配。因此,要查看断开连接的位置,我使用 scanString,它不仅返回匹配的标记,还返回匹配标记的开始和结束位置:

result,start,end = next(g1.scanString(msg))
print len(msg),end

这给了我:

320 161

所以我看到我们在总长度为 320 的字符串中的位置 161 处结束,因此我将再添加一个打印语句:

print msg[end:end+10]

我得到:

.
lastly;

正文中的尾随句号是罪魁祸首。如果我从消息中删除它并再次尝试 parseString,我现在得到:

['NOW;', 'is', 'the', 'time', 'for', 'a', 'few', 'good', 'ones', 'to', 'come', 'to', 'the', 'aid', 'of', 'new', 'things', 'to', 'come', 'for', 'it', 'is', 'almost', 'time', 'for', 'a', 'tornado', 'to', 'strike', 'upon', 'a', 'small', 'hill', 'when', 'least', 'expected']
- Body: ['is', 'the', 'time', 'for', 'a', 'few', 'good', 'ones', 'to', 'come', 'to', 'the', 'aid', 'of', 'new', 'things', 'to', 'come', 'for', 'it', 'is', 'almost', 'time', 'for', 'a', 'tornado', 'to', 'strike', 'upon', 'a', 'small', 'hill', 'when', 'least', 'expected']
- KEY: NOW;
['lastly;', 'another', 'day', 'progresses', 'and', 'then', 'we', 'find', 'that', 'which', 'we', 'seek', 'and', 'finally', 'we', 'will', 'find', 'our', 'happiness', 'perhaps', 'its', 'closer', 'than', '1', 'or', '2', 'years', 'or', 'not', 'so']
- Body: ['another', 'day', 'progresses', 'and', 'then', 'we', 'find', 'that', 'which', 'we', 'seek', 'and', 'finally', 'we', 'will', 'find', 'our', 'happiness', 'perhaps', 'its', 'closer', 'than', '1', 'or', '2', 'years', 'or', 'not', 'so']
- KEY: lastly;

如果你想处理标点符号,我建议你添加如下内容:

PUNC = oneOf(". , ? ! : & $")

并将其添加到body1:

body1 = OneOrMore(~kw + (Word(alphas + nums) | PUNC))('Body')

关于python - 获取一种语法来读取文本中的多个关键字,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/18039236/

相关文章:

Python - 如何从文本中打印出一行

python - 如何在 pyparsing 中禁止文字之间的空格?

python - Pyparsing:将半JSON嵌套明文数据解析为列表

python - 服务器如何单独使用 python 和 websockets 将数据推送到客户端浏览器

python - 如何使 Socket.IO 客户端连接到 Python3 Websocket 服务器

python - 这些 Python 类方法输出的意义是什么?

python - 如何从 python 代码调用 shell 脚本?

python - 为什么解析操作中的 str.join 会产生异常?

python - 标记嵌套表达式但忽略带空格的引用字符串

python - 你如何用 pyparsing 解析文字星号?