这是我第一次尝试使用 pyparsing
,我很难设置它。我想用pyparsing来解析lexc
文件。 lexc
格式用于声明编译为有限状态转换器的词典。
特殊字符:
: divides 'upper' and 'lower' sides of a 'data' declaration
; terminates entry
# reserved LEXICON name. end-of-word or final state
' ' (space) universal delimiter
! introduces comment to the end of the line
< introduces xfst-style regex
> closes xfst-style regex
% escape character: %: %; %# % %! %< %> %%
有多个级别需要解析。
一般来说,任何未转义的 !
换行符是注释。这可以在每个级别单独处理。
在文档级别,分为三个不同的部分:
Multichar_Symbols Optional one-time declaration
LEXICON Usually many of these
END Anything after this is ignored
在 Multichar_Symbols
level,任何由空格分隔的内容都是声明。本节以 LEXICON
的第一个声明结束。 .
Multichar_Symbols the+first-one thesecond_one
third_one ! comment that this one is special
+Pl ! plural
在 LEXICON
水平,LEXICON
的名称声明为:
LEXICON the_name ! whitespace delimited
在名称声明之后,LEXICON 由条目组成: data continuation ;
。分号分隔条目。 data
是可选的。
在 data
级别,有三种可能的形式:
upper:lower
,simple
(分解为upper
和lower
为simple:simple
,<xfst-style regex>
.
示例:
! # is a reserved continuation that means "end of word".
dog+Pl:dogs # ; ! upper:lower continuation ;
cat # ; ! automatically exploded to "cat:cat # ;" by interpreter
Num ; ! no data, only a continuation to LEXICON named "Num"
<[1|2|3]+> # ; ! xfst-style regex enclosed in <>
END
之后的所有内容被忽略
完整的lexc
文件可能如下所示:
! Comments begin with !
! Multichar_Symbols (separated by whitespace, terminated by first declared LEXICON)
Multichar_Symbols +A +N +V ! +A is adjectives, +N is nouns, +V is verbs
+Adv ! This one is for adverbs
+Punc ! punctuation
! +Cmpar ! This is broken for now, so I commented it out.
! The bulk of lexc is made of up LEXICONs, which contain entries that point to
! other LEXICONs. "Root" is a reserved lexicon name, and the start state.
! "#" is also a reserved lexicon name, and the end state.
LEXICON Root ! Root is a reserved lexicon name, if it is not declared, then the first LEXICON is assumed to be the root
big Adj ; ! This
bigly Adv ; ! Not sure if this is a real word...
dog Noun ;
cat Noun ;
crow Noun ;
crow Verb ;
Num ; ! This continuation class generates numbers using xfst-style regex
! NB all the following are reserved characters
sour% cream Noun ; ! escaped space
%: Punctuation ; ! escaped :
%; Punctuation ; ! escaped ;
%# Punctuation ; ! escaped #
%! Punctuation ; ! escaped !
%% Punctuation ; ! escaped %
%< Punctuation ; ! escaped <
%> Punctuation ; ! escaped >
%:%:%::%: # ; ! Should map ::: to :
LEXICON Adj
+A: # ; ! # is a reserved lexicon name which means end-of-word (final state).
! +Cmpar:er # ; ! Broken, so I commented it out.
LEXICON Adv
+Adv: # ;
LEXICON Noun
+N+Sg: # ;
+N+Pl:s # ;
LEXICON Num
<[0|1|2|3|4|5|6|7|8|9]> Num ; ! This is an xfst regular expression and a cyclic continuation
# ; ! After the first cycle, this makes sense, but as it is, this is bad.
LEXICON Verb
+V+Inf: # ;
+V+Pres:s # ;
LEXICON Punctuation
+Punc: # ;
END
This text is ignored because it is after END
因此有多个不同的级别可供解析。在 pyparsing
中设置此功能的最佳方式是什么? ?是否有这种分层语言的示例可以作为模型遵循?
最佳答案
使用 pyparsing 时的策略是将解析问题分解为小部分,然后将它们组合成较大的部分。
从第一个高级结构定义开始:
Multichar_Symbols Optional one-time declaration
LEXICON Usually many of these
END Anything after this is ignored
您最终的整体解析器将如下所示:
parser = (Optional(multichar_symbols_section)('multichar_symbols')
+ Group(OneOrMore(lexicon_section))('lexicons')
+ END)
每个部分后面括号中的名称将为我们提供标签,以便我们轻松访问解析结果的不同部分。
更深入地了解细节,让我们看看如何为 lexicon_section
定义解析器。
首先定义标点符号和特殊关键字
COLON,SEMI = map(Suppress, ":;")
HASH = Literal('#')
LEXICON, END = map(Keyword, "LEXICON END".split())
您的标识符和值可以包含“%”转义字符,因此我们需要将它们分段构建:
# use regex and Combine to handle % escapes
escaped_char = Regex(r'%.').setParseAction(lambda t: t[0][1])
ident_lit_part = Word(printables, excludeChars=':%;')
xfst_regex = Regex(r'<.*?>')
ident = Combine(OneOrMore(escaped_char | ident_lit_part)) | xfst_regex
value_expr = ident()
有了这些片段,我们现在可以定义一个单独的词典声明:
# handle the following lexicon declarations:
# name ;
# name:value ;
# name value ;
# name value # ;
lexicon_decl = Group(ident("name")
+ Optional(Optional(COLON)
+ value_expr("value")
+ Optional(HASH)('hash'))
+ SEMI)
这部分有点困惑,事实证明 value
可以作为字符串、结果结构(pyparsing ParseResults)返回,甚至可能完全丢失。我们可以使用解析操作将所有这些形式规范化为单个字符串形式。
# use a parse action to normalize the parsed values
def fixup_value(tokens):
if 'value' in tokens[0]:
# pyparsing makes this a nested element, just take zero'th value
if isinstance(tokens[0].value, ParseResults):
tokens[0]['value'] = tokens[0].value[0]
else:
# no value was given, expand 'name' as if parsed 'name:name'
tokens[0]['value'] = tokens[0].name
lexicon_decl.setParseAction(fixup_value)
现在该值将在解析时被清理,因此调用 parseString 后不需要额外的代码。
我们终于准备好定义整个 LEXICON 部分了:
# TBD - make name optional, define as 'Root'
lexicon_section = Group(LEXICON
+ ident("name")
+ ZeroOrMore(lexicon_decl, stopOn=LEXICON | END)("declarations"))
最后一点整理工作 - 我们需要忽略评论。我们可以在最顶层的解析器表达式上调用 ignore
,整个解析器中的注释将被忽略:
# ignore comments anywhere in our parser
comment = '!' + Optional(restOfLine)
parser.ignore(comment)
以下是单个可复制粘贴部分中的所有代码:
import pyparsing as pp
# define punctuation and special words
COLON,SEMI = map(pp.Suppress, ":;")
HASH = pp.Literal('#')
LEXICON, END = map(pp.Keyword, "LEXICON END".split())
# use regex and Combine to handle % escapes
escaped_char = pp.Regex(r'%.').setParseAction(lambda t: t[0][1])
ident_lit_part = pp.Word(pp.printables, excludeChars=':%;')
xfst_regex = pp.Regex(r'<.*?>')
ident = pp.Combine(pp.OneOrMore(escaped_char | ident_lit_part | xfst_regex))
value_expr = ident()
# handle the following lexicon declarations:
# name ;
# name:value ;
# name value ;
# name value # ;
lexicon_decl = pp.Group(ident("name")
+ pp.Optional(pp.Optional(COLON)
+ value_expr("value")
+ pp.Optional(HASH)('hash'))
+ SEMI)
# use a parse action to normalize the parsed values
def fixup_value(tokens):
if 'value' in tokens[0]:
# pyparsing makes this a nested element, just take zero'th value
if isinstance(tokens[0].value, pp.ParseResults):
tokens[0]['value'] = tokens[0].value[0]
else:
# no value was given, expand 'name' as if parsed 'name:name'
tokens[0]['value'] = tokens[0].name
lexicon_decl.setParseAction(fixup_value)
# define a whole LEXICON section
# TBD - make name optional, define as 'Root'
lexicon_section = pp.Group(LEXICON
+ ident("name")
+ pp.ZeroOrMore(lexicon_decl, stopOn=LEXICON | END)("declarations"))
# this part still TBD - just put in a placeholder for now
multichar_symbols_section = pp.empty()
# tie it all together
parser = (pp.Optional(multichar_symbols_section)('multichar_symbols')
+ pp.Group(pp.OneOrMore(lexicon_section))('lexicons')
+ END)
# ignore comments anywhere in our parser
comment = '!' + pp.Optional(pp.restOfLine)
parser.ignore(comment)
解析您发布的“根”示例,我们可以使用 dump()
转储结果
result = lexicon_section.parseString(lexicon_sample)[0]
print(result.dump())
给予:
['LEXICON', 'Root', ['big', 'Adj'], ['bigly', 'Adv'], ['dog', 'Noun'], ['cat', 'Noun'], ['crow', 'Noun'], ['crow', 'Verb'], ['Num'], ['sour cream', 'Noun'], [':', 'Punctuation'], [';', 'Punctuation'], ['#', 'Punctuation'], ['!', 'Punctuation'], ['%', 'Punctuation'], ['<', 'Punctuation'], ['>', 'Punctuation'], [':::', ':', '#']]
- declarations: [['big', 'Adj'], ['bigly', 'Adv'], ['dog', 'Noun'], ['cat', 'Noun'], ['crow', 'Noun'], ['crow', 'Verb'], ['Num'], ['sour cream', 'Noun'], [':', 'Punctuation'], [';', 'Punctuation'], ['#', 'Punctuation'], ['!', 'Punctuation'], ['%', 'Punctuation'], ['<', 'Punctuation'], ['>', 'Punctuation'], [':::', ':', '#']]
[0]:
['big', 'Adj']
- name: 'big'
- value: 'Adj'
[1]:
['bigly', 'Adv']
- name: 'bigly'
- value: 'Adv'
[2]:
['dog', 'Noun']
- name: 'dog'
- value: 'Noun'
...
[13]:
['<', 'Punctuation']
- name: '<'
- value: 'Punctuation'
[14]:
['>', 'Punctuation']
- name: '>'
- value: 'Punctuation'
[15]:
[':::', ':', '#']
- hash: '#'
- name: ':::'
- value: ':'
- name: 'Root'
此代码展示了如何迭代该部分的各个部分并获取命名部分:
# try out a lexicon against the posted sample
result = lexicon_section.parseString(lexicon_sample)[0]
print(result.dump())
print('Name:', result.name)
print('\nDeclarations')
for decl in result.declarations:
print("{name} -> {value}".format_map(decl), "(END)" if decl.hash else '')
给予:
Name: Root
Declarations
big -> Adj
bigly -> Adv
dog -> Noun
cat -> Noun
crow -> Noun
crow -> Verb
Num -> Num
sour cream -> Noun
: -> Punctuation
; -> Punctuation
# -> Punctuation
! -> Punctuation
% -> Punctuation
< -> Punctuation
> -> Punctuation
::: -> : (END)
希望这足以让您从这里开始。
关于python - 使用python(pyparsing)解析lexc,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/42840321/